CN117235527B - End-to-end containerized big data model construction method, device, equipment and medium - Google Patents

End-to-end containerized big data model construction method, device, equipment and medium Download PDF

Info

Publication number
CN117235527B
CN117235527B CN202311282136.1A CN202311282136A CN117235527B CN 117235527 B CN117235527 B CN 117235527B CN 202311282136 A CN202311282136 A CN 202311282136A CN 117235527 B CN117235527 B CN 117235527B
Authority
CN
China
Prior art keywords
preset
data set
data
configuration file
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311282136.1A
Other languages
Chinese (zh)
Other versions
CN117235527A (en
Inventor
王蒴
孙召进
李希明
王沛
王龙振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Civic Se Commercial Middleware Co ltd
Original Assignee
Shandong Civic Se Commercial Middleware Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Civic Se Commercial Middleware Co ltd filed Critical Shandong Civic Se Commercial Middleware Co ltd
Priority to CN202311282136.1A priority Critical patent/CN117235527B/en
Publication of CN117235527A publication Critical patent/CN117235527A/en
Application granted granted Critical
Publication of CN117235527B publication Critical patent/CN117235527B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请公开了端到端容器化的大数据分析模型构建方法、装置、设备及存储介质,涉及机器学习领域,包括:获取原始数据进行分类得到结构化数据和非结构化数据,并进行分组得到目标数据集;确定初始模型的参数信息,根据参数信息生成算法配置文件,并输入至预设虚拟化容器中对目标数据集和算法配置文件的映射关系进行校验;若正确,则利用预设执行器根据映射关系训练初始模型,并在训练过程中利用预设执行器调用预设参数调度器动态调整预设优化器的超参数进行初始模型的优化生成目标模型。通过数据集分类、算法适配和执行器,实现了不同场景下模型构建的灵活性、高效性和扩展性,并且对虚拟化容器的控制实现了环境隔离,确保训练任务的完成度。

The present application discloses an end-to-end containerized big data analysis model construction method, device, equipment and storage medium, which relates to the field of machine learning, including: obtaining raw data for classification to obtain structured data and unstructured data, and grouping to obtain a target data set; determining the parameter information of the initial model, generating an algorithm configuration file based on the parameter information, and inputting it into a preset virtualized container to verify the mapping relationship between the target data set and the algorithm configuration file; if correct, using a preset executor to train the initial model according to the mapping relationship, and using a preset executor to call a preset parameter scheduler during the training process to dynamically adjust the hyperparameters of a preset optimizer to optimize the initial model and generate a target model. Through data set classification, algorithm adaptation and executors, the flexibility, efficiency and scalability of model construction in different scenarios are achieved, and the control of virtualized containers achieves environmental isolation to ensure the completion of training tasks.

Description

End-to-end containerized big data model construction method, device, equipment and medium
Technical Field
The present invention relates to the field of machine learning, and in particular, to a method, an apparatus, a device, and a storage medium for constructing an end-to-end containerized big data analysis model.
Background
In the age of information explosion today, big data analysis and artificial intelligence are important forces pushing social development and business innovation. However, building an efficient, accurate AI (ARTIFICIAL INTELLIGENCE ) model often faces a series of challenges and problems. These problems include model construction requirements in different application scenarios, diversity and complexity of the data set, adaptation of the algorithm, complexity of the training process, etc.
Firstly, the requirements of different application scenes on the AI model are quite different, and a flexible model construction platform is needed to meet the diversified requirements. Traditional model building methods often require specific code to be written for different scenarios, increasing complexity of development and maintenance. Second, the diversity and complexity of the data sets is also a challenge in constructing AI models. The data in the real world includes both structured data (e.g., tabular data) and unstructured data (e.g., images, text, etc.). Different data types require different preprocessing, feature extraction and labeling methods, and therefore a unified data set implementation is required. Furthermore, the adaptation and training process of the algorithm is also a key issue in constructing the AI model. Different algorithms have different requirements on super parameters, training strategies, optimizers and the like, so that the model construction process is complex and easy to make mistakes. Moreover, conventional training processes often require extensive training code and procedures to be written, which is inconvenient for rapid experimentation and iteration. Therefore, how to provide a flexible, efficient and unified AI model building platform to meet model building requirements under different application scenarios is a problem to be solved in the art.
Disclosure of Invention
In view of the above, the present invention aims to provide an end-to-end containerized big data analysis model construction method, apparatus, device and storage medium, which can realize flexibility, high efficiency and expansibility of model construction under different scenes by means of data set classification, algorithm adaptation and executor, and realize environmental isolation for controlling a virtualized container, thereby ensuring the completion of training tasks. The specific scheme is as follows:
in a first aspect, the present application provides a method for constructing an end-to-end containerized big data analysis AI model, including:
Acquiring original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, and grouping the initial data set according to a preset grouping rule to obtain a target data set;
determining parameter information of an initial model, generating an algorithm configuration file in a preset format according to the parameter information, inputting the algorithm configuration file into a preset virtualization container, and checking the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container;
if the mapping relation is correct, training the initial model by using a preset executor according to the mapping relation, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of a preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model.
Optionally, the generating an initial data set based on the structured data and the unstructured data includes:
Performing data quality inspection, data cleaning and conversion operations on the fields of the structured data, and generating a corresponding structured data set;
labeling the unstructured data to generate an unstructured data set in a label format;
An initial dataset is generated from the structured dataset and the unstructured dataset.
Optionally, after the initial data set is grouped according to a preset grouping rule to obtain a target data set, the method further includes:
And inputting the target data set into the preset virtualized container, and providing an application programming interface corresponding to the target data set in a network request mode.
Optionally, the generating an algorithm configuration file in a preset format according to the parameter information includes:
and assembling the parameter information into a parameter dictionary according to a preset algorithm, and storing the parameter dictionary into an algorithm configuration file in a yml or json format.
Optionally, the training the initial model by using a preset executor according to the mapping relationship includes:
And reading the parameter dictionary in the algorithm configuration file, loading the corresponding initial model according to the parameter dictionary by using the preset virtualization container so as to read the corresponding target data set according to the mapping relation by using the preset executor, and training the initial model.
Optionally, the verifying, by using the preset virtualization container, the mapping relationship between the target data set and the algorithm configuration file includes:
and determining whether the characteristics corresponding to the target data set and the algorithm configuration file are matched by using the preset virtualization container, and if so, judging that the mapping relationship between the target data set and the algorithm configuration file is correct.
Optionally, the training the initial model by using a preset executor according to the mapping relationship further includes:
and setting a hook by using the preset executor so as to acquire a user-defined request of a user in a training process according to the hook, and training the initial model through the user-defined request.
In a second aspect, the present application provides an end-to-end containerized big data analysis model building apparatus, including:
The data set grouping module is used for obtaining original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, and grouping the initial data set according to a preset grouping rule to obtain a target data set;
The configuration file verification module is used for determining parameter information of an initial model, generating an algorithm configuration file in a preset format according to the parameter information, inputting the algorithm configuration file into a preset virtualization container, and verifying the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container;
And the model training module is used for training the initial model according to the mapping relation by using a preset executor if the mapping relation is correct, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of the preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model.
In a third aspect, the application provides an electronic device comprising a processor and a memory, wherein the memory is used for storing a computer program, and the computer program is loaded and executed by the processor to realize the big data analysis model construction method of end-to-end containerization.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which when executed by a processor implements the foregoing end-to-end containerized big data analysis model construction method.
The method comprises the steps of obtaining original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, grouping the initial data set according to a preset grouping rule to obtain a target data set, determining parameter information of an initial model, generating an algorithm configuration file in a preset format according to the parameter information, inputting the algorithm configuration file into a preset virtualization container, checking the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container, training the initial model by using a preset executor according to the mapping relation if the mapping relation is correct, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust super parameters of the preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model. In this way, the application can conveniently process different types of data through a unified data set implementation mode, form a unified data set format, and construct and train different types of algorithms through a one-key training mode through an adaptive algorithm implementation mode, meanwhile, an executor is introduced as a core module of a training engine, so that the efficient execution and management of training, testing and reasoning tasks can be realized, a plurality of independent steps such as data collection, data preprocessing, feature extraction, model selection, super-parameter adjustment and the like required by AI model construction are avoided, and the complexity of constructing the AI model is simplified for non-professional persons.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for constructing an end-to-end containerized big data analysis model provided by the application;
FIG. 2 is a flow chart of data set integration provided by the present application;
FIG. 3 is a flow chart of model training provided by the present application;
FIG. 4 is a flowchart of a specific end-to-end containerized big data analysis model construction method provided by the application;
FIG. 5 is a containerization flow chart provided by the present application;
FIG. 6 is a deep learning flow chart provided by the present application;
FIG. 7 is a schematic diagram of a device for constructing an end-to-end containerized big data analysis model according to the present application;
Fig. 8 is a block diagram of an electronic device according to the present application.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
At present, the efficient and accurate AI model is often constructed to face the problems of different model construction requirements, diversity and complexity of data sets, algorithm adaptation, complexity of training process and the like under different application scenes. The application can conveniently process different types of data through a unified data set implementation mode, form a unified data set format, and construct and train different types of algorithms through a one-key training mode through an implementation mode of an adaptation algorithm, and simultaneously introduce an executor as a core module of a training engine, so that the efficient execution and management of training, testing and reasoning tasks can be realized.
Referring to fig. 1, the embodiment of the invention discloses a method for constructing a big data analysis model of end-to-end containerization, which comprises the following steps:
Step S11, obtaining original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, and grouping the initial data set according to a preset grouping rule to obtain a target data set.
In this embodiment, first, raw data in a data source is acquired, the raw data is classified to obtain corresponding structured data and unstructured data, an initial data set is generated based on the classified structured data and unstructured data, and then the initial data set is grouped according to a preset grouping rule to obtain a target data set. Firstly, cleaning, extracting and filtering original data in a data source, performing labeling to form a data set, grouping the data set to form a training set and a testing set so as to facilitate the subsequent assembly of an adaptation algorithm for different data sets, and then performing the processes of training, evaluation and the like in a container.
In the process of classifying the data sources and further processing and labeling, firstly, performing data quality inspection, data cleaning and conversion operation on the fields of the structured data to generate corresponding structured data sets, labeling the unstructured data to generate unstructured data sets in a label format, and then generating initial data sets according to the structured data sets and the unstructured data sets. As shown in fig. 2, the generation of the data set specifically includes the following steps:
Classifying the original data into structured data and unstructured data according to the characteristics of the original data, wherein the structured data refers to data with fixed formats and fields, such as database tables, CSV (Comma-SEPARATED VALUES, character separation value) files and the like, and the unstructured data refers to data without fixed formats or fields, such as text documents, images, audio, video and the like; the structured data analysis comprises the steps of analyzing the field level of structured data, firstly, carrying out data quality inspection, data cleaning and conversion operation on each field to ensure the accuracy and consistency of the data, then carrying out statistics, aggregation, sequencing and other operations on the fields according to analysis requirements to obtain useful information and insight, marking unstructured data to generate a data set in a label format, wherein the marking refers to the fact that the unstructured data is endowed with a specific label or mark so as to carry out subsequent model training and analysis, for example, carrying out object detection, image classification and other marking on image data, carrying out named entity identification, emotion analysis and other marking on text data, grouping and sampling the data set, carrying out grouping and sampling operation on the data set according to actual requirements, dividing the data set into different subsets according to certain characteristics or attributes so as to be more accurately applied to different model construction tasks, sampling part of samples from the whole data set so as to carry out quick verification and analysis in the model construction process, carrying out uniform processing steps on the data set in the form, carrying out uniform data structure and data structure, and data with uniform data format being processed and marked by the unified data set, these data sets may be stored in a particular file format. The data sources are classified into structured data and unstructured data types, the structured data is analyzed at a field level, the unstructured data is marked to form a data set in a tag format, the data set is further subjected to grouping, sampling and other operations, a uniform type data set is finally formed, and network request acquisition is carried out through a container.
It should be noted that after the initial data sets are grouped according to the preset grouping rule to obtain the target data sets, the target data sets are input into the preset virtualized container, and the application programming interfaces corresponding to the target data sets are provided in the form of network requests. By feeding the resulting dataset into a container and providing it to the user in the form of a network request, the user can obtain the dataset, including the raw data and dataset tags, for subsequent model construction and analysis by accessing an API (Application Programming Interface ) interface in the container. Through the data set implementation manner, the embodiment provides a unified, flexible and efficient method, so that a user can conveniently process and utilize structured data and unstructured data to perform model construction and analysis. In the training process of the AI model, the storage requirement of massive data and the occupation of the multi-task training files are two common problems, and the embodiment can form different data sets aiming at different data sources to quickly construct a special model, so that the problems of storage resource waste and low training efficiency caused by the fact that a large amount of storage space is required to store data in a traditional method and each task needs an independent training file are solved, simultaneously, the method also has good support for the oversized data sets, and solves the storage problem of massive data and the technical problem of occupation of the multi-task training files.
Step S12, determining parameter information of an initial model, generating an algorithm configuration file in a preset format according to the parameter information, inputting the algorithm configuration file into a preset virtualization container, and checking the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container.
In this embodiment, after the parameter information of the initial model is determined, an algorithm configuration file in a preset format is generated according to the parameter information, the algorithm configuration file is input into a preset virtualization container, whether the characteristics corresponding to the target data set and the algorithm configuration file are matched or not is determined by using the preset virtualization container, and if so, the mapping relationship between the target data set and the algorithm configuration file is determined to be correct.
And S13, if the mapping relation is correct, training the initial model by using a preset executor according to the mapping relation, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of a preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model.
In this embodiment, if the mapping relationship is determined to be correct, the preset executor is used to train the initial model according to the mapping relationship, and the preset executor is used to call the preset parameter scheduler to dynamically adjust the super parameters of the preset optimizer in the training process, so as to optimize the initial model by using the preset optimizer to generate the target model. The executor (Runner) is responsible for executing training, testing and reasoning tasks and managing all required components, and allows users to expand and execute custom logic through Hook (Hook), the optimizer packages execute a back propagation optimization model, and the parameter scheduler dynamically adjusts the super parameters of the optimizer.
In the method, as shown in fig. 3, original data sent by a preset data source is acquired, the original data is classified to obtain corresponding structured data and unstructured data, an initial data set is generated, a target data set is generated through data labeling, normalization processing and grouping sampling, parameter information of an initial model is determined, an algorithm configuration file in a preset format is generated according to the parameter information, the algorithm configuration file is input into a preset virtualization container, and the mapping relation between the target data set and the algorithm configuration file is verified by the preset virtualization container. If the mapping relation is correct, training an initial model by using a preset executor according to the mapping relation, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of the preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model. An AI model construction technique for end-to-end containerized big data analysis is provided to solve the above problems. The application can conveniently process different types of data and form a unified data set format through a unified data set implementation mode, different types of algorithms can be constructed and trained through a one-key training mode through an adaptation algorithm implementation mode, and an executor is introduced as a core module of a training engine, so that efficient execution and management of training, testing and reasoning tasks can be realized.
Based on the above embodiment, the present application can implement execution and management of the model training task through processing the data set, implementing the adaptation algorithm, and introducing the executor, and next, in this embodiment, a process of implementing the model training task by introducing the executor will be described in detail. Referring to fig. 4, the embodiment of the application discloses a specific end-to-end containerized big data analysis model construction method, which comprises the following steps:
Step S21, obtaining original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, and grouping the initial data set according to a preset grouping rule to obtain a target data set.
And S22, determining parameter information of an initial model, assembling the parameter information into a parameter dictionary according to a preset algorithm, storing the parameter dictionary into an algorithm configuration file in a yml or json format, inputting the algorithm configuration file into a preset virtualization container, and checking the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container.
The implementation mode of the adaptation algorithm in the embodiment is to assemble information such as super parameters and training strategies into a dictionary form, and input the information into a container in a manner of a. Yml or. Json file, so that a unified interface of training tasks is realized. In this way, different types of algorithms can be implemented by one-key training, and a mapping relationship exists between the data set and the algorithm. When a large model is trained, different algorithms possibly have different assessment requirements and assessment indexes, the application can lead the multiple algorithms to share one assessment function in the training process, realize the unified assessment of the multiple algorithms by introducing a unified assessment framework and flexible algorithm interface design, provide accurate performance indexes and monitoring information so as to guide the model training and tuning, avoid the problems that the traditional method independently designs and realizes the assessment function for each algorithm, lead to code repetition and maintenance difficulty, further increase the complexity and time consumption of the assessment process, and solve the technical problem of unified model assessment when the large model is trained. It will be appreciated that the present embodiment may select a conventional machine learning model or a deep learning model as the model architecture.
The above information such as the super parameters and training strategies in fig. 5 is assembled in the form of a dictionary, and can be input into a container through a. Yml or. Json file, and a unified interface is provided to realize one-key training of different types of algorithms, and meanwhile, the data set and the algorithms are checked to ensure that a correct mapping relation exists, so as to ensure correct execution of the algorithms. The model training comprises the specific steps of assembling parameters including, but not limited to, learning rate, batch size, network structure, loss function and the like into a dictionary according to the requirements of a specific algorithm, conveniently configuring and managing the algorithm by assembling the parameters into the dictionary, inputting configuration files, namely storing the parameter dictionary into a yml or json file, and sending the configuration files into a container in a file input mode, so that the parameters can be flexibly modified and adjusted without modifying codes, the adaptability and configurability of the algorithm are improved, checking the data set and the algorithm in the container, ensuring that a correct mapping relation exists between the data set and the algorithm, and if the characteristics of the data set are not matched or wrong, providing corresponding error prompts and suggestions. Training by one key: through reading the parameter dictionary in the configuration file, the container can automatically load the corresponding algorithm model according to the requirements of different algorithms, and train by using the appointed data set, so that the training task can be quickly started by a user in a one-key training mode without manually writing complex training codes and flows. Through the implementation manner of the adaptive algorithm, the embodiment provides a unified interface, so that mapping and training can be conveniently performed between different types of data sets and different algorithms, a unified platform is provided for a user by combining a containerization technology and a big data analysis algorithm, an AI model can be conveniently constructed and deployed to solve various practical problems, the user can easily select and apply various complex algorithms, and parameter adjustment and optimization can be performed according to requirements, so that efficient model construction and training processes are realized.
According to the technical scheme, an efficient model format conversion algorithm is introduced, one model can be quickly converted from one format to another format, and the algorithm is based on the mapping relation between the model structure and parameters, and through optimizing calculation and data transmission modes, the calculation and storage expenses in the conversion process are reduced, the conversion speed and efficiency are improved, and the problems of quick conversion and generation of the model format are solved.
Step S23, if the mapping relationship is correct, reading the parameter dictionary in the algorithm configuration file, loading the corresponding initial model according to the parameter dictionary by using the preset virtualization container, so as to use a preset executor to read the corresponding target data set according to the mapping relationship, train the initial model, and set a hook by using the preset executor in the training process, so as to obtain a user-defined request according to the hook in the training process, and train the initial model by using the user-defined request.
In this embodiment, as shown in fig. 6, the core module of the training engine adopted in the model training process and the model evaluation implementation manner is an executor (Runner), which is responsible for executing and managing each component in the training, testing and reasoning tasks, and the executor allows the user to expand, insert and execute custom logics by setting hooks (Hook), and executes the logics at specific positions of the training, testing and reasoning tasks. In FIG. 6, dataset (Dataset) is responsible for constructing the datasets needed for training, testing, and reasoning tasks, and passing the data to the model. In practice, the data set is typically packaged by a data loader (DataLoader) that initiates a plurality of sub-processes to load data to increase data loading efficiency. Models (models) accept data as input and either lose (in training) or predict (in testing and reasoning), in the case of distributed training, etc., models are typically encapsulated by a Model encapsulator (Model Wrapper) such as MMDistributedDataParallel, etc. The Optimizer package (Optimizer) is responsible for performing back propagation to optimize the model and provides a unified interface supporting the functions of hybrid accuracy training and gradient accumulation. The parameter scheduler (PARAMETER SCHEDULER) is responsible for dynamically adjusting the super-parameters of the optimizer, such as learning rate, momentum, etc., during the training process to improve the training effect. In the training gap or test stage, an evaluation index and an evaluation device (Metrics & Evaluator) are responsible for evaluating the performance of the model, the evaluation device evaluates the prediction of the model based on a data set, and the evaluation index calculates specific evaluation indexes such as recall rate, accuracy rate and the like. In this way, through the cooperative work of the executor and the components, the training engine can efficiently perform training, testing and reasoning tasks, and flexible expansibility is provided, so that users can customize and insert logic to meet different requirements and application scenes. In the training process, the embodiment can use a GPU (graphics processing unit, a graphic processor) and a CPU (Central ProcessingUnit, a central processing unit) to optionally perform model training, support most algorithms such as the traditional machine learning algorithm, the deep learning algorithm and the like at present, support various application scenes, monitor and control containerization, realize environmental isolation and ensure the completion degree of training tasks.
For more specific processing in step S21, reference may be made to the corresponding content disclosed in the foregoing embodiment, and no further description is given here.
Through the technical scheme, the method classifies the original data to obtain the structured data and the unstructured data, generates the initial data set, and groups the initial data set to obtain the target data set. And determining parameter information of the initial model, assembling the parameter information into a parameter dictionary according to a preset algorithm, storing the parameter dictionary as an algorithm configuration file in a yml or json format, inputting the algorithm configuration file into a preset virtualization container, and checking the mapping relation between the target data set and the algorithm configuration file. If the mapping relation is correct, reading a parameter dictionary in an algorithm configuration file, loading a corresponding initial model according to the parameter dictionary by using a preset virtualization container so as to read a corresponding target data set according to the mapping relation by using a preset executor, training the initial model, setting hooks by using the preset executor in the training process so as to acquire user-defined requests according to the hooks in the training process, and training the initial model by using the user-defined requests. According to the embodiment, through classification of a data set, adaptation of an algorithm and a core module of a training engine, flexibility, high efficiency and expansibility of model construction under different application scenes are realized, a flexible, high-efficiency and unified AI model construction platform is provided, model construction requirements under different application scenes can be met, a model construction process is simplified, development efficiency and model accuracy are improved, different components and tools can be conveniently integrated through containerization and end-to-end design, rapid iteration and deployment are realized, complexity of AI model construction is solved, and an end-to-end model construction technology is realized.
Referring to fig. 7, the embodiment of the application also discloses a device for constructing a big data analysis model of end-to-end containerization, which comprises:
The data set grouping module 11 classifies the original data to obtain corresponding structured data and unstructured data as the original data sent by a preset data source is acquired, generates an initial data set based on the structured data and the unstructured data, and groups the initial data set according to a preset grouping rule to obtain a target data set;
The configuration file verification module 12 is configured to determine parameter information of an initial model, generate an algorithm configuration file in a preset format according to the parameter information, input the algorithm configuration file into a preset virtualization container, and verify a mapping relationship between the target data set and the algorithm configuration file by using the preset virtualization container;
and the model training module 13 is configured to train the initial model according to the mapping relationship by using a preset executor if the mapping relationship is correct, and call a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of the preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model.
In the embodiment, original data sent by a preset data source are obtained, the original data are classified to obtain corresponding structured data and unstructured data, an initial data set is generated based on the structured data and unstructured data, the initial data set is grouped to obtain a target data set according to a preset grouping rule, parameter information of an initial model is determined, an algorithm configuration file in a preset format is generated according to the parameter information, the algorithm configuration file is input into a preset virtualization container, the mapping relation between the target data set and the algorithm configuration file is checked by the preset virtualization container, if the mapping relation is correct, a preset executor is used for training the initial model according to the mapping relation, and a preset parameter scheduler is called by the preset executor in the training process to dynamically adjust super parameters of the preset optimizer so as to optimize the initial model by the preset optimizer to generate a target model. In this way, the application can conveniently process different types of data through a unified data set implementation mode, form a unified data set format, and construct and train different types of algorithms through a one-key training mode through an implementation mode of an adaptation algorithm, and simultaneously introduce an executor as a core module of a training engine, thereby realizing the efficient execution and management of training, testing and reasoning tasks.
In some specific embodiments, the data set grouping module 11 specifically includes:
the first data preprocessing unit is used for performing data quality inspection, data cleaning and conversion operation on the fields of the structured data and generating a corresponding structured data set;
the second data preprocessing unit is used for marking the unstructured data and generating an unstructured data set in a label format;
And the data set generating unit is used for generating an initial data set according to the structured data set and the unstructured data set.
In some specific embodiments, the data set grouping module 11 further includes:
the interface determining unit is used for inputting the target data set into the preset virtualized container and providing an application programming interface corresponding to the target data set in a network request mode.
In some embodiments, the configuration file verification module 12 specifically includes:
and the parameter dictionary assembling unit is used for assembling the parameter information into a parameter dictionary according to a preset algorithm and storing the parameter dictionary into an algorithm configuration file in a yml or json format.
In some embodiments, the model training module 13 specifically includes:
the model training unit is used for reading the parameter dictionary in the algorithm configuration file, loading the corresponding initial model by using the preset virtualization container according to the parameter dictionary so as to read the corresponding target data set by using the preset executor according to the mapping relation, and training the initial model.
In some embodiments, the configuration file verification module 12 specifically includes:
And the mapping judgment unit is used for determining whether the characteristics corresponding to the target data set and the algorithm configuration file are matched by utilizing the preset virtualization container, and if so, judging that the mapping relation between the target data set and the algorithm configuration file is correct.
In some embodiments, the model training module 13 further includes:
the hook setting unit is used for setting a hook by using the preset executor so as to acquire a user-defined request of a user in the training process according to the hook, and training the initial model through the user-defined request.
Further, the embodiment of the present application further discloses an electronic device, and fig. 8 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application.
Fig. 8 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may include, in particular, at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26. The memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement relevant steps in the end-to-end containerized big data analysis AI model construction method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device 20, the communication interface 24 is capable of creating a data transmission channel with an external device for the electronic device 20, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the external device, and the specific interface type of the input/output interface may be selected according to the specific application needs and is not specifically limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the end-to-end containerized big data analysis AI model building method performed by the electronic device 20 disclosed in any of the previous embodiments.
Furthermore, the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the method for constructing the end-to-end containerized big data analysis AI model. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
While the foregoing has been provided to illustrate the principles and embodiments of the present application, specific examples have been provided herein to assist in understanding the principles and embodiments of the present application, and are intended to be in no way limiting, for those of ordinary skill in the art will, in light of the above teachings, appreciate that the principles and embodiments of the present application may be varied in any way.

Claims (7)

1.一种端到端容器化的大数据分析模型构建方法,其特征在于,包括:1. A method for constructing an end-to-end containerized big data analysis model, comprising: 获取预设数据源发送的原始数据,对所述原始数据进行分类得到对应的结构化数据和非结构化数据,并基于所述结构化数据和所述非结构化数据生成初始数据集,根据预设分组规则对所述初始数据集进行分组得到目标数据集;Acquire original data sent by a preset data source, classify the original data to obtain corresponding structured data and unstructured data, generate an initial data set based on the structured data and the unstructured data, and group the initial data set according to a preset grouping rule to obtain a target data set; 确定初始模型的参数信息,根据所述参数信息生成预设格式的算法配置文件,并将所述算法配置文件输入至预设虚拟化容器中,利用所述预设虚拟化容器对所述目标数据集和所述算法配置文件的映射关系进行校验;Determine parameter information of the initial model, generate an algorithm configuration file in a preset format according to the parameter information, input the algorithm configuration file into a preset virtualization container, and use the preset virtualization container to verify the mapping relationship between the target data set and the algorithm configuration file; 若所述映射关系正确,则利用预设执行器根据所述映射关系训练所述初始模型,并在训练过程中利用所述预设执行器调用预设参数调度器动态调整预设优化器的超参数,以利用所述预设优化器进行所述初始模型的优化生成目标模型;If the mapping relationship is correct, the initial model is trained according to the mapping relationship using the preset executor, and during the training process, the preset executor is used to call the preset parameter scheduler to dynamically adjust the hyperparameters of the preset optimizer, so as to optimize the initial model using the preset optimizer to generate a target model; 其中,所述根据所述参数信息生成预设格式的算法配置文件,包括:The step of generating an algorithm configuration file in a preset format according to the parameter information includes: 根据预设算法将所述参数信息组装为参数字典,并将所述参数字典保存为.yml或.json格式的算法配置文件;Assembling the parameter information into a parameter dictionary according to a preset algorithm, and saving the parameter dictionary as an algorithm configuration file in .yml or .json format; 相应的,所述利用预设执行器根据所述映射关系训练所述初始模型,包括:Accordingly, the using a preset executor to train the initial model according to the mapping relationship includes: 读取所述算法配置文件中的所述参数字典,利用所述预设虚拟化容器根据所述参数字典加载对应的所述初始模型,以便利用所述预设执行器根据所述映射关系读取对应的所述目标数据集,并对所述初始模型进行训练;Read the parameter dictionary in the algorithm configuration file, and use the preset virtualization container to load the corresponding initial model according to the parameter dictionary, so as to use the preset executor to read the corresponding target data set according to the mapping relationship, and train the initial model; 并且,所述利用所述预设虚拟化容器对所述目标数据集和所述算法配置文件的映射关系进行校验,包括:Furthermore, the verifying the mapping relationship between the target data set and the algorithm configuration file by using the preset virtualization container includes: 利用所述预设虚拟化容器确定所述目标数据集和所述算法配置文件对应的特征是否匹配,若匹配,则判定所述目标数据集和所述算法配置文件的映射关系正确;Determine whether the target data set and the features corresponding to the algorithm configuration file match using the preset virtualization container, and if so, determine that the mapping relationship between the target data set and the algorithm configuration file is correct; 以及,所述根据预设分组规则对所述初始数据集进行分组得到目标数据集,包括:And, grouping the initial data set according to a preset grouping rule to obtain a target data set includes: 根据预设分组规则对所述初始数据集进行分组得到若干目标数据集,以便根据各所述目标数据集构建相应的所述目标模型;所述预设分组规则为根据所述初始数据集中的数据的特征或属性构建的规则。The initial data set is grouped according to a preset grouping rule to obtain a plurality of target data sets, so as to construct the corresponding target model according to each target data set; the preset grouping rule is a rule constructed according to the characteristics or attributes of the data in the initial data set. 2.根据权利要求1所述的端到端容器化的大数据分析模型构建方法,其特征在于,所述基于所述结构化数据和所述非结构化数据生成初始数据集,包括:2. The end-to-end containerized big data analysis model construction method according to claim 1, characterized in that the generating of the initial data set based on the structured data and the unstructured data comprises: 对所述结构化数据的字段进行数据质量检查、数据清洗和转换操作,并生成对应的结构化数据集;Performing data quality checks, data cleaning and conversion operations on the fields of the structured data, and generating corresponding structured data sets; 对所述非结构化数据进行标注,生成标签格式的非结构化数据集;Annotating the unstructured data to generate an unstructured data set in a label format; 根据所述结构化数据集和所述非结构化数据集生成初始数据集。An initial data set is generated according to the structured data set and the unstructured data set. 3.根据权利要求1所述的端到端容器化的大数据分析模型构建方法,其特征在于,所述根据预设分组规则对所述初始数据集进行分组得到目标数据集之后,还包括:3. The end-to-end containerized big data analysis model construction method according to claim 1 is characterized in that after the initial data set is grouped according to the preset grouping rule to obtain the target data set, it also includes: 将所述目标数据集输入至所述预设虚拟化容器中,并通过网络请求的形式提供所述目标数据集对应的应用程序编程接口。The target data set is input into the preset virtualization container, and an application programming interface corresponding to the target data set is provided in the form of a network request. 4.根据权利要求1至3任一项所述的端到端容器化的大数据分析模型构建方法,其特征在于,所述利用预设执行器根据所述映射关系训练所述初始模型,还包括:4. The end-to-end containerized big data analysis model construction method according to any one of claims 1 to 3, characterized in that the using a preset executor to train the initial model according to the mapping relationship further comprises: 利用所述预设执行器设置钩子,以便根据所述钩子在训练过程中获取用户的自定义请求,并通过所述自定义请求训练所述初始模型。The preset executor is used to set a hook so as to obtain the user's customized request during the training process according to the hook, and train the initial model through the customized request. 5.一种端到端容器化的大数据分析模型构建装置,其特征在于,包括:5. An end-to-end containerized big data analysis model construction device, characterized by comprising: 数据集分组模块,由于获取预设数据源发送的原始数据,对所述原始数据进行分类得到对应的结构化数据和非结构化数据,并基于所述结构化数据和所述非结构化数据生成初始数据集,根据预设分组规则对所述初始数据集进行分组得到目标数据集;A data set grouping module, which obtains raw data sent by a preset data source, classifies the raw data to obtain corresponding structured data and unstructured data, generates an initial data set based on the structured data and the unstructured data, and groups the initial data set according to a preset grouping rule to obtain a target data set; 配置文件校验模块,用于确定初始模型的参数信息,根据所述参数信息生成预设格式的算法配置文件,并将所述算法配置文件输入至预设虚拟化容器中,利用所述预设虚拟化容器对所述目标数据集和所述算法配置文件的映射关系进行校验;A configuration file verification module, used to determine parameter information of the initial model, generate an algorithm configuration file in a preset format according to the parameter information, input the algorithm configuration file into a preset virtualization container, and use the preset virtualization container to verify the mapping relationship between the target data set and the algorithm configuration file; 模型训练模块,用于若所述映射关系正确,则利用预设执行器根据所述映射关系训练所述初始模型,并在训练过程中利用所述预设执行器调用预设参数调度器动态调整预设优化器的超参数,以利用所述预设优化器进行所述初始模型的优化生成目标模型;A model training module, used for, if the mapping relationship is correct, using a preset executor to train the initial model according to the mapping relationship, and during the training process, using the preset executor to call a preset parameter scheduler to dynamically adjust the hyperparameters of a preset optimizer, so as to use the preset optimizer to optimize the initial model to generate a target model; 其中,所述配置文件校验模块,具体包括:The configuration file verification module specifically includes: 参数字典组装单元,用于根据预设算法将所述参数信息组装为参数字典,并将所述参数字典保存为.yml或.json格式的算法配置文件;A parameter dictionary assembly unit, used to assemble the parameter information into a parameter dictionary according to a preset algorithm, and save the parameter dictionary as an algorithm configuration file in .yml or .json format; 相应的,所述模型训练模块,具体包括:Accordingly, the model training module specifically includes: 模型训练单元,用于读取所述算法配置文件中的所述参数字典,利用所述预设虚拟化容器根据所述参数字典加载对应的所述初始模型,以便利用所述预设执行器根据所述映射关系读取对应的所述目标数据集,并对所述初始模型进行训练;A model training unit, configured to read the parameter dictionary in the algorithm configuration file, load the corresponding initial model according to the parameter dictionary using the preset virtualization container, read the corresponding target data set according to the mapping relationship using the preset executor, and train the initial model; 并且,所述配置文件校验模块,具体包括:Furthermore, the configuration file verification module specifically includes: 映射判断单元,用于利用所述预设虚拟化容器确定所述目标数据集和所述算法配置文件对应的特征是否匹配,若匹配,则判定所述目标数据集和所述算法配置文件的映射关系正确。The mapping judgment unit is used to determine whether the features corresponding to the target data set and the algorithm configuration file match by using the preset virtualization container, and if they match, determine that the mapping relationship between the target data set and the algorithm configuration file is correct. 6.一种电子设备,其特征在于,所述电子设备包括处理器和存储器;其中,所述存储器用于存储计算机程序,所述计算机程序由所述处理器加载并执行以实现如权利要求1至4任一项所述的端到端容器化的大数据分析模型构建方法。6. An electronic device, characterized in that the electronic device includes a processor and a memory; wherein the memory is used to store a computer program, and the computer program is loaded and executed by the processor to implement the end-to-end containerized big data analysis model construction method as described in any one of claims 1 to 4. 7.一种计算机可读存储介质,其特征在于,用于保存计算机程序,所述计算机程序被处理器执行时实现如权利要求1至4任一项所述的端到端容器化的大数据分析模型构建方法。7. A computer-readable storage medium, characterized in that it is used to store a computer program, which, when executed by a processor, implements the end-to-end containerized big data analysis model construction method as described in any one of claims 1 to 4.
CN202311282136.1A 2023-09-28 2023-09-28 End-to-end containerized big data model construction method, device, equipment and medium Active CN117235527B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311282136.1A CN117235527B (en) 2023-09-28 2023-09-28 End-to-end containerized big data model construction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311282136.1A CN117235527B (en) 2023-09-28 2023-09-28 End-to-end containerized big data model construction method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN117235527A CN117235527A (en) 2023-12-15
CN117235527B true CN117235527B (en) 2025-02-11

Family

ID=89094636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311282136.1A Active CN117235527B (en) 2023-09-28 2023-09-28 End-to-end containerized big data model construction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117235527B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891531B (en) * 2024-03-14 2024-06-14 蒲惠智造科技股份有限公司 System parameter configuration method, system, medium and electronic equipment for SAAS software

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463301A (en) * 2020-11-30 2021-03-09 常州微亿智造科技有限公司 Container-based model training test tuning and deployment method and device
CN115080021A (en) * 2022-05-13 2022-09-20 北京思特奇信息技术股份有限公司 Zero code modeling method and system based on automatic machine learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325541A (en) * 2018-09-30 2019-02-12 北京字节跳动网络技术有限公司 Method and apparatus for training pattern
CN112883654B (en) * 2021-03-24 2023-01-31 国家超级计算天津中心 Model training system based on data driving
US20230229735A1 (en) * 2022-01-18 2023-07-20 Chime Financial, Inc. Training and implementing machine-learning models utilizing model container workflows
CN114116684B (en) * 2022-01-27 2022-05-24 中国传媒大学 Version management method of deep learning large model and large data set based on Docker containerization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112463301A (en) * 2020-11-30 2021-03-09 常州微亿智造科技有限公司 Container-based model training test tuning and deployment method and device
CN115080021A (en) * 2022-05-13 2022-09-20 北京思特奇信息技术股份有限公司 Zero code modeling method and system based on automatic machine learning

Also Published As

Publication number Publication date
CN117235527A (en) 2023-12-15

Similar Documents

Publication Publication Date Title
CN107943463B (en) Interactive mode automation big data analysis application development system
WO2021136365A1 (en) Application development method and apparatus based on machine learning model, and electronic device
US11500830B2 (en) Learning-based workload resource optimization for database management systems
US11681511B2 (en) Systems and methods for building and deploying machine learning applications
WO2019100635A1 (en) Editing method and apparatus for automated test script, terminal device and storage medium
US20240281621A1 (en) Generating multi-order text query results utilizing a context orchestration engine
CN119415393B (en) A software testing automation platform and intelligent optimization method thereof
CN114089889A (en) Model training method, device and storage medium
CN117235527B (en) End-to-end containerized big data model construction method, device, equipment and medium
CN112527676A (en) Model automation test method, device and storage medium
CN118820107A (en) Server automatic testing method, device, electronic device and non-volatile storage medium
US20230169408A1 (en) Annotation of a Machine Learning Pipeline with Operational Semantics
CN120122931A (en) Code generation method, device and computer equipment based on LLM multi-agent collaboration
CN114691111B (en) Code recognition model training method and device based on visualization
US11599783B1 (en) Function creation for database execution of deep learning model
CN118860380A (en) An intelligent auxiliary method and system for software development
JP5206268B2 (en) Rule creation program, rule creation method and rule creation device
CN118113271A (en) Code generation method, system, terminal and medium based on large model
EP4167139A1 (en) Method and apparatus for data augmentation
CN116820483A (en) Code generation model generation method, code generation method and device
CN113033816B (en) Processing method and device of machine learning model, storage medium and electronic equipment
US20250265060A1 (en) Method and apparatus for computer code analysis using generative ai system
US20250209064A1 (en) Chatbot assistant powered by artificial intelligence for troubleshooting issues based on historical resolution data
CN111435402A (en) Intelligent analysis method based on big data
US12226901B2 (en) Smart change evaluator for robotics automation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant