CN117235527B

CN117235527B - End-to-end containerized big data model construction method, device, equipment and medium

Info

Publication number: CN117235527B
Application number: CN202311282136.1A
Authority: CN
Inventors: 王蒴; 孙召进; 李希明; 王沛; 王龙振
Original assignee: Shandong Civic Se Commercial Middleware Co ltd
Current assignee: Shandong Civic Se Commercial Middleware Co ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2025-02-11
Anticipated expiration: 2043-09-28
Also published as: CN117235527A

Abstract

The present application discloses an end-to-end containerized big data analysis model construction method, device, equipment and storage medium, which relates to the field of machine learning, including: obtaining raw data for classification to obtain structured data and unstructured data, and grouping to obtain a target data set; determining the parameter information of the initial model, generating an algorithm configuration file based on the parameter information, and inputting it into a preset virtualized container to verify the mapping relationship between the target data set and the algorithm configuration file; if correct, using a preset executor to train the initial model according to the mapping relationship, and using a preset executor to call a preset parameter scheduler during the training process to dynamically adjust the hyperparameters of a preset optimizer to optimize the initial model and generate a target model. Through data set classification, algorithm adaptation and executors, the flexibility, efficiency and scalability of model construction in different scenarios are achieved, and the control of virtualized containers achieves environmental isolation to ensure the completion of training tasks.

Description

End-to-end containerized big data model construction method, device, equipment and medium

Technical Field

The present invention relates to the field of machine learning, and in particular, to a method, an apparatus, a device, and a storage medium for constructing an end-to-end containerized big data analysis model.

Background

In the age of information explosion today, big data analysis and artificial intelligence are important forces pushing social development and business innovation. However, building an efficient, accurate AI (ARTIFICIAL INTELLIGENCE ) model often faces a series of challenges and problems. These problems include model construction requirements in different application scenarios, diversity and complexity of the data set, adaptation of the algorithm, complexity of the training process, etc.

Firstly, the requirements of different application scenes on the AI model are quite different, and a flexible model construction platform is needed to meet the diversified requirements. Traditional model building methods often require specific code to be written for different scenarios, increasing complexity of development and maintenance. Second, the diversity and complexity of the data sets is also a challenge in constructing AI models. The data in the real world includes both structured data (e.g., tabular data) and unstructured data (e.g., images, text, etc.). Different data types require different preprocessing, feature extraction and labeling methods, and therefore a unified data set implementation is required. Furthermore, the adaptation and training process of the algorithm is also a key issue in constructing the AI model. Different algorithms have different requirements on super parameters, training strategies, optimizers and the like, so that the model construction process is complex and easy to make mistakes. Moreover, conventional training processes often require extensive training code and procedures to be written, which is inconvenient for rapid experimentation and iteration. Therefore, how to provide a flexible, efficient and unified AI model building platform to meet model building requirements under different application scenarios is a problem to be solved in the art.

Disclosure of Invention

In view of the above, the present invention aims to provide an end-to-end containerized big data analysis model construction method, apparatus, device and storage medium, which can realize flexibility, high efficiency and expansibility of model construction under different scenes by means of data set classification, algorithm adaptation and executor, and realize environmental isolation for controlling a virtualized container, thereby ensuring the completion of training tasks. The specific scheme is as follows:

in a first aspect, the present application provides a method for constructing an end-to-end containerized big data analysis AI model, including:

Acquiring original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, and grouping the initial data set according to a preset grouping rule to obtain a target data set;

determining parameter information of an initial model, generating an algorithm configuration file in a preset format according to the parameter information, inputting the algorithm configuration file into a preset virtualization container, and checking the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container;

if the mapping relation is correct, training the initial model by using a preset executor according to the mapping relation, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of a preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model.

Optionally, the generating an initial data set based on the structured data and the unstructured data includes:

Performing data quality inspection, data cleaning and conversion operations on the fields of the structured data, and generating a corresponding structured data set;

labeling the unstructured data to generate an unstructured data set in a label format;

An initial dataset is generated from the structured dataset and the unstructured dataset.

Optionally, after the initial data set is grouped according to a preset grouping rule to obtain a target data set, the method further includes:

And inputting the target data set into the preset virtualized container, and providing an application programming interface corresponding to the target data set in a network request mode.

Optionally, the generating an algorithm configuration file in a preset format according to the parameter information includes:

and assembling the parameter information into a parameter dictionary according to a preset algorithm, and storing the parameter dictionary into an algorithm configuration file in a yml or json format.

Optionally, the training the initial model by using a preset executor according to the mapping relationship includes:

And reading the parameter dictionary in the algorithm configuration file, loading the corresponding initial model according to the parameter dictionary by using the preset virtualization container so as to read the corresponding target data set according to the mapping relation by using the preset executor, and training the initial model.

Optionally, the verifying, by using the preset virtualization container, the mapping relationship between the target data set and the algorithm configuration file includes:

and determining whether the characteristics corresponding to the target data set and the algorithm configuration file are matched by using the preset virtualization container, and if so, judging that the mapping relationship between the target data set and the algorithm configuration file is correct.

Optionally, the training the initial model by using a preset executor according to the mapping relationship further includes:

and setting a hook by using the preset executor so as to acquire a user-defined request of a user in a training process according to the hook, and training the initial model through the user-defined request.

In a second aspect, the present application provides an end-to-end containerized big data analysis model building apparatus, including:

The data set grouping module is used for obtaining original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, and grouping the initial data set according to a preset grouping rule to obtain a target data set;

The configuration file verification module is used for determining parameter information of an initial model, generating an algorithm configuration file in a preset format according to the parameter information, inputting the algorithm configuration file into a preset virtualization container, and verifying the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container;

And the model training module is used for training the initial model according to the mapping relation by using a preset executor if the mapping relation is correct, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of the preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model.

In a third aspect, the application provides an electronic device comprising a processor and a memory, wherein the memory is used for storing a computer program, and the computer program is loaded and executed by the processor to realize the big data analysis model construction method of end-to-end containerization.

In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which when executed by a processor implements the foregoing end-to-end containerized big data analysis model construction method.

The method comprises the steps of obtaining original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, grouping the initial data set according to a preset grouping rule to obtain a target data set, determining parameter information of an initial model, generating an algorithm configuration file in a preset format according to the parameter information, inputting the algorithm configuration file into a preset virtualization container, checking the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container, training the initial model by using a preset executor according to the mapping relation if the mapping relation is correct, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust super parameters of the preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model. In this way, the application can conveniently process different types of data through a unified data set implementation mode, form a unified data set format, and construct and train different types of algorithms through a one-key training mode through an adaptive algorithm implementation mode, meanwhile, an executor is introduced as a core module of a training engine, so that the efficient execution and management of training, testing and reasoning tasks can be realized, a plurality of independent steps such as data collection, data preprocessing, feature extraction, model selection, super-parameter adjustment and the like required by AI model construction are avoided, and the complexity of constructing the AI model is simplified for non-professional persons.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for constructing an end-to-end containerized big data analysis model provided by the application;

FIG. 2 is a flow chart of data set integration provided by the present application;

FIG. 3 is a flow chart of model training provided by the present application;

FIG. 4 is a flowchart of a specific end-to-end containerized big data analysis model construction method provided by the application;

FIG. 5 is a containerization flow chart provided by the present application;

FIG. 6 is a deep learning flow chart provided by the present application;

FIG. 7 is a schematic diagram of a device for constructing an end-to-end containerized big data analysis model according to the present application;

Fig. 8 is a block diagram of an electronic device according to the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, the efficient and accurate AI model is often constructed to face the problems of different model construction requirements, diversity and complexity of data sets, algorithm adaptation, complexity of training process and the like under different application scenes. The application can conveniently process different types of data through a unified data set implementation mode, form a unified data set format, and construct and train different types of algorithms through a one-key training mode through an implementation mode of an adaptation algorithm, and simultaneously introduce an executor as a core module of a training engine, so that the efficient execution and management of training, testing and reasoning tasks can be realized.

Referring to fig. 1, the embodiment of the invention discloses a method for constructing a big data analysis model of end-to-end containerization, which comprises the following steps:

Step S11, obtaining original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, and grouping the initial data set according to a preset grouping rule to obtain a target data set.

In this embodiment, first, raw data in a data source is acquired, the raw data is classified to obtain corresponding structured data and unstructured data, an initial data set is generated based on the classified structured data and unstructured data, and then the initial data set is grouped according to a preset grouping rule to obtain a target data set. Firstly, cleaning, extracting and filtering original data in a data source, performing labeling to form a data set, grouping the data set to form a training set and a testing set so as to facilitate the subsequent assembly of an adaptation algorithm for different data sets, and then performing the processes of training, evaluation and the like in a container.

In the process of classifying the data sources and further processing and labeling, firstly, performing data quality inspection, data cleaning and conversion operation on the fields of the structured data to generate corresponding structured data sets, labeling the unstructured data to generate unstructured data sets in a label format, and then generating initial data sets according to the structured data sets and the unstructured data sets. As shown in fig. 2, the generation of the data set specifically includes the following steps:

Classifying the original data into structured data and unstructured data according to the characteristics of the original data, wherein the structured data refers to data with fixed formats and fields, such as database tables, CSV (Comma-SEPARATED VALUES, character separation value) files and the like, and the unstructured data refers to data without fixed formats or fields, such as text documents, images, audio, video and the like; the structured data analysis comprises the steps of analyzing the field level of structured data, firstly, carrying out data quality inspection, data cleaning and conversion operation on each field to ensure the accuracy and consistency of the data, then carrying out statistics, aggregation, sequencing and other operations on the fields according to analysis requirements to obtain useful information and insight, marking unstructured data to generate a data set in a label format, wherein the marking refers to the fact that the unstructured data is endowed with a specific label or mark so as to carry out subsequent model training and analysis, for example, carrying out object detection, image classification and other marking on image data, carrying out named entity identification, emotion analysis and other marking on text data, grouping and sampling the data set, carrying out grouping and sampling operation on the data set according to actual requirements, dividing the data set into different subsets according to certain characteristics or attributes so as to be more accurately applied to different model construction tasks, sampling part of samples from the whole data set so as to carry out quick verification and analysis in the model construction process, carrying out uniform processing steps on the data set in the form, carrying out uniform data structure and data structure, and data with uniform data format being processed and marked by the unified data set, these data sets may be stored in a particular file format. The data sources are classified into structured data and unstructured data types, the structured data is analyzed at a field level, the unstructured data is marked to form a data set in a tag format, the data set is further subjected to grouping, sampling and other operations, a uniform type data set is finally formed, and network request acquisition is carried out through a container.

It should be noted that after the initial data sets are grouped according to the preset grouping rule to obtain the target data sets, the target data sets are input into the preset virtualized container, and the application programming interfaces corresponding to the target data sets are provided in the form of network requests. By feeding the resulting dataset into a container and providing it to the user in the form of a network request, the user can obtain the dataset, including the raw data and dataset tags, for subsequent model construction and analysis by accessing an API (Application Programming Interface ) interface in the container. Through the data set implementation manner, the embodiment provides a unified, flexible and efficient method, so that a user can conveniently process and utilize structured data and unstructured data to perform model construction and analysis. In the training process of the AI model, the storage requirement of massive data and the occupation of the multi-task training files are two common problems, and the embodiment can form different data sets aiming at different data sources to quickly construct a special model, so that the problems of storage resource waste and low training efficiency caused by the fact that a large amount of storage space is required to store data in a traditional method and each task needs an independent training file are solved, simultaneously, the method also has good support for the oversized data sets, and solves the storage problem of massive data and the technical problem of occupation of the multi-task training files.

Step S12, determining parameter information of an initial model, generating an algorithm configuration file in a preset format according to the parameter information, inputting the algorithm configuration file into a preset virtualization container, and checking the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container.

In this embodiment, after the parameter information of the initial model is determined, an algorithm configuration file in a preset format is generated according to the parameter information, the algorithm configuration file is input into a preset virtualization container, whether the characteristics corresponding to the target data set and the algorithm configuration file are matched or not is determined by using the preset virtualization container, and if so, the mapping relationship between the target data set and the algorithm configuration file is determined to be correct.

And S13, if the mapping relation is correct, training the initial model by using a preset executor according to the mapping relation, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of a preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model.

In this embodiment, if the mapping relationship is determined to be correct, the preset executor is used to train the initial model according to the mapping relationship, and the preset executor is used to call the preset parameter scheduler to dynamically adjust the super parameters of the preset optimizer in the training process, so as to optimize the initial model by using the preset optimizer to generate the target model. The executor (Runner) is responsible for executing training, testing and reasoning tasks and managing all required components, and allows users to expand and execute custom logic through Hook (Hook), the optimizer packages execute a back propagation optimization model, and the parameter scheduler dynamically adjusts the super parameters of the optimizer.

In the method, as shown in fig. 3, original data sent by a preset data source is acquired, the original data is classified to obtain corresponding structured data and unstructured data, an initial data set is generated, a target data set is generated through data labeling, normalization processing and grouping sampling, parameter information of an initial model is determined, an algorithm configuration file in a preset format is generated according to the parameter information, the algorithm configuration file is input into a preset virtualization container, and the mapping relation between the target data set and the algorithm configuration file is verified by the preset virtualization container. If the mapping relation is correct, training an initial model by using a preset executor according to the mapping relation, and calling a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of the preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model. An AI model construction technique for end-to-end containerized big data analysis is provided to solve the above problems. The application can conveniently process different types of data and form a unified data set format through a unified data set implementation mode, different types of algorithms can be constructed and trained through a one-key training mode through an adaptation algorithm implementation mode, and an executor is introduced as a core module of a training engine, so that efficient execution and management of training, testing and reasoning tasks can be realized.

Based on the above embodiment, the present application can implement execution and management of the model training task through processing the data set, implementing the adaptation algorithm, and introducing the executor, and next, in this embodiment, a process of implementing the model training task by introducing the executor will be described in detail. Referring to fig. 4, the embodiment of the application discloses a specific end-to-end containerized big data analysis model construction method, which comprises the following steps:

Step S21, obtaining original data sent by a preset data source, classifying the original data to obtain corresponding structured data and unstructured data, generating an initial data set based on the structured data and the unstructured data, and grouping the initial data set according to a preset grouping rule to obtain a target data set.

And S22, determining parameter information of an initial model, assembling the parameter information into a parameter dictionary according to a preset algorithm, storing the parameter dictionary into an algorithm configuration file in a yml or json format, inputting the algorithm configuration file into a preset virtualization container, and checking the mapping relation between the target data set and the algorithm configuration file by using the preset virtualization container.

The implementation mode of the adaptation algorithm in the embodiment is to assemble information such as super parameters and training strategies into a dictionary form, and input the information into a container in a manner of a. Yml or. Json file, so that a unified interface of training tasks is realized. In this way, different types of algorithms can be implemented by one-key training, and a mapping relationship exists between the data set and the algorithm. When a large model is trained, different algorithms possibly have different assessment requirements and assessment indexes, the application can lead the multiple algorithms to share one assessment function in the training process, realize the unified assessment of the multiple algorithms by introducing a unified assessment framework and flexible algorithm interface design, provide accurate performance indexes and monitoring information so as to guide the model training and tuning, avoid the problems that the traditional method independently designs and realizes the assessment function for each algorithm, lead to code repetition and maintenance difficulty, further increase the complexity and time consumption of the assessment process, and solve the technical problem of unified model assessment when the large model is trained. It will be appreciated that the present embodiment may select a conventional machine learning model or a deep learning model as the model architecture.

The above information such as the super parameters and training strategies in fig. 5 is assembled in the form of a dictionary, and can be input into a container through a. Yml or. Json file, and a unified interface is provided to realize one-key training of different types of algorithms, and meanwhile, the data set and the algorithms are checked to ensure that a correct mapping relation exists, so as to ensure correct execution of the algorithms. The model training comprises the specific steps of assembling parameters including, but not limited to, learning rate, batch size, network structure, loss function and the like into a dictionary according to the requirements of a specific algorithm, conveniently configuring and managing the algorithm by assembling the parameters into the dictionary, inputting configuration files, namely storing the parameter dictionary into a yml or json file, and sending the configuration files into a container in a file input mode, so that the parameters can be flexibly modified and adjusted without modifying codes, the adaptability and configurability of the algorithm are improved, checking the data set and the algorithm in the container, ensuring that a correct mapping relation exists between the data set and the algorithm, and if the characteristics of the data set are not matched or wrong, providing corresponding error prompts and suggestions. Training by one key: through reading the parameter dictionary in the configuration file, the container can automatically load the corresponding algorithm model according to the requirements of different algorithms, and train by using the appointed data set, so that the training task can be quickly started by a user in a one-key training mode without manually writing complex training codes and flows. Through the implementation manner of the adaptive algorithm, the embodiment provides a unified interface, so that mapping and training can be conveniently performed between different types of data sets and different algorithms, a unified platform is provided for a user by combining a containerization technology and a big data analysis algorithm, an AI model can be conveniently constructed and deployed to solve various practical problems, the user can easily select and apply various complex algorithms, and parameter adjustment and optimization can be performed according to requirements, so that efficient model construction and training processes are realized.

According to the technical scheme, an efficient model format conversion algorithm is introduced, one model can be quickly converted from one format to another format, and the algorithm is based on the mapping relation between the model structure and parameters, and through optimizing calculation and data transmission modes, the calculation and storage expenses in the conversion process are reduced, the conversion speed and efficiency are improved, and the problems of quick conversion and generation of the model format are solved.

Step S23, if the mapping relationship is correct, reading the parameter dictionary in the algorithm configuration file, loading the corresponding initial model according to the parameter dictionary by using the preset virtualization container, so as to use a preset executor to read the corresponding target data set according to the mapping relationship, train the initial model, and set a hook by using the preset executor in the training process, so as to obtain a user-defined request according to the hook in the training process, and train the initial model by using the user-defined request.

In this embodiment, as shown in fig. 6, the core module of the training engine adopted in the model training process and the model evaluation implementation manner is an executor (Runner), which is responsible for executing and managing each component in the training, testing and reasoning tasks, and the executor allows the user to expand, insert and execute custom logics by setting hooks (Hook), and executes the logics at specific positions of the training, testing and reasoning tasks. In FIG. 6, dataset (Dataset) is responsible for constructing the datasets needed for training, testing, and reasoning tasks, and passing the data to the model. In practice, the data set is typically packaged by a data loader (DataLoader) that initiates a plurality of sub-processes to load data to increase data loading efficiency. Models (models) accept data as input and either lose (in training) or predict (in testing and reasoning), in the case of distributed training, etc., models are typically encapsulated by a Model encapsulator (Model Wrapper) such as MMDistributedDataParallel, etc. The Optimizer package (Optimizer) is responsible for performing back propagation to optimize the model and provides a unified interface supporting the functions of hybrid accuracy training and gradient accumulation. The parameter scheduler (PARAMETER SCHEDULER) is responsible for dynamically adjusting the super-parameters of the optimizer, such as learning rate, momentum, etc., during the training process to improve the training effect. In the training gap or test stage, an evaluation index and an evaluation device (Metrics & Evaluator) are responsible for evaluating the performance of the model, the evaluation device evaluates the prediction of the model based on a data set, and the evaluation index calculates specific evaluation indexes such as recall rate, accuracy rate and the like. In this way, through the cooperative work of the executor and the components, the training engine can efficiently perform training, testing and reasoning tasks, and flexible expansibility is provided, so that users can customize and insert logic to meet different requirements and application scenes. In the training process, the embodiment can use a GPU (graphics processing unit, a graphic processor) and a CPU (Central ProcessingUnit, a central processing unit) to optionally perform model training, support most algorithms such as the traditional machine learning algorithm, the deep learning algorithm and the like at present, support various application scenes, monitor and control containerization, realize environmental isolation and ensure the completion degree of training tasks.

For more specific processing in step S21, reference may be made to the corresponding content disclosed in the foregoing embodiment, and no further description is given here.

Through the technical scheme, the method classifies the original data to obtain the structured data and the unstructured data, generates the initial data set, and groups the initial data set to obtain the target data set. And determining parameter information of the initial model, assembling the parameter information into a parameter dictionary according to a preset algorithm, storing the parameter dictionary as an algorithm configuration file in a yml or json format, inputting the algorithm configuration file into a preset virtualization container, and checking the mapping relation between the target data set and the algorithm configuration file. If the mapping relation is correct, reading a parameter dictionary in an algorithm configuration file, loading a corresponding initial model according to the parameter dictionary by using a preset virtualization container so as to read a corresponding target data set according to the mapping relation by using a preset executor, training the initial model, setting hooks by using the preset executor in the training process so as to acquire user-defined requests according to the hooks in the training process, and training the initial model by using the user-defined requests. According to the embodiment, through classification of a data set, adaptation of an algorithm and a core module of a training engine, flexibility, high efficiency and expansibility of model construction under different application scenes are realized, a flexible, high-efficiency and unified AI model construction platform is provided, model construction requirements under different application scenes can be met, a model construction process is simplified, development efficiency and model accuracy are improved, different components and tools can be conveniently integrated through containerization and end-to-end design, rapid iteration and deployment are realized, complexity of AI model construction is solved, and an end-to-end model construction technology is realized.

Referring to fig. 7, the embodiment of the application also discloses a device for constructing a big data analysis model of end-to-end containerization, which comprises:

The data set grouping module 11 classifies the original data to obtain corresponding structured data and unstructured data as the original data sent by a preset data source is acquired, generates an initial data set based on the structured data and the unstructured data, and groups the initial data set according to a preset grouping rule to obtain a target data set;

The configuration file verification module 12 is configured to determine parameter information of an initial model, generate an algorithm configuration file in a preset format according to the parameter information, input the algorithm configuration file into a preset virtualization container, and verify a mapping relationship between the target data set and the algorithm configuration file by using the preset virtualization container;

and the model training module 13 is configured to train the initial model according to the mapping relationship by using a preset executor if the mapping relationship is correct, and call a preset parameter scheduler by using the preset executor in the training process to dynamically adjust the super parameters of the preset optimizer so as to optimize the initial model by using the preset optimizer to generate a target model.

In the embodiment, original data sent by a preset data source are obtained, the original data are classified to obtain corresponding structured data and unstructured data, an initial data set is generated based on the structured data and unstructured data, the initial data set is grouped to obtain a target data set according to a preset grouping rule, parameter information of an initial model is determined, an algorithm configuration file in a preset format is generated according to the parameter information, the algorithm configuration file is input into a preset virtualization container, the mapping relation between the target data set and the algorithm configuration file is checked by the preset virtualization container, if the mapping relation is correct, a preset executor is used for training the initial model according to the mapping relation, and a preset parameter scheduler is called by the preset executor in the training process to dynamically adjust super parameters of the preset optimizer so as to optimize the initial model by the preset optimizer to generate a target model. In this way, the application can conveniently process different types of data through a unified data set implementation mode, form a unified data set format, and construct and train different types of algorithms through a one-key training mode through an implementation mode of an adaptation algorithm, and simultaneously introduce an executor as a core module of a training engine, thereby realizing the efficient execution and management of training, testing and reasoning tasks.

In some specific embodiments, the data set grouping module 11 specifically includes:

the first data preprocessing unit is used for performing data quality inspection, data cleaning and conversion operation on the fields of the structured data and generating a corresponding structured data set;

the second data preprocessing unit is used for marking the unstructured data and generating an unstructured data set in a label format;

And the data set generating unit is used for generating an initial data set according to the structured data set and the unstructured data set.

In some specific embodiments, the data set grouping module 11 further includes:

the interface determining unit is used for inputting the target data set into the preset virtualized container and providing an application programming interface corresponding to the target data set in a network request mode.

In some embodiments, the configuration file verification module 12 specifically includes:

and the parameter dictionary assembling unit is used for assembling the parameter information into a parameter dictionary according to a preset algorithm and storing the parameter dictionary into an algorithm configuration file in a yml or json format.

In some embodiments, the model training module 13 specifically includes:

the model training unit is used for reading the parameter dictionary in the algorithm configuration file, loading the corresponding initial model by using the preset virtualization container according to the parameter dictionary so as to read the corresponding target data set by using the preset executor according to the mapping relation, and training the initial model.

And the mapping judgment unit is used for determining whether the characteristics corresponding to the target data set and the algorithm configuration file are matched by utilizing the preset virtualization container, and if so, judging that the mapping relation between the target data set and the algorithm configuration file is correct.

In some embodiments, the model training module 13 further includes:

the hook setting unit is used for setting a hook by using the preset executor so as to acquire a user-defined request of a user in the training process according to the hook, and training the initial model through the user-defined request.

Further, the embodiment of the present application further discloses an electronic device, and fig. 8 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application.

Fig. 8 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may include, in particular, at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input-output interface 25, and a communication bus 26. The memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement relevant steps in the end-to-end containerized big data analysis AI model construction method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.

In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device 20, the communication interface 24 is capable of creating a data transmission channel with an external device for the electronic device 20, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the external device, and the specific interface type of the input/output interface may be selected according to the specific application needs and is not specifically limited herein.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the end-to-end containerized big data analysis AI model building method performed by the electronic device 20 disclosed in any of the previous embodiments.

Furthermore, the application also discloses a computer readable storage medium for storing a computer program, wherein the computer program is executed by a processor to realize the method for constructing the end-to-end containerized big data analysis AI model. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

While the foregoing has been provided to illustrate the principles and embodiments of the present application, specific examples have been provided herein to assist in understanding the principles and embodiments of the present application, and are intended to be in no way limiting, for those of ordinary skill in the art will, in light of the above teachings, appreciate that the principles and embodiments of the present application may be varied in any way.

Claims

1. A method for constructing an end-to-end containerized big data analysis model, comprising:

Acquire original data sent by a preset data source, classify the original data to obtain corresponding structured data and unstructured data, generate an initial data set based on the structured data and the unstructured data, and group the initial data set according to a preset grouping rule to obtain a target data set;

Determine parameter information of the initial model, generate an algorithm configuration file in a preset format according to the parameter information, input the algorithm configuration file into a preset virtualization container, and use the preset virtualization container to verify the mapping relationship between the target data set and the algorithm configuration file;

If the mapping relationship is correct, the initial model is trained according to the mapping relationship using the preset executor, and during the training process, the preset executor is used to call the preset parameter scheduler to dynamically adjust the hyperparameters of the preset optimizer, so as to optimize the initial model using the preset optimizer to generate a target model;

The step of generating an algorithm configuration file in a preset format according to the parameter information includes:

Assembling the parameter information into a parameter dictionary according to a preset algorithm, and saving the parameter dictionary as an algorithm configuration file in .yml or .json format;

Accordingly, the using a preset executor to train the initial model according to the mapping relationship includes:

Read the parameter dictionary in the algorithm configuration file, and use the preset virtualization container to load the corresponding initial model according to the parameter dictionary, so as to use the preset executor to read the corresponding target data set according to the mapping relationship, and train the initial model;

Furthermore, the verifying the mapping relationship between the target data set and the algorithm configuration file by using the preset virtualization container includes:

Determine whether the target data set and the features corresponding to the algorithm configuration file match using the preset virtualization container, and if so, determine that the mapping relationship between the target data set and the algorithm configuration file is correct;

And, grouping the initial data set according to a preset grouping rule to obtain a target data set includes:

The initial data set is grouped according to a preset grouping rule to obtain a plurality of target data sets, so as to construct the corresponding target model according to each target data set; the preset grouping rule is a rule constructed according to the characteristics or attributes of the data in the initial data set.

2. The end-to-end containerized big data analysis model construction method according to claim 1, characterized in that the generating of the initial data set based on the structured data and the unstructured data comprises:

Performing data quality checks, data cleaning and conversion operations on the fields of the structured data, and generating corresponding structured data sets;

Annotating the unstructured data to generate an unstructured data set in a label format;

An initial data set is generated according to the structured data set and the unstructured data set.

3. The end-to-end containerized big data analysis model construction method according to claim 1 is characterized in that after the initial data set is grouped according to the preset grouping rule to obtain the target data set, it also includes:

The target data set is input into the preset virtualization container, and an application programming interface corresponding to the target data set is provided in the form of a network request.

4. The end-to-end containerized big data analysis model construction method according to any one of claims 1 to 3, characterized in that the using a preset executor to train the initial model according to the mapping relationship further comprises:

The preset executor is used to set a hook so as to obtain the user's customized request during the training process according to the hook, and train the initial model through the customized request.

5. An end-to-end containerized big data analysis model construction device, characterized by comprising:

A data set grouping module, which obtains raw data sent by a preset data source, classifies the raw data to obtain corresponding structured data and unstructured data, generates an initial data set based on the structured data and the unstructured data, and groups the initial data set according to a preset grouping rule to obtain a target data set;

A configuration file verification module, used to determine parameter information of the initial model, generate an algorithm configuration file in a preset format according to the parameter information, input the algorithm configuration file into a preset virtualization container, and use the preset virtualization container to verify the mapping relationship between the target data set and the algorithm configuration file;

A model training module, used for, if the mapping relationship is correct, using a preset executor to train the initial model according to the mapping relationship, and during the training process, using the preset executor to call a preset parameter scheduler to dynamically adjust the hyperparameters of a preset optimizer, so as to use the preset optimizer to optimize the initial model to generate a target model;

The configuration file verification module specifically includes:

A parameter dictionary assembly unit, used to assemble the parameter information into a parameter dictionary according to a preset algorithm, and save the parameter dictionary as an algorithm configuration file in .yml or .json format;

Accordingly, the model training module specifically includes:

A model training unit, configured to read the parameter dictionary in the algorithm configuration file, load the corresponding initial model according to the parameter dictionary using the preset virtualization container, read the corresponding target data set according to the mapping relationship using the preset executor, and train the initial model;

Furthermore, the configuration file verification module specifically includes:

The mapping judgment unit is used to determine whether the features corresponding to the target data set and the algorithm configuration file match by using the preset virtualization container, and if they match, determine that the mapping relationship between the target data set and the algorithm configuration file is correct.

6. An electronic device, characterized in that the electronic device includes a processor and a memory; wherein the memory is used to store a computer program, and the computer program is loaded and executed by the processor to implement the end-to-end containerized big data analysis model construction method as described in any one of claims 1 to 4.

7. A computer-readable storage medium, characterized in that it is used to store a computer program, which, when executed by a processor, implements the end-to-end containerized big data analysis model construction method as described in any one of claims 1 to 4.