Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used herein, a "module," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should be further noted that the terms "comprises" and "comprising," when used herein, include not only those elements but also other elements not expressly listed or inherent to such processes, methods, articles, or devices. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Fig. 1 shows a flowchart of an example of a vocoder model based speech synthesis method according to an embodiment of the present invention.
As shown in fig. 1, in step 110, source information to be subjected to speech synthesis is obtained, and a feature data set corresponding to the source information is determined. Here, the source information to be voice-synthesized may be determined in various ways, for example, by recognizing a user intention corresponding to the user request, and in addition, different user intentions may correspond to the respective source information, so that matching source information may be determined according to the user intention.
Further, a feature data set corresponding to the source information is determined through feature engineering. Here, the source information may be a piece of content information having continuity in the time domain, and accordingly, the feature data set includes a plurality of feature data distributed continuously in the time domain.
Next, in step 120, the feature data set is provided to a vocoder model to determine from the vocoder model a synthesized speech corresponding to the feature data set. Here, the type of vocoder model may not be limited, and for example, an LPCNet structure may be employed. It should be understood that the LPCNet is a Network in which an RNN (Recurrent Neural Network) structure is a main body.
In this embodiment, the operating frequency of each hidden layer in the vocoder model is reduced during the synthesis of speech. Since the time domain signal predicted by the vocoder model changes continuously and slowly, reducing the operating frequency does not have an excessive influence on the quality of the synthesized speech, and also can reduce the computational complexity of the vocoder model.
Further, a first hidden layer in the vocoder model interfaces with a plurality of second hidden layers, the method further comprising, after providing the feature data set to the vocoder model: the hidden layer state data output by the first hidden layer and aiming at each characteristic data are up-sampled, the up-sampled hidden layer state data are respectively input to each second hidden layer, the first hidden layer is controlled to operate according to a first operating frequency, the second hidden layer is controlled to operate according to a second operating frequency, and the first operating frequency is smaller than the second operating frequency. Therefore, the operating frequencies of different hidden layers are asynchronously reduced, each hidden layer in the vocoder model can process a plurality of hidden layer state data, and the computing resources consumed by real-time operation of the vocoder model can be effectively reduced.
Fig. 2 shows an example of the structure of an LPCNet in the related art at present. As shown in fig. 2, in the RNN structure of the LPCNet, there are a plurality of hidden layers such as GRU (Gated current Unit) a and GRU B, and the calculation processing of GRU a and GRU B causes most of the calculation complexity in the LPCNet structure. Here, GRU a is a hidden layer interfacing with the input layer, and GRU B is a hidden layer interfacing with the output layer.
Note that the output of some vocoder models (e.g., LPCNet) is a continuous time domain signal, and the output value of each step does not change much from the previous step. In view of this continuity, it is proposed herein that the operating frequency of the hidden layer can be reduced to reduce the complexity of network computation.
Fig. 3 is a block diagram illustrating an example of a vocoder model according to an embodiment of the present invention.
As shown in fig. 3, the vocoder model 300 includes an input layer 310, a hidden layer 320, an up-sampling unit 330, and an output layer 340, and the hidden layer 320 includes a plurality of hidden layers 1 to n. Here, the input layer 310 may be used to receive feature data sets corresponding to source information for speech synthesis. In addition, each of the hidden layers 320 is used to determine hidden layer state data for the respective feature data, respectively. Further, the output layer 340 may be configured to determine a synthesized speech from the hidden layer state data for each feature data output by the plurality of hidden layers. Some details regarding synthesizing speech by using a vocoder model can also refer to the description of the vocoder model in the related art, which will not be repeated herein.
In this embodiment, the hidden layers 320 include at least one first hidden layer and a plurality of second hidden layers respectively abutting against the first hidden layers, and are different from the hidden layer arrangement in the original vocoder model structure in fig. 2, such as extending the hidden layer connected to the output layer in fig. 2.
In addition, the operating frequency of the first hidden layer is less than the operating frequency of the second hidden layer. Specifically, the operation frequency of each hidden layer can be reduced by different magnitudes, respectively, to realize the above operation process. Here, when the operation frequency of the hidden layer is lowered, it may be realized that a plurality of hidden layer state data are processed by the hidden layer. Specifically, each first hidden layer is configured to process a first preset amount of first hidden layer state data corresponding to an operating frequency of the first hidden layer, and each second hidden layer is configured to process a second preset amount of second hidden layer state data corresponding to an operating frequency of the second hidden layer, for example, when the hidden layer is lowered by 4 times the operating frequency, the hidden layer may process 4 times the amount of hidden layer state data relative to the previous hidden layer. In addition, the operating frequency of the second hidden layer is less than or equal to that of the first hidden layer, so that a single second hidden layer can also process less hidden layer state data than the number processed by the butted first hidden layer, and the hidden layer state data is processed by the cooperation of a plurality of second hidden layers.
The upsampling unit 330 may upsample hidden layer state data for each feature data output by the first hidden layer, and input the upsampled hidden layer state data to each of the docked second hidden layers, respectively. Here, the first hidden layer is interfaced with a plurality of second hidden layers, and thus the hidden layer state data output by the first hidden layer is amplified by an up-sampling method.
In an example of this embodiment, the number of the second hidden layers to which the first hidden layer is docked corresponds to the operating frequency of the second hidden layers, for example, when the operating frequency of the second hidden layers is reduced by 2 times, the number of the second hidden layers may be expanded by 2 times, and the number of the first hidden layers is not changed, at this time, the number of the second hidden layers to which the first hidden layers are docked may be 2. Accordingly, the magnification of the up-sampling unit 330 may be determined according to a ratio of the operating frequency of the first hidden layer to the operating frequency of the second hidden layer. Illustratively, the operating frequency of the first hidden layer is reduced by 6 times, and the operating frequency of the second hidden layer is reduced by 2 times, in which case the magnification of the up-sampling unit may be 3 times.
In this embodiment, since the output signal of the vocoder model has continuous time domain and the output value of each step does not change much from the previous step, the operation frequency of the hidden layer can be adjusted (e.g., reduced) without much influence on the effect of synthesizing the speech. In addition, the first operating frequency of the first hidden layer is smaller than the second operating frequency of the second hidden layer, so that each hidden layer can output a plurality of hidden layer state data, and the calculation complexity of the network is reduced.
Fig. 4 is a schematic diagram illustrating an example of a vocoder model using an LPCNet structure according to an embodiment of the present invention.
As shown in fig. 4, by reducing the operating frequency of GRU a and GRU B in the LPCNet structure, the computational complexity of LPCNet is effectively reduced.
It should be noted that, since the output of the LPCNet is a continuous time domain signal, the output value of each step does not change much from the previous step. In this embodiment, the operating frequencies of the GRU a and the GRU B are reduced based on the time domain continuity of the signal, which can effectively reduce the computational complexity of the network.
Specifically, the operating frequency of GRU a can be reduced by M times, and the network size of GRU a is kept unchanged. Therefore, the hidden layer state data is output from the hidden layer state data of one feature data, and M pieces of hidden layer state data are input to output one piece of hidden layer state data. The output hidden state data of GRU A is then up-sampled by a factor of M/N. In addition, the operation frequency of the GRU B is reduced by N times, and the network is expanded by N times, so that the hidden layer state data output by the GRU B is divided into N parts.
It should be noted that, when different values of M and N are selected, the computational complexity of the network determined by the calculation is different. The details are shown in table 1 below:
M
|
N
|
LPCNet(Gflops)
|
1
|
1
|
3.0
|
2
|
1
|
2.3
|
4
|
1
|
1.9
|
4
|
2
|
1.8 |
TABLE 1
To achieve better synthesized sound quality and lower computational complexity, the operating frequency of the first hidden layer may be 1/2 of the operating frequency of the second hidden layer. Exemplarily, when M is 2 and N is 1, the synthesized sound quality hardly changes, but the computational complexity is reduced to 77%; in addition, when M is 4 and N is 2, the synthesized sound quality is still acceptable, and the computational complexity is reduced to 60%, i.e., the example depicted in fig. 4.
In the embodiment of the invention, the characteristic that the time domain signal predicted by the network changes continuously and slowly is utilized, the running frequency of GRU A and GRU B is reduced, and the calculation complexity of LPCnet is reduced.
Fig. 5 is a block diagram illustrating an example of a vocoder model-based speech synthesis apparatus according to an embodiment of the present invention.
As shown in fig. 5, the vocoder model-based speech synthesis apparatus 500 includes a source information characteristic determination unit 510 and a vocoder model call unit 520.
The source information feature determination unit 510 is configured to acquire source information to be subjected to speech synthesis, and determine a feature data set corresponding to the source information. The operation of the source information characteristic determination unit 510 may refer to the description above with reference to step 110 in fig. 1.
The vocoder model invoking unit 520 is configured to provide the feature data set to a vocoder model to determine a synthesized speech corresponding to the feature data set from the vocoder model, and to reduce the operating frequency of each hidden layer in the vocoder model during the synthesis of the speech. The operation of the vocoder model invoking unit 520 may be as described above with reference to step 120 in fig. 1.
In an example of this embodiment, when the first hidden layer in the vocoder model is interfaced with a plurality of second hidden layers, the vocoder model invoking unit 520 may be further configured to: the hidden layer state data output by the first hidden layer and aiming at each feature data are up-sampled, and the up-sampled hidden layer state data are respectively input to each second hidden layer; controlling the first hidden layer to operate according to a first operating frequency; and controlling the second hidden layer to operate according to a second operating frequency, wherein the first operating frequency is less than the second operating frequency.
The apparatus according to the above embodiment of the present invention may be used to execute the corresponding method embodiment of the present invention, and accordingly achieve the technical effect achieved by the method embodiment of the present invention, which is not described herein again.
In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
In another aspect, an embodiment of the present invention provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the steps of the above vocoder model-based speech synthesis method.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.