CN111128117B

CN111128117B - Vocoder model, speech synthesis method and device

Info

Publication number: CN111128117B
Application number: CN201911391057.8A
Authority: CN
Inventors: 李翰正; 陈宽; 张辉
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2022-03-29
Anticipated expiration: 2039-12-30
Also published as: CN111128117A

Abstract

The invention discloses a vocoder model, a speech synthesis method and a device. The vocoder model includes: an input layer for receiving feature data sets corresponding to source information for speech synthesis; a plurality of hidden layers, each hidden layer for determining hidden layer state data for each feature data, The plurality of hidden layers include at least one first hidden layer and a plurality of second hidden layers respectively adjoining the first hidden layers, and the operating frequency of the first hidden layer is lower than the operating frequency of the second hidden layer; the upsampling unit, using For up-sampling the hidden layer state data for each feature data output by the first hidden layer, and inputting the up-sampled hidden layer state data to each of the connected second hidden layers; the output layer is used according to The synthesized speech is determined from the hidden layer state data for each feature data output by the multiple hidden layers, thereby reducing the computational resources required by the vocoder model.

Description

Vocoder model, speech synthesis method and device

Technical Field

The invention belongs to the technical field of internet, and particularly relates to a vocoder model, a voice synthesis method and a voice synthesis device.

Background

Speech synthesis is a technology for generating an artificial speech by a mechanical or electronic method, and is a technology for converting text information generated by a computer itself or inputted from the outside into intelligible and fluent speech and outputting the speech.

At present, it has become a trend to apply neural networks to intelligently synthesize speech. For example, speech is intelligently synthesized by a neural network vocoder such as WaveNet or LPCNet. However, the neural network vocoder has extremely high computational complexity, and is only applicable to a high-end CPU, and cannot be applied to some common processors or mobile phones.

In view of the above problems, there is no better solution in the industry at present.

Disclosure of Invention

An embodiment of the present invention provides a vocoder model, a voice synthesis method and an apparatus thereof, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a vocoder model, including: an input layer for receiving a feature data set corresponding to source information for speech synthesis; a plurality of hidden layers, each of which is configured to determine hidden layer state data for each feature data, the plurality of hidden layers including at least one first hidden layer and a plurality of second hidden layers respectively butted to the first hidden layers, an operating frequency of the first hidden layer being less than an operating frequency of the second hidden layer; the up-sampling unit is used for up-sampling hidden layer state data aiming at each feature data output by the first hidden layer and respectively inputting the up-sampled hidden layer state data to each second hidden layer; an output layer for determining a synthesized speech from the hidden layer state data for each feature data output by the plurality of hidden layers.

In a second aspect, an embodiment of the present invention provides a method for synthesizing speech based on a vocoder model, where the vocoder model is a vocoder model as described above, the method including: acquiring source information to be subjected to voice synthesis, and determining a characteristic data set corresponding to the source information; the feature data set is provided to a vocoder model as described above to determine from the vocoder model synthesized speech corresponding to the feature data set, the operating frequency of each hidden layer in the vocoder model being reduced during synthesis of speech.

In a third aspect, an embodiment of the present invention provides a speech synthesis apparatus based on a vocoder model, where the vocoder model is the vocoder model as described above, the apparatus including: the source information characteristic determining unit is configured to acquire source information to be subjected to voice synthesis and determine a characteristic data set corresponding to the source information; a vocoder model calling unit configured to provide the feature data set to the vocoder model as described above to determine a synthesized voice corresponding to the feature data set from the vocoder model, and to reduce the operating frequency of each hidden layer in the vocoder model during the synthesis of the voice.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including: the computer-readable medium includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the above-described method.

In a fifth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the above method.

The embodiment of the invention has the beneficial effects that: by controlling the operating frequency of the first hidden layer to be smaller than that of the second hidden layer, each hidden layer in the vocoder model can process a plurality of hidden layer state data. In the scheme, the characteristic that the time domain signal predicted by the vocoder model changes continuously and slowly is utilized, the operating frequencies of different hidden layers are asynchronously reduced, the quality of synthesized voice is not influenced, the computing resources consumed by real-time operation of the vocoder model can be effectively reduced, and the method and the device are suitable for processors of more types.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow diagram illustrating an example of a method for vocoder model based speech synthesis according to an embodiment of the present invention;

fig. 2 shows a schematic structural diagram of an example of an LPCNet in the related art at present;

FIG. 3 is a block diagram illustrating an example of a vocoder model according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating an example of a vocoder model using an LPCNet structure according to an embodiment of the present invention; and

fig. 5 is a block diagram illustrating an example of a vocoder model-based speech synthesis apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used herein, a "module," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should be further noted that the terms "comprises" and "comprising," when used herein, include not only those elements but also other elements not expressly listed or inherent to such processes, methods, articles, or devices. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Fig. 1 shows a flowchart of an example of a vocoder model based speech synthesis method according to an embodiment of the present invention.

As shown in fig. 1, in step 110, source information to be subjected to speech synthesis is obtained, and a feature data set corresponding to the source information is determined. Here, the source information to be voice-synthesized may be determined in various ways, for example, by recognizing a user intention corresponding to the user request, and in addition, different user intentions may correspond to the respective source information, so that matching source information may be determined according to the user intention.

Further, a feature data set corresponding to the source information is determined through feature engineering. Here, the source information may be a piece of content information having continuity in the time domain, and accordingly, the feature data set includes a plurality of feature data distributed continuously in the time domain.

Next, in step 120, the feature data set is provided to a vocoder model to determine from the vocoder model a synthesized speech corresponding to the feature data set. Here, the type of vocoder model may not be limited, and for example, an LPCNet structure may be employed. It should be understood that the LPCNet is a Network in which an RNN (Recurrent Neural Network) structure is a main body.

In this embodiment, the operating frequency of each hidden layer in the vocoder model is reduced during the synthesis of speech. Since the time domain signal predicted by the vocoder model changes continuously and slowly, reducing the operating frequency does not have an excessive influence on the quality of the synthesized speech, and also can reduce the computational complexity of the vocoder model.

Further, a first hidden layer in the vocoder model interfaces with a plurality of second hidden layers, the method further comprising, after providing the feature data set to the vocoder model: the hidden layer state data output by the first hidden layer and aiming at each characteristic data are up-sampled, the up-sampled hidden layer state data are respectively input to each second hidden layer, the first hidden layer is controlled to operate according to a first operating frequency, the second hidden layer is controlled to operate according to a second operating frequency, and the first operating frequency is smaller than the second operating frequency. Therefore, the operating frequencies of different hidden layers are asynchronously reduced, each hidden layer in the vocoder model can process a plurality of hidden layer state data, and the computing resources consumed by real-time operation of the vocoder model can be effectively reduced.

Fig. 2 shows an example of the structure of an LPCNet in the related art at present. As shown in fig. 2, in the RNN structure of the LPCNet, there are a plurality of hidden layers such as GRU (Gated current Unit) a and GRU B, and the calculation processing of GRU a and GRU B causes most of the calculation complexity in the LPCNet structure. Here, GRU a is a hidden layer interfacing with the input layer, and GRU B is a hidden layer interfacing with the output layer.

Note that the output of some vocoder models (e.g., LPCNet) is a continuous time domain signal, and the output value of each step does not change much from the previous step. In view of this continuity, it is proposed herein that the operating frequency of the hidden layer can be reduced to reduce the complexity of network computation.

Fig. 3 is a block diagram illustrating an example of a vocoder model according to an embodiment of the present invention.

As shown in fig. 3, the vocoder model 300 includes an input layer 310, a hidden layer 320, an up-sampling unit 330, and an output layer 340, and the hidden layer 320 includes a plurality of hidden layers 1 to n. Here, the input layer 310 may be used to receive feature data sets corresponding to source information for speech synthesis. In addition, each of the hidden layers 320 is used to determine hidden layer state data for the respective feature data, respectively. Further, the output layer 340 may be configured to determine a synthesized speech from the hidden layer state data for each feature data output by the plurality of hidden layers. Some details regarding synthesizing speech by using a vocoder model can also refer to the description of the vocoder model in the related art, which will not be repeated herein.

In this embodiment, the hidden layers 320 include at least one first hidden layer and a plurality of second hidden layers respectively abutting against the first hidden layers, and are different from the hidden layer arrangement in the original vocoder model structure in fig. 2, such as extending the hidden layer connected to the output layer in fig. 2.

In addition, the operating frequency of the first hidden layer is less than the operating frequency of the second hidden layer. Specifically, the operation frequency of each hidden layer can be reduced by different magnitudes, respectively, to realize the above operation process. Here, when the operation frequency of the hidden layer is lowered, it may be realized that a plurality of hidden layer state data are processed by the hidden layer. Specifically, each first hidden layer is configured to process a first preset amount of first hidden layer state data corresponding to an operating frequency of the first hidden layer, and each second hidden layer is configured to process a second preset amount of second hidden layer state data corresponding to an operating frequency of the second hidden layer, for example, when the hidden layer is lowered by 4 times the operating frequency, the hidden layer may process 4 times the amount of hidden layer state data relative to the previous hidden layer. In addition, the operating frequency of the second hidden layer is less than or equal to that of the first hidden layer, so that a single second hidden layer can also process less hidden layer state data than the number processed by the butted first hidden layer, and the hidden layer state data is processed by the cooperation of a plurality of second hidden layers.

The upsampling unit 330 may upsample hidden layer state data for each feature data output by the first hidden layer, and input the upsampled hidden layer state data to each of the docked second hidden layers, respectively. Here, the first hidden layer is interfaced with a plurality of second hidden layers, and thus the hidden layer state data output by the first hidden layer is amplified by an up-sampling method.

In an example of this embodiment, the number of the second hidden layers to which the first hidden layer is docked corresponds to the operating frequency of the second hidden layers, for example, when the operating frequency of the second hidden layers is reduced by 2 times, the number of the second hidden layers may be expanded by 2 times, and the number of the first hidden layers is not changed, at this time, the number of the second hidden layers to which the first hidden layers are docked may be 2. Accordingly, the magnification of the up-sampling unit 330 may be determined according to a ratio of the operating frequency of the first hidden layer to the operating frequency of the second hidden layer. Illustratively, the operating frequency of the first hidden layer is reduced by 6 times, and the operating frequency of the second hidden layer is reduced by 2 times, in which case the magnification of the up-sampling unit may be 3 times.

In this embodiment, since the output signal of the vocoder model has continuous time domain and the output value of each step does not change much from the previous step, the operation frequency of the hidden layer can be adjusted (e.g., reduced) without much influence on the effect of synthesizing the speech. In addition, the first operating frequency of the first hidden layer is smaller than the second operating frequency of the second hidden layer, so that each hidden layer can output a plurality of hidden layer state data, and the calculation complexity of the network is reduced.

Fig. 4 is a schematic diagram illustrating an example of a vocoder model using an LPCNet structure according to an embodiment of the present invention.

As shown in fig. 4, by reducing the operating frequency of GRU a and GRU B in the LPCNet structure, the computational complexity of LPCNet is effectively reduced.

It should be noted that, since the output of the LPCNet is a continuous time domain signal, the output value of each step does not change much from the previous step. In this embodiment, the operating frequencies of the GRU a and the GRU B are reduced based on the time domain continuity of the signal, which can effectively reduce the computational complexity of the network.

Specifically, the operating frequency of GRU a can be reduced by M times, and the network size of GRU a is kept unchanged. Therefore, the hidden layer state data is output from the hidden layer state data of one feature data, and M pieces of hidden layer state data are input to output one piece of hidden layer state data. The output hidden state data of GRU A is then up-sampled by a factor of M/N. In addition, the operation frequency of the GRU B is reduced by N times, and the network is expanded by N times, so that the hidden layer state data output by the GRU B is divided into N parts.

It should be noted that, when different values of M and N are selected, the computational complexity of the network determined by the calculation is different. The details are shown in table 1 below:

M	N	LPCNet(Gflops)
			1	1	3.0
2	1	2.3
			4	1	1.9
4	2	1.8

TABLE 1

To achieve better synthesized sound quality and lower computational complexity, the operating frequency of the first hidden layer may be 1/2 of the operating frequency of the second hidden layer. Exemplarily, when M is 2 and N is 1, the synthesized sound quality hardly changes, but the computational complexity is reduced to 77%; in addition, when M is 4 and N is 2, the synthesized sound quality is still acceptable, and the computational complexity is reduced to 60%, i.e., the example depicted in fig. 4.

In the embodiment of the invention, the characteristic that the time domain signal predicted by the network changes continuously and slowly is utilized, the running frequency of GRU A and GRU B is reduced, and the calculation complexity of LPCnet is reduced.

As shown in fig. 5, the vocoder model-based speech synthesis apparatus 500 includes a source information characteristic determination unit 510 and a vocoder model call unit 520.

The source information feature determination unit 510 is configured to acquire source information to be subjected to speech synthesis, and determine a feature data set corresponding to the source information. The operation of the source information characteristic determination unit 510 may refer to the description above with reference to step 110 in fig. 1.

The vocoder model invoking unit 520 is configured to provide the feature data set to a vocoder model to determine a synthesized speech corresponding to the feature data set from the vocoder model, and to reduce the operating frequency of each hidden layer in the vocoder model during the synthesis of the speech. The operation of the vocoder model invoking unit 520 may be as described above with reference to step 120 in fig. 1.

In an example of this embodiment, when the first hidden layer in the vocoder model is interfaced with a plurality of second hidden layers, the vocoder model invoking unit 520 may be further configured to: the hidden layer state data output by the first hidden layer and aiming at each feature data are up-sampled, and the up-sampled hidden layer state data are respectively input to each second hidden layer; controlling the first hidden layer to operate according to a first operating frequency; and controlling the second hidden layer to operate according to a second operating frequency, wherein the first operating frequency is less than the second operating frequency.

The apparatus according to the above embodiment of the present invention may be used to execute the corresponding method embodiment of the present invention, and accordingly achieve the technical effect achieved by the method embodiment of the present invention, which is not described herein again.

In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

In another aspect, an embodiment of the present invention provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the steps of the above vocoder model-based speech synthesis method.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A vocoder model comprising:

an input layer for receiving feature datasets corresponding to source information for speech synthesis;

A plurality of hidden layers, each of which is used to determine hidden layer state data for each feature data, the plurality of hidden layers include at least one first hidden layer and a plurality of the second hidden layer, the operating frequency of the first hidden layer is less than the operating frequency of the second hidden layer;

The upsampling unit is used to upsample the hidden layer state data for each feature data output by the first hidden layer, and input the upsampled hidden layer state data to the connected second hidden layers respectively ;

The output layer is used for determining the synthesized speech according to the hidden layer state data for each feature data output by the plurality of hidden layers.

2. The vocoder model of claim 1, wherein each of the first hidden layers is used to process a first preset number of first hidden layers corresponding to an operating frequency of the first hidden layer state data, and each of the second hidden layers is used to process a second preset number of second hidden layer state data corresponding to the operating frequency of the second hidden layer.

3. The vocoder model of claim 1 or 2, wherein the number of second hidden layers to which the first hidden layer is docked corresponds to an operating frequency of the second hidden layer; and

The magnification of the up-sampling unit is determined according to the ratio of the operating frequency of the first hidden layer to the operating frequency of the second hidden layer.

4. The vocoder model of claim 1, wherein the vocoder model adopts an LPCNet structure, and the hidden layer is a GRU in the LPCNet structure.

5. The vocoder model of claim 1 or 4, wherein the operating frequency of the first hidden layer is 1/2 the operating frequency of the second hidden layer.

6. A speech synthesis method based on a vocoder model, comprising:

Obtain source information to be synthesized by speech, and determine the feature data set corresponding to the source information;

The feature data set is provided to the vocoder model according to any one of claims 1-5, so that the synthesized speech corresponding to the feature data set is determined by the vocoder model, and in the process of synthesizing the speech reduce the operating frequency of each hidden layer in the vocoder model.

7. The method of claim 6, wherein a first hidden layer in the vocoder model is interfaced with a plurality of second hidden layers, after the providing the feature dataset to the vocoder model , the method also includes:

Upsampling the hidden layer state data for each feature data output by the first hidden layer, and inputting the upsampled hidden layer state data to each of the connected second hidden layers;

controlling the first hidden layer to operate at a first operating frequency;

The second hidden layer is controlled to operate at a second operating frequency, where the first operating frequency is smaller than the second operating frequency.

8. A speech synthesis device based on a vocoder model, comprising:

a source information feature determining unit, configured to acquire source information to be synthesized by speech, and determine a feature data set corresponding to the source information;

A vocoder model invoking unit configured to provide the feature data set to the vocoder model of any one of claims 1-5 for determining by the vocoder model a phase relative to the feature data set For the corresponding synthesized speech, the operating frequency of each hidden layer in the vocoder model is reduced in the process of synthesizing the speech.

9. An electronic device comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions Executed by the at least one processor to enable the at least one processor to perform the steps of the method of claim 6 or 7.

10. A storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method of claim 6 or 7 are implemented.