CN111128117B - Vocoder model, speech synthesis method and device - Google Patents

Vocoder model, speech synthesis method and device Download PDF

Info

Publication number
CN111128117B
CN111128117B CN201911391057.8A CN201911391057A CN111128117B CN 111128117 B CN111128117 B CN 111128117B CN 201911391057 A CN201911391057 A CN 201911391057A CN 111128117 B CN111128117 B CN 111128117B
Authority
CN
China
Prior art keywords
hidden layer
hidden
operating frequency
vocoder model
feature data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911391057.8A
Other languages
Chinese (zh)
Other versions
CN111128117A (en
Inventor
李翰正
陈宽
张辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN201911391057.8A priority Critical patent/CN111128117B/en
Publication of CN111128117A publication Critical patent/CN111128117A/en
Application granted granted Critical
Publication of CN111128117B publication Critical patent/CN111128117B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本发明公开一种声码器模型、语音合成方法及装置。该声码器模型包括:输入层,用于接收与用于语音合成的源信息相对应的特征数据集;多个隐藏层,每一隐藏层用于确定针对各个特征数据的隐层状态数据,多个隐藏层包括至少一个第一隐藏层和与各个第一隐藏层分别相对接的多个第二隐藏层,第一隐藏层的运行频率小于第二隐藏层的运行频率;上采样单元,用于对第一隐藏层所输出的针对各个特征数据的隐层状态数据进行上采样,并将经上采样的隐层状态数据分别输入至所对接的各个第二隐藏层;输出层,用于根据由多个隐藏层所输出的针对各个特征数据的隐层状态数据来确定合成语音,由此降低该声码器模型所需要的计算资源。

Figure 201911391057

The invention discloses a vocoder model, a speech synthesis method and a device. The vocoder model includes: an input layer for receiving feature data sets corresponding to source information for speech synthesis; a plurality of hidden layers, each hidden layer for determining hidden layer state data for each feature data, The plurality of hidden layers include at least one first hidden layer and a plurality of second hidden layers respectively adjoining the first hidden layers, and the operating frequency of the first hidden layer is lower than the operating frequency of the second hidden layer; the upsampling unit, using For up-sampling the hidden layer state data for each feature data output by the first hidden layer, and inputting the up-sampled hidden layer state data to each of the connected second hidden layers; the output layer is used according to The synthesized speech is determined from the hidden layer state data for each feature data output by the multiple hidden layers, thereby reducing the computational resources required by the vocoder model.

Figure 201911391057

Description

Vocoder model, speech synthesis method and device
Technical Field
The invention belongs to the technical field of internet, and particularly relates to a vocoder model, a voice synthesis method and a voice synthesis device.
Background
Speech synthesis is a technology for generating an artificial speech by a mechanical or electronic method, and is a technology for converting text information generated by a computer itself or inputted from the outside into intelligible and fluent speech and outputting the speech.
At present, it has become a trend to apply neural networks to intelligently synthesize speech. For example, speech is intelligently synthesized by a neural network vocoder such as WaveNet or LPCNet. However, the neural network vocoder has extremely high computational complexity, and is only applicable to a high-end CPU, and cannot be applied to some common processors or mobile phones.
In view of the above problems, there is no better solution in the industry at present.
Disclosure of Invention
An embodiment of the present invention provides a vocoder model, a voice synthesis method and an apparatus thereof, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a vocoder model, including: an input layer for receiving a feature data set corresponding to source information for speech synthesis; a plurality of hidden layers, each of which is configured to determine hidden layer state data for each feature data, the plurality of hidden layers including at least one first hidden layer and a plurality of second hidden layers respectively butted to the first hidden layers, an operating frequency of the first hidden layer being less than an operating frequency of the second hidden layer; the up-sampling unit is used for up-sampling hidden layer state data aiming at each feature data output by the first hidden layer and respectively inputting the up-sampled hidden layer state data to each second hidden layer; an output layer for determining a synthesized speech from the hidden layer state data for each feature data output by the plurality of hidden layers.
In a second aspect, an embodiment of the present invention provides a method for synthesizing speech based on a vocoder model, where the vocoder model is a vocoder model as described above, the method including: acquiring source information to be subjected to voice synthesis, and determining a characteristic data set corresponding to the source information; the feature data set is provided to a vocoder model as described above to determine from the vocoder model synthesized speech corresponding to the feature data set, the operating frequency of each hidden layer in the vocoder model being reduced during synthesis of speech.
In a third aspect, an embodiment of the present invention provides a speech synthesis apparatus based on a vocoder model, where the vocoder model is the vocoder model as described above, the apparatus including: the source information characteristic determining unit is configured to acquire source information to be subjected to voice synthesis and determine a characteristic data set corresponding to the source information; a vocoder model calling unit configured to provide the feature data set to the vocoder model as described above to determine a synthesized voice corresponding to the feature data set from the vocoder model, and to reduce the operating frequency of each hidden layer in the vocoder model during the synthesis of the voice.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including: the computer-readable medium includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the above-described method.
In a fifth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the above method.
The embodiment of the invention has the beneficial effects that: by controlling the operating frequency of the first hidden layer to be smaller than that of the second hidden layer, each hidden layer in the vocoder model can process a plurality of hidden layer state data. In the scheme, the characteristic that the time domain signal predicted by the vocoder model changes continuously and slowly is utilized, the operating frequencies of different hidden layers are asynchronously reduced, the quality of synthesized voice is not influenced, the computing resources consumed by real-time operation of the vocoder model can be effectively reduced, and the method and the device are suitable for processors of more types.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flow diagram illustrating an example of a method for vocoder model based speech synthesis according to an embodiment of the present invention;
fig. 2 shows a schematic structural diagram of an example of an LPCNet in the related art at present;
FIG. 3 is a block diagram illustrating an example of a vocoder model according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating an example of a vocoder model using an LPCNet structure according to an embodiment of the present invention; and
fig. 5 is a block diagram illustrating an example of a vocoder model-based speech synthesis apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used herein, a "module," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.
Finally, it should be further noted that the terms "comprises" and "comprising," when used herein, include not only those elements but also other elements not expressly listed or inherent to such processes, methods, articles, or devices. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Fig. 1 shows a flowchart of an example of a vocoder model based speech synthesis method according to an embodiment of the present invention.
As shown in fig. 1, in step 110, source information to be subjected to speech synthesis is obtained, and a feature data set corresponding to the source information is determined. Here, the source information to be voice-synthesized may be determined in various ways, for example, by recognizing a user intention corresponding to the user request, and in addition, different user intentions may correspond to the respective source information, so that matching source information may be determined according to the user intention.
Further, a feature data set corresponding to the source information is determined through feature engineering. Here, the source information may be a piece of content information having continuity in the time domain, and accordingly, the feature data set includes a plurality of feature data distributed continuously in the time domain.
Next, in step 120, the feature data set is provided to a vocoder model to determine from the vocoder model a synthesized speech corresponding to the feature data set. Here, the type of vocoder model may not be limited, and for example, an LPCNet structure may be employed. It should be understood that the LPCNet is a Network in which an RNN (Recurrent Neural Network) structure is a main body.
In this embodiment, the operating frequency of each hidden layer in the vocoder model is reduced during the synthesis of speech. Since the time domain signal predicted by the vocoder model changes continuously and slowly, reducing the operating frequency does not have an excessive influence on the quality of the synthesized speech, and also can reduce the computational complexity of the vocoder model.
Further, a first hidden layer in the vocoder model interfaces with a plurality of second hidden layers, the method further comprising, after providing the feature data set to the vocoder model: the hidden layer state data output by the first hidden layer and aiming at each characteristic data are up-sampled, the up-sampled hidden layer state data are respectively input to each second hidden layer, the first hidden layer is controlled to operate according to a first operating frequency, the second hidden layer is controlled to operate according to a second operating frequency, and the first operating frequency is smaller than the second operating frequency. Therefore, the operating frequencies of different hidden layers are asynchronously reduced, each hidden layer in the vocoder model can process a plurality of hidden layer state data, and the computing resources consumed by real-time operation of the vocoder model can be effectively reduced.
Fig. 2 shows an example of the structure of an LPCNet in the related art at present. As shown in fig. 2, in the RNN structure of the LPCNet, there are a plurality of hidden layers such as GRU (Gated current Unit) a and GRU B, and the calculation processing of GRU a and GRU B causes most of the calculation complexity in the LPCNet structure. Here, GRU a is a hidden layer interfacing with the input layer, and GRU B is a hidden layer interfacing with the output layer.
Note that the output of some vocoder models (e.g., LPCNet) is a continuous time domain signal, and the output value of each step does not change much from the previous step. In view of this continuity, it is proposed herein that the operating frequency of the hidden layer can be reduced to reduce the complexity of network computation.
Fig. 3 is a block diagram illustrating an example of a vocoder model according to an embodiment of the present invention.
As shown in fig. 3, the vocoder model 300 includes an input layer 310, a hidden layer 320, an up-sampling unit 330, and an output layer 340, and the hidden layer 320 includes a plurality of hidden layers 1 to n. Here, the input layer 310 may be used to receive feature data sets corresponding to source information for speech synthesis. In addition, each of the hidden layers 320 is used to determine hidden layer state data for the respective feature data, respectively. Further, the output layer 340 may be configured to determine a synthesized speech from the hidden layer state data for each feature data output by the plurality of hidden layers. Some details regarding synthesizing speech by using a vocoder model can also refer to the description of the vocoder model in the related art, which will not be repeated herein.
In this embodiment, the hidden layers 320 include at least one first hidden layer and a plurality of second hidden layers respectively abutting against the first hidden layers, and are different from the hidden layer arrangement in the original vocoder model structure in fig. 2, such as extending the hidden layer connected to the output layer in fig. 2.
In addition, the operating frequency of the first hidden layer is less than the operating frequency of the second hidden layer. Specifically, the operation frequency of each hidden layer can be reduced by different magnitudes, respectively, to realize the above operation process. Here, when the operation frequency of the hidden layer is lowered, it may be realized that a plurality of hidden layer state data are processed by the hidden layer. Specifically, each first hidden layer is configured to process a first preset amount of first hidden layer state data corresponding to an operating frequency of the first hidden layer, and each second hidden layer is configured to process a second preset amount of second hidden layer state data corresponding to an operating frequency of the second hidden layer, for example, when the hidden layer is lowered by 4 times the operating frequency, the hidden layer may process 4 times the amount of hidden layer state data relative to the previous hidden layer. In addition, the operating frequency of the second hidden layer is less than or equal to that of the first hidden layer, so that a single second hidden layer can also process less hidden layer state data than the number processed by the butted first hidden layer, and the hidden layer state data is processed by the cooperation of a plurality of second hidden layers.
The upsampling unit 330 may upsample hidden layer state data for each feature data output by the first hidden layer, and input the upsampled hidden layer state data to each of the docked second hidden layers, respectively. Here, the first hidden layer is interfaced with a plurality of second hidden layers, and thus the hidden layer state data output by the first hidden layer is amplified by an up-sampling method.
In an example of this embodiment, the number of the second hidden layers to which the first hidden layer is docked corresponds to the operating frequency of the second hidden layers, for example, when the operating frequency of the second hidden layers is reduced by 2 times, the number of the second hidden layers may be expanded by 2 times, and the number of the first hidden layers is not changed, at this time, the number of the second hidden layers to which the first hidden layers are docked may be 2. Accordingly, the magnification of the up-sampling unit 330 may be determined according to a ratio of the operating frequency of the first hidden layer to the operating frequency of the second hidden layer. Illustratively, the operating frequency of the first hidden layer is reduced by 6 times, and the operating frequency of the second hidden layer is reduced by 2 times, in which case the magnification of the up-sampling unit may be 3 times.
In this embodiment, since the output signal of the vocoder model has continuous time domain and the output value of each step does not change much from the previous step, the operation frequency of the hidden layer can be adjusted (e.g., reduced) without much influence on the effect of synthesizing the speech. In addition, the first operating frequency of the first hidden layer is smaller than the second operating frequency of the second hidden layer, so that each hidden layer can output a plurality of hidden layer state data, and the calculation complexity of the network is reduced.
Fig. 4 is a schematic diagram illustrating an example of a vocoder model using an LPCNet structure according to an embodiment of the present invention.
As shown in fig. 4, by reducing the operating frequency of GRU a and GRU B in the LPCNet structure, the computational complexity of LPCNet is effectively reduced.
It should be noted that, since the output of the LPCNet is a continuous time domain signal, the output value of each step does not change much from the previous step. In this embodiment, the operating frequencies of the GRU a and the GRU B are reduced based on the time domain continuity of the signal, which can effectively reduce the computational complexity of the network.
Specifically, the operating frequency of GRU a can be reduced by M times, and the network size of GRU a is kept unchanged. Therefore, the hidden layer state data is output from the hidden layer state data of one feature data, and M pieces of hidden layer state data are input to output one piece of hidden layer state data. The output hidden state data of GRU A is then up-sampled by a factor of M/N. In addition, the operation frequency of the GRU B is reduced by N times, and the network is expanded by N times, so that the hidden layer state data output by the GRU B is divided into N parts.
It should be noted that, when different values of M and N are selected, the computational complexity of the network determined by the calculation is different. The details are shown in table 1 below:
M N LPCNet(Gflops)
1 1 3.0
2 1 2.3
4 1 1.9
4 2 1.8
TABLE 1
To achieve better synthesized sound quality and lower computational complexity, the operating frequency of the first hidden layer may be 1/2 of the operating frequency of the second hidden layer. Exemplarily, when M is 2 and N is 1, the synthesized sound quality hardly changes, but the computational complexity is reduced to 77%; in addition, when M is 4 and N is 2, the synthesized sound quality is still acceptable, and the computational complexity is reduced to 60%, i.e., the example depicted in fig. 4.
In the embodiment of the invention, the characteristic that the time domain signal predicted by the network changes continuously and slowly is utilized, the running frequency of GRU A and GRU B is reduced, and the calculation complexity of LPCnet is reduced.
Fig. 5 is a block diagram illustrating an example of a vocoder model-based speech synthesis apparatus according to an embodiment of the present invention.
As shown in fig. 5, the vocoder model-based speech synthesis apparatus 500 includes a source information characteristic determination unit 510 and a vocoder model call unit 520.
The source information feature determination unit 510 is configured to acquire source information to be subjected to speech synthesis, and determine a feature data set corresponding to the source information. The operation of the source information characteristic determination unit 510 may refer to the description above with reference to step 110 in fig. 1.
The vocoder model invoking unit 520 is configured to provide the feature data set to a vocoder model to determine a synthesized speech corresponding to the feature data set from the vocoder model, and to reduce the operating frequency of each hidden layer in the vocoder model during the synthesis of the speech. The operation of the vocoder model invoking unit 520 may be as described above with reference to step 120 in fig. 1.
In an example of this embodiment, when the first hidden layer in the vocoder model is interfaced with a plurality of second hidden layers, the vocoder model invoking unit 520 may be further configured to: the hidden layer state data output by the first hidden layer and aiming at each feature data are up-sampled, and the up-sampled hidden layer state data are respectively input to each second hidden layer; controlling the first hidden layer to operate according to a first operating frequency; and controlling the second hidden layer to operate according to a second operating frequency, wherein the first operating frequency is less than the second operating frequency.
The apparatus according to the above embodiment of the present invention may be used to execute the corresponding method embodiment of the present invention, and accordingly achieve the technical effect achieved by the method embodiment of the present invention, which is not described herein again.
In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
In another aspect, an embodiment of the present invention provides a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the steps of the above vocoder model-based speech synthesis method.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1.一种声码器模型,包括:1. A vocoder model comprising: 输入层,用于接收与用于语音合成的源信息相对应的特征数据集;an input layer for receiving feature datasets corresponding to source information for speech synthesis; 多个隐藏层,每一所述隐藏层用于确定针对各个特征数据的隐层状态数据,所述多个隐藏层包括至少一个第一隐藏层和与各个第一隐藏层分别相对接的多个第二隐藏层,所述第一隐藏层的运行频率小于所述第二隐藏层的运行频率;A plurality of hidden layers, each of which is used to determine hidden layer state data for each feature data, the plurality of hidden layers include at least one first hidden layer and a plurality of the second hidden layer, the operating frequency of the first hidden layer is less than the operating frequency of the second hidden layer; 上采样单元,用于对所述第一隐藏层所输出的针对各个特征数据的隐层状态数据进行上采样,并将经上采样的隐层状态数据分别输入至所对接的各个第二隐藏层;The upsampling unit is used to upsample the hidden layer state data for each feature data output by the first hidden layer, and input the upsampled hidden layer state data to the connected second hidden layers respectively ; 输出层,用于根据由所述多个隐藏层所输出的针对各个特征数据的隐层状态数据来确定合成语音。The output layer is used for determining the synthesized speech according to the hidden layer state data for each feature data output by the plurality of hidden layers. 2.如权利要求1所述的声码器模型,其中,每一所述第一隐藏层用于处理与所述第一隐藏层的运行频率相对应的第一预设数量的第一隐层状态数据,以及每一所述第二隐藏层用于处理与所述第二隐藏层的运行频率相对应的第二预设数量的第二隐层状态数据。2. The vocoder model of claim 1, wherein each of the first hidden layers is used to process a first preset number of first hidden layers corresponding to an operating frequency of the first hidden layer state data, and each of the second hidden layers is used to process a second preset number of second hidden layer state data corresponding to the operating frequency of the second hidden layer. 3.如权利要求1或2所述的声码器模型,其中,所述第一隐藏层所对接的第二隐藏层的数量与所述第二隐藏层的运行频率相对应;以及3. The vocoder model of claim 1 or 2, wherein the number of second hidden layers to which the first hidden layer is docked corresponds to an operating frequency of the second hidden layer; and 所述上采样单元的放大倍数是根据所述第一隐藏层的运行频率相对于所述第二隐藏层的运行频率的比例来确定的。The magnification of the up-sampling unit is determined according to the ratio of the operating frequency of the first hidden layer to the operating frequency of the second hidden layer. 4.如权利要求1所述的声码器模型,其中,所述声码器模型采用LPCNet结构,以及所述隐藏层为所述LPCNet结构中的GRU。4. The vocoder model of claim 1, wherein the vocoder model adopts an LPCNet structure, and the hidden layer is a GRU in the LPCNet structure. 5.如权利要求1或4所述的声码器模型,其中,所述第一隐藏层的运行频率是所述第二隐藏层的运行频率的1/2。5. The vocoder model of claim 1 or 4, wherein the operating frequency of the first hidden layer is 1/2 the operating frequency of the second hidden layer. 6.一种基于声码器模型的语音合成方法,包括:6. A speech synthesis method based on a vocoder model, comprising: 获取待进行语音合成的源信息,并确定所述源信息所对应的特征数据集;Obtain source information to be synthesized by speech, and determine the feature data set corresponding to the source information; 将所述特征数据集提供给权利要求1-5中任一项所述的声码器模型,以由该声码器模型确定与所述特征数据集相对应的合成语音,在合成语音的过程中降低所述声码器模型中各个隐藏层的运行频率。The feature data set is provided to the vocoder model according to any one of claims 1-5, so that the synthesized speech corresponding to the feature data set is determined by the vocoder model, and in the process of synthesizing the speech reduce the operating frequency of each hidden layer in the vocoder model. 7.如权利要求6所述的方法,其中,所述声码器模型中的第一隐藏层与多个第二隐藏层对接,在所述将所述特征数据集提供给声码器模型之后,所述方法还包括:7. The method of claim 6, wherein a first hidden layer in the vocoder model is interfaced with a plurality of second hidden layers, after the providing the feature dataset to the vocoder model , the method also includes: 对由所述第一隐藏层所输出的针对各个特征数据的隐层状态数据进行上采样,并将经上采样的隐层状态数据分别输入至所对接的各个第二隐藏层;Upsampling the hidden layer state data for each feature data output by the first hidden layer, and inputting the upsampled hidden layer state data to each of the connected second hidden layers; 控制所述第一隐藏层按照第一运行频率运行;controlling the first hidden layer to operate at a first operating frequency; 控制所述第二隐藏层按照第二运行频率运行,所述第一运行频率小于所述第二运行频率。The second hidden layer is controlled to operate at a second operating frequency, where the first operating frequency is smaller than the second operating frequency. 8.一种基于声码器模型的语音合成装置,包括:8. A speech synthesis device based on a vocoder model, comprising: 源信息特征确定单元,被配置为获取待进行语音合成的源信息,并确定所述源信息所对应的特征数据集;a source information feature determining unit, configured to acquire source information to be synthesized by speech, and determine a feature data set corresponding to the source information; 声码器模型调用单元,被配置为将所述特征数据集提供给权利要求1-5中任一项所述的声码器模型,以由该声码器模型确定与所述特征数据集相对应的合成语音,在合成语音的过程中降低所述声码器模型中各个隐藏层的运行频率。A vocoder model invoking unit configured to provide the feature data set to the vocoder model of any one of claims 1-5 for determining by the vocoder model a phase relative to the feature data set For the corresponding synthesized speech, the operating frequency of each hidden layer in the vocoder model is reduced in the process of synthesizing the speech. 9.一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求6或7所述方法的步骤。9. An electronic device comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions Executed by the at least one processor to enable the at least one processor to perform the steps of the method of claim 6 or 7. 10.一种存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求6或7所述方法的步骤。10. A storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method of claim 6 or 7 are implemented.
CN201911391057.8A 2019-12-30 2019-12-30 Vocoder model, speech synthesis method and device Active CN111128117B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911391057.8A CN111128117B (en) 2019-12-30 2019-12-30 Vocoder model, speech synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911391057.8A CN111128117B (en) 2019-12-30 2019-12-30 Vocoder model, speech synthesis method and device

Publications (2)

Publication Number Publication Date
CN111128117A CN111128117A (en) 2020-05-08
CN111128117B true CN111128117B (en) 2022-03-29

Family

ID=70504630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911391057.8A Active CN111128117B (en) 2019-12-30 2019-12-30 Vocoder model, speech synthesis method and device

Country Status (1)

Country Link
CN (1) CN111128117B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997767A (en) * 2017-03-24 2017-08-01 百度在线网络技术(北京)有限公司 Method of speech processing and device based on artificial intelligence
CN108269569A (en) * 2017-01-04 2018-07-10 三星电子株式会社 Audio recognition method and equipment
CN110050302A (en) * 2016-10-04 2019-07-23 纽昂斯通讯有限公司 Speech synthesis
CN110473515A (en) * 2019-08-29 2019-11-19 郝洁 A kind of end-to-end speech synthetic method based on WaveRNN

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10896669B2 (en) * 2017-05-19 2021-01-19 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110050302A (en) * 2016-10-04 2019-07-23 纽昂斯通讯有限公司 Speech synthesis
CN108269569A (en) * 2017-01-04 2018-07-10 三星电子株式会社 Audio recognition method and equipment
CN106997767A (en) * 2017-03-24 2017-08-01 百度在线网络技术(北京)有限公司 Method of speech processing and device based on artificial intelligence
CN110473515A (en) * 2019-08-29 2019-11-19 郝洁 A kind of end-to-end speech synthetic method based on WaveRNN

Also Published As

Publication number Publication date
CN111128117A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
JP2020170200A (en) End-to-end text-to-speech conversion
CN108156317B (en) Call voice control method and device, storage medium and mobile terminal
US11990124B2 (en) Language model prediction of API call invocations and verbal responses
US11868898B2 (en) Machine learning through multiple layers of novel machine trained processing nodes
US11514919B1 (en) Voice synthesis for virtual agents
CN112732340B (en) Man-machine dialogue processing method and device
WO2022005615A1 (en) Speech enhancement
CN113921032A (en) Training method and device for audio processing model, and audio processing method and device
CN117035111A (en) Multitasking method, system, computer device and storage medium
CN113990347A (en) Signal processing method, computer equipment and storage medium
CN117351299A (en) Image generation and model training methods, devices, equipment and storage media
CN111128117B (en) Vocoder model, speech synthesis method and device
WO2022266872A1 (en) Virtual meeting control
CN115803805A (en) Conditional Output Generation via Data Density Gradient Estimation
CN115798453A (en) Voice reconstruction method and device, computer equipment and storage medium
CN110874343B (en) Method for processing voice based on deep learning chip and deep learning chip
CN114023313B (en) Training of speech processing model, speech processing method, apparatus, equipment and medium
EP3523800B1 (en) Shared three-dimensional audio bed
US12204532B2 (en) Parameterized narrations for data analytics systems
US20230102798A1 (en) Instruction applicable to radix-3 butterfly computation
CN116051365A (en) Image processing method, device, equipment and medium
CN114974207A (en) Speech synthesis method, speech synthesis device and electronic device
CN116257611A (en) Question-answering model training method, question-answering processing device and storage medium
CN113255233A (en) Business requirement processing method and device, storage medium and electronic equipment
US20250298981A1 (en) Midstream processing of streaming input to generate streaming output

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant