CN108877765A

CN108877765A - Processing method and processing device, computer equipment and the readable medium of voice joint synthesis

Info

Publication number: CN108877765A
Application number: CN201810552365.3A
Authority: CN
Inventors: 孙晓辉; 顾宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2018-11-23
Also published as: US10803851B2; JP6786751B2; US20190371291A1; JP2019211747A

Abstract

The present invention provides processing method and processing device, computer equipment and the readable medium of a kind of voice joint synthesis.Its method includes：According to speech synthesis model trained in advance and the synthesis text got, expand sound library；It include the original language material manually acquired in sound library before expansion；Voice joint synthesis processing is carried out using the sound library after expansion.Technical solution of the present invention, by expanding sound library, so that including enough corpus in sound library, in this way when carrying out voice joint processing according to the sound library after expansion, the sound bite that can choose is more, so as to improve speech synthesis effect continuity and naturalness so that the effect of speech synthesis is very coherent, naturalness is fine, can satisfy the normal use of user.

Description

Processing method and processing device, computer equipment and the readable medium of voice joint synthesis

【Technical field】

The processing method and dress synthesized the present invention relates to computer application technology more particularly to a kind of voice joint It sets, computer equipment and readable medium.

【Background technique】

Speech synthesis is an important component of human-computer interaction, and common synthetic technology has the ginseng based on statistical modeling Number synthesizes and the splicing based on unit selection synthesizes two major classes.Due to use natural-sounding segment, splicing synthetic technology sound quality compared with It is good, therefore the existing system that is commercially synthesized mainly uses splicing synthetic method.General common business joint synthesis system often needs Up to ten thousand voices are recorded, data reach ten hours or more scales, and need inspection and the mark of a large amount of manpowers progress data Note, can just guarantee to be attained by any text one acceptable synthetic effect.

Star's sound is synthesized, individual character is combined to scene, can not often collect a large amount of voice data.Because star records Cost is relatively high, and it is unpractical for allowing it to record large-scale corpus；Individual scene can not allow each user to record Product could be used after up to ten thousand voices.However these scenes have great commercial value, the synthesis of star's sound can be effective Product attention rate and propagation degree are improved, individual character is combined to that user relatives or the sound of oneself can be used, and improves the participation of user Sense and feeling of freshness, effectively promotion user experience.And in the scene that the synthesis of existing star's sound and individual character are combined to, in sound library It can only collect the corpus of small data quantity, when voice joint, since the sound bite that can choose in sound library is very little, voice The effect of synthesis is very discontinuous, and naturalness is very poor, and the voice of splicing substantially can not normal use.

【Summary of the invention】

The present invention provides processing method and processing device, computer equipment and the readable mediums of a kind of synthesis of voice joint, use In the continuity and naturalness of the effect for improving speech synthesis.

The present invention provides a kind of processing method of voice joint synthesis, the method includes：

According to speech synthesis model trained in advance and the synthesis text got, expand sound library；The sound before expansion It include the original language material manually acquired in library；

Voice joint synthesis processing is carried out using the sound library after expansion.

Still optionally further, in method as described above, according to speech synthesis model trained in advance and the conjunction got At text, expands sound library, specifically include：

Using the speech synthesis model and the synthesis text got, the corresponding synthesis of the synthesis text is synthesized Voice；

Using the synthesis text and the corresponding synthesis voice as synthesis corpus, it is updated in the sound library.

Still optionally further, in method as described above, according to speech synthesis model trained in advance and the conjunction got At text, expand before sound library, the method includes：

According to the original language material manually acquired in the sound library before expansion, the training speech synthesis model.

It still optionally further, include urtext and corresponding original in the original language material in method as described above Beginning voice；

According to the original language material manually acquired in sound library, training speech synthesis model is specifically included：

According to the urtext and the corresponding raw tone, the training speech synthesis model.

The synthesis text is grabbed from network.

Still optionally further, in method as described above, the speech synthesis model uses WaveNet model.

The present invention provides a kind of processing unit of voice joint synthesis, and described device includes：

Enlargement module, for expanding sound library according to speech synthesis model trained in advance and the synthesis text got；Expand It include the original language material manually acquired in the sound library before filling；

Processing module, for carrying out voice joint synthesis processing using the sound library after expanding.

Still optionally further, in device as described above, the enlargement module is specifically used for：

Still optionally further, in device as described above, further include：

Training module, for according to the original language material manually acquired in the sound library before expansion, training institute's predicate Sound synthetic model.

It still optionally further, include urtext and corresponding original in the original language material in device as described above Beginning voice；

The training module, described in training according to the urtext and the corresponding raw tone Speech synthesis model.

Still optionally further, in device as described above, further include：

Handling module, for grabbing the synthesis text from network.

Still optionally further, in device as described above, the speech synthesis model uses WaveNet model.

The present invention also provides a kind of computer equipment, the equipment includes：

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes the processing method of voice joint synthesis as described above.

The present invention also provides a kind of computer-readable mediums, are stored thereon with computer program, which is held by processor The processing method of voice joint synthesis as described above is realized when row.

Processing method and processing device, computer equipment and the readable medium of voice joint synthesis of the invention, by according to pre- First trained speech synthesis model and the synthesis text got, expand sound library；It include manually acquiring in sound library before expansion Original language material；Voice joint synthesis processing is carried out using the sound library after expansion.Technical solution of the present invention, by being carried out to sound library Expand, so that include enough corpus in sound library, in this way when carrying out voice joint processing according to the sound library after expansion, Ke Yixuan The sound bite selected is more, so as to improve speech synthesis effect continuity and naturalness so that the effect of speech synthesis Fruit is very coherent, and naturalness is fine, can satisfy the normal use of user.

【Detailed description of the invention】

Fig. 1 is the flow chart for the processing method embodiment one that voice joint of the invention synthesizes.

Fig. 2 is the flow chart for the processing method embodiment two that voice joint of the invention synthesizes.

Fig. 3 is the structure chart for the processing device embodiment one that voice joint of the invention synthesizes.

Fig. 4 is the structure chart for the processing device embodiment two that voice joint of the invention synthesizes.

Fig. 5 is the structure chart of computer equipment embodiment of the invention.

Fig. 6 is a kind of exemplary diagram of computer equipment provided by the invention.

【Specific embodiment】

To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments The present invention is described in detail.

Fig. 1 is the flow chart for the processing method embodiment one that voice joint of the invention synthesizes.As shown in Figure 1, this implementation The processing method of the voice joint synthesis of example, can specifically include following steps：

100, according to speech synthesis model trained in advance and the synthesis text got, expand sound library；Sound before expansion It include the original language material manually acquired in library；

101, voice joint synthesis processing is carried out using the sound library after expansion.

The executing subject of the processing method of the voice joint synthesis of the present embodiment can be the processing dress of voice joint synthesis It sets, which carries out expansion processing in required sound library when can synthesize to voice joint, so that including enough languages in sound library Material, to meet the needs of voice joint technology, to carry out voice joint synthesis processing using the sound library after expanding.

In the present embodiment, according to speech synthesis model trained in advance and the synthesis text got, expand sound library, it can be with So that not only including the original language material that manually acquires, can also including according to speech synthesis model and obtaining in sound library after expanding The synthesis text got, synthesized synthesis corpus.In this way, the corpus content that the sound library after expanding includes can be sufficiently rich Richness, the subsequent sound library that can use after expanding carry out voice joint synthesis processing.Since the corpus in the sound library after expanding is enough It is more, it is ensured that when carrying out voice joint synthesis processing using the sound library after expanding, the effect of speech synthesis is very coherent, natural Degree very well, meets normal use enough.

The processing method of the voice joint synthesis of the present embodiment, by according to speech synthesis model trained in advance and acquisition The synthesis text arrived expands sound library；It include the original language material manually acquired in sound library before expansion；Using the sound library after expansion into Row voice joint synthesis processing.The technical solution of the present embodiment, by expanding sound library, so that including enough in sound library Corpus, in this way when carrying out voice joint processing according to the sound library after expansion, the sound bite that can choose is more, so as to The continuity and naturalness of the effect of speech synthesis are improved, so that the effect of speech synthesis is very coherent, naturalness is fine, can Meet the normal use of user.

Fig. 2 is the flow chart for the processing method embodiment two that voice joint of the invention synthesizes.As shown in Fig. 2, this implementation The processing method of the voice joint synthesis of example, on the basis of the technical solution of above-mentioned embodiment illustrated in fig. 1, further more in detail Carefully introduce technical solution of the present invention.It, specifically can be with as shown in Fig. 2, the processing method that the voice joint of the present embodiment synthesizes Include the following steps：

200, according to the original language material manually acquired in the sound library before expansion, training speech synthesis model；

201, synthesis text is grabbed from network；

202, using the synthesis text of speech synthesis model and acquisition, the corresponding synthesis voice of the synthesis text is synthesized；

203, it using synthesis text and corresponding synthesis voice as synthesis corpus, is updated in sound library；

The step 202 and step 203 are a kind of specific implementation of the step 100 of above-mentioned embodiment illustrated in fig. 1.

204, voice joint synthesis processing is carried out using the sound library after expansion.

Specifically, in the present embodiment, first can artificial collecting part original language material, for example, original language material may include Urtext and corresponding raw tone.The original language material of the present embodiment is manually acquired by staff.For example, bright In the scene of star sound synthesis, the raw tone in the original language material is the voice that star records according to the urtext provided. In the scene that individual character is combined to, which is that user itself or relatives and friends record according to the urtext provided Voice.Cost of labor especially during star's recording raw tone is higher and time-consuming and laborious, so the present embodiment is being adopted It can only include less data amount when collecting original language material, such as can only acquire the raw tone of 1 hour.It should The features such as the tone color of the included corresponding sound pronunciation people of raw tone in original language material.In the present embodiment, according to sound library In the original language material that manually acquires, training speech synthesis model, so that voice and original language that the speech synthesis model synthesizes The features such as the voice of material tone color having the same, the synthesis language that raw tone can be made to synthesize with the speech synthesis model in this way Sound sounds what the same user issued.

For example, the speech synthesis model of the present embodiment can use WaveNet model.The WaveNet model is The DeepMind team model with waveform modeling ability proposed in 2016, the WaveNet model is since proposition, by work The extensive concern of industry and academia.

In the present embodiment, according to the original language material manually acquired in sound library, training speech synthesis model is specifically as follows root According to urtext and corresponding raw tone, training speech synthesis model.For example, can first be extracted from original language material a plurality of Training data, each training data include corresponding one section of text in one section of sound bite and urtext in raw tone Segment.It is the parameter setting initial value of the WaveNet model before training.It is when training, the text fragments of each training data are defeated Enter to the WaveNet model, WaveNet model is according to the sound bite after the output synthesis of the text fragments of input；Then it calculates The cross entropy of the sound bite of the sound bite and training data；Then using gradient descent method adjustment WaveNet model Parameter, so that the cross entropy reaches a minimum, the i.e. language of the sound bite and training data of the synthesis of expression WaveNet model Tablet section is close enough.In the manner described above, constantly WaveNet model is trained using a plurality of training data, is determined The parameter of WaveNet model, so that it is determined that WaveNet model, the training of WaveNet model is finished.

According to the mode of above-described embodiment, after getting the speech synthesis model based on WaveNet model, next, can To be based on the speech synthesis model, synthesis corpus is generated, sound library is expanded.Specifically, synthesis text can first be obtained.It specifically can be with Combined use field obtains synthesis text, such as voice joint synthesis is used in aviation field, boat can be obtained from network A large amount of text in empty field is as synthesis text.Such as the voice joint is used in artificial intelligence field, can obtain from network Take a large amount of text in artificial intelligence field as synthesis text.Such as the voice joint is used in education sector, can be from network A large amount of text in middle acquisition education sector is as synthesis text, etc..For every kind of field, can be obtained from network The text of related subject is as synthesis text.Then the synthesis text that will acquire is input in trained speech synthesis model, The speech synthesis model can synthesize corresponding synthesis voice.The synthesis voice has identical with the raw tone in original language material The features such as tone color, sound as the voice of same speaker.Finally, can using synthesis text and corresponding synthesis voice as Corpus is synthesized, is updated in sound library.The synthesis text of the present embodiment can be the text of an entirety, or more.And In the present embodiment, the amount of the synthesis voice of synthesis can be far longer than the amount of raw tone, for example, if raw tone is 1 hour Amount, the synthesis voice of synthesis can achieve 20 hours speech volumes, even more hours speech volumes, in this way, using updating Sound library afterwards carries out voice joint synthesis processing, can satisfy the demand of more voice joint synthesis, so that voice joint Composite result can be relatively more coherent and naturalness is also preferable, can satisfy the demand of more practical applications.

Based on the above, it is recognised that the processing method that the voice joint of the present embodiment synthesizes, is based on WaveNet The offline synthesis capability of the speech synthesis model of model is constructed first with small data quantity (such as recording data of 1 hour) Then one speech synthesis model based on WaveNet model synthesizes 20 hours scales, texts using the speech synthesis model and covers The high large-scale corpus of lid rate.It particularly, can also be for the concrete scene of synthetic video application, for ground in the present embodiment The corpus of field high frequency appearance is added.Can finally use the speech synthesis model based on WaveNet model to synthesize this The corpus of 20 hours scales constructs joint synthesis system, due to the voice of the speech synthesis model synthesis based on WaveNet model Sound quality is higher, can achieve sound quality same as the voice manually acquired, and due to having extended to sound library scale 20 hours Magnitude, it is ensured that voice joint has enough units for selecting when synthesizing, thereby may be ensured that the synthesis knot of voice joint Fruit can be relatively more coherent, and naturalness is also fine.

The processing method of the voice joint synthesis of the present embodiment, compared with traditional sound library for only collecting a small amount of corpus, It can be in the case where low volume data, hence it is evident that improve the sound quality and fluency of composite result.It, can be with when making star's sound sound library Star's recording data amount is reduced, cost is reduced；It is the low volume data provided using user making personalized sound library, so that it may Synthesize the sound of high-fidelity, improves user experience.

The processing method of the voice joint synthesis of the present embodiment, can quickly update that existing to be commercially synthesized system (such as each The speech synthesis system of company) synthetic effect under small data quantity.Promotion and WaveNet model of the future with computing capability Optimization, the WaveNet model can also be directly deployed on line at some time point.

The processing method of the voice joint synthesis of the present embodiment, can make full use of the modeling ability of WaveNet, and can have Effect evades the problem that at high cost, high time delay, real-time rate difference are calculated when WaveNet is directly used, can be in small data quantity It is obviously improved synthetic effect on line.

Fig. 3 is the structure chart for the processing device embodiment one that voice joint of the invention synthesizes.As shown in figure 3, this implementation The processing unit of the voice joint synthesis of example, can specifically include：

Enlargement module 10 is used to expand sound library according to speech synthesis model trained in advance and the synthesis text got； It include the original language material manually acquired in sound library before expansion；

Sound library after processing module 11 is used to expand using enlargement module 10 carries out voice joint synthesis processing.

The processing unit of the voice joint synthesis of the present embodiment realizes voice joint synthesis processing by using above-mentioned module Realization principle and technical effect it is identical as the realization of above-mentioned related method embodiment, above-mentioned correlation technique can be referred in detail The record of embodiment, details are not described herein.

Fig. 4 is the structure chart for the processing device embodiment two that voice joint of the invention synthesizes.As shown in figure 4, this implementation The processing unit of the voice joint synthesis of example, on the basis of the technical solution of above-mentioned embodiment illustrated in fig. 3, further more in detail Carefully introduce technical solution of the present invention.

In the processing unit of the voice joint synthesis of the present embodiment, enlargement module 10 is specifically used for：

Using speech synthesis model and the synthesis text got, the corresponding synthesis voice of the synthesis text is synthesized；

Using synthesis text and corresponding synthesis voice as synthesis corpus, it is updated in sound library.

Still optionally further, as shown in figure 4, further including in the processing unit that the voice joint of the present embodiment synthesizes：

Training module 12 is used for according to the original language material manually acquired in the sound library before expansion, training speech synthesis model.

It still optionally further, may include urtext and corresponding raw tone in the original language material；

Training module 12 is specifically used for according to urtext and corresponding raw tone, training speech synthesis model.

Accordingly, the speech synthesis model and the conjunction got that enlargement module 10 is used to be trained in advance according to training module 12 At text, expand sound library.

Handling module 13 from network for grabbing synthesis text.

Accordingly, the speech synthesis model and handling module 13 that enlargement module 10 is used to be trained in advance according to training module 12 The synthesis text got expands sound library.

Still optionally further, in the processing unit of the voice joint synthesis of the present embodiment, speech synthesis model is used WaveNet model.

Fig. 5 is the structure chart of computer equipment embodiment of the invention.As shown in figure 5, the computer equipment of the present embodiment, Including：One or more processors 30 and memory 40, memory 40 work as memory for storing one or more programs The one or more programs stored in 40 are executed by one or more processors 30, so that one or more processors 30 are realized such as The processing method of figure 1 above-embodiment illustrated in fig. 2 voice joint synthesis.To include multiple processors 30 in embodiment illustrated in fig. 5 For.

For example, Fig. 6 is a kind of exemplary diagram of computer equipment provided by the invention.Fig. 6, which is shown, to be suitable for being used to realizing this The block diagram of the exemplary computer device 12a of invention embodiment.The computer equipment 12a that Fig. 6 is shown is only an example, Should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in fig. 6, computer equipment 12a is showed in the form of universal computing device.The component of computer equipment 12a can To include but is not limited to：One or more processor 16a, system storage 28a connect different system components (including system Memory 28a and processor 16a) bus 18a.

Bus 18a indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.

Computer equipment 12a typically comprises a variety of computer system readable media.These media can be it is any can The usable medium accessed by computer equipment 12a, including volatile and non-volatile media, moveable and immovable Jie Matter.

System storage 28a may include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 30a and/or cache memory 32a.Computer equipment 12a may further include it is other it is removable/ Immovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 34a can be used for reading Write immovable, non-volatile magnetic media (Fig. 6 do not show, commonly referred to as " hard disk drive ").Although being not shown in Fig. 6, The disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided, and non-easy to moving The CD drive that the property lost CD (such as CD-ROM, DVD-ROM or other optical mediums) is read and write.In these cases, each Driver can be connected by one or more data media interfaces with bus 18a.System storage 28a may include at least One program product, the program product have one group of (for example, at least one) program module, these program modules are configured to hold The function of the above-mentioned each embodiment of Fig. 1-Fig. 4 of the row present invention.

Program with one group of (at least one) program module 42a/utility 40a, can store and deposit in such as system In reservoir 28a, such program module 42a include --- but being not limited to --- operating system, one or more application program, It may include the reality of network environment in other program modules and program data, each of these examples or certain combination It is existing.Program module 42a usually executes the function and/or method in above-mentioned each embodiment of Fig. 1-Fig. 4 described in the invention.

Computer equipment 12a can also be with one or more external equipment 14a (such as keyboard, sensing equipment, display 24a etc.) communication, the equipment interacted with computer equipment 12a communication can be also enabled a user to one or more, and/or (such as network interface card is adjusted with any equipment for enabling computer equipment 12a to be communicated with one or more of the other calculating equipment Modulator-demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 22a.Also, computer equipment 12a can also by network adapter 20a and one or more network (such as local area network (LAN), wide area network (WAN) and/or Public network, such as internet) communication.As shown, network adapter 20a passes through its of bus 18a and computer equipment 12a The communication of its module.It should be understood that although not shown in the drawings, other hardware and/or software can be used in conjunction with computer equipment 12a Module, including but not limited to：Microcode, device driver, redundant processor, external disk drive array, RAID system, tape Driver and data backup storage system etc..

Processor 16a by the program that is stored in system storage 28a of operation, thereby executing various function application and Data processing, such as realize the processing method of the synthesis of voice joint shown in above-described embodiment.

The present invention also provides a kind of computer-readable mediums, are stored thereon with computer program, which is held by processor The processing method of the synthesis of the voice joint as shown in above-described embodiment is realized when row.

The computer-readable medium of the present embodiment may include in the system storage 28a in above-mentioned embodiment illustrated in fig. 6 RAM30a, and/or cache memory 32a, and/or storage system 34a.

With the development of science and technology, the route of transmission of computer program is no longer limited by tangible medium, it can also be directly from net Network downloading, or obtained using other modes.Therefore, the computer-readable medium in the present embodiment not only may include tangible Medium can also include invisible medium.

The computer-readable medium of the present embodiment can be using any combination of one or more computer-readable media. Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium Matter for example may be-but not limited to-system, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or Any above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes：With one Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or Person is in connection.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service It is connected for quotient by internet).

In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.

The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention The part steps of embodiment the method.And storage medium above-mentioned includes：USB flash disk, mobile hard disk, read-only memory (Read- Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various It can store the medium of program code.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of processing method of voice joint synthesis, which is characterized in that the method includes：

According to speech synthesis model trained in advance and the synthesis text got, expand sound library；In the sound library before expansion Including the original language material manually acquired；

2. the method according to claim 1, wherein according to trained in advance speech synthesis model and getting Synthesis text expands sound library, specifically includes：

Using the speech synthesis model and the synthesis text got, the corresponding synthesis language of the synthesis text is synthesized Sound；

3. the method according to claim 1, wherein according to trained in advance speech synthesis model and getting Synthesis text expands before sound library, the method includes：

4. according to the method described in claim 3, it is characterized in that, including urtext and corresponding in the original language material Raw tone；

5. the method according to claim 1, wherein according to trained in advance speech synthesis model and getting Synthesis text expands before sound library, the method includes：

The synthesis text is grabbed from network.

6. -5 any method according to claim 1, which is characterized in that the speech synthesis model uses WaveNet mould Type.

7. a kind of processing unit of voice joint synthesis, which is characterized in that described device includes：

Enlargement module, for expanding sound library according to speech synthesis model trained in advance and the synthesis text got；Before expansion The sound library in include the original language material that manually acquires；

8. device according to claim 7, which is characterized in that the enlargement module is specifically used for：

9. device according to claim 7, which is characterized in that described device further includes：

Training module, for according to the original language material manually acquired in the sound library before expansion, the training voice to be closed At model.

10. device according to claim 9, which is characterized in that include urtext and correspondence in the original language material Raw tone；

The training module is specifically used for according to the urtext and the corresponding raw tone, the training voice Synthetic model.

11. device according to claim 7, which is characterized in that described device further includes：

Handling module, for grabbing the synthesis text from network.

12. according to any device of claim 7-11, which is characterized in that the speech synthesis model uses WaveNet Model.

13. a kind of computer equipment, which is characterized in that the equipment includes：

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processors are real Now such as method as claimed in any one of claims 1 to 6.

14. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that the program is executed by processor Shi Shixian method for example as claimed in any one of claims 1 to 6.