CN108877765A - Processing method and processing device, computer equipment and the readable medium of voice joint synthesis - Google Patents
Processing method and processing device, computer equipment and the readable medium of voice joint synthesis Download PDFInfo
- Publication number
- CN108877765A CN108877765A CN201810552365.3A CN201810552365A CN108877765A CN 108877765 A CN108877765 A CN 108877765A CN 201810552365 A CN201810552365 A CN 201810552365A CN 108877765 A CN108877765 A CN 108877765A
- Authority
- CN
- China
- Prior art keywords
- synthesis
- sound library
- text
- model
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 198
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 198
- 238000012545 processing Methods 0.000 title claims abstract description 46
- 238000003672 processing method Methods 0.000 title claims abstract description 26
- 239000000463 material Substances 0.000 claims abstract description 35
- 238000000034 method Methods 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims description 33
- 238000004590 computer program Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 15
- 230000001427 coherent effect Effects 0.000 abstract description 6
- 238000004891 communication Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000005291 magnetic effect Effects 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 239000002131 composite material Substances 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000010189 synthetic method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides processing method and processing device, computer equipment and the readable medium of a kind of voice joint synthesis.Its method includes:According to speech synthesis model trained in advance and the synthesis text got, expand sound library;It include the original language material manually acquired in sound library before expansion;Voice joint synthesis processing is carried out using the sound library after expansion.Technical solution of the present invention, by expanding sound library, so that including enough corpus in sound library, in this way when carrying out voice joint processing according to the sound library after expansion, the sound bite that can choose is more, so as to improve speech synthesis effect continuity and naturalness so that the effect of speech synthesis is very coherent, naturalness is fine, can satisfy the normal use of user.
Description
【Technical field】
The processing method and dress synthesized the present invention relates to computer application technology more particularly to a kind of voice joint
It sets, computer equipment and readable medium.
【Background technique】
Speech synthesis is an important component of human-computer interaction, and common synthetic technology has the ginseng based on statistical modeling
Number synthesizes and the splicing based on unit selection synthesizes two major classes.Due to use natural-sounding segment, splicing synthetic technology sound quality compared with
It is good, therefore the existing system that is commercially synthesized mainly uses splicing synthetic method.General common business joint synthesis system often needs
Up to ten thousand voices are recorded, data reach ten hours or more scales, and need inspection and the mark of a large amount of manpowers progress data
Note, can just guarantee to be attained by any text one acceptable synthetic effect.
Star's sound is synthesized, individual character is combined to scene, can not often collect a large amount of voice data.Because star records
Cost is relatively high, and it is unpractical for allowing it to record large-scale corpus;Individual scene can not allow each user to record
Product could be used after up to ten thousand voices.However these scenes have great commercial value, the synthesis of star's sound can be effective
Product attention rate and propagation degree are improved, individual character is combined to that user relatives or the sound of oneself can be used, and improves the participation of user
Sense and feeling of freshness, effectively promotion user experience.And in the scene that the synthesis of existing star's sound and individual character are combined to, in sound library
It can only collect the corpus of small data quantity, when voice joint, since the sound bite that can choose in sound library is very little, voice
The effect of synthesis is very discontinuous, and naturalness is very poor, and the voice of splicing substantially can not normal use.
【Summary of the invention】
The present invention provides processing method and processing device, computer equipment and the readable mediums of a kind of synthesis of voice joint, use
In the continuity and naturalness of the effect for improving speech synthesis.
The present invention provides a kind of processing method of voice joint synthesis, the method includes:
According to speech synthesis model trained in advance and the synthesis text got, expand sound library;The sound before expansion
It include the original language material manually acquired in library;
Voice joint synthesis processing is carried out using the sound library after expansion.
Still optionally further, in method as described above, according to speech synthesis model trained in advance and the conjunction got
At text, expands sound library, specifically include:
Using the speech synthesis model and the synthesis text got, the corresponding synthesis of the synthesis text is synthesized
Voice;
Using the synthesis text and the corresponding synthesis voice as synthesis corpus, it is updated in the sound library.
Still optionally further, in method as described above, according to speech synthesis model trained in advance and the conjunction got
At text, expand before sound library, the method includes:
According to the original language material manually acquired in the sound library before expansion, the training speech synthesis model.
It still optionally further, include urtext and corresponding original in the original language material in method as described above
Beginning voice;
According to the original language material manually acquired in sound library, training speech synthesis model is specifically included:
According to the urtext and the corresponding raw tone, the training speech synthesis model.
Still optionally further, in method as described above, according to speech synthesis model trained in advance and the conjunction got
At text, expand before sound library, the method includes:
The synthesis text is grabbed from network.
Still optionally further, in method as described above, the speech synthesis model uses WaveNet model.
The present invention provides a kind of processing unit of voice joint synthesis, and described device includes:
Enlargement module, for expanding sound library according to speech synthesis model trained in advance and the synthesis text got;Expand
It include the original language material manually acquired in the sound library before filling;
Processing module, for carrying out voice joint synthesis processing using the sound library after expanding.
Still optionally further, in device as described above, the enlargement module is specifically used for:
Using the speech synthesis model and the synthesis text got, the corresponding synthesis of the synthesis text is synthesized
Voice;
Using the synthesis text and the corresponding synthesis voice as synthesis corpus, it is updated in the sound library.
Still optionally further, in device as described above, further include:
Training module, for according to the original language material manually acquired in the sound library before expansion, training institute's predicate
Sound synthetic model.
It still optionally further, include urtext and corresponding original in the original language material in device as described above
Beginning voice;
The training module, described in training according to the urtext and the corresponding raw tone
Speech synthesis model.
Still optionally further, in device as described above, further include:
Handling module, for grabbing the synthesis text from network.
Still optionally further, in device as described above, the speech synthesis model uses WaveNet model.
The present invention also provides a kind of computer equipment, the equipment includes:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processing
Device realizes the processing method of voice joint synthesis as described above.
The present invention also provides a kind of computer-readable mediums, are stored thereon with computer program, which is held by processor
The processing method of voice joint synthesis as described above is realized when row.
Processing method and processing device, computer equipment and the readable medium of voice joint synthesis of the invention, by according to pre-
First trained speech synthesis model and the synthesis text got, expand sound library;It include manually acquiring in sound library before expansion
Original language material;Voice joint synthesis processing is carried out using the sound library after expansion.Technical solution of the present invention, by being carried out to sound library
Expand, so that include enough corpus in sound library, in this way when carrying out voice joint processing according to the sound library after expansion, Ke Yixuan
The sound bite selected is more, so as to improve speech synthesis effect continuity and naturalness so that the effect of speech synthesis
Fruit is very coherent, and naturalness is fine, can satisfy the normal use of user.
【Detailed description of the invention】
Fig. 1 is the flow chart for the processing method embodiment one that voice joint of the invention synthesizes.
Fig. 2 is the flow chart for the processing method embodiment two that voice joint of the invention synthesizes.
Fig. 3 is the structure chart for the processing device embodiment one that voice joint of the invention synthesizes.
Fig. 4 is the structure chart for the processing device embodiment two that voice joint of the invention synthesizes.
Fig. 5 is the structure chart of computer equipment embodiment of the invention.
Fig. 6 is a kind of exemplary diagram of computer equipment provided by the invention.
【Specific embodiment】
To make the objectives, technical solutions, and advantages of the present invention clearer, right in the following with reference to the drawings and specific embodiments
The present invention is described in detail.
Fig. 1 is the flow chart for the processing method embodiment one that voice joint of the invention synthesizes.As shown in Figure 1, this implementation
The processing method of the voice joint synthesis of example, can specifically include following steps:
100, according to speech synthesis model trained in advance and the synthesis text got, expand sound library;Sound before expansion
It include the original language material manually acquired in library;
101, voice joint synthesis processing is carried out using the sound library after expansion.
The executing subject of the processing method of the voice joint synthesis of the present embodiment can be the processing dress of voice joint synthesis
It sets, which carries out expansion processing in required sound library when can synthesize to voice joint, so that including enough languages in sound library
Material, to meet the needs of voice joint technology, to carry out voice joint synthesis processing using the sound library after expanding.
In the present embodiment, according to speech synthesis model trained in advance and the synthesis text got, expand sound library, it can be with
So that not only including the original language material that manually acquires, can also including according to speech synthesis model and obtaining in sound library after expanding
The synthesis text got, synthesized synthesis corpus.In this way, the corpus content that the sound library after expanding includes can be sufficiently rich
Richness, the subsequent sound library that can use after expanding carry out voice joint synthesis processing.Since the corpus in the sound library after expanding is enough
It is more, it is ensured that when carrying out voice joint synthesis processing using the sound library after expanding, the effect of speech synthesis is very coherent, natural
Degree very well, meets normal use enough.
The processing method of the voice joint synthesis of the present embodiment, by according to speech synthesis model trained in advance and acquisition
The synthesis text arrived expands sound library;It include the original language material manually acquired in sound library before expansion;Using the sound library after expansion into
Row voice joint synthesis processing.The technical solution of the present embodiment, by expanding sound library, so that including enough in sound library
Corpus, in this way when carrying out voice joint processing according to the sound library after expansion, the sound bite that can choose is more, so as to
The continuity and naturalness of the effect of speech synthesis are improved, so that the effect of speech synthesis is very coherent, naturalness is fine, can
Meet the normal use of user.
Fig. 2 is the flow chart for the processing method embodiment two that voice joint of the invention synthesizes.As shown in Fig. 2, this implementation
The processing method of the voice joint synthesis of example, on the basis of the technical solution of above-mentioned embodiment illustrated in fig. 1, further more in detail
Carefully introduce technical solution of the present invention.It, specifically can be with as shown in Fig. 2, the processing method that the voice joint of the present embodiment synthesizes
Include the following steps:
200, according to the original language material manually acquired in the sound library before expansion, training speech synthesis model;
201, synthesis text is grabbed from network;
202, using the synthesis text of speech synthesis model and acquisition, the corresponding synthesis voice of the synthesis text is synthesized;
203, it using synthesis text and corresponding synthesis voice as synthesis corpus, is updated in sound library;
The step 202 and step 203 are a kind of specific implementation of the step 100 of above-mentioned embodiment illustrated in fig. 1.
204, voice joint synthesis processing is carried out using the sound library after expansion.
Specifically, in the present embodiment, first can artificial collecting part original language material, for example, original language material may include
Urtext and corresponding raw tone.The original language material of the present embodiment is manually acquired by staff.For example, bright
In the scene of star sound synthesis, the raw tone in the original language material is the voice that star records according to the urtext provided.
In the scene that individual character is combined to, which is that user itself or relatives and friends record according to the urtext provided
Voice.Cost of labor especially during star's recording raw tone is higher and time-consuming and laborious, so the present embodiment is being adopted
It can only include less data amount when collecting original language material, such as can only acquire the raw tone of 1 hour.It should
The features such as the tone color of the included corresponding sound pronunciation people of raw tone in original language material.In the present embodiment, according to sound library
In the original language material that manually acquires, training speech synthesis model, so that voice and original language that the speech synthesis model synthesizes
The features such as the voice of material tone color having the same, the synthesis language that raw tone can be made to synthesize with the speech synthesis model in this way
Sound sounds what the same user issued.
For example, the speech synthesis model of the present embodiment can use WaveNet model.The WaveNet model is
The DeepMind team model with waveform modeling ability proposed in 2016, the WaveNet model is since proposition, by work
The extensive concern of industry and academia.
In the present embodiment, according to the original language material manually acquired in sound library, training speech synthesis model is specifically as follows root
According to urtext and corresponding raw tone, training speech synthesis model.For example, can first be extracted from original language material a plurality of
Training data, each training data include corresponding one section of text in one section of sound bite and urtext in raw tone
Segment.It is the parameter setting initial value of the WaveNet model before training.It is when training, the text fragments of each training data are defeated
Enter to the WaveNet model, WaveNet model is according to the sound bite after the output synthesis of the text fragments of input;Then it calculates
The cross entropy of the sound bite of the sound bite and training data;Then using gradient descent method adjustment WaveNet model
Parameter, so that the cross entropy reaches a minimum, the i.e. language of the sound bite and training data of the synthesis of expression WaveNet model
Tablet section is close enough.In the manner described above, constantly WaveNet model is trained using a plurality of training data, is determined
The parameter of WaveNet model, so that it is determined that WaveNet model, the training of WaveNet model is finished.
According to the mode of above-described embodiment, after getting the speech synthesis model based on WaveNet model, next, can
To be based on the speech synthesis model, synthesis corpus is generated, sound library is expanded.Specifically, synthesis text can first be obtained.It specifically can be with
Combined use field obtains synthesis text, such as voice joint synthesis is used in aviation field, boat can be obtained from network
A large amount of text in empty field is as synthesis text.Such as the voice joint is used in artificial intelligence field, can obtain from network
Take a large amount of text in artificial intelligence field as synthesis text.Such as the voice joint is used in education sector, can be from network
A large amount of text in middle acquisition education sector is as synthesis text, etc..For every kind of field, can be obtained from network
The text of related subject is as synthesis text.Then the synthesis text that will acquire is input in trained speech synthesis model,
The speech synthesis model can synthesize corresponding synthesis voice.The synthesis voice has identical with the raw tone in original language material
The features such as tone color, sound as the voice of same speaker.Finally, can using synthesis text and corresponding synthesis voice as
Corpus is synthesized, is updated in sound library.The synthesis text of the present embodiment can be the text of an entirety, or more.And
In the present embodiment, the amount of the synthesis voice of synthesis can be far longer than the amount of raw tone, for example, if raw tone is 1 hour
Amount, the synthesis voice of synthesis can achieve 20 hours speech volumes, even more hours speech volumes, in this way, using updating
Sound library afterwards carries out voice joint synthesis processing, can satisfy the demand of more voice joint synthesis, so that voice joint
Composite result can be relatively more coherent and naturalness is also preferable, can satisfy the demand of more practical applications.
Based on the above, it is recognised that the processing method that the voice joint of the present embodiment synthesizes, is based on WaveNet
The offline synthesis capability of the speech synthesis model of model is constructed first with small data quantity (such as recording data of 1 hour)
Then one speech synthesis model based on WaveNet model synthesizes 20 hours scales, texts using the speech synthesis model and covers
The high large-scale corpus of lid rate.It particularly, can also be for the concrete scene of synthetic video application, for ground in the present embodiment
The corpus of field high frequency appearance is added.Can finally use the speech synthesis model based on WaveNet model to synthesize this
The corpus of 20 hours scales constructs joint synthesis system, due to the voice of the speech synthesis model synthesis based on WaveNet model
Sound quality is higher, can achieve sound quality same as the voice manually acquired, and due to having extended to sound library scale 20 hours
Magnitude, it is ensured that voice joint has enough units for selecting when synthesizing, thereby may be ensured that the synthesis knot of voice joint
Fruit can be relatively more coherent, and naturalness is also fine.
The processing method of the voice joint synthesis of the present embodiment, compared with traditional sound library for only collecting a small amount of corpus,
It can be in the case where low volume data, hence it is evident that improve the sound quality and fluency of composite result.It, can be with when making star's sound sound library
Star's recording data amount is reduced, cost is reduced;It is the low volume data provided using user making personalized sound library, so that it may
Synthesize the sound of high-fidelity, improves user experience.
The processing method of the voice joint synthesis of the present embodiment, can quickly update that existing to be commercially synthesized system (such as each
The speech synthesis system of company) synthetic effect under small data quantity.Promotion and WaveNet model of the future with computing capability
Optimization, the WaveNet model can also be directly deployed on line at some time point.
The processing method of the voice joint synthesis of the present embodiment, can make full use of the modeling ability of WaveNet, and can have
Effect evades the problem that at high cost, high time delay, real-time rate difference are calculated when WaveNet is directly used, can be in small data quantity
It is obviously improved synthetic effect on line.
Fig. 3 is the structure chart for the processing device embodiment one that voice joint of the invention synthesizes.As shown in figure 3, this implementation
The processing unit of the voice joint synthesis of example, can specifically include:
Enlargement module 10 is used to expand sound library according to speech synthesis model trained in advance and the synthesis text got;
It include the original language material manually acquired in sound library before expansion;
Sound library after processing module 11 is used to expand using enlargement module 10 carries out voice joint synthesis processing.
The processing unit of the voice joint synthesis of the present embodiment realizes voice joint synthesis processing by using above-mentioned module
Realization principle and technical effect it is identical as the realization of above-mentioned related method embodiment, above-mentioned correlation technique can be referred in detail
The record of embodiment, details are not described herein.
Fig. 4 is the structure chart for the processing device embodiment two that voice joint of the invention synthesizes.As shown in figure 4, this implementation
The processing unit of the voice joint synthesis of example, on the basis of the technical solution of above-mentioned embodiment illustrated in fig. 3, further more in detail
Carefully introduce technical solution of the present invention.
In the processing unit of the voice joint synthesis of the present embodiment, enlargement module 10 is specifically used for:
Using speech synthesis model and the synthesis text got, the corresponding synthesis voice of the synthesis text is synthesized;
Using synthesis text and corresponding synthesis voice as synthesis corpus, it is updated in sound library.
Still optionally further, as shown in figure 4, further including in the processing unit that the voice joint of the present embodiment synthesizes:
Training module 12 is used for according to the original language material manually acquired in the sound library before expansion, training speech synthesis model.
It still optionally further, may include urtext and corresponding raw tone in the original language material;
Training module 12 is specifically used for according to urtext and corresponding raw tone, training speech synthesis model.
Accordingly, the speech synthesis model and the conjunction got that enlargement module 10 is used to be trained in advance according to training module 12
At text, expand sound library.
Still optionally further, as shown in figure 4, further including in the processing unit that the voice joint of the present embodiment synthesizes:
Handling module 13 from network for grabbing synthesis text.
Accordingly, the speech synthesis model and handling module 13 that enlargement module 10 is used to be trained in advance according to training module 12
The synthesis text got expands sound library.
Still optionally further, in the processing unit of the voice joint synthesis of the present embodiment, speech synthesis model is used
WaveNet model.
The processing unit of the voice joint synthesis of the present embodiment realizes voice joint synthesis processing by using above-mentioned module
Realization principle and technical effect it is identical as the realization of above-mentioned related method embodiment, above-mentioned correlation technique can be referred in detail
The record of embodiment, details are not described herein.
Fig. 5 is the structure chart of computer equipment embodiment of the invention.As shown in figure 5, the computer equipment of the present embodiment,
Including:One or more processors 30 and memory 40, memory 40 work as memory for storing one or more programs
The one or more programs stored in 40 are executed by one or more processors 30, so that one or more processors 30 are realized such as
The processing method of figure 1 above-embodiment illustrated in fig. 2 voice joint synthesis.To include multiple processors 30 in embodiment illustrated in fig. 5
For.
For example, Fig. 6 is a kind of exemplary diagram of computer equipment provided by the invention.Fig. 6, which is shown, to be suitable for being used to realizing this
The block diagram of the exemplary computer device 12a of invention embodiment.The computer equipment 12a that Fig. 6 is shown is only an example,
Should not function to the embodiment of the present invention and use scope bring any restrictions.
As shown in fig. 6, computer equipment 12a is showed in the form of universal computing device.The component of computer equipment 12a can
To include but is not limited to:One or more processor 16a, system storage 28a connect different system components (including system
Memory 28a and processor 16a) bus 18a.
Bus 18a indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts
For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC)
Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer equipment 12a typically comprises a variety of computer system readable media.These media can be it is any can
The usable medium accessed by computer equipment 12a, including volatile and non-volatile media, moveable and immovable Jie
Matter.
System storage 28a may include the computer system readable media of form of volatile memory, such as deposit at random
Access to memory (RAM) 30a and/or cache memory 32a.Computer equipment 12a may further include it is other it is removable/
Immovable, volatile/non-volatile computer system storage medium.Only as an example, storage system 34a can be used for reading
Write immovable, non-volatile magnetic media (Fig. 6 do not show, commonly referred to as " hard disk drive ").Although being not shown in Fig. 6,
The disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided, and non-easy to moving
The CD drive that the property lost CD (such as CD-ROM, DVD-ROM or other optical mediums) is read and write.In these cases, each
Driver can be connected by one or more data media interfaces with bus 18a.System storage 28a may include at least
One program product, the program product have one group of (for example, at least one) program module, these program modules are configured to hold
The function of the above-mentioned each embodiment of Fig. 1-Fig. 4 of the row present invention.
Program with one group of (at least one) program module 42a/utility 40a, can store and deposit in such as system
In reservoir 28a, such program module 42a include --- but being not limited to --- operating system, one or more application program,
It may include the reality of network environment in other program modules and program data, each of these examples or certain combination
It is existing.Program module 42a usually executes the function and/or method in above-mentioned each embodiment of Fig. 1-Fig. 4 described in the invention.
Computer equipment 12a can also be with one or more external equipment 14a (such as keyboard, sensing equipment, display
24a etc.) communication, the equipment interacted with computer equipment 12a communication can be also enabled a user to one or more, and/or
(such as network interface card is adjusted with any equipment for enabling computer equipment 12a to be communicated with one or more of the other calculating equipment
Modulator-demodulator etc.) communication.This communication can be carried out by input/output (I/O) interface 22a.Also, computer equipment
12a can also by network adapter 20a and one or more network (such as local area network (LAN), wide area network (WAN) and/or
Public network, such as internet) communication.As shown, network adapter 20a passes through its of bus 18a and computer equipment 12a
The communication of its module.It should be understood that although not shown in the drawings, other hardware and/or software can be used in conjunction with computer equipment 12a
Module, including but not limited to:Microcode, device driver, redundant processor, external disk drive array, RAID system, tape
Driver and data backup storage system etc..
Processor 16a by the program that is stored in system storage 28a of operation, thereby executing various function application and
Data processing, such as realize the processing method of the synthesis of voice joint shown in above-described embodiment.
The present invention also provides a kind of computer-readable mediums, are stored thereon with computer program, which is held by processor
The processing method of the synthesis of the voice joint as shown in above-described embodiment is realized when row.
The computer-readable medium of the present embodiment may include in the system storage 28a in above-mentioned embodiment illustrated in fig. 6
RAM30a, and/or cache memory 32a, and/or storage system 34a.
With the development of science and technology, the route of transmission of computer program is no longer limited by tangible medium, it can also be directly from net
Network downloading, or obtained using other modes.Therefore, the computer-readable medium in the present embodiment not only may include tangible
Medium can also include invisible medium.
The computer-readable medium of the present embodiment can be using any combination of one or more computer-readable media.
Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer-readable storage medium
Matter for example may be-but not limited to-system, device or the device of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or
Any above combination of person.The more specific example (non exhaustive list) of computer readable storage medium includes:With one
Or the electrical connections of multiple conducting wires, portable computer diskette, hard disk, random access memory (RAM), read-only memory (ROM),
Erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light
Memory device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer readable storage medium can
With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or
Person is in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium other than computer readable storage medium, which can send, propagate or
Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.?
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service
It is connected for quotient by internet).
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the present invention
The part steps of embodiment the method.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various
It can store the medium of program code.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.
Claims (14)
1. a kind of processing method of voice joint synthesis, which is characterized in that the method includes:
According to speech synthesis model trained in advance and the synthesis text got, expand sound library;In the sound library before expansion
Including the original language material manually acquired;
Voice joint synthesis processing is carried out using the sound library after expansion.
2. the method according to claim 1, wherein according to trained in advance speech synthesis model and getting
Synthesis text expands sound library, specifically includes:
Using the speech synthesis model and the synthesis text got, the corresponding synthesis language of the synthesis text is synthesized
Sound;
Using the synthesis text and the corresponding synthesis voice as synthesis corpus, it is updated in the sound library.
3. the method according to claim 1, wherein according to trained in advance speech synthesis model and getting
Synthesis text expands before sound library, the method includes:
According to the original language material manually acquired in the sound library before expansion, the training speech synthesis model.
4. according to the method described in claim 3, it is characterized in that, including urtext and corresponding in the original language material
Raw tone;
According to the original language material manually acquired in sound library, training speech synthesis model is specifically included:
According to the urtext and the corresponding raw tone, the training speech synthesis model.
5. the method according to claim 1, wherein according to trained in advance speech synthesis model and getting
Synthesis text expands before sound library, the method includes:
The synthesis text is grabbed from network.
6. -5 any method according to claim 1, which is characterized in that the speech synthesis model uses WaveNet mould
Type.
7. a kind of processing unit of voice joint synthesis, which is characterized in that described device includes:
Enlargement module, for expanding sound library according to speech synthesis model trained in advance and the synthesis text got;Before expansion
The sound library in include the original language material that manually acquires;
Processing module, for carrying out voice joint synthesis processing using the sound library after expanding.
8. device according to claim 7, which is characterized in that the enlargement module is specifically used for:
Using the speech synthesis model and the synthesis text got, the corresponding synthesis language of the synthesis text is synthesized
Sound;
Using the synthesis text and the corresponding synthesis voice as synthesis corpus, it is updated in the sound library.
9. device according to claim 7, which is characterized in that described device further includes:
Training module, for according to the original language material manually acquired in the sound library before expansion, the training voice to be closed
At model.
10. device according to claim 9, which is characterized in that include urtext and correspondence in the original language material
Raw tone;
The training module is specifically used for according to the urtext and the corresponding raw tone, the training voice
Synthetic model.
11. device according to claim 7, which is characterized in that described device further includes:
Handling module, for grabbing the synthesis text from network.
12. according to any device of claim 7-11, which is characterized in that the speech synthesis model uses WaveNet
Model.
13. a kind of computer equipment, which is characterized in that the equipment includes:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors are real
Now such as method as claimed in any one of claims 1 to 6.
14. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that the program is executed by processor
Shi Shixian method for example as claimed in any one of claims 1 to 6.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810552365.3A CN108877765A (en) | 2018-05-31 | 2018-05-31 | Processing method and processing device, computer equipment and the readable medium of voice joint synthesis |
US16/226,321 US10803851B2 (en) | 2018-05-31 | 2018-12-19 | Method and apparatus for processing speech splicing and synthesis, computer device and readable medium |
JP2018239323A JP6786751B2 (en) | 2018-05-31 | 2018-12-21 | Voice connection synthesis processing methods and equipment, computer equipment and computer programs |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810552365.3A CN108877765A (en) | 2018-05-31 | 2018-05-31 | Processing method and processing device, computer equipment and the readable medium of voice joint synthesis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108877765A true CN108877765A (en) | 2018-11-23 |
Family
ID=64335626
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810552365.3A Pending CN108877765A (en) | 2018-05-31 | 2018-05-31 | Processing method and processing device, computer equipment and the readable medium of voice joint synthesis |
Country Status (3)
Country | Link |
---|---|
US (1) | US10803851B2 (en) |
JP (1) | JP6786751B2 (en) |
CN (1) | CN108877765A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109448694A (en) * | 2018-12-27 | 2019-03-08 | 苏州思必驰信息科技有限公司 | A kind of method and device of rapid synthesis TTS voice |
CN110162176A (en) * | 2019-05-20 | 2019-08-23 | 北京百度网讯科技有限公司 | The method for digging and device terminal, computer-readable medium of phonetic order |
CN110390928A (en) * | 2019-08-07 | 2019-10-29 | 广州多益网络股份有限公司 | It is a kind of to open up the speech synthesis model training method and system for increasing corpus automatically |
CN111369966A (en) * | 2018-12-06 | 2020-07-03 | 阿里巴巴集团控股有限公司 | Method and device for personalized speech synthesis |
CN112242134A (en) * | 2019-07-01 | 2021-01-19 | 北京邮电大学 | Speech synthesis method and device |
CN112634860A (en) * | 2020-12-29 | 2021-04-09 | 苏州思必驰信息科技有限公司 | Method for screening training corpus of children voice recognition model |
US20210110273A1 (en) * | 2019-10-10 | 2021-04-15 | Samsung Electronics Co., Ltd. | Apparatus and method with model training |
CN115312024A (en) * | 2022-06-24 | 2022-11-08 | 普强时代(珠海横琴)信息技术有限公司 | Method and device for making sound library based on end-to-end splicing synthesis |
CN115602146A (en) * | 2022-09-08 | 2023-01-13 | 建信金融科技有限责任公司(Cn) | Spliced voice generation method and device, electronic equipment and storage medium |
CN115836300A (en) * | 2020-07-09 | 2023-03-21 | 谷歌有限责任公司 | Self-training WaveNet for text-to-speech |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180197438A1 (en) * | 2017-01-10 | 2018-07-12 | International Business Machines Corporation | System for enhancing speech performance via pattern detection and learning |
CN108877765A (en) * | 2018-05-31 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Processing method and processing device, computer equipment and the readable medium of voice joint synthesis |
JP7381020B2 (en) * | 2019-05-24 | 2023-11-15 | 日本電信電話株式会社 | Data generation model learning device, data generation device, data generation model learning method, data generation method, program |
CN111862933A (en) * | 2020-07-20 | 2020-10-30 | 北京字节跳动网络技术有限公司 | Method, apparatus, apparatus and medium for generating synthetic speech |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7430503B1 (en) * | 2004-08-24 | 2008-09-30 | The United States Of America As Represented By The Director, National Security Agency | Method of combining corpora to achieve consistency in phonetic labeling |
CN101350195A (en) * | 2007-07-19 | 2009-01-21 | 财团法人工业技术研究院 | Speech synthesizer generating system and method |
CN105304080A (en) * | 2015-09-22 | 2016-02-03 | 科大讯飞股份有限公司 | Speech synthesis device and speech synthesis method |
CN106297766A (en) * | 2015-06-04 | 2017-01-04 | 科大讯飞股份有限公司 | Phoneme synthesizing method and system |
Family Cites Families (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7082396B1 (en) * | 1999-04-30 | 2006-07-25 | At&T Corp | Methods and apparatus for rapid acoustic unit selection from a large speech corpus |
US6865533B2 (en) * | 2000-04-21 | 2005-03-08 | Lessac Technology Inc. | Text to speech |
JP4680429B2 (en) * | 2001-06-26 | 2011-05-11 | Okiセミコンダクタ株式会社 | High speed reading control method in text-to-speech converter |
JP2003058181A (en) * | 2001-08-14 | 2003-02-28 | Oki Electric Ind Co Ltd | Voice synthesizing device |
US20040030555A1 (en) * | 2002-08-12 | 2004-02-12 | Oregon Health & Science University | System and method for concatenating acoustic contours for speech synthesis |
US7280967B2 (en) * | 2003-07-30 | 2007-10-09 | International Business Machines Corporation | Method for detecting misaligned phonetic units for a concatenative text-to-speech voice |
JP4034751B2 (en) | 2004-03-31 | 2008-01-16 | 株式会社東芝 | Speech synthesis apparatus, speech synthesis method, and speech synthesis program |
US7475016B2 (en) * | 2004-12-15 | 2009-01-06 | International Business Machines Corporation | Speech segment clustering and ranking |
EP1872361A4 (en) * | 2005-03-28 | 2009-07-22 | Lessac Technologies Inc | Hybrid speech synthesizer, method and use |
CN1889170B (en) * | 2005-06-28 | 2010-06-09 | 纽昂斯通讯公司 | Method and system for generating synthesized speech based on recorded speech template |
JP2007024960A (en) * | 2005-07-12 | 2007-02-01 | Internatl Business Mach Corp <Ibm> | System, program and control method |
JP5457706B2 (en) | 2009-03-30 | 2014-04-02 | 株式会社東芝 | Speech model generation device, speech synthesis device, speech model generation program, speech synthesis program, speech model generation method, and speech synthesis method |
WO2011025532A1 (en) * | 2009-08-24 | 2011-03-03 | NovaSpeech, LLC | System and method for speech synthesis using frequency splicing |
CN102117614B (en) * | 2010-01-05 | 2013-01-02 | 索尼爱立信移动通讯有限公司 | Personalized text-to-speech synthesis and personalized speech feature extraction |
US20120316881A1 (en) * | 2010-03-25 | 2012-12-13 | Nec Corporation | Speech synthesizer, speech synthesis method, and speech synthesis program |
JP5758713B2 (en) * | 2011-06-22 | 2015-08-05 | 株式会社日立製作所 | Speech synthesis apparatus, navigation apparatus, and speech synthesis method |
JP6170384B2 (en) | 2013-09-09 | 2017-07-26 | 株式会社日立超エル・エス・アイ・システムズ | Speech database generation system, speech database generation method, and program |
CN104142909B (en) * | 2014-05-07 | 2016-04-27 | 腾讯科技(深圳)有限公司 | A kind of phonetic annotation of Chinese characters method and device |
US9679554B1 (en) * | 2014-06-23 | 2017-06-13 | Amazon Technologies, Inc. | Text-to-speech corpus development system |
US10186251B1 (en) * | 2015-08-06 | 2019-01-22 | Oben, Inc. | Voice conversion using deep neural network with intermediate voice training |
US9697820B2 (en) * | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
CN105206258B (en) * | 2015-10-19 | 2018-05-04 | 百度在线网络技术(北京)有限公司 | The generation method and device and phoneme synthesizing method and device of acoustic model |
CN105185372B (en) * | 2015-10-20 | 2017-03-22 | 百度在线网络技术(北京)有限公司 | Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device |
US10147416B2 (en) * | 2015-12-09 | 2018-12-04 | Amazon Technologies, Inc. | Text-to-speech processing systems and methods |
US9934775B2 (en) * | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US10319365B1 (en) * | 2016-06-27 | 2019-06-11 | Amazon Technologies, Inc. | Text-to-speech processing with emphasized output audio |
US10339925B1 (en) * | 2016-09-26 | 2019-07-02 | Amazon Technologies, Inc. | Generation of automated message responses |
US10448115B1 (en) * | 2016-09-28 | 2019-10-15 | Amazon Technologies, Inc. | Speech recognition for localized content |
WO2018058425A1 (en) * | 2016-09-29 | 2018-04-05 | 中国科学院深圳先进技术研究院 | Virtual reality guided hypnotic voice processing method and apparatus |
US11069335B2 (en) * | 2016-10-04 | 2021-07-20 | Cerence Operating Company | Speech synthesis using one or more recurrent neural networks |
US10565989B1 (en) * | 2016-12-16 | 2020-02-18 | Amazon Technogies Inc. | Ingesting device specific content |
US10276149B1 (en) * | 2016-12-21 | 2019-04-30 | Amazon Technologies, Inc. | Dynamic text-to-speech output |
US10325599B1 (en) * | 2016-12-28 | 2019-06-18 | Amazon Technologies, Inc. | Message response routing |
US10872598B2 (en) * | 2017-02-24 | 2020-12-22 | Baidu Usa Llc | Systems and methods for real-time neural text-to-speech |
US20180330713A1 (en) * | 2017-05-14 | 2018-11-15 | International Business Machines Corporation | Text-to-Speech Synthesis with Dynamically-Created Virtual Voices |
US10896669B2 (en) * | 2017-05-19 | 2021-01-19 | Baidu Usa Llc | Systems and methods for multi-speaker neural text-to-speech |
US10418033B1 (en) * | 2017-06-01 | 2019-09-17 | Amazon Technologies, Inc. | Configurable output data formats |
US10332517B1 (en) * | 2017-06-02 | 2019-06-25 | Amazon Technologies, Inc. | Privacy mode based on speaker identifier |
US10446147B1 (en) * | 2017-06-27 | 2019-10-15 | Amazon Technologies, Inc. | Contextual voice user interface |
CN107393556B (en) | 2017-07-17 | 2021-03-12 | 京东方科技集团股份有限公司 | A method and device for implementing audio processing |
US10672416B2 (en) * | 2017-10-20 | 2020-06-02 | Board Of Trustees Of The University Of Illinois | Causing microphones to detect inaudible sounds and defense against inaudible attacks |
US10600408B1 (en) * | 2018-03-23 | 2020-03-24 | Amazon Technologies, Inc. | Content output management based on speech quality |
US10770063B2 (en) * | 2018-04-13 | 2020-09-08 | Adobe Inc. | Real-time speaker-dependent neural vocoder |
CN108877765A (en) * | 2018-05-31 | 2018-11-23 | 百度在线网络技术(北京)有限公司 | Processing method and processing device, computer equipment and the readable medium of voice joint synthesis |
CN108550363B (en) * | 2018-06-04 | 2019-08-27 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method and device, computer equipment and readable medium |
-
2018
- 2018-05-31 CN CN201810552365.3A patent/CN108877765A/en active Pending
- 2018-12-19 US US16/226,321 patent/US10803851B2/en active Active
- 2018-12-21 JP JP2018239323A patent/JP6786751B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7430503B1 (en) * | 2004-08-24 | 2008-09-30 | The United States Of America As Represented By The Director, National Security Agency | Method of combining corpora to achieve consistency in phonetic labeling |
CN101350195A (en) * | 2007-07-19 | 2009-01-21 | 财团法人工业技术研究院 | Speech synthesizer generating system and method |
CN106297766A (en) * | 2015-06-04 | 2017-01-04 | 科大讯飞股份有限公司 | Phoneme synthesizing method and system |
CN105304080A (en) * | 2015-09-22 | 2016-02-03 | 科大讯飞股份有限公司 | Speech synthesis device and speech synthesis method |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111369966A (en) * | 2018-12-06 | 2020-07-03 | 阿里巴巴集团控股有限公司 | Method and device for personalized speech synthesis |
CN109448694A (en) * | 2018-12-27 | 2019-03-08 | 苏州思必驰信息科技有限公司 | A kind of method and device of rapid synthesis TTS voice |
CN110162176B (en) * | 2019-05-20 | 2022-04-26 | 北京百度网讯科技有限公司 | Method and device for mining voice instruction terminal, computer readable medium |
CN110162176A (en) * | 2019-05-20 | 2019-08-23 | 北京百度网讯科技有限公司 | The method for digging and device terminal, computer-readable medium of phonetic order |
CN112242134A (en) * | 2019-07-01 | 2021-01-19 | 北京邮电大学 | Speech synthesis method and device |
CN112242134B (en) * | 2019-07-01 | 2024-07-16 | 北京邮电大学 | Speech synthesis method and device |
CN110390928A (en) * | 2019-08-07 | 2019-10-29 | 广州多益网络股份有限公司 | It is a kind of to open up the speech synthesis model training method and system for increasing corpus automatically |
CN110390928B (en) * | 2019-08-07 | 2022-01-11 | 广州多益网络股份有限公司 | Method and system for training speech synthesis model of automatic expansion corpus |
US20210110273A1 (en) * | 2019-10-10 | 2021-04-15 | Samsung Electronics Co., Ltd. | Apparatus and method with model training |
CN115836300A (en) * | 2020-07-09 | 2023-03-21 | 谷歌有限责任公司 | Self-training WaveNet for text-to-speech |
CN112634860B (en) * | 2020-12-29 | 2022-05-03 | 思必驰科技股份有限公司 | Method for screening training corpus of children voice recognition model |
CN112634860A (en) * | 2020-12-29 | 2021-04-09 | 苏州思必驰信息科技有限公司 | Method for screening training corpus of children voice recognition model |
CN115312024A (en) * | 2022-06-24 | 2022-11-08 | 普强时代(珠海横琴)信息技术有限公司 | Method and device for making sound library based on end-to-end splicing synthesis |
CN115602146A (en) * | 2022-09-08 | 2023-01-13 | 建信金融科技有限责任公司(Cn) | Spliced voice generation method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
US10803851B2 (en) | 2020-10-13 |
JP6786751B2 (en) | 2020-11-18 |
US20190371291A1 (en) | 2019-12-05 |
JP2019211747A (en) | 2019-12-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108877765A (en) | Processing method and processing device, computer equipment and the readable medium of voice joint synthesis | |
JP6019108B2 (en) | Video generation based on text | |
CN105185372B (en) | Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device | |
JP6783479B1 (en) | Video generation program, video generation device and video generation method | |
US11847726B2 (en) | Method for outputting blend shape value, storage medium, and electronic device | |
US9037956B2 (en) | Content customization | |
US8849676B2 (en) | Content customization | |
JP2014519082A5 (en) | ||
US20220345796A1 (en) | Systems and methods for generating synthetic videos based on audio contents | |
Steinmetz et al. | Multimedia fundamentals, Volume 1: Media coding and content processing | |
CN113299312A (en) | Image generation method, device, equipment and storage medium | |
CN110047121A (en) | Animation producing method, device and electronic equipment end to end | |
CN109599090A (en) | A kind of method, device and equipment of speech synthesis | |
WO2024122284A1 (en) | Information processing device, information processing method, and information processing program | |
KR20180012166A (en) | Story-telling system for changing 3 dimension character into 3 dimension avatar | |
CN112383721B (en) | Method, apparatus, device and medium for generating video | |
CN112750184B (en) | Method and equipment for data processing, action driving and man-machine interaction | |
WO2023090419A1 (en) | Content generation device, content generation method, and program | |
US11289067B2 (en) | Voice generation based on characteristics of an avatar | |
JP2020204683A (en) | Electronic publication audio-visual system, audio-visual electronic publication creation program, and program for user terminal | |
CN112383722B (en) | Method and apparatus for generating video | |
CN117711372A (en) | Speech synthesis method, device, computer equipment and storage medium | |
CN117577122A (en) | A data processing method, device and related equipment | |
CN120321350A (en) | Digital human video synthesis method, server and storage medium | |
WO2025094234A1 (en) | Information processing device, information processing method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181123 |
|
RJ01 | Rejection of invention patent application after publication |