CN110600004A

CN110600004A - Voice synthesis playing method and device and storage medium

Info

Publication number: CN110600004A
Application number: CN201910848598.2A
Authority: CN
Inventors: 杨木文
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2019-12-20

Abstract

The embodiment of the present invention discloses a voice synthesis playback method, device and storage medium, wherein the user terminal can receive the voice synthesis request, obtain the text to be synthesized that needs to be synthesized according to the voice synthesis request, and then send the text to be synthesized to The speech synthesis server performs speech synthesis, obtains the corresponding synthesized speech, then plays the synthesized speech, receives the pronunciation correction request for the synthesized speech, receives the correction data corresponding to the synthesized speech according to the pronunciation correction request, and sends the correction data to the speech synthesis The server is used to update the synthesized voice, so as to obtain the updated synthesized voice, and replace the currently played synthesized voice with the updated synthesized voice for playback. Compared with related technologies, the present invention can correct and update the played synthesized voice in real time during the process of playing the synthesized voice, so that even when the pronunciation prediction of polyphonic characters is wrong, the pronunciation can be corrected in time.

Description

A speech synthesis playback method, device and storage medium

技术领域technical field

本发明涉及语音技术领域，具体涉及一种语音合成播放方法、装置和存储介质。The present invention relates to the technical field of speech, in particular to a speech synthesis playback method, device and storage medium.

背景技术Background technique

语音合成技术，也被称为文语转换技术(Text To Speech，TTS)，其目标是让机器通过识别和理解，把文本信息转换成语音输出，从而让机器能够说话，是未来人机交互的重要分支。Speech synthesis technology, also known as text-to-speech technology (Text To Speech, TTS), its goal is to allow machines to convert text information into voice output through recognition and understanding, so that machines can speak, which is the future of human-computer interaction. important branch.

语音合成技术应用广泛，比如网页内容朗读、小说有声阅读、电子邮件的阅读等。以小说有声阅读为例，通过语音合成，手机、平板电脑等用户终端能够将用户阅读的小说朗读出来，使得用户能够闭眼“看”小说。Speech synthesis technology is widely used, such as web page content reading aloud, novel audio reading, e-mail reading and so on. Take the audio reading of novels as an example. Through speech synthesis, user terminals such as mobile phones and tablet computers can read the novels read by users aloud, so that users can "read" novels with their eyes closed.

在对现有技术的研究和实践过程中，本发明的发明人发现，现有语音合成技术的多音字处理能力存在缺陷，在面临不常见的上下文语境时，往往无法准确的预测出多音字的发音。During the research and practice of the prior art, the inventors of the present invention found that the polyphonic character processing ability of the existing speech synthesis technology has defects, and it is often impossible to accurately predict the polyphonic character when faced with an uncommon context. pronunciation.

发明内容Contents of the invention

本发明实施例提供一种语音合成播放方法、装置和存储介质，能够在多音字的发音预测错误时，及时校正其发音。Embodiments of the present invention provide a speech synthesis playback method, device and storage medium, capable of timely correcting the pronunciation of a polyphone when the pronunciation prediction of a polyphonic character is wrong.

本发明实施例提供一种语音合成播放方法，包括：An embodiment of the present invention provides a speech synthesis playback method, including:

接收语音合成请求，并根据所述语音合成请求获取需要进行语音合成的待合成文本；receiving a speech synthesis request, and obtaining text to be synthesized that needs to be speech synthesized according to the speech synthesis request;

将所述待合成文本发送至语音合成服务器进行语音合成，使得所述语音合成服务器返回对应所述待合成文本的合成语音；Sending the text to be synthesized to a speech synthesis server for speech synthesis, so that the speech synthesis server returns a synthesized speech corresponding to the text to be synthesized;

播放所述合成语音，并接收对所述合成语音的发音校正请求；Playing the synthesized voice, and receiving a pronunciation correction request for the synthesized voice;

根据所述发音校正请求接收输入的对应于所述合成语音的校正数据，并将所述校正数据发送至所述语音合成服务器，使得所述语音合成服务器根据所述校正数据更新所述合成语音，并返回更新后的合成语音；receiving input correction data corresponding to the synthesized speech according to the pronunciation correction request, and sending the correction data to the speech synthesis server, so that the speech synthesis server updates the synthesized speech according to the correction data, and return the updated synthesized speech;

将当前播放的所述合成语音替换为所述更新后的合成语音进行播放。The currently played synthesized voice is replaced with the updated synthesized voice for playback.

本发明实施例还提供一种语音合成播放方法，包括：The embodiment of the present invention also provides a speech synthesis playback method, including:

当接收到来自于用户终端的待合成文本时，根据预先训练的语音合成模型对所述待合成文本进行语音合成，得到合成语音；When receiving the text to be synthesized from the user terminal, performing speech synthesis on the text to be synthesized according to a pre-trained speech synthesis model to obtain a synthesized speech;

将所述合成语音返回至所述用户终端进行播放，并接收所述用户终端返回的对应所述合成语音的校正数据；returning the synthesized voice to the user terminal for playing, and receiving correction data corresponding to the synthesized voice returned by the user terminal;

根据所述校正数据更新所述合成语音，得到更新后的合成语音；updating the synthesized speech according to the correction data to obtain the updated synthesized speech;

将所述更新后的合成语音返回至所述用户终端，使得所述用户终端将所述合成语音替换为所述更新后的合成语音进行播放。returning the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice for playing.

本发明实施例还提供一种语音合成播放装置，包括：The embodiment of the present invention also provides a speech synthesis playback device, comprising:

文本获取模块，用于接收语音合成请求，并根据所述语音合成请求获取需要进行语音合成的待合成文本；A text acquisition module, configured to receive a speech synthesis request, and obtain text to be synthesized that requires speech synthesis according to the speech synthesis request;

语音合成模块，用于将所述待合成文本发送至语音合成服务器进行语音合成，使得所述语音合成服务器返回对应所述待合成文本的合成语音；A speech synthesis module, configured to send the text to be synthesized to a speech synthesis server for speech synthesis, so that the speech synthesis server returns the synthesized speech corresponding to the text to be synthesized;

语音播放模块，用于播放所述合成语音，并接收对所述合成语音的发音校正请求；A voice playback module, configured to play the synthesized voice and receive a pronunciation correction request for the synthesized voice;

文本校正模块，用于根据所述发音校正请求接收输入的对应于所述合成语音的校正数据，并将所述校正数据发送至所述语音合成服务器，使得所述语音合成服务器根据所述校正数据更新所述合成语音，并返回更新后的合成语音；A text correction module, configured to receive input correction data corresponding to the synthesized speech according to the pronunciation correction request, and send the correction data to the speech synthesis server, so that the speech synthesis server updating the synthesized speech, and returning the updated synthesized speech;

所述语音播放模块还用于将当前播放的所述合成语音替换为所述更新后的合成语音进行播放。The voice playing module is also used to replace the currently played synthesized voice with the updated synthesized voice for playback.

在一实施例中，在根据发音校正请求接收对应于合成语音的校正数据，所述文本校正模块用于：In one embodiment, after receiving the correction data corresponding to the synthesized speech according to the pronunciation correction request, the text correction module is used for:

根据所述发音校正请求展示发音校正界面，所述发音校正界面包括字输入控件和发音控件；According to the pronunciation correction request, a pronunciation correction interface is displayed, and the pronunciation correction interface includes a word input control and a pronunciation control;

基于所述字输入控件接收所述待合成文本中需要校正的目标字；receiving a target word to be corrected in the text to be synthesized based on the word input control;

基于所述发音控件接收对应所述目标字的目标发音；receiving a target pronunciation corresponding to the target word based on the pronunciation control;

将所述目标字和所述目标发音设为所述校正数据。The target word and the target pronunciation are set as the correction data.

在一实施例中，在基于发音控件接收对应目标字的目标发音时，所述文本校正模块用于：In one embodiment, when the target pronunciation of the corresponding target word is received based on the pronunciation control, the text correction module is used for:

校验所述目标字是否为多音字；Check whether the target word is a polyphonic word;

当判定所述目标字为多音字时，根据预设的多音字和发音的对应关系，获取所述目标字对应的多个发音；When determining that the target word is a polyphonic character, according to the preset correspondence between polyphonic characters and pronunciations, obtain a plurality of pronunciations corresponding to the target word;

基于所述发音控件展示所述多个发音，并接收对展示的发音的选择操作；displaying the plurality of pronunciations based on the pronunciation control, and receiving a selection operation on the displayed pronunciations;

将所述选择操作对应的发音设为所述目标字的目标发音。The pronunciation corresponding to the selection operation is set as the target pronunciation of the target word.

在一实施例中，在根据语音合成请求获取需要进行语音合成的待合成文本时，所述文本获取模块用于：In one embodiment, when obtaining the text to be synthesized that needs to be synthesized according to the speech synthesis request, the text acquisition module is used for:

根据所述语音合成请求提取前台应用的展示内容中的文本，得到提取文本；Extracting the text in the display content of the foreground application according to the speech synthesis request to obtain the extracted text;

按照预设分句策略，将所述提取文本划分为多个分句；Dividing the extracted text into multiple clauses according to a preset clause strategy;

将所述分句设为所述待合成文本。Set the clause as the text to be synthesized.

在一实施例中，在播放合成语音的过程中，所述语音播放模块还用于：In one embodiment, in the process of playing synthesized speech, the speech playback module is also used for:

按照预设规则对所述合成语音对应的分句进行标识。The clauses corresponding to the synthesized speech are identified according to preset rules.

在一实施例中，在接收对合成语音的发音校正请求时，所述语音播放模块用于：In one embodiment, when receiving a pronunciation correction request for synthesized speech, the speech playback module is used to:

在所述合成语音对应的分句的预设范围内展示发音校正控件；Displaying a pronunciation correction control within a preset range of a sentence corresponding to the synthesized speech;

基于所述发音校正控件接收对合成语音的发音校正请求。A pronunciation correction request for synthesized speech is received based on the pronunciation correction control.

在一实施例中，本发明实施例提供的语音合成播放装置还包括数据存储模块，用于：In one embodiment, the speech synthesis playback device provided by the embodiment of the present invention also includes a data storage module for:

将所述待合成文本、所述合成语音和/或所述更新后的合成语音存储至分布式系统中。storing the text to be synthesized, the synthesized speech and/or the updated synthesized speech in a distributed system.

本发明实施例还提供一种语音合成播放装置，包括语音合成模块、语音下发模块以及语音更新模块，其中，An embodiment of the present invention also provides a speech synthesis playback device, including a speech synthesis module, a speech delivery module, and a speech update module, wherein,

所述语音合成模块，用于在接收到来自于用户终端的待合成文本时，根据预先训练的语音合成模型对所述待合成文本进行语音合成，得到合成语音；The speech synthesis module is configured to, when receiving the text to be synthesized from the user terminal, perform speech synthesis on the text to be synthesized according to a pre-trained speech synthesis model to obtain synthesized speech;

所述语音下发模块，用于将所述合成语音返回至所述用户终端进行播放，并接收所述用户终端返回的对应所述待合成文本的校正数据；The voice sending module is used to return the synthesized voice to the user terminal for playback, and receive correction data corresponding to the text to be synthesized returned by the user terminal;

所述语音更新模块，用于根据所述校正数据更新所述合成语音，得到更新后的合成语音；The voice update module is used to update the synthesized voice according to the correction data to obtain the updated synthesized voice;

所述语音下发模块，还用于所述将所述更新后的合成语音返回至所述用户终端，使得所述用户终端将所述合成语音替换为所述更新后的合成语音进行播放。The voice sending module is further configured to return the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice for playback.

在一实施例中，本发明实施例提供的语音合成播放装置还包括模型更新模块，用于：In one embodiment, the speech synthesis playback device provided by the embodiment of the present invention also includes a model update module, which is used for:

根据所述待合成文本以及所述校正数据对所述语音合成模型进行更新。The speech synthesis model is updated according to the text to be synthesized and the correction data.

此外，本发明实施例还提供一种存储介质，所述存储介质存储有多条指令，所述指令适于处理器进行加载，以执行本发明实施例所提供的任一种语音合成播放方法。In addition, an embodiment of the present invention also provides a storage medium, the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to execute any speech synthesis playback method provided in the embodiments of the present invention.

本发明中，通过用户终端接收语音合成请求，并根据语音合成请求获取需要进行语音合成的待合成文本，然后将待合成文本发送至语音合成服务器进行语音合成，得到对应的合成语音，然后播放该合成语音，并接收对合成语音的发音校正请求，根据发音校正请求接收对应于合成语音的校正数据，将该校正数据发送至语音合成服务器用于更新合成语音，从而得到更新后的合成语音，将当前播放的合成语音替换为更新后的合成语音进行播放。相比于相关技术，本发明在播放合成语音的过程中，能够实时对播放的合成语音进行校正、更新，由此，即使在多音字的发音预测错误时，也能够及时校正其发音。In the present invention, the speech synthesis request is received by the user terminal, and the text to be synthesized that needs to be synthesized is obtained according to the speech synthesis request, and then the text to be synthesized is sent to the speech synthesis server for speech synthesis to obtain the corresponding synthesized speech, and then the text is played. Synthesizing speech, and receiving a pronunciation correction request for the synthesized speech, receiving correction data corresponding to the synthesized speech according to the pronunciation correction request, sending the correction data to the speech synthesis server for updating the synthesized speech, thereby obtaining the updated synthesized speech, and The currently playing synthesized voice is replaced with the updated synthesized voice for playback. Compared with related technologies, the present invention can correct and update the played synthesized voice in real time during the process of playing the synthesized voice, so that even when the pronunciation prediction of polyphonic characters is wrong, the pronunciation can be corrected in time.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1是本发明实施例中的语音合成播放系统的架构示意图；Fig. 1 is a schematic diagram of the architecture of a speech synthesis playback system in an embodiment of the present invention;

图2a是本发明实施例提供的语音合成播放方法的一流程示意图；Fig. 2a is a schematic flow chart of a speech synthesis playback method provided by an embodiment of the present invention;

图2b是本发明实施例中展示语音合成控件的示意图；Fig. 2b is a schematic diagram showing speech synthesis controls in an embodiment of the present invention;

图2c是本发明实施例中标识合成语音对应的分句一示意图；Fig. 2c is a schematic diagram of a clause 1 corresponding to the identified synthesized speech in the embodiment of the present invention;

图2d是本发明实施例中标识合成语音对应的分句另一示意图；Fig. 2d is another schematic diagram of the sentence corresponding to the identification synthesized speech in the embodiment of the present invention;

图2e是本发明实施例中展示发音校正控件的示意图；Fig. 2e is a schematic diagram showing pronunciation correction controls in an embodiment of the present invention;

图2f是本发明实施例中展示发音校正界面的示意图；Fig. 2f is a schematic diagram showing a pronunciation correction interface in an embodiment of the present invention;

图2g是本发明实施例中分布式系统的结构示意图；Fig. 2g is a schematic structural diagram of a distributed system in an embodiment of the present invention;

图2h是本发明实施例中区块结构的示意图；Fig. 2h is a schematic diagram of a block structure in an embodiment of the present invention;

图3是本发明实施例提供的语音合成播放方法另一流程示意图；Fig. 3 is another schematic flow chart of the speech synthesis playback method provided by the embodiment of the present invention;

图4是本发明实施例提供的语音合成播放方法另一流程示意图；Fig. 4 is another schematic flow chart of the speech synthesis playback method provided by the embodiment of the present invention;

图5是本发明实施例提供的语音合成播放装置的一结构示意图；5 is a schematic structural diagram of a speech synthesis playback device provided by an embodiment of the present invention;

图6是本发明实施例提供的语音合成播放装置的一结构示意图；FIG. 6 is a schematic structural diagram of a speech synthesis playback device provided by an embodiment of the present invention;

图7是本发明实施例中用户终端的结构示意图；FIG. 7 is a schematic structural diagram of a user terminal in an embodiment of the present invention;

图8是本发明实施例中语音合成服务器的结构示意图。Fig. 8 is a schematic structural diagram of the speech synthesis server in the embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative efforts fall within the protection scope of the present invention.

语音技术(Speech Technology)的关键分支有自动语音识别技术(AutomaticSpeech Recognition，ASR)和语音合成技术(Text To Speech，TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉，是未来人机交互的发展方向，其中，语音合成技术成为未来最被看好的人机交互方式之一。The key branches of Speech Technology include Automatic Speech Recognition (ASR), Text To Speech (TTS) and voiceprint recognition. Enabling computers to hear, see, speak, and feel is the development direction of human-computer interaction in the future. Among them, speech synthesis technology has become one of the most promising methods of human-computer interaction in the future.

早期的语音合成一般采用专用的芯片实现，如德州仪器公司的TMS50C10/TMS50C57、飞利浦的PH84H36等，但主要应用在家用电器和儿童玩具中。Early speech synthesis was generally implemented using dedicated chips, such as Texas Instruments' TMS50C10/TMS50C57, Philips' PH84H36, etc., but it was mainly used in household appliances and children's toys.

如今的语音合成一般采用纯软件实现，其文本到语音的转换过程为：首先对文本进行预处理、分词、词性标注、多音字预测、韵律层级预测等处理，然后再通过声学模型，预测各个单元对应的声学特征，最后利用声学参数直接通过声码器合成声音，或者从录音词库中挑选单元进行拼接，以生成与文本对应的语音。Today's speech synthesis is generally implemented by pure software. The text-to-speech conversion process is as follows: firstly, the text is preprocessed, word segmentation, part-of-speech tagging, polyphonic word prediction, prosodic level prediction, etc., and then through the acoustic model, each unit is predicted Corresponding to the acoustic features, finally use the acoustic parameters to directly synthesize the sound through the vocoder, or select units from the recorded lexicon for splicing to generate the speech corresponding to the text.

对于中文语音合成而言，目前比较关键的研究方向就是中文韵律处理、符号数字、多音字预测、以及构词等，需要不断研究，以使得中文语音合成的自然化程度提高。For Chinese speech synthesis, the current key research directions are Chinese prosodic processing, symbolic numbers, polyphonic word prediction, and word formation, etc., which need continuous research to improve the naturalization of Chinese speech synthesis.

其中多音字预测是中文语音合成的基础之一，多音字发音的正确与否，极大地影响了听者对合成声音的语义理解情况，如果多音字预测准确率高，将极大改善用户体验，使合成出来的语音易于理解，听起来也更加自然流畅。Among them, the prediction of polyphonic characters is one of the foundations of Chinese speech synthesis. Whether the pronunciation of polyphonic characters is correct or not greatly affects the semantic understanding of the synthesized voice by the listener. If the prediction accuracy of polyphonic characters is high, the user experience will be greatly improved. Make the synthesized speech easier to understand and sound more natural and smooth.

目前，针对于多音字，现有的语音合成多采用如下合成策略：At present, for polyphonic characters, the existing speech synthesis mostly adopts the following synthesis strategy:

若多音字可以和上下文组成词语，则按照固定搭配中的多音字来进行语音合成，比如重(zhong4)点、重(chong2)新，其中，拼音后的数字表示声调；If the polyphonic character can be combined with the context to form a word, the speech synthesis is carried out according to the polyphonic character in the fixed collocation, such as heavy (zhong4) point, heavy (chong2) new, wherein the number after the pinyin represents the tone;

若多音字以单字形式出现，则利用预先采用大量样本数据训练得到的语音合成模型来预测其发音，比如为(wei4)人民服务、结果为(wei2)零。If the polyphonic character appears in the form of a single character, use the speech synthesis model trained by using a large number of sample data in advance to predict its pronunciation, such as (wei4) serving the people, the result is (wei2) zero.

其中，常用的语音合成模型的训练方法包括但不限于：条件随机场(ConditionalRandom Field，CRF)方法，隐马尔科夫模型(Hidden Markov Model，HMM)方法，决策树方法等等。这些方法的特点是需要大量多音字的发音来进行训练。优点是可以仅凭文本来预测多音字的发音，且对于出现在常见上下文语境中的多音字预测准确率较高，缺点是对于不常见上下文语境中的多音字的处理能力很差。Among them, commonly used methods for training speech synthesis models include but are not limited to: Conditional Random Field (Conditional Random Field, CRF) method, Hidden Markov Model (Hidden Markov Model, HMM) method, decision tree method and so on. The characteristic of these methods is that the pronunciation of a large number of polyphonic characters is required for training. The advantage is that the pronunciation of polyphonic characters can be predicted based on the text alone, and the prediction accuracy for polyphonic characters that appear in common contexts is high. The disadvantage is that the processing ability for polyphonic characters in uncommon contexts is poor.

基于现有技术中的以上缺陷，本发明实施例提供一种语音合成播放方法、装置和存储介质。其中，包括适用于用户终端的语音合成播放方法、装置和存储介质，以及适用于语音合成服务器的语音合成播放方法、装置和存储介质。Based on the above defects in the prior art, embodiments of the present invention provide a speech synthesis playback method, device and storage medium. Among them, it includes a speech synthesis playing method, device and storage medium suitable for a user terminal, and a speech synthesis playing method, device and storage medium suitable for a speech synthesis server.

请参阅图1，本发明实施例还提供一种语音合成播放系统，该语音合成播放系统包括用户终端10、语音合成服务器20以及网络30(可以为有线网络，也可以为无线网络)，用户终端10通过网络30与语音合成服务器20进行交互。其中，网络30中包括路由器、网关等等网络实体，图1中并未一一示意出。Please refer to Fig. 1, the embodiment of the present invention also provides a kind of speech synthesis playing system, this speech synthesis playing system comprises user terminal 10, speech synthesis server 20 and network 30 (can be wired network, also can be wireless network), user terminal 10 interacts with speech synthesis server 20 over network 30. Wherein, the network 30 includes network entities such as routers and gateways, which are not shown in FIG. 1 .

基于图1所示的语音合成播放系统，用户终端10可以接收语音合成请求，并根据语音合成请求获取需要进行语音合成的待合成文本，然后将待合成文本发送至语音合成服务器20；语音合成服务器20在接收到来自于用户终端10的待合成文本之后，对待合成文本进行语音合成，得到对应的合成语音，并将该合成语音返回至用户终端10；用户终端10在接收到语音合成服务器20返回的合成语音之后，播放该合成语音，并接收对合成语音的发音校正请求，然后根据发音校正请求接收对应于合成语音的校正数据，并将该校正数据发送至语音合成服务器20；语音合成服务器20在接收到来自于用户终端10的校正数据之后，根据该校正数据更新合成语音，并将更新后的合成语音返回至用户终端10；用户终端10在接收到语音合成服务器20返回的更新后的合成语音之后，将当前播放的合成语音替换为更新后的合成语音进行播放。Based on the speech synthesis playback system shown in Figure 1, the user terminal 10 can receive the speech synthesis request, and obtain the text to be synthesized that needs to be synthesized according to the speech synthesis request, and then send the text to be synthesized to the speech synthesis server 20; the speech synthesis server 20. After receiving the text to be synthesized from the user terminal 10, perform speech synthesis on the text to be synthesized to obtain corresponding synthesized speech, and return the synthesized speech to the user terminal 10; After the synthesized speech, play this synthesized speech, and receive the pronunciation correction request to synthesized speech, then receive the correction data corresponding to synthesized speech according to the pronunciation correction request, and send this corrected data to speech synthesis server 20; Speech synthesis server 20 After receiving the correction data from the user terminal 10, update the synthesized speech according to the correction data, and return the updated synthesized speech to the user terminal 10; After the speech, replace the currently played synthesized speech with the updated synthesized speech for playback.

需要说明的是，上述图1示出的仅是实现本发明实施例的一个系统架构实例，本发明实施例并不限于上述图1所示的系统架构。基于该系统架构，以下分别进行详细说明。需说明的是，以下实施例的顺序不作为对实施例优选顺序的限定。It should be noted that the foregoing FIG. 1 shows only an example of a system architecture for implementing the embodiment of the present invention, and the embodiment of the present invention is not limited to the system architecture shown in the foregoing FIG. 1 . Based on the system architecture, detailed descriptions are given below respectively. It should be noted that the order of the following examples is not intended to limit the preferred order of the examples.

实施例一、Embodiment one,

本发明实施例提供一种语音合成播放方法，适用于用户终端，包括：接收语音合成请求，并根据语音合成请求获取需要进行语音合成的待合成文本；将待合成文本发送至语音合成服务器进行语音合成，使得语音合成服务器返回对应待合成文本的合成语音；播放合成语音，并接收对合成语音的发音校正请求；根据发音校正请求接收对应于合成语音的校正数据，并将校正数据发送至语音合成服务器，使得语音合成服务器根据校正数据更新合成语音，并返回更新后的合成语音；将当前播放的合成语音替换为更新后的合成语音进行播放。An embodiment of the present invention provides a speech synthesis playback method, which is suitable for a user terminal, including: receiving a speech synthesis request, and obtaining text to be synthesized that needs to be synthesized according to the speech synthesis request; sending the text to be synthesized to a speech synthesis server for speech Synthesize, so that the speech synthesis server returns the synthesized speech corresponding to the text to be synthesized; play the synthesized speech, and receive the pronunciation correction request for the synthesized speech; receive the correction data corresponding to the synthesized speech according to the pronunciation correction request, and send the correction data to the speech synthesis The server enables the speech synthesis server to update the synthesized speech according to the correction data, and return the updated synthesized speech; replace the currently played synthesized speech with the updated synthesized speech for playback.

请参照图2a，该语音合成播放方法的流程可以如下：Please refer to Figure 2a, the flow of the speech synthesis playback method can be as follows:

201，接收语音合成请求，并根据语音合成请求获取需要进行语音合成的待合成文本。201. Receive a speech synthesis request, and acquire text to be synthesized that requires speech synthesis according to the speech synthesis request.

本发明实施例中，用户终端可以实时接收外部输入的语音合成请求，从而触发进行语音合成，将对应的文本转换为语音进行输出。其中，用户终端可以接收用户直接输入的语音合成请求，也可以接收其它用户终端输入的语音合成请求。In the embodiment of the present invention, the user terminal may receive an externally input speech synthesis request in real time, thereby triggering speech synthesis, and converting the corresponding text into speech for output. Wherein, the user terminal may receive the speech synthesis request directly input by the user, or may receive the speech synthesis request input by other user terminals.

示例性的，对用户而言，可以通过多种不同的方式向用户终端输入语音合成请求。Exemplarily, for the user, the speech synthesis request can be input to the user terminal in many different ways.

比如，用户可以采用语音指令的方式说出“请朗读当前界面/全文”等，从而向用户终端输入用于指示对当前界面(比如，网页浏览界面、文本浏览界面等)或全文中的文本进行语音合成的语音合成请求。For example, the user can say "please read the current interface/full text" in the form of a voice command, so as to input to the user terminal to indicate the text in the current interface (for example, web page browsing interface, text browsing interface, etc.) or the full text. Speech synthesis request for speech synthesis.

又比如，用户终端提供有输入语音合成请求的语音合成控件，如图2b所示，该语音合成控件可以为按钮形式，并通过“朗读”标识，使得用户可以直接点击该语音合成控件以向用户终端输入用于指示对当前界面中的文本进行语音合成的语音合成请求。For another example, the user terminal is provided with a speech synthesis control for inputting a speech synthesis request. As shown in FIG. The terminal input is used to indicate a speech synthesis request for performing speech synthesis on the text in the current interface.

应当说明的是，本领域普通技术人员可以根据实际需要对用户终端进行配置，使得用户终端还能够接收以上未示出的其它方式所输入的语音合成请求。It should be noted that those skilled in the art can configure the user terminal according to actual needs, so that the user terminal can also receive speech synthesis requests input in other ways not shown above.

当接收到语音合成请求之后，用户终端进一步根据该语音合成请求获取需要进行语音合成的待合成文本。After receiving the speech synthesis request, the user terminal further acquires the text to be synthesized that needs to be speech synthesized according to the speech synthesis request.

在一实施例中，“根据语音合成请求获取需要进行语音合成的待合成文本”，包括：In an embodiment, "obtaining the text to be synthesized that needs to be synthesized according to the speech synthesis request" includes:

(1)根据语音合成请求提取前台应用的展示内容中的文本，得到提取文本；(1) Extract the text in the display content of the foreground application according to the speech synthesis request, and obtain the extracted text;

(2)按照预设分句策略，将提取文本划分为多个分句；(2) Divide the extracted text into multiple clauses according to the preset clause strategy;

(3)将分句设为待合成文本。(3) Set the clause as the text to be synthesized.

本发明实施例中，语音合成请求用于指示用户终端对其前台应用中的文本进行语音合成，前台应用即用户当前正在展示的应用。In the embodiment of the present invention, the speech synthesis request is used to instruct the user terminal to perform speech synthesis on the text in its foreground application, and the foreground application is the application currently displayed by the user.

相应的，用户终端在根据语音合成请求获取需要进行语音合成的待合成文本时，首先根据接收到语音合成请求，对前台应用的展示内容中的文本进行提取，将提取出的文本记为提取文本。Correspondingly, when the user terminal obtains the text to be synthesized that needs to be synthesized according to the speech synthesis request, it first extracts the text in the display content of the foreground application according to the received speech synthesis request, and records the extracted text as the extracted text .

比如，当用户终端在通过浏览器应用浏览网页期间接收到语音合成请求，则根据该语音合成请求接收网页的DOM树，并基于文本密度计算方法抽取出网页中的文本，如资讯文章正文或小说章节内容等；For example, when a user terminal receives a speech synthesis request while browsing webpages through a browser application, it receives the DOM tree of the webpage according to the speech synthesis request, and extracts the text in the webpage based on the text density calculation method, such as the text of news articles or novels. Chapter content, etc.;

又比如，当用户终端在通过文本阅读应用浏览本地的文档(比如txt、word等格式文件)期间接收到语音合成请求，则根据该语音合成请求对本地文档进行编解码，解析出以GB2312编码的纯文本内容。For another example, when the user terminal receives a speech synthesis request during browsing of local documents (such as txt, word, etc.) through a text reading application, the local document is encoded and decoded according to the speech synthesis request, and the GB2312-encoded text is parsed out. Plain text content.

当提取得到对应前台应用的提取文本之后，用户终端进一步按照预设分句策略，将提取文本划分为多个分句。应当说明的是，本发明实施例对该预设分句策略的配置不做具体限制，可由本领域普通技术人员根据实际需要进行配置，比如，本发明实施例中，配置的预设分句策略为根据标点符号和长度进行分句。After the extracted text corresponding to the foreground application is extracted, the user terminal further divides the extracted text into multiple clauses according to the preset sentence segmentation strategy. It should be noted that the embodiment of the present invention does not specifically limit the configuration of the preset clause strategy, and it can be configured by those skilled in the art according to actual needs. For example, in the embodiment of the present invention, the configured preset clause strategy To divide sentences according to punctuation and length.

对于划分得到的多个分句，用户终端依次将每一分句设为待合成文本，以对每一分句进行语音合成。For the multiple divided sentences, the user terminal sequentially sets each sentence as the text to be synthesized, so as to perform speech synthesis on each sentence.

202，将待合成文本发送至语音合成服务器进行语音合成，使得语音合成服务器返回对应待合成文本的合成语音。202. Send the text to be synthesized to the speech synthesis server for speech synthesis, so that the speech synthesis server returns synthesized speech corresponding to the text to be synthesized.

用户终端在获取到需要进行语音合成的待合成文本之后，按照预先预定的数据格式，构建携带待合成文本的语音合成能力请求，并将该语音合成能量请求发送至语音合成服务器，指示语音合成服务器进行语音合成。After obtaining the text to be synthesized that needs to be synthesized, the user terminal constructs a speech synthesis capability request carrying the text to be synthesized according to a predetermined data format, and sends the speech synthesis energy request to the speech synthesis server, instructing the speech synthesis server Perform speech synthesis.

示例性的，以下为一语音合成能力请求的数据格式示意：Exemplarily, the data format of a speech synthesis capability request is as follows:

其中，header表示语音合成能力请求的请求头，header.guid表示用户终端的唯一标识，header.qua表示用户终端的设备及应用信息，header.user表示用户信息，header.user.user_id表示用户的唯一标识，header.lbs表示用户位置信息，header.lbs.longitude表示经度，header.lbs.latitude表示维度，header.ip表示用户终端的IP地址，header.device.network表示用户终端的网络类型。Among them, header indicates the request header of the speech synthesis capability request, header.guid indicates the unique identifier of the user terminal, header.qua indicates the device and application information of the user terminal, header.user indicates user information, and header.user.user_id indicates the user's unique ID, header.lbs indicates user location information, header.lbs.longitude indicates longitude, header.lbs.latitude indicates latitude, header.ip indicates the IP address of the user terminal, and header.device.network indicates the network type of the user terminal.

payload表示语音合成能力请求的请求内容，payload.speech_meta表示语音配置信息，payload.speech_meta.compress表示压缩类型，payload.speech_meta.person表示发音人，payload.speech_meta.volume表示发音音量，payload.speech_meta.speed表示发音语速，payload.speech_meta.pitch表示音调，payload.session_id表示会话ID，payload.index表示请求的语音片序号，payload.single_request表示语音合成类型，payload.content表示语音合成的内容，payload.content.text用于填充待合成文本。payload indicates the request content of the speech synthesis capability request, payload.speech_meta indicates the voice configuration information, payload.speech_meta.compress indicates the compression type, payload.speech_meta.person indicates the speaker, payload.speech_meta.volume indicates the pronunciation volume, and payload.speech_meta.speed Indicates the pronunciation speed, payload.speech_meta.pitch indicates the pitch, payload.session_id indicates the session ID, payload.index indicates the serial number of the requested voice film, payload.single_request indicates the type of speech synthesis, payload.content indicates the content of speech synthesis, payload.content .text is used to fill in the text to be synthesized.

另一方面，语音合成服务器在接收到来自于用户终端的语音合成能力请求时，根据该语音合成能量请求进行语音合成，得到对应待合成文本的合成语音，并将该合成语音返回用户终端。On the other hand, when the speech synthesis server receives the speech synthesis capability request from the user terminal, it performs speech synthesis according to the speech synthesis energy request, obtains the synthesized speech corresponding to the text to be synthesized, and returns the synthesized speech to the user terminal.

比如，对应于以上示出的语音合成能量请求的数据格式，语音合成服务器返回合成语音的数据格式如下：For example, corresponding to the data format of the speech synthesis energy request shown above, the data format of the synthesized speech returned by the speech synthesis server is as follows:

其中，header表示消息头，header.session表示会话，header.session.session_id表示会话ID，payload表示消息体，payload.speech_finished表示是否结束，payload.speech_base64表示合成语音的Base64数据。Among them, header indicates the message header, header.session indicates the session, header.session.session_id indicates the session ID, payload indicates the message body, payload.speech_finished indicates whether it is finished, and payload.speech_base64 indicates the Base64 data of the synthesized speech.

203，播放合成语音，并接收对合成语音的发音校正请求。203. Play the synthesized voice, and receive a pronunciation correction request for the synthesized voice.

用户终端在接收到语音合成服务器返回的合成语音之后，播放该合成语音，并在播放该合成语音的过程中接收对该合成语音的发音校正请求。After receiving the synthesized voice returned by the speech synthesis server, the user terminal plays the synthesized voice, and receives a pronunciation correction request for the synthesized voice during the process of playing the synthesized voice.

在一实施例中，在播放合成语音的过程中，本发明实施例提供的语音合成播放方法还包括：In one embodiment, during the process of playing synthesized speech, the speech synthesis playback method provided by the embodiment of the present invention further includes:

按照预设规则对合成语音对应的分句进行标识。The clauses corresponding to the synthesized speech are identified according to preset rules.

本发明实施例中，用户终端可以在提取得到前台应用的提取文本之后，创建一个覆盖前台应用的语音合成播放界面，并在该语音合成播放界面中展示提取文本。In the embodiment of the present invention, after extracting the extracted text of the foreground application, the user terminal can create a speech synthesis playback interface covering the foreground application, and display the extracted text in the speech synthesis playback interface.

在播放合成语音的过程中，用户终端按照预设规则对合成语音对应的分句进行标识，其中，通过对合成语音对应的分句进行标识，目的在于突出展示该分句，使得该分句区别展示于其它分句，进而使得用户能够从提取文本中快速定位到正在播放的分句。应当说明的是，本发明实施例中对于预设规则的配置方式不做具体限制，可由本领域普通技术人员根据实际需要进行配置。During the process of playing the synthesized voice, the user terminal identifies the clauses corresponding to the synthesized voice according to the preset rules. The purpose of marking the clauses corresponding to the synthesized voice is to highlight the clauses so that the clauses are distinguished from each other. Displayed in other clauses, so that users can quickly locate the clause being played from the extracted text. It should be noted that, in the embodiment of the present invention, there is no specific limitation on the way of configuring the preset rules, and those skilled in the art can configure according to actual needs.

比如，预设规则可以配置为增大展示比例，如图2c所示，被设为待合成文本的分句为“开始的时候王宝乐不懂”，该分句相较于其它分句具有更大的展示比例，使得其明显区别于其它分句。For example, the preset rule can be configured to increase the display ratio. As shown in Figure 2c, the clause set as the text to be synthesized is "Wang Baole didn't understand at the beginning", and this clause has a larger value than other clauses. The display ratio of , making it distinct from other clauses.

又比如，预设规则可以配置为调整展示颜色，如图2d所示，被设为待合成文本的分句为“开始的时候王宝乐不懂”，该分句相较于其它分句具有不同的展示颜色，使得其明显区别于其它分句。For another example, the default rule can be configured to adjust the display color. As shown in Figure 2d, the clause set as the text to be synthesized is "Wang Baole didn't understand at the beginning", and this clause has a different meaning than other clauses. Display the color so that it is clearly distinguished from other clauses.

在一实施例中，“接收对合成语音的发音校正请求”，包括：In an embodiment, "receiving a pronunciation correction request for synthesized speech" includes:

(1)在合成语音对应的分句的预设范围内展示发音校正控件；(1) displaying pronunciation correction controls within the preset range of the sentence corresponding to the synthesized speech;

(2)基于发音校正控件接收对合成语音的发音校正请求。(2) A pronunciation correction request for the synthesized speech is received based on the pronunciation correction control.

本发明实施例中，用户终端在播放合成语音的过程中，除了按照预设规则对合成语音对应的分句进行标识之外，还在合成语音对应的分句的预设范围内展示发音校正控件，从而通过该发音校正控件接收对播放的合成语音的发音校正请求。应当说明的是，对于预设范围的配置，本发明实施例中不做具体限制，可由本领域普通技术人员根据实际需要进行配置。In the embodiment of the present invention, during the process of playing the synthesized voice, the user terminal not only identifies the clauses corresponding to the synthesized voice according to the preset rules, but also displays the pronunciation correction control within the preset range of the clauses corresponding to the synthesized voice , so as to receive a pronunciation correction request for the played synthesized speech through the pronunciation correction control. It should be noted that, for the configuration of the preset range, there is no specific limitation in the embodiment of the present invention, and it can be configured by those of ordinary skill in the art according to actual needs.

比如，如图2e所示，分句“开始的时候王宝乐不懂”被设为待合成文本，用户终端在播放“开始的时候王宝乐不懂”对应的合成语音的过程中，通过改变展示颜色的方式对“开始的时候王宝乐不懂”进行标识，与此同时，用户终端在“开始的时候王宝乐不懂”的末尾展示发音校正控件，使得用户可以通过点击该发音校正控件来向用户终端输入发音校正请求。For example, as shown in Figure 2e, the clause "Wang Baole didn't understand at the beginning" is set as the text to be synthesized. During the process of playing the synthesized speech corresponding to "Wang Baole didn't understand at the beginning", the user terminal changes the display color way to mark "Wang Baole didn't understand at the beginning", and at the same time, the user terminal displays a pronunciation correction control at the end of "Wang Baole did not understand at the beginning", so that the user can click on the pronunciation correction control to input the pronunciation to the user terminal Correction request.

204，根据发音校正请求接收对应于合成语音的校正数据，并将校正数据发送至语音合成服务器，使得语音合成服务器根据校正数据更新合成语音，并返回更新后的合成语音。204. Receive correction data corresponding to the synthesized speech according to the pronunciation correction request, and send the correction data to the speech synthesis server, so that the speech synthesis server updates the synthesized speech according to the correction data, and returns the updated synthesized speech.

本发明实施例中，用户终端在接收输入的发音校正请求之后，进一步根据该发音校正请求接收对应于合成语音的校正数据，并在接收到对应于合成语音的校正数据之后，将该校正数据发送至语音合成服务器，使得语音合成服务器根据校正数据更新合成语音，并返回更新后的合成语音。其中，校正数据包括需要校正的字，以及正确的发音。In the embodiment of the present invention, after receiving the input pronunciation correction request, the user terminal further receives the correction data corresponding to the synthesized speech according to the pronunciation correction request, and after receiving the correction data corresponding to the synthesized speech, sends the correction data to to the speech synthesis server, so that the speech synthesis server updates the synthesized speech according to the correction data, and returns the updated synthesized speech. Wherein, the correction data includes the word to be corrected and the correct pronunciation.

比如，用户终端可以采用语音合成能力请求的方式发送校正数据，但与以上所示的语音合成能力请求的数据格式的区别在于，此处额外增加了两个字段，分别为"report_correction"和"correct_phonetic"，其中，report_correction用于表示此次发送的语音合成能力请求是否用于更新合成语音，当写入值为“true”时，表示更新，当写入值为“false”时，表示正常进行语音合成，correct_phonetic用于写入校正数据。当语音合成服务器接收到来自于用户终端的语音合成能力请求时，根据其中"report_correction"确定是否为更新合成语音，若是，则从“correct_phonetic”中提取出校正数据和待合成文本，并根据该校正数据以及待合成文本重新合成得到新的合成语音，设为更新后的合成语音返回至用户终端。For example, the user terminal can send correction data in the form of a speech synthesis capability request, but the difference from the data format of the speech synthesis capability request shown above is that two additional fields are added here, namely "report_correction" and "correct_phonetic ", among them, report_correction is used to indicate whether the speech synthesis capability request sent this time is used to update the synthesized speech. When the written value is "true", it means update; when the written value is "false", it means that the speech is performed normally synthetic, correct_phonetic is used to write correction data. When the speech synthesis server receives the speech synthesis capability request from the user terminal, it determines whether it is to update the synthesized speech according to the "report_correction", and if so, extracts the correction data and the text to be synthesized from the "correct_phonetic", and The data and the text to be synthesized are re-synthesized to obtain a new synthesized voice, which is set as the updated synthesized voice and returned to the user terminal.

在一实施例中“根据发音校正请求接收对应于合成语音的校正数据”，包括：In an embodiment, "receiving correction data corresponding to synthesized speech according to a pronunciation correction request" includes:

(1)根据发音校正请求展示发音校正界面，发音校正界面包括字输入控件和发音控件；(1) Display the pronunciation correction interface according to the pronunciation correction request, and the pronunciation correction interface includes word input controls and pronunciation controls;

(2)基于字输入控件接收待合成文本中需要校正的目标字；(2) receiving the target word to be corrected in the text to be synthesized based on the word input control;

(3)基于发音控件接收对应目标字的目标发音；(3) receiving the target pronunciation of the corresponding target word based on the pronunciation control;

(4)将目标字和目标发音设为校正数据。(4) Set the target word and target pronunciation as correction data.

本发明实施例中，在根据发音校正请求接收对应于合成语音的校正数据时，用户终端首先根据接收到的发音校正请求展示发音校正界面，该发音校正界面包括字输入控件和发音控件，其中，字输入控件用于接收待合成文本中需要校正的目标字，发音控件用于接收接收对目标字的目标发音。In the embodiment of the present invention, when receiving the correction data corresponding to the synthesized speech according to the pronunciation correction request, the user terminal first displays the pronunciation correction interface according to the received pronunciation correction request, and the pronunciation correction interface includes a character input control and a pronunciation control, wherein, The word input control is used to receive the target word to be corrected in the text to be synthesized, and the pronunciation control is used to receive the target pronunciation of the target word.

由此，用户终端可以基于字输入控件接收用户输入的需要校正的目标字，以及基于发音控件接收对应该目标字的目标发音，并将目标字及其对应目标发音设为校正数据。Thus, the user terminal can receive the target word to be corrected input by the user based on the character input control, and receive the target pronunciation corresponding to the target word based on the pronunciation control, and set the target word and its corresponding target pronunciation as correction data.

其中，在将目标字及其对应的目标发音设为校正数据之前，用户终端还识别目标字是否归属于合成语音对应的待合成文本，在且仅在目标字归属于待合成文本时，才将接收到的目标字和目标发音设为对应于播放的合成语音的校正数据，由此来确保对合成语音校正的准确性。Wherein, before setting the target word and its corresponding target pronunciation as correction data, the user terminal also identifies whether the target word belongs to the text to be synthesized corresponding to the synthesized speech, and only when the target word belongs to the text to be synthesized, the The received target word and target pronunciation are set to correspond to the corrected data of the played synthesized speech, thereby ensuring the accuracy of correcting the synthesized speech.

在一实施例中，“基于发音控件接收对应目标字的目标发音”，包括：In one embodiment, "receiving the target pronunciation of the corresponding target word based on the pronunciation control" includes:

(1)校验目标字是否为多音字；(1) check whether the target word is a polyphonic word;

(2)当判定目标字为多音字时，根据预设的多音字和发音的对应关系，获取目标字对应的多个发音；(2) When determining that the target word is a polyphonic word, according to the preset correspondence between the polyphonic word and the pronunciation, obtain a plurality of pronunciations corresponding to the target word;

(3)基于发音控件展示多个发音，并接收对展示的发音的选择操作；(3) Display multiple pronunciations based on the pronunciation control, and receive a selection operation on the displayed pronunciations;

(4)将选择操作对应的发音设为目标字的目标发音。(4) Set the pronunciation corresponding to the selection operation as the target pronunciation of the target word.

本发明实施例中，为了进一步确保对合成语音校正的准确性，用户终端在基于发音控件接收对应目标字的目标发音时，首先校验目标字是否为多音字，以在源头排除用户误输入而导致的误校正。In the embodiment of the present invention, in order to further ensure the accuracy of the synthetic speech correction, when the user terminal receives the target pronunciation of the corresponding target word based on the pronunciation control, it first checks whether the target word is a polyphonic word, so as to eliminate the user's mistaken input at the source. resulting in miscorrection.

比如，用户终端中预先配置有多音字数据库，该多音字数据库存储有已知的多音字，在校验目标字是否为多音字，用户终端可以查询多音字数据库中是否存在用户输入的目标字，如存在，则校验通过，判定用户输入的目标字为多音字。For example, the polyphonic database is pre-configured in the user terminal, and the polyphonic database stores known polyphonic characters. When checking whether the target word is a polyphonic character, the user terminal can query whether there is a target word input by the user in the polyphonic database. If it exists, then the verification is passed, and it is determined that the target word input by the user is a polyphonic word.

当判断输入的目标字为多音字时，用户终端进一步根据预设的多音字和发音的对应关系，获取目标字对应的多个发音。然后，用户终端基于发音控件展示获取到的对应于目标字的多个发音，并接收对展示的发音的选择操作，将选择操作对应的发音设为目标字的目标发音。When judging that the input target word is a polyphonic character, the user terminal further acquires multiple pronunciations corresponding to the target word according to the preset correspondence between polyphonic characters and pronunciations. Then, the user terminal displays the acquired multiple pronunciations corresponding to the target word based on the pronunciation control, receives a selection operation on the displayed pronunciations, and sets the pronunciation corresponding to the selection operation as the target pronunciation of the target word.

示例性的，请参照图2f，发音校正界面展示有：For example, please refer to Figure 2f, the pronunciation correction interface shows:

被设为待合成文本的分句“开始的时候王宝乐不懂”；The clause set as the text to be synthesized is "Wang Baole didn't understand at the beginning";

输入框形式的字输入控件以及第一提示信息“请输入需要校正的字”，提示用户输入需要校正的目标字，比如图示中用户输入了“乐”；The word input control in the form of an input box and the first prompt message "Please enter the word to be corrected" prompt the user to input the target word to be corrected, for example, the user has entered "乐" in the illustration;

选择框形式的发音控件以及第二提示信息“请勾选正确发音”，提示用户选择正确的发音作为目标发音，其中，发音控件的个数与获取到的对应目标字发音个数相同，比如图示中展示发音“le4”的发音控件和展示发音“yue4”的发音控件；The pronunciation control in the form of a selection box and the second prompt message "Please check the correct pronunciation" prompt the user to select the correct pronunciation as the target pronunciation, wherein the number of pronunciation controls is the same as the number of pronunciations of the corresponding target word obtained, as shown in Fig. The pronunciation control showing the pronunciation "le4" and the pronunciation control showing the pronunciation "yue4" in the demo;

用于指示输入完成的“上报校正”控件，当用户输入完成时，可点击该“上报校正”控件，使得用户终端获取到目标字“乐”以及对应的目标发音“le4”。The "report correction" control is used to indicate the completion of the input. When the user completes the input, the "report correction" control can be clicked, so that the user terminal can obtain the target word "le" and the corresponding target pronunciation "le4".

205，将当前播放的合成语音替换为更新后的合成语音进行播放。205. Replace the currently played synthesized voice with the updated synthesized voice for playback.

其中，用户终端在接收到语音合成服务器所返回的更新后的合成语音时，即将当前播放的合成语音替换为更新后的合成语音进行播放，实现对合成语音的发音校正。Wherein, when the user terminal receives the updated synthesized voice returned by the speech synthesis server, it replaces the currently played synthesized voice with the updated synthesized voice and plays it, so as to realize the pronunciation correction of the synthesized voice.

在一实施例中，本发明实施例提供的语音合成播放方法，还包括：In one embodiment, the speech synthesis playback method provided by the embodiment of the present invention further includes:

将待合成文本、合成语音和/或更新后的合成语音存储至分布式系统中。The text to be synthesized, the synthesized speech and/or the updated synthesized speech are stored in the distributed system.

以分布式系统为区块链系统为例，请参照图2g，图2g是本发明实施例提供的分布式系统100应用于区块链的一个可选的结构示意图，其由多个节点(本发明以上实施例提及的用户终端、其它用户终端和语音合成服务器)和客户端形成，节点之间形成组成的点对点(P2P，Peer To Peer)网络，P2P协议是一个运行在传输控制协议(TCP，TransmissionControl Protocol)协议之上的应用层协议。节点包括硬件层、中间层、操作系统层和应用层。Taking the distributed system as a block chain system as an example, please refer to FIG. 2g. FIG. 2g is an optional structural diagram of the distributed system 100 applied to the block chain provided by the embodiment of the present invention, which consists of multiple nodes (this The user terminal mentioned in the above embodiments of the invention, other user terminals and speech synthesis server) and the client are formed, and a point-to-point (P2P, Peer To Peer) network is formed between the nodes, and the P2P protocol is a transmission control protocol (TCP) that runs on , the application layer protocol above the TransmissionControl Protocol) protocol. Nodes include hardware layer, middle layer, operating system layer and application layer.

参照图2g示出的区块链系统中各节点的功能，涉及的功能包括：Referring to the functions of each node in the blockchain system shown in Figure 2g, the functions involved include:

1)路由，节点具有的基本功能，用于支持节点之间的通信。1) Routing, the basic function of nodes, is used to support communication between nodes.

节点除具有路由功能外，还可以具有以下功能：In addition to the routing function, nodes can also have the following functions:

2)应用，用于部署在区块链中，根据实际业务需求而实现特定业务，记录实现功能相关的数据形成记录数据，在记录数据中携带数字签名以表示任务数据的来源，将记录数据发送到区块链系统中的其他节点，供其他节点在验证记录数据来源以及完整性成功时，将记录数据添加到临时区块中。2) Application, which is used to deploy in the blockchain, realize specific business according to actual business needs, record the data related to the realization function to form record data, carry a digital signature in the record data to indicate the source of the task data, and send the record data to To other nodes in the blockchain system, for other nodes to add the record data to the temporary block when they verify the source of the record data and the integrity is successful.

例如，应用实现的业务包括：For example, the services implemented by the application include:

2.1)钱包，用于提供进行电子货币的交易的功能，包括发起交易(即，将当前交易的交易记录发送给区块链系统中的其他节点，其他节点验证成功后，作为承认交易有效的响应，将交易的记录数据存入区块链的临时区块中；当然，钱包还支持查询电子货币地址中剩余的电子货币；2.1) Wallet, which is used to provide the function of conducting electronic currency transactions, including initiating transactions (that is, sending the transaction records of the current transaction to other nodes in the blockchain system, and after other nodes are successfully verified, they will be recognized as a valid response to the transaction , store the transaction record data in the temporary block of the blockchain; of course, the wallet also supports querying the remaining electronic currency in the electronic currency address;

2.2)共享账本，用于提供账目数据的存储、查询和修改等操作的功能，将对账目数据的操作的记录数据发送到区块链系统中的其他节点，其他节点验证有效后，作为承认账目数据有效的响应，将记录数据存入临时区块中，还可以向发起操作的节点发送确认。2.2) Shared account book, which is used to provide the functions of storage, query and modification of account data, and send the record data of the operation of account data to other nodes in the blockchain system. Respond with valid data, store the recorded data in the temporary block, and send a confirmation to the node that initiated the operation.

2.3)智能合约，计算机化的协议，可以执行某个合约的条款，通过部署在共享账本上的用于在满足一定条件时而执行的代码实现，根据实际的业务需求代码用于完成自动化的交易，例如查询买家所购买商品的物流状态，在买家签收货物后将买家的电子货币转移到商户的地址；当然，智能合约不仅限于执行用于交易的合约，还可以执行对接收的信息进行处理的合约。2.3) Smart contracts, computerized agreements, can execute the terms of a certain contract, implemented by code deployed on the shared ledger for execution when certain conditions are met, and codes are used to complete automated transactions according to actual business needs. For example, query the logistics status of the goods purchased by the buyer, and transfer the buyer's electronic money to the merchant's address after the buyer signs for the goods; of course, smart contracts are not limited to the execution of contracts for transactions, but can also execute the received information. Processed contracts.

3)区块链，包括一系列按照产生的先后时间顺序相互接续的区块(Block)，新区块一旦加入到区块链中就不会再被移除，区块中记录了区块链系统中节点提交的记录数据。3) Blockchain, including a series of blocks (Block) that are successively connected to each other according to the time sequence of generation. Once a new block is added to the blockchain, it will not be removed again. The blockchain system is recorded in the block. The record data submitted by the middle node.

参照图2h，图2h是本发明实施例提供的区块结构(Block Structure)一个可选的示意图，每个区块中包括本区块存储交易记录的哈希值(本区块的哈希值)、以及前一区块的哈希值，各区块通过哈希值连接形成区块链。另外，区块中还可以包括有区块生成时的时间戳等信息。区块链(Blockchain)，本质上是一个去中心化的数据库，是一串使用密码学方法相关联产生的数据块，每一个数据块中包含了相关的信息，用于验证其信息的有效性(防伪)和生成下一个区块。Referring to Fig. 2h, Fig. 2h is an optional schematic diagram of the block structure (Block Structure) provided by the embodiment of the present invention, each block includes the hash value of the block storage transaction record (the hash value of the block ), and the hash value of the previous block, each block is connected through the hash value to form a blockchain. In addition, the block may also include information such as a time stamp when the block was generated. Blockchain (Blockchain), essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains relevant information to verify the validity of its information. (anti-counterfeiting) and generate the next block.

本发明实施例中，用户终端还可以将以上语音合成过程中的待合成文本及其对应的合成语音和/或更新后的合成语音存储至其所在的分布式系统中，以作记录。In the embodiment of the present invention, the user terminal may also store the text to be synthesized and its corresponding synthesized voice and/or updated synthesized voice in the distributed system where it is located in the above voice synthesis process for recording.

由上可知，本发明实施例中，用户终端可以接收语音合成请求，并根据语音合成请求获取需要进行语音合成的待合成文本，然后将待合成文本发送至语音合成服务器进行语音合成，得到对应的合成语音，然后播放该合成语音，并接收对合成语音的发音校正请求，根据发音校正请求接收对应于合成语音的校正数据，将该校正数据发送至语音合成服务器用于更新合成语音，从而得到更新后的合成语音，将当前播放的合成语音替换为更新后的合成语音进行播放。相比于相关技术，本发明在播放合成语音的过程中，能够实时对播放的合成语音进行校正、更新，由此，即使在多音字的发音预测错误时，也能够及时校正其发音。As can be seen from the above, in the embodiment of the present invention, the user terminal can receive the speech synthesis request, and obtain the text to be synthesized that needs to be synthesized according to the speech synthesis request, and then send the text to be synthesized to the speech synthesis server for speech synthesis, and obtain the corresponding Synthesize voice, then play the synthesized voice, and receive a pronunciation correction request for the synthesized voice, receive correction data corresponding to the synthesized voice according to the pronunciation correction request, and send the correction data to the speech synthesis server for updating the synthesized voice, thereby obtaining an update The updated synthesized voice will replace the currently played synthesized voice with the updated synthesized voice for playback. Compared with related technologies, the present invention can correct and update the played synthesized voice in real time during the process of playing the synthesized voice, so that even when the pronunciation prediction of polyphonic characters is wrong, the pronunciation can be corrected in time.

实施例二、Embodiment two,

本发明实施例中还提供一种语音合成播放方法，适用于语音合成服务器，包括：当接收到来自于用户终端的待合成文本时，根据预先训练的语音合成模型对待合成文本进行语音合成，得到合成语音；将合成语音返回至用户终端进行播放，并接收用户终端返回的对应合成语音的校正数据；根据校正数据更新合成语音，得到更新后的合成语音；将更新后的合成语音返回至用户终端，使得用户终端将合成语音替换为更新后的合成语音进行播放。An embodiment of the present invention also provides a speech synthesis playback method, which is suitable for a speech synthesis server, including: when receiving text to be synthesized from a user terminal, performing speech synthesis on the text to be synthesized according to a pre-trained speech synthesis model, and obtaining Synthesize voice; return the synthesized voice to the user terminal for playback, and receive the correction data corresponding to the synthesized voice returned by the user terminal; update the synthesized voice according to the correction data to obtain an updated synthesized voice; return the updated synthesized voice to the user terminal , so that the user terminal replaces the synthesized voice with the updated synthesized voice for playback.

请参照图3，该语音合成播放方法的流程可以如下：Please refer to Fig. 3, the flow process of this speech synthesis playing method can be as follows:

301，当接收到来自于用户终端的待合成文本时，根据预先训练的语音合成模型对待合成文本进行语音合成，得到合成语音。301. When receiving text to be synthesized from a user terminal, perform speech synthesis on the text to be synthesized according to a pre-trained speech synthesis model to obtain synthesized speech.

其中，用户终端在接收到输入的语音合成请求时，根据该语音合成请求获取需要进行语音合成的待合成文本。相应的，语音合成服务器接收来自于用户终端的待合成文本，当接收到来自于用户终端的待合成文本时，根据预先训练的语音合成模型对待合成文本进行语音合成，得到合成语音。Wherein, when receiving an input speech synthesis request, the user terminal acquires the text to be synthesized that needs to be speech synthesized according to the speech synthesis request. Correspondingly, the speech synthesis server receives the text to be synthesized from the user terminal, and when receiving the text to be synthesized from the user terminal, performs speech synthesis on the text to be synthesized according to the pre-trained speech synthesis model to obtain synthesized speech.

应当说明的是，语音合成模型可以采用条件随机场(Conditional Random Field，CRF)方法，隐马尔科夫模型(Hidden Markov Model，HMM)方法，决策树方法等方法预先训练得到，本发明对此不做赘述。It should be noted that the speech synthesis model can be pre-trained using a conditional random field (Conditional Random Field, CRF) method, a hidden Markov model (Hidden Markov Model, HMM) method, a decision tree method, etc. Do repeat.

302，将合成语音返回至用户终端进行播放，并接收用户终端返回的对应合成语音的校正数据。302. Return the synthesized speech to the user terminal for playing, and receive correction data corresponding to the synthesized speech returned by the user terminal.

语音合成服务器在合成得到合成语音后，将该合成语音返回至用户终端进行播放。After the speech synthesis server synthesizes the synthesized speech, it returns the synthesized speech to the user terminal for playback.

另一方面，用户终端在播放合成语音的过程中，接收对合成语音的发音校正请求，并根据该发音校正请求接收输入的校正数据，将该校正数据发送至语音合成服务器。相应的，语音合成服务器还接收用户终端返回的对应待合成文本的校正数据。On the other hand, during the process of playing the synthesized speech, the user terminal receives a pronunciation correction request for the synthesized speech, receives input correction data according to the pronunciation correction request, and sends the correction data to the speech synthesis server. Correspondingly, the speech synthesis server also receives the correction data corresponding to the text to be synthesized returned by the user terminal.

303，根据校正数据更新合成语音，得到更新后的合成语音。303. Update the synthesized speech according to the correction data to obtain the updated synthesized speech.

在接收到用户终端返回的校正数据之后，语音合成服务器根据该校正数据更新之前合成得的合成语音，得到更新后的合成语音。After receiving the correction data returned by the user terminal, the speech synthesis server updates the previously synthesized speech according to the correction data to obtain the updated synthesized speech.

比如，对于待合成文本“开始的时候王宝乐不懂”，合成语音中“乐”的发音为“yue4”，经过更新，更新后的合成语音中“乐”的发音为“le4”。For example, for the text to be synthesized "Wang Baole didn't understand at the beginning", the pronunciation of "Le" in the synthesized voice is "yue4". After updating, the pronunciation of "Le" in the updated synthesized voice is "le4".

304，将更新后的合成语音返回至用户终端，使得用户终端将合成语音替换为更新后的合成语音进行播放。304. Return the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice for playback.

在得到更新后的合成语音之后，语音合成服务器将该更新后的合成语音返回至电子设备，使得电子设备将当前播放的合成语音替换为更新后的合成语音机械能播放，实现发音校正。After obtaining the updated synthesized voice, the voice synthesis server returns the updated synthesized voice to the electronic device, so that the electronic device replaces the currently played synthesized voice with the updated synthesized voice to be played mechanically and realizes pronunciation correction.

应当说明的是，对于本发明实施例为具体说明的部分，可参照以上适用于用户终端的语音合成播放方法的实施例中的相关描述，此处不再赘述。It should be noted that, for the part that is specifically described in the embodiment of the present invention, reference may be made to the relevant description in the above embodiment of the speech synthesis playback method applicable to the user terminal, and details are not repeated here.

根据待合成文本以及校正数据对语音合成模型进行更新。The speech synthesis model is updated according to the text to be synthesized and the correction data.

比如，语音合成服务器在每次接收到来自于用户终端的校正数据后，将该校正数据及其对应待合成文本作为训练语料存储至预先创建的数据库中，不断丰富训练语料，等数据库中训练语料累积到预设数量时，根据其中已存储的训练语料采用监督学习的方式对语音合成模型进行更新，使得语音合成模型能够更准确的预测多音字的发音。For example, each time the speech synthesis server receives correction data from the user terminal, it stores the correction data and its corresponding text to be synthesized as training corpus in a pre-created database, continuously enriches the training corpus, and waits for the training corpus in the database to When the preset amount is accumulated, the speech synthesis model is updated by means of supervised learning according to the stored training corpus, so that the speech synthesis model can more accurately predict the pronunciation of polyphonic characters.

在一实施例中，本发明实施例提供的语音合成方法，还包括：In one embodiment, the speech synthesis method provided by the embodiment of the present invention further includes:

其中，以分布式系统为区块链系统为例，其由多个节点(本发明以上实施例提及的用户终端和语音合成服务器等)和客户端形成。Wherein, the distributed system is taken as an example of a block chain system, which is formed by multiple nodes (user terminals and speech synthesis servers mentioned in the above embodiments of the present invention, etc.) and clients.

本发明实施例中语音合成服务器还可以将语音合成过程中的待合成文本及其对应的合成语音和/或更新后的合成语音存储至其所在的分布式系统中，以作记录。In the embodiment of the present invention, the speech synthesis server may also store the text to be synthesized during the speech synthesis process and its corresponding synthesized speech and/or updated synthesized speech in its distributed system for recording.

实施例三、Embodiment three,

根据前面实施例所描述的方法，以下将举例作进一步说明。According to the methods described in the previous embodiments, examples will be given below for further description.

如图4所示，该语音合成播放方法的流程可以如下：As shown in Figure 4, the flow of the speech synthesis playback method can be as follows:

401，用户终端接收语音合成请求，并根据语音合成请求提取前台应用的展示内容中的文本，得到提取文本，以及按照预设分句策略，将提取文本划分为多个分句，并依次将划分得到分句设为待合成文本，发送至语音合成服务器。401. The user terminal receives the speech synthesis request, and extracts the text in the display content of the foreground application according to the speech synthesis request, obtains the extracted text, and divides the extracted text into multiple clauses according to the preset sentence segmentation strategy, and sequentially divides the divided text The obtained sentence is set as the text to be synthesized and sent to the speech synthesis server.

本发明实施例中，用户终端可以实时接收外部输入的语音合成请求，从而触发进行语音合成，将对应的文本转换为语音进行输出。In the embodiment of the present invention, the user terminal may receive an externally input speech synthesis request in real time, thereby triggering speech synthesis, and converting the corresponding text into speech for output.

当接收到语音合成请求之后，用户终端进一步根据该语音合成请求获取需要进行语音合成的待合成文本。其中，首先根据接收到语音合成请求，对前台应用的展示内容中的文本进行提取，将提取出的文本记为提取文本。After receiving the speech synthesis request, the user terminal further acquires the text to be synthesized that needs to be speech synthesized according to the speech synthesis request. Wherein, firstly, according to the received speech synthesis request, the text in the display content of the foreground application is extracted, and the extracted text is recorded as the extracted text.

当提取得到对应前台应用的提取文本之后，用户终端进一步按照预设分句策略，将提取文本划分为多个分句。应当说明的是，本发明实施例对该预设分句策略的配置不做具体限制，可由本领域普通技术人员根据实际需要进行配置，比如，本发明实施例中，配置的预设分句策略为根据标点符号和长度进行分句。After the extracted text corresponding to the foreground application is extracted, the user terminal further divides the extracted text into multiple clauses according to the preset sentence segmentation policy. It should be noted that the embodiment of the present invention does not specifically limit the configuration of the preset clause strategy, and it can be configured by those skilled in the art according to actual needs. For example, in the embodiment of the present invention, the configured preset clause strategy To divide sentences according to punctuation and length.

402，语音合成服务器根据预先训练的语音合成模型对待合成文本进行语音合成，得到合成语音，返回至用户终端。402. The speech synthesis server performs speech synthesis on the text to be synthesized according to the pre-trained speech synthesis model, obtains the synthesized speech, and returns it to the user terminal.

其中，语音合成服务器接收来自于用户终端的待合成文本，当接收到来自于用户终端的待合成文本时，根据预先训练的语音合成模型对待合成文本进行语音合成，得到合成语音，返回至用户终端。Wherein, the speech synthesis server receives the text to be synthesized from the user terminal, when receiving the text to be synthesized from the user terminal, performs speech synthesis on the text to be synthesized according to the pre-trained speech synthesis model, obtains the synthesized speech, and returns to the user terminal .

403，用户终端播放合成语音，并按照预设规则对合成语音对应的分句进行标识，以及在该分句的预设范围内展示发音校正控件。403. The user terminal plays the synthesized voice, identifies the clause corresponding to the synthesized voice according to a preset rule, and displays a pronunciation correction control within a preset range of the clause.

404，用户终端基于发音校正控件接收对合成语音的发音校正请求。404. The user terminal receives a pronunciation correction request for the synthesized voice based on the pronunciation correction control.

用户终端在接收到语音合成服务器返回的合成语音之后，播放该合成语音，并在播放合成语音的过程中，按照预设规则对合成语音对应的分句进行标识。After receiving the synthesized voice returned by the voice synthesis server, the user terminal plays the synthesized voice, and in the process of playing the synthesized voice, identifies the clauses corresponding to the synthesized voice according to preset rules.

其中，通过对合成语音对应的分句进行标识，目的在于突出展示该分句，使得该分句区别展示于其它分句，进而使得用户能够从提取文本中快速定位到正在播放的分句。应当说明的是，本发明实施例中对于预设规则的配置方式不做具体限制，可由本领域普通技术人员根据实际需要进行配置。Among them, by marking the clauses corresponding to the synthesized speech, the purpose is to highlight the clauses so that the clauses are displayed differently from other clauses, so that users can quickly locate the clauses being played from the extracted text. It should be noted that, in the embodiment of the present invention, there is no specific limitation on the way of configuring the preset rules, and those skilled in the art can configure according to actual needs.

用户终端除了按照预设规则对合成语音对应的分句进行标识之外，还在合成语音对应的分句的预设范围内展示发音校正控件，从而通过该发音校正控件接收对播放的合成语音的发音校正请求。应当说明的是，对于预设范围的配置，本发明实施例中不做具体限制，可由本领域普通技术人员根据实际需要进行配置。In addition to identifying the sentence corresponding to the synthesized speech according to the preset rules, the user terminal also displays the pronunciation correction control within the preset range of the sentence corresponding to the synthesized speech, so as to receive feedback on the played synthesized speech through the pronunciation correction control. Pronunciation correction request. It should be noted that, for the configuration of the preset range, there is no specific limitation in the embodiment of the present invention, and it can be configured by those of ordinary skill in the art according to actual needs.

405，用户终端根据发音校正请求展示发音校正界面，发音校正界面包括字输入控件和发音控件；405. The user terminal displays a pronunciation correction interface according to the pronunciation correction request, and the pronunciation correction interface includes a character input control and a pronunciation control;

406，用户终端基于字输入控件接收待合成文本中需要校正的目标字，以及基于发音控件接收对应目标字的目标发音，将目标字和目标发音设为校正数据发送至语音合成服务器。406. The user terminal receives the target word to be corrected in the text to be synthesized based on the word input control, and receives the target pronunciation of the corresponding target word based on the pronunciation control, and sends the target word and the target pronunciation as correction data to the speech synthesis server.

比如，请参照图2f，发音校正界面展示有：For example, please refer to Figure 2f, the pronunciation correction interface shows:

用户终端在基于字输入控件接收到用户输入的需要校正的目标字，以及基于发音控件接收到对应该目标字的目标发音后，将目标字及其对应目标发音设为校正数据，发送至语音合成服务器。After the user terminal receives the target word that needs to be corrected input by the user based on the character input control, and the target pronunciation corresponding to the target word is received based on the pronunciation control, the target word and its corresponding target pronunciation are set as correction data and sent to the speech synthesis server.

407，语音合成服务器根据校正数据更新合成语音，得到更新后的合成语音，返回至用户终端。407. The speech synthesis server updates the synthesized speech according to the correction data, obtains the updated synthesized speech, and returns it to the user terminal.

在接收到用户终端返回的校正数据之后，语音合成服务器根据该校正数据更新之前合成得的合成语音，得到更新后的合成语音，返回至用户终端。、After receiving the correction data returned by the user terminal, the speech synthesis server updates the previously synthesized speech according to the correction data, obtains the updated synthesized speech, and returns it to the user terminal. ,

408，用户终端将当前播放的合成语音替换为更新后的合成语音进行播放。408. The user terminal replaces the currently played synthesized voice with the updated synthesized voice for playback.

实施例四、Embodiment four,

为了更好地实施以上语音合成播放方法，本发明实施例还提供一种语音合成播放装置，该语音合成播放装置具体可以集成在用户终端中。In order to better implement the above speech synthesis playback method, an embodiment of the present invention further provides a speech synthesis playback device, and the speech synthesis playback device may specifically be integrated in a user terminal.

例如，如图5所示，该语音合成播放装置可以包括文本获取模块501、语音合成模块502、语音播放模块503以及文本校正模块504，如下：For example, as shown in Figure 5, the speech synthesis playback device may include a text acquisition module 501, a speech synthesis module 502, a speech playback module 503 and a text correction module 504, as follows:

文本获取模块501，用于接收语音合成请求，并根据语音合成请求获取需要进行语音合成的待合成文本。The text obtaining module 501 is configured to receive a speech synthesis request, and obtain text to be synthesized that needs to be speech synthesized according to the speech synthesis request.

语音合成模块502，用于将待合成文本发送至语音合成服务器进行语音合成，使得语音合成服务器返回对应待合成文本的合成语音；The speech synthesis module 502 is used for sending the text to be synthesized to the speech synthesis server for speech synthesis, so that the speech synthesis server returns the synthesized speech corresponding to the text to be synthesized;

语音播放模块503，用于播放合成语音，并接收对合成语音的发音校正请求；Voice playing module 503, used for playing synthesized voice, and receiving the pronunciation correction request to synthesized voice;

文本校正模块504，用于根据发音校正请求接收对应于合成语音的校正数据，并将校正数据发送至语音合成服务器，使得语音合成服务器根据校正数据更新合成语音，并返回更新后的合成语音；The text correction module 504 is used to receive correction data corresponding to the synthesized speech according to the pronunciation correction request, and send the correction data to the speech synthesis server, so that the speech synthesis server updates the synthesized speech according to the correction data, and returns the updated synthesized speech;

语音播放模块503还用于将当前播放的合成语音替换为更新后的合成语音进行播放。The voice playing module 503 is also used to replace the currently played synthesized voice with the updated synthesized voice for playing.

在一实施例中，在根据发音校正请求接收对应于合成语音的校正数据，文本校正模块504用于：In one embodiment, after receiving correction data corresponding to the synthesized speech according to the pronunciation correction request, the text correction module 504 is used to:

根据发音校正请求展示发音校正界面，发音校正界面包括字输入控件和发音控件；Display the pronunciation correction interface according to the pronunciation correction request, and the pronunciation correction interface includes word input controls and pronunciation controls;

基于字输入控件接收待合成文本中需要校正的目标字；Receiving target words that need to be corrected in the text to be synthesized based on the word input control;

基于发音控件接收对应目标字的目标发音；receiving the target pronunciation of the corresponding target word based on the pronunciation control;

将目标字和目标发音设为校正数据。The target word and the target pronunciation are set as correction data.

在一实施例中，在基于发音控件接收对应目标字的目标发音时，文本校正模块504用于：In one embodiment, when receiving the target pronunciation of the corresponding target word based on the pronunciation control, the text correction module 504 is used to:

校验目标字是否为多音字；Check whether the target word is a polyphonic word;

当判定目标字为多音字时，根据预设的多音字和发音的对应关系，获取目标字对应的多个发音；When it is determined that the target word is a polyphonic character, according to the preset correspondence between polyphonic characters and pronunciations, multiple pronunciations corresponding to the target word are obtained;

基于发音控件展示多个发音，并接收对展示的发音的选择操作；Display multiple pronunciations based on the pronunciation control, and receive selection operations on the displayed pronunciations;

将选择操作对应的发音设为目标字的目标发音。The pronunciation corresponding to the selection operation is set as the target pronunciation of the target word.

在一实施例中，在根据语音合成请求获取需要进行语音合成的待合成文本时，文本获取模块501用于：In one embodiment, when obtaining the text to be synthesized that needs to be synthesized according to the speech synthesis request, the text acquisition module 501 is used to:

根据语音合成请求提取前台应用的展示内容中的文本，得到提取文本；Extract the text in the display content of the foreground application according to the speech synthesis request to obtain the extracted text;

按照预设分句策略，将提取文本划分为多个分句；Divide the extracted text into multiple clauses according to the preset clause strategy;

将分句设为待合成文本。Set the clause as the text to be synthesized.

在一实施例中，在播放合成语音的过程中，语音播放模块503还用于：In one embodiment, during the process of playing synthesized voice, the voice playing module 503 is also used for:

在一实施例中，在接收对合成语音的发音校正请求时，语音播放模块503用于：In one embodiment, when receiving a pronunciation correction request for synthesized speech, the speech playback module 503 is used to:

在合成语音对应的分句的预设范围内展示发音校正控件；Display pronunciation correction controls within the preset range of the sentence corresponding to the synthesized speech;

基于发音校正控件接收对合成语音的发音校正请求。A pronunciation correction request for the synthesized speech is received based on the pronunciation correction control.

实施例五、Embodiment five,

为了更好地实施以上智能检索方法，本发明实施例还提供一种语音合成播放装置，该语音合成播放装置具体可以集成在语音合成服务器中。In order to better implement the above intelligent retrieval method, an embodiment of the present invention further provides a speech synthesis playback device, which may specifically be integrated in a speech synthesis server.

例如，如图6所示，该语音合成播放装置可以包括语音合成模块601、语音下发模块602以及语音更新模块603，如下：For example, as shown in Figure 6, the speech synthesis playback device may include a speech synthesis module 601, a speech delivery module 602 and a speech update module 603, as follows:

语音合成模块601，用于在当接收到来自于用户终端的待合成文本时，根据预先训练的语音合成模型对待合成文本进行语音合成，得到合成语音；Speech synthesis module 601, for when receiving the text to be synthesized from the user terminal, perform speech synthesis on the text to be synthesized according to the pre-trained speech synthesis model, to obtain synthesized speech;

语音下发模块602，用于将合成语音返回至用户终端进行播放，并接收用户终端返回的对应合成语音的校正数据；The voice sending module 602 is used to return the synthesized voice to the user terminal for playback, and receive correction data corresponding to the synthesized voice returned by the user terminal;

语音更新模块603，用于根据校正数据更新合成语音，得到更新后的合成语音；Voice updating module 603, for updating the synthesized voice according to the correction data, to obtain the updated synthesized voice;

语音下发模块602还用于将更新后的合成语音返回至用户终端，使得用户终端将合成语音替换为更新后的合成语音进行播放。The voice delivery module 602 is also used to return the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice for playback.

实施例六、Embodiment six,

本发明实施例还提供一种用户终端，该用户终端可以为手机、平板电脑、笔记本电脑等设备。如图7所示，其示出了本发明实施例所涉及的用户终端的结构示意图，具体来讲：The embodiment of the present invention also provides a user terminal, and the user terminal may be a device such as a mobile phone, a tablet computer, or a notebook computer. As shown in FIG. 7, it shows a schematic structural diagram of a user terminal involved in an embodiment of the present invention, specifically:

该用户终端可以包括一个或者一个以上处理核心的处理器701、一个或一个以上计算机可读存储介质的存储器702、电源703和输入单元704等部件。本领域技术人员可以理解，图7中示出的用户终端结构并不构成对用户终端的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。其中：The user terminal may include a processor 701 of one or more processing cores, a memory 702 of one or more computer-readable storage media, a power supply 703, an input unit 704 and other components. Those skilled in the art can understand that the structure of the user terminal shown in FIG. 7 does not constitute a limitation on the user terminal, and may include more or less components than those shown in the figure, or combine some components, or arrange different components. in:

处理器701是该用户终端的控制中心，利用各种接口和线路连接整个用户终端的各个部分，通过运行或执行存储在存储器702内的软件程序和/或模块，以及调用存储在存储器702内的数据，执行用户终端的各种功能和处理数据。The processor 701 is the control center of the user terminal, and uses various interfaces and lines to connect various parts of the entire user terminal, by running or executing software programs and/or modules stored in the memory 702, and calling the Data, perform various functions of the user terminal and process data.

存储器702可用于存储软件程序以及模块，处理器701通过运行存储在存储器702的软件程序以及模块，从而执行各种功能应用以及数据处理。此外，存储器702可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地，存储器702还可以包括存储器控制器，以提供处理器701对存储器702的访问。The memory 702 can be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by running the software programs and modules stored in the memory 702 . In addition, the memory 702 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices. Correspondingly, the memory 702 may further include a memory controller to provide the processor 701 with access to the memory 702 .

用户终端还包括给各个部件供电的电源703，优选的，电源703可以通过电源管理系统与处理器701逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The user terminal also includes a power supply 703 for supplying power to various components. Preferably, the power supply 703 can be logically connected to the processor 701 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.

该用户终端还可包括输入单元704，该输入单元704可用于接收输入的数字或字符信息，以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The user terminal can also include an input unit 704, which can be used to receive input digital or character information, and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.

尽管未示出，用户终端还可以包括显示单元等，在此不再赘述。具体在本实施例中，用户终端中的处理器701会按照如下的指令，将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器702中，并由处理器701来运行存储在存储器702中的应用程序，从而实现各种功能，如下：Although not shown, the user terminal may also include a display unit, etc., which will not be repeated here. Specifically, in this embodiment, the processor 701 in the user terminal will load the executable file corresponding to the process of one or more application programs into the memory 702 according to the following instructions, and the processor 701 will run the executable file stored in the The application program in memory 702, thus realizes various functions, as follows:

接收语音合成请求，并根据语音合成请求获取需要进行语音合成的待合成文本；将待合成文本发送至语音合成服务器进行语音合成，使得语音合成服务器返回对应待合成文本的合成语音；播放合成语音，并接收对合成语音的发音校正请求；根据发音校正请求接收对应于合成语音的校正数据，并将校正数据发送至语音合成服务器，使得语音合成服务器根据校正数据更新合成语音，并返回更新后的合成语音；将当前播放的合成语音替换为更新后的合成语音进行播放。Receive the speech synthesis request, and obtain the text to be synthesized that needs to be synthesized according to the speech synthesis request; send the text to be synthesized to the speech synthesis server for speech synthesis, so that the speech synthesis server returns the synthesized speech corresponding to the text to be synthesized; play the synthesized speech, And receive the pronunciation correction request for the synthesized speech; receive the correction data corresponding to the synthesized speech according to the pronunciation correction request, and send the correction data to the speech synthesis server, so that the speech synthesis server updates the synthesized speech according to the correction data, and returns the updated synthesized speech Speech; replace the currently playing synthesized voice with the updated synthesized voice for playback.

应当说明的是，本发明实施例提供的用户终端与上文实施例中的适用于用户终端的语音合成播放方法属于同一构思，其具体实现过程详见以上方法实施例，此处不再赘述。It should be noted that the user terminal provided by the embodiment of the present invention belongs to the same concept as the speech synthesis playback method applicable to the user terminal in the above embodiment, and its specific implementation process is detailed in the above method embodiment, and will not be repeated here.

实施例七、Embodiment seven,

本发明实施例还提供一种语音合成服务器，如图8所示，其示出了本发明实施例所涉及的用户终端的结构示意图，具体来讲：The embodiment of the present invention also provides a speech synthesis server, as shown in FIG. 8 , which shows a schematic structural diagram of the user terminal involved in the embodiment of the present invention. Specifically:

该语音合成服务器可以包括一个或者一个以上处理核心的处理器801、一个或一个以上计算机可读存储介质的存储器802、电源803和输入单元804等部件。本领域技术人员可以理解，图8中示出的语音合成服务器结构并不构成对语音合成服务器的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。其中：The speech synthesis server may include a processor 801 of one or more processing cores, a memory 802 of one or more computer-readable storage media, a power supply 803, an input unit 804 and other components. Those skilled in the art can understand that the speech synthesis server structure shown in Figure 8 does not constitute a limitation to the speech synthesis server, and may include more or less components than those shown in the illustration, or combine certain components, or different components layout. in:

处理器801是该语音合成服务器的控制中心，利用各种接口和线路连接整个语音合成服务器的各个部分，通过运行或执行存储在存储器802内的软件程序和/或模块，以及调用存储在存储器802内的数据，执行语音合成服务器的各种功能和处理数据。The processor 801 is the control center of the speech synthesis server, and uses various interfaces and lines to connect the various parts of the entire speech synthesis server, by running or executing software programs and/or modules stored in the memory 802, and calling the software programs stored in the memory 802 The data within, perform various functions of the speech synthesis server and process the data.

存储器802可用于存储软件程序以及模块，处理器801通过运行存储在存储器802的软件程序以及模块，从而执行各种功能应用以及数据处理。此外，存储器802可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地，存储器802还可以包括存储器控制器，以提供处理器801对存储器802的访问。The memory 802 can be used to store software programs and modules, and the processor 801 executes various functional applications and data processing by running the software programs and modules stored in the memory 802 . In addition, the memory 802 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices. Correspondingly, the memory 802 may further include a memory controller to provide the processor 801 with access to the memory 802 .

语音合成服务器还包括给各个部件供电的电源803，优选的，电源803可以通过电源管理系统与处理器801逻辑相连，从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The speech synthesis server also includes a power supply 803 for supplying power to each component. Preferably, the power supply 803 can be logically connected to the processor 801 through a power management system, so that functions such as charging, discharging, and power consumption management can be realized through the power management system.

具体在本实施例中，语音合成服务器中的处理器801会按照如下的指令，将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器802中，并由处理器801来运行存储在存储器802中的应用程序，从而实现各种功能，如下：Specifically in this embodiment, the processor 801 in the speech synthesis server will load the executable file corresponding to the process of one or more application programs into the memory 802 according to the following instructions, and the processor 801 will run the stored The application program in memory 802, thus realizes various functions, as follows:

当接收到来自于用户终端的待合成文本时，根据预先训练的语音合成模型对待合成文本进行语音合成，得到合成语音；将合成语音返回至用户终端进行播放，并接收用户终端返回的对应合成语音的校正数据；根据校正数据更新合成语音，得到更新后的合成语音；将更新后的合成语音返回至用户终端，使得用户终端将合成语音替换为更新后的合成语音进行播放。When receiving the text to be synthesized from the user terminal, perform speech synthesis on the text to be synthesized according to the pre-trained speech synthesis model to obtain a synthesized voice; return the synthesized voice to the user terminal for playback, and receive the corresponding synthesized voice returned by the user terminal the correction data; update the synthesized voice according to the correction data to obtain the updated synthesized voice; return the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice for playback.

应当说明的是，本发明实施例提供的语音合成服务器与上文实施例中的适用于语音合成服务器的语音合成播放方法属于同一构思，其具体实现过程详见以上方法实施例，此处不再赘述。It should be noted that the speech synthesis server provided by the embodiment of the present invention belongs to the same concept as the speech synthesis playback method applicable to the speech synthesis server in the above embodiments, and its specific implementation process is detailed in the above method embodiments, and will not be repeated here. repeat.

实施例八、Embodiment eight,

本领域普通技术人员可以理解，上述实施例的各种方法中的全部或部分步骤可以通过指令来完成，或通过指令控制相关的硬件来完成，该指令可以存储于一计算机可读存储介质中，并由处理器进行加载和执行。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructions, or by instructions controlling related hardware, and the instructions can be stored in a computer-readable storage medium, and loaded and executed by the processor.

为此，本发明实施例提供一种存储介质，其中存储有多条指令，该指令能够被用户终端的处理器进行加载，以执行本发明实施例所提供的适用于用户终端的语音合成播放方法，例如，该指令可以执行如下步骤：To this end, the embodiment of the present invention provides a storage medium, which stores a plurality of instructions, which can be loaded by the processor of the user terminal to execute the speech synthesis playback method suitable for the user terminal provided by the embodiment of the present invention , for example, the command can perform the following steps:

此外，本发明实施例提供一种存储介质，其中存储有多条指令，该指令能够被语音合成服务器的处理器进行加载，以执行本发明实施例所提供的适用于服务器的语音合成播放方法，例如，该指令可以执行如下步骤：In addition, an embodiment of the present invention provides a storage medium in which a plurality of instructions are stored, and the instructions can be loaded by the processor of the speech synthesis server to execute the speech synthesis playback method suitable for the server provided by the embodiment of the present invention, For example, the command can perform the following steps:

其中，该存储介质可以包括：只读存储器(ROM，Read Only Memory)、随机存取记忆体(RAM，Random Access Memory)、磁盘或光盘等。Wherein, the storage medium may include: a read only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, and the like.

本发明实施例所提供的存储介质能够实现本发明实施例所提供的对应语音合成播放方法所能实现的有益效果，详见前面的实施例，在此不再赘述。The storage medium provided by the embodiment of the present invention can achieve the beneficial effect achieved by the corresponding speech synthesis playback method provided by the embodiment of the present invention. For details, refer to the previous embodiments, and details will not be repeated here.

以上对本发明实施例所提供的一种语音合成播放方法、装置和存储介质进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。A speech synthesis playback method, device, and storage medium provided by the embodiments of the present invention have been described in detail above. In this paper, specific examples are used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for To help understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and application range. In summary, the content of this specification is not It should be understood as a limitation of the present invention.

Claims

1. A method for synthesizing and playing speech, comprising:

receiving a voice synthesis request, and acquiring a text to be synthesized, which needs to be subjected to voice synthesis, according to the voice synthesis request;

sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns a synthesized voice corresponding to the text to be synthesized;

playing the synthesized voice and receiving a pronunciation correction request for the synthesized voice;

receiving input correction data corresponding to the synthesized voice according to the pronunciation correction request, and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice;

and replacing the currently played synthesized voice with the updated synthesized voice for playing.

2. The speech synthesis playing method according to claim 1, wherein the step of receiving correction data corresponding to the synthesized speech input according to the utterance correction request includes:

displaying a pronunciation correction interface according to the pronunciation correction request, wherein the pronunciation correction interface comprises a character input control and a pronunciation control;

receiving target words needing to be corrected in the text to be synthesized based on the word input control;

receiving a target pronunciation corresponding to the target character based on the pronunciation control;

setting the target word and the target pronunciation as the correction data.

3. The speech synthesis playing method according to claim 2, wherein the step of receiving a target pronunciation corresponding to the target word based on the pronunciation control comprises:

checking whether the target character is a polyphone character;

when the target character is judged to be a polyphone character, acquiring a plurality of pronunciations corresponding to the target character according to a preset corresponding relation between the polyphone character and the pronunciations;

displaying the plurality of pronunciations based on the pronunciation control, and receiving selection operation of the displayed pronunciations;

and setting the pronunciation corresponding to the selection operation as the target pronunciation of the target character.

4. The method for playing speech synthesis according to claims 1-3, wherein the step of obtaining the text to be synthesized that needs to be speech-synthesized according to the speech synthesis request comprises:

extracting a text in the display content of the foreground application according to the voice synthesis request to obtain an extracted text;

dividing the extracted text into a plurality of clauses according to a preset clause strategy;

and setting the clauses as the texts to be synthesized.

5. The speech synthesis playing method according to any one of claims 1 to 3, wherein the speech synthesis playing method further comprises:

and storing the text to be synthesized, the synthesized voice and/or the updated synthesized voice into a distributed system.

6. A method for synthesizing and playing speech, comprising:

when a text to be synthesized from a user terminal is received, carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model to obtain synthesized voice;

returning the synthesized voice to the user terminal for playing, and receiving correction data corresponding to the synthesized voice returned by the user terminal;

updating the synthesized voice according to the correction data to obtain updated synthesized voice;

and returning the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice for playing.

7. The speech synthesis playing method according to claim 6, further comprising:

and updating the voice synthesis model according to the text to be synthesized and the correction data.

8. A speech synthesis playback apparatus, comprising:

the text acquisition module is used for receiving a voice synthesis request and acquiring a text to be synthesized, which needs voice synthesis, according to the voice synthesis request;

the voice synthesis module is used for sending the text to be synthesized to a voice synthesis server for voice synthesis, so that the voice synthesis server returns the synthesized voice corresponding to the text to be synthesized;

the voice playing module is used for playing the synthesized voice and receiving a pronunciation correction request of the synthesized voice;

the text correction module is used for receiving input correction data corresponding to the synthesized voice according to the pronunciation correction request and sending the correction data to the voice synthesis server, so that the voice synthesis server updates the synthesized voice according to the correction data and returns the updated synthesized voice;

the voice playing module is further configured to replace the currently played synthesized voice with the updated synthesized voice for playing.

9. A speech synthesis playing device is characterized in that the device comprises a speech synthesis module, a speech sending module and a speech updating module, wherein,

the voice synthesis module is used for carrying out voice synthesis on the text to be synthesized according to a pre-trained voice synthesis model when receiving the text to be synthesized from the user terminal to obtain synthesized voice;

the voice issuing module is used for returning the synthesized voice to the user terminal for playing and receiving the correction data corresponding to the text to be synthesized returned by the user terminal;

the voice updating module is used for updating the synthesized voice according to the correction data to obtain the updated synthesized voice;

the voice issuing module is further configured to return the updated synthesized voice to the user terminal, so that the user terminal replaces the synthesized voice with the updated synthesized voice to play the synthesized voice.

10. A storage medium storing instructions adapted to be loaded by a processor to perform a speech synthesis playing method according to any one of claims 1 to 5 or to perform a speech synthesis playing method according to claim 6 or 7.