Text to speech synthesis system for mobile applications,

Ramakrishnan Angarai Ganesan

doi:10.13140/RG.2.1.4560.7528

Outline

Text to speech synthesis system for mobile applications,

Ramakrishnan Angarai Ganesan

Proc. Workshop in Image and Signal Processing (WISP-2007)

https://doi.org/10.13140/RG.2.1.4560.7528

visibility

…

description

4 pages

link

1 file

Abstract

This paper discusses a Text-To-Speech (TTS) synthesis system embedded in a mobile. The TTS system used is unit selection based concatenative speech synthesizer, where a speech unit is selected from the database based on its phonetic and prosodic context. Speech unit considered in the synthesis is larger than a phone, diphone and syllable. Usually the unit is a word or a phrase. While the quality of the synthesized speech has improved significantly by using corpus-based TTS technology, there is a practical problem regarding the trade-off between database size and quality of synthetic speech, especially in mobile environment. Several speech compression schemes currently used in mobiles today are applied on the database. Speech is synthesized from the input text, using compressed speech in the database, The intelligibility and naturalness of the synthesized speech are studied. Mobiles contain a speech codec, one of the modules in the baseband processing. The idea of this paper is to propose a methodology to use the already available speech codec in the mobile and read a SMS aloud to the listener, when TTS is embedded in a mobile. Experimental results show the clear possibility of our idea.

Figures (5)

Table 1 Database size for various compression schemes

Ihe procedure for text to speech synthesis remains the same as discussed in section 1, with the exception that the units selected from the database are now compressed units. During the final stage of waveform generation, compressed speech units are decompressed, coupled and smoothened at concatenation points. It is then played out on the speaker. Exact compressed speech frame boundaries need to be considered during synthesis; even a single byte mismatch results in complete degradation of synthesized speech. This is because all the compression schemes use basic linear prediction principle, where correlation exists between frames and filter memories need to be updated continuously. To decode the initial frame of a compressed speech unit picked up from the database, the decoding algorithm resets the filter memories. Due to this the error gets propagated to all frames in the decoded speech unit. Actually, to decode this initial frame of compressed speech unit the decoder algorithm needs filter memories updated during the decoding of previous frame. Then, this results in error-free decoding of compressed speech unit. A block diagram of the synthesis using compressed database is shown in Fig. 1.

Table 2 List of compression schemes used in our study. Table 3. MOS ratings of the synthesized sentences. (mean of 5 listeners.)

The perception experiments evaluated that the TTS engine produced a high quality synthetic speech, even with highly compressed database. This, therefore, holds promise for the future, where we can read messages in Indian languages on the mobile. Good scores have been obtained for GSM FR and EFR, which means that optimized code can readily fit into GSM mobiles. This has a high potential for the market, since the synthesis quality is very good. However, AMR schemes are usually used when the communication channel is taken into consideration. We have used AMR in our experiments to understand the effect on synthesized speech quality at very high compression rates. Very high compression rates lead to very low memory requirement of the database. It’s interesting to note that MOS score of synthesis using compressed data is very high compared to the case when uncompressed data is used. Listeners felt that the synthesized signal generated using compressed data has relatively smoother envelope. However, listeners also felt that pitch variation of units is locally good and needs some more modification in the global sense. Local refers to word level and global refers to a complete sentence or a phrase. Pause needs to be effectively modeled for improving the naturalness of the synthetic speech globally. The combination of high quality database and robust unit selection has resulted in good quality of our synthesis.

References (15)

REFERENCES
Nobuo Nukaga, Ryota Kamoshida, Kenji Nagamatsu and Yoshinori Kitahara. "Scalable Implementation of unit selection based text-to-speech system for embedded solutions", Hitachi Ltd. Central Research Laboratory, Japan.
A. G. Ramakrishnan, Lakshmish N Kaushik, Laxmi Narayana. M, "Natural Language Processing for Tamil TTS", Proc. 3rd Language and Technology Conference, Poznan, Poland, October 5-7, 2007.
A Black and N Campbell, "Optimizing selection of units from speech databases for concatenative synthesis", In Proc, Eurospeech, pp. 581-584, 1995.
A Hunt and A Black, "Unit selection in a concatenative speech synthesis system using a large speech database", In Proc. ICASSP, pp. 373-376, 1996.
Digital cellular telecommunications system (Phase 2+) (GSM);
Enhanced Full Rate (EFR) speech transcoding (GSM 06.60 version 8.0.1 Release 1999).
Digital cellular telecommunications system (Phase 2+) (GSM);
Adaptive Multi-Rate (AMR); Speech processing functions; General description (GSM 06.71 version 7.0.2 Release 1998).
Digital cellular telecommunications system (Phase 2);
Full rate speech; Part2: transcoding (GSM 06.10 version 4.3.0 GSM Phase 2).
S Isard and A D Coonkie. Progress in Speech Synthesis, chapter Optimum coupling of diphones. Wiley 2002.
Chang-Heon Lee, Sung-Kyo Jung and Hong-Goo Kang "Applying a Speaker-Dependent Speech Compression Technique to Concatenative TTS synthesizers" IEEE Trans Audio, Speech Lang. Proc., VoL. 15, No. 2, Feb 2007.
PRAAT : A tool for phonetic analyses and sound manipulations by Boersma and Weenink, 1992-2001. www.praat.org
"Flite: a small, fast speech synthesis engine" Edition 1.3, for Flite version 1.3 by Alan W Black and Kevin A.Lenzo. Speech Group at Cranegie Mellon University

Text to speech synthesis system for mobile applications,

Sign up for access to the world's latest research

Abstract

Related papers

References (15)

Related papers

Related topics