Predicting F0 and voicing from NAM-captured whispered speech
Abstract
The NAM-to-speech conversion proposed by Toda and colleagues which converts Non-Audible Murmur (NAM) to audible speech by statistical mapping trained using aligned corpora is a very promising technique, but its performance is still insufficient, mainly due to the difficulty in estimating F 0 of the transformed voice from unvoiced speech. In this paper, we propose a method to improve F 0 estimation and voicing decision in a NAM-to-speech conversion system based on Gaussian Mixture Models (GMM) applied to whispered speech. Instead of combining voicing decision and F 0 estimation in a single GMM, a simple feed-forward neural network is used to detect voiced segments in the whisper while a GMM estimates a continuous melodic contour based on training voiced segments. The error rate for the voiced/unvoiced decision of the network is 6.8% compared to 9.2% with the original system. Our proposal benefits also to F 0 estimation error.
References (15)
- References
- Toda, T.; Shikano, K., 2005. NAM-to-Speech Conversion with Gaussian Mixture Models. In Proc. Interspeech. Lisboa, 1957-1960.
- Toda, T.; Black, A.W.; Tokuda, K., 2007. Voice Conversion Based on Maximum Likelihood Estimation of Spectral Parameter Trajectory. In IEEE Transactions on Audio, Speech and Language Processing. Vol. 15, No. 8, 2222-2235.
- Ohtani , Y.; Toda, T.; Sarawatari, H.; Shikano, K., 2006. Maximum Likelihood Voice Conversion Based on GMM with STRAIGHT Mixed Excitation. In Proc. Interspeech -ICSLP. Pittsburgh, USA. 2266-2269.
- Nakagiri, M. ; Toda, T. ; Kashioka, H.; Shikano, K., 2006. Improving Body Transmitted Unvoiced Speech with Statistical Voice Conversion. In Proc. Interspeech- ICSLP. Pittsburgh, USA. 2270-2273.
- Nakajima, Y.; Kashioka, H.; Shikano, K.; Campbell N., 2003. Non-audible murmur recognition. In Proc. Inter- speech(Eurospeech). Geneva, Switzeland, 2601-2604.
- Heracleous, P.; Nakajima, Y., 2004. Audible (normal) speech and inaudible murmur recognition using NAM microphone. In EUSIPCO, Vienna, Austria.
- Ito, T.; Takeda, K.; Itakura, F., 2005. Analysis and recognition of whispered speech. In Speech Communica- tion. Lisboa. Vol. 45, Issue 2, 139-152.
- Higashikawa, M.; Nakai, K.; Sakakura, A; Takahashi, H., 1996. Perceived Pitch of Whispered Vowels -Relation- ship with formant frequencies: A preliminary study. Journal of Voice, 155-158.
- Higashikawa, M; Minifie, F .D., 1999. Acoustical- perceptual correlates of "whispered pitch" in synthetically generated vowels. In Journal of Speech, Language, and Hearing Research. Vol 42, 583-591.
- Stylianou, Y. ; Cappé O. ; Moulines E, 1998. Continuous probabilistic transform for voice conversion. In IEEE Trans. Speech and Audio Processing, Vol. 6, No.2, 131- 142.
- Kain, A.; Macon M. W., Spectral voice conversion for text-to-speech synthesis. In Proc. ICASSP. Seattle, U.S.A. Vol 1, 285-288.
- Hueber, T.; Chollet, G.; Denby, B.; Dreyfus G.; Stone M, 2007. Continuous-Speech Phone Recognition from Ultrasound and Optical Images of the Tongue and Lips. In Proc. Interspeech. Antwerp, Belgium,
- Kawahara, H.; Masuda-Katsuse, I.; Cheveigné, A., 1999. Restructuring speech representations using a pitch- adaptive time frequency smoothing and instantaneous- frequency-based F 0 extraction: Possible role of a repetitive structure in sounds. In Speech Communication. Vol. 27, No. 3-4, 187-207.
- Kawahara, H.; Estill, J.; Fujimura, O., 2001. Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT. MAVEBA, Firentze, Italy.