Academia.eduAcademia.edu

Outline

System fusion for high-performance voice conversion

Interspeech 2015

https://doi.org/10.21437/INTERSPEECH.2015-581

Abstract

Recently, a number of voice conversion methods have been developed. These methods attempt to improve conversion performance by using diverse mapping techniques in various acoustic domains, e.g. high-resolution spectra and low-resolution Mel-cepstral coefficients. Each individual method has its own pros and cons. In this paper, we introduce a system fusion framework, which leverages and synergizes the merits of these state-of-the-art and even potential future conversion methods. For instance, methods delivering high speech quality are fused with methods capturing speaker characteristics, bringing another level of performance gain. To examine the feasibility of the proposed framework, we select two state-of-the-art methods, Gaussian mixture model and frequency warping based systems, as a case study. Experimental results reveal that the fusion system outperforms each individual method in both objective and subjective evaluation, and demonstrate the effectiveness of the proposed fusion framework.

References (29)

  1. References
  2. Y. Stylianou, O. Cappé, and E. Moulines, "Continu- ous probabilistic transform for voice conversion," IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131-142, 1998.
  3. A. Kain and M. W. Macon, "Spectral voice conver- sion for text-to-speech synthesis," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, 1998, pp. 285-288.
  4. E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, "Voice conversion using partial least squares regression," IEEE Transactions on Audio, Speech, and Language Pro- cessing, vol. 18, no. 5, pp. 912-921, 2010.
  5. S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, "Voice conversion using artifi- cial neural networks," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009, pp. 3893-3896.
  6. L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, "Voice conversion using deep neural networks with layer-wise generative training," IEEE Transactions on Speech and Audio Processing, vol. 22, no. 12, pp. 1859-1872, 2014.
  7. F.-L. Xie, Y. Qian, Y. Fan, F. K. Soong, and H. Li, "Se- quence error (SE) minimization training of neural network for voice conversion," in INTERSPEECH, 2014.
  8. E. Helander, H. Silén, T. Virtanen, and M. Gabbouj, "Voice conversion using dynamic kernel partial least squares regression," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pp. 806-817, 2012.
  9. T. Toda, A. W. Black, and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral pa- rameter trajectory," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222-2235, 2007.
  10. H. Benisty and D. Malah, "Voice conversion using GMM with enhanced global variance," in INTERSPEECH, 2011, pp. 669-672.
  11. Z. Wu, T. Virtanen, T. Kinnunen, E. S. Chng, and H. Li, "Exemplar-based voice conversion using non-negative spectrogram deconvolution," in 8th ISCA Speech Synthe- sis Workshop, 2013.
  12. R. Takashima, T. Takiguchi, and Y. Ariki, "Exemplar- based voice conversion in noisy environment," in Spoken Language Technology workshop (SLT), 2012, pp. 313- 317.
  13. Z. Wu, T. Virtanen, E. S. Chng, and H. Li, "Exemplar- based sparse representation with residual compensation for voice conversion," IEEE Transactions on Speech and Audio Processing, vol. 22, no. 10, pp. 1506-1521, 2014.
  14. D. Sundermann and H. Ney, "VTLN-based voice conver- sion," in IEEE International Symposium on Signal Pro- cessing and Information Technology (ISSPIT), 2003, pp. 556-559.
  15. D. Sundermann, H. Ney, and H. Hoge, "VTLN-based cross-language voice conversion," in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2003, pp. 676-681.
  16. D. Erro, A. Moreno, and A. Bonafonte, "Voice conver- sion based on weighted frequency warping," IEEE Trans- actions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 922-931, 2010.
  17. D. Erro, E. Navas, and I. Hernaez, "Parametric voice con- version based on bilinear frequency warping plus ampli- tude scaling," IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 3, pp. 556-566, 2013.
  18. X. Tian, Z. Wu, S. W. Lee, and E. S. Chng, "Correlation- based frequency warping for voice conversion," in 9th In- ternational Symposium on Chinese Spoken Language Pro- cessing (ISCSLP), 2014, pp. 211-215.
  19. X. Tian, Z. Wu, S. W. Lee, N. Q. Hy, E. S. Chng, and M. Dong, "Sparse representation for frequency warp- ing based voice conversion," in IEEE International Con- ference on Acoustics, Speech, and Signal Processing (ICASSP) to appear, 2015.
  20. E. Godoy, O. Rosec, and T. Chonavel, "Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora," IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 4, pp. 1313-1323, 2012.
  21. M. J. Gales and S. J. Young, "Robust continuous speech recognition using parallel model combination," IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352-359, 1996.
  22. N. Brummer, L. Burget, J. H. Cernocky, O. Glembek, F. Grezl, M. Karafiat, D. A. Van Leeuwen, P. Mate- jka, P. Schwarz, and A. Strasheim, "Fusion of heteroge- neous speaker recognition systems in the stbu submission for the nist speaker recognition evaluation 2006," IEEE Transactions on Audio, Speech, and Language Process- ing, vol. 15, no. 7, pp. 2072-2084, 2007.
  23. H. Zen, M. J. Gales, Y. Nankaku, and K. Tokuda, "Prod- uct of experts for statistical parametric speech synthesis," IEEE Transactions on Audio, Speech, and Language Pro- cessing, vol. 20, no. 3, pp. 794-805, 2012.
  24. H. Valbret, E. Moulines, and J.-P. Tubach, "Voice transfor- mation using PSOLA technique," Speech Communication, vol. 11, no. 2, pp. 175-187, 1992.
  25. J. O. Smith and J. S. Abel, "Bark and ERB bilinear trans- forms," IEEE Transactions on Speech and Audio Process- ing, vol. 7, no. 6, pp. 697-708, 1999.
  26. A. B. Kain, "High resolution voice transformation," Ph.D. dissertation, Rockford College, 2001.
  27. H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, "Restructuring speech representations using a pitch- adaptive time-frequency smoothing and an instantaneous- frequency-based F0 extraction: Possible role of a repeti- tive structure in sounds," Speech communication, vol. 27, no. 3, pp. 187-207, 1999.
  28. T. Toda, T. Muramatsu, and H. Banno, "Implementation of computationally efficient real-time voice conversion." in INTERSPEECH, 2012.
  29. H. Ye and S. Young, "High quality voice morphing," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, 2004, pp. 1-9.