Academia.eduAcademia.edu

Outline

Audio-Visual Speech Super-Resolution

2021

Abstract

In this paper, we present an audio-visual model to perform speech super-resolution at large scale-factors (8× and 16×). Previous works attempted to solve this problem using only the audio modality as input and thus were limited to low scale-factors of 2× and 4×. In contrast, we propose to incorporate both visual and auditory signals to superresolve speech of sampling rates as low as 1kHz. In such challenging situations, the visual features assist in learning the content and improves the quality of the generated speech. Further, we demonstrate the applicability of our approach to arbitrary speech signals where the visual stream is not accessible. Our “pseudo-visual network” precisely synthesizes the visual stream solely from the low-resolution speech input. Extensive experiments and the demo video illustrate our method’s remarkable results and benefits over state-of-the-art audio-only speech super-resolution approaches. Figure 1: We present an audio-visual model for super-resolving v...

References (31)

  1. T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In INTERSPEECH, 2018.
  2. Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE transactions on pattern anal- ysis and machine intelligence, 2018.
  3. Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self- supervised learning of audio-visual objects from video. In European Conference on Computer Vision, 2020.
  4. Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, and et.al Battenberg. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning -Volume 48, ICML'16, page 173-182. JMLR.org, 2016.
  5. Sawyer Birnbaum, Volodymyr Kuleshov, Zayd Enam, Pang Wei W Koh, and Stefano Ermon. Temporal film: Capturing long-range sequence dependencies with feature- wise modulations. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  6. Y. Cheng, D. O'Shaughnessy, and P. Mermelstein. Statistical recovery of wideband speech from narrowband speech. IEEE Trans. Speech Audio Process., 2:544-548, 1994.
  7. J. S. Chung, A. Jamaludin, and A. Zisserman. You said that? In British Machine Vision Conference, 2017.
  8. J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.
  9. Soo-Whan Chung, Joon Son Chung, and Hong-Goo Kang. Perfect match: Improved cross-modal embeddings for audio-visual synchronisation. ICASSP 2019 -2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3965-3969, 2019.
  10. Shaojin Ding, Tianlong Chen, Xinyu Gong, Weiwei Zha, and Zhangyang Wang. Au- tospeech: Neural architecture search for speaker recognition, 2020.
  11. P. Ekstrand. Bandwidth extension of audio signals by spectral band replication. In Pro- ceedings of the 1st IEEE Benelux Workshop on Model Based Processing and Coding of Audio (MPCA '02, 2002.
  12. Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Trans. Graph., 37, 2018. doi: 10.1145/3197517.3201357.
  13. A. Gray and J. Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24:380-391, 1976.
  14. Sindhu B. Hegde, K.R. Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. Visual speech enhancement without a real visual stream. In Proceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1926-1935, January 2021.
  15. Jesper Jensen and Cees Taal. An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24:1-1, 11 2016. doi: 10.1109/TASLP.2016.2585878.
  16. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  17. Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix, and Tomohiro Nakatani. Improv- ing noise robust automatic speech recognition with single-channel time-domain en- hancement network. In IEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pages 7009-7013. IEEE, 2020.
  18. Volodymyr Kuleshov, S. Enam, and S. Ermon. Audio super resolution using neural networks. ICLR Workshops, abs/1708.00853, 2017.
  19. Erik Larsen and R. Aarts. Audio bandwidth extension: Application of psychoacoustics, signal processing and loudspeaker design. 2004.
  20. Junhyeok Lee and Seungu Han. Nu-wave: A diffusion probabilistic model for neural audio upsampling. INTERSPEECH, 2021.
  21. Jinyu Li, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. An overview of noise- robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):745-777, 2014. doi: 10.1109/TASLP.2014.2304637.
  22. Kehuang Li, Z. Huang, Yong Xu, and Chin-Hui Lee. Dnn-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech. In INTERSPEECH, 2015.
  23. Alastair H Moore, P Peso Parada, and Patrick A Naylor. Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures. Computer Speech & Language, 46:574-584, 2017.
  24. Andrew Owens and Alexei A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In The European Conference on Computer Vision (ECCV), September 2018.
  25. Andrew Owens and Alexei A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  26. Santiago Pascual, Antonio Bonafonte, and Joan Serrà. Segan: Speech enhancement generative adversarial network. pages 3642-3646, 08 2017. doi: 10.21437/Interspeech. 2017-1428.
  27. K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. Learning individual speaking styles for accurate lip to speech synthesis. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  28. K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, MM '20, page 484-492. Association for Computing Machinery, 2020. doi: 10.1145/3394171.3413532.
  29. Antony Rix, John Beerends, Michael Hollier, and Andries Hekstra. Perceptual evaluation of speech quality (pesq): A new method for speech quality assessment of telephone networks and codecs. volume 2, pages 749-752 vol.2, 2001. doi: 10.1109/ICASSP.2001.941023.
  30. Cees Taal, Richard Hendriks, R. Heusdens, and Jesper Jensen. A short-time objective intelligibility measure for time-frequency weighted noisy speech. pages 4214 -4217, 04 2010. doi: 10.1109/ICASSP.2010.5495701.
  31. Ruilin Xu, Rundi Wu, Yuko Ishiwaka, Carl Vondrick, and Changxi Zheng. Listening to sounds of silence for speech denoising. In Advances in Neural Information Processing Systems, volume 33, pages 9633-9648. Curran Associates, Inc., 2020.