INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge
Interspeech 2022
https://doi.org/10.21437/INTERSPEECH.2022-10829Abstract
Audio Packet Loss Concealment (PLC) is the hiding of gaps in audio streams caused by data transmission failures in packet switched networks. This is a common problem, and of increasing importance as end-to-end VoIP telephony and teleconference systems become the default and ever more widely used form of communication in business as well as in personal usage. This paper presents the INTERSPEECH 2022 Audio Deep Packet Loss Concealment challenge. We first give an overview of the PLC problem, and introduce some classical approaches to PLC as well as recent work. We then present the open source dataset released as part of this challenge as well as the evaluation methods and metrics used to determine the winner. We also briefly introduce PLCMOS, a novel data-driven metric that can be used to quickly evaluate the performance PLC systems. Finally, we present the results of the INTERSPEECH 2022 Audio Deep PLC Challenge, and provide a summary of important takeaways.
References (27)
- References
- L. F. Sun, G. Wade, B. M. Lines, and E. C. Ifeachor, "Impact of Packet Loss Location on Perceived Speech Quality," in In 2nd IP-Telephony Workshop, 2001, pp. 114-122.
- K. Hellwig, P. Vary, D. Massaloux, J. Petit, C. Galand, and M. Rosso, "Speech codec for the European mobile radio system," in 1989 IEEE Global Telecommunications Conference and Exhi- bition 'Communications Technology for the 1990s and Beyond', Nov. 1989, pp. 1065-1069 vol.2.
- Generation Partnership Project, "Adaptive Multi-Rate (AMR) speech codec; Error concealment of lost frames," in Adaptive Multi-Rate (AMR) speech codec, 2004.
- J. Lecomte, T. Vaillancourt, S. Bruhn, H. Sung, K. Peng, K. Kikuiri, B. Wang, S. Subasingha, and J. Faure, "Packet-loss concealment technology advances in EVS," in 2015 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 5708-5712, iSSN: 2379-190X.
- E. Thirunavukkarasu and E. Karthikeyan, "A survey on VoIP packet loss techniques," International Journal of Communication Networks and Distributed Systems, vol. 14, no. 1, pp. 106-116, Jan. 2015, publisher: Inderscience Publishers.
- C. Rodbro, M. Murthi, S. Andersen, and S. Jensen, "Hidden Markov model-based packet loss concealment for voice over IP," IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1609-1623, Sep. 2006, conference Name: IEEE Transactions on Audio, Speech, and Language Processing.
- Y. Bahat, Y. Y. Schechner, and M. Elad, "Self-content-based audio inpainting," Signal Processing, vol. 111, pp. 61-72, Jun. 2015.
- M. Kegler, P. Beckmann, and M. Cernak, "Deep Speech Inpainting of Time-Frequency Masks," in Interspeech 2020. ISCA, Oct. 2020, pp. 3276-3280.
- A. A. Nair and K. Koishida, "Cascaded Time + Time-Frequency Unet For Speech Enhancement: Jointly Addressing Clipping, Codec Distortions, And Gaps," in ICASSP 2021 -2021 IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP), Jun. 2021, pp. 7153-7157, iSSN: 2379-190X.
- O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Con- volutional Networks for Biomedical Image Segmentation," arXiv:1505.04597 [cs], May 2015, arXiv: 1505.04597 version: 1.
- H. Zhou, Z. Liu, X. Xu, P. Luo, and X. Wang, "Vision-Infused Deep Audio Inpainting," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 283-292.
- G. Morrone, D. Michelsanti, Z.-H. Tan, and J. Jensen, "Audio- Visual Speech Inpainting with Deep Learning," in ICASSP 2021 -2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 6653-6657, iSSN: 2379-190X.
- N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, "Efficient Neural Audio Synthesis," arXiv:1802.08435 [cs, eess], Feb. 2018, arXiv: 1802.08435 version: 1.
- "Improving Audio Quality in Duo with WaveNetEQ," http://ai.googleblog.com/2020/04/ improving-audio-quality-in-duo-with.html. Archived: https: //web.archive.org/web/20220309043235/http://ai.googleblog. com/2020/04/improving-audio-quality-in-duo-with.html.
- J. Kong, J. Kim, and J. Bae, "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis," arXiv:2010.05646 [cs, eess], Oct. 2020, arXiv: 2010.05646.
- Y. Shi, N. Zheng, Y. Kang, and W. Rong, "Speech Loss Compen- sation by Generative Adversarial Networks," in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Nov. 2019, pp. 347-351, iSSN: 2640-0103.
- A. Rix, J. Beerends, M. Hollier, and A. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2. Salt Lake City, UT, USA: IEEE, 2001, pp. 749-752.
- C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An Algo- rithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech," IEEE Transactions on Audio, Speech, and Lan- guage Processing, vol. 19, no. 7, pp. 2125-2136, Sep. 2011, con- ference Name: IEEE Transactions on Audio, Speech, and Lan- guage Processing.
- S. Pascual, J. Serrà, and J. Pons, "Adversarial Auto-Encoding for Packet Loss Concealment," arXiv:2107.03100 [cs, eess], Jul. 2021, arXiv: 2107.03100.
- R. Kubichek, "Mel-cepstral distance measure for objective speech quality assessment," in Proceedings of IEEE Pacific Rim Con- ference on Communications Computers and Signal Processing, vol. 1, May 1993, pp. 125-128 vol.1.
- J. Serrà, J. Pons, and S. Pascual, "SESQA: semi-supervised learning for speech quality assessment," arXiv:2010.00368 [cs, eess], Feb. 2021, arXiv: 2010.00368.
- J. Wang, Y. Guan, C. Zheng, R. Peng, and X. Li, "A temporal-spectral generative adversarial network based end-to- end packet loss concealment for wideband speech transmission," The Journal of the Acoustical Society of America, vol. 150, no. 4, pp. 2577-2588, Oct. 2021, publisher: Acoustical Society of America.
- J. Lin, Y. Wang, K. Kalgaonkar, G. Keren, D. Zhang, and C. Fuegen, "A Time-Domain Convolutional Recurrent Network for Packet Loss Concealment," in ICASSP 2021 -2021 IEEE In- ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP), Jun. 2021, pp. 7148-7152, iSSN: 2379-190X.
- M. M. Mohamed and B. W. Schuller, "ConcealNet: An End- to-end Neural Network for Packet Loss Concealment in Deep Speech Emotion Recognition," arXiv:2005.07777 [cs, eess], May 2020, arXiv: 2005.07777.
- B. Naderi and R. Cutler, "An Open Source Implementation of ITU-T Recommendation P.808 with Validation," INTER- SPEECH, pp. 2862-2866, Oct. 2020, arXiv: 2005.08138.
- C. K. A. Reddy, V. Gopal, and R. Cutler, "DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors," arXiv:2010.15258 [cs, eess], Feb. 2021, arXiv: 2010.15258.