A Multi-View Approach To Audio-Visual Speaker Verification
2021, arXiv (Cornell University)
https://doi.org/10.48550/ARXIV.2102.06291Abstract
Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audiovisual approaches to speaker verification, starting with standard fusion techniques to learn joint audiovisual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using our best system. As these methods lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video into the same space. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
References (26)
- REFERENCES
- Zhiyong Wu, Lianhong Cai, and Helen Meng, "Multi- level fusion of audio and visual features for speaker identification," in International Conference on Biomet- rics. Springer, 2006, pp. 493-499.
- Andrew Senior, Chalapathy V Neti, and Benoit Mai- son, "On the use of visual information for improving audio-based speaker recognition," in AVSP -Interna- tional Conference on Auditory-Visual Speech Process- ing, 1999.
- Suwon Shon, Tae-Hyun Oh, and James Glass, "Noise- tolerant audio-visual online person verification using an attention-based neural network fusion," in Proc. IEEE ICASSP. IEEE, 2019, pp. 3995-3999.
- Aggelos K Katsaggelos, Sara Bahaadini, and Rafael Molina, "Audiovisual fusion: Challenges and new ap- proaches," Proceedings of the IEEE, vol. 103, no. 9, pp. 1635-1653, 2015.
- Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Du- mouchel, and Pierre Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2010.
- David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, "Deep neural network embed- dings for text-independent speaker verification.," in In- terspeech, 2017, pp. 999-1003.
- Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, "Utterance-level aggregation for speaker recognition in the wild," in Proc. IEEE ICASSP. IEEE, 2019, pp. 5791-5795.
- Sarthak Yadav and Atul Rai, "Frequency and tempo- ral convolutional attention for text-independent speaker recognition," in Proc. IEEE ICASSP. IEEE, 2020, pp. 6794-6798.
- Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic, "NetVLAD: CNN architecture for weakly supervised place recognition," in Proc. IEEE CVPR, 2016, pp. 5297-5307.
- Yujie Zhong, Relja Arandjelović, and Andrew Zisser- man, "GhostVLAD for set-based face recognition," in Asian Conference on Computer Vision. Springer, 2018, pp. 35-50.
- Arsha Nagrani, Joon Son Chung, and Andrew Zisser- man, "Voxceleb: a large-scale speaker identification dataset," arXiv preprint arXiv:1706.08612, 2017.
- J. S. Chung, A. Nagrani, and A. Zisserman, "Voxceleb2: Deep speaker recognition," in Proc. ISCA Interspeech, 2018, pp. 1086-1090.
- Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matějka, and Oldřich Plchot, "BUT system description to VoxCeleb speaker recognition challenge 2019," arXiv preprint arXiv:1910.12592, 2019.
- A. Nagrani, J. S. Chung, S. Albanie, and A. Zisser- man, "Disentangled speech embeddings using cross- modal self-supervision," in Proc. IEEE ICASSP, 2020, pp. 6829-6833.
- Arsha Nagrani, Samuel Albanie, and Andrew Zisser- man, "Learnable pins: Cross-modal embeddings for person identity," in Proc. ECCV, 2018, pp. 71-88.
- Ruijie Tao, Rohan Kumar Das, and Haizhou Li, "Audio- visual speaker recognition with a cross-modal discrim- inative network," arXiv preprint arXiv:2008.03894, 2020.
- Shah Nawaz, Muhammad Kamran Janjua, Ignazio Gallo, Arif Mahmood, and Alessandro Calefati, "Deep latent space learning for cross-modal mapping of audio and visual signals," in Digital Image Computing: Tech- niques and Applications (DICTA). IEEE, 2019, pp. 1-7.
- Leda Sarı, Samuel Thomas, and Mark Hasegawa- Johnson, "Training spoken language understanding sys- tems with non-parallel speech and text," in Proc. IEEE ICASSP. IEEE, 2020, pp. 8109-8113.
- Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick, "Detectron2," https://github.com/facebookresearch/ detectron2, 2019.
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen, "Mobilenetv2: In- verted residuals and linear bottlenecks," in Proc. IEEE CVPR, 2018, pp. 4510-4520.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep residual learning for image recognition," in Proc. IEEE CVPR, 2016, pp. 770-778.
- Sergey Ioffe and Christian Szegedy, "Batch nor- malization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015.
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfit- ting," The journal of machine learning research, vol. 15, no. 1, pp. 1929-1958, 2014.
- Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, "Arcface: Additive angular margin loss for deep face recognition," in Proc. IEEE CVPR, 2019, pp. 4690-4699.
- Harriet MJ Smith, Andrew K Dunn, Thom Baguley, and Paula C Stacey, "Matching novel face and voice identity using static and dynamic facial images," Attention, Per- ception, & Psychophysics, vol. 78, no. 3, pp. 868-879, 2016.