Detecting audio-visual synchrony using deep neural networks
2015, Interspeech 2015
https://doi.org/10.21437/INTERSPEECH.2015-201Abstract
In this paper, we address the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not. This is of interest in a wide range of applications, for example spoof detection in biometrics, lip-syncing, speaker detection and diarization in multi-subject videos, and video data quality assurance. In our adopted approach, we investigate the use of deep neural networks (DNNs) for this purpose. The proposed synchrony DNNs operate directly on audio and visual features over relatively wide contexts, or, alternatively, on appropriate hidden (bottleneck) or output layers of DNNs trained for single-modal or audio-visual automatic speech recognition. In all cases, the synchrony DNN classes consist of the "in-sync" and a number of "out-of-sync" targets, the latter considered at multiples of ± 30 msec steps of overall asynchrony between the two modalities. We apply the proposed approach on two multi-subject audio-visual databases, one of high-quality data recorded in studio-like conditions, and one of data recorded by smart cell-phone devices. On both sets, and under a speaker-independent experimental framework, we are able to achieve very low equal-error-rates in distinguishing "in-sync" from "out-of-sync" data.
References (30)
- References
- H. Bredin and G. Chollet, "Audiovisual speech synchrony measure: Application to biometrics," EURASIP Journal on Advances in Signal Processing, 2007.
- G. Chetty and M. Wagner, "Liveness verification in audio- video speaker authentication," in Proc. Australian Int. Conf. on Speech Science and Technology (SST), 2004, pp. 358-363.
- N. Eveno and L. Besacier, "A speaker independent "live- ness" test for audio-visual biometrics," in Proc. European Conf. on Speech Communication and Technology (Inter- speech -Eurospeech), 2005, pp. 3081-3084.
- H. J. Nock, G. Iyengar, and C. Neti, "Assessing face and speech consistency for monologue detection in video," in Proc. ACM Int. Conf. on Multimedia, 2002, pp. 303-306.
- M. Slaney and M. Covell, "FaceSync: a linear operator for measuring synchronization of video facial images and au- dio tracks," in Advances in Neural Information Processing Systems, vol. 13. MIT Press, 2000, pp. 814-820.
- J. Hershey and J. Movellan, "Audio vision: Using audio- visual synchrony to locate sounds," in Advances in Neu- ral Information Processing Systems, vol. 12. MIT Press, 1999, pp. 813-819.
- T. Butz and J.-P. Thiran, "Feature space mutual informa- tion in speech-video sequences," in Proc. IEEE Int. Conf. on Multimedia and Expo (ICME), vol. 2, 2002, pp. 361- 364.
- H. J. Nock, G. Iyengar, and C. Neti, "Speaker localisa- tion using audio-visual synchrony: An empirical study," in Proc. ACM Int. Conf. on Image and Video Retrieval (CIVR), vol. LNCS 2728, 2003, pp. 488-499.
- D. R. Hardoon, S. Szedmk, and J. Shawe-Taylor, "Canon- ical correlation analysis: An overview with applications to learning methods," Neural Computation, vol. 16, no. 12, pp. 2639-2664, 2004.
- R. Cutler and L. Davis, "Look who's talking: Speaker de- tection using video and audio correlation," in Proc. IEEE Int. Conf. on Multimedia and Expo (ICME), vol. 3, 2000, pp. 1589-1592.
- J. P. Barker and F. Berthommier, "Evidence of correla- tion between acoustic and visual features of speech," in Proc. Int. Congress of Phonetic Sciences (ICPhS), 1999, pp. 199-202.
- A. Abel, A. Hussain, Q.-D. Nguyen, F. Ringeval, M. Chetouani, and M. Milgram, "Maximizing audiovi- sual correlation with automatic lip tracking and vowel based segmentation," in Proc. Int. Conf. on Biomet- ric ID Management and Multimodal Communication (BioID MultiComm), vol. LNCS 5707, 2009, pp. 65-72.
- J. W. Fisher, III and T. Darrell, "Speaker association with signal-level audiovisual fusion," IEEE Transactions on Multimedia, vol. 6, pp. 406-413, 2004.
- M. Gurban and J.-P. Thiran, "Multimodal speaker local- ization in a probabilistic framework," in Proc. European Signal Processing Conf. (Eusipco), 2006, pp. 4-8.
- M. E. Sargin, Y. Yemez, E. Erzin, and A. M. Tekalp, "Audio-visual synchronization and fusion using canonical correlation analysis," IEEE Transactions on Multimedia, vol. 9, no. 7, pp. 1396-1403, 2007.
- E. A. Rúa, H. Bredin, C. G. Mateo, G. Chollet, and D. G. Jiménez, "Audio-visual speech asynchrony detection us- ing co-inertia analysis and coupled hidden Markov mod- els," Pattern Analysis and Applications, vol. 12, no. 3, pp. 271-284, 2009.
- K. Kumar, J. Navratil, E. Marcheret, V. Libal, and G. Potamianos, "Robust audio-visual speech synchrony detection by generalized bimodal linear prediction," in Proc. Int. Conf. of the Speech Communication Association (Interspeech), 2009, pp. 2251-2254.
- K. Kumar, G. Potamianos, J. Navratil, E. Marcheret, and V. Libal, "Audio-visual speech synchrony detection by a family of bimodal linear prediction models," in Multibio- metrics for Human Identification, B. Bhanu and V. Govin- daraju (Eds.). Cambridge University Press, 2011, ch. 2, pp. 31-50.
- B. P. Yuhas, M. H. Goldstein, Jr., and T. J. Sejnowski, "In- tegration of acoustic and visual speech signals using neu- ral networks," IEEE Communications Magazine, vol. 27, no. 11, pp. 65-71, 1989.
- P. Duchnowski, U. Meier, and A. Waibel, "See me, hear me: Integrating automatic speech recognition and lip- reading," in Proc. Int. Conf. on Spoken Language Process- ing (ICSLP), 1994, pp. 547-550.
- G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, "Multimodal deep learning," in Proc. Int. Conf. on Machine Learning (ICML), 2011.
- J. Huang and B. Kingsbury, "Audio-visual deep learn- ing for noise robust speech recognition," in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7596-7599.
- K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, "Audio-visual speech recognition using deep learning," Applied Intelligence, vol. 42, pp. 722-737, 2015.
- Y. Mroueh, E. Marcheret, and V. Goel, "Deep multimodal learning for audio-visual speech recognition," in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Process- ing (ICASSP), 2015, pp. 2130-2134.
- G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, "Recent advances in the automatic recognition of audio-visual speech," Proceedings of the IEEE, vol. 91, no. 9, pp. 1306-1326, 2003.
- L. Sifre and S. Mallat, "Rotation, scaling and deformation invariant scattering for texture discrimination," in Proc. IEEE Computer Vision and Pattern Recognition Conf. (CVPR), 2013.
- J. Bruna and S. Mallat, "Invariant scattering convolution networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1872-1886, 2013.
- M. Castrillón, O. Déniz, D. Hernández, and J. Lorenzo, "A comparison of face and facial feature detectors based on the Viola-Jones general object detection framework," Machine Vision and Applications, vol. 22, no. 3, pp. 481- 494, 2011.