Academia.eduAcademia.edu

Outline

Audiovisual speech synchrony measure: application to biometrics

2007, EURASIP Journal on Applied Signal …

https://doi.org/10.1155/2007/70186

Abstract

Speech is a means of communication which is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech, and more specifically techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech. Finally, the use of synchrony measure for biometric identity verification based on talking faces is experimented on the BANCA database.

References (41)

  1. G. Potamianos, C. Neti, J. Luettin, and I. Matthews, "Audio- visual automatic speech recognition: an overview," in Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, Eds., chapter 10, MIT Press, Cambridge, Mass, USA, 2004.
  2. T. Chen, "Audiovisual speech processing," IEEE Signal Process- ing Magazine, vol. 18, no. 1, pp. 9-21, 2001.
  3. C. C. Chibelushi, F. Deravi, and J. S. Mason, "A review of speech-based bimodal recognition," IEEE Transactions on Multimedia, vol. 4, no. 1, pp. 23-37, 2002.
  4. J. P. Barker and F. Berthommier, "Evidence of correlation between acoustic and visual features of speech," in Proceed- ings of the 14th International Congress of Phonetic Sciences (ICPhS '99), pp. 199-202, San Francisco, Calif, USA, August 1999.
  5. H. Yehia, P. Rubin, and E. Vatikiotis-Bateson, "Quantitative as- sociation of vocal-tract and facial behavior," Speech Communi- cation, vol. 26, no. 1-2, pp. 23-43, 1998.
  6. E. Bailly-Baillière, S. Bengio, F. Bimbot, et al., "The BANCA database and evaluation protocol," in Proceedings of the 4th International Conference on Audioand Video-Based Biometric Person Authentication (AVBPA '03), vol. 2688 of Lecture Notes in Computer Science, pp. 625-638, Springer, Guildford, UK, January 2003.
  7. J. Hershey and J. Movellan, "Audio-vision: using audio-visual synchrony to locate sounds," in Advances in Neural Informa- tion Processing Systems 11, M. S. Kearns, S. A. Solla, and D. A. Cohn, Eds., pp. 813-819, MIT Press, Cambridge, Mass, USA, 1999.
  8. H. Bredin, A. Miguel, I. H. Witten, and G. Chollet, "Detect- ing replay attacks in audiovisual identity verification," in Pro- ceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), vol. 1, pp. 621-624, Toulous, France, May 2006.
  9. J. W. Fisher III and T. Darrell, "Speaker association with signal- level audiovisual fusion," IEEE Transactions on Multimedia, vol. 6, no. 3, pp. 406-413, 2004.
  10. G. Chetty and M. Wagner, ""Liveness" verification in audio- video authentication," in Proceedings of the 10th Australian In- ternational Conference on Speech Science and Technology (SST '04), pp. 358-363, Sydney, Australia, December 2004.
  11. R. Cutler and L. Davis, "Look who's talking: speaker detection using video and audio correlation," in Proceedings of IEEE In- ternational Conference on Multimedia and Expo (ICME '00), vol. 3, pp. 1589-1592, New York, NY, USA, July-August 2000.
  12. G. Iyengar, H. J. Nock, and C. Neti, "Audio-visual synchrony for detection of monologues in video archives," in Proceed- ings of IEEE International Conference on Multimedia and Expo (ICME '03), vol. 1, pp. 329-332, Baltimore, Md, USA, July 2003.
  13. H. J. Nock, G. Iyengar, and C. Neti, "Assessing face and speech consistency for monologue detection in video," in Proceed- ings of the 10th ACM international Conference on Multimedia (MULTIMEDIA '02), pp. 303-306, Juan-les-Pins, France, De- cember 2002.
  14. M. Slaney and M. Covell, "FaceSync: a linear operator for measuring synchronization of video facial images and audio tracks," in Advances in Neural Information Processing Systems 13, pp. 814-820, MIT Press, Cambridge, Mass, USA, 2000.
  15. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verifi- cation using adapted Gaussian mixture models," Digital Signal Processing, vol. 10, no. 1-3, pp. 19-41, 2000.
  16. N. Sugamura and F. Itakura, "Speech analysis and synthesis methods developed at ECL in NTT-from LPC to LSP," Speech Communications, vol. 5, no. 2, pp. 199-215, 1986.
  17. C. Bregler and Y. Konig, ""Eigenlips" for robust speech recog- nition," in Proceedings of the 19th IEEE International Confer- ence on Acoustics, Speech, and Signal Processing (ICASSP '94), vol. 2, pp. 669-672, Adelaide, Australia, April 1994.
  18. M. Turk and A. Pentland, "Eigenfaces for recognition," Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.
  19. R. Goecke and B. Millar, "Statistical analysis of the relationship between audio and video speech parameters for Australian En- glish," in Proceedings of the ISCA Tutorial and Research Work- shop on Audio Visual Speech Processing (AVSP '03), pp. 133- 138, Saint-Jorioz, France, September 2003.
  20. N. Eveno and L. Besacier, "A speaker independent "liveness" test for audio-visual biometrics," in Proceedings of the 9th Eu- ropean Conference on Speech Communication and Technology (EuroSpeech '05), pp. 3081-3084, Lisbon, Portugal, September 2005.
  21. N. Eveno and L. Besacier, "Co-inertia analysis for "liveness" test in audio-visual biometrics," in Proceedings of the 4th Inter- national Symposium on Image and Signal Processing and Anal- ysis (ISPA '05), pp. 257-261, Zagreb, Croatia, September 2005.
  22. N. Fox and R. B. Reilly, "Audio-visual speaker identifica- tion based on the use of dynamic audio and visual features," in Proceedings of the 4th International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA '03), vol. 2688 of Lecture Notes in Computer Science, pp. 743-751, Springer, Guildford, UK, June 2003.
  23. C. C. Chibelushi, J. S. Mason, and F. Deravi, "Integrated per- son identification using voice and facial features," in IEE Collo- quium on Image Processing for Security Applications, vol. 4, pp. 1-5, London, UK, March 1997.
  24. A. Hyvärinen, "Survey on independent component analysis," Neural Computing Surveys, vol. 2, pp. 94-128, 1999.
  25. D. Sodoyer, L. Girin, C. Jutten, and J.-L. Schwartz, "Speech ex- traction based on ICA and audio-visual coherence," in Pro- ceedings of the 7th International Symposium on Signal Process- ing and Its Applications (ISSPA '03), vol. 2, pp. 65-68, Paris, France, July 2003.
  26. P. Smaragdis and M. Casey, "Audio/visual independent com- ponents," in Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA '03), pp. 709-714, Nara, Japan, April 2003.
  27. ICA, http://www.cis.hut.fi/projects/ica/fastica/.
  28. Canonical Correlation Analysis. http://people.imt.liu.se/ ∼magnus/cca/.
  29. S. Dolédec and D. Chessel, "Co-inertia analysis: an alterna- tive method for studying species-environment relationships," Freshwater Biology, vol. 31, pp. 277-294, 1994.
  30. J. W. Fisher, T. Darrell, W. T. Freeman, and P. Viola, "Learn- ing joint statistical models for audio-visual fusion and segre- gation," in Advances in Neural Information Processing Systems 13, T. K. Leen, T. G. Dietterich, and V. Tresp, Eds., pp. 772-778, MIT Press, Cambridge, Mass, USA, 2001.
  31. D. Sodoyer, J.-L. Schwartz, L. Girin, J. Klinkisch, and C. Jut- ten, "Separation of audio-visual speech sources: a new ap- proach exploiting the audio-visual coherence of speech stim- uli," EURASIP Journal on Applied Signal Processing, vol. 2002, no. 11, pp. 1165-1173, 2002.
  32. L. R. Rabiner, "A tutorial on hidden Markov models and se- lected applications in speech recognition," Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
  33. S. Bengio, "An asynchronous hidden Markov model for audio- visual speech recognition," in Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer, Eds., pp. 1213-1220, MIT Press, Cambridge, Mass, USA, 2003.
  34. H. Bredin, G. Aversano, C. Mokbel, and G. Chollet, "The biosecure talking-face reference system," in Proceed- ings of the 2nd Workshop on Multimodal User Authentication (MMUA '06), Toulouse, France, May 2006.
  35. K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, "XM2VTSDB: the extended M2VTS database," in Proceedings of International Conference on Audio-and Video-Based Biomet- ric Person Authentication (AVBPA '99), pp. 72-77, Washington, DC, USA, March 1999.
  36. BT-DAVID, http://eegalilee.swan.ac.uk/.
  37. S. Garcia-Salicetti, C. Beumier, G. Chollet, et al., "BIOMET: a multimodal person authentication database including face, voice, fingerprint, hand and signature modalities," in Proceed- ings of the 4th International Conference on Audio-and Video- Based Biometric Person Authentication (AVBPA '03), pp. 845- 853, Guildford, UK, June 2003.
  38. A. F. Martin, G. R. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, "The DET curve in assessment of detection task performance," in Proceedings of the 5th European Conference on Speech Communication and Technology (EuroSpeech '97), vol. 4, pp. 1895-1898, Rhodes, Greece, September 1997.
  39. M. E. Sargin, E. Erzin, Y. Yemez, and A. M. Tekalp, "Multi- modal speaker identification using canonical correlation anal- ysis," in Proceedings of the 31st IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), vol. 1, pp. 613-616, Toulouse, France, May 2006.
  40. Text Retrieval Conference Video Track. http://trec.nist.gov/.
  41. H. Bredin and G. Chollet, "Audio-visual speech synchrony measure for talking-face identity verification," in Proceedings of the 32nd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '07), Honolulu, Hawaii, USA, April 2007.