Academia.eduAcademia.edu

Outline

Current trends in joint audio-video signal processing: a review

2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005.

https://doi.org/10.1109/ISSPA.2005.1580198

Abstract

Multimodal signal processing has gained a lot of significance in recent years due to advances in computer technology as well as more sophisticated sensors being available. One example is the joint processing of audio and video signals in a variety of applications. This paper serves as a broad introduction to the special session on "Audio-Video Signal Processing and its Applications". The paper reviews current trends and developments in joint audio-video (AV) signal processing and gives an overview of current issues in theory and application in this area. We focus on speech processing, person authentication, and affective sensing as examples. An overview of available AV data corpora is given.

References (31)

  1. REFERENCES
  2. H. McGurk and J. MacDonald, "Hearing lips and see- ing voices," Nature, vol. 264, pp. 746-748, Dec. 1976.
  3. T. Chen, "Audiovisual Speech Processing," IEEE Sig- nal Proc. Mag., vol. 18, no. 1, pp. 9-21, Jan. 2001.
  4. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior, "Recent Advances in the Automatic Recogni- tion of Audiovisual Speech," Proceedings of the IEEE, vol. 91, no. 9, pp. 1306-1326, Sept. 2003.
  5. G.L. Plant and J.J. Macrae, "Visual Perception of Aus- tralian Consonants, Vowels and Diphthongs," Austral. Teacher of the Deaf, vol. 18, pp. 46-50, July 1977.
  6. H.P. Graf, E. Cosatto, D. Gibbon, M. Kocheisen, and E. Petajan, "Multi-Modal System for Locating Heads and Faces," in Proc. IEEE FG'96, Killington, USA, Oct. 1996, pp. 88-93.
  7. H.A. Rowley, S. Baluja, and T. Kanade, "Neural Network-Based Face Detection," IEEE Trans. Patt. Anal. Mach. Int., vol. 20, no. 1, pp. 23-38, Jan. 1998.
  8. A. Adjoudani and C. Benoît, "On the Integration of Auditory and Visual Parameters in an HMM-based ASR," in Speechreading by Humans and Machines, 1996, vol. 150 of NATO ASI Series, pp. 461-471.
  9. J. Luettin, N.A. Thacker, and S.W. Beet, "Active Shape Models for Visual Speech Feature Extraction," in Speechreading by Humans and Machines, 1996, vol. 150 of NATO ASI Series, pp. 383-390.
  10. R. Goecke, J.B. Millar, A. Zelinsky, and J. Robert- Ribes, "Automatic Extraction of Lip Feature Points," in Proc. Austral. Conf. Robotics & Automation ACRA- 2000, Melbourne, Australia, Aug. 2000, pp. 31-36.
  11. G. Potamianos, J. Luettin, and C. Neti, "Hierarchi- cal Discriminant Features for Audio-Visual LVCSR," in Pro. IEEE ICASSP'01, Salt Lake City, USA, May 2001, vol. 1, pp. 165-168.
  12. M.T. Chan, "HMM-Based Audio-Visual Speech Recognition Integrating Geometric-and Appearance- Based Visual Features," in Proc. IEEE MMSP-2001, Cannes, France, Oct. 2001, pp. 9-14.
  13. T.F. Cootes, G.J. Edwards, and C.J. Taylor, "Active Appearance Models," in Proc. ECCV'98, Freiburg, Germany, June 1998, vol. 2, pp. 484-498.
  14. S. Dupont and J. Luettin, "Audio-Visual Speech Mod- eling for Continuous Speech Recognition," IEEE Trans. Multim., vol. 2, no. 3, pp. 141-151, Sept. 2000.
  15. C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, "Audio-Visual Speech Recognition," Wksh. report, Johns Hopkins Univ., Baltimore, USA, 2000.
  16. J. Robert-Ribes, M. Piquemal, J.-L. Schwartz, and P. Escudier, "Exploiting Sensor Fusion Architectures and Stimuli Complementarity in AV Speech Recog- nition," in Speechreading by Humans and Machines, 1996, vol. 150 of NATO ASI Series, pp. 193-210.
  17. J.-L. Dugelay, J.-C. Junqua, C. Kotropoulos, R. Kuhn, F. Perronnin, and I. Pitas, "Recent advances in biomet- ric person authentication," in Proc. IEEE ICASSP'02, Orlando, USA, May 2002, vol. 4, pp. 4060-4063.
  18. R. Chellappa, C.L. Wilson, and S. Sirohey, "Human and Machine Recognition of Faces: A Survey," Proc. of the IEEE, vol. 83, no. 5, pp. 705-741, May 1995.
  19. M. Turk and A. Pentland, "Eigenfaces for recogni- tion," in Proc. CVPR-91, 1991, pp. 1-2.
  20. G. Chetty and M. Wagner, ""Liveness" Verification in Audio-Video Authentication," in Proc. ICSLP2004, Jeju, Korea, Oct. 2004, vol. III, pp. 2509-2512.
  21. R. Goecke and B. Millar, "The Audio-Video Aus- tralian English Speech Data Corpus AVOZES," in Proc. ICSLP2004, Jeju, Korea, 2004, vol. III, pp. 2525-2528.
  22. E. Bailly-Baillire, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariethoz, J. Matas, K. Messer, V. Popovici, F. Boree, B. Ruiz, and J.-P. Thiran, "The BANCA Database and Evaluation Protocol," in Proc. AVBPA2003, Guildford, UK, June 2003, pp. 625-638.
  23. E.K. Patterson, S. Gurbuz, Z. Tufekci, and J.N. Gowdy, "CUAVE: A New Audio-Visual Database for Multimodal Human-Computer Interface Research," in Proc. IEEE ICASSP2002, Orlando, USA, 2002, vol. 2, pp. 2017-2020.
  24. C.C. Chibelushi, S. Gandon, J.S. Mason, F. Deravi, and D. Johnston, "Design Issues for a Digital Inte- grated Audio-Visual Database," in IEE Colloq. Inte- grated AV Proc. for Rec., Synth. & Comm., London, UK, Digest Ref. No. 1996/213, 1996, pp. 7/1-7/7.
  25. K. Messer, J. Matas, and J. Kittler, "Acquisition of a large database for biometric identity verification," in Proc. BIOSIGNAL 98, Brno, Czech Republic, 1998, pp. 70-72.
  26. C. Sanderson and K.K. Paliwal, "Fast Features for Face Authentication under Illumination Direction Changes," Pattern Recognition Letters, vol. 24, no. 14, pp. 2409-2419, Oct. 2003.
  27. K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, "XM2VTSDB: The Extended M2VTS Database," in Proc. AVBPA'99, Washington (DC), USA, 1999, pp. 72-77.
  28. B. Millar, M. Wagner, and R. Goecke, "Aspects of Speaking-Face Data Corpus Design Methodology," in Proc. ICSLP2004, Jeju, Korea, 2004, vol. II, pp. 1157- 1160.
  29. R.W. Picard, Affective Computing, MIT Press, Cam- bridge (MA), USA, 1997.
  30. L. ten Bosch, "Emotions, speech and the ASR frame- work," Speech Communication, vol. 40, no. 1-2, pp. 213-225, Apr. 2003.
  31. P. Ekman and E.L. Rosenberg, What the Face Reveals, Series in Affective Science. Oxford University Press, Oxford, UK, 1997.