Academia.eduAcademia.edu

Outline

On Dual View Lipreading Using High Speed Camera

Abstract

Lipreading gets increasingly attention from the scientific society. However, many aspects related to lipreading are still unknown or poorly understood. In the current paper we present the entire process used for engineering the data for building a lip recognizer. Firstly, we provide detailed information on compiling an advanced multimodal data corpus for audio-visual speech recognition, lipreading and related domains. This data corpus contains synchronized dual view acquired using high speed camera. We paid careful attention to the language content of the corpus and the affective state of the speaker. Secondly, we introduce several methods for extraction features from both views and detail the problem of combining the information from the two views. While the information of the frontal view processing is more like a state of the art, we bring as well valuable new information and analysis for the profile view.

References (8)

  1. Bregler and Konig 1994) C. Bregler and Y. Konig, ""Eigenlips" for robust speech recognition", in Acoustics, Speech, and Signal Processing, 1994. ICASSP-94 IEEE International Conference on, 1994. (ChiŃu and Rothkrantz 2007) Alin G. ChiŃu and Leon J.M. Rothkrantz, "Building a Data Corpus for Audio-Visual Speech Recognition", Euromedia2007, ISBN 9789077381328, pp. 88-92, April 2007. (ChiŃu and Rothkrantz 2007b) Alin G. ChiŃu and Leon J.M. Rothkrantz, "The Influence of Video Sampling Rate on Lipreading Performance", 12-th International Conference on Speech and Computer (SPECOM'2007), ISBN 6-7452-0110-x, pp. 678-684, Moscow State Linguistic University, Moscow, October 2007. (ChiŃu et. al 2007) Alin G. ChiŃu, Leon J.M. Rothkrantz, Jacek C. Wojdel, Pascal Wiggers, "Comparison Between Different Feature Extraction Techniques for Audio-Visual Speech Recognition", Journal on Multimodal User Interfaces, pp. 7-20, Springer, March 2007. (Cootes et. al 1998) T.F., Edwards, G.J., Taylor, C.J., Active Appearance Models, In H.Burkhardt and B.Neumann, editors, 5
  2. th European Conference on
  3. Computer Vision, Vol.2, 484-498, Springer, 1998. (Duchnowski et. al 1995) P. Duchnowski, M. Hunke, D. B¨usching, U. Meier, and A. Waibel, "Toward Movement-Invariant Automatic Lip-Reading and Speech Recognition", in International Conference on Acoustics, Speech, and Signal Processing, 1995 (ICASSP-95), vol. 1, pp. 109-112, 1995. (Essa and Pentland 1994) I. A. Essa and A. Pentland, "A Vision System for Observing and Extracting Facial Action Parameters", in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 76-83, IEEE, June 1994. (Hong et. al 2006) X. Hong, H. Yao, Y. Wan, and R. Chen, "A PCA Based Visual DCT Feature Extraction Method for Lip-Reading", iih-msp, vol. 0, pp. 321-326, 2006. (Iwano et. al 2001) K. Iwano, S. Tamura, and S. Furui, "Bimodal Speech Recognition Using Lip Movement Measured By Optical-Flow analysis", in HSC2001, 2001. (Kumar et. al 2007) Kshitiz Kumar, Tsuhan Chen, Richard M. Stern, "Profile View Lip Reading", IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), 2007, Honolulu, Hawaii.
  4. Li et. al 1997) N. Li, S. Dettmer, and M. Shah, "Visually recognizing speech using eigensequences", Motion- based recognition, 1997. (Mase and Pentland 1991) K. Mase and A. Pentland., "Automatic Lipreading by Optical-Flow Analysis", in Systems and Computers in Japan, vol. 22, pp. 67-76, 1991. (McGurk and MacDonald 1976) McGurk, H. &
  5. MacDonald, J. Hearing lips and seeing voices Nature, 1976, 264, 746 -748 (Rothkrantz et. al 2005) L. J. M. Rothkrantz, J. C. Wojdel, and P. Wiggers, "Fusing Data Streams in Continuous Audio-Visual Speech Recognition", in Text, Speech and Dialogue: 8th International Conference, TSD 2005, vol. 3658, (Karlovy Vary, Czech Republic), pp. 33-44, Springer Berlin / Heidelberg, September 2005. (Rothkrantz et. al 2006) L. J. M. Rothkrantz, J. C. Wojdel, and P. Wiggers, "Comparison between different feature extraction techniques in lipreading applications", in Specom'2006, SpIIRAS Petersburg, 2006. (Tamura et. al 2002) S. Tamura, K. Iwano, and S. Furui, "A Robust Multi-Modal Speech Recognition Method Using Optical-Flow Analysis", in Extended summary of IDS02, (Kloster Irsee, Germany), pp. 2-4, June 2002. (Varga and Steeneken 1993) Varga, A. and Steeneken, H. 1993. "Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems." Speech Communication, (vol. 12, no. 3, pp. 247-251, July)
  6. Wojdel and Rothkrantz 2000) J. C. Wojdel and L. J. M. Rothkrantz, "Visually based speech onset/offset detection", in Proceedings of 5th Annual Scientific Conference on Web Technology, New Media, Communications and Telematics Theory, Methods, Tools and Application (Euromedia 2000), (Antwerp, Belgium), pp. 156-160, 2000.
  7. Wojdeł et. al 2002) Wojdeł, J.C.; Wiggers, P. and Rothkrantz, L.J.M. 2002. "An audio-visual corpus for multimodal speech recognition in Dutch language" In Proceedings of the International Conference on Spoken Language Processing (ICSLP2002) (Denver CO, USA, September, pp. 1917-1920)
  8. Yoshinaga et. al 2003) T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui, "Audio-Visual Speech Recognition Using Lip Movement Extracted from Side-Face Images", in AVSP2003, pp. 117-120, September 2003. (Yoshinaga et. al 2004) T. Yoshinaga, S. Tamura, K. iwano, and S. Furui, "Audio-Visual Speech Recognition Using New Lip Features Extracted from Side-Face Images", in Robust 2004, August 2004.