Abstract
Lip-reading is a technique to understand speech by observing a speaker’s lips movement. It has numerous applications; for example, it is helpful for hearing impaired persons and understanding the speech in noisy environments. Most of the previous works of lips-reading focused on frontal and near frontal face lip-reading and some of them targeted multiple poses in high quality videos. However, their results are not satisfactory on low quality videos containing multiple poses. In this research work, a lip-reading framework is proposed for improving the recognition rate in low quality videos. In this work, a Multiple Pose (MP) dataset of low quality videos containing multiple extreme poses is built. The proposed framework decomposes the input video into frames and enhances them by applying the Contrast Limited Adaptive Histogram Equalization (CLAHE) method. Next, faces are detected from enhanced frames and frontalized the multiple poses using the face frontalization Generative Adversar...
References (32)
- McGurk, H.; MacDonald, J. Hearing lips and seeing voices. Nature 1976, 264, 746-748. [CrossRef] [PubMed]
- Scanlon, P.; Reilly, R. Feature analysis for automatic speechreading. In Proceedings of the 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No. 01TH8564), Cannes, France, 3-5 October 2001; pp. 625-630.
- Matthews, I.; Potamianos, G.; Neti, C.; Luettin, J. A comparison of model and transform-based visual features for audio-visual LVCSR. In Proceedings of the IEEE International Conference on Multimedia and Expo, 2001 (ICME 2001), Tokyo, Japan, 22-25 August 2001; IEEE Computer Society: Washington, DC, USA, 2001.
- Aleksic, P.S.; Katsaggelos, A.K. Comparison of low-and high-level visual features for audio-visual continuous automatic speech recognition. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, 17-21 May 2004; Volume 5, pp. 4-917.
- Shaikh, A.A.; Kumar, D.K.; Yau, W.C.; Azemin, M.C.; Gubbi, J. Lip reading using optical flow and support vector machines. In Proceedings of the 2010 3rd International Congress on Image and Signal Processing, Yantai, China, 16-18 October 2010; Volume 1, pp. 327-330.
- Puviarasan, N.; Palanivel, S. Lip reading of hearing impaired persons using HMM. Expert Syst. Appl. 2011, 38, 4477-4481.
- Bear, H.L.; Harvey, R.W.; Lan, Y. Finding phonemes: Improving machine lip-reading. arXiv 2017, arXiv:1710.01142.
- Chung, J.S.; Zisserman, A. Learning to lip read words by watching videos. Comput. Vis. Image Underst. 2018, 173, 76-85.
- Paleček, K. Experimenting with lipreading for large vocabulary continuous speech recognition. J. Multimodal User Interfaces 2018, 12, 309-318. [CrossRef]
- Thangthai, K.; Bear, H.L.; Harvey, R. Comparing phonemes and visemes with DNN-based lipreading. arXiv 2018, arXiv:1805.02924.
- Bear, H.L. Decoding Visemes: Improving Machine Lip-Reading. Ph.D. Thesis, University of East Anglia, Norwich, UK, 2016.
- Bear, H.L.; Cox, S.J.; Harvey, R.W. Speaker-independent machine lip-reading with speaker-dependent viseme classifiers. arXiv 2017, arXiv:1710.01122.
- Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21-26 July 2017; pp. 3444-3453.
- Chung, J.S.; Zisserman, A. Lip reading in the wild. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20-24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 87-103.
- Hassanat, A.B. Visual speech recognition. Speech Lang. Technol. 2011, 1, 279-303.
- Almajai, I.; Cox, S.; Harvey, R.; Lan, Y. Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20-25 March 2016; pp. 2722-2726.
- Bear, H.L.; Harvey, R. Phoneme-to-viseme mappings: The good, the bad, and the ugly. Speech Commun. 2017, 95, 40-67. [CrossRef]
- Assael, Y.M.; Shillingford, B.; Whiteson, S.; De Freitas, N. Lipnet: End-to-end sentence-level lipreading. arXiv 2016, arXiv:1611.01599.
- Koller, O.; Ney, H.; Bowden, R. Deep learning of mouth shapes for sign language. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7-13 December 2015; pp. 85-91.
- Lu, Y.; Li, H. Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory. Appl. Sci. 2019, 9, 1599. [CrossRef]
- Lan, Y.; Theobald, B.J.; Harvey, R. View independent computer lip-reading. In Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, Melbourne, VIC, Australia, 9-13 July 2012; pp. 432-437.
- Lucey, P.J.; Potamianos, G.; Sridharan, S. A unified approach to multi-pose audio-visual ASR. In Proceedings of the INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, 27-31 August 2007.
- Marxer, R.; Barker, J.; Alghamdi, N.; Maddock, S. The impact of the Lombard effect on audio and visual speech recognition systems. Speech Commun. 2018, 100, 58-68. [CrossRef]
- Noda, K.; Yamaguchi, Y.; Nakadai, K.; Okuno, H.G.; Ogata, T. Audio-visual speech recognition using deep learning. Appl. Intell. 2015, 42, 722-737. [CrossRef]
- Castiglione, A.; Nappi, M.; Ricciardi, S. Trustworthy Method for Person Identification in IIoT Environments by Means of Facial Dynamics. IEEE Trans. Ind. Inform. 2020, 17, 766-774. [CrossRef]
- Chung, J.S.; Zisserman, A. Lip reading in profile. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4-7 September 2017; BMVA Press: Surrey, UK, 2017.
- Yadav, G.; Maheshwari, S.; Agarwal, A. Contrast limited adaptive histogram equalization based enhancement for real time video system. In Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Delhi, India, 24-27 September 2014; pp. 2392-2397.
- Zhu, Y.; Huang, C. An adaptive histogram equalization algorithm on the image gray level mapping. Phys. Procedia 2012, 25, 601-608. [CrossRef]
- Kazemi, V.; Sullivan, J. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23-28 June 2014; pp. 1867-1874.
- Yin, X.; Yu, X.; Sohn, K.; Liu, X.; Chandraker, M. Towards large-pose face frontalization in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22-29 October 2017; pp. 3990-3999.
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4-9 December 2007.
- Cooke, M.; Barker, J.; Cunningham, S.; Shao, X. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 2006, 120, 2421-2424. [CrossRef] [PubMed]