Academia.eduAcademia.edu

Outline

GestSync: Determining who is speaking without a talking head

2023, arXiv (Cornell University)

https://doi.org/10.48550/ARXIV.2310.05304

Abstract

In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audiovisual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces. The code, datasets and pre-trained models can be found at: https://www.robots.ox.ac.uk/~vgg/research/gestsync. Figure 1: Who is speaking in these scenes? Our model, dubbed GestSync learns to identify whether a person's gestures and speech are "in-sync". The learned embeddings from our model are used to determine "who is speaking" in the crowd, without looking at their faces. Please refer to the demo video for examples.

References (32)

  1. Artem Abzaliev, Andrew Owens, and Rada Mihalcea. Towards understanding the rela- tion between gestures and language. In Proceedings of the 29th International Confer- ence on Computational Linguistics, pages 5507-5520, 2022.
  2. Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
  3. Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE PAMI, 2019.
  4. Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Now you're speaking my language: Visual language identification. In INTERSPEECH, 2020.
  5. Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self- supervised learning of audio-visual objects from video. In Proc. ECCV, 2020.
  6. Michael Andric and Steven L. Small. Gesture's neural language. In Front. Psychology, 2012.
  7. Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
  8. Hannah Bull, Michèle Gouiffès, and Annelies Braffort. Automatic segmentation of sign language into subtitle-units. In ECCV Workshops, 2020.
  9. Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  10. Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. Audio-visual synchronisation in the wild. In Proc. BMVC, 2021.
  11. Joon Son Chung and Andrew Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
  12. Soo-Whan Chung, Joon Son Chung, and Hong-Goo Kang. Perfect match: Improved cross-modal embeddings for audio-visual synchronisation. In Proc. ICASSP, 2019.
  13. Cleison Correia de Amorim, David Macêdo, and Cleber Zanchettin. Spatial-temporal graph convolutional networks for sign language recognition. In Artificial Neural Net- works and Machine Learning-ICANN, pages 646-657. Springer, 2019.
  14. Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. Learning individual styles of conversational gesture. In Computer Vision and Pattern Recognition (CVPR), 2019.
  15. Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7297-7306, 2018.
  16. Tavi Halperin, Ariel Ephrat, and Shmuel Peleg. Dynamic temporal alignment of speech to lips. In Proc. ICASSP, 2019.
  17. Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Sparse in space and time: Audio-visual synchronisation with trainable selectors. In Proc. BMVC, 2022.
  18. Venkatesh S Kadandale, Juan F Montesinos, and Gloria Haro. VocaLiST: An audio- visual synchronisation model for lips and voices. arXiv preprint arXiv:2204.02090, 2022.
  19. Naji Khosravan, Shervin Ardeshir, and Rohit Puri. On attention modules for audio- visual synchronization. In Workshop on Sight and Sound, CVPR, 2019.
  20. You Jin Kim, Hee Soo Heo, Soo-Whan Chung, and Bong-Jin Lee. End-to-end lip synchronisation based on pattern classification. In SLT Workshop, 2021.
  21. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Chuankun Li, Shuai Li, Yanbo Gao, Xiang Zhang, and Wanqing Li. A two-stream neu- ral network for pose-based hand gesture recognition. IEEE Transactions on Cognitive and Developmental Systems, 14(4):1594-1603, 2021.
  23. Marko Linna, Juho Kannala, and Esa Rahtu. Real-time human pose estimation from video with convolutional neural networks. arXiv preprint arXiv:1609.07420, 2016.
  24. Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
  25. David Mcneill. Hand and mind: What gestures reveal about thought. In Bibliovault OAI Repository, the University of Chicago Press, volume 27, 1994.
  26. Lu Meng and Ronghui Li. An attention-enhanced multi-scale and dual sign language recognition network based on a graph convolution network. Sensors, 21(4):1120, 2021.
  27. Tomas Pfister, James Charles, and Andrew Zisserman. Flowing convnets for human pose estimation in videos. In Proc. ICCV, 2015.
  28. Malcolm Slaney and Michele Covell. Facesync: A linear operator for measuring syn- chronization of video facial images and audio tracks. NeurIPS, 2000.
  29. Rachel Sutton-Spence and Bencie Woll. The Linguistics of British Sign Language: An Introduction. Cambridge University Press, 1999.
  30. Manuel Vazquez-Enriquez, Jose L Alba-Castro, Laura Docío-Fernández, and Eduardo Rodriguez-Banga. Isolated sign language recognition with multi-scale spatial-temporal graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 3462-3471, 2021.
  31. Jun Wan, Stan Z. Li, Yibing Zhao, Shuai Zhou, Isabelle Guyon, and Sergio Escalera. ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recog- nition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 761-769, 2016.
  32. Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional net- works for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.