Joint Learning of Facial Expression and Head Pose from Speech

Stephen Laycock

doi:10.21437/INTERSPEECH.2018-2587

Outline

Joint Learning of Facial Expression and Head Pose from Speech

Stephen Laycock

2018, Interspeech 2018

https://doi.org/10.21437/INTERSPEECH.2018-2587

visibility

…

description

5 pages

link

1 file

Abstract

Natural movement plays a significant role in realistic speech animation, and numerous studies have demonstrated the contribution visual cues make to the degree human observers find an animation acceptable. Natural, expressive, emotive, and prosodic speech exhibits motion patterns that are difficult to predict with considerable variation in visual modalities. Recently, there have been some impressive demonstrations of face animation derived in some way from the speech signal. Each of these methods have taken unique approaches, but none have included rigid head pose in their predicted output. We observe a high degree of correspondence with facial activity and rigid head pose during speech, and exploit this observation to jointly learn full face animation and head pose rotation and translation combined. From our own corpus, we train Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language to model the relationship that speech has with the complex activity of the face. We define a model architecture to encourage learning of rigid head motion via the latent space of the speaker's facial activity. The result is a model that can predict lip sync and other facial motion along with rigid head motion directly from audible speech.

References (37)

References
H. McGurk and J. MacDonald, "Hearing lips and seeing voices," 1976.
K. G. Munhall, J. A. Jones, D. E. Callan, T. Kuratate, and E. Vatikiotis-Bateson, "Visual prosody and speech intelligibility: head movement improves auditory speech perception." Psycho- logical science : A journal of the American Psychological Society / APS, vol. 15, no. 2, pp. 133-137, 2004.
M. Mori, "The uncanny valley," Energy, vol. 7, no. 4, pp. 33-35, 1970.
D. Greenwood, S. Laycock, and I. Matthews, "Predicting head pose from speech with a conditional variational autoencoder," Proc. Interspeech 2017, pp. 3991-3995, 2017.
J. P. Lewis and F. I. Parke, "Automated lip-synch and speech synthesis for character animation," SIGCHI Bull., vol. 17, no. SI, pp. 143-147, May 1986. [Online]. Available: http://doi.acm.org/10.1145/30851.30874
C. G. Fisher, "Confusions among visually perceived consonants." Journal of speech and hearing research, vol. 11 4, pp. 796-804, 1968.
J. Lewis, "Automated lip-sync: Background and techniques," Computer Animation and Virtual Worlds, vol. 2, no. 4, pp. 118- 122, 1991.
W. Mattheyses and W. Verhelst, "Audiovisual speech synthe- sis: An overview of the state-of-the-art," Speech Communication, vol. 66, pp. 182-217, 2015.
L. Wang, W. Han, F. K. Soong, and Q. Huo, "Text driven 3d photo-realistic talking head," in Twelfth Annual Conference of the International Speech Communication Association, 2011.
C. Bregler, M. Covell, and M. Slaney, "Video rewrite: Driving visual speech with audio," in Proceedings of the 24th annual con- ference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1997, pp. 353-360.
S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins, and I. Matthews, "A deep learning approach for gener- alized speech animation," ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 93, 2017.
S. L. Taylor, M. Mahler, B.-J. Theobald, and I. Matthews, "Dy- namic units of visual speech," in Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation. Eurographics Association, 2012, pp. 275-284.
M. Brand, "Voice puppetry," in Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1999, pp. 21-28.
S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, "Synthesizing obama: Learning lip sync from audio," ACM Trans. Graph., vol. 36, no. 4, pp. 95:1-95:13, Jul. 2017.
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, "Audio- driven facial animation by joint end-to-end learning of pose and emotion," ACM Trans. Graph., vol. 36, no. 4, pp. 94:1-94:12, Jul. 2017.
Z. Deng, S. Narayanan, C. Busso, and U. Neumann, "Audio-based head motion synthesis for avatar-based telepresence systems," in Proceedings of the 2004 ACM SIGMM workshop on Effective telepresence. ACM, 2004, pp. 24-30.
C. Busso, Z. Deng, U. Neumann, and S. Narayanan, "Natural head motion synthesis driven by acoustic prosodic features," Journal of Visualization and Computer Animation, vol. 16, no. 3-4, pp. 283- 290, 2005.
C. Busso, Z. Deng, M. Grimm, U. Neumann, and S. Narayanan, "Rigid head motion in expressive speech animation: Analysis and synthesis," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1075-1086, 2007.
G. Hofer and H. Shimodaira, "Automatic head motion prediction from speech data," in Eighth Annual Conference of the Interna- tional Speech Communication Association, 2007.
G. Hofer, H. Shimodaira, and J. Yamagishi, "Speech driven head motion synthesis based on a trajectory model," in ACM SIG- GRAPH 2007 posters. ACM, 2007, p. 86.
A. Ben Youssef, H. Shimodaira, and D. A. Braude, "Articulatory features for speech-driven head motion synthesis," Proceedings of Interspeech, Lyon, France, 2013.
C. Ding, L. Xie, and P. Zhu, "Head motion synthesis from speech using deep neural networks," Multimedia Tools and Applications, pp. 1-18, 2014.
C. Ding, P. Zhu, and L. Xie, "Blstm neural networks for speech driven head motion synthesis," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
K. Haag and H. Shimodaira, "Bidirectional lstm networks em- ploying stacked bottleneck features for expressive speech-driven head motion synthesis," in International Conference on Intelligent Virtual Agents. Springer, 2016, pp. 198-207.
J. Gehring, Y. Miao, F. Metze, and A. Waibel, "Extracting deep bottleneck features using stacked auto-encoders," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Interna- tional Conference on. IEEE, 2013, pp. 3377-3381.
D. Greenwood, S. Laycock, and I. Matthews, "Predicting head pose in dyadic conversation," in International Conference on In- telligent Virtual Agents. Springer, 2017, pp. 160-169.
T. F. Cootes, G. J. Edwards, and C. J. Taylor, "Active appearance models," IEEE Transactions on pattern analysis and machine in- telligence, vol. 23, no. 6, pp. 681-685, 2001.
I. Matthews and S. Baker, "Active appearance models revisited," International Journal of Computer Vision, vol. 60, no. 2, pp. 135- 164, 2004.
J. C. Gower, "Generalized procrustes analysis," Psychometrika, vol. 40, no. 1, pp. 33-51, 1975.
L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams et al., "Recent advances in deep learning for speech research at Microsoft," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Con- ference on. IEEE, 2013, pp. 8604-8608.
H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang, "Visual prosody: facial movements accompanying speech," in Proceed- ings of Fifth IEEE International Conference on Automatic Face Gesture Recognition, May 2002, pp. 396-401.
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
T. Tieleman and G. Hinton, "Lecture 6.5-rmsprop: Divide the gra- dient by a running average of its recent magnitude," COURSERA: Neural networks for machine learning, vol. 4, no. 2, 2012.
F. Chollet et al., "Keras," https://github.com/fchollet/keras, 2015.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/

Joint Learning of Facial Expression and Head Pose from Speech

Sign up for access to the world's latest research

Abstract

Related papers

References (37)

Related papers

Cited by