One Shot Audio to Animated Video Generation

Srishti Goel

Outline

One Shot Audio to Animated Video Generation

Srishti Goel

2021

Abstract

We consider the challenging problem of audio to animated video generation. We propose a novel method OneShotAu2AV to generate an animated video of arbitrary length using an audio clip and a single unseen image of a person as an input. The proposed method consists of two stages. In the first stage, OneShotAu2AV generates the talking-head video in the human domain given an audio and a person′s image. In the second stage, the talking-head video from the human domain is converted to the animated domain. The model architecture of the first stage consists of spatially adaptive normalization based multi-level generator and multiple multilevel discriminators along with multiple adversarial and non-adversarial losses. The second stage leverages attention based normalization driven GAN architecture along with temporal predictor based recycle loss and blink loss coupled with lipsync loss, for unsupervised generation of animated video. In our approach, the input audio clip is not restricted to ...

References (63)

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, and Zhenyao Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. 12 2015. Deepali Aneja and Wilmot Li. Real-time lip sync for live 2d animation. 2019.
Deepali Aneja, Daniel McDuff, and Shital Shah. A high-fidelity open embodied avatar with lip syncing and expression capabilities. pages 69-73, 10 2019. ISBN 978-1-4503-6860-5. doi: 10.1145/3340555.3353744.
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Lipnet: End-to-end sentence-level lipreading. GPU Technology Conference, 2017. URL https://github.com/ Fengdalu/LipNet-PyTorch.
Aayush Bansal, Shugao Ma, Deva Ramanan, and Yaser Sheikh. Recycle-gan: Unsupervised video retargeting. In ECCV, 2018.
MikoÅ Ćaj BiÅ Ďkowski, DJ Sutherland, M Arbel, and A Gretton. Demystifying mmd gans. 01 2018.
Matthew Brand. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co, page 21â Ȃ Ş28, 1999.
Houwei Cao, David Cooper, Michael Keutmann, Ruben Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5:377-390, 10 2014. doi: 10.1109/TAFFC.2014.2336244.
Yong Cao, Wen Tien, Petros Faloutsos, and Frederic Pighin. Expressive speech-driven facial animation. ACM Trans. Graph., 24:1283-1302, 10 2005. doi: 10.1145/1095878.1095881.
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 2018. Luca Cappelletta and Naomi Harte. Phoneme-to-viseme mapping for visual speech recognition. Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM 2012), 2, 05 2012.
J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV, 2016.
Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. You said that? In British Machine Vision Conference, 2017.
Barker J. Cunningham S. Shao X Cooke, M. An audio-visual corpus for speech perception and auto- matic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421â Ȃ Ş2424, 2006.
Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil Bharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35, 10 2017. doi: 10.1109/MSP.2017.2765202.
Matthew Dailey, Michael Lyons, Miyuki Kamachi, H Ishi, Jiro Gyoba, and Garrison Cottrell. Cultural differences in facial expression classification. Proc. Cognitive Neuroscience Society, 9th Annual Meeting, San Francisco CA, page 153, 06 2002.
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 35:1-11, 07 2016. doi: 10.1145/2897824.2925984.
Epic Games. Unreal engine , online: https://www. unrealengine. com. 2007.
MichaÃńl Gilbert, Samuel Demarchi, and Isabel Urdapilleta. Facshuman a software to create experimental material by modeling 3d facial expression. pages 333-334, 11 2018. doi: 10.1145/ 3267851.3267865.
Riza Guler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. pages 7297-7306, 06 2018. doi: 10.1109/CVPR.2018.00762.
David Huggins Daines, M. Kumar, A. Chan, A.W. Black, M. Ravishankar, and Alexander Rudnicky. Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. volume 1, pages I -I, 06 2006. doi: 10.1109/ICASSP.2006.1659988.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei Efros. Image-to-image translation with conditional adversarial networks. pages 5967-5976, 07 2017. doi: 10.1109/CVPR.2017.632.
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 36: 1-12, 07 2017. doi: 10.1145/3072959.3073658.
Byung-Hak Kim and Varun Ganapath. LumiÃ ĺrenet: Lecture video synthesis from audio. 2016.
Junho Kim, Minjae Kim, Hyeon-Woo Kang, and Kwanghee Lee. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation, 07 2019.
Ann Kring and Denise Sloan. The facial expression coding system (faces): Development, validation, and utility. Psychological assessment, 19:210-24, 06 2007. doi: 10.1037/1040-3590.19.2.210.
Soonkyu Lee and Dongsuk Yook. Audio-to-visual conversion using hidden markov models. pages 563-570, 08 2002. doi: 10.1007/3-540-45683-X_60.
Giuseppe Riccardo Leone, Giulio Paci, and Piero Cosi. Lucia: An open source 3d expressive avatar for multimodal h.m.i. volume 78, pages 193-202, 01 2012. doi: 10.1007/978-3-642-30214-5_21.
Wesley Mattheyses and Werner Verhelst. Audiovisual speech synthesis: An overview of the state-of- the-art. Speech Communication, 66, 11 2014. doi: 10.1016/j.specom.2014.11.001.
Ricard Marxer Jon Barker Najwa Alghamdi, Steve Maddock and Guy J. Brown. A corpus of audio- visual lombard speech with frontal and profile view, the journal of the acoustical society of america 143, el523 (2018); https://doi.org/10.1121/1.5042758, 2018.
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. pages 5206-5210, 04 2015. doi: 10.1109/ICASSP.2015. 7178964.
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Isabella Poggi, Catherine Pelachaud, F. Rosis, Valeria Carofiglio, and Berardina Carolis. Greta. A Believable Embodied Conversational Agent, pages 3-25. 01 2005. doi: 10.1007/1-4020-3051-7_1.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. volume 9351, pages 234-241, 10 2015. ISBN 978-3-319-24573-7. doi: 10.1007/978-3-319-24574-4_28.
Marcos Santos PÃl'rez, Eva GonzÃ ąlez-Parada, and Jose Manuel Cano-Garcia. Avatar: An open source architecture for embodied conversational agents in smart environments. volume 6693, pages 109-115, 06 2011. doi: 10.1007/978-3-642-21303-8_15.
A. Simons and Stephen Cox. Generation of mouthshapes for a synthetic talking head. Proceedings of the Institute of Acoustics, Autumn Meeting, 01 1990.
Tereza Soukupova and Jan Cech. Real-time eye blink detection using facial landmarks, 2016.
Andreea Stef, Kaveen Perera, Hubert Shum, and Edmond Ho. Synthesizing expressive facial and speech animation by text-to-ipa translation with emotion control. pages 1-8, 12 2018. doi: 10.1109/SKIMA.2018.8631536.
Supasorn Suwajanakorn, Steven Seitz, and Ira Kemelmacher. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 36:1-13, 07 2017. doi: 10.1145/3072959.3073640.
Sarah Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. Dynamic units of visual speech. pages 275-284, 07 2012.
Guanzhong Tian, Yi Yuan, and Yong Liu. Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. 05 2019.
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526-1535, 2018.
Konstantinos Vougioukas, Stavros Petridi, and Maja Pantic. End-to-end speech-driven facial anima- tion with temporal gans. Journal of Foo, 14(1):234-778, 2004.
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High- resolution image synthesis and semantic manipulation with conditional gans. 11 2017.
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High- resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video-to-video synthesis. In Conference on Neural Information Processing Systems (NeurIPS), 2019.
O. Wiles, A.S. Koepke, and A. Zisserman. X2face: A network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision, 2018.
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models, 05 2019.
Bolei Zhou, Aditya Khosla, Ã Ȃgata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. 12 2015.
Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang Kim. Unsu- pervised attention-guided image to image translation, 06 2018.
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Lipnet: End-to-end sentence-level lipreading. GPU Technology Conference, 2017. URL https://github.com/ Fengdalu/LipNet-PyTorch.
Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. Incremental face alignment in the wild. pages 1859-1866, 06 2014. doi: 10.1109/CVPR.2014.240.
MikoÅ Ćaj BiÅ Ďkowski, DJ Sutherland, M Arbel, and A Gretton. Demystifying mmd gans. 01 2018.
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 2018.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. 12 2017.
Alexandre Alahi Justin Johnson and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. 2016.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
Yanchun Li, Nanfeng Xiao, and Wanli Ouyang. Improved generative adversarial networks with reconstruction loss. Neurocomputing, 323, 10 2018. doi: 10.1016/j.neucom.2018.10.014.
Yu Tian Mubbasir Kapadia Long Zhao, Xi Peng and Dimitris Metaxas1. Learning to forecast and refine residual motion for image-to-video generation, 2018.
Wang Mei and Weihong Deng. Deep face recognition: A survey. 04 2018.
Nataniel Ruiz, Eunji Chong, and James M. Rehg. Fine-grained head pose estimation without key- points. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv 1409.1556, 09 2014.
Tereza Soukupova and Jan Cech. Real-time eye blink detection using facial landmarks, 2016.
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526-1535, 2018.
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High- resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

One Shot Audio to Animated Video Generation

Sign up for access to the world's latest research

Abstract

Related papers

References (63)

Related papers