Academia.eduAcademia.edu

Outline

Robust One Shot Audio to Video Generation

2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

https://doi.org/10.1109/CVPRW50498.2020.00393

Abstract

Audio to Video generation is an interesting problem that has numerous applications across industry verticals including film making, multi-media, marketing, education and others. High-quality video generation with expressive facial movements is a challenging problem that involves complex learning steps for generative adversarial networks. Further, enabling one-shot learning for an unseen single image increases the complexity of the problem while simultaneously making it more applicable to practical scenarios.In the paper, we propose a novel approach OneShotA2V to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person. OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking head video of the given person.Further, it feeds the features generated from the audio input directly into a generative adversarial network and it adapts to any given u...

References (48)

  1. Deepali Aneja and Wilmot Li. Real-time lip sync for live 2d animation. 2019.
  2. Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. Lipnet: End-to-end sentence-level lipreading. GPU Technology Confer- ence, 2017.
  3. Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. Incremental face alignment in the wild. pages 1859-1866, 06 2014.
  4. Matthew Brand. Voice puppetry. In Proceedings of the 26th annual conference on Computer graph- ics and interactive techniques. ACM Press/Addison- Wesley Publishing Co, page 21-28, 1999.
  5. Yong Cao, Wen Tien, Petros Faloutsos, and Frederic Pighin. Expressive speech-driven facial animation. ACM Trans. Graph., 24:1283-1302, 10 2005.
  6. Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 2018.
  7. Luca Cappelletta and Naomi Harte. Phoneme-to- viseme mapping for visual speech recognition. Pro- ceedings of the International Conference on Pat- tern Recognition Applications and Methods (ICPRAM 2012), 2, 05 2012.
  8. Joon Son Chung, Amir Jamaludin, and Andrew Zis- serman. You said that? In British Machine Vision Conference, 2017.
  9. J. S. Chung and A. Zisserman. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip- reading, ACCV, 2016.
  10. Barker J. Cunningham S. Shao X Cooke, M. An audio-visual corpus for speech perception and auto- matic speech recognition. The Journal of the Acousti- cal Society of America, 120(5):2421-2424, 2006.
  11. Eric Battenberg Carl Case Jared Casper Bryan Catanzaro-Jingdong Chen Mike Chrzanowski Adam Coates Greg Diamos Erich Elsen Jesse Engel Linxi Fan Christopher Fougner Tony Han Awni Hannun Billy Jun Patrick LeGresley Libby Lin Sharan Narang Andrew Ng Sherjil Ozair Ryan Prenger Jonathan Raiman Sanjeev Satheesh David Seetapun Shubho Sengupta Yi Wang Zhiqian Wang Chong Wang Bo Xiao Dani Yogatama Jun Zhan Zhenyao Zhu Dario Amodei, Rishita Anubhai. Deep speech 2: End- to-end speech recognition in english and mandarin. 2005.
  12. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 35:1-11, 07 2016.
  13. Riza Guler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. pages 7297-7306, 06 2018.
  14. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei Efros. Image-to-image translation with conditional adversarial networks. pages 5967-5976, 07 2017.
  15. Alexandre Alahi Justin Johnson and Li Fei-Fei. Per- ceptual losses for real-time style transfer and super- resolution. 2016.
  16. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 36:1-12, 07 2017.
  17. Byung-Hak Kim and Varun Ganapath. Lumièrenet: Lecture video synthesis from audio. 2016.
  18. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014.
  19. Soonkyu Lee and Dongsuk Yook. Audio-to-visual conversion using hidden markov models. pages 563- 570, 08 2002.
  20. Yanchun Li, Nanfeng Xiao, and Wanli Ouyang. Im- proved generative adversarial networks with recon- struction loss. Neurocomputing, 323, 10 2018.
  21. Yu Tian Mubbasir Kapadia Long Zhao, Xi Peng and Dimitris Metaxas1. Learning to forecast and refine residual motion for image-to-video generation, 2018.
  22. Wesley Mattheyses and Werner Verhelst. Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication, 66, 11 2014.
  23. Wang Mei and Weihong Deng. Deep face recognition: A survey. 04 2018.
  24. Gaurav Mittal and Baoyuan Wang. Animating face using disentangled audio representations, 2019.
  25. Ricard Marxer Jon Barker Najwa Alghamdi, Steve Maddock and Guy J. Brown. A corpus of audio-visual lombard speech with frontal and profile view, the journal of the acoustical society of america 143, el523 (2018); https://doi.org/10.1121/1.5042758, 2018.
  26. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. pages 5206-5210, 04 2015.
  27. Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  28. Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. 09 2016.
  29. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. volume 9351, pages 234-241, 10 2015.
  30. Nataniel Ruiz, Eunji Chong, and James M. Rehg. Fine-grained head pose estimation without keypoints.
  31. In The IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR) Workshops, June 2018.
  32. Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Temporal generative adversarial nets with singular value clipping. 10 2017.
  33. A. Simons and Stephen Cox. Generation of mouthshapes for a synthetic talking head. Proceed- ings of the Institute of Acoustics, Autumn Meeting, 01 1990.
  34. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. arXiv 1409.1556, 09 2014.
  35. Tereza Soukupova and Jan Cech. Real-time eye blink detection using facial landmarks, 2016.
  36. Andreea Stef, Kaveen Perera, Hubert Shum, and Ed- mond Ho. Synthesizing expressive facial and speech animation by text-to-ipa translation with emotion con- trol. pages 1-8, 12 2018.
  37. Supasorn Suwajanakorn, Steven Seitz, and Ira Kemel- macher. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 36:1-13, 07 2017.
  38. Sarah Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. Dynamic units of visual speech. pages 275-284, 07 2012.
  39. Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. Neural voice puppetry: Audio-driven facial reenactment, 12 2019.
  40. Guanzhong Tian, Yi Yuan, and Yong Liu. Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. 05 2019.
  41. Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526-1535, 2018.
  42. Carl Vondrick, Hamed Pirsiavash, and Antonio Tor- ralba. Generating videos with scene dynamics. 09 2016.
  43. Konstantinos Vougioukas, Stavros Petridi, and Maja Pantic. End-to-end speech-driven facial animation with temporal gans. Journal of Foo, 14(1):234-778, 2004.
  44. Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. Few-shot video- to-video synthesis. In Conference on Neural Informa- tion Processing Systems (NeurIPS), 2019.
  45. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. In Advances in Neural In- formation Processing Systems (NeurIPS), 2018.
  46. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, An- drew Tao, Jan Kautz, and Bryan Catanzaro. High- resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2018.
  47. O. Wiles, A.S. Koepke, and A. Zisserman. X2face: A network for controlling face generation by using im- ages, audio, and pose codes. In European Conference on Computer Vision, 2018.
  48. Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models, 05 2019.