Academia.eduAcademia.edu

Outline

Adaptive Fusion Techniques for Multimodal Data

2021

Abstract

Effective fusion of data from multiple modalities, such as video, speech, and text, is challenging due to the heterogeneous nature of multimodal data. In this paper, we propose adaptive fusion techniques that aim to model context from different modalities effectively. Instead of defining a deterministic fusion operation, such as concatenation, for the network, we let the network decide “how” to combine a given set of multimodal features more effectively. We propose two networks: 1) Auto-Fusion, which learns to compress information from different modalities while preserving the context, and 2) GAN-Fusion, which regularizes the learned latent space given context from complementing modalities. A quantitative evaluation on the tasks of multimodal machine translation and emotion recognition suggests that our lightweight, adaptive networks can better model context from other modalities than existing methods, many of which employ massive transformer-based networks.

References (38)

  1. Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017. Generalization and equilib- rium in generative adversarial nets (gans). In ICML, pages 224-232. PMLR.
  2. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In ACL workshop on intrinsic and extrinsic evaluation mea- sures for machine translation and/or summarization, pages 65-72.
  3. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jean- nette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language re- sources and evaluation, page 335.
  4. Sarath Chandar, Mitesh M Khapra, Hugo Larochelle, and Balaraman Ravindran. 2016. Correlational neu- ral networks. Neural computation, 28(2):257-285.
  5. Corinna Cortes and Vladimir Vapnik. 1995. Support- vector networks. Machine learning, 20(3):273-297.
  6. D. Elliott, S. Frank, K. Sima'an, and L. Specia. 2016. Multi30k: Multilingual english-german image de- scriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70-74.
  7. Desmond Elliott, Stella Frank, Loïc Barrault, Fethi Bougares, and Lucia Specia. 2017. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. In Proceedings of the Second Conference on Machine Translation.
  8. Xiang Gao, Sungjin Lee, Yizhe Zhang, Chris Brockett, Michel Galley, Jianfeng Gao, and Bill Dolan. 2019. Jointly optimizing diversity and relevance in neural response generation. In NAACL-HLT.
  9. Ian Goodfellow. 2016. NIPS 2016 Tutorial: Generative adversarial networks.
  10. Stig-Arne Grönroos, Benoit Huet, Mikko Kurimo, Jorma Laaksonen, Bernard Merialdo, Phu Pham, Mats Sjöberg, Umut Sulubacak, Jörg Tiedemann, Raphael Troncy, and Raúl Vázquez. 2018. The MeMAD submission to the WMT18 multimodal translation task. In Proceedings of the Third Confer- ence on Machine Translation: Shared Task Papers, pages 603-611.
  11. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735-1780.
  12. Lifu Huang, Kyunghyun Cho, Boliang Zhang, Heng Ji, and Kevin Knight. 2018. Multi-lingual common semantic space construction via cluster-consistent word embedding. In EMNLP, pages 250-260.
  13. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR. Ke Li and Jitendra Malik. 2018. On the implicit assumptions of gans. NeurIPS Workshop on Cri- tiquing and Correcting Trends in Machine Learning.
  14. Paul Pu Liang, Zhun Liu, Yao-Hung Hubert Tsai, Qibin Zhao, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2019. Learning representations from im- perfect time series data via tensor rank regulariza- tion. In ACL, pages 1569-1576.
  15. Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi- narasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient low- rank multimodal fusion with modality-specific fac- tors. In ACL, pages 2247-2256.
  16. Thang Luong, Hieu Pham, and Christopher D. Man- ning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, pages 1412- 1421.
  17. Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural net- work acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Process- ing.
  18. Iain Matthews, Timothy F Cootes, J Andrew Bangham, Stephen Cox, and Richard Harvey. 2002. Extrac- tion of visual features for lipreading. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 24(2):198-213.
  19. Sunil S. Morade and Suprava Patnaik. 2015. Com- parison of classifiers for lip reading with cuave and tulips database. Optik -International Journal for Light and Electron Optics, 126(24):5753-5761.
  20. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Mul- timodal deep learning. In ICML, pages 689-696.
  21. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: A method for automatic eval- uation of machine translation. In ACL, pages 311- 318.
  22. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learn- ing library. In NeurIPS, pages 8024-8035.
  23. Eric K Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N Gowdy. 2002. Cuave: A new audio-visual database for multimodal human-computer interface research. In ICASSP, volume 2, pages II-2017.
  24. Hai Pham, Paul Pu Liang, Thomas Manzini, Louis- Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint represen- tations by cyclic translations between modalities. In AAAI, volume 33, pages 6892-6899.
  25. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The kaldi speech recogni- tion toolkit. In IEEE workshop on automatic speech recognition and understanding.
  26. Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Deep boltzmann machines. In Artificial intelligence and statistics, pages 448-455.
  27. Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. 2018. How2: a large-scale dataset for multimodal language understanding. In Work- shop on Visually Grounded Interaction and Lan- guage (ViGIL). NeurIPS.
  28. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  29. Yuge Shi, N Siddharth, Brooks Paige, and Philip Torr. 2019. Variational mixture-of-experts autoen- coders for multi-modal deep generative models. In NeurIPS, pages 15718-15729.
  30. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
  31. Mathieu Sinn and Ambrish Rawat. 2018. Non- parametric estimation of jensen-shannon divergence in generative adversarial network training. In AIS- TATS, pages 642-651.
  32. Nitish Srivastava and Ruslan R Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NeurIPS, pages 2222-2230.
  33. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NeurIPS, pages 3104-3112.
  34. Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In ACL, volume 2019, page 6558.
  35. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS, pages 5998-6008.
  36. Seunghyun Yoon, Seokhyun Byun, Subhadeep Dey, and Kyomin Jung. 2019. Speech emotion recog- nition using multi-hop attention mechanism. In ICASSP, pages 2822-2826.
  37. Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. 2018. Multimodal speech emotion recognition us- ing audio and text. In SLT Workshop, pages 112- 118. IEEE.
  38. Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam- bria, and Louis-Philippe Morency. 2017. Tensor fu- sion network for multimodal sentiment analysis. In EMNLP, pages 1103-1114.