Academia.eduAcademia.edu

Outline

A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking

IEEE MultiMedia

https://doi.org/10.1109/MMUL.2018.023121161

Abstract
sparkles

AI

This paper investigates the fusion of multimodal data by employing a crossmodal approach in video hyperlinking tasks. It highlights the limitations of traditional aggregation strategies such as late score fusion and emphasizes the effectiveness of crossmodal translation methods. By analyzing single-modal representations for speech and visual components, the work aims to develop a robust BiDNN architecture that enhances the common representation space necessary for successful multimodal interactions.

References (19)

  1. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, "Multimodal deep learning" in Intl. Conf. on Machine Learning, 2011.
  2. F. Feng, X. Wang, and R. Li, "Cross-modal retrieval with correspondence autoencoder" in ACM Intl. Conf. on Multimedia, 2014, pp. 7-16.
  3. G. Awad, A. Butt, J. Fiscus, D. Joy, A. Delgado, M. Michel, A. F. Smeaton, Y. Graham, W. Kraaij, G. Quenot et al., "Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking" in Proceedings of TRECVID, 2017.
  4. M. Demirdelen, M. Budnik, G. Sargent, R. Bois, and G. Gravier, "IRISA at TRECVid 2017: Beyond crossmodal and multimodal models for video hyperlinking" in Working Notes of the TRECVid 2017 Workshop, 2017.
  5. M. Campr and K. Jezek, "Comparing semantic models for evaluating automatic document summarization" in Text, Speech, and Dialogue, 2015.
  6. A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, "CNN features off-the- shelf: an astounding baseline for recognition" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806-813.
  7. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks" in Advances in neural information processing systems, 2012, pp. 1097-1105.
  8. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition" 2014.
  9. T. Chen and R. R. Rao, "Audio-visual integration in multimodal communication" Proceedings of the IEEE, vol. 86, no. 5, pp. 837-852, 1998.
  10. C. Guinaudeau, A. R. Simon, G. Gravier, and P. Sebillot, "HITS and IRISA at MediaEval 2013: Search and hyperlinking task" in Working Notes MediaEval Workshop, 2013.
  11. V. Vukotić, C. Raymond, and G. Gravier, "Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications" in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016, pp. 343-346.
  12. M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, S. Chen, and G. J. Jones, "The search and hyperlinking task at MediaEval 2014" in Working Notes MediaEval Workshop, 2014.
  13. K. Chatfield and A. Zisserman, "Visor: Towards on-the-fly largescale object category retrieval" in Asian Conference on Computer Vision. Springer, 2012, pp. 432-446.
  14. P. Over, G. Awad, J. Fiscus, M. Michel, D. Joy, A. Smeaton, W. Kraaij, G. Quenot, R. Ordelman, and R. Aly, "Trecvid 2015 an overview of the goals, tasks, data, evaluation mechanisms, and metrics" in Proceedings of TRECVID, 2015.
  15. Lamel and J.-L. Gauvain, "Speech processing for audio indexing" in Advances in Natural Language Processing. Springer, 2008, pp. 4-15.
  16. V. Vukotić, C. Raymond, and G. Gravier, "Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking" in ACM Multimedia 2016 Workshop: Vision and Language Integration Meets Multimedia Fusion (iV&L-MM'16). Amsterdam, Netherlands: ACM Multimedia, Oct. 2016.
  17. X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks." in Aistats, vol. 9, 2010, pp. 249-256.
  18. G. Awad, J. Fiscus, M. Michel, D. Joy, W. Kraaij, A. F. Smeaton, G. Quenot, M. Eskevich, R. Aly, and R. Ordelman, "Trecvid 2016: Evaluating video search, video event detection, localization, and hyperlinking" in Proceedings of TRECVID, vol. 2016. ABOUT THE AUTHORS Vedran Vukotić is a Ph.D. student at IRISA and INSA Rennes. He received his B.Sc. and M.Sc. in computer science in 2012 and 2014 respectively and his M.Sc. in nautical sciences in 2010. His main areas of interests are unsupervised methods of obtaining multimodal rep- resentations with deep learning architectures and in particular multimodal fusion of speech and vision in video hyperlinking.
  19. Christian Raymond received the Ph.D. degree in 2005 in computer science, from the Uni- versity of Avignon, France. He was appointed in 2009 associate professor at INSA Rennes (France). He's member of LinkMedia team, devoted to multimedia document analysis, at the IRISA research unit. His research activities focus mainly on speech understanding and machine learning for natural language processing.