Cross-Media Learning for Image Sentiment Analysis in the Wild

Lucia Vadicamo; Fabio Carrara; Andrea Cimino; Stefano Cresci; Felice Dell'Orletta; Fabrizio Falchi; Maurizio Tesconi

doi:10.1109/ICCVW.2017.45

Outline

Cross-Media Learning for Image Sentiment Analysis in the Wild

Fabrizio Falchi

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

https://doi.org/10.1109/ICCVW.2017.45

visibility

…

description

11 pages

link

1 file

Abstract

Much progress has been made in the field of sentiment analysis in the past years. Researchers relied on textual data for this task, while only recently they have started investigating approaches to predict sentiments from multimedia content. With the increasing amount of data shared on social media, there is also a rapidly growing interest in approaches that work "in the wild", i.e. that are able to deal with uncontrolled conditions. In this work, we faced the challenge of training a visual sentiment classifier starting from a large set of user-generated and unlabeled contents. In particular, we collected more than 3 million tweets containing both text and images, and we leveraged on the sentiment polarity of the textual contents to train a visual sentiment classifier. To the best of our knowledge, this is the first time that a cross-media learning approach is proposed and tested in this context. We assessed the validity of our model by conducting comparative studies and evaluations on a benchmark for visual sentiment analysis. Our empirical study shows that although the text associated to each image is often noisy and weakly correlated with the image content, it can be profitably exploited to train a deep Convolutional Neural Network that effectively predicts the sentiment polarity of previously unseen images.

References (53)

Groundtruth: NEG, Prediction: NEG Groundtruth: NEG, Prediction: NEU Groundtruth: NEG, Prediction: POS Groundtruth: NEU, Prediction: NEG Groundtruth: NEU, Prediction: NEU Groundtruth: NEU, Prediction: POS Groundtruth: POS, Prediction: NEG Groundtruth: POS, Prediction: NEU Groundtruth: POS, Prediction: POS References
G. Amato, F. Carrara, F. Falchi, C. Gennaro, C. Meghini, and C. Vairo. Deep learning for decentralized parking lot occupancy detection. Expert Syst. Appl., 72:327 -334, 2017. 4
G. Amato, F. Falchi, and L. Vadicamo. Visual recognition of ancient inscriptions Using Convolutional Neural Network and Fisher Vector. JOCCH, 9(4):21:1-21:24, Dec. 2016. 4
M. Avvenuti, S. Cresci, F. Del Vigna, and M. Tesconi. Im- promptu crisis mapping to prioritize emergency response. Computer, 49(5):28-37, 2016. 1
S. Baccianella, A. Esuli, and F. Sebastiani. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC 2010. 5
C. Baecchi, T. Uricchio, M. Bertini, and A. Del Bimbo. A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimedia Tools and Appli- cations, 75(5):2507-2525, 2016. 3
F. Barbieri, V. Basile, D. Croce, M. Nissim, N. Novielli, and V. Patti. Overview of the evalita 2016 sentiment polarity classification task. In EVALITA 2016. 3
A. Bermingham and A. F. Smeaton. Classifying sentiment in microblogs: is brevity an advantage? In CIKM 2010. ACM. 2
D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large- scale visual sentiment ontology and detectors using adjective noun pairs. In Multimedia 2013. ACM. 1, 2, 8
V. Campos, B. Jou, and X. Giró i Nieto. From pixels to sen- timent: Fine-tuning CNNs for visual sentiment prediction. Image and Vision Computing, 2017. 3, 6, 8
D. Cao, R. Ji, D. Lin, and S. Li. A cross-media public sen- timent analysis system for microblog. Multimedia Systems, 22(4):479-486, 2016. 1, 3
T. Chen, D. Borth, T. Darrell, and S. Chang. Deepsentibank: Visual sentiment concept classification with deep convolu- tional neural networks. CoRR, abs/1410.8586, 2014. 3, 8
F. Chollet. Keras. https://github.com/fchollet/ keras, 2015. 5
A. Cimino and F. Dell'Orletta. Tandem LSTM-SVM ap- proach for sentiment analysis. In EVALITA 2016. 3, 5
S. Cresci, M. Tesconi, A. Cimino, and F. Dell'Orletta. A linguistically-driven approach to cross-event damage assess- ment of natural disasters from social media messages. In WWW Companion 2015. ACM. 1
R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv., 40(2):5:1-5:60, May 2008. 1
J. Deriu, M. Gonzenbach, F. Uzdilli, A. Lucchi, V. D. Luca, and M. Jaggi. Swisscheese at semeval-2016 task 4: Sentiment classification using an ensemble of convolutional neural networks with distant supervision. In SemEval @ NAACL-HLT 2016. 2
Y. Gal and Z. Ghahramani. A theoretically grounded applica- tion of dropout in recurrent neural networks. In NIPS 2016. 5
A. Go, R. Bhayani, and L. Huang. Twitter sentiment classi- fication using distant supervision. CS224N Project Report, Stanford, 1(12), 2009. 2
M. Hu and B. Liu. Mining and summarizing customer re- views. In SIGKDD 2004. ACM. 5
J. Islam and Y. Zhang. Visual sentiment analysis for so- cial images using transfer learning approach. In BDCloud- SocialCom-SustainCom 2016. IEEE. 3, 6, 8
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 6
B. Jou, T. Chen, N. Pappas, M. Redi, M. Topkara, and S.- F. Chang. Visual affect around the world: A large-scale multilingual visual sentiment ontology. In Multimedia 2015. ACM. 1, 2, 3, 8
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS 2012. 3, 4, 8
M. Laver, K. Benoit, and J. Garry. Extracting policy po- sitions from political texts using words as data. American Political Science Review, 97(2):311-331, 005 2003. 1
S. Li, Z. Wang, G. Zhou, and S. Y. M. Lee. Semi-supervised learning for imbalanced sentiment classification. In IJCAI 2011. 4
X. Li, T. Uricchio, L. Ballan, M. Bertini, C. G. M. Snoek, and A. D. Bimbo. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Comput. Surv., 49(1):14:1-14:39, June 2016. 1
Z. Li, Y. Fan, W. Liu, and F. Wang. Image sentiment predic- tion based on textual descriptions with adjective noun pairs. Multimedia Tools and Applications, pages 1-18, 2017. 3, 6
L. Ma, Z. Lu, L. Shang, and H. Li. Multimodal convolutional neural networks for matching image and sentence. In ICCV 2015, pages 2623-2631. IEEE, 2015. 3
J. Machajdik and A. Hanbury. Affective image classifica- tion using features inspired by psychology and art theory. In Multimedia 2010. ACM. 2
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014. 3
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS 2013. 2, 5
G. Mishne, N. S. Glance, et al. Predicting movie sales from blogger sentiment. In Computational Approaches to Analyz- ing Weblogs 2006. AAAI. 1
S. M. Mohammad, S. Kiritchenko, and X. Zhu. Nrc-canada: Building the state-of-the-art in sentiment analysis of tweets. In SemEval @ NAACL-HLT 2013. 4
P. Nakov, A. Ritter, S. Rosenthal, F. Sebastiani, and V. Stoy- anov. Semeval-2016 task 4: Sentiment analysis in twitter. In SemEval @ NAACL-HLT 2016. 2, 4
B. O'Connor, R. Balasubramanyan, B. R. Routledge, and N. A. Smith. From tweets to polls: Linking text sentiment to public opinion time series. In ICWSM 2010. AAAI. 1
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP 2014. 2, 5
T. Rao, M. Xu, and D. Xu. Learning multi-level deep rep- resentations for image emotion classification. arXiv preprint arXiv:1611.07145, 2016. 3
A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carls- son. CNN features off-the-shelf: an astounding baseline for recognition. In CVPRW 2014, pages 512-519. IEEE, 2014. 4
M. Rouvier and B. Favre. SENSEI-LIF at semeval-2016 task 4: Polarity embedding fusion for robust sentiment analysis. In SemEval @ NAACL-HLT 2016. 2
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211-252, 2015. 5, 8
M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673-2681, 1997. 4
S. Siersdorfer, E. Minack, F. Deng, and J. Hare. Analyzing and predicting sentiment of images on the social web. In Multimedia 2010. ACM. 1, 2, 8
K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 4, 5, 8
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR 2015. IEEE. 3, 8
D. Tang, B. Qin, and T. Liu. Document modeling with gated recurrent neural network for sentiment classification. In EMNLP 2015. 4
T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 2012. 5
Y. Wang, S. Wang, J. Tang, H. Liu, and B. Li. Unsupervised sentiment analysis for social media images. In IJCAI 2015. 3
T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing con- textual polarity in phrase-level sentiment analysis. In HLT- EMNLP 2005. 5
Q. You, J. Luo, H. Jin, and J. Yang. Cross-modality consis- tent regression for joint visual-textual sentiment analysis of social multimedia. In WSDM 2016. ACM. 1, 3
Q. You, J. Luo, H. Jin, and J. Yang. Robust image sentiment analysis using progressively trained and domain transferred deep networks. CoRR, abs/1509.06041, 2015. 1, 3, 6, 8
J. Yuan, S. Mcdonough, Q. You, and J. Luo. Sentribute: Image sentiment analysis from a mid-level perspective. In WISDOM @ SIGKDD 2013. ACM. 2, 8
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In NIPS 2014. 3, 4, 5, 8

Multimodal data can convey user emotions and feelings more effectively and interactively than unimodal content. Thus, multimodal sentiment analysis (MSA) research has recently acquired great significance as a field of study. However, most current approaches either acquire sentimental features independently for each modality or simply combine multiple modal features. Thus, semantic details pertinent to sentiment analysis and the relationship between visual and textual content are neglected. Furthermore, most available multimodal datasets are sentiment-annotated, although user emotions are usually rich and unlimited. Motivated by these observations, this paper proposes a novel deep multi-view attentive network (DMVAN) for robust multimodal sentiment and emotion classification. The DMVAN model has three phases: feature learning, attentive interaction learning, and cross-modal fusion learning. During the feature learning phase, visual features from a multi-view perspective (region and scene) and textual features from various levels of analysis (word, sentence, and document) are extracted to capture information effectively for accurate classification. In the attentive interaction learning phase, the image-text interaction learning mechanism is employed to enhance visual and textual information interaction by extracting sentimental and discriminative visual features and utilizing the textual information to guide the learning process of image features. Moreover, a cross-modal fusion learning module is developed to incorporate different features into a comprehensive framework that takes advantage of the complementary aspects of multiple modalities. Then, a multi-head attention mechanism is employed to extract and merge sufficient data from the intermediate features, thereby aiding in developing a robust joint representation. Finally, a multi-layer perceptron with multiple stacking-fully connected layers is used to deeply fuse the modal features, thereby enhancing sentiment classification performance. An interpretable multimodal sentiment classification model is further developed utilizing the local interpretable model-agnostic explanation model (LIME) to ensure the model's explainability and strength. To perform a multimodal emotion classification, an image-text emotion dataset named Emotion-Getty (EMO-G) is constructed from Getty Images and labeled by distinct emotions. The proposed model is tested on three real-world datasets, attaining 99.801% accuracy on Binary_Getty (BG), 96.867% on Twitter, and 96.174% on the EMO-G dataset. These results show that the suggested model outperforms single-model techniques and current state-of-the-art methodologies based on model evaluation criteria.

Cross-Media Learning for Image Sentiment Analysis in the Wild

Sign up for access to the world's latest research

Abstract

Related papers

References (53)

Related papers

Related topics

Cited by