Vector of Locally-Aggregated Word Embeddings (
2019, Proceedings of the 2019 Conference of the North
https://doi.org/10.18653/V1/N19-1033Abstract
In this paper, we propose a novel representation for text documents based on aggregating word embedding vectors into document embeddings. Our approach is inspired by the Vector of Locally-Aggregated Descriptors used for image representation, and it works as follows. First, the word embeddings gathered from a collection of documents are clustered by k-means in order to learn a codebook of semnatically-related word embeddings. Each word embedding is then associated to its nearest cluster centroid (codeword). The Vector of Locally-Aggregated Word Embeddings (VLAWE) representation of a document is then computed by accumulating the differences between each codeword vector and each word vector (from the document) associated to the respective codeword. We plug the VLAWE representation, which is learned in an unsupervised manner, into a classifier and show that it is useful for a diverse set of text classification tasks. We compare our approach with a broad range of recent state-of-the-art methods, demonstrating the effectiveness of our approach. Furthermore, we obtain a considerable improvement on the Movie Review data set, reporting an accuracy of 93.3%, which represents an absolute gain of 10% over the stateof-the-art approach. Our code is available at https:
References (48)
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Lan- guage Model. Journal of Machine Learning Re- search, 3:1137-1155.
- Sudha Bhingardive, Dhirendra Singh, Rudramurthy V, Hanumant Harichandra Redkar, and Pushpak Bhattacharyya. 2015. Unsupervised Most Frequent Sense Detection using Word Embeddings. In Pro- ceedings of NAACL, pages 1238-1243.
- Andrei Butnaru and Radu Tudor Ionescu. 2017. From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings. In Proceed- ings of KES, pages 1784-1793.
- Andrei Butnaru, Radu Tudor Ionescu, and Florentina Hristea. 2017. ShotgunWSD: An unsupervised al- gorithm for global word sense disambiguation in- spired by DNA sequencing. In Proceedings of EACL, pages 916-926.
- Chih-Chung Chang and Chih-Jen Lin. 2011. LibSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technol- ogy, 2:27:1-27:27. Software available at http:// www.csie.ntu.edu.tw/ ˜cjlin/libsvm.
- Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A Unified Model for Word Sense Representation and Disambiguation. In Proceedings of EMNLP, pages 1025-1035.
- Zhou Cheng, Chun Yuan, Jiancheng Li, and Haiqin Yang. 2018. TreeNet: Learning Sentence Represen- tations with Unconstrained Tree Structure. In Pro- ceedings of IJCAI, pages 4005-4011.
- Stéphane Clinchant and Florent Perronnin. 2013. Ag- gregating continuous word embeddings for informa- tion retrieval. In Proceedings of CVSC Workshop, pages 100-109.
- Ronan Collobert and Jason Weston. 2008. A Uni- fied Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of ICML, pages 160-167.
- Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceed- ings of EMNLP, pages 670-680.
- Mȃdȃlina Cozma, Andrei Butnaru, and Radu Tudor Ionescu. 2018. Automated essay scoring with string kernels and word embeddings. In Proceedings of ACL, pages 503-509.
- Cícero Nogueira Dos Santos and Maira Gatti. 2014. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts. In Proceedings of COL- ING, pages 69-78.
- Mingsheng Fu, Hong Qu, Li Huang, and Li Lu. 2018. Bag of meta-words: A novel method to represent document for the sentiment classification. Expert Systems with Applications, 113:33-43.
- Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning Distributed Representations of Sentences from Unlabelled Data. In Proceedings of NAACL, pages 1367-1377.
- Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Embeddings for Word Sense Disambiguation: An Evaluation Study. In Proceed- ings of ACL, pages 897-907.
- Radu Tudor Ionescu and Marius Popescu. 2013. Ker- nels for Visual Words Histograms. In Proceedings of ICIAP, pages 81-90.
- Radu Tudor Ionescu and Marius Popescu. 2014. Ob- jectness to improve the bag of visual words model. In Proceedings of ICIP, pages 3238-3242.
- Radu Tudor Ionescu and Marius Popescu. 2015a. Have a SNAK. Encoding Spatial Information with the Spatial Non-alignment Kernel. In Proceedings of ICIAP, pages 97-108.
- Radu Tudor Ionescu and Marius Popescu. 2015b. PQ kernel: a rank correlation kernel for visual word his- tograms. Pattern Recognition Letters, 55:51-57.
- Radu Tudor Ionescu, Marius Popescu, and Cristian Grozea. 2013. Local Learning to Improve Bag of Visual Words Model for Facial Expression Recogni- tion. In Proceedings of WREPL.
- Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep Unordered Compo- sition Rivals Syntactic Methods for Text Classifica- tion. In Proceedings of ACL, pages 1681-1691.
- Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceed- ings of CVPR, pages 3304-3311.
- Hervé Jégou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Perez, and Cordelia Schmid. 2012. Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1704-1716.
- Thorsten Joachims. 1998. Text Categorization with Su- port Vector Machines: Learning with Many Rele- vant Features. In Proceedings of ECML, pages 137- 142, London, UK, UK. Springer-Verlag.
- Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of EMNLP, pages 1746-1751.
- Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-Thought Vectors. In Proceedings of NIPS, pages 3294-3302.
- Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to doc- ument distances. In Proceedings of ICML, pages 957-966.
- Quoc Le and Tomas Mikolov. 2014. Distributed Rep- resentations of Sentences and Documents. In Pro- ceedings of ICML, pages 1188-1196.
- David Lewis. 1997. The Reuters-21578 text catego- rization test collection. http://www.daviddlewis.co m/resources/testcollections/reuters21578/.
- Xin Li and Dan Roth. 2002. Learning question classi- fiers. In Proceedings of COLING, pages 1-7.
- Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Dynamic compositional neural networks over tree structure. In Proceedings of IJCAI, pages 4054- 4060.
- David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91-110.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Rep- resentations of Words and Phrases and their Com- positionality. In Proceedings of NIPS, pages 3111- 3119.
- Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Sci- ence, 34(8):1388-1429.
- Bo Pang and Lillian Lee. 2004. A Sentimental Educa- tion: Sentiment Analysis Using Subjectivity Sum- marization Based on Minimum Cuts. In Proceed- ings of ACL, pages 271-278.
- Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploit- ing Class Relationships For Sentiment Categoriza- tion With Respect To Rating Scales. In Proceedings of ACL, pages 115-124.
- Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of EMNLP, pages 1532-1543.
- James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object re- trieval with large vocabularies and fast spatial matching. In Proceedings of CVPR, pages 1-8.
- David Powers. 1998. Applications and explanations of Zipf's law. In Proceedings of NeMLaP/CoNLL, pages 151-160.
- Dinghan Shen, Guoyin Wang, Wenlin Wang, Mar- tin Renqiang Min, Qinliang Su, Yizhe Zhang, Chun- yuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline Needs More Love: On Simple Word- Embedding-Based Models and Associated Pooling Mechanisms. In Proceedings of ACL, pages 440- 450.
- Marwan Torki. 2018. A Document Descriptor using Covariance of Word Vectors. In Proceedings of ACL, pages 527-532.
- Andrea Vedaldi and B. Fulkerson. 2008. VLFeat: An Open and Portable Library of Computer Vision Al- gorithms. http://www.vlfeat.org/.
- Xiao-Bing Xue and Zhi-Hua Zhou. 2009. Distri- butional features for text categorization. IEEE Transactions on Knowledge and Data Engineering, 21(3):428-442.
- Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of SI- GIR, pages 42-49.
- Xin Ye, Hui Shen, Xiao Ma, Rȃzvan Bunescu, and Chang Liu. 2016. From word embeddings to docu- ment similarities for improved information retrieval in software engineering. In Proceedings of ICSE, pages 404-415.
- Han Zhao, Zhengdong Lu, and Pascal Poupart. 2015. Self-Adaptive Hierarchical Sentence Model. In Pro- ceedings of IJCAI, pages 4069-4076.
- Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo Xu. 2016. Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling. In Proceedings of COLING, pages 3485-3495.
- Qianrong Zhou, Xiaojie Wang, and Xuan Dong. 2018. Differentiated attentive representation learning for sentence classification. In Proceedings of IJCAI, pages 4630-4636.