A Topical Word Embeddings for Text Classification
Anais do XV Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2018)
Abstract
This paper presents an approach that uses topic models based on LDA to represent documents in text categorization problems. The document representation is achieved through the cosine similarity between document embeddings and embeddings of topic words, creating a Bag-of-Topics (BoT) variant. The performance of this approach is compared against those of two other representations: BoW (Bag-of-Words) and Topic Model, both based on standard tf-idf. Also, to reveal the effect of the classifier, we compared the performance of the nonlinear classifier SVM against that of the linear classifier Naive Bayes, taken as baseline. To evaluate the approach we use two bases, one multi-label (RCV-1) and another single-label (20 Newsgroup). The model presents significant results with low dimensionality when compared to the state of the art.
References (11)
- Blei, D. M., Edu, B. B., Ng, A. Y., Edu, A. S., Jordan, M. I., and Edu, J. B. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022.
- Kim, H. K., Kim, H., and Cho, S. (2017). Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing, 266:336-352.
- Lau, J. H., Grieser, K., Newman, D., and Baldwin, T. (2011). Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Com- putational Linguistics: Human Language Technologies-Volume 1, pages 1536-1545. Association for Computational Linguistics.
- Li, S., Chua, T.-S., Zhu, J., and Miao, C. (2016). Generative Topic Embedding: a Contin- uous Representation of Documents (Extended Version with Proofs).
- Liu, Y., Liu, Z., Chua, T.-S., and Sun, M. (2015). Topical Word Embeddings. Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI'15), 2(C):2418-2424.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 26, pages 3111-3119. Curran Associates, Inc.
- Mouriño-garcía, M., Pérez-rodríguez, R., and Anido-rifón, L. (2015). Bag-of-Concepts Document Representation for Textual News Classification (PDF Download Avail- able).pdf. 6(1):173-188.
- Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009). Labeled lda: A super- vised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248-256. Association for Computational Linguistics.
- Rubin, T. N., Chambers, A., Smyth, P., and Steyvers, M. (2012). Statistical topic models for multi-label document classification. Machine learning, 88(1-2):157-208.
- Schütze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to information re- trieval, volume 39. Cambridge University Press.
- Sriurai, W. (2011). Improving Text Categorization by using a Topic Model. Advanced Computing, 2(6):21-27.