Academia.eduAcademia.edu

Outline

Semantic N-Gram Topic Modeling

2018, ICST Transactions on Scalable Information Systems

https://doi.org/10.4108/EAI.13-7-2018.163131

Abstract

In this paper a novel approach for effective topic modeling is presented. The approach is different from traditional vector space model-based topic modeling, where the Bag of Words (BOW) approach is followed. The novelty of our approach is that in phrase-based vector space, where critical measure like point wise mutual information (PMI) and log frequency based mutual dependency (LGMD)is applied and phrase's suitability for particular topic are calculated and best considerable semantic N-Gram phrases and terms are considered for further topic modeling. In this experiment the proposed semantic N-Gram topic modeling is compared with collocation Latent Dirichlet allocation(coll-LDA) and most appropriate state of the art topic modeling technique latent Dirichlet allocation (LDA). Results are evaluated and it was found that perplexity is drastically improved and found significant improvement in coherence score specifically for short text data set like movie reviews and political blogs.

References (27)

  1. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, volume 3: 993-1022.
  2. Hanna M. Wallach (2006). Topic Modeling: Beyond Bag of-Words. Proceedings of the 23rd International Conference on Machine Learning: 977-984.
  3. Jordan Boyd-Graber, David M. Blei, and Xiaojin Zhu. (2007). A Topic Model for Word Sense Disambiguation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning: 1024-1033.
  4. Gao, W., Peng, M., Wang, H., Zhang, Y., Xie, Q., & Tian, G. (2019). Incorporating word embeddings into topic modeling of short text. Knowledge and Information Systems, 61(2), 1123-1145.
  5. Peng, M., Chen, D., Xie, Q., Zhang, Y., Wang, H., Hu, G., ... & Zhang, Y. (2018, November). Topic-net conversation model. In International Conference on Web Information Systems Engineering (pp. 483- 496). Springer, Cham
  6. Gao, W., Peng, M., Wang, H., Zhang, Y., Han, W., Hu, G., & Xie, Q. (2019). Generation of topic evolution graphs from short text streams. Neurocomputing.
  7. Peng, M., Xie, Q., Zhang, Y., Wang, H., Zhang, X. J., Huang, J., & Tian, G. (2018, July). Neural sparse topical coding. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2332-2340).
  8. Peng, M., Xie, Q., Wang, H., Zhang, Y., & Tian, G. (2018). Bayesian sparse topical coding. IEEE Transactions on Knowledge and Data Engineering, 31(6), 1080-1093.
  9. Lindsey, R. V., Headden III, W. P., & Stipicevic, M. J. (2012). A phrase- discovering topic model using hierarchical pitman-yor processes. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning:214-222
  10. Mark Johnson M. (2010) PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names. Proceedings of the 48thAnnual Meeting of the ACL: 1148-1157
  11. Selivanov D (2017) text2vec: Vectorization. Retrieved- from http://text2vec.org/vectorization.html
  12. Hanna M. Wallach (2006). Topic Modeling: Beyond Bag of-Words. Proceedings of the 23rd International Conference on Machine Learning: 977-984.
  13. Jey Han Lau, Timothy Baldwin, and David Newman. (2013) On Collocations and Topic Models. ACM Transactions on Speech and Language Processing,10(3): 1-14.
  14. Wei Hu, Nobuyuki Shimizu, Hiroshi Nakagawa, and Huanye Shenq. (2008). Modeling Chinese Document with Topical Word-Character Models. Proceedings of the 22nd International Conference on Computational Linguistics: 345-352
  15. Thomas L. Griffiths, Mark Steyvers, and Joshua B. Tenenbaum. (2007) Topics in Semantic Representation. Psychological Review,114(2): 211-244.
  16. Kherwa, P., & Bansal, P. (2019). Empirical Evaluation of Inference Technique for Topic Models. In Progress in Advanced Computing and Intelligent Engineering:237-246.
  17. El-Kishky, A., Song, Y., Wang, C., Voss, C. R., & Han, J. (2014). Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment, 8(3): 305-316.
  18. Semantic N-Gram Topic Modeling 11 EAI Endorsed Transactions on Scalable Information Systems 03 2020 -05 2020 | Volume 7 | Issue 26 | e7
  19. Viegas, F., Luiz, W., Gomes, C., Khatibi, A., Canuto, S., Mourão, F.,& Gonçalves, M. A. (2018) Semantically-Enhanced Topic
  20. Modeling. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management: 893-902.
  21. Peng, M., Zhu, J., Wang, H., Li, X., Zhang, Y., Zhang, X., & Tian, G. (2018). Mining event-oriented topics in microblog stream with unsupervised multi view hierarchical embedding. ACM Transactions on Knowledge Discovery from Data (TKDD), 12(3), 1-26.
  22. Xie, Q., Peng, M., Huang, J., Wang, B., & Wang, H. (2019, July). Discriminative Regularization with Conditional Generative Adversarial Nets for Semi-Supervised Learning. In 2019 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE
  23. T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. (2005) Integrating topics and syntax. In Advances in Neural Information Processing Systems 17.
  24. Thomas L. Griffiths, Mark Steyvers, and Joshua B. Tenenbaum. (2007) Topics in Semantic Representation. Psychological Review,114(2): 211-244.
  25. Manning C. and Schütze H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.
  26. De Finetti, B, (2017) Theory of probability: A critical introductory treatment, Vol. 6, John Wiley & Sons,2017.
  27. Ayadi, R., Maraoui, M., & Zrigui, M. (2014). Latent topic model for indexing arabic documents. International Journal of