Academia.eduAcademia.edu

Outline

A Hybrid Model for Documents Representation

2021, International Journal of Advanced Computer Science and Applications

https://doi.org/10.14569/IJACSA.2021.0120339

Abstract

Text representation is a critical issue for exploring the insights behind the text. Many models have been developed to represent the text in defined forms such as numeric vectors where it would be easy to calculate the similarity between the documents using the well-known distance measures. In this paper, we aim to build a model to represent text semantically either in one document or multiple documents using a combination of hierarchical Latent Dirichlet Allocation (hLDA), Word2vec, and Isolation Forest models. The proposed model aims to learn a vector for each document using the relationship between its words’ vectors and the hierarchy of topics generated using the hierarchical Latent Dirichlet Allocation model. Then, the isolation forest model is used to represent multiple documents in one representation as one profile to facilitate finding similar documents to the profile. The proposed text representation model outperforms the traditional text representation models when applied ...

References (31)

  1. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. (1972).
  2. Dillon, M.: Introduction to modern information retrieval: G. Salton and M. McGill. McGraw-Hill, New York (1983). xv+ 448 pp., $32.95 ISBN 0-07-054484-0, (1983).
  3. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513-523 (1988).
  4. Ramos, J., others: Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning. pp. 133-142 (2003).
  5. Manning, C.D., Raghavan, P., Schütze, H.: Scoring, term weighting and the vector space model. Introd. to Inf. Retr. 100, 2-4 (2008).
  6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993-1022 (2003).
  7. Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical topic models and the nested Chinese restaurant process. Adv. Neural Inf. Process. Syst. (2004).
  8. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111-3119 (2013).
  9. Wang, Z., Ma, L., Zhang, Y.: A hybrid document feature extraction method using latent Dirichlet allocation and word2vec. In: 2016 IEEE First International Conference on Data Science in Cyberspace (DSC). pp. 98-103 (2016).
  10. Mohamed, D., El-Kilany, A., Mokhtar, H.M.O.: Academic Articles Recommendation Using Concept-Based Representation. In: Proceedings of SAI Intelligent Systems Conference. pp. 733-744 (2020).
  11. Blei, D.M., Griffiths, T.L., Jordan, M.I.: The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM. 57, (2010). https://doi.org/10.1145/1667053.1667056.
  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv Prepr. arXiv1301.3781. (2013).
  13. Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on Machine learning. pp. 160-167 (2008).
  14. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. pp. 746-751 (2013).
  15. Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 413-422 (2008).
  16. Liu, F.T., Ting, K.M., Zhou, Z.-H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. from Data. 6, 1-39 (2012).
  17. Preiss, B.R.: Data structures and algorithms. John Wiley & Sons, Inc. (1999).
  18. Chen, H., Ma, H., Chu, X., Xue, D.: Anomaly detection and critical attributes identification for products with multiple operating conditions based on isolation forest. Adv. Eng. Informatics. 46, 101139 (2020). https://doi.org/https://doi.org/10.1016/j.aei.2020.101139.
  19. Ricci, F., Rokach, L., Shapira, B.: Introduction to recommender systems handbook. In: Recommender systems handbook. pp. 1-35. Springer (2011).
  20. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: The adaptive web. pp. 325-341. Springer (2007).
  21. Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based framework for text categorization. Procedia Eng. 69, 1356-1364 (2014).
  22. Linstead, E., Rigor, P., Bajracharya, S., Lopes, C., Baldi, P.: Mining concepts from code with probabilistic topic models. In: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering. pp. 461-464 (2007).
  23. Fang, Y., Si, L., Somasundaram, N., Yu, Z.: Mining contrastive opinions on political texts using cross-perspective topic model. In: Proceedings of the fifth ACM international conference on Web search and data mining. pp. 63-72 (2012).
  24. Apaza, R.G., Cervantes, E.V., Quispe, L.C., Luna, J.O.: Online Courses Recommendation based on LDA. In: SIMBig. pp. 42-48 (2014).
  25. Wang, C., Blei, D.M.: Collaborative topic modeling for recommending scientific articles. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 448-456 (2011).
  26. Sun, X., Liu, X., Duan, Y., Li, B.: Using hierarchical latent dirichlet allocation to construct feature tree for program comprehension. Sci. Program. 2017, (2017).
  27. Venkatesh, R.K.: Legal documents clustering and summarization using hierarchical latent Dirichlet allocation. IAES Int. J. Artif. Intell. 2, (2013).
  28. Rahmawati, D., Khodra, M.L.: Word2vec semantic representation in multilabel classification for Indonesian news article. In: 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA). pp. 1-6 (2016).
  29. Li, Y., Yang, M., Zhang, Z.M.: Scientific articles recommendation. In: Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. pp. 1147-1156 (2013).
  30. Manning, C., Schutze, H.: Foundations of statistical natural language processing. MIT press (1999).
  31. Baeza-Yates, R., Ribeiro-Neto, B., others: Modern information retrieval. ACM press New York (1999).