Academia.eduAcademia.edu

Outline

A solution of semantic clustering of text documents

https://doi.org/10.13140/2.1.1828.4806

Abstract

Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters. Measuring similarity and discernment of two documents is not always clear problem and it depends of topical affiliation of the documents. For example, when clustering research papers, two documents are regarded as similar if they share similar topics. When clustering is employed on web sites, we are usually more interested in clustering the component pages according to the type of information that is presented in the page. A variety of similarity or distance measures have been proposed and widely applied, such as cosine similarity, Pearson correlation coefficient, Euclidian distance etc. This paper deals with semantic clustering of text documents written in Serbian language. The aim is to prepare the documents of different formats for clustering, to find key words in the set of documents, clustering documents based on key words and finding the most appropriate document for the given question.

References (11)

  1. Z. Fang. E-government in digital era: Concept, practice and development. International Journal of The Computer, The Internet and Management, 10(2):1-22, 2002.
  2. G. Šimić, E. Kajan, Z. Jeremić, and D. Randjelović. An approach to document clustering using hybrid method. IADIS e-Society Conference, Berlin, March 2012.
  3. D. Suboti and N. Forbes. Serbo-croatian language grammar.
  4. J. Kaur and V. Gupta. Effective approaches for extraction of keywords. International Journal of Computer Science Issue, 7(6):144-148, 2010.
  5. C. Wartena and R. Brusse. Topic detection by clustering keywords. Pro- ceedings of DEXA08, pages 54-58, 2008.
  6. U. Marovac, A. Pljasković, and E. Kajan. Applying native xml databases in advanced e-government systems. Proceedings of ICIST Conference, March 2012.
  7. A. Tagarelli and S. Greco. Semantic clustering of xml documents. ACM Transactions on Information Systems, (28), 1 2010.
  8. Y. Matsuo and M. Ishizika. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(1):157-169, 2004.
  9. K. Zou, Z. Wang, and M.Hu. A new initialization method for fuzzy c- means algorithm. Journal of Fuzzy Optimization and Decision Making, (7):409-416, 4 2008.
  10. J. MacQueen. Some methods for classification and analysis of multivariate observations. 1. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281-297, 1967.
  11. A. Vattani. K-means requires exponentially many iterations even in the plane. Discrete and Computational Geometry Journal, pages 596-616, 2011.