Academia.eduAcademia.edu

Outline

A new hybrid approach for document clustering

2017, 2017 13th International Computer Engineering Conference (ICENCO)

https://doi.org/10.1109/ICENCO.2017.8289803

Abstract

K-means algorithm is a well-known clustering algorithm due to its simplicity. Unfortunately, the output of k-means depends on the initialization of cluster centroids. In this paper, we propose a new hybrid approach for document clustering which uses the outputs of single pass clustering (SPC) as an initialization for k-means algorithm. We aim to get the advantages of careful seeding with single pass clustering and the benefits of k-means algorithm. The experimental results state that the proposed approach outperforms traditional k-means algorithm in both unsupervised and supervised evaluation measures especially when the number of required clusters is increased.

References (22)

  1. J. Leskovec, A. Rajaraman, JD. Ullman, Mining of Massive Datasets, Second Edition, Cambridge, UK: Cambridge University Press, 2014, chapter 7: Clustering, page 228.
  2. A. ElSaed, O. Ismael, M. Sharkawy, "MapReduce: state- of-the-art and research directions," Second International Conference on Computer Technology and Science, Dubai, UAE, August, 2013.
  3. N. Nagwani, "Summarizing large text collection using topic modeling and clustering based on MapReduce framework," Journal of Big Data, Vol. 2, No. 1, 2015, pp. 1-18.
  4. O. Zamir, O. Etzioni, "Grouper: a dynamic clustering interface to web search results," Computer Networks, Vol. 31, No. 11, 1999, pp. 1361-1374.
  5. A. Elsayed, H. Mokhtar, O. Ismael, "Ontology based document clustering using MapReduce," International Journal of Database Management Systems, Vol. 7, No. 2, 2015, pp. 1-12.
  6. J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, CA, USA, Vol. 1, No. 14, 1967, pp. 281-297.
  7. C. Gupta, R. Grossman, "GenIc: a single-pass generalized incremental algorithm for clustering," in Proceedings of SIAM International Conference on Data Mining, 2004, pp. 147-153.
  8. E. Rasmussen, Information Retrieval: Data Structures & Algorithms, Upper Saddle River, NJ, USA: Prentice Hall, Inc., 1992, in Frakes, W. B., Baeza-Yates, R. (Eds.), chapter 16: Clustering Algorithms, pp. 419-442.
  9. B. Bahmani, B. Moseley, A. Vattani, R. Kumar, S. Vassilvitskii, "Scalable k-means++," in Proceedings of the VLDB Endowment, Vol. 5, No. 7, 2012, pp. 622-633.
  10. D. Arthur, S. Vassilvitskii, "K-means++: the advantages of careful seeding," in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2007, pp. 1027-1035.
  11. Y. Xu, W. Qu, Z. Li, C. Ji, Y. Li, Y. Wu, "Fast scalable k- means++ algorithm with MapReduce," in Algorithms and Architectures for Parallel Processing, Springer international publishing, Vol. 8631, 2014, pp. 15-28.
  12. G. Hamerly, C. Elkan, "Learning the k in k-means," in Proceedings of Advances in Neural Information Processing Systems 16, 2003.
  13. D. Pham, S. Dimov, C. Nguyen, "Selection of k in k- means clustering," Journal of Mechanical Engineering Science, Vol. 219, No. 1, 2005, pp. 103-119.
  14. W. Zhao, H. Ma, Q. He, "Parallel k-means clustering based on MapReduce," in Proceedings of the 1st International Conference on Cloud Computing, Springer Berlin Heidelberg, 2009, pp. 674-679.
  15. A. ElSaed, O. Ismael, H. Mokhtar, "A new approach for document clustering using MapReduce (var-secting clustering)," Ninth European Conference on Data Mining (ECDM'15), Spain, July, 2015.
  16. A. ElSaed, O. Ismael, H. Mokhtar, "Distributed Single Pass Clustering based On MapReduce," in the Proceedings of the 8th IEEE International Conference on Intelligent Computing and Information Systems (ICICIS 2017). In press.
  17. A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii, S. Venkatesan, "Scalable k-means by ranked retrieval," in Proceedings of the 7th ACM International Conference on Web Search and Data Mining, 2014, pp. 233-242.
  18. J. Dean, S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, Vol. 51, No. 1, 2008, pp. 107-113.
  19. M. Lichman, "UCI Machine Learning Repository" [http://archive.ics.uci.edu/ml].
  20. Irvine, CA: University of California, school of information and computer science, 2013.
  21. D. Lewis, "Reuters-21578 Text Categorization Collection Data Set," distribution 1.0.
  22. B. Larsen, C. Aone, "Fast and effective text mining using linear-time document clustering," in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 16-22.