Academia.eduAcademia.edu

Outline

Data Mining Process Using Clustering: A Survey

irpds.com

Abstract

Clustering is a basic and useful method in understanding and exploring a data set. Clustering is division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Interest in clustering has increased recently in new areas of applications including data mining, bioinformatics, web mining, text mining, image analysis and so on. This survey focuses on clustering in data mining. The goal of this survey is to provide a review of different clustering algorithms in data mining. A Categorization of clustering algorithms has been provided closely followed by this survey. The basics of Hierarchical Clustering include Linkage Metrics, Hierarchical Clusters of Arbitrary and Binary Divisive Partitioning is discussed at first. Next discussion is Algorithms of the Partitioning Relocation Clustering include Probabilistic Clustering, K-Medoids Methods, K-Means Methods. Density-Based-Partitioning, Grid-Based Methods and Co-Occurrence of Categorical Data are other sections. Their comparisons are mostly based on some specific applications and under certain conditions. So the results may become quite different if the conditions change.

References (42)

  1. P. Hansen and B. Jaumard, "Cluster analysis and mathematical programming," Math. Program., vol. 79, pp. 191-215, 1997.
  2. Bing Liu, Yuliang Shi, Zhihui Wang, Wei Wang, Baile Shi: Dynamic Incremental Data Summarization for Hierarchical Clustering. Electronic Edition (link) BibTeX.2006
  3. Lai, Ying Orlandic, Ratko Yee, Wai Gen Kulkarni, Sachin Scalable "Clustering for Large High-Dimensional Data Based on Data Summarization Computer Science", Illinois Institute of Technology, Chicago,IL60616,U.S. 2007
  4. GUHA, S., RASTOGI, R., and SHIM, K.. "CURE: An efficient clustering algorithm for large databases". In Proceedings of the ACM SIGMOD Conference, 73-84, Seattle, WA. 1998
  5. F. Murtagh. A survey of recent advances inhierarchical clustering algorithms. The Computer Journal, 26(4):354-359, 1983.
  6. D. Pelleg and A. Moore. "X-means: Extending K-means with efficient estimation of the number of clusters". In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 727-734, 2000.
  7. G. Hamerly and C. Elkan. Learning the k in k-means. In Proceedings of NIPS, 2003.
  8. R. T. Ng and J. Han." Efficient and effective clustering methods for spatial data mining". In Proc. of VLDB Conference., pages 144-155, 1994.
  9. S. Guha, R. Rastogi, and K. Shim. CURE:An efficient clustering algorithm for large databases. In SIGMOD Conference, pages 73-84, 1998.
  10. I. Jolliffe. "Principal Component Analysis". Springer Verlag, 1986.
  11. T. Zhang, R. Ramakrishnan, and M. Livny. "BIRCH: An efficient data clustering method for very large databases". In SIGMOD Conference, pages 103-114, 1996.
  12. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise". In KDD Conference, 1996.
  13. JAIN, A. and DUBES. "Algorithms for Clustering Data." Prentice-Hall, Englewood Cliffs, NJ. 1988.
  14. OLSON, C. "Parallel algorithms for hierarchical clustering." Parallel
  15. Computing, 21, 1313-1325. 1995
  16. Pavel Berkhin, "Survey of Clustering Data Mining Techniques",Accrue Software, Inc.2002
  17. CORTER, J. and GLUCK, "Explaining basic categories: feature predictability and information." Psychological
  18. Bulletin, 111, 291-303. M. 1992.
  19. CHIU, T., FANG, D., CHEN, J., and Wang, Y.. "A Robust and scalable clustering algorithm for mixed type attributes in large database environments". In Proceedings of the 7th ACM SIGKDD, 263-268, San Francisco, CA. 2001
  20. GUHA, S., RASTOGI, R., and SHIM, K. ROCK" A robust clustering algorithm for categorical attributes". In Proceedings of the 15th ICDE, 512-521, Sydney, Australia. 1999
  21. BERRY, M.W. and BROWNE, "Understanding Search Engines: Mathematical Modeling and Text Retrieval." M.1999
  22. BOLEY, D.L." Principal direction divisive partitioning". 1998
  23. STEINBACH, M., KARYPIS, G., and KUMAR. "A comparison of document clustering techniques". 6th ACM IDMC'07 20-21 Nov.2007
  24. Conference, Boston, MA. V. 2000
  25. MCLACHLAN, G. and BASFORD, "Mixture Models: Inference and Applications to Clustering." Marcel Dekker, New York, NY. K. 1988.
  26. KAUFMAN, L. and ROUSSEEUW,. "Finding Groups in Data: An Introduction toCluster Analysis". John Wiley and Sons, New York, NY. P. 1990
  27. NG, R. and HAN," Efficient and effective clustering methods for spatial data mining". In Proceedings of the 20th Conference on VLDB, 144-155, Santiago, Chile. J. 1994
  28. HARTIGAN,. "Clustering Algorithms". John Wiley & Sons, New York, NY. J. 1975
  29. PELLEG, D. and MOORE, "X-means: Extending K-means with Efficient Estimation of the Number of Clusters". In Proceedings 17th ICML, Stanford University. A. 2000.
  30. FRALEY, C. and RAFTERY, "A. How many clusters?. Which clustering method? Answers via model-based cluster analysis". The Computer Journal, 41, 8, 578-588. 1998
  31. HAN, J. and KAMBER, "Data Mining. Morgan Kaufmann Publishers." M. 2001.
  32. ESTER, M., KRIEGEL, H-P., SANDER, J. and XU," A density-based algorithm for discovering clusters in large spatial databases with noise". In Proceedings of the 2nd ACM SIGKDD, 226-231, Portland, Oregon. X. 1996
  33. SANDER, J., ESTER, M., KRIEGEL, H.-P., and XU, X. "Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. In Data Mining and Knowledge Discovery", 1998 2, 2, 169- 194.
  34. ANKERST, M., BREUNIG, M., KRIEGEL, H.-P., and SANDER, J.. "OPTICS: Ordering points to identify clustering structure". In Proceedings of the ACM SIGMOD Conference, 49-60, Philadelphia, PA. 1999
  35. XU, X., ESTER, M., KRIEGEL, H.-P., and SANDER, J. "A distribution- based clustering algorithm for mining large spatial datasets". In Proceedings of the 14th ICDE,324-331, Orlando, FL. 1998.
  36. HINNEBURG, A. and KEIM,." An efficient approach to clustering large multimedia databases with noise". In Proceedings of the 4th ACM SIGKDD, 58-65, New York, NY. D. 1998
  37. SCHIKUTA, E., ERHART, "The BANG-clustering system: grid-based data analysis". In Proceeding of Advances in Intelligent Data Analysis, Reasoning about Data, 2nd
  38. International Symposium, 513-524, London, UK. M. 1997.
  39. SCHIKUTA, "Grid-clustering: a fast hierarchical clustering method for very large ". E. 1996.data sets. In Proceedings 13th International Conference on Pattern Recognition, 2, 101-105.
  40. SHEIKHOLESLAMI, G. , CHATTERJEE, S., and ZHANG,. WaveCluster: "A multiresolution clustering approach for very large spatial databases". In Proceedings of the 24 th Conference on VLDB, 428- 439, New York, NY. A. 1998
  41. ERTOZ, L., STEINBACH, M., and KUMAR, "Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Technical Report". V. 2002
  42. Rui Xu, "Survey of Clustering Algorithms", VOL. 16, NO. 3, MAY 2005 IDMC'07 20-21 Nov.2007