Academia.eduAcademia.edu

Outline

Cluster ensembles

2011, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

https://doi.org/10.1002/WIDM.32

Abstract

Cluster ensembles combine multiple clusterings of a set of objects into a single consolidated clustering, often referred to as the consensus solution. Consensus clustering can be used to generate more robust and stable clustering results compared to a single clustering approach, perform distributed computing under privacy or sharing constraints, or reuse existing knowledge. This paper describes a variety of algorithms that have been proposed to address the cluster ensemble problem, organizing them in conceptual categories that bring out the common threads and lessons learnt while simultaneously highlighting unique features of individual approaches.

Key takeaways
sparkles

AI

  1. Cluster ensembles improve clustering quality and robustness by consolidating multiple base clusterings into a consensus solution.
  2. Consensus clustering can handle varying numbers of clusters and symbolic label alignment challenges across different base solutions.
  3. Key algorithms include probabilistic models, coassociation-based approaches, and heuristic methods for cluster ensemble design.
  4. Metrics like Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Variation of Information (VI) assess clustering quality.
  5. Applications of cluster ensembles range from document clustering to gene expression analysis, enhancing diverse data analysis tasks.

References (52)

  1. Sharkey A. Combining Artificial Neural Nets. Secau- cus, NJ, USA: Springer-Verlag, New York, Inc.; 1999.
  2. Tumer K, Ghosh J. Robust order statistics based ensembles for distributed data mining. In: Hillol Kargupta H, Chan P, eds, Advances in Distributed and Parallel Knowledge Discovery. AAAI Press; 2000, 85- 110.
  3. Kuncheva LI. Combining Pattern Classifiers: Methods and Algorithms. Hoboken, NJ: John Wiley & Sons; 2004.
  4. Ayad HG, Kamel MS. Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans Pattern Anal Mach Intell 2008, 30:160- 173.
  5. Hore P, Hall LO, Goldgof DB. A scalable framework for cluster ensembles. Pattern Recognit 2009, 42:676- 688.
  6. Fred A, Jain AK. Combining multiple clusterings us- ing evidence accumulation. IEEE Trans Pattern Anal Mach Intell 2005, 27:835-850.
  7. Strehl A, Ghosh J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 2002, 3:583-617.
  8. Kuncheva LI, Hadjitodorov ST. Using diversity in clus- ter ensemble. IEEE Int Conf Syst Man Cybern 2004, 2:1214-1219.
  9. Hu X, Yoo I. Cluster ensemble and its applications in gene expression analysis. In: APBC '04: Proceedings of the second conference on Asia-Pacific bioinformatics, Darlinghurst, Australia: Australian Computer Society, Inc; 2004.
  10. Karypis G, Han E-H, Kumar V. Chameleon: hierarchi- cal clustering using dynamic modeling. IEEE Comput 1999, 32:68-75.
  11. Sevillano X, Cobo G, Alías F, Socor ó JC. Feature diver- sity in cluster ensembles for robust document cluster- ing. In: SIGIR '06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York: ACM; 2006, 697-698.
  12. Ghosh J, Strehl A, Merugu S. A consensus framework for integrating distributed clusterings under limited knowledge sharing. In: Proceedings of NSF Workshop on Next Generation Data Mining, Baltimore; 2002, 99-108.
  13. Asur S, Parthasarathy S, Ucar D. An ensemble frame- work for clustering protein-protein interaction net- works. In: Proceedings of 15th Annual International Conference on Intelligent Systems for Molecular Biol- ogy (ISMB); 2007, 29-40.
  14. Brodley CE. Random projection for high dimen- sional data clustering: a cluster ensemble approach. In: Proceedings of 20th International Conference on Machine Learning (ICML'03), Washington, DC; 2003.
  15. Merugu S, Ghosh J. A distributed learning framework for heterogeneous data sources. In: Proc. KDD; 2005, 208-217.
  16. Mirkin B. Mathematical Classification and Clustering. Dordrecht: Kluwer; 1996.
  17. Day WHE. Foreword: comparison and consensus of classifications. J. Classi 1986, 3:183-185.
  18. Topchy A, Jain A, Punch W. A mixture model for clustering ensembles. In: Proceedings of SIAM In- ternational Conference on Data Mining; 2004, 379- 390.
  19. Wang H, Shan H, Banerjee A. Bayesian cluster en- sembles. In: Proceedings of the Ninth SIAM Inter- national Conference on Data Mining; 2009, 211- 222.
  20. Hubert L, Arabie P. Comparing partitions. J Classif 1985, 2:193-218.
  21. Meila M. Comparing clusterings by the variation of in- formation. In: Proceedings of Conference on Learning Theory; 2003, 173-187.
  22. Wu J, Chen J, Xiong H, Xie M. External valida- tion measures for k-means clustering: a data distri- bution perspective. Expert Syst Appl 2009, 36:6050- 6061.
  23. Topchy AP, Law MHC, Jain AK, Fred AL. Analysis of consensus partition in cluster ensemble. In: ICDM '04: Proceedings of the Fourth IEEE International Confer- ence on Data Mining, Washington, DC: IEEE Com- puter Society; 2004, 225-232.
  24. Hilton P, Pedersen J, Stigter J. On partitions, surjec- tions and stirling numbers. In: Bull Belgian Math Soc 1994, 1:713-725, 1994.
  25. Topchy A, Jain AK, Punch W. Combining multiple weak clusterings. In: ICDM '03: Proceedings of the Third IEEE International Conference on Data Mining; page 331, Washington, DC: IEEE Computer Society; 2003, 331
  26. Bishop CM. Pattern Recognition and Machine Learn- ing. Secaucus, NJ, USA: Springer, New York, Inc.; 2006.
  27. Wang P, Domeniconi C, Laskey K. Nonparametric bayesian clustering ensembles. In: Machine Learn- ing and Knowledge Discovery in Databases. Lecture Notes in Computer Science, Vol. 6323, Ch. 28. Berlin/ Heidelberg: Springer; 2010.
  28. Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 1998; 20:359-392.
  29. Strehl A, Ghosh J. A scalable approach to bal- anced, high-dimensional clustering of market-baskets. In: Proc. HiPC 2000, Bangalore, LNCS, Vol. 1970. Springer; 2000.
  30. Punera K, Ghosh J. Consensus based ensembles of soft clusterings. In: Proc. MLMTA'07 -International Con- ference on Machine Learning: Models, Technologies & Applications; 2007.
  31. Al Razgan M, Domeniconi C. Weighted cluster ensem- ble. In: Proceedings of SIAM International Conference on Data Mining; 2006, 258-269.
  32. Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering-a resampling-based method for class discov- ery and visualization of gene expression microarray data. J Mach Learn 2003, 52:91-118.
  33. Nguyen N, Caruana R. Consensus clusterings. In: Pro- ceedings of International Conference on Data Mining; 2007, 607-612.
  34. Zhong S, Ghosh J. A unified framework for model- based clustering. J Mach Learn Res 2003, 4:1001- 1037.
  35. Li T, Ding C, Jordan M. Solving consensus and semi- supervised clustering problems using non-negative matrix factorization. In: Proceedings of Eighth IEEE International Conference on Data Mining; 2007, 577- 582.
  36. Swift S, Tucker A, Vinciotti V, Martin M, Orengo C, Liu X, Kellam P. Consensus clustering and functional interpretation of gene-expression data. Genome Biol 2004, 5:R94.
  37. Wang F, Wang X, Li T. Generalized cluster aggrega- tion. In: IJCAI'09: Proceedings of the 21st Interna- tional Jont Conference on Artifical Intelligence, San Francisco, CA: Morgan Kaufmann Publishers Inc.; 2009, 1279-1284.
  38. Banerjee A, Merugu S, Dhillon I, Ghosh J. Clustering with Bregman divergences. J. Mach Learn Res 2005, 6:1705-1749.
  39. Li T, Ding C. Weighted consensus clustering. In: Pro- ceedings of Eighth SIAM International Conference on Data Mining; 2008, 798-809.
  40. Lee DD, Seung HS. Algorithms for non-negative matrix factorization. In: NIPS. Denver, CO, USA: MIT Press; 2000.
  41. Goder A, Filkov V. Consensus clustering algorithms: Comparison and refinement. In: Proceedings of the Tenth Workshop on Algorithm Engineering and Ex- periments; 2008, 109-117.
  42. Gionis A, Mannila H, Tsaparas P. Clustering aggre- gation. ACM Trans Knowl Discov Data 2007, 1:109- 117.
  43. Bansal N, Blum AL, Chawla S. Correlation clustering. In: Proceedings of Foundations of Computer Science; 2002, 238-247.
  44. Karypis G, Aggarwal R, Kumar V, Shekhar S. Multi- level hypergraph partitioning: applications in VLSI do- main. In: Proceedings of the Design and Automation Conference; 1997, 526-529.
  45. Fern X, Brodley C. Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of In- ternational Conference on Machine Learning; 2004, 281-288.
  46. Dudoit S, Fridlyand J. Bagging to improve the accu- racy of a clustering procedure. Bioinformatics 2003, 19:1090-1099.
  47. Yoon SY Ahn, SH Lee, SB Cho, JH Kim. Hetero- geneous clustering ensemble method for combining different cluster results. In: Proceedings of BioDM 2006, Lecture Notes in Computer Science, Vol. 3916; 2006, 82-92.
  48. Yang Y, Kamel MS. An aggregated clustering approach using multi-ant colonies algorithms. Pattern Recognit 2006, 39:1278-1289.
  49. Deodhar M, Ghosh J. Consensus clustering for de- tection of overlapping clusters in microarray data. In: ICDMW '06: Proceedings of the Sixth IEEE In- ternational Conference on Data Mining Workshops, Washington, DC: IEEE Computer Society; 2006, 104- 108.
  50. He Z, Xu X, Deng S. A cluster ensemble method for clustering categorical data. Inform Fusion 2005, 6:143-151.
  51. Ye Y, Li T, Chen Y, Jiang Q. Automatic malware cat- egorization using cluster ensemble. In: KDD '10: Pro- ceedings of the 16th ACM SIGKDD international con- ference on Knowledge discovery and data mining, New York: ACM; 2010, 95-104.
  52. Chiu T-Y, Hsu T-C, Wang J-S. Ap-based consensus clustering for gene expression time series. In: Inter- national Conference on Pattern Recognition; 2010, 2512-2515.