Academia.eduAcademia.edu

Outline

Data Weighting Mechanisms for Clustering Ensembles

https://doi.org/10.1016/J.COMPELECENG.2013.02.004

Abstract

Inspired by bagging and boosting algorithms in classification, the non-weighing and weighing-based sampling approaches for clustering are proposed and studied in the paper. The effectiveness of non-weighing-based sampling technique, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms have been invested in this paper. Experimental results have shown improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and boosting techniques. Subsamples of small size can reduce the computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical research study also compares the performance of boosting and bagging clustering ensembles using different consensus functions on a number of datasets.

References (49)

  1. Saha S, Bandyopadhyay S. A new multi-objective clustering technique based on the concepts of stability and symmetry. Knowl Inf Syst 2009.
  2. Fred ALN, Jain AK. Data clustering using evidence accumulation. In: Proc of the 16th intl conf on pattern recognition, ICPR 2002, Quebec City; 2002. p. 276-80.
  3. Strehl A, Ghosh J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 2003;3:583-617.
  4. Topchy A, Jain AK, Punch WF. Combining multiple weak clusterings. In: Proc 3d IEEE intl conf on data mining; 2003. p. 331-8.
  5. Mohammadi M, Alizadeh H, Minaei-Bidgoli B. Neural network ensembles using clustering ensemble and genetic algorithm. In: International conference on convergence and hybrid information technology, ICCIT08; 2008. p. 11-3.
  6. Parvin H, Alizadeh H, Minaei-Bidgoli B, Analoui M. CCHR: combination of classifiers using heuristic retraining. In: International conference on networked computing and advanced information management (NCM 2008); 2008.
  7. Parvin H, Alizadeh H, Minaei-Bidgoli B, Analoui M. An scalable method for improving the performance of classifiers in multiclass applications by pairwise classifiers and GA. In: International conference on networked computing and advanced information management (NCM 2008), published by IEEE CS; 2008.
  8. Parvin H, Alizadeh H, Minaei-Bidgoli B. A new approach to improve the vote-based classifier selection. In: International conference on networked computing and advanced information management (NCM 2008), Korea, published by IEEE CS; 2008.
  9. Parvin H, Alizadeh H, Moshki M, Minaei-Bidgoli B, Mozayani N. Divide & conquer classification and optimization by genetic algorithm. In: International conference on convergence and hybrid information technology, ICCIT08, published by IEEE CS; 2008.
  10. Topchy A, Jain AK, Punch WF. A mixture model for clustering ensembles. In: Proc SIAM intl conf on data mining, SDM 04; 2004. p. 379-90.
  11. Dudoit S, Fridlyand J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics 2003;19(9):1090-9.
  12. Fern X, Brodley CE. Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proc 20th int conf on machine learning, ICML; 2003.
  13. Fischer B, Buhmann JM. Data resampling for path based clustering. In: Van Gool L, editor. Pattern recognition -symposium of the DAGM 2002. Springer, LNCS 2449; 2002. p. 206-14.
  14. Barthelemy JP, Leclerc B. The median procedure for partition. In: Cox IJ et al., editors. Partitioning data sets, AMS DIMACS series in discrete mathematics, vol. 19; 1995. p. 3-34.
  15. Jain AK, Dubes RC. Algorithms for clustering data. Englewood Cliffs, NJ: Prentice-Hall; 1988.
  16. Tan PN, Steinbach M, Kumar V. Introduction to data mining. 1st ed. Addison-Wesly; 2005.
  17. Fred ALN, Jain AK. Combining multiple clustering using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 2005;27(6).
  18. Fischer B, Buhmann JM. Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans PAMI 2003;25(4):513-8.
  19. Minaei-Bidgoli B, Topchy A, Punch WF. Ensembles of partitions via data resampling. In: Proc intl conf on information technology, ITCC 04, Las Vegas; 2004.
  20. Breiman L. Arcing classifiers. Ann Stat 1998;26(3):801-49.
  21. Frossyniotis D, Likas A, Stafylopatis A. A clustering method based on boosting. Pattern Recogn Lett 2004;6(19):641-54.
  22. Topchy A, Minaei-Bidgoli B, Jain AK, Punch WF. Adaptive clustering ensembles. In: Proc intl conf on pattern recognition, ICPR 2004, Cambridge, UK; 2004.
  23. Efron B. Bootstrap methods: another look at the jackknife. Ann Stat 1979;7:1-26.
  24. Breiman L. Bagging predictors. J Mach Learn 1996;24(2):123-40.
  25. Jain AK, Moreau JV. The bootstrap approach to clustering. In: Devijver PA, Kittler J, editors. Pattern recognition theory and applications. Springer- Verlag; 1987. p. 63-71.
  26. Levine E, Domany E. Resampling method for unsupervised estimation of cluster validity. Neural Comput 2001;13:2573-93.
  27. Ben-Hur A, Elisseeff A, Guyon I. A stability based method for discovering structure in clustered data. In: Pac symp biocomputing, vol. 7; 2002. p. 6-17.
  28. Roth V, Lange T, Braun M, Buhmann JM. A resampling approach to cluster validation. In: Proceedings in computational statistics: 15th symposium COMPSTAT 2002. Heidelberg: Physica-Verlag; 2002. p. 123-8.
  29. Minaei-Bidgoli B, Topchy A, Punch WF. A comparison of resampling methods for clustering ensembles. In: Proc intl conf machine learning methods technology and application, MLMTA 04, Las Vegas; 2004.
  30. Parvin H, Beigi A, Mozayani N. A clustering ensemble learning method based on the ant colony clustering algorithm. Int J Appl Comput Math 2012;11(2):286-302.
  31. Fern X, Brodley CE. Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the 21st international conference on machine learning, Canada; 2004.
  32. Ng A, Jordan M, Weiss Y. On spectral clustering: analysis and an algorithm. NIPS 2002;14.
  33. Jiamthapthaksin R, Eick CF, Lee S. GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets. Knowl Inf Syst; 2010; 29(3):597-628.
  34. Wang H, Shan H, Banerjee A. Bayesian cluster ensembles. In: Proceedings of the ninth SIAM international conference on data mining; 2009. p. 211-22.
  35. Guénoche A. Consensus of partitions: a constructive approach. In: Advances in data analysis and classification; 2011.
  36. Christou IT. Coordination of cluster ensembles via exact methods. IEEE Trans Pattern Anal Mach Intell 2011;33(2).
  37. Singh V, Mukherjee L, Peng J, Xu J. Ensemble clustering using semidefinite programming with applications. Mach Learn 2010;79:177-200.
  38. Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec 1996;25(2):103-14.
  39. Dhillon IS, Modha DS. A data-clustering algorithm on distributed memory multiprocessors. In: Proceedings of large-scale parallel KDD systems workshop, ACM SIGKDD, in large-scale parallel data mining, lecture notes in artificial intelligence, vol. 1759; 2000. p. 245-60.
  40. Park BH, Kargupta H. Distributed data mining. In: Ye Nong, editor. The handbook of data mining. Lawrence Erlbaum Associates; 2003.
  41. Zhang B, Hsu M, Forman G. Accurate recasting of parameter estimation algorithms using sufficient statistics for efficient parallel speed-up demonstrated for center-based data clustering algorithms. In: Proc 4th European conference on principles and practice of knowledge discovery in databases, in principles of, data mining and knowledge discovery; 2000.
  42. Duda RO, Hart PE, Stork DG. Pattern classification. 2nd ed. New York, NY: John Wiley & Sons, Inc.; 2001.
  43. Pfitzne D, Leibbrandt R, Powers D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl. Inf. Syst. 2009;19(3):361-94.
  44. Marxer R, Holonowicz P, Purwins H, Hazan A. Dynamical hierarchical self-organization of harmonic motivic, and pitch categories. In: Music, brain and cognition, part 2: models of sound and cognition, held at NIPS'07; 2007.
  45. Munkres J. Algorithms for the assignment and transportation problems. J Soc Ind Appl Math 1957;5(1):32-8.
  46. Aeberhard S, Coomans D, De Vel O. Comparison of classifiers in high dimensional settings, tech. rep. no. 92-02. Dept. of Computer Science and Dept. of Mathematics and Statistics, James Cook University of North Queensland; 1992.
  47. Odewahn SC, Stockwell EB, Pennington RL, Humphreys RM, Zumach WA. Automated star/galaxy discrimination with neural networks. Astron J 1992;103:308-31.
  48. Minaei-Bidgoli B, Punch WF. Using genetic algorithms for data mining optimization in an educational web-based system. GECCO 2003;2003:2252-63.
  49. Minaei-Bidgoli B, Parvin H, Alinejad-Rokny H, Alizadeh H, Panch WF. Effects of resampling method and adaptation on clustering ensemble efficacy, artificial intelligence review. Springer; 2011. http://dx.doi.org/10.1007/s10462-011-9295-x.