A comparative study of various clustering techniques on big data sets using Apache Mahout

Venkateswara Reddy Eluri; M. Ramesh; Amina Salim Mohd Al-Jabri; Mare Jane

doi:10.1109/ICBDSC.2016.7460397

Outline

COMPARATIVE STUDY OF VARIOUS CLUSTERING TECHNIQUES

IJCSMC Journal

https://doi.org/10.1109/ICBDSC.2016.7460397

visibility

…

description

8 pages

link

1 file

Abstract

Clustering is a process of dividing the data into groups of similar objects and dissimilar ones from other objects. Representation of data by fewer clusters necessarily loses fine details, but achieves simplification. Data is model by its clusters. Clustering plays an significant part in applications of data mining such as scientific data exploration, information retrieval, text mining, city-planning, earthquake studies, marketing, spatial database applications, Web analysis, marketing, medical diagnostics, computational biology, etc. Clustering plays a role of active research in several fields such as statistics, pattern recognition and machine learning. Data mining adds complications to very large datasets with many attributes of different types to clustering. Unique computational requirements are imposed on relevant clustering algorithms. A variety of clustering algorithms have recently emerged that meet the various requirements and were successfully applied to many real-life data mining problems.

Key takeaways
AI

Clustering techniques simplify data representation by grouping similar objects, pivotal in data mining applications.
The study reviews various clustering algorithms, including hierarchical, partitioning, and grid-based methods.
K-means and K-medoids provide effective partitioning techniques by optimizing cluster centers or representative points.
Feature selection using the proposed FAST algorithm enhances efficiency by removing irrelevant and redundant features.
Kruskal's algorithm constructs minimum spanning trees to improve clustering performance on high-dimensional datasets.

Figures (4)

Partitioning methods a are classified into two subcategories viz. centroid and medoid algorithms. Centroid algorithms represent each cluster by using the gravity centre of the instances. The Mediod algorithm represents each cluster by means of the occurrences closest to gravity centre. Partitioning clustering algorithms try to locally improve a certain criterion. They compute the values of the similarity or distance, they order the results, and pick the one that optimizes the criterion. Hence, the majority of them could be considered as greedy-like algorithms. [7]

Features in different clusters are relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method.

Table 4,1.; Advantages and Disadvantages 5. COMPARATIVE STUDY OF EXISTING METHODS

N =number of objects, K = number of clusters, S = size of sample.

References (10)

Qinbao Song, Jingjie Ni and Guangtao Wang, A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:25 NO:1 YEAR 2013.
Osama Abu Abbas, Comparision between Data Clustering Algorithms, The International Arab journal of Information Technology, Vol. 5, No. 3,July 2008.
A Review: Comparative Study of Various Clustering Techniques in Data Mining, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 3, March 2013.
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In Proceedings of the International Conference on Management of Data, (SIGMOD), volume 27(2) of SIGMOD Record, pages 94-105, Seattle,WA, USA, 1-4 June 1998. ACM Press.
Jiawei Han and Michelle Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96.
Literature Survey on Clustering Techniques, http://www.slideshare.net/IOSR/a0310112-26684753.
BOTTOU, L. and BENGIO, Y. 1995. Convergence properties of the K-means algorithms. In Tesauro, G. and Touretzky, D. (Eds.) Advances in Neural Information Processing Systems 7, 585-592, The MIT Press, Cambridge, MA.
DHILLON, I., FAN, J., and GUAN, Y. 2001. Efficient clustering of very large document collections. In Grossman, R.L., Kamath, C., Kegelmeyer, P., Kumar, V., and Namburu, R.R. (Eds.) Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers.
Alexander Hinneburg and Daniel A. Keim. An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, (KDD), pages 58-65, New York, NY, USA, 27-31 August 1998. AAAI Press.

COMPARATIVE STUDY OF VARIOUS CLUSTERING TECHNIQUES

Sign up for access to the world's latest research

Abstract

Key takeawaysAI

Related papers

References (10)

Related papers

Related topics

Key takeaways
AI