Academia.eduAcademia.edu

Data Clustering

description2,735 papers
group207 followers
lightbulbAbout this topic
Data clustering is a machine learning technique that involves grouping a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups. It is used for exploratory data analysis and pattern recognition.
lightbulbAbout this topic
Data clustering is a machine learning technique that involves grouping a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups. It is used for exploratory data analysis and pattern recognition.

Key research themes

1. How do different fundamental clustering algorithm paradigms address data complexity and application needs in data mining?

This theme explores the comparative roles, methodological foundations, and practical implementations of the main clustering paradigms—partitioning, hierarchical, density-based, grid-based, and model-based clustering—in data mining. It highlights how these paradigms adapt to handle large, high-dimensional, or complex datasets and accommodate different data types and clustering objectives across various applications.

Key finding: Provides a comprehensive taxonomy of clustering algorithms used in big data mining, dividing them into hierarchical and partitioning methods, further delineating subtypes such as agglomerative/divisive, k-means, k-medoids,... Read more
Key finding: Offers an extensive survey highlighting strengths and weaknesses of partitioning (e.g., k-means, k-medoids), hierarchical (agglomerative and divisive), density-based (DBSCAN, OPTICS), and grid-based algorithms. It emphasizes... Read more
Key finding: Analyzes the performance of classical clustering methods on high-dimensional datasets, emphasizing challenges of scalability and meaningful pattern extraction. It discusses issues like the curse of dimensionality causing... Read more
Key finding: Presents a tutorial overview and mathematical underpinnings of diverse clustering approaches including hierarchical, partitioning, density-based, model-based, grid-based, and soft computing. It rigorously discusses the... Read more
Key finding: Focuses on clustering methods specifically tailored for Web data, emphasizing adaptation of classical clustering, graph clustering, and neural network approaches to domain-specific data representations (text, hyperlinks,... Read more

2. What optimization and ensemble strategies improve clustering robustness, accuracy, and shape flexibility beyond traditional centroid-based methods?

This theme investigates advanced methodologies that enhance clustering performance via mathematical programming, ensemble evidence accumulation, non-centroid discrete optimization, and hybrid parallel approaches. It elucidates how these methods address issues such as local minima trapping, robustness over multiple runs, identification of arbitrary-shaped clusters, and computational scalability.

Key finding: Introduces a mixed-integer linear programming (MILP) model to obtain provably optimal cluster assignments minimizing total within-cluster dissimilarities, incorporating constraints such as group precedence and size limits.... Read more
Key finding: Proposes an ensemble clustering approach that aggregates multiple clusterings generated by repeated random initializations of k-means, forming a co-association matrix representing pairwise pattern similarity. By clustering... Read more
Key finding: Develops an information-theoretical framework using normalized mutual information and bootstrap variance to quantify clustering ensemble consistency and robustness. It formulates evidence accumulation as an optimization of... Read more
Key finding: Presents novel K-medoids based clustering algorithms tailored for set-valued data, bypassing centroid computation limitations in categorical and set data by leveraging classical set-distance measures (Jaccard, Otsuka-Ochiai)... Read more
Key finding: Combines multi-agent system (MAS) concepts with K-means algorithm to introduce Multi-K-means (MK-means), a parallel clustering technique improving global optimization convergence and accuracy. Agents collaborate by monitoring... Read more

3. How can dimensionality reduction and integration of clustering algorithms enhance clustering effectiveness in high-dimensional and domain-specific datasets?

This theme surveys approaches combining dimensionality reduction techniques like Principal Component Analysis (PCA) and integrated or hybrid clustering frameworks to address the curse of dimensionality, improve clustering interpretability, and optimize domain-specific applications such as telecom customer segmentation.

Key finding: Demonstrates a hybrid approach where PCA reduces high-dimensional telecom customer data to critical principal components, enabling effective hierarchical K-nearest neighbors clustering for client segmentation. This method... Read more
Key finding: Introduces a discrete differential evolution algorithm that eschews traditional centroid reliance, searching instead for label assignments directly in discrete space. This enables discovery of non-spherical clusters and... Read more

All papers in Data Clustering

The extraction of numeric features to characterize textures on images takes special relevance in certain satellite and aerial images classification processes. The wide range of the methodological approaches used and their applications in... more
Due to the dramatic increase of data volumes in different applications, it is becoming infeasible to keep these data in one centralized machine. It is becoming more and more natural to deal with distributed databases and networks. That is... more
Rb-Sr and K-Ar ages have been obtained on six biotites, two muscovites and one hornblende from samples of micaschist, gneiss and amphibolite of Lower Paleozoic to Precambrian age at a depth exceeding 2,000 m in basement rocks of the... more
Cluster analysis is an un-supervised learning technique that is widely used in the process of topic discovery from text. The research presented here proposes a novel un-supervised learning approach based on aggregation of clusterings... more
We recently introduced the idea of solving cluster ensembles using a Weighted Shared nearest neighbors Graph (WSnnG). Preliminary experiments have shown promising results in terms of integrating different clusterings into a combined one,... more
Traditional sketch-based image or video search systems rely on machine learning concepts as their core technology. However, in many applications, machine learning alone is impractical since videos may not be semantically annotated... more
Riassunto: In questo lavoro viene proposto un nuovo approccio, suggerito dalla Teoria dei Grafi, per ridurre, nel processo di correzione, la perdita di informazioni derivante da cancellazioni improprie dei dati, e per migliorare la... more
In the paper the most recent methodological and technological advancements at ISTAT in the area of editing and imputation are described. A recently developed model-based method for localizing systematic unity measure errors and some... more
Data clustering constitutes at present a commonly used technique for extracting fuzzy system rules from experimental data. Detailed studies in the field have shown that using above-mentioned method results in significantly reduced... more
Feature extraction is an essential process in machine learning (ML). It is the process of deriving new features from the original features in order to enhance the quality or representation of the data for different reasons, such as... more
This study explores the principal methods of data mining and their diverse applications across industries. Its purpose is to provide a comprehensive overview of key techniques-classification, clustering, association rule learning,... more
Data mining are data analysis supported unsupervised clustering algorithm is one of the quickest growing research areas because of availability of huge quantity of data analysis and extract usefully information based on new improve... more
Wireless sensor networks (WSNs) based on the Internet of Things (IoT) are now one of the most prominent wireless sensor communication technologies. WSNs are often developed for particular applications such as monitoring or tracking in... more
As network traffic gets more complex, conventional manual techniques of identifying network traffic are becoming less successful. Fraudulent activities are no longer allowed by internet auction sites like those that allow shill bidding;... more
Data mining are data analysis supported unsupervised clustering algorithm is one of the quickest growing research areas because of availability of huge quantity of data analysis and extract usefully information based on new improve... more
Segmentation is a fundamental step in image description or classiÿcation. In recent years, several computational models have been used to implement segmentation methods but without establishing a single analytic solution. However, the... more
Conversational message thread identification regards a wide spectrum of applications, ranging from social network marketing to virus propagation, digital forensics, etc. Many different approaches have been proposed in literature for the... more
Conversational message thread identification regards a wide spectrum of applications, ranging from social network marketing to virus propagation, digital forensics, etc. Many different approaches have been proposed in literature for the... more
Electric vehicles (EVs) are emerging as the future of individual mobility systems in smart cities since they reduce greenhouse gas emissions and fossil fuel dependence. However, the deepening penetration of battery EVs forecasted for the... more
by Jan Hunady and 
1 more
Panel data, also known as longitudinal data, is collected and analysed across various research areas. This type of data consists of statistical objects that are periodically observed over time. In comparison to cross-sectional data, there... more
Tables of earthquakes and clustering results, maps of questionnaire results, ananimation showing the evolution of the Barcelonnette event questionnaire clustering with time, and figures showing clustering comparisons (zipped archive).
This paper is devoted to the proposal of two classes of compromise conditional Gaussian networks for data clustering as well as to their experimental evaluation and comparison on synthetic and real-world databases. According to the... more
AbstractÐThis paper introduces a novel enhancement for unsupervised learning of conditional Gaussian networks that benefits from feature selection. Our proposal is based on the assumption that, in the absence of labels reflecting the... more
This version may not include final proof corrections and does not include published layout or pagination. Citation for the version of the work held in 'OpenAIR@RGU': BRUZA, P. D. and SONG, D., 2003. A comparison of various approaches for... more
Data clustering is an approach for automatically finding classes, concepts, or groups of patterns. It also aims at representing large datasets by a few number of prototypes or clusters. It brings simplicity in modelling data and plays an... more
Feature selection aims to reduce dimensionality for building comprehensible learning models with good generalization performance. Feature selection algorithms are largely studied separately according to the type of learning: supervised or... more
Real-world datasets commonly present high dimensional data, which means an increased amount of information. However, this does not always imply an improvement in learning technique performance. Furthermore, some features may be correlated... more
In this paper we have presented an effective hybrid genetic algorithm for solving clustering problems with multi-dimensional grid structure. The algorithm is basically a combination of Genetic Algorithm (GA) and Tabu Search (TS) so that... more
Parallel implementations of two computer vision algorithms on distributed cluster platforms are described. The rst algorithm is a square-error data clustering method whose parallel implementation is based on the well-known sequential... more
Algorithmic enhancements are described that enable large computational reduction in mean square-error data clustering. These improvements are incorporated into a parallel data-clustering tool, P-CLUSTER, designed to execute on a network... more
Cluster analysis is one of the prominent techniques in the field of data mining and k-means is one of the most well known popular and partitioned based clustering algorithms. K-means clustering algorithm is widely used in clustering. The... more
A fundamental assumption often made in unsupervised learning is that the problem is static, i.e., the description of the classes does not change with time. However, many practical clustering tasks involve changing environments. It is... more
We investigate the adaptation and performance of modularity-based algorithms, designed in the scope of complex networks, to analyze the mesoscopic structure of correlation matrices. Using a multiresolution analysis, we are able to... more
Context prediction is useful for energy saving and hence eco-efficient context-aware service by increasing the interval of context sensing. One way of predicting context is to recognize context patterns in an accurate manner.... more
Content inappropriate for children on Internet television is a serious problem in today's multimedia world. There are numerous methods which are used to control the content of the transmitted television programmes. However, these... more
Complex systems are described with high-dimensional data that is hard to visualise. Inselberg's parallel coordinates are one representation technique for visualising high-dimensional data. Here we generalise Inselberg's approach, and use... more
The organizations of Flying ad hoc networks (FANETs) are turning into a favorable answer for various purposes situation including unmanned aerial vehicles, as metropolitan reconnaissance or search and salvage missions. Be that as it may,... more
Clustering is widely used to explore and understand large collections of data. In this thesis, we introduce LIMBO, a scalable hierarchical categorical clustering algorithm based on the Information Bottleneck (IB) framework for quantifying... more
Fuzzy clustering algorithm is one of the data mining methods that is applied in different fields. According to the fuzzy clustering algorithm, each object is allocated to the clusters regarding its percentage of belonging to each of the... more
The huge size of multimedia data requires for efficient data classification and organization in providing effective multimedia data manipulation. Those valuable data must be captured and stored for potential purposes. One of the main... more
This study examined the level of community satisfaction with the operational performance of the 1st Provincial Mobile Force Company (PMFC) in Basilan concerning its anti-criminality campaign. Using a descriptive research design, data were... more
Clustering is a widely used technique in data mining application for discovering patterns in underlying data. Most traditional clustering algorithms are limited in handling datasets that contain categorical attributes. However, datasets... more
Knowledge discovery in multi-dimensional data is a challenging problem in engineering design. For example, in trade space exploration of large design data sets, designers need to select a subset of data of interest and examine data from... more
We develop a new method to measure neutron star parameters and derive constraints on the equation of state of dense matter by fitting the frequencies of simultaneous Quasi Periodic Oscillation modes observed in the X-ray flux of accreting... more
The clustering algorithms have evolved over the last decade. With the continuous success of natural inspired algorithms in solving many engineering problems, it is imperative to scrutinize the success of these methods applied to data... more
Download research papers for free!