This paper presents a probabilistic model for combining cluster ensembles utilizing information t... more This paper presents a probabilistic model for combining cluster ensembles utilizing information theoretic measures. Starting from a co-association matrix which summarizes the ensemble, we extract a set of association distributions, which are modelled as discrete probability distributions of the object labels, conditional on each data object. The key objectives are, first, to model the associations of neighboring data objects, and second, to allow for the manipulation of the defined probability distributions using statistical and information theoretic means. A Jensen-Shannon Divergence based Clustering Combination (JSDCC) method is proposed. The method selects cluster prototypes from the set of association distributions based on entropy maximization and maximization of the generalized JS divergence among the selected prototypes. The method proceeds by grouping association distributions by minimizing their JS divergences to the selected prototypes. By aggregating the grouped association distributions, we can represent empirical cluster conditional probability distributions of the object labels, for each of the combined clusters. Finally, data objects are assigned to their most likely clusters, and their cluster assignment probabilities are estimated. Experiments are performed to assess the presented method and compare its performance with other alternative co-association based methods.
Finding Natural Clusters Using Multi-clusterer Combiner Based on Shared Nearest Neighbors
In this paper, we present a multiple data clusterings combiner, based on a proposed Weighted Shar... more In this paper, we present a multiple data clusterings combiner, based on a proposed Weighted Shared nearest neighbors Graph. (WSnnG). While combining of multiple classifiers (supervised learners) is now an active and mature area, only a limited number of contemporary research in combining multiple data clusterings (un-supervised learners) appear in the literature. The problem addressed in this paper is that of generating a reliable clustering to represent the natural cluster structure in a set of patterns, when a number of different clusterings of the data is available or can be generated. The underlying model of the proposed shared nearest neighbors based combiner is a weighted graph, whose vertices correspond to the set of patterns, and are assigned relative weights based on a ratio of a balancing factor to the size of their shared nearest neighbors population. The edges in the graph exist only between patterns that share a pre-specified portion of their nearest neighborhood. The graph can be further partitioned into a desired number of clusters. Preliminary experiments show promising results, and comparison with a recent study justifies the combiner’s suitability to the pre-defined problem domain.
Topic Discovery from Text Using Aggregation of Different Clustering Methods
Cluster analysis is an un-supervised learning technique that is widely used in the process of top... more Cluster analysis is an un-supervised learning technique that is widely used in the process of topic discovery from text. The research presented here proposes a novel un-supervised learning approach based on aggregation of clusterings produced by different clustering techniques. By examining and combining two different clusterings of a document collection, the aggregation aims at revealing a better structure of the data rather than imposing one that is imposed or constrained by the clustering method itself. When clusters of documents are formed, a process called topic extraction picks terms from the feature space (i.e. the vocabulary of the whole collection) to describe the topic of each cluster. It is proposed at this stage to re-compute terms weights according to the revealed cluster structure. The work further investigates the adaptive setup of the parameters required for the clustering and aggregation techniques. Finally, a topic accuracy measure is developed and used along with the F-measure to evaluate and compare the extracted topics and the clustering quality (respectively) before and after the aggregation. Experimental evaluation shows that the aggregation can successfully improve the clustering quality and the topic accuracy over individual clustering techniques.
In this paper, we propose a cluster-based cumulative representation for cluster ensembles. Cluste... more In this paper, we propose a cluster-based cumulative representation for cluster ensembles. Cluster labels are mapped to incrementally accumulated clusters, and a matching criterion based on maximum similarity is used. The ensemble method is investigated with bootstrap re-sampling, where the k-means algorithm is used to generate high granularity clusterings. For combining, group average hierarchical metaclustering is applied and the Jaccard measure is used for cluster similarity computation. Patterns are assigned to combined meta-clusters based on estimated cluster assignment probabilities. The cluster-based cumulative ensembles are more compact than co-association-based ensembles. Experimental results on artificial and real data show reduction of the error rate across varying ensemble parameters and cluster structures.
Refined Shared Nearest Neighbors Graph for Combining Multiple Data Clusterings
We recently introduced the idea of solving cluster ensembles using a Weighted Shared nearest neig... more We recently introduced the idea of solving cluster ensembles using a Weighted Shared nearest neighbors Graph (WSnnG). Preliminary experiments have shown promising results in terms of integrating different clusterings into a combined one, such that the natural cluster structure of the data can be revealed. In this paper, we further study and extend the basic WSnnG. First, we introduce the use of fixed number of nearest neighbors in order to reduce the size of the graph. Second, we use refined weights on the edges and vertices of the graph. Experiments show that it is possible to capture the similarity relationships between the data patterns on a compact refined graph. Furthermore, the quality of the combined clustering based on the proposed WSnnG surpasses the average quality of the ensemble and that of an alternative clustering combining method based on partitioning of the patterns’ co-association matrix.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008
Over the past few years, there has been a renewed interest in the consensus clustering problem. S... more Over the past few years, there has been a renewed interest in the consensus clustering problem. Several new methods have been proposed for finding a consensus partition for a set of n data objects that optimally summarizes an ensemble. In this paper, we propose new consensus clustering algorithms with linear computational complexity in n. We consider clusterings generated with a random number of clusters, which we describe by categorical random variables. We introduce the idea of cumulative voting as a solution for the problem of cluster label alignment, where unlike the common one-to-one voting scheme, a probabilistic mapping is computed. We seek a first summary of the ensemble that minimizes the average squared distance between the mapped partitions and the optimal representation of the ensemble, where the selection criterion of the reference clustering is defined based on maximizing the information content as measured by the entropy. We describe cumulative vote weighting schemes and corresponding algorithms to compute an empirical probability distribution summarizing the ensemble. Given the arbitrary number of clusters of the input partitions, we formulate the problem of extracting the optimal consensus as that of finding a compressed summary of the estimated distribution that preserves the maximum relevant information. An efficient solution is obtained using an agglomerative algorithm that minimizes the average generalized Jensen-Shannon divergence within the cluster. The empirical study demonstrates significant gains in accuracy and superior performance compared to several recent consensus clustering algorithms.
Voting-based consensus clustering refers to a distinct class of consensus methods in which the cl... more Voting-based consensus clustering refers to a distinct class of consensus methods in which the cluster label mismatch problem is explicitly addressed. The voting problem is defined as the problem of finding the optimal relabeling of a given partition with respect to a reference partition. It is commonly formulated as a weighted bipartite matching problem. In this paper, we present a more general formulation of the voting problem as a regression problem with multiple-response and multiple-input variables. We show that a recently introduced cumulative voting scheme is a special case corresponding to a linear regression method. We use a randomized ensemble generation technique, where an overproduced number of clusters is randomly selected for each ensemble partition. We apply an information theoretic algorithm for extracting the consensus clustering from the aggregated ensemble representation and for estimating the number of clusters. We apply it in conjunction with bipartite matching and cumulative voting. We present empirical evidence showing substantial improvements in clustering accuracy, stability, and estimation of the true number of clusters based on cumulative voting. The improvements are achieved in comparison to consensus algorithms based on bipartite matching, which perform very poorly with the chosen ensemble generation technique, and also to other recent consensus algorithms.
We propose a novel design of a Student Success System (S3), a holistic analytical system for iden... more We propose a novel design of a Student Success System (S3), a holistic analytical system for identifying and treating atrisk students. S3 synthesizes several strands of risk analytics: the use of predictive models to identify academically at-risk students, the creation of data visualizations for reaching diagnostic insights, and the application of a case-based approach for managing interventions. Such a system poses numerous design, implementation, and research challenges. In this paper we discuss a core research challenge for designing early warning systems such as S3. We then propose our approach for meeting that challenge. A practical implementation of an student risk early warning system, utilizing predictive models, must meet two design criteria: a) the methodology for generating predictive models must be flexible to allow generalization from one context to another; b) the underlying mechanism of prediction should be easily interpretable by practitioners whose end goal is to design meaningful interventions on behalf of students. Our proposed solution applies an ensemble method for predictive modeling using a strategy of decomposition. Decomposition provides a flexible technique for generating and generalizing predictive models across different contexts. Decomposition into interpretable semantic units, when coupled with data visualizations and case management tools, allows practitioners, such as instructors and advisors, to build a bridge between prediction and intervention.
Uploads
Papers by Hanan Ayad