Key research themes
1. How do similarity and distance measures impact the effectiveness of partitional text clustering?
This research area investigates the selection and comparative effectiveness of similarity and distance functions when applied within partitional clustering algorithms like K-means for text documents. Since the quality of clustering heavily depends on accurately capturing the closeness between documents, the choice of measure such as cosine similarity, Euclidean distance, Jaccard coefficient, and Kullback-Leibler divergence is critical. These measures differ in their mathematical properties and suitability for high-dimensional, sparse textual data, and empirical evaluations on diverse datasets help identify best practices for clustering performance.
2. What challenges do short texts pose to clustering algorithms and what methods improve short text clustering effectiveness?
Short text clustering (STC) deals with clustering highly sparse, context-poor, and noisy textual data such as tweets, search queries, and social media posts. Due to limited length, traditional clustering approaches often underperform on short texts. This research theme centers on addressing the representation, similarity measure, dimensionality reduction, and algorithmic adaptations necessary to overcome data sparsity and high dimensionality while preserving semantic coherence in STC. Advances in approaches specifically tailored to short text characteristics are critical for enhancing applications in social media analysis, sentiment detection, and real-time information extraction.
3. How can semantic-enriched document representations and modern embedding methods enhance text clustering quality?
This theme explores advances in document representation that go beyond simple bag-of-words and frequency-based vectors to incorporate semantic relations, word embeddings, lexical databases, and representations derived from large language models (LLMs). By capturing synonymy, polysemy, and contextual meanings, semantic-enriched methods aim to improve the discrimination and coherence of clusters. The impact of such advanced embeddings and domain-specific knowledge sources on clustering algorithms like K-means, spectral clustering, and fuzzy clustering is a key focus. The theme also includes leveraging LLM embeddings and hybrid semantic techniques to improve clustering purity and topic separability.