Academia.eduAcademia.edu

Text Clustering

description397 papers
group131 followers
lightbulbAbout this topic
Text clustering is a natural language processing technique that involves grouping a set of documents or text data into clusters based on their similarity, typically using algorithms that analyze the content and structure of the text. This method aids in organizing, summarizing, and retrieving information from large datasets.
lightbulbAbout this topic
Text clustering is a natural language processing technique that involves grouping a set of documents or text data into clusters based on their similarity, typically using algorithms that analyze the content and structure of the text. This method aids in organizing, summarizing, and retrieving information from large datasets.

Key research themes

1. How do similarity and distance measures impact the effectiveness of partitional text clustering?

This research area investigates the selection and comparative effectiveness of similarity and distance functions when applied within partitional clustering algorithms like K-means for text documents. Since the quality of clustering heavily depends on accurately capturing the closeness between documents, the choice of measure such as cosine similarity, Euclidean distance, Jaccard coefficient, and Kullback-Leibler divergence is critical. These measures differ in their mathematical properties and suitability for high-dimensional, sparse textual data, and empirical evaluations on diverse datasets help identify best practices for clustering performance.

Key finding: This paper empirically compared five commonly used distance/similarity measures (Euclidean distance, cosine similarity, Jaccard coefficient, Pearson correlation coefficient, and averaged Kullback-Leibler divergence) within... Read more
Key finding: The paper surveyed the impact of diverse distance measures on clustering quality across different data types, emphasizing text data characterized by high dimensionality and sparsity. It reinforced that cosine similarity often... Read more
Key finding: Introducing the Condensed Star (ACONS) algorithm that operates on a thresholded similarity graph representation of documents, this work showed improved clustering quality and reduced cluster count compared to Star and... Read more

2. What challenges do short texts pose to clustering algorithms and what methods improve short text clustering effectiveness?

Short text clustering (STC) deals with clustering highly sparse, context-poor, and noisy textual data such as tweets, search queries, and social media posts. Due to limited length, traditional clustering approaches often underperform on short texts. This research theme centers on addressing the representation, similarity measure, dimensionality reduction, and algorithmic adaptations necessary to overcome data sparsity and high dimensionality while preserving semantic coherence in STC. Advances in approaches specifically tailored to short text characteristics are critical for enhancing applications in social media analysis, sentiment detection, and real-time information extraction.

Key finding: This comprehensive survey identified the intrinsic challenges in short text clustering, such as data sparsity, meaningful feature representation, and noisy, informal language. It emphasized that conventional methods like... Read more
Key finding: Through empirical evaluation of eight clustering algorithms across six high-dimensional text datasets, the study demonstrated that dimensionality reduction techniques, such as Singular Value Decomposition (SVD) and Principal... Read more
Key finding: The Cassiopeia model introduced pre-processing via summarization to reduce dimensionality and sparse data problems prior to clustering. Experiments comparing clustering on full-text versus summarized texts showed that... Read more

3. How can semantic-enriched document representations and modern embedding methods enhance text clustering quality?

This theme explores advances in document representation that go beyond simple bag-of-words and frequency-based vectors to incorporate semantic relations, word embeddings, lexical databases, and representations derived from large language models (LLMs). By capturing synonymy, polysemy, and contextual meanings, semantic-enriched methods aim to improve the discrimination and coherence of clusters. The impact of such advanced embeddings and domain-specific knowledge sources on clustering algorithms like K-means, spectral clustering, and fuzzy clustering is a key focus. The theme also includes leveraging LLM embeddings and hybrid semantic techniques to improve clustering purity and topic separability.

by Aravind Dupati and 
1 more
Key finding: By integrating fuzzy membership values derived from semantic relationships captured by WordNet into TF-IDF weighted document vectors, and applying a regularized K-means clustering with adaptive group LASSO penalties, this... Read more
Key finding: This work utilizes semantic similarity derived from document summaries and lexical preprocessing from NLTK to construct TF-IDF matrices, subsequently clustered via graph-based spectral methods. Applying this approach to movie... Read more
Key finding: This study evaluates embeddings from large language models (LLMs) such as GPT-3.5 Turbo and BERT in combination with traditional clustering algorithms, demonstrating that LLM embeddings capture language subtleties and... Read more
Key finding: This paper proposed a novel hierarchical clustering algorithm incorporating semantic similarity between words calculated via WordNet ontology combined with TF-IDF vectors. By representing documents semantically, the algorithm... Read more

All papers in Text Clustering

Data-mining techniques that detect trends and patterns in structured data are often illsuited for analysis of unstructured text. Information critical to business-and generated by groups such as employees, customers, and the public-appears... more
The tagging aims to address a challenge to search relevant text-documents given a set of tags. In addition, the tag-based approaches received a wide attention as a possible solution to the big-content. Probabilistic topic model methods,... more
The detection and extraction of scene text from document images is one of the challenging research areas. Many researchers have detected and extracted the text from plain text background. But the multi-oriented scene text detection is one... more
Results of queries by personal names often contain documents related to several people because of the namesake problem. In order to differentiate documents related to different people, an effective method is needed to measure document... more
This paper summarizes the goals, organization and results of the first RepLab competitive evaluation campaign for Online Reputation Management Systems (RepLab 2012). RepLab focused on the reputation of companies, and asked participant... more
The third WePS (Web People Search) Evaluation campaign took place in 2009-2010 and attracted the participation of 13 research groups from Europe, Asia and North America. Given the top web search results for a person name, two tasks were... more
Text mining, also known as discovering knowledge from the text, which has emerged as a possible solution for the current information explosion, refers to the process of extracting non-trivial and useful patterns from unstructured text.... more
Text mining, also known as discovering knowledge from the text, which has emerged as a possible solution for the current information explosion, refers to the process of extracting non-trivial and useful patterns from unstructured text.... more
by Ranjna Garg and 
1 more
Text based Mining is the process of analyzing a document or set of documents to understand the content and meaning of the information they contain. Text Mining enhances human's ability to Process massive quantities of information and it... more
Bulanık c-ortalama kümeleme, literatürde farklı alanlarda kullanılan yaygın kümeleme algoritmalarından biridir. Boyut küçültme, büyük veri kümelerini, en az bilgi kaybıyla eşdeğeri olan daha küçük boyutlu veri kümelerine dönüştüren bir... more
This paper presents a new knowledge-based vector space model (VSM) for text clustering. In the new model, semantic relationships between terms (e.g., words or concepts) are included in representing text documents as a set of vectors. The... more
Günümüzde insansız araçlarla alan taraması, yani bir alanın tümünün veya bir kısmının insansız araçlarla en az efor ile dolaşılması, alan taramasına duyulan ihtiyaç ve insansız araçların kullanımının artmasıyla beraber hızla önem... more
Clustering constitutes an ubiquitous problem when dealing with huge data sets for data compression, visualization, or preprocessing. Prototype-based neural methods such as neural gas or the self-organizing map offer an intuitive and fast... more
Clustering constitutes an ubiquitous problem when dealing with huge data sets for data compression, visualization, or preprocessing. Prototype-based neural methods such as neural gas or the self-organizing map offer an intuitive and fast... more
In this paper, we present a particle swarm optimizer (PSO) to solve the variable weighting problem in projected clustering of high-dimensional data. Many subspace clustering algorithms fail to yield good cluster quality because they do... more
Information is commonly reflected in news articles. However, texts are unstructured and thus demanding to analyze automatically. To identify and capture the facts in a news story we propose a novel approach, which utilizes natural... more
by Qin Lu
Web Person Disambiguation (WPD) is often done through clustering of web documents to identify the different namesakes for a given name. This paper presents a clustering algorithm using key phrases as the basic feature. However, key... more
Text Document are tremendously increasing in the internet, the hierarchical document clustering has proven to be useful in grouping similar document for large applications. Still most documents suffer from problems of high dimensionality,... more
The content of a text is mainly defined by keywords and named entities occurring in it. In particular for news articles, named entities are usually important to define their semantics. However, named entities have ontological features,... more
This paper adapts a widespread formalism of Knowledge Representation known in the AI literature as J. Sowa's Conceptual Graphs to the purposes of Content Analysis. It is proposed that instead of nested contexts, negation and modalities... more
Obtaining correct and relevant information at the right time to user's query is quite a difficult task. This becomes even complex, if the query terms have many meanings and occur in different varieties of domain. This paper presents a... more
Generally, textual data sets are represented by using different models. But, sometimes it does not capture the text structure, or some models that preserves text structure. Vector space model is also known as the ‘bag of word model’. To... more
Generally, textual data sets are represented by using different models. But, sometimes it does not capture the text arrangement as it is. Vector space model is also recognized as the bag of word model. To represent textual document using... more
Generally, textual data sets are represented by using different models. But, sometimes it does not capture the text arrangement as it is. Vector space model is also recognized as the bag of word model. To represent textual document using... more
Günümüzde, karmaşık radar hedef eko sinyal ortamlarındaki gerçek hedef sayısının tespiti işlemi savunma sistemleri ve haberleşme alanında önem arz eden konular arasında yer almaktadır. Özellikle, birden fazla hedef sinyalinin bulunduğu bu... more
Bu çalışma, Transformer mimarisinin giriş katmanlarının matematiksel temellerine odaklanmaktadır. Giriş katmanları, ham veriyi modelin işleyebilecegi anlamlı bir forma dönüştürmek için kritik bir rol oynar. Bu baglamda, tokenizasyon,... more
Bu çalışma, Transformer mimarisinin giriş katmanlarının matematiksel temellerine odaklanmaktadır. Giriş katmanları, ham veriyi modelin işleyebilecegi anlamlı bir forma dönüştürmek için kritik bir rol oynar. Bu baglamda, tokenizasyon,... more
We present in this work a new algorithm for document hierarchical clustering and automatic generation of portals sites. This model is inspired from the self-assembling behavior observed in real ants where ants progressively get attached... more
Traditional approaches to document classification requires labeled data in order to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available, and often too expensive to obtain, especially for large... more
Abstract—Traditional approaches to document classification requires labeled data in order to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available, and often too expensive to obtain, especially for... more
Abstract. In this paper we describe our participation in the Third Web People Search (WePS3) evaluation campaign. We took part in the Online Reputation Management (ORM) task. Ambiguity of organization names (eg,“Amazon” or “Apple”) raises... more
In this paper we describe our participation in the Second Web People Search workshop (WePS2) and detail our approaches. For the clustering task, our focus was on replicating the lessons learned at WEPS1 on the data set made available as... more
Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the... more
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a... more
Delivering effective customer service via the Internet requires attention to many aspects of knowledge management if it is to be convenient and satisfying for customers, while at the same time efficient and economical for the company or... more
The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of... more
In this paper some results of a new text clustering methodology are presented. A prototype is an interesting document or a part of an extracted, interesting text. The given prototype is matched with the existing document database or the... more
There is a vast amount of financial information on companies' financial performance available to investors today. While automatic analysis of financial figures is common, it has been difficult to automatically extract meaning from the... more
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org ... Ahmad Zuhdi1, Aniati Murni Arymurthy2 and Heru Suhartanto3 ... 1 Informatics Engineering Dept., Trisakti... more
Bulanık c-ortalama kümeleme, literatürde farklı alanlarda kullanılan yaygın kümeleme algoritmalarından biridir. Boyut küçültme, büyük veri kümelerini, en az bilgi kaybıyla eşdeğeri olan daha küçük boyutlu veri kümelerine dönüştüren bir... more
We discuss our experiences in analyzing customer-support issues from the unstructured free-text fields of technical-support call logs. The identification of frequent issues and their accurate quantification is essential in order to track... more
Bu calismada Meksika San Pedro Martir (SPM) Ulusal Astronomi Gozlemevi'nde 84 cm teleskop ile gozlenmis 20 acik yildiz kumesinin CCD UBVRI fotometrisi verileri ile gozlemsel parametreleri olan renk artiklari, metal ve agir element... more
Decide Madrid is the civic technology of Madrid City Council which allows users to create and support online petitions. Despite the initial success, the platform is encountering problems with the growth of petition signing because... more
Analyzing financial performance in today's information-rich society can be a daunting task. With the evolution of the Internet, access to massive amounts of financial data, typically in the form of financial statements, is widespread.... more
Bu calismada Meksika San Pedro Martir (SPM) Ulusal Astronomi Gozlemevi'nde 84 cm teleskop ile gozlenmis 20 acik yildiz kumesinin CCD UBVRI fotometrisi verileri ile gozlemsel parametreleri olan renk artiklari, metal ve agir element... more
Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure & separating the dissimilar documents. Popular clustering algorithms available for text... more
Özetçe-Derin öğrenmenin doğal dil işleme problemlerinde kullanılmaya başlaması ile bu alandaki birçok problemin çözümünde ciddi iyileşmeler olmuştur. Kelime dizilerini etiketleme problemlerinin de derin öğrenme yöntemleri ile sıkça... more
Download research papers for free!