Academia.eduAcademia.edu

Topic Identification

description156 papers
group3 followers
lightbulbAbout this topic
Topic identification is the process of determining and articulating the central theme or subject matter of a research study. It involves analyzing existing literature, understanding research gaps, and formulating specific questions or hypotheses that guide the investigation, ensuring relevance and clarity in the research objectives.
lightbulbAbout this topic
Topic identification is the process of determining and articulating the central theme or subject matter of a research study. It involves analyzing existing literature, understanding research gaps, and formulating specific questions or hypotheses that guide the investigation, ensuring relevance and clarity in the research objectives.

Key research themes

1. How can unsupervised topic modeling techniques identify and track the evolution of scientific ideas and fields over time?

This research area focuses on applying unsupervised probabilistic topic modeling methods, such as Latent Dirichlet Allocation (LDA), to large scientific corpora to analyze the temporal dynamics of research topics and intellectual trends. Understanding how scientific ideas emerge, grow, decline, or shift in prominence over time provides insights into paradigm changes and the structural evolution of academic disciplines. It matters because it offers a data-driven, quantitative complement to traditional historiographic methods, enabling nuanced tracking of thematic diversity and convergence across venues and subfields.

Key finding: Applied LDA to over 12,500 computational linguistics papers from the ACL Anthology spanning 1978-2006, revealing significant historical trends such as the rise of probabilistic methods from 1988 and decline in semantics... Read more
Key finding: Proposed a novel computational approach combining word embeddings with dynamic semantic similarity networks and clustering to detect temporal evolution of topics in large corpora. Demonstrated the ability to model complex... Read more
Key finding: Introduced semantic-LDA, an enhanced topic modeling approach which integrates external ontologies (Probase) to capture word semantics more accurately within the input corpus context by quantifying word-concept relationships... Read more
Key finding: Provided a scientometric analysis of topic modeling research evolution, highlighting LDA's dominance and applications across multiple domains, particularly in large-scale text analysis. It underscored LDA’s theoretical... Read more

2. What methods exist for topic identification in massive, heterogeneous text corpora, and how do they compare in scalability and interpretability?

This research area investigates diverse computational approaches for discovering latent topics in large and diverse textual datasets, emphasizing techniques that differ in scalability, parameter requirements, and interpretability. It includes probabilistic generative models like LDA which require pre-specification of topics, and alternative hashing-based and graph-based algorithms able to handle massive vocabularies and documents without strict prior constraints. Understanding these methods aids in selecting effective solutions for practical large-scale applications such as social media analytics and web corpus organization.

Key finding: Presented Sampled Min-Hashing (SMH), an alternative to LDA for massive corpora topic discovery that obviates the need to predetermine topic number and dramatically reduces computational resource requirements. By generating... Read more
Key finding: Proposed an iterative clustering approach based on consensus matrices combined with semantic enrichment via word embeddings for topic detection in short texts like tweets. This method addresses instability and noise... Read more
Key finding: Applied LDA to Twitter datasets concerning social events in Kenya, utilizing evaluation metrics such as Normalized Mutual Information (NMI) and topic coherence to select optimal models. Demonstrated that LDA effectively... Read more
Key finding: Developed an XCES-compliant corpus portal for Brazilian Portuguese newspaper corpora enabling corpus partitioning based on automatic topic identification using term covariance and multidimensional projections (Projection... Read more

3. How can topic identification facilitate practical applications such as social media analysis, information retrieval, and cyber-security through tailored approaches?

This research area concentrates on leveraging topic identification methods specifically designed or adapted for domains like social media analytics, text classification in cybercrime, and enterprise network security. It involves integrating topic detection with sentiment analysis, classification techniques, and domain-specific preprocessing to extract actionable insights from noisy, multilingual, or domain-specific textual data. These application-driven studies inform the development of targeted computational tools that enhance real-time monitoring, information filtering, or anomaly detection in complex operational environments.

Key finding: Proposed an unsupervised approach combining term ranking, localized language analysis (including informal language like Singlish), multilingual sentiment analysis, and unsupervised clustering to extract relevant topics from... Read more
Key finding: Applied supervised machine learning models combined with feature extractors (TF-IDF, Word2Vec) for automatic classification of cybercrime news articles into types. Found Random Forest with TF-IDF achieved highest accuracy... Read more
Key finding: Utilized Latent Dirichlet Allocation to model patterns of user authentication events from real enterprise network data at Los Alamos National Laboratory. Treated daily authentication logs as documents and destination... Read more
Key finding: Implemented an intelligent topic tracking and infoveillance system combining NLP and graph mining to analyze streams of Italian tweets related to the Russia-Ukraine conflict. This unsupervised graph-based method effectively... Read more

All papers in Topic Identification

Resumen: En este artículo presentamos una tarea de identificación de tópico basada en Redes Bayesianas. Estas redes son entrenadas a partir de los conceptos semánticos que se han etiquetado para cada frase a procesar y que han sido... more
Recently, context-awareness has been a hot topic in the ubiquitous computing field. Numerous methods for capturing, representing and inferring context have been developed and relevant projects have been performed. Existing research has... more
Resumen: En este artículo presentamos una tarea de identificación de tópico basada en Redes Bayesianas. Estas redes son entrenadas a partir de los conceptos semánticos que se han etiquetado para cada frase a procesar y que han sido... more
This paper presents an approach to routing telephone calls automatically, based upon their speech content. Our data consist of a set of calls collected from a customer-service center with a twolevel menu, which allows jumping past the... more
Recently, context-awareness has been a hot topic in the ubiquitous computing field. Numerous methods for capturing, representing and inferring context have been developed and relevant projects have been performed. Existing research has... more
Knowledge mining is a young and rapidly growing discipline aiming at automatically identifying valuable knowledge in digital documents. This paper presents the results of a study of the application of document retrieval and text mining... more
We have developed tools to explore social networks that share information in medical forums to better understand the unmet informational needs of patients and family members facing cancer treatments. We define metrics that demonstrate... more
Knowledge mining is a young and rapidly growing discipline aiming at automatically identifying valuable knowledge in digital documents. This paper presents the results of a study of the application of document retrieval and text mining... more
The rapid growth in the number of documents available to various end users from around the world has led to a greatly increased need for machine understanding of their topics, as well as for automatic grouping of related documents. This... more
Abstract: The number of cybercrime cases has increased in this country, especially after the pandemic. The nation has created numerous strategic plans, including the introduction of the Malaysia Cyber Security Strategy (MCSS), which... more
Recently, the usage of social media websites has become an attractive phenomenon in our daily life. These sites allow their users to communicate with each other through various tools. This results in learning and sharing of valuable... more
Educational Data Mining (EDM) is an emerging field that is concerned with mining and exploring the useful patterns in educational data. The main objective of this study is to predict the students' academic performance based on a new... more
Nowadays, the broadcasting of news via social media networks is almost provided in a textual format. The nature of the broadcasted text is considered as unstructured text. Text mining techniques play an essential role in converting the... more
As long as the internet user is increasing, online electronic content is growing proportionally irrespective of languages. A lot of research works on English text summarization have come to light to deal with this gigantic body of online... more
In this paper, we propose an ontology-based approach that enables to detect the emergence of relational conflicts between persons that cooperate on computer supported projects. In order to detect these conflicts, we analyze, using this... more
Taming the Tiger Topic: An XCES Compliant Corpus Portal to Generate Subcorpora Based on Automatic Text-Topic Identification Marcelo Muniz, 1 Fernando V. Paulovich, 1 Rosane Minghim, 1 Kleber Infante, 1 Fernando Muniz, 1 Renata Vieira 2... more
The continuous growth of information on the Internet and the availability of a large mass of electronic documents in Arabic language make Natural Language processing (NLP) tasks play an important role to enhance and facilitate the access... more
Clustering short text streams is a challenging task due to its unique properties: infinite length, sparse data representation and cluster evolution. Existing approaches often exploit short text streams in a batch way. However, determine... more
Nowadays, the amount of Arabic documents has increased significantly in different domains, such as news articles, emails, business summary, biomedicine, web sites and social media documents. Some databases have increased in its size to... more
Resumen: En este artículo presentamos una tarea de identificación de tópico basada en Redes Bayesianas. Estas redes son entrenadas a partir de los conceptos semánticos que se han etiquetado para cada frase a procesar y que han sido... more
As dialog systems evolve to handle unconstrained input and for use in open environments, addressee detection (detecting speech to the system versus to other people) becomes an increasingly important challenge. We study a corpus in which... more
The CALO Meeting Assistant (MA) provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multiparty meetings, and is part of the larger CALO personal assistant system. This paper presents the... more
Multi document summarization has very great impact among research community, ever since the growth of online information and availability. Selecting most important sentences from such huge repository of data is quiet tricky and... more
We explore the problem of resolving the second person English pronoun you in multi-party dialogue, using a combination of linguistic and visual features. First, we distinguish generic and referential uses, then we classify the referential... more
The CALO Meeting Assistant (MA) provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multiparty meetings, and is part of the larger CALO personal assistant system. This paper presents the... more
In face-to-face meetings, assigning and agreeing to carry out future actions is a frequent subject of conversation. Work thus far on identifying these action item discussions has focused on extracting them from entire transcripts of... more
We present a system for extracting useful information from multi-party meetings and presenting the results to users via a browser. Users can view automatically extracted discussion topics and action items, initially seeing high-level... more
Presenting complex information in an understandable manner using speech is a challenging task to do well. Significant limitations, both in the generation process and from the human listeners' capabilities, typically make for poorly... more
This paper describes an integrated system that enables the storage and retrieval of meeting transcripts (e.g. staff meetings). The system gives users who have not attended a meeting, or who want to review a particular point, enhanced... more
The aim of this study is topic identification by using two methods, in this case, a new one that we have proposed: TR-classifier which is based on computing triggers, and the well-known k Nearest Neighbors. Performances are acceptable,... more
Topic Identification is one of the important keys for the success of many applications. Indeed, there are few works in this field concerning Arabic language because of lack of standard corpora. In this study, we will provide directly... more
In this paper we present two well-known categorization methods and their use in topic identification for Modern Standard Arabic. The first one is the TFIDF approach, and the second is a Support Vector Machines (SVM) based classifier. In... more
This paper focuses on studying topic identification for Arabic language by using two methods. The first method is the well-known kNN (k Nearest Neighbors) which is used as baseline. The second one is the TR-Classifier, mainly based on... more
The Dialog State Tracking Challenge 4 (DSTC 4) proposes several pilot tasks. In this paper, we focus on the spoken language understanding pilot task, which consists of tagging a given utterance with speech acts and semantic slots. We... more
The CALO Meeting Assistant is an integrated, multimodal meeting assistant technology that captures speech, gestures, and multimodal data from multiparty interactions during meetings, and uses machine learning and robust discourse... more
With the dramatic improvement in automated speech recognition (ASR) accuracy, a variety of machine learning (ML) and natural language processing (NLP) algorithms are designed for human conversation data. Supervised machine learning and... more
The exponential growth of online textual data triggered the crucial need for an effective and powerful tool that automatically provides the desired content in a summarized form while preserving core information. In this paper, we propose... more
We study the problem of topic segmentation of manually transcribed speech in order to facilitate information extraction from dialogs. Our approach is based on a combination of multi-source knowledge modeled by hidden Markov models. We... more
Abstract—Outlier analysis has become a popular topic in the field of data mining but there have been less work on how to detect outliers in web content. Mining Web Content Outliers is used to detect irrelevant web content within a web... more
We present a novel lecture browser that utilizes ranked key phrases displayed on a stream graph to overcome the shortcomings of traditional extractive (query-based) summaries. The system extracts key phrases from the ASR transcripts,... more
We propose an alternative evaluation metric to Word Error Rate (WER) for the decision audit task of meeting recordings, which exemplifies how to evaluate speech recognition within a legitimate application context. Using machine learning... more
In face-to-face meetings, assigning and agreeing to carry out future actions is a frequent subject of conversation. Work thus far on identifying these action item discussions has focused on extracting them from entire transcripts of... more
The task of automatically detecting the end of a device-directed user request is particularly challenging in case of switching short command and long free-form utterances. While lowlatency end-pointing configurations typically lead to... more
Given a large hierarchical concept dictionary (thesaurus, or ontology), the task of selection of the concepts that describe the contents of a given document is considered. A statistical method of document indexing driven by such a... more
In early 2001 we reported (at the Human Language Technology meeting) the early stages of an ICSI project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). In this paper we report... more
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article... more
Keywords are used to index data, generate tag clouds or for searching. Alchemy API's keyword extraction, API is capable of finding keywords in text and ranking them. In this paper addresses the problem of getting the related keywords from... more
The exponential growth of online textual data triggered the crucial need for an effective and powerful tool that automatically provides the desired content in a summarized form while preserving core information. In this paper, we propose... more
Download research papers for free!