Academia.eduAcademia.edu

Topic Models

description598 papers
group2,331 followers
lightbulbAbout this topic
Topic models are statistical algorithms used in natural language processing and text mining to discover abstract topics within a collection of documents. They analyze word co-occurrence patterns to identify clusters of words that frequently appear together, enabling the extraction of themes and the organization of large text corpora.
lightbulbAbout this topic
Topic models are statistical algorithms used in natural language processing and text mining to discover abstract topics within a collection of documents. They analyze word co-occurrence patterns to identify clusters of words that frequently appear together, enabling the extraction of themes and the organization of large text corpora.

Key research themes

1. How can topic models be effectively adapted to represent and analyze short and multimodal texts, such as social media messages and memes?

Short texts (e.g., tweets, microblogs) and multimodal content (e.g., memes combining text and images) present unique challenges to traditional topic modeling approaches due to information sparsity, multimodality, and semantic ambiguity. This theme investigates methods to adapt topic models for better semantic capture, improved interpretability, and accurate topic discovery in such domains, addressing the need for aggregation, semantic augmentation, and multimodal integration.

Key finding: This paper empirically demonstrates that training standard topic models on aggregated Twitter messages (such as aggregated user posts) yields higher quality topics and significantly better classification performance compared... Read more
Key finding: Introduces the Latent Concept Topic Model (LCTM), which incorporates word embeddings and latent concepts as Gaussian distributions in vector space to overcome data sparsity in short texts (e.g., SNS posts). By modeling topics... Read more
Key finding: Proposes PromptMTopic, a novel multimodal prompt-based topic modeling framework that leverages large language models and visual description extraction to jointly model text and image modalities in memes. The model effectively... Read more
Key finding: Assesses neural topic models (NTMs) combined with pretrained word embeddings as superior approaches for extracting coherent and interpretable topics from short social texts compared to traditional topic models. The study... Read more
Key finding: Demonstrates that using word embeddings (Word2Vec) significantly improves topic modeling quality on political-linguistic short-text datasets compared to latent Dirichlet allocation (LDA) alone, particularly when accompanied... Read more

2. How can semantic knowledge and external ontologies be integrated into topic models to enhance semantic coherence and disambiguation?

Traditional probabilistic topic models rely largely on word co-occurrence statistics, often ignoring underlying semantic relationships between words and their contextual meanings. This theme explores the integration of semantic resources like ontologies, knowledge bases, and concept mappings into topic modeling frameworks to better capture word meanings, handle ambiguity, and produce more interpretable, semantically coherent topics.

Key finding: Introduces Semantic-LDA, a topic model that incorporates external ontologies (e.g., Probase) by computing word-concept relationship strengths directly from the input text collection rather than fixed ontology-derived weights.... Read more
Key finding: Develops OCTVis, a visual analytics framework that maps topics from multiple topic models onto domain ontologies to facilitate qualitative evaluation and interpretability. By aligning topic terms with ontology concepts and... Read more
Key finding: Presents a non-Bayesian framework called Additive Regularization of Topic Models (ARTM) that simplifies modeling by adding domain-specific regularizers to stochastic matrix factorization for topic learning. ARTM enables... Read more

3. What are the methodological advances and limitations of variational inference and dynamic topic modeling approaches in large-scale and temporally evolving corpora?

This theme investigates advanced inference methods (including stochastic gradient variational Bayes) and dynamic topic models designed to capture correlations and temporal evolution in topics across large-scale datasets. It further examines inherent limitations such as instability, non-conjugacy issues, and challenges in topic interpretability over time, aiming to refine modeling techniques for reliable, scalable, and temporally aware topic discovery.

Key finding: Proposes an efficient stochastic gradient variational Bayes (SGVB) inference method for the Correlated Topic Model (CTM), which models correlations among latent topics using a logistic normal prior. The method avoids explicit... Read more
Key finding: Demonstrates that naive mean field variational inference for Latent Dirichlet Allocation (LDA) can produce misleading non-trivial topic decompositions even when the data contain no information about the true topic structure,... Read more
Key finding: Introduces a novel approach combining word embeddings with dynamic network clustering to identify and model temporal evolution of topics in large corpora, circumventing challenges of embedding alignment and stochasticity in... Read more

4. How have topic models evolved in research, and what are their applications and methodological trends across different disciplines?

This theme maps the historical development, key methodologies, and cross-disciplinary applications of topic modeling, emphasizing bibliometric and scientometric analyses. It helps researchers understand influential models, predominant research areas, and how topic modeling tools are tailored to various data types and research questions.

Key finding: Provides a comprehensive scientometric analysis of topic modeling literature, revealing the dominance of Latent Dirichlet Allocation (LDA) and its widespread application in short-text domains like social networks and blogs.... Read more
Key finding: Delivers a detailed introduction and survey of LDA and its probabilistic foundations, discussing variations like PLSA and extensions including hierarchical and multilingual topic models. Provides insights into model... Read more
Key finding: Proposes combining outputs from multiple topic modeling frameworks (e.g., LDA and doc2vec neural embeddings) to gain richer semantic interpretations beyond single model outputs. The study shows how mapping topic term... Read more
Key finding: Utilizes latent Dirichlet allocation topic models on a corpus of flagship astrobiology journals over five decades to identify thematic researcher communities, revealing how semantic profiles group authors by shared research... Read more

All papers in Topic Models

Understanding thematic trends and user roles is an important challenge in the field of information retrieval. In this paper, we worked on a novel probabilistic model for capturing the evolution of user's interests in terms of content they... more
Latent Dirichlet Allocation, also known as LDA, is a generative probabilistic model in the domain of Natural Language Processing (NLP), that is frequently used for the purpose of discovering latent thematic structures that are concealed... more
Online trading is essential for day-today life in the rapid growth of internet and e-commerce. Customers are intended through the online advertisements and necessity to take decision making for the product to purchase among similar ones.... more
documents として語形を,terms としてそれらの文字 (skippy) n-gram を与えた (Labeled) LDA を実行し,その結果に基づくクラスタリングを,自己教師あり分類と再定義し評価する事で,それが十分な分類性能を持っている事を確認した.その結果,topic が意味的なものであるという想定は不要である事を示した.
Digitization and computer science have established a whole new set of methods to analyze large collections of texts. One of these methods is particularly promising for economic historians: topic models, statistical algorithms that... more
Online reviews have been gaining relevance in hospitality and tourism management and represent an important research avenue for academia. This study illustrates the discrimination between positive and negative reviews based on single word... more
This article examines the automatic identification of Web registers, that is, text varieties such as news articles and reviews. Most studies have focused on corpora restricted to include only preselected classes with well-defined... more
Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on... more
Online Social Networks (OSNs), such as Twitter, offer attractive means of social interactions and communications, but also raise privacy and security issues. The OSNs provide valuable information to marketing and competitiveness based on... more
One of those tools, Latent Dirichlet Allocation (LDA) is used in this research to identify the latent topics in the New Testament once it is collated into two corpora: Paul's books; and the other books in the New Testament. The purpose is... more
We develop and apply unsupervised statistical topic models, in particular Latent Dirichlet Allocation, to identify functional components of source code and study their evolution over multiple project versions. We present results for two... more
The incorporation of artificial intelligence (AI) into educational management offers personalized learning, adaptive tutoring, and efficient resource management. However, ethical considerations such as fairness, transparency,... more
Zou Fan" is currently the largest "tree hole" on Weibo, where people having suicidal ideation often express their thoughts and use this channel to seek support. Therefore, early suicide monitoring and timely crisis intervention based on... more
Unstructured geo-text annotations volunteered by users of web map services enrich the basic geographic data. However, irrelevant geo-texts can be added to the web map, and these geo-texts reduce utility to users. Therefore, this study... more
Unstructured geo-text annotations volunteered by users of web map services enrich the basic geographic data. However, irrelevant geo-texts can be added to the web map, and these geo-texts reduce utility to users. Therefore, this study... more
En este estudio se realiza una primera aproximación cuantitativa a la escritura poética de autoría femenina en el Siglo de Oro mediante la aplicación de metodologías computacionales y estilométricas, siguiendo la estela de trabajos... more
This paper discusses approaches that promote the ability of Japanese language learners to carry out bottom-up processing when reading texts. Chunking and reading aloud have been suggested to automate bottom-up processing skills and are... more
of paper 1031 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019.
Datafication, or the translation of our everyday actions into quantifiable metrics, underwrites a wide set of contemporary discursive practices. With a specific focus on the social media platform Instagram, this paper analyzes mediatized... more
Astrobiology is often defined as the study of the origin, evolution, distribution and future of life on Earth and in the Universe and thought of as a discipline. In practice though, the delineation of astrobiology-related research and... more
Opposition is a core component of any democracy, yet it is scarcely studied. Leaning on research prescribing blurred lines between government and opposition in parliamentary democracies, we use word embeddings in tandem with sentiment... more
This research uses artificial intelligence and manual content-analysis to examine the diffusion of incivility against political leaders on Twitter during the 2022 Italian election campaign. Using a mixed approach (artificial intelligence... more
This research paper explores India's speeches in United Nations General Debate (UNGD) sessions to investigate the nation's priority settings and challenges faced during the last five decades. The study adopted Latent Dirichlet Allocation... more
Spectral topic modeling algorithms operate on matrices/tensors of word co-occurrence statistics to learn topic-specific word distributions. This approach removes the dependence on the original documents and produces substantial gains in... more
This study compares and contrasts the results of two lexical-based methods aimed at identifying content temporal trends in diachronic text corpora. A corpus of end-of-year addresses of the presidents of the Italian Republic constitutes a... more
Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the... more
Topic models (TM), as introduced by Blei et al. (2003), are techniques used to infer linguistic themes from the co-occurrence of words in various textual entities, such as documents, paragraphs, sentences, or tweets. Traditionally, the... more
A major challenge for many analyses of Wikipedia dynamics-e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussionis grouping the very diverse range of... more
This article addresses the need for a comprehensive understanding of the rapidly evolving field of Artificial Intelligence (AI) in education, given its potential to transform teaching and learning practices. The study analyzed 1,234... more
This paper introduces a multimodal discussion corpus for the study into head movement and turn-taking patterns in debates. Given that participants either acted alone or in a pair, cooperation and competition and their nonverbal correlates... more
The rapid development of AI, particularly in the form of language models such as OpenAI's ChatGPT, is fundamentally transforming the academic and educational landscape. AI tools like ChatGPT play a crucial role in accelerating text... more
Most of the relevance feedback algorithms only use document terms as feedback (local features) in order to update the query and re-rank the documents to show to the user. This approach is limited by the terms of those documents without... more
Since its launching year of 2022, ChatGPT has really become a phenomenon. Either with or without empirical evidence, many people believe it has potential to replace intellectual workers in many fields. The advent and development of... more
Research Problem: Islamic ideas on rulership and governance are disseminated through various textual traditions, among which the mirrors for princes genre holds a prominent place. These texts, serving as manuals for rulers and future... more
Generative Artificial Intelligence (GenAI) models such as LLMs, GPTs, and Diffusion Models have recently gained widespread attention from both the research and the industrial communities. This survey explores their application in network... more
The study offers a comprehensive analysis of global research on Chatbot and ChatGPT from 2002 to 2023. A rapid research growth has been noted after the year 2017. The research growth, authorship analysis, keyword analysis, citation... more
This paper aims to explore the integration of Systematic Inventive Thinking (SIT) methodology with Large Language Models (LLMs) to enhance innovative processes. It seeks to assess how LLMs can support analytical and creative processes in... more
Topic modeling (TM) is an unsupervised technique used to recognize hidden or abstract topics in large corpora, extracting meaningful patterns of words (semantics). This paper explores TM within data mining (DM), focusing on challenges and... more
The history of ideas emerged as a distinct discipline in the 1940s under the leadership of Arthur Lovejoy, whose foundational principles emphasized meticulous qualitative analysis and the interpretation of selected thinkers' works.... more
In the face of the unprecedented COVID-19 pandemic, various government-led initiatives and individual actions (e.g., lockdowns, social distancing, and masking) have resulted in diverse pandemic experiences. This study aims to explore... more
For more than a year and a half, artificial intelligence has once again captured the imagination of both regular users and researchers worldwide. A pivotal moment was the release of the chatbot called ChatGPT to a broad audience on 30... more
With Generative AI's (GenAI) rapid development and the ability to generate sophisticated human-like text, it has evolved as a powerful technology in various domains. However, its application in the education domain was initially met with... more
Son yıllarda, farklı konular için sunulan dijital bilgi kaynaklarının sayısı aşırı miktarda artmaktadır. Bu dijital bilgi kaynaklarına erişim desteği sunan sistemlerin birçoğu tarama, arama ve bilgi geri kazanımı araçlarına odaklanmıştır.... more
Traditional works in sentiment analysis and aspect rating prediction do not take author preferences and writing style into account during rating prediction of reviews. In this work, we introduce Joint Author Sentiment Topic Model (JAST),... more
Traditional works in sentiment analysis and aspect rating prediction do not take author preferences and writing style into account during rating prediction of reviews. In this work, we introduce Joint Author Sentiment Topic Model (JAST),... more
64  Abstract:-We present the details of a large scale user profiling framework that we developed here on Apache Hadoop. We address the problem of extracting and maintaining a very large number of user profiles extracted from large scale... more
Topic modelling (TM) is a significant natural language processing (NLP) task and is becoming more popular, especially, in the context of literature synthesis and analysis. Despite the growing volume of studies on the use of and... more
Download research papers for free!