Academia.eduAcademia.edu

Text Representation

description18 papers
group0 followers
lightbulbAbout this topic
Text representation refers to the methods and techniques used to convert textual data into a structured format suitable for analysis, processing, or machine learning. This includes various approaches such as tokenization, vectorization, and embedding, which facilitate the understanding and manipulation of text by computational models.
lightbulbAbout this topic
Text representation refers to the methods and techniques used to convert textual data into a structured format suitable for analysis, processing, or machine learning. This includes various approaches such as tokenization, vectorization, and embedding, which facilitate the understanding and manipulation of text by computational models.

Key research themes

1. How can semantic and structural information be incorporated into text representations to improve understanding and communication?

This research theme explores methods of representing text beyond simple bag-of-words models by integrating semantic, syntactic, and graphical information to enhance comprehension, disambiguation, and communication effectiveness. Incorporating semantic relations helps bridge gaps stemming from synonymy, polysemy, and structural features of language, which are essential for applications like Augmentative and Alternative Communication (AAC), cognitive modeling, and natural language understanding.

Key finding: This paper demonstrates an effective system translating Dutch text into pictographs using shallow linguistic analysis aligned with lexical-semantic databases (e.g., Cornetto linked to WordNet synsets), which significantly... Read more
Key finding: This work critically evaluates bag-of-words limitations such as synonymy, polysemy, and high dimensionality, and discusses probabilistic topic models (e.g., PLSA, LDA) to represent documents by latent topics to capture... Read more
Key finding: This comprehensive review demonstrates that graph-based text representations (e.g., word co-occurrence networks, conceptual graphs) effectively capture syntactic and semantic relations beyond bag-of-words limitations by... Read more
Key finding: This paper situates text representation within cognitive models of human language understanding, emphasizing that individuals process and represent linguistic information through evolving mental structures constrained by... Read more
Key finding: This study argues that digital text representations must not only capture the physical form but also enable automatic processing by encoding structural and semantic relations, reconciling representation form and text model... Read more

2. What role do dimensionality reduction and embedding methods play in constructing effective low-dimensional text representations?

This theme focuses on how dimensionality reduction techniques, including deep learning autoencoders and embedding models, can produce compact, informative representations of text data. These techniques aim to address the curse of dimensionality inherent in text data, capturing latent semantic structures at various levels (words, sentences, documents) to enhance similarity measurement, classification, and other downstream tasks.

Key finding: The paper presents a detailed evaluation of deep autoencoders (bDA and rsDA) focusing on sentence-level text representations, proposing novel quantitative metrics (SPI, SAI) to assess reconstruction quality and structural... Read more
Key finding: This study empirically validates that monolingual deep learning text representation models (FastText, Flair, BERT) trained on larger, high-quality Basque corpora outperform multilingual counterparts trained on smaller or... Read more
Key finding: This paper introduces a novel symbolic representation of text documents based on clustering term frequency vectors and constructing interval-valued features from means and standard deviations within these clusters. By... Read more

3. How can multimodal and cross-lingual methods be leveraged to generate or expand text representations from other modalities or enhance text understanding?

This research area investigates approaches that integrate information across modalities (e.g., visual to text) and across linguistic levels (e.g., text expansion, normalization, or anaphora resolution) to produce richer, more context-aware textual representations. These methods are crucial for tasks such as image captioning in low-resource languages, text normalization for speech and translation systems, expanding text representations for narratology, and resolving linguistic ambiguities in MT.

Key finding: This paper contributes a large-scale manually-curated Bangla image description dataset (Biboron) and develops two models (Local Attention and Transformer-based multi-head attention) for automated image-to-text generation. The... Read more
Key finding: This study proposes a language-independent text normalization architecture based on statistical machine translation techniques that converts non-standard words (e.g., numbers, abbreviations) into regular forms in... Read more
Key finding: The authors propose a hierarchical text representation framework combining linguistic decomposition (following Harris’s grammar) and question/answer structures that supports progressive text expansion from kernel versions to... Read more
Key finding: This work emphasizes the critical necessity of anaphora resolution for accurate machine translation, especially when language pairs differ in pronoun gender, omission, or reference. By developing an integrated pronominal... Read more

All papers in Text Representation

Psychologists have used tests or carefully designed survey questions, such as Beck's Depression Inventory (BDI), to identify the presence of depression and to assess its severity level. On the other hand, methods for automatic depression... more
The increasing use of social media allows the extraction of valuable information to early prevent some risks. Such is the case of the use of blogs to early detect people with signs of depression. In order to address this problem, we... more
The increasing use of social media allows the extraction of valuable information to early prevent some risks. Such is the case of the use of blogs to early detect people with signs of depression. In order to address this problem, we... more
This work addresses the as yet insufficiently covered problem of Anaphora Resolution in Machine Translation. To start with, the paper discusses the translation of pronominal anaphors, the necessity of anaphora resolution in Machine... more
The development in Information Technology (IT) has encouraged the use of Igbo Language in text creation, online news reporting, online searching and articles publications. As the information stored in text format of this language is... more
This paper presents an improved classification model for Igbo text using N-gram and K-Nearest Neighbour approaches. The N-gram model was used for text representation and the classification was carried out on the text using the K-Nearest... more
System for converting Arabic numerals to their textual equivalence is an important tool in Natural Language processing (NLP) especially in high-level speech processing and machine translation. Such system is scarcely available for most... more
We present a comprehensive study on the use of autoencoders for modelling text data, in which (differently from previous studies) we focus our attention on the following issues: i) we explore the suitability of two different models bDA... more
Abstract: In this paper we give an overview of an approach to anaphora resolution that takes a whole variety of different factors into account. These factors concern on the one hand structural information like agreement, proximity,... more
System for converting Arabic numerals to their textual equivalence is an important tool in Natural Language processing (NLP) especially in high-level speech processing and machine translation. Such system is scarcely available for most... more
Abstract: In this paper we give an overview of an approach to anaphora resolution that takes a whole variety of different factors into account. These factors concern on the one hand structural information like agreement, proximity,... more
System for converting Arabic numerals to their textual equivalence is an important tool in Natural Language processing (NLP) especially in high-level speech processing and machine translation. Such system is scarcely available for most... more
The development in Information Technology (IT) has encouraged the use of Igbo Language in text creation, online news reporting, online searching and articles publications. As the information stored in text format of this language is... more
The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online. Text similarity is of great importance in any text-based applications. This paper presents a... more
The advancement in Information Technology (IT) has assisted in inculcating the three Nigeria major languages in text-based application such as text mining, information retrieval and natural language processing. The interest of this paper... more
The advancement in Information Technology (IT) has assisted in inculcating the three Nigeria major languages in text-based application such as text mining, information retrieval and natural language processing. The interest of this paper... more
The development in Information Technology (IT) has encouraged the use of Igbo Language in text creation, online news reporting, online searching and articles publications. As the information stored in text format of this language is... more
The development in Information Technology (IT) has encouraged the use of Igbo Language in text creation, online news reporting, online searching and articles publications. As the information stored in text format of this language is... more
Analysis and modeling of crime text report data has important applications, including refinement of crime classifications, clustering of documents, and feature extraction for spatiotemporal forecasts. Having better neural network... more
The advancement in Information Technology (IT) has assisted in inculcating the three Nigeria major languages in text-based application such as text mining, information retrieval and natural language processing. The interest of this paper... more
Sentiment Analysis is a very challenging and important task that contains natural language processing, web mining and machine learning. Up to date, few researches have been conducted on sentiment classification for Arabic languages due to... more
Text categorization and feature selection are two of the many text data mining problems. In text categorization, the document that contains a collection of text will be changed to the dataset format, the dataset that consists of features... more
Download research papers for free!