Text Representation

description18 papers

group0 followers

lightbulbAbout this topic

Text representation refers to the methods and techniques used to convert textual data into a structured format suitable for analysis, processing, or machine learning. This includes various approaches such as tokenization, vectorization, and embedding, which facilitate the understanding and manipulation of text by computational models.

lightbulbAbout this topic

Key research themes

1. How can semantic and structural information be incorporated into text representations to improve understanding and communication?

This research theme explores methods of representing text beyond simple bag-of-words models by integrating semantic, syntactic, and graphical information to enhance comprehension, disambiguation, and communication effectiveness. Incorporating semantic relations helps bridge gaps stemming from synonymy, polysemy, and structural features of language, which are essential for applications like Augmentative and Alternative Communication (AAC), cognitive modeling, and natural language understanding.

Translating text into pictographs

by Ineke Schuurman

2022, Natural Language Engineering

Key finding: This paper demonstrates an effective system translating Dutch text into pictographs using shallow linguistic analysis aligned with lexical-semantic databases (e.g., Cornetto linked to WordNet synsets), which significantly... Read more

articleView Paper downloadDownload

Constructing inferences and relations during text comprehension

by Peter Hastings

2024, Human Cognitive Processing

Key finding: This work critically evaluates bag-of-words limitations such as synonymy, polysemy, and high dimensionality, and discusses probabilistic topic models (e.g., PLSA, LDA) to represent documents by latent topics to capture... Read more

articleView Paper downloadDownload

Graph-Based Text Representation and Matching: A Review of the State of the Art and Future Challenges

by Ahmed Hamza Osman

2022, IEEE Access

Key finding: This comprehensive review demonstrates that graph-based text representations (e.g., word co-occurrence networks, conceptual graphs) effectively capture syntactic and semantic relations beyond bag-of-words limitations by... Read more

articleView Paper downloadDownload

Cognitive psychology and text processing: From text representation to text-world

by Guy Denhiere

2015

Key finding: This paper situates text representation within cognitive models of human language understanding, emphasizing that individuals process and represent linguistic information through evolving mental structures constrained by... Read more

articleView Paper downloadDownload

Textual data representation

by Mikaela Keller

2024

Key finding: This study argues that digital text representations must not only capture the physical form but also enable automatic processing by encoding structural and semantic relations, reconciling representation form and text model... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What role do dimensionality reduction and embedding methods play in constructing effective low-dimensional text representations?

This theme focuses on how dimensionality reduction techniques, including deep learning autoencoders and embedding models, can produce compact, informative representations of text data. These techniques aim to address the curse of dimensionality inherent in text data, capturing latent semantic structures at various levels (words, sentences, documents) to enhance similarity measurement, classification, and other downstream tasks.

Squeezing bottlenecks: Exploring the limits of autoencoder semantic representation capabilities

by Parth Gupta

2023, Neurocomputing

Key finding: The paper presents a detailed evaluation of deep autoencoders (bDA and rsDA) focusing on sentence-level text representations, proposing novel quantitative metrics (SPI, SAI) to assess reconstruction quality and structural... Read more

articleView Paper downloadDownload

Give your Text Representation Models some Love: the Case for Basque

by Xabier Saralegi

2022

Key finding: This study empirically validates that monolingual deep learning text representation models (FastText, Flair, BERT) trained on larger, high-quality Basque corpora outperform multilingual counterparts trained on smaller or... Read more

articleView Paper downloadDownload

Classifying Text Documents using Unconventional Representation

by Manjunath Shantharamu

2015

Key finding: This paper introduces a novel symbolic representation of text documents based on clustering term frequency vectors and constructing interval-valued features from means and standard deviations within these clusters. By... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can multimodal and cross-lingual methods be leveraged to generate or expand text representations from other modalities or enhance text understanding?

This research area investigates approaches that integrate information across modalities (e.g., visual to text) and across linguistic levels (e.g., text expansion, normalization, or anaphora resolution) to produce richer, more context-aware textual representations. These methods are crucial for tasks such as image captioning in low-resource languages, text normalization for speech and translation systems, expanding text representations for narratology, and resolving linguistic ambiguities in MT.

Transformation of Visual Information into Bangla Textual Representation

by Md Humaion Kabir Mehedi

2023, 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)

Key finding: This paper contributes a large-scale manually-curated Bangla image description dataset (Biboron) and develops two models (Local Attention and Transformer-based multi-head attention) for automated image-to-text generation. The... Read more

articleView Paper downloadDownload

Architecture for text normalization using statistical machine translation techniques

by Juan M Montero

2025

Key finding: This study proposes a language-independent text normalization architecture based on statistical machine translation techniques that converts non-standard words (e.g., numbers, abbreviations) into regular forms in... Read more

articleView Paper downloadDownload

Towards a Text Representation for Text Expanding

by Jacques Virbel

2014

Key finding: The authors propose a hierarchical text representation framework combining linguistic decomposition (following Harris’s grammar) and question/answer structures that supports progressive text expansion from kernel versions to... Read more

articleView Paper downloadDownload

Anaphora Resolution in Machine Translation

by Carla Umbach

2023

Key finding: This work emphasizes the critical necessity of anaphora resolution for accurate machine translation, especially when language pairs differ in pronoun gender, omission, or reference. By developing an integrated pronominal... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Text Representation

Constructing inferences and relations during text comprehension

by Peter Hastings

2024, Human Cognitive Processing

descriptionView Paper arrow_downwardDownload

Towards Measuring the Severity of Depression in Social Media via Text Classification

by Marcelo Luis Errecalde

2024, XXV Congreso Argentino de Ciencias de la Computación (CACIC) (Universidad Nacional de Río Cuarto, Córdoba, 14 al 18 de octubre de 2019)

Psychologists have used tests or carefully designed survey questions, such as Beck's Depression Inventory (BDI), to identify the presence of depression and to assess its severity level. On the other hand, methods for automatic depression... more

descriptionView Paper arrow_downwardDownload

k-TVT: a flexible and effective method for early depression detection

by Marcelo Luis Errecalde

2024

The increasing use of social media allows the extraction of valuable information to early prevent some risks. Such is the case of the use of blogs to early detect people with signs of depression. In order to address this problem, we... more

descriptionView Paper arrow_downwardDownload

k-TVT: a flexible and effective method for early depression detection

by Maria Paula Villegas

2023

descriptionView Paper arrow_downwardDownload

Anaphora Resolution in Machine Translation

by Carla Umbach

2023

This work addresses the as yet insufficiently covered problem of Anaphora Resolution in Machine Translation. To start with, the paper discusses the translation of pronominal anaphors, the necessity of anaphora resolution in Machine Translation (MT) and previous work in the field. Next, it briefly presents the MT system CAT2 and reports on an anaphora resolution model developed as an extension to the system. Finally, some implementation issues of the model are outlined and its further improvement is discussed. Furthermore, the translation of the predicates connected with the pronoun (verbs, nouns etc.) may change according to different antecedents. Anaphora resolution reflects two essential topics in Machine Translation: ambiguity in a MT context and translation of discourse instead of isolated sentences. Anaphora can be viewed as a sort of ambiguity, in that the antecedent of a given pronoun might be uncertain and referential relations are one of the means that constitute coherence of texts. 2.1 Translation of Pronominal Anaphors In the majority of language pairs and cases the pronouns in the source language are translated by target language pronouns which correspond to the referent of the anaphor. However, there are various exceptions. In some languages the pronoun is translated directly by its referent. In English to Malay translation for instance, there is a tendency to replace ' i t ' with its referent. Replacing a pronominal anaphor with its referent means, however, that the translator (program) must be able to identify the referent first. Very often pronominal anaphors are simply omitted in the target language. For example, though the English personal pronouns have their equivalences in Spanish, they are frequently not translated because of the typical Spanish elliptical zero-subject, constructions. Another interesting example is English-to-Korean translation. The English pronouns can be omitted elliptically, translated by a definite noun phrase, by their referent, or by one or two possible Korean pronouns, depending on the syntactic information and semantic class of the noun the anaphor refers to ([Mitkov et al. 94]). 2.2 The Necessity of Anaphora Resolution in Machine Translation Whereas in most European language pairs anaphora resolution is "compulsory" (or else we risk rendering in certain cases quite unacceptable translations), there are certain language pairs and cases where anaphora resolution may seem "optional". Consider the following sentences [Hutchins & Somers 92]: (1) The monkey ate the banana because it was hungry. (2) The monkey ate the. banana because it was ripe. (3) The monkey ate the banana because it was teatime. In each case the pronoun it refers to something different: in (1) the monkey, in (2) the banana and in (3) to the abstract notion of time. If we have to translate the above sentences into German, then anaphora resolution is inevitable, since the pronouns take the gender of their antecedents, and since the German words Affe (masculine, "monkey"), Banane (feminine, "banana") and es (neutral, "it" for time notion) are of different gender. Consider the translation of the sentences (l)-(3) from English to Korean and their literal descriptions [Mitkov et al. 94]: (1') Pae.go.pa.so won.sung.yi.nun pa.na.na.rul mo.got.ta. hungry-CAUSAL monkey-NOM banana-ACC eat-PAST,DECL (2') I.go.so won.sung.yi.nun pa.na.na.rul mo.got.ta. ripe-CAUSAL monkey-NOM banana-ACC eat-PAST,DECL

descriptionView Paper arrow_downwardDownload

An Efficient Feature Selection Model for IGBO Text

by Ifeanyi-reuben Nkechi Jacinta

2023, International Journal of Data Mining & Knowledge Management Process

The development in Information Technology (IT) has encouraged the use of Igbo Language in text creation, online news reporting, online searching and articles publications. As the information stored in text format of this language is... more

descriptionView Paper arrow_downwardDownload

An Improved Classification Model for Igbo Text Using N-Gram And K-Nearest Neighbour Approaches

by Ifeanyi-reuben Nkechi Jacinta

2023, ArXiv

This paper presents an improved classification model for Igbo text using N-gram and K-Nearest Neighbour approaches. The N-gram model was used for text representation and the classification was carried out on the text using the K-Nearest... more

collections, Text pre-processing, Text representation and Similarity measurement.

Euclidean distance metric function measures distance between various points in a multidimensional data in real applications. Fig. 4 illustrates the Euclidean distance between 2 points. Assuming the two points are (X;, Y2;) and (X9, Y>) with the following features frequencies:

Fig 5: Igbo Similarity Result on Unigram Text Chart

Table 2. Bigram Representation of Sample Igbo Text

descriptionView Paper arrow_downwardDownload

Computational Analysis of Igbo Numerals in a Number-to-text Conversion System

by Uchenna Benjamin

2023, Journal of Computer and Education Research

System for converting Arabic numerals to their textual equivalence is an important tool in Natural Language processing (NLP) especially in high-level speech processing and machine translation. Such system is scarcely available for most... more

descriptionView Paper arrow_downwardDownload

Squeezing bottlenecks: Exploring the limits of autoencoder semantic representation capabilities

by Parth Gupta

2023, Neurocomputing

We present a comprehensive study on the use of autoencoders for modelling text data, in which (differently from previous studies) we focus our attention on the following issues: i) we explore the suitability of two different models bDA... more

descriptionView Paper arrow_downwardDownload

Anaphora resolution in machine translation

by Carla Umbach

2022

Abstract: In this paper we give an overview of an approach to anaphora resolution that takes a whole variety of different factors into account. These factors concern on the one hand structural information like agreement, proximity,... more

descriptionView Paper arrow_downwardDownload

Computational Analysis of Igbo Numerals in a Number-to-text Conversion System

by Abimbola Iyanda

2022, Journal of Computer and Education Research

descriptionView Paper arrow_downwardDownload

Anaphora resolution in machine translation

by Carla Umbach

2022

descriptionView Paper arrow_downwardDownload

Computational Analysis of Igbo Numerals in a Number-to-text Conversion System

by Nwachi Uchenna Benjamin

2022, Journal of Computer and Education Research

numeral system were used as the metrics. The analysis is shown in Tables 6 and 7. Figure 3. Result of number 25 for desktop application.

Figure 4. Result of Number 574 for Android Application

Table 1. Traditional and Decimalized Igbo Number representation of Hindu-Arabic Numerals

Table 6. Analysis of the description of the respondents

Table 7. Data of the knowledge versus score

Although this work has provided an assessment of the Igbo numeral system, however,

descriptionView Paper arrow_downwardDownload

An Efficient Feature Selection Model for IGBO Text

by Mrs. Nkechi Ifeanyi-Reuben

2022, International Journal of Data Mining & Knowledge Management Process

descriptionView Paper arrow_downwardDownload

Comparative Analysis of N-gram Text Representation on Igbo Text Document Similarity

by Dr Ugwu Chidiebere

2022

The improvement in Information Technology has encouraged the use of Igbo in the creation of text such as resources and news articles online. Text similarity is of great importance in any text-based applications. This paper presents a... more

descriptionView Paper arrow_downwardDownload

Analysis and representation of Igbo text document for a text-based system

by Dr Ugwu Chidiebere

2022, arXiv: Computation and Language

The advancement in Information Technology (IT) has assisted in inculcating the three Nigeria major languages in text-based application such as text mining, information retrieval and natural language processing. The interest of this paper... more

descriptionView Paper arrow_downwardDownload

Analysis and Representation of Igbo Text Document for a Text-Based System

by Dr Ugwu Chidiebere

2022, International Journal of Data Mining Techniques and Applications

descriptionView Paper arrow_downwardDownload

An Efficient Feature Selection Model for IGBO Text

by MERCY BENSON-EMENIKE

2021, International Journal of Data Mining & Knowledge Management Process

descriptionView Paper arrow_downwardDownload

An Efficient Feature Selection Model for Igbo Text

by MERCY BENSON-EMENIKE

2021

descriptionView Paper arrow_downwardDownload

Assessing GAN-based approaches for generative modeling of crime text reports

by Samira Khorshidi

2021, IEEE

Analysis and modeling of crime text report data has important applications, including refinement of crime classifications, clustering of documents, and feature extraction for spatiotemporal forecasts. Having better neural network... more

INCOMPLETE WORDS AND THEIR RESTORED VERSION s approached v and asked v to fight s hit v multiple times v defended himself and punched s. s fled location southbound on kingsley from melrose.

Fig. 1. Intertopic Distance Map. We have reduced dimension via Jensen Shannon Divergence and Principal Coordinate Analysis(PCoA). Top: LDA, Down: MalletLDA. Each bubble represents a topic. The larger the bubble, the more prevalent is that topic.

COHERENCE SCORE OF LDA AND MALLET’S LDA ON REAL DATA. HIGHER COHERENCE SCORE MEANS BETTER PERFORMANCE.

METHOD APPLICATION COMPARISON USING NEGATIVE LOG LIKELIHOOD AND EMBEDDING SIMILARITY TABLE V

descriptionView Paper arrow_downwardDownload

Analysis and Representation of Igbo Text Document for a Text-Based System

by ijdmta iir

2020, Integrated Intelligent Research

g.2. Igbo Text Unicode Decoding and Encoding The Unicode model was used for extracting and processing Igbo texts from file because it is one of the languages that employ non-ASCII character sets like English. Unicode supports many character sets. Each character of the characters in the set is given a number called a code point. This enabled us manipulate the Igbo text loaded from a file just like any other normal text. When Unicode characters are stored in a file, they are encoded as a stream of bytes. They only support a small subset of Unicode. Processing Igbo text needs UTF-8 encoding. UTF-8 makes use of multiple bytes and represents complete collection of Unicode characters. This is achieved with the mechanisms of decoding and encoding. Decoding translates text in files in a particular encoding like the Igbo text written with Igbo character sets into Unicode while encoding write Unicode to a file and convert it into an appropriate encoding [12]. We achieved this with the help of Python program and other Natural Language Processing tools. The mechanism is illustrated in Fig. 2.

A list is created for these data and the normalization task process is shown in Fig.3 and is performed before text tokenization. An algorithm designed for the text normalization task is given in algorithm 1.

Table 4: Igbo TextDocl1 N-Gram key Features From our observation and analysis of the results obtained, semantically, unigram model / BOW is not an ideal model for representing Igbo textual document in any text-based

descriptionView Paper arrow_downwardDownload

Ensemble of Classification Algorithms for Subjectivity and Sentiment Analysis of Arabic Customers' Reviews

by Nazlia Omar

2016

Sentiment Analysis is a very challenging and important task that contains natural language processing, web mining and machine learning. Up to date, few researches have been conducted on sentiment classification for Arabic languages due to... more

descriptionView Paper arrow_downwardDownload

Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document Classification with Naive Bayes Multinomial

by IOSR Journals

2016

Text categorization and feature selection are two of the many text data mining problems. In text categorization, the document that contains a collection of text will be changed to the dataset format, the dataset that consists of features... more

descriptionView Paper arrow_downwardDownload