Academia.eduAcademia.edu

Text Indexing

description449 papers
group19 followers
lightbulbAbout this topic
Text indexing is the process of organizing and storing textual data in a structured format to facilitate efficient retrieval and searching. It involves creating an index that maps keywords or phrases to their locations within the text, enhancing the speed and accuracy of information retrieval in databases and search engines.
lightbulbAbout this topic
Text indexing is the process of organizing and storing textual data in a structured format to facilitate efficient retrieval and searching. It involves creating an index that maps keywords or phrases to their locations within the text, enhancing the speed and accuracy of information retrieval in databases and search engines.

Key research themes

1. How can semantic and concept-based methods improve text indexing compared to traditional keyword-based approaches?

This research area focuses on enhancing the indexing and retrieval of text documents by moving beyond simple keyword matching to semantic-aware methods. These approaches leverage linguistic resources, concept identification, and semantic similarity measures to better capture the inherent meaning and context in documents. The goal is to address challenges posed by synonymy, polysemy, and lexical ambiguities that limit keyword-based indexing.

Key finding: Proposes a novel semantic indexing method that integrates WordNet and WordNetDomains to identify and weight concepts rather than mere keywords. It introduces a new concept centrality weighting scheme that outperforms... Read more
by Jaesung Lee and 
1 more
Key finding: Introduces an automatic indexing approach based on Latent Dirichlet Allocation (LDA) for capturing latent topics in documents, overcoming limitations of traditional vector space models that ignore conceptual meaning. It... Read more
Key finding: Develops DigiDoc MetaEdit, a semi-automatic indexing tool exploiting embedded semantic metadata in HTML and a controlled thesaurus. The approach balances term frequency and semantic relevance, achieving about 50% overlap with... Read more

2. What indexing structures and algorithms enable efficient document retrieval at scale, especially using inverted and cluster-based indexes?

This theme addresses the design, optimization, and application of indexing data structures such as inverted indexes and clustering-enhanced variants to enable fast and scalable document retrieval. The focus lies on supporting varied query types including word-based, substring, and complex queries over large text corpora, while balancing time and space efficiency. Emerging data structures like wavelet trees and clustering algorithms improve indexing precision and retrieval speed.

Key finding: Presents a comprehensive review of word-based inverted indexes and full-text indexes, emphasizing their respective strengths and limitations. Highlights the wavelet tree data structure as an emerging method for compressing... Read more
Key finding: Introduces a novel cluster-based inverted index that integrates piecewise fuzzy C-means clustering with classical inverted indexing, using Bhattacharyya distance for query matching and Pearson correlation for query... Read more
Key finding: Proposes applying compressed self-indexes, previously developed for arbitrary strings, to sequences of words in natural language text. Demonstrates that such self-indexes can occupy space close to the best word-based... Read more

3. How can indexing approaches handle uncertainty and variability in texts, such as weighted sequences or approximate matching?

The focus here is on indexing methods that accommodate uncertain, imprecise, or approximate text representations. This includes weighted sequences where each position represents probabilistic letter distributions, and approximate dictionary matching where exact matches are relaxed to allow errors or mismatches. Such methods must balance indexing size, preprocessing time, and query performance especially in applications like bioinformatics and noisy data retrieval.

Key finding: Develops an O(nz)-time and O(nz)-space indexing structure for weighted sequences with optimal query time O(m + Occ). The novel data structure uses a family of special strings encoding all probable occurrences above a... Read more
Key finding: Proposes a practically efficient split index leveraging the Dirichlet principle for dictionary matching allowing few mismatches (especially one). Experimental evaluations reveal microsecond-level query times and effective... Read more
Key finding: Presents an approach relying on extensive computational lexicons and explicit linguistic knowledge for term and phrase extraction and normalization in highly inflectional languages. This reduces errors in indexing due to... Read more

All papers in Text Indexing

The recent results of the research in the construction of the electronic dictionary of Serbo-Croatian are presented. This research involves the development of methodological and theoretical principles for the construction of the lexicon... more
In this paper we describe the resources and tools for the processing of texts written in Serbian. Most of the resources have been developed within the University of Belgrade NLP group located at the Faculty of Mathematics. The main... more
We present an optimal adaptive algorithm for context queries in tagged content. The queries consist of locating instances of a tag within a context specified by the query using patterns with preorder, ancestor-descendant and proximity... more
Digital games, as a popular technology in youth entertainment, constitute a fast-growing field which has been affecting various aspects of education for several years now. The research project “Lexipaignio” focuses on the development of... more
A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrase-based models, this problem can be addressed by storing the training data in memory... more
The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a... more
We introduce a new method for conducting an exact search in a uni- and bidirectional FM index in $\mathcal{O}(1)$ time per step while using $\mathcal{O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)$ bits of space. This is... more
For [Formula: see text], define [Formula: see text] as the set of integers [Formula: see text]. Given an integer [Formula: see text] and a string [Formula: see text] of length [Formula: see text] over [Formula: see text], we count the... more
Suffix trees and suffix arrays are two of the most widely used data structures for text indexing. Each uses linear space and can be constructed in linear time for polynomially sized alphabets. However, when it comes to answering queries... more
We introduce the first index that can be built in o(n) time for a text of length n, and can also be queried in o(q) time for a pattern of length q. On an alphabet of size σ, our index uses O(n √ log n log σ) bits, is built in O(n((log log... more
We propose a new method of extracting texts related to a given keyword from Web pages collected by a search engine. By combining structural pattern matching and text classification, texts related to a given keyword such as reputations of... more
The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, geographical ontology, maintenance and... more
Stream computing research is moving from terascale to petascale levels. It aims to rapidly analyze data as it streams in from many sources and make decisions with high speed and accuracy in fields as diverse as security surveillance and... more
YADDA framework facilitates information exchange between digital document repositories. YaddaWeb, its web-based interface, provides browse and search functionalities. Content providers use DeskLight application to add or modify metadata... more
YADDA2 is an open software platform which facilitates creation of digital library applications. It consists of versatile building blocks providing, among others: storage, relational and full-text indexing, process management, and... more
In this paper we investigate some properties and algorithms related to a text sparsification technique based on the identification of local maxima in the given string. As the number of local maxima depends on the order assigned to the... more
In theory, speech recognition technology can make any spoken words in video or audio media usable for text indexing, search and retrieval. This article describes the News-on-Demand application created within the Informedia TM Digital... more
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and... more
In this article we present the MIRTO platform-under development at the University Stendhal of Grenoble-and how it addresses common flaws of CALL software. This platform led to another project: the creation of a pedagogically indexed text... more
This communication is meant to present our project of pedagogically indexed text base. After introducing the notion of pedagogical indexation, which needs to be articulated around the teachers needs, we explain to which extent existing... more
Defining pedagogical indexation of texts for language learning as an indexation allowing users to query for texts in order to use them in language teaching requires to take into account the influence of the properties of the teaching... more
In this article we present the MIRTO platform-under development at the University Stendhal of Grenoble-and how it addresses common flaws of CALL software. This platform led to another project: the creation of a pedagogically indexed text... more
In today's world e-learning is one of the popular modes of learning and video lectures are more prominent in keeping learners engaged with course. Internet enabled to keep a large number of video lectures on-line. To search for a required... more
Pesquisa de abordagem qualitativa de carater descritivo-exploratorio, em que foram entrevistados 13 profissionais da equipe de enfermagem, atraves de entrevistas semiestruturadas, bem como observacoes sistematicas nao participantes.... more
The NZDL aims to impose structure on anarchic and uncataloged repositories of information, providing information consumers with effective tools to locate and peruse what they need. Our goal is to produce an easy-to-use digital library... more
A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome... more
Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal... more
Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and... more
It seems that there is no limit to the amount of data we need to store in our computers and also send these data to our friends and colleagues. For this purpose people tend to store a lot of files inside their storage. When the storage... more
Consider an input text string T ≡ T [1, N ] drawn from an unbounded alphabet, so text positions can be accessed using comparisons. We study partial computation in suffix-based problems for Data Compression and Text Indexing such as •... more
Data sets are growing rapidly and there is an attendant need for tools that facilitate human analysis of them in a timely manner. To help meet this need, column-oriented databases (or "column stores") have come into wide use because of... more
Data sets are growing rapidly and there is an attendant need for tools that facilitate human analysis of them in a timely manner. To help meet this need, column-oriented databases (or "column stores") have come into wide use because of... more
The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, geographical ontology, maintenance and... more
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of... more
Indexing of static and dynamic sets is fundamental to a large set of applications such as information retrieval and caching. Denoting the characteristic vector of the set by B, we consider the problem of encoding sets and multisets to... more
We consider the range mode problem where given a sequence and a query range in it, we want to find items with maximum frequency in the range. We give time- and space- efficient algorithms for this problem. Our algorithms are efficient for... more
This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in... more
Suffix trees and suffix arrays are two of the most widely used data structures for text indexing. Each uses linear space and can be constructed in linear time for polynomially sized alphabets. However, when it comes to answering queries... more
On-demand string sorting is the problem of preprocessing a set of strings to allow subsequent queries for finding the k lexicographically smallest strings (and afterward the next k etc.) This on-demand variant strongly resembles the... more
This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in... more
In this chapter we discuss various bitmap index technologies for efficient query processing in data warehousing applications. We review the existing literature and organize the technology into three categories, namely bitmap encoding,... more
Bitmap indexes are known as the most effective indexing methods for range queries on append-only data, especially for low cardinality attributes. Recently, bitmap indexes were also shown to be just as effective for high cardinality... more
This article presents a survey of techniques for ranking results in search engines, with emphasis on link-based ranking methods and the PageRank algorithm. The problem of selecting, in relation to a user search query, the most relevant... more
A Scalable Distributed Data Structure (SDDS) allows to store a large scalable file over a distributed RAM. The file scales up transparently for the application over the nodes of a multicomputer, e.g., a local network of PCs. The prototype... more
In this paper, we present simple and area efficient VLSI architectures for HufFman coding, an industrial standard proposed by MPEG, JPEG, and others. We use a memory of size O(nlogn) bits to store a HuEman code tree, where n is the number... more
This working notes describe the runs and results obtained by the LIG at ImageCLEFphoto 2008. The submitted runs are: two runs (text only and text+image) without diversification on classes, and two runs (text only and text+image) with... more
Di Yogyakarta masih banyak dapat ditemukan naskah-naskah kuno yang merupakan warisan budaya yang tak ternilai harganya. Maka Apabila naskah-naskah tersebut dapat dikonversikan ke dalam format digital, akan banyak manfaat yang bisa diraih.... more
The run-length compressed Burrows-Wheeler transform (RLBWT) used in conjunction with the backward search introduced in the FM index is the centerpiece of most compressed indexes working on highly-repetitive data sets like biological... more
Let S and S be two strings, having the same length, over a totally-ordered alphabet. We consider the following two variants of string matching. Parameterized Matching: The characters of S and S are partitioned into static characters and... more
Download research papers for free!