Text Indexing

description449 papers

group19 followers

lightbulbAbout this topic

Text indexing is the process of organizing and storing textual data in a structured format to facilitate efficient retrieval and searching. It involves creating an index that maps keywords or phrases to their locations within the text, enhancing the speed and accuracy of information retrieval in databases and search engines.

lightbulbAbout this topic

Key research themes

1. How can semantic and concept-based methods improve text indexing compared to traditional keyword-based approaches?

This research area focuses on enhancing the indexing and retrieval of text documents by moving beyond simple keyword matching to semantic-aware methods. These approaches leverage linguistic resources, concept identification, and semantic similarity measures to better capture the inherent meaning and context in documents. The goal is to address challenges posed by synonymy, polysemy, and lexical ambiguities that limit keyword-based indexing.

CONCEPT-BASED INDEXING IN TEXT INFORMATION RETRIEVAL

by International Journal of Computer Science & Information Technology (IJCSIT)

2016

Key finding: Proposes a novel semantic indexing method that integrates WordNet and WordNetDomains to identify and weight concepts rather than mere keywords. It introduces a new concept centrality weighting scheme that outperforms... Read more

articleView Paper downloadDownload

Indexing by Latent Dirichlet Allocation and an Ensemble Model

by Jaesung Lee and

2015

Key finding: Introduces an automatic indexing approach based on Latent Dirichlet Allocation (LDA) for capturing latent topics in documents, overcoming limitations of traditional vector space models that ignore conceptual meaning. It... Read more

articleView Paper downloadDownload

A semi-automatic indexing system based on embedded information in HTML documents

by Mari Vallez

2022, Library Hi Tech

Key finding: Develops DigiDoc MetaEdit, a semi-automatic indexing tool exploiting embedded semantic metadata in HTML and a controlled thesaurus. The approach balances term frequency and semantic relevance, achieving about 50% overlap with... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What indexing structures and algorithms enable efficient document retrieval at scale, especially using inverted and cluster-based indexes?

This theme addresses the design, optimization, and application of indexing data structures such as inverted indexes and clustering-enhanced variants to enable fast and scalable document retrieval. The focus lies on supporting varied query types including word-based, substring, and complex queries over large text corpora, while balancing time and space efficiency. Emerging data structures like wavelet trees and clustering algorithms improve indexing precision and retrieval speed.

Document Retrieval using Efficient Indexing Techniques

by Rajesh Prasad

2023, International Journal of Business Analytics

Key finding: Presents a comprehensive review of word-based inverted indexes and full-text indexes, emphasizing their respective strengths and limitations. Highlights the wavelet tree data structure as an emerging method for compressing... Read more

articleView Paper downloadDownload

An approach for document retrieval using cluster-based inverted indexing

by GUNJAN CHANDWANI

2021, Journal of Information Science

Key finding: Introduces a novel cluster-based inverted index that integrates piecewise fuzzy C-means clustering with classical inverted indexing, using Bhattacharyya distance for query matching and Pearson correlation for query... Read more

articleView Paper downloadDownload

Self-indexing Natural Language

by Antonio Fariña

2023, Lecture Notes in Computer Science

Key finding: Proposes applying compressed self-indexes, previously developed for arbitrary strings, to sequences of words in natural language text. Demonstrates that such self-indexes can occupy space close to the best word-based... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can indexing approaches handle uncertainty and variability in texts, such as weighted sequences or approximate matching?

The focus here is on indexing methods that accommodate uncertain, imprecise, or approximate text representations. This includes weighted sequences where each position represents probabilistic letter distributions, and approximate dictionary matching where exact matches are relaxed to allow errors or mismatches. Such methods must balance indexing size, preprocessing time, and query performance especially in applications like bioinformatics and noisy data retrieval.

Indexing weighted sequences: Neat and efficient

by Carl Barton

2022, Information and Computation

Key finding: Develops an O(nz)-time and O(nz)-space indexing structure for weighted sequences with optimal query time O(m + Occ). The novel data structure uses a family of special strings encoding all probable occurrences above a... Read more

articleView Paper downloadDownload

A Practical Index for Approximate Dictionary Matching with Few Mismatches

by Szymon Grabowski

2023, Computing and Informatics

Key finding: Proposes a practically efficient split index leveraging the Dirichlet principle for dictionary matching allowing few mismatches (especially one). Experimental evaluations reveal microsecond-level query times and effective... Read more

articleView Paper downloadDownload

Extraction and normalization of IR indexing terms and phrases in a highly inflectional language

by Christos Tsalidis

2023, Journal of Greek Linguistics

Key finding: Presents an approach relying on extensive computational lexicons and explicit linguistic knowledge for term and phrase extraction and normalization in highly inflectional languages. This reduces errors in indexing due to... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Text Indexing

Recent results in Serbian computational lexicography

by Gordana Pavlović-Lažetić

2025

The recent results of the research in the construction of the electronic dictionary of Serbo-Croatian are presented. This research involves the development of methodological and theoretical principles for the construction of the lexicon... more

descriptionView Paper arrow_downwardDownload

An overview of resources and basic tools for the processing of Serbian written texts

by Gordana Pavlović-Lažetić

2025

In this paper we describe the resources and tools for the processing of texts written in Serbian. Most of the resources have been developed within the University of Belgrade NLP group located at the Faculty of Mathematics. The main... more

descriptionView Paper arrow_downwardDownload

Efficient Algorithms for Context Query Evaluation over a Tagged Corpus

by Jérémy Barbay

2025, 2009 International Conference of the Chilean Computer Science Society

We present an optimal adaptive algorithm for context queries in tagged content. The queries consist of locating instances of a tag within a context specified by the query using patterns with preorder, ancestor-descendant and proximity... more

descriptionView Paper arrow_downwardDownload

Implementing Language Games with NLP Tools: The Greek Case (short paper)

by Fountana Maria

2025

Digital games, as a popular technology in youth entertainment, constitute a fast-growing field which has been affecting various aspects of education for several years now. The research project “Lexipaignio” focuses on the development of... more

descriptionView Paper arrow_downwardDownload

Hierarchical phrase-based translation with suffix arrays

by Adam Lopez

2025

A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrase-based models, this problem can be addressed by storing the training data in memory... more

descriptionView Paper arrow_downwardDownload

Constant-time and space-efficient unidirectional and bidirectional FM-indices using EPR-dictionaries

by Knut Reinert

2025, arXiv (Cornell University)

The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a... more

descriptionView Paper arrow_downwardDownload

Constant-time and space-efficient unidirectional and bidirectional FM-indices using EPR-dictionaries

by Knut Reinert

2025, ArXiv

We introduce a new method for conducting an exact search in a uni- and bidirectional FM index in $\mathcal{O}(1)$ time per step while using $\mathcal{O}(\log \sigma \cdot n) + o(\log \sigma \cdot \sigma \cdot n)$ bits of space. This is... more

descriptionView Paper arrow_downwardDownload

Near-optimal algorithm to count occurrences of subsequences of a given length

by Rene Peralta

2025, Discrete Mathematics, Algorithms and Applications

For [Formula: see text], define [Formula: see text] as the set of integers [Formula: see text]. Given an integer [Formula: see text] and a string [Formula: see text] of length [Formula: see text] over [Formula: see text], we count the... more

descriptionView Paper arrow_downwardDownload

Suffix Trays and Suffix Trists: Structures for Faster Text Indexing

by Moshe Lewenstein

2025, Springer eBooks

Suffix trees and suffix arrays are two of the most widely used data structures for text indexing. Each uses linear space and can be constructed in linear time for polynomially sized alphabets. However, when it comes to answering queries... more

descriptionView Paper arrow_downwardDownload

Text Indexing and Searching in Sublinear Time

by Ian Munro

2024, arXiv (Cornell University)

We introduce the first index that can be built in o(n) time for a text of length n, and can also be queried in o(q) time for a pattern of length q. On an alphabet of size σ, our index uses O(n √ log n log σ) bits, is built in O(n((log log... more

descriptionView Paper arrow_downwardDownload

Reputation extraction using both structural and content information

by Mineichi Kudo

2024

We propose a new method of extracting texts related to a given keyword from Web pages collected by a search engine. By combining structural pattern matching and text classification, texts related to a given keyword such as reputations of... more

descriptionView Paper arrow_downwardDownload

Recent Research Developments in Learning Technologies. <hal-00190734&gt

by Claude Ponton

2024

descriptionView Paper arrow_downwardDownload

The SPIRIT Spatial Search Engine: Architecture, Ontologies and Spatial Indexing

by Alia Abdelmoty

2024, Lecture Notes in Computer Science

The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, geographical ontology, maintenance and... more

descriptionView Paper arrow_downwardDownload

Highly scalable algorithm for distributed real-time text indexing

by Vijay Garg

2024, 2009 International Conference on High Performance Computing (HiPC)

Stream computing research is moving from terascale to petascale levels. It aims to rapidly analyze data as it streams in from many sources and make decisions with high speed and accuracy in fields as diverse as security surveillance and financial services including stock trading. We specifically consider real-time text indexing and search with high input data rates (10 GB/s or more) along with small index ageoff(expiry) time. This makes it necessary to have maximal indexing rates for large volumes of data as well as minimal latency for indexing (time between start of indexing for a document and its availability for search) while maintaining very-low search response time. In addition, future massively parallel architectures with storage class memories will enable high speed in-memory real-time indexing, where index can be completely stored in a high capacity storage class memory. In this paper, we present the design of distributed datastructures and distributed real-time text indexing algorithm for parallel systems having large (thousands to hundred thousand) number of cores/processors, while simultaneously providing acceptable search performance [1]. The inherent trade-offs involved in index space, indexing throughput and search response time make this problem particularly challenging. Our algorithm uses group-based index construction and leverages novel index data structures that reduce load imbalance and make text indexing and merge process more scalable and efficient. We show analytically that the asymptotic parallel time complexity of our distributed indexing algorithm, is at least Ω(log(P)) factor better than typical indexing approaches, where P is the number of indexing nodes in a group. We further demonstrate the performance and scalability of our distributed indexing algorithm, on an MPP architecture (Blue Gene/L 1) using actual IBM intranet data. We achieved high indexing throughput of around 312 GB/min on an 8K node Blue Gene/L machine. In comparison with parallel indexing implemented using typical approaches like CLucene 2 , this is 3×-7× better. To the best of our knowledge, this is the first published result on indexing throughput at such a large scale, with sustained search performance. We further show that our approach is scalable 1. http://www.research.ibm.com/bluegene 2. http://www.sourceforge.net/projects/clucene to 128K nodes, giving an estimated indexing throughput of 5 T B/min. We also achieved indexing latency that is around 10× better than typical indexing approaches.

descriptionView Paper arrow_downwardDownload

Migration of the Mathematical Collection of Polish Virtual Library of Science to the YADDA Platform

by Tomasz Rosiek

2024

YADDA framework facilitates information exchange between digital document repositories. YaddaWeb, its web-based interface, provides browse and search functionalities. Content providers use DeskLight application to add or modify metadata... more

descriptionView Paper arrow_downwardDownload

YADDA2

by Tomasz Rosiek

2024

YADDA2 is an open software platform which facilitates creation of digital library applications. It consists of versatile building blocks providing, among others: storage, relational and full-text indexing, process management, and... more

descriptionView Paper arrow_downwardDownload

Text sparsification via local maxima

by Gianluca Rossi

2024, Theoretical Computer Science

In this paper we investigate some properties and algorithms related to a text sparsification technique based on the identification of local maxima in the given string. As the number of local maxima depends on the order assigned to the... more

descriptionView Paper arrow_downwardDownload

Informedia TM: News-On-Demand Experiments in Speech Recognition

by Michael Witbrock

2024

In theory, speech recognition technology can make any spoken words in video or audio media usable for text indexing, search and retrieval. This article describes the News-on-Demand application created within the Informedia TM Digital... more

descriptionView Paper arrow_downwardDownload

Burst tries

by Steffen Heinz

2024, ACM Transactions on Information Systems

Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and... more

descriptionView Paper arrow_downwardDownload

Recent Research Developments in Learning Technologies. <hal-00190734&gt

by Claude Ponton

2024

In this article we present the MIRTO platform-under development at the University Stendhal of Grenoble-and how it addresses common flaws of CALL software. This platform led to another project: the creation of a pedagogically indexed text... more

In Antoniadis et. al.[1], we identify current CALL software flaws: the poorness of meaning associated to any linguistic sequence, the rigidity of software and the necessity for language teacher users to express their pedagogical solutions in computer understandable terms instead of resorting to language didactics. These flaws mostly stem from the divergences between computer science’s and didactics’ view of the notion of “language”. “Computer science can only consider and process the form of language independ- ently of any interpretation, while, for language didactics, the form only exists through its properties and the concepts it is supposed to represent’ [2]. The MIRTO platform, currently under development at the University Stendhal of Grenoble, plans to address these problems via the use of NLP (Natural Language Processing) tools and collaborative work xsnith didarticrce ayvnarta Tra anrdar ta cunmnart thic annrnarh we racartad tra the fAllawina architartira* ‘ Corresponding author: e-mail: mathieu.loiseau @u-grenoble3.fr, Phone: +33 4 76 86 22 57

descriptionView Paper arrow_downwardDownload

The concept of “text facet” as a means to achieve pedagogical indexation of a text base dedicated to language teaching

by Claude Ponton

2024

This communication is meant to present our project of pedagogically indexed text base. After introducing the notion of pedagogical indexation, which needs to be articulated around the teachers needs, we explain to which extent existing... more

descriptionView Paper arrow_downwardDownload

“Facets” and “Prisms” as a Means to Achieve Pedagogical Indexation of Texts for Language Learning: Consequences of the Notion of Pedagogical Context

by Claude Ponton

2024, Communications in Computer and Information Science

Defining pedagogical indexation of texts for language learning as an indexation allowing users to query for texts in order to use them in language teaching requires to take into account the influence of the properties of the teaching... more

descriptionView Paper arrow_downwardDownload

Pedagogical text indexation and exploitation for language teaching

by Claude Ponton

2024

descriptionView Paper arrow_downwardDownload

An experimental comparative study on slide change detection in lecture videos

by Ramani Kasarapu

2024, International Journal of Information Technology

In today's world e-learning is one of the popular modes of learning and video lectures are more prominent in keeping learners engaged with course. Internet enabled to keep a large number of video lectures on-line. To search for a required... more

In the above equations P, and P, denotes the pixel distribution difference along X and Y axis. b,, and by, are values in different bins. /(x,y) is the binarized image pixel intensity.

more than the values in Figs. 6 and 7 with time interval | and 2 s, respectively. The experimental results show that:

descriptionView Paper arrow_downwardDownload

A assistência de enfermagem no tratamento de pessoas com feridas cirúrgicas decorrentes de acidentes de trânsito

by Francisca Adriana Barreto

2024

Pesquisa de abordagem qualitativa de carater descritivo-exploratorio, em que foram entrevistados 13 profissionais da equipe de enfermagem, atraves de entrevistas semiestruturadas, bem como observacoes sistematicas nao participantes.... more

descriptionView Paper arrow_downwardDownload

A Public Digital Library based on Full-Text Retrieval: Collections and Experience

by Sally Jo Cunningham

2024

The NZDL aims to impose structure on anarchic and uncataloged repositories of information, providing information consumers with effective tools to locate and peruse what they need. Our goal is to produce an easy-to-use digital library... more

descriptionView Paper arrow_downwardDownload

Storage and Retrieval of Individual Genomes

by Gonzalo Navarro

2024, Lecture Notes in Computer Science

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome... more

descriptionView Paper arrow_downwardDownload

Using Compressed Suffix-Arrays for a compact representation of temporal-graphs

by Diego Caro

2024, Information Sciences

Temporal graphs represent binary relationships that change along time. They can model the dynamism of, for example, social and communication networks. Temporal graphs are defined as sets of contacts that are edges tagged with the temporal... more

descriptionView Paper arrow_downwardDownload

A New Burrows Wheeler Transform Markov Distance

by Charles Nicholas

2024, arXiv (Cornell University)

Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and... more

descriptionView Paper arrow_downwardDownload

Noiseless Data Compression Technique by Using Burrows-Wheeler Transform

by Karuna Khobragade

2024

It seems that there is no limit to the amount of data we need to store in our computers and also send these data to our friends and colleagues. For this purpose people tend to store a lot of files inside their storage. When the storage... more

descriptionView Paper arrow_downwardDownload

Partial Data Compression and Text Indexing via Optimal Suffix Multi-Selection

by Roberto Grossi

2024

Consider an input text string T ≡ T [1, N ] drawn from an unbounded alphabet, so text positions can be accessed using comparisons. We study partial computation in suffix-based problems for Data Compression and Text Indexing such as • retrieve any segment of K ≤ N consecutive symbols from the Burrows-Wheeler transform of T , which is at the heart of the bzip2 family of text compressors, and • retrieve any chunk of K ≤ N consecutive entries of the Suffix Array or the Suffix Tree, two popular Text Indexing data structures for T. Prior literature would take O(N log N) comparisons (and time) to solve these problems by solving the total problem of building the entire Burrows-Wheeler transform or Text Index for T , and performing a post-processing to single out the wanted portion. The technical challenge is that the suffixes of interest are potentially of size O(KN) and overlap in intricate ways: we have to use structural properties of these overlaps to avoid rescanning them repeatedly. We introduce a novel adaptive approach to partial computational problems above, and solve both the partial problems in O(K log K + N) comparisons and time, improving the best known running times of O(N log N) for K = o(N). These partial-computation problems are intimately related since they share a common bottleneck: the suffix multi-selection problem, which is to output the suffixes of rank r 1 , r 2 ,. .. , r K under the lexicographic order, where r 1 < r 2 < • • • < r K , r i ∈ [1, N ]. Special cases of this problem are well known: K = N is the suffix sorting problem that is the workhorse in Stringology with hundreds of applications, and K = 1 is the recently studied suffix selection. We show that suffix multi-selection can be solved in Θ   N log N − K j=0 ∆ j log ∆ j + N   time and comparisons, where r 0 = 0, r K+1 = N + 1, and ∆ j = r j+1 − r j for 0 ≤ j ≤ K. This is asymptotically optimal, and also matches the bound in [7] for multi-selection on atomic elements (not suffixes). Matching the bound known for atomic elements for strings is a long running theme and challenge from 70's, which we achieve for the suffix multi-selection problem. The partial suffix problems as well as the suffix multi-selection problem have many applications.

descriptionView Paper arrow_downwardDownload

Leveraging compression in the tableau data engine

by Paweł Terlecki

2024

Data sets are growing rapidly and there is an attendant need for tools that facilitate human analysis of them in a timely manner. To help meet this need, column-oriented databases (or "column stores") have come into wide use because of... more

descriptionView Paper arrow_downwardDownload

Leveraging compression in the tableau data engine

by Paweł Terlecki

2024, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Figure 2. Invisible join for a string column.

Figure 3. Predicate push-down on a Rank J oin.

Where applicable, we also ran the tests with heap acceleration on and off, as well as with encodings on and off. The results are displayed in Figure 2. By comparing the encoded and un-encoded results for the “All” and “Scalars” scenarios in Fig. 4, we can see that in all situations

The version 1 Flights database with only run-length encoding and dictionary compression was 4.1GB. Figure 5 also shows the logical and physical sizes of this table. The total disk savings from the original 25GB flat file is 21GB (84%) and the savings from the logical size is 15GB (85%). Figure 5. Compression Savings.

Figure 6. Number of Sorted Heaps. Note that with no encoding, there were a total of five sorted heaps in the table set (the blue bars in the figure), mostly due to the TPC-H data generation algorithm or other accidents. With encoding on, however, all string heaps are sorted except one (L_comment), which has a large domain with low duplication.

The use of minimum width representations for scalars and tokens is another important optimization. When values have minimal widths, the system can choose better hashing algorithms for joins and aggregation. In Fig. 8, we can see that about three quarters of the string columns had their token width reduced from the default width of 8 bytes, often down to one byte. This can mean the difference between using an imperfect hash function with collision detection and using a perfect hash, or even a fast direct hash during joins and aggregation. A similar transformation can be performed on integer columns. Integers are parsed with a default width of 8 bytes, but often contain numbers from a much smaller domain. In Fig. 9, we can again see that about three quarters of the integer columns had their width reduced, often down to one byte, indicating that the values are in a very small range near zero.

The first plan is a control, which fulfills the query using the existing system. The second plan applies the filter to the index, but relies on hash aggregation. The third plan also sorts the index, before scanning to allow the use of ordered aggregation.

descriptionView Paper arrow_downwardDownload

The SPIRIT Spatial Search Engine: Architecture, Ontologies and Spatial Indexing

by Christopher Jones

2024, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Computing the multi-string BWT and LCP array in external memory

by Paola Bonizzoni

2024, Theoretical Computer Science

Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of... more

descriptionView Paper arrow_downwardDownload

Approximate Query Processing over Static Sets and Sliding Windows

by Srinivasa Satti

2024, arXiv (Cornell University)

Indexing of static and dynamic sets is fundamental to a large set of applications such as information retrieval and caching. Denoting the characteristic vector of the set by B, we consider the problem of encoding sets and multisets to... more

descriptionView Paper arrow_downwardDownload

Enumerating Range Modes

by Srinivasa Satti

2024, ArXiv

We consider the range mode problem where given a sequence and a query range in it, we want to find items with maximum frequency in the range. We give time- and space- efficient algorithms for this problem. Our algorithms are efficient for... more

descriptionView Paper arrow_downwardDownload

Dictionary matching and indexing with errors and don't cares

by Richard Cole

2024, Proceedings of the thirty-sixth annual ACM symposium on Theory of computing

This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in... more

descriptionView Paper arrow_downwardDownload

Suffix Trays and Suffix Trists: Structures for Faster Text Indexing

by Richard Cole

2024, Algorithmica

descriptionView Paper arrow_downwardDownload

On demand string sorting over unbounded alphabets

by Moshe Lewenstein

2024, Theoretical Computer Science

On-demand string sorting is the problem of preprocessing a set of strings to allow subsequent queries for finding the k lexicographically smallest strings (and afterward the next k etc.) This on-demand variant strongly resembles the... more

descriptionView Paper arrow_downwardDownload

Dictionary matching and indexing with errors and don't cares

by Moshe Lewenstein

2024, Proceedings of the thirty-sixth annual ACM symposium on Theory of computing

descriptionView Paper arrow_downwardDownload

Bitmap Indices for Data Warehouses

by Kesheng Wu

2024, IGI Global eBooks

In this chapter we discuss various bitmap index technologies for efficient query processing in data warehousing applications. We review the existing literature and organize the technology into three categories, namely bitmap encoding,... more

descriptionView Paper arrow_downwardDownload

Performances of Multi-Level and Multi-Component Compressed BitmapIndices

by Kesheng Wu

2024

Bitmap indexes are known as the most effective indexing methods for range queries on append-only data, especially for low cardinality attributes. Recently, bitmap indexes were also shown to be just as effective for high cardinality... more

descriptionView Paper arrow_downwardDownload

The Mathematics of Internet Search Engines

by Sergei Silvestrov

2024, Acta Applicandae Mathematicae

This article presents a survey of techniques for ranking results in search engines, with emphasis on link-based ranking methods and the PageRank algorithm. The problem of selecting, in relation to a user search query, the most relevant... more

descriptionView Paper arrow_downwardDownload

Disk Backup Through Algebraic Signatures in Scalable Distributed Data Structures

by Riad Mokadem

2024

A Scalable Distributed Data Structure (SDDS) allows to store a large scalable file over a distributed RAM. The file scales up transparently for the application over the nodes of a multicomputer, e.g., a local network of PCs. The prototype... more

descriptionView Paper arrow_downwardDownload

Area efficient VLSI architectures for Huffman coding

by Heonchul Park

2024, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing

In this paper, we present simple and area efficient VLSI architectures for HufFman coding, an industrial standard proposed by MPEG, JPEG, and others. We use a memory of size O(nlogn) bits to store a HuEman code tree, where n is the number... more

Fig. 1. (a) A tree with fixed length codes. (b) A tree with variable length codes.

tion. The number of I/O pins in the proposed design is 28 including signal and power ports. The number of I/O pins can be reduced into 20 by using bit serial input for memory update. The layout is shown in Fig. 9. The pro- posed design is more compact, compared with the known design for 7-bit ASCII symbols in [1] which requires 6.8 X 6.9 mm? area using 2 micron SCMOS cell library. Notice that our die size for 8-bit symbols becomes 5.8 x 5.8 mm? for the 2 micron process. SCMOS standard cell requires much smaller area than a CMOSN cell. Also, the design in [2] employs customized RAM cells.

Fig. 9. VLSI layout of the Huffman codec for 8-bit symbols. quires max {1.25 log n + 3.5m, (a + vn Mllog 2/2) + D} bits of memory for encoding and decoding. Notice that our algorithm requires smaller memory for login < 10, which occurs very frequently in practice. Also, our design requires O(log) time units for encoding/decoding a symbol on the average, while the storage scheme in [7] requires O(n) time on the average.

COMPUTATIONAL REQUIREMENTS IN HUFFMAN CODING FOR SEVERAL VIDEO SIGNALS CCIR is the International Consultative Committee on Broadcasting. TABLE I

descriptionView Paper arrow_downwardDownload

LIG at ImageCLEFphoto 2008

by Philippe Mulhem

2024

This working notes describe the runs and results obtained by the LIG at ImageCLEFphoto 2008. The submitted runs are: two runs (text only and text+image) without diversification on classes, and two runs (text only and text+image) with... more

descriptionView Paper arrow_downwardDownload

Pengenalan Citra dokumen Sastra Jawa konsep dan implementasinya

by Rita Widiarti

2024

Di Yogyakarta masih banyak dapat ditemukan naskah-naskah kuno yang merupakan warisan budaya yang tak ternilai harganya. Maka Apabila naskah-naskah tersebut dapat dikonversikan ke dalam format digital, akan banyak manfaat yang bisa diraih.... more

descriptionView Paper arrow_downwardDownload

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

by Wing-kai Hon

2024, 2022 Data Compression Conference (DCC)

The run-length compressed Burrows-Wheeler transform (RLBWT) used in conjunction with the backward search introduced in the FM index is the centerpiece of most compressed indexes working on highly-repetitive data sets like biological... more

descriptionView Paper arrow_downwardDownload

Editors: Roberto Grossi and Moshe Lewenstein

by Wing-kai Hon

2024

Let S and S be two strings, having the same length, over a totally-ordered alphabet. We consider the following two variants of string matching. Parameterized Matching: The characters of S and S are partitioned into static characters and... more

descriptionView Paper arrow_downwardDownload

Text Indexing

Key research themes

1. How can semantic and concept-based methods improve text indexing compared to traditional keyword-based approaches?

2. What indexing structures and algorithms enable efficient document retrieval at scale, especially using inverted and cluster-based indexes?

3. How can indexing approaches handle uncertainty and variability in texts, such as weighted sequences or approximate matching?

Related Topics

All papers in Text Indexing