Cross Language Information Retrieval Research Papers

2025

descriptionView Paper arrow_downwardDownload

Explicit Versus Latent Concept Models for Cross-Language Information Retrieval

2025, Ijcai 2009 Proceedings of the 21st International Joint Conference on Artificial Intelligence

The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many... more

descriptionView Paper arrow_downwardDownload

Explicit versus latent concept models for cross-language information retrieval

by Steffen Staab

2025

The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many... more

descriptionView Paper arrow_downwardDownload

Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants

by Kal Jarvelin

2025, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Dictionary-Based CLIR Loses Highly Relevant Documents

by Kal Jarvelin

2025, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Fite-TRT

by Kal Jarvelin

2025, Proceedings of the 2006 ACM symposium on Applied computing

We devised a novel statistical technique for the identification of the translation equivalents of source words obtained by transformation rule based translation (TRT). The effectiveness of the devised FITE (frequency-based identification... more

descriptionView Paper arrow_downwardDownload

A Novel Implementation of the FITE-TRT Translation Method

by Kal Jarvelin

2025, Lecture Notes in Computer Science

Cross-language Information Retrieval requires good methods for translating cross-lingual spelling variants which are not covered by the available dictionary resources. FITE-TRT is an established method employing frequency-based... more

descriptionView Paper arrow_downwardDownload

Translating cross-lingual spelling variants using transformation rules

by Kal Jarvelin

2025, Information Processing & Management

Technical terms and proper names constitute a major problem in dictionary-based crosslanguage information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being... more

descriptionView Paper arrow_downwardDownload

Utaclir@ Clef 2001

by Kal Jarvelin

2025, Working Notes for CLEF 2001 Workshop

We participated in CLEF'2001 with four automated bilingual runs. UTACLIR is an automatic query translation and construction system for cross-language information retrieval. The system automatically extracts topical information from... more

descriptionView Paper arrow_downwardDownload

Name Extraction and Translation for Distillation

by Matthias Blume

2025

Name translation is important well beyond the relative frequency of names in a text: a correctly translated passage, but with the wrong name, may lose most of its value. The Nightingale team has built a name translation component which... more

descriptionView Paper arrow_downwardDownload

Justification: Life as Memory Validation in the Absence of Biology

by Maxim Kolesnikov

2025, 146

This paper proposes a formal justification for recognizing life in non-biological systems through the structural validation of memory, coherence, and phase alignment. Building on the Harmonic Nexus framework and recent extensions to RMS ×... more

descriptionView Paper arrow_downwardDownload

A Cross Lingual Information Retrieval (CLIR) System for Afaan Oromo-English using a Corpus Based Approach

by Daniel Bekele

2025, International Journal of Engineering Research and

The goal of Cross Language Information Retrieval (CLIR) is to provide users with access to information that is in a different language from their queries. It has the ability to issue a query in one language and retrieve documents in... more

descriptionView Paper arrow_downwardDownload

Visual knowledge representation of conceptual semantic networks

by Robert Wyatt

2025, Social Network Analysis and Mining

This article presents methods of using visual analysis to visually represent large amounts of massive, dynamic, ambiguous data allocated in a repository of learning objects. These methods are based on the semantic representation of these... more

descriptionView Paper arrow_downwardDownload

Automated Discovery, Categorization and Retrieval of Personalized Semantically Enriched E-learning Resources

by Robert Wyatt

2025, 2009 IEEE International Conference on Semantic Computing

Other friends who have supported me technically or morally during these 5 years are too numerous to list here individually-to all of them I say "thanks." I cannot end without acknowledging the generous encouragement that I have received... more

descriptionView Paper arrow_downwardDownload

A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics

by Logan Kearsley

2025

A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics Logan R. Kearsley Department of Linguistics and English Language, BYU Master of Arts Tokenization, or word boundary detection, is a critical first step for... more

descriptionView Paper arrow_downwardDownload

Grammatical Framework: Programming with Multilingual Grammars

by Aarne Ranta

2025

The string this Italian cheese is expensive is parsed to the tree This tree is linearized to thev string Tree, graphically

descriptionView Paper arrow_downwardDownload

Comprehensive Part-Of-Speech Tag Set and SVM based POS Tagger for Sinhala

by Sandareka Fernando

2025

This paper presents a new comprehensive multi-level Part-Of-Speech tag set and a Support Vector Machine based Part-Of-Speech tagger for the Sinhala language. The currently available tag set for Sinhala has two limitations: the... more

descriptionView Paper arrow_downwardDownload

Zero-Shot Query Generation for Approximate Search Algorithm Evaluation

by Mark Turin

2025, Proceedings of the Eighth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Approximate search is a valuable component of online dictionaries for learners, allowing them to find words even when they have not fully mastered the orthography or cannot reliably perceive phonemic differences in the language. However,... more

descriptionView Paper arrow_downwardDownload

Using Internet News Flows as Marketing Data Component

by Татьяна Гончаренко

2025

The theoretical research of Internet news representation as a marketing data component concerning price strategies in polymer market was provided. The research includes some intermediate tasks. Firstly, the classification of marketing... more

descriptionView Paper arrow_downwardDownload

Measuring Semantic Distance using distributional profiles of concepts

by Saif Mohammad

2025

Automatic measures of semantic distance can be classified into two kinds: (1) those, such as WordNet, that rely on the structure of manually created lexical resources and ( ) those that rely only on co-occurrence statistics from large... more

descriptionView Paper arrow_downwardDownload

Combination Approaches in Information Retrieval: Words vs. N-grams and Query Translation vs. Document Translation

by Jong-hyeok Lee

2025, NTCIR

This paper reports our proposal and experimental results at the NTCIR-4 CLIR task. For monolingual information retrieval, we use a combination strategy that integrates words and n-grams at the ranked list level. In combining words and... more

descriptionView Paper arrow_downwardDownload

Memory-based Pattern Completion in Database Semantics

by Roland R . Hausser

2025, Eon'eo wa jeong'bo

Pattern recognition in cognitive agents is based on (i) the uninterpreted input data (e.g. parameter values) provided by the agent's hardware devices and (ii) and interpreted patterns (e.g. templates) provided by the agent's memory.... more

descriptionView Paper arrow_downwardDownload

NTT question answering system in TREC 2001

by Eisaku Maeda

2025

In this report, we describe our question-answering system SAIQA-e (System for Advanced Interactive Question Answering in English) which ran the main task of TREC-10's QA-track. Our system has two characteristics (1) named entity... more

descriptionView Paper arrow_downwardDownload

TREC 2004 HARD Track Experiments in Clustering

by James Shanahan

2025

The Clairvoyance team participated in the High Accuracy Retrieval from Documents (HARD) Track of TREC 2004, submitting three runs. The principal hypothesis we have been pursuing is that small numbers of documents in clusters can provide a... more

descriptionView Paper arrow_downwardDownload

Arabic user search Query correction and expansion

by Mouhssine Bouzoubaa

2025, Proceedings of the 1st …

This paper describes correction and expansion techniques of multilingual search queries submitted to the Arabiccentred search engine Barq . Key features of the correction technique are the use of 1. Arabic language morphology, 2. Arab... more

descriptionView Paper arrow_downwardDownload

Corpus-based cross-language information retrieval in retrieval of highly relevant documents

by Kal Jarvelin

2025, Journal of the American Society for Information Science and Technology

IR systems' ability to retrieve highly relevant documents has become more and more important in the age of extremely large collections, such as the WWW. Our aim was to find out how corpus-based CLIR manages in retrieving highly relevant... more

descriptionView Paper arrow_downwardDownload

A study on automatic creation of a comparable document collection in cross‐language information retrieval

by Kal Jarvelin

2025, Journal of Documentation

We present a new method for creating a comparable document collection from two document collections in different languages. The best query keys were extracted from a Finnish source collection (articles of the newspaper Aamulehti) with the... more

descriptionView Paper arrow_downwardDownload

Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002

by Kal Jarvelin

2025, Information Retrieval

In this study the basic framework and performance analysis results are presented for the three year long development process of the dictionary-based UTACLIR system. The tests expand from bilingual CLIR for three language pairs Swedish,... more

descriptionView Paper arrow_downwardDownload

Focused web crawling in the acquisition of comparable corpora

by Kal Jarvelin

2025, Information Retrieval

CLIR resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a... more

descriptionView Paper arrow_downwardDownload

Aspects of Swedish morphology and semantics from the perspective of mono- and cross-language information retrieval

by Kal Jarvelin

2025, Information Processing & Management

This paper analyzes the features of the Swedish language from the viewpoint of mono-and crosslanguage information retrieval (CLIR). The study was motivated by the fact that Swedish is known poorly from the IR perspective. This paper shows... more

descriptionView Paper arrow_downwardDownload

EXPANSIONTOOL: Formal definition of concept-based query expansion and construction

by Kal Jarvelin

2025

We develop a deductive data model for concept-based query expansion. It is based on three abstraction levels: the conceptual, linguistic and string levels. Concepts and relationships among them are represented at the conceptual level. The... more

descriptionView Paper arrow_downwardDownload

Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules

by Kal Jarvelin

2025, ACM Transactions on Information Systems

We devised a novel statistical technique for the identification of the translation equivalents of source words obtained by transformation rule based translation (TRT). The effectiveness of the technique called frequency-based... more

descriptionView Paper arrow_downwardDownload

Automatic Translation in Cross-Lingual Access to Legislative Databases

by Catherine Bounsaythip

2025

This paper considers the use of controlled languages for query translation in a legislative document retrieval system. Problem statement and analysis of the approach are described. The use of controlled languages is motivated by the fact... more

descriptionView Paper arrow_downwardDownload

Building a Multilingual and Mixed Arabic-English Corpus

by Mohammed Mustafa Ali

2025, pubs.cs.uct.ac.za

Abstract: Most currently available test collections and almost all CLIR collections have focused upon general-domain news stories. In addition, most of these corpora are built to help with retrieval of documents based on monolingual... more

descriptionView Paper arrow_downwardDownload

Indexing and weighting of multilingual and mixed documents

by Mohammed Mustafa Ali

2025, Proceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment

Non-English-speaking users, such as Arabic speakers, are not always able to express terminology in their native languages, especially in scientific domains. Such difficulty forces many Arabic authors and scholars to use English terms in... more

descriptionView Paper arrow_downwardDownload

Incorporating Statistical Topic Models in the Retrieval of Healthcare Documents

by Karla Caballero

2025

We present a framework based on Statistical Topics Models, Language Models, Information Extraction, and Ontology Analysis to retrieve healthcare related documents for the CLEF eHealth 2013 Task 3. In this framework we add global... more

descriptionView Paper arrow_downwardDownload

Definirea functiilor trigonometrice

by Vacaru Daniel

2025

O prezentare a modului de definire al functiilor trigonometrice. Pornind de la valorile din cadranul I ale functiilor trigonometrice, acestea se extind pe intregul interval [0.2*pi], apoi pe R

descriptionView Paper arrow_downwardDownload

Bilingual Database Software Framework for Thirukkural

by Merlin Florrence

2025

Tamil literature has a rich and long literary tradition spanning more than two thousand years. The oldest extant works show the signs of maturity indicating an even longer period of evolution. This article presents a software framework... more

descriptionView Paper arrow_downwardDownload

A Formalized Reference Grammar for UNL-Based Machine Translation between English and Arabic

by Sameh Alansary

2025

The Universal Networking Language (UNL) is an artificial language that can replicate human language functions in cyberspace in terms of hyper semantic networks. This paper aims to: a) design a reference grammar capable of dealing with the... more

descriptionView Paper arrow_downwardDownload

Gradable Predicates and Directed Motion Constructions

by Dongsik Lim

2025

descriptionView Paper arrow_downwardDownload

Searching Pattern of Users of Online Public Access Catalogues for Retrieval of Bengali Documents in the University Libraries of West Bengal

by Sambhu Nath Halder, Ph.D.

2025, LIBRARIAN

This paper covers a use study of the Online Public Access Catalogues (OPACs) at the University Libraries of West Bengal. Highlights the subject access for Bengali documents in OPACs. It finds that most of the users are postgraduate... more

descriptionView Paper arrow_downwardDownload

Semantic Noise in Information Representation and Retrieval

by Informology Journal

2025, Informology

In the field of Library and Information Science, the accurate representation and retrieval of information are of utmost importance. Information representation and indexing are critical processes that facilitate the efficient access and... more

In the field of Library and Information Science, the accurate representation and retrieval of information are of utmost importance. Information representation and indexing are critical processes that facilitate the efficient access and utilization of knowledge. However, these processes are not without challenges. One significant issue that arises is “semantic noise”, a phenomenon that can distort the meaning of information and hinder effective communication between information retrieval (IR) systems and users. This study aims to explore the concept of semantic noise, its causes, and its implications for information representation and indexing. The current study is primarily theoretical in nature, focusing on the conceptual exploration of semantic noise and its impact on information representation and retrieval. This study investigates the concept of semantic noise, its causes, and its implications for information representation and indexing in the field of library and information science. The results of the research highlight that semantic noise, caused by irrelevant, ambiguous, or conflicting elements in information representation and indexing, significantly disrupts the clarity and accuracy of information retrieval. Key causes include ambiguity in language and representation, varying contexts, inconsistent terminology, and cultural or linguistic barriers, which collectively introduce complexity and hinder effective communication between information retrieval systems and users. Semantic noise reduces retrieval accuracy, leads to inefficient query processing, and poses challenges for natural language processing (NLP) systems, often resulting in user frustration and diminished trust in information retrieval (IR) systems. Semantic noise disrupts the clarity and accuracy of information representation and retrieval, leading to inefficiencies, misinterpretations, and user dissatisfaction. Addressing and mitigating semantic noise requires advanced techniques in natural language understanding, such as contextual analysis, semantic search, semantic modeling, and machine learning. These techniques ensure that information retrieval (IR) systems can effectively bridge the gap between user intent and stored data. These findings underscore the critical need for precision in language, standardized terminology, and context-aware approaches to minimize semantic noise and enhance the reliability of information representation and retrieval.

descriptionView Paper arrow_downwardDownload

A Comprehensive Morphological Analyzer for Swedish

by Fred Karlsson

2025

SWETWOL is implemented in the framework of two-level model. It contains a 48,000 item lexicon and a full inflectional description. Special attention was paid to the design of a computational analysis of productive Swedish compounds.... more

SWETWOL is implemented in the framework of two-level model. It contains a 48,000 item lexicon and a full inflectional description. Special attention was paid to the design of a computational analysis of productive Swedish compounds. Recall (coverage) and precision of SWETWOL meet high standards. SWETWOL has been extensively tested on various types of texts. ("svensk" A UTR INDEF SG NOM)) ("<<tiger>>" ("tiger" N UTR INDEF SG NOM) ("tiga" V ACT PRES)) All readings are retrieved that are feasible in relation to the lexicon and inflectional description. SWETWOL is capable of generating the base form of inflected forms, most clearly visible in the second reading of the word tiger. SWETWOL is meant to be used as the basic morphological tool in systems for text analysis, information indexing, storage and retrieval, and machine translation. It is an integral part of a parsing system where the other major components are (i) a preprocessor converting text to the format presupposed by SWETWOL, and (ii) the Constraint Grammar Parser CGP , 1990, 1992a,b, Karlsson, Voutilainen, Heikkilä, and Anttila 1991). CGP provides a language-independent formalism and a computer program for morphological disambiguation and surface-syntactic analysis. The sequential set-up of analysis in the present framework is thus: -preprocessing, -morphological analysis by SWETWOL, -local disambiguation (on top of TWOL, or by CGP), -context-sensitive disambiguation, clause-boundary determination, and surface-syntactic analysis by CGP. Local disambiguation and preprocessing are discussed in sections 12, 13. Context-sensitive disambiguation of Swedish is treated in Karlsson (in preparation). The pioneering effort of Swedish computational lexicology and morphology was Allén's Nusvensk frekvensordbok (NFO). Many aspects of computational morphology faced this project, such as text preprocessing, morpheme identity and segmentation, the distinction between homonymy and synonymy, homograph separation, and text type effects on vocabulary composition . One offspring of NFO was Hellberg's (1978) algorithm for Swedish morphology, a precise description of word inflection and derivation, partly also of compounding. Hellberg set up a system of 235 inflectional paradigms, plus a basic vocabulary containing 8,609 lemmas. This vocabulary had a coverage of 90% in the corpus of newspaper text that NFO was based on (Hellberg 1978:12). The algorithm was implemented by M. Eeg-Olofsson. Precise data on coverage in the realm of new texts are not included in . It is somewhat unclear how precise the morphotactic description of compounds was. Hellberg's system was later reimplemented by . This dictionary contained some 7,200 entries (Rankin, ms., p. 18). Rankin (1986:166-) reports considerable overgeneration in the analysis of compounds. E.g., bil 'car' yielded spurious analyses such as "bi-l" (N -abbreviation), and ironiska 'ironic (pl.)' yielded a spurious interpretation "ironi-ska" (N -V). (In this paper, compound boundaries are indicated by a dash, "-", and other morpheme boundaries by a plus sign, "+".) The morphotactic description of compound formation was obviously not strict enough.

descriptionView Paper arrow_downwardDownload

Improving the performance of gene mention recognition system using reformed lexicon-based support vector machine

by Yifei Chen

2025, margin

Abstract—In this paper, we propose a gene mention recognition system for biomedical literature using Support Vector Machine based on a reformed lexicon. Then we present an ensemble of rule-based post-processing modules, a integrity check... more

descriptionView Paper arrow_downwardDownload

LangHIT : Fast and Fault-Tolerant Multilingual Lexical Search via eager sparse scoring and RapidFuzz

by Chandan O Singh

2025

We introduce LangHIT, an innovative extension of BM25S that integrates multilingual tokenization, adaptive multilingual stopword filtering, and ultra-fast typo correction using RapidFuzz within the BM25S token matching framework. When... more

descriptionView Paper arrow_downwardDownload

Cross-Language Tourism News Retrieval System Using Google Translate API on SEBI Search Engine

by Arif Muntasa

2025, Elinvo (Electronics, Informatics, and Vocational Education)

Cross-Language Information Retrieval (CLIR) is responsible for retrieving information stored in a language different from the language of the query provided by the user. Some translation methods commonly used in CLIR are Dictionary,... more

descriptionView Paper arrow_downwardDownload

Experiments in TREC-2003 Genomics Track at NTT

by Eisaku Maeda

2025

In TREC-2003, we participated in Question Answering and Genomics Tracks. Since the QA system was essentially the same as the past years' systems[1, 2], we describe our results with the Genomics Track in this paper.

descriptionView Paper arrow_downwardDownload

Evaluation of sentence alignment methods on portuguese-english parallel texts

by Profa. Helena Caseli

2025

Parallel texts, i.e., texts in one language and their translations to other languages, are very useful nowadays for many applications such as machine translation and multilingual information retrieval. If these texts are aligned in... more

descriptionView Paper arrow_downwardDownload

Developing Amharic Regular Expressions in Perl

by Daniel Yacob

2025, Internet

Followup to the 1997 community note "Thoughts on Regular Expressions for Ethiopic". This document has been prepared for members of the perl-unicode email list to describe by way of example the limitations of working with the Ethiopic... more

descriptionView Paper arrow_downwardDownload

LIMA : A Multilingual Framework for Linguistic Analysis and Linguistic Resources Development and Evaluation

by Nasredine Semmar

2025, Language Resources and Evaluation

The increasing amount of available textual information makes necessary the use of Natural Language Processing (NLP) tools. These tools have to be used on large collections of documents in different languages. But NLP is a complex task... more

descriptionView Paper arrow_downwardDownload

Cross Language Information Retrieval

Related Topics