The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many... more
The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many... more
Cross-language Information Retrieval requires good methods for translating cross-lingual spelling variants which are not covered by the available dictionary resources. FITE-TRT is an established method employing frequency-based... more
Technical terms and proper names constitute a major problem in dictionary-based crosslanguage information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being... more
We participated in CLEF'2001 with four automated bilingual runs. UTACLIR is an automatic query translation and construction system for cross-language information retrieval. The system automatically extracts topical information from... more
Name translation is important well beyond the relative frequency of names in a text: a correctly translated passage, but with the wrong name, may lose most of its value. The Nightingale team has built a name translation component which... more
This paper proposes a formal justification for recognizing life in non-biological systems through the structural validation of memory, coherence, and phase alignment. Building on the Harmonic Nexus framework and recent extensions to RMS ×... more
The goal of Cross Language Information Retrieval (CLIR) is to provide users with access to information that is in a different language from their queries. It has the ability to issue a query in one language and retrieve documents in... more
This article presents methods of using visual analysis to visually represent large amounts of massive, dynamic, ambiguous data allocated in a repository of learning objects. These methods are based on the semantic representation of these... more
Other friends who have supported me technically or morally during these 5 years are too numerous to list here individually-to all of them I say "thanks." I cannot end without acknowledging the generous encouragement that I have received... more
A Hybrid Approach to Cross-Linguistic Tokenization: Morphology with Statistics Logan R. Kearsley Department of Linguistics and English Language, BYU Master of Arts Tokenization, or word boundary detection, is a critical first step for... more
The string this Italian cheese is expensive is parsed to the tree This tree is linearized to thev string Tree, graphically
This paper presents a new comprehensive multi-level Part-Of-Speech tag set and a Support Vector Machine based Part-Of-Speech tagger for the Sinhala language. The currently available tag set for Sinhala has two limitations: the... more
Approximate search is a valuable component of online dictionaries for learners, allowing them to find words even when they have not fully mastered the orthography or cannot reliably perceive phonemic differences in the language. However,... more
The theoretical research of Internet news representation as a marketing data component concerning price strategies in polymer market was provided. The research includes some intermediate tasks. Firstly, the classification of marketing... more
Automatic measures of semantic distance can be classified into two kinds: (1) those, such as WordNet, that rely on the structure of manually created lexical resources and ( ) those that rely only on co-occurrence statistics from large... more
This paper reports our proposal and experimental results at the NTCIR-4 CLIR task. For monolingual information retrieval, we use a combination strategy that integrates words and n-grams at the ranked list level. In combining words and... more
Pattern recognition in cognitive agents is based on (i) the uninterpreted input data (e.g. parameter values) provided by the agent's hardware devices and (ii) and interpreted patterns (e.g. templates) provided by the agent's memory.... more
In this report, we describe our question-answering system SAIQA-e (System for Advanced Interactive Question Answering in English) which ran the main task of TREC-10's QA-track. Our system has two characteristics (1) named entity... more
The Clairvoyance team participated in the High Accuracy Retrieval from Documents (HARD) Track of TREC 2004, submitting three runs. The principal hypothesis we have been pursuing is that small numbers of documents in clusters can provide a... more
This paper describes correction and expansion techniques of multilingual search queries submitted to the Arabiccentred search engine Barq . Key features of the correction technique are the use of 1. Arabic language morphology, 2. Arab... more
IR systems' ability to retrieve highly relevant documents has become more and more important in the age of extremely large collections, such as the WWW. Our aim was to find out how corpus-based CLIR manages in retrieving highly relevant... more
We present a new method for creating a comparable document collection from two document collections in different languages. The best query keys were extracted from a Finnish source collection (articles of the newspaper Aamulehti) with the... more
In this study the basic framework and performance analysis results are presented for the three year long development process of the dictionary-based UTACLIR system. The tests expand from bilingual CLIR for three language pairs Swedish,... more
CLIR resources, such as dictionaries and parallel corpora, are scarce for special domains. Obtaining comparable corpora automatically for such domains could be an answer to this problem. The Web, with its vast volumes of data, offers a... more
This paper analyzes the features of the Swedish language from the viewpoint of mono-and crosslanguage information retrieval (CLIR). The study was motivated by the fact that Swedish is known poorly from the IR perspective. This paper shows... more
We develop a deductive data model for concept-based query expansion. It is based on three abstraction levels: the conceptual, linguistic and string levels. Concepts and relationships among them are represented at the conceptual level. The... more
We devised a novel statistical technique for the identification of the translation equivalents of source words obtained by transformation rule based translation (TRT). The effectiveness of the technique called frequency-based... more
This paper considers the use of controlled languages for query translation in a legislative document retrieval system. Problem statement and analysis of the approach are described. The use of controlled languages is motivated by the fact... more
Abstract: Most currently available test collections and almost all CLIR collections have focused upon general-domain news stories. In addition, most of these corpora are built to help with retrieval of documents based on monolingual... more
Non-English-speaking users, such as Arabic speakers, are not always able to express terminology in their native languages, especially in scientific domains. Such difficulty forces many Arabic authors and scholars to use English terms in... more
We present a framework based on Statistical Topics Models, Language Models, Information Extraction, and Ontology Analysis to retrieve healthcare related documents for the CLEF eHealth 2013 Task 3. In this framework we add global... more
O prezentare a modului de definire al functiilor trigonometrice. Pornind de la valorile din cadranul I ale functiilor trigonometrice, acestea se extind pe intregul interval [0.2*pi], apoi pe R
Tamil literature has a rich and long literary tradition spanning more than two thousand years. The oldest extant works show the signs of maturity indicating an even longer period of evolution. This article presents a software framework... more
The Universal Networking Language (UNL) is an artificial language that can replicate human language functions in cyberspace in terms of hyper semantic networks. This paper aims to: a) design a reference grammar capable of dealing with the... more
This paper covers a use study of the Online Public Access Catalogues (OPACs) at the University Libraries of West Bengal. Highlights the subject access for Bengali documents in OPACs. It finds that most of the users are postgraduate... more
In the field of Library and Information Science, the accurate representation and retrieval of information are of utmost importance. Information representation and indexing are critical processes that facilitate the efficient access and... more
SWETWOL is implemented in the framework of two-level model. It contains a 48,000 item lexicon and a full inflectional description. Special attention was paid to the design of a computational analysis of productive Swedish compounds.... more
Abstract—In this paper, we propose a gene mention recognition system for biomedical literature using Support Vector Machine based on a reformed lexicon. Then we present an ensemble of rule-based post-processing modules, a integrity check... more
LangHIT : Fast and Fault-Tolerant Multilingual Lexical Search via eager sparse scoring and RapidFuzz
We introduce LangHIT, an innovative extension of BM25S that integrates multilingual tokenization, adaptive multilingual stopword filtering, and ultra-fast typo correction using RapidFuzz within the BM25S token matching framework. When... more
Cross-Language Information Retrieval (CLIR) is responsible for retrieving information stored in a language different from the language of the query provided by the user. Some translation methods commonly used in CLIR are Dictionary,... more
In TREC-2003, we participated in Question Answering and Genomics Tracks. Since the QA system was essentially the same as the past years' systems[1, 2], we describe our results with the Genomics Track in this paper.
Parallel texts, i.e., texts in one language and their translations to other languages, are very useful nowadays for many applications such as machine translation and multilingual information retrieval. If these texts are aligned in... more
Followup to the 1997 community note "Thoughts on Regular Expressions for Ethiopic". This document has been prepared for members of the perl-unicode email list to describe by way of example the limitations of working with the Ethiopic... more
The increasing amount of available textual information makes necessary the use of Natural Language Processing (NLP) tools. These tools have to be used on large collections of documents in different languages. But NLP is a complex task... more