Cross Language Information Retrieval Research Papers

Call for Papers: International Journal on Natural Language Computing (IJNLC)

by International Journal on Natural Language Computing (IJNLC)

2022

Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze,... more

descriptionView Paper arrow_downwardDownload

Keyword-Driven Suffix Arrays for On-Line Keyword Searching from Documents In Chinese

by International Journal of Artificial Intelligence (IJAIA)

On-line keyword searching from documents in Chinese tends to use inverted indexing as the main technique, which has its difficulties. Suffix Array is widely used for processing text in Western languages. However, it fails to get widely... more

descriptionView Paper arrow_downwardDownload

International research on bilingualism: Cross-language and cross-cultural perspectives

by Norbert Francis

2018, Ethnologia

Linguistics and the science of Anthropology have much in common. In fact, to a large extent the two fields overlap. Field workers utilize research models of the ethnographic type as well as approaches that are experimental, methods that... more

descriptionView Paper arrow_downwardDownload

December 2021: Top 10 Read Article in Natural Language Computing

by International Journal on Natural Language Computing (IJNLC)

2021

Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze,... more

descriptionView Paper arrow_downwardDownload

Evaluation of Oromo-English Cross-Language Information Retrieval

by Prasad Pingali

2000

This paper reports on the first Oromo-English CLIR system that is based on dictionary-based query translation techniques. The basic objective of the study is to design and develop an Oromo-English CLIR system with a view to enable Afaan... more

Table 2: Summary of average results for the three runs Table 2 reveals OMTD (title and description) run and OMTDN (title, description and narration) run have achieved almost the same level of performance (with about MAP of 25 %). The title run has slightly lower performance with MAP of 22%. We feel this is due to the fact that most of the title fields in Afaan Oromo topics were very short.

descriptionView Paper arrow_downwardDownload

Integrating different strategies for cross-language information retrieval in the MIETTA project

by Feiyu Xu

1998, Proceedings of TWLT14, …

TWLT is an acronym of Twente Workshop(s) on Language Technology. These workshops on natural language theory and technology are organised by the Parlevink Project, a language theory and technology project of the . For each workshop... more

Table 3: Decomposing an example sentence for translation. En: Release the wheel suspension on the left-hand side.

On an abstract level, the task of CLIR is not much different from traditional IR. A set of documents is acquired that are of potential interest to some user. The documents are processed in order to obtain information that may be relevant in a query situation. We may call this information simply document information. The document information is stored in a format that permits the matching of queries against this infor- mation. Addresses of the documents are stored to- gether with the document information to enable the actual retrieval. The addresses may be URLs, lo- cations on a storage medium or data base identifi- ers.

hours”. On the other hand, related terms like “church” and “monastery”, which correspond to BUILDING, will lead to a template that includes slots describing “building period” and “architect”.

These mappings are crucial to the multilinguality of the information extraction and natural language generation systems and will be, as mentioned before, organized in a multilingual thesaurus that maps terms in one language onto those of another. The figure below (due to [9]) shows how language independent templates interact with generation into individual languages on the one side and mappings between information extracted in these languages on the other.

The query forms are meant to support the user by guiding his search, i.e., making it clear to him on what aspects of a certain class or category he can pose more specific queries. To further support a precise formulation of the (free text) query, the user can submit it to the query expansion component. Here, the query is first processed and normalized, where appropriate, and envisioned to take a central role in this. We assume that the user has basically two choices for searching; one is on the basis of free text queries, the other on the basis of concept- oriented classifications. Free text queries can take the user either to different kinds of full text indices and/or it can take the user to a classification hierarchy, if the search term happens to map onto on of the conceptual classes.

Figure 1: vector product weighting algorithm f more than one translation per source language query term is used for searching we might stil treat the translated query as a bag-of-words. As we will see in section 5 the way of weighting the possible translations is crucial for unstructured queries. In particular it is important to normalise the possible translations in such a way that for each source language query term the weights o possible translations sum up to one. Not using normalisation will make source language query terms with a lot of possible translations unin- tentionally more important than source language query terms that have only less possible transla- tions.

Figure 4: translation chart of third world war

Figure 1: Multimedia database architecture

Figure 2 proposes the design of the misflRor query processor. An IR system is described by its re- trieval model, which defines the document rep- resentation, the query formulation, and the rank- ing function [26]. These three aspects are re- flected in the design of our multimedia query pro- cessor, in subsequently the concept layer (doc- ument representation), the evidential reason- ing layer (ranking function), and the relevance feedback layer (query formulation). Figure 2: The multimedia query processor

Figure 3: A multimedia retrieval model based on Bayesian inference networks

data collections. Its main characteristic is the strict separation between the logical and physi- cal databases. This separation provides data in- dependence, and allows for algebraic query opti- mization in the translation from expressions at the logical level to queries executed in the physi- cal database. Also, parallellization of the physical algebra is orthogonal to the logical algebra, such that we can transparently distribute the data over different database servers by changing only the mapping between the two views. In this paper, we only discuss query processing at the logical level. The interested reader is referred to [9] for a discussion of the implementation in the physical database. MOA is an object algebra for the logical level, being developed by our research group. It pro- vides an extensible nested object data model and an algebra on this model. The prototype imple- mentation does not yet provide a query language at the conceptual level; queries can only be spec- ified using MOA expressions. The MOA Tools translate the query expressions specified in MOA into efficient MIL programs? that are executed in the Monet database system [1]. Monet is an ex- tensible parallel database kernel that is intended to serve as a backend in various application do- mains [2]; e.g., image retrieval is supported by an extension module defining the ‘Acoi’ algebra 18]. Monet has also been used succesfully for geographic information systems as well as com- mercial data mining applications.

(Chinchor [5]) has in fact shown that there were no significant differences between the top five systems in MUC-6 at a 98% level of confidence.

first N would be taken as the output of the hybrid system. duces a ranked order of the M most relevant texts from the new corpus. A figure of merit for routing systems is "precision at N", i.e. the percentage of relevant texts out of the first N retrieved.

Figure 4: Enriched canonical tree structure. SU- node contains the syntactic subject and TOP col- lects all topicalised elements (CP’s, PP’s, NP’s, adverbs). V contains the main verb, and C&A all verbal complements and adverbials. In en- riched structures, TOP and SU are linked to their deep structure positions. SU is linked to the deep structure subject position of the main verb in case of active, and to the leftmost position under C&A in case of passive sentences.

Figure 5: Syntactic tree containing the Linguistic Information Package for the VP of zonisamide affects epilepsy. The LIP consists of a candidate relation (REL), consisting of a candidate co-ordinator (COO), an agent (AGE) and an object (OBJ). Constituents are linked to deep structure positions by indexes (e.g. <1>).

the muscular-skeletal movements of the dancers and their positions in space. Movement notation systems are expected to ‘provide the key to relatively unambiguous communication through the creation of an agreed symbol system’ [3] An example of a prominent system, Labanotation, is shown in Fig. 1. The notation is read from bottom to top, along a vertical temporal axis delimited by bars akin to those of a musical score. Symbols to the left of the centre refer to movements made by the left hand side of the body: the foot, leg, torso, arm, hand and fingers - in that order. The symbols’ points, shadings and size capture the movement dynamics of direction, level of extension and duration.

has been extended to implement a video object database, alongside text analysis modules. Collateral text is processed in order to semi-automate video annotation by generating video objects with reference to a knowledge-base, Fig. 2. Fig. 2: The KAB Video Annotation Overview

The KAB prototype, Fig. 3, lets the user build collections of linked videos and collateral texts. Annotations can be attached to the video in the form of video objects through a series of dynamic menus which show a selection of available representations (updated through the ‘Add Lexical Knowledge’ option). Searching is achieved by making a selection from similar menus, which returns a set of matching video objects. Current work is implementing the ‘Process Texts’ function so that collateral texts are analysed to automatically suggest video objects — grounded in exical resources and knowledge-bases. As well as being used to match queries for retrieval purposes, the expressions attached to video objects can also be used o explain the video contents to the viewer when browsing, e.g. by showing an expert’s commentary on a sequence or offering a link to related media. For further information about the development of KAB see [20] Fig. 3: The KAB Prototype main menu and example video with collateral text

Figure 4 Part of the story board (images) Figure 3 Text hits in the subtitles

Figure 1: Illustration for clause 1 Clause 1 is illustrated by an image depicting the squeezing of ointment (Figure 1). Clause 2 is il- lustrated by a picture showing a finger entering the left nostril (Figure 2), while clause 3 is il- lustrated by a similar picture involving the right nostril (Figure 3):

Figure 1: Plot of average precision against term weighting parameters b and K forTREC-7/SDR local development queries (left), and TREC-7/SDR evaluation queries (right).

Figure 2: Effect of the query expansion parameters nr (maximum number of relevant documents to consider) anc it (maximum number of terms to add) on the average precision for our local development queries using ABBOT speech recognizer output. The Jun 1997 - Feb 1998 LA Times/Washington Post portion of the 1998 TREC-7/SDF anguage model corpus was used as the query expansion collection.

Figure 3: Effect of query expansion on retrieval of recognizer output for local development queries. Query expan- sion was performed on (1) LA Times/ Washington Post newswire text (LM-qe); (2) the recognizer transcripts that made up the test collection (S1-qe); and (3) no query expansion (noge).

Figure 4: Recall-precision curves of the THISL system running on various transcripts submitted for TREC-7/SDR.

Figure 5: Effect of query expansion on recall-precision for evaluation R1 and S1 conditions (post-evaluation experiment).

Figure 6: Query-by-query effect of query expansion in terms of change in average precision compared with no query expansion.

Figure 1: System architecture The following figure shows the architecture of the TNO system as used in the TREC7 experiments. For the pilot experiment, which is described in section )3] a more simplistic term weighting strategy was used. The following figure shows the architecture of

Figure 2: Trading speed for Recall Hybrid system performance

Figure 3: Rough estimate of false alarm rate

Figure 1 — Architecture of the Sumatra system The Sumatra system uses the following architecture to perform these steps: The input of the system consists of a text. The output consists of a summarized version of that text. Let’s take a closer look at the different components of the system. Text generator

A relation oriented graph-pruning method selects certain relations from a semantic structure, together r arguments, and discard the rest of the Figure 2 with thei structure. shows the selection of relation A and the resulting summarized structure.

Figure 3 — The connectivity of a relation There are several ways to define the importance value of a relation. A possible definition could use the connectivity of a relation as its importance value. The connectivity of a relation is defined as the number of connections that must be severed to cut the relation and all its arguments free from the structure. Consider, for example, the following semantic structure:

Figure 2, Query screen he query bar a table is presenting the documents hat contain this query phrase or a phrase that iccording to some similarity measure is related to it. The table gives the relevant document, its source which can be either the Twenty-One database -ontaining electronic versions of paper documents, or ocuments from remote sites that have been marked is relevant to the application domain), the page in the ocument containing the matching phrase, the

Table 3: results of ’unstructured query’ runs

Table 4: results of structured query’ runs Table 4 lists the results of the structured query runs. Normalisation of term weights is implicit in the structured query, so run3a and run3b will give exactly the same results as run3c and run3d respectively.

Table 1: A translation memory example interface

Fr: Monter l’élément de suspension droit. Table 2: Granularity and type of sub-sentence alignments generated by TRINITY En: Connect the wheel suspension on the right-hand side.

Table 4: Estimating expected counts via the Iterative Proportional Fitting Algorithm

Table 1: Excerpts from two corresponding D- texts and one I-text

Table 3: Preponderant general language lexical items in the D-texts The mid-ranges of the SL / GL list were filled with more generally familiar words which describe movement and actions, e.g. bend, come, hold, roll, kneel, walk. Words that locate movements and actions in space and time also appear, e.g. diagonal, forward, left, right, across and continuing, occasional, while, sporadic.

Table 4: Semi-fossilized phrases in the D-texts

Table 5: Distribution of clauses by information content in one D-text The even spread of clauses suggests that this is a useful classification for further analysis to be based on. Each clause can refer to 1, 2 or 3+ dancers - dancing in unison or taking different roles in an interaction. The contents of a clause can be modified by adjectives and adverbs to describe quality, and by prepositional phrases to situate them in space and time.

Table 7: Cohesion through reference in D-texts In the above passage, a mention of a previously unseen dancer is cued by an indefinite article, another in this case. Subsequent mentions are referential - who, his, he - or elliptical. The return of a previous dancer is cued by the definite article the.

Table 9: Linking of phrases to form interpretations in the I-texts

Table 1: Use of multiple transcriptions derived from ABBOT on the TREC-6 known-item retrieval task. R1 are the reference transcripts, S1 are the transcripts produced by A BBOT using frame-level merging. Forward and backward are the decodings produced by the nets in isolation. The term ‘merged’ implies the concatenation of two or more sets of transcripts whereas the term ‘union’ implies the union of sets of transcripts — multiple occurrences of the same term are discarded.

Table 2: Summary of TREC-7 Spoken Document Retrieval track results for different recognizer conditions, eval- uated in terms of word error rate (WER), term error rate (TER) defined in the text, average precision (AveP) and R-precision (R-P). R1 refers to the reference transcripts; $1 refers to THISL speech recognition described in the paper; B1 and B2 are baseline recognition runs with different levels of pruning using CMU Sphinx-I] at NIST; CR- CUHTK refers to Cambridge University (HTK) speech recognition; CR-DERASRU-S1 and CR-DERASRU-S2 refers to DERA/SRU speech recognition; CR-DRAG ON-S1 refers to Dragon Systems speech recognition.

Table 2: ISM + wordspotter hybrid system

Table 3: Results of the SDR TREC7 runs where d/ is the document length and avdl represents the average document length. This resulted in the following average precision values for the tasks R1, B1 and B2. For the S1 run we submitted a run based on the method described in The only conceptual difference with the DAS+ pilot set-up on the IR

and a MMR reranked order with A = ..5. They were asked to perform nine different search tasks to find information and asked various questions about the tasks. They used two methods to retrieve documents, known only as R and S. _ Parallel tasks were constructed so that one set of users would perform method R on one task and method S on a similar task. Users were not told how the documents were presented only that either “method R” or “method S” were used and that they needed to be try to distinguish the differences between methods. After each task we asked them to record the information found. We also asked them to look at the ranking for method R and method S and see if they could tell any difference between the two. The majority of people said they preferred the method which gave in their opinion the most broad and interesting topics. In the final section they were asked to select a search method and use it for a search task. 80% (4 out of 5) chose the method MMR to use. The person who chose Smart stated it was because “it tends to group more like stories together.” The users indicated a differential preference for MMR in navigation and for locating the relevant candidate documents more quickly, and pure-relevance ranking when looking at related documents within that band. Three of the five users clearly discovered the differential utility of diversity search and relevance-only search. One user explicitly stated his strategy: 3. DOCUMENT REORDERING We implemented MMR in two retrieval engines, PURSUIT (an upgraded version of the original

Table 1. Sample output from the CBA program: List of the most highly weighted candidate substrings for the role of SPECIES. After the text has been read in, all the fillers found for each role are collated. The substrings may be weighted, since some templates are more reliable than others. If any substring of a filler is found more than once, the weight associated with each instance is combined. The weight for each substring is also enhanced if the substring occurs in the title of the document or can be lexically validated [2] by matching a word or phrase in a lexicon of terms known to be feasible role fillers. The most highly weighted substring found in this way is taken to be the most likely interpretation of a given role.

Table 3. Machine generated list of roles and fillers corresponding to Table 2. In many information retrieval experiments, t effectiveness of the retrieval is measured using the measures of recall and precision. Recall (R) measures the proportion of relevant material actually retrieved in response to a search [5]. Since our ultimate aim is to retrieve ALL t ne ne (f), followed by no match (g) at all. In making these comparisons, the best-match criterion of Gaussier, Langé & Meunier is employed [4]. Only matches between each machine-generated term and the best matching human-selected term are considered. However, if the best-matching term in the human-selected list can be matched even more strongly with a different machine-generated term, the match with the first machine-generated term is not considered.

Beside connectivity, other parameters can be used to determine the importance value of a relation. The first and last sentence of a paragraph are usually more important than other sentences, and the occurrence of certain signal words in a sentence can indicate that it is important. The importance value of relations that have been derived from such sentences can be altered by multiplying it with a certain boost factor. The Sumatra system uses the following heuristics to alter the importance value if a relation conforms to certain conditions: The values for these boost factors have been obtained by extensive testing with six texts used in the final exams of the Dutch grammar school. For each text, a list of the relevant information elements was available and a script has been used to automatically determine the percentage of relevant information elements a summary contains. The values for the boost factors have been varied to maximize this percentage. The combination of values for the boost factors have been chosen manually, and because of the enormous search space of this optimizing problem, it is very likely that a better combination exists. A better combination could be found by using a neural network or genetic algorithm to find a higher (local) maximum.

descriptionView Paper arrow_downwardDownload

Arabic to French sentence alignment: exploration of a cross-language information retrieval approach

by Christian Fluhr

Sentence alignment consists in estimating which sentence or sentences in the source language correspond with which sentence or sentences in a target language. We present in this paper a new approach to aligning sentences from a parallel... more

descriptionView Paper arrow_downwardDownload

Top 10 Read Article in Natural Language Computing: July 2021

by International Journal on Natural Language Computing (IJNLC)

2021

Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze,... more

descriptionView Paper arrow_downwardDownload

Integrating Different Strategies for Cross-Language Information Retrieval In the MIETTA Project

by Klaus Netter

1998, TWLT 14 Language Technology in …

Integrating Different Strategies for Cross-Language Information Retrieval in the MIETTA Project Paul Buitelaar, Klaus Netter, Feiyu Xu DFKI Language Technology Lab Stuhlsatzenhausweg 3, 66123 Saarbrücken, Germany {paulb, netter, feiyu}@... more

descriptionView Paper arrow_downwardDownload

Arabic to French Sentence Alignment: Exploration of a Cross-language Information Retrieval Approach

by Christian Fluhr

descriptionView Paper arrow_downwardDownload

Exploring Information Retrieval by Latent Semantic Indexing and Latent Dirichlet Allocation Techniques

by Radha Guha

International Research Journal of Computer Science

Today we are living in modern Internet era. We can get all our information from the internet anytime and from anywhere using a desktop PC or a smart phone. However, the underlying technology for relevant information retrieval from the... more

descriptionView Paper arrow_downwardDownload

Evaluation of Two Statistical Machine Translation Systems within a Greek- English Cross-Language Information Retrieval Architecture

by Nikos Katris

Cross-language information retrieval (CLIR) systems cater for the requirements of users who need to access a pool of information published in a language that they do not speak. A CLIR system uses an information retrieval (IR) architecture... more

Cross-language information retrieval (CLIR) systems cater for the requirements of users who need to access a pool of information published in a language that they do not speak. A CLIR system uses an information retrieval (IR) architecture with the addition of a machine translation component.
The purpose of this paper was to evaluate the performance of two statistical machine translation (SMT) systems within a CLIR pipeline: KantanMT, a cloud-based machine translation (MT) platform, and Moses, an open-source software. In order to train the MT systems we used the 1,073,225-sentence-pair parallel bilingual (Greek-English) EMEA corpus (Tiedemann 2009) and the 62,452-sentence-pair parallel bilingual (Greek-English) QTLP corpus. Both corpora were from the medical domain.
For the IR part of the experiment, the OHSUMED medical test collection was used. OHSUMED (Hersh et al 1994) contains 233,445 abstracts from MEDLINE, and 63 English queries along with their correct answers. Prof. Theodore Kalamboukis from the Athens University of Economics and Business (AUEB) very generously provided the Greek version of the queries (Kotsonis et al 2008). The Greek queries were then translated back into English using the two MT systems. Finally, the machine translated queries were used for the retrieval of relevant documents from the OHSUMED database using Apache Solr.
Three experiments were conducted: One using the queries translated by KantanMT and Moses to retrieve relevant documents from the OHSUMED collection, one where we calculated the BLEU score for each MT system using the independent 2,469-sentence ECDC corpus and one using the original human-produced queries to retrieve relevant documents. The top 10 retrieved relevant documents from each set of 63 queries (KantanMT, Moses, human-produced) were compared to the gold standard provided in the OHSUMED collection.
In the first experiment, KantanMT’s precision (0.12) was found to be slightly better than the precision of Moses (0.10) and the same applied to F-measure, where KantanMT achieved a score of 0.07 and Moses a score of 0.05. However, both systems produced the same recall score (0.06). In the second experiment, Moses was found to yield a higher BLEU score (17.72) than KantanMT (11.74), which seems to confirm the theory that there is no correlation between translation quality and IR performance. In the third experiment, we concluded that the original English queries produced around double the F-measure (0.13) compared to the queries from KantanMT (0.07) and Moses (0.05).
Finally, we conclude that there is a lot of room for more research on the use of full SMT systems in CLIR applications, especially involving the Greek language.

descriptionView Paper arrow_downwardDownload

Reliable measures for aligning Japanese-English news articles and sentences

by Hitoshi Isahara

2003, Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL '03

We have aligned Japanese and English news articles and sentences to make a large parallel corpus. We first used a method based on cross-language information retrieval (CLIR) to align the Japanese and English articles and then used a... more

descriptionView Paper arrow_downwardDownload

Aperfeiçoando a Interação entre Estudantes de Medicina e Máquinas de Busca

by Thais Machado

Resumo -Este artigo procura identificar dificuldades encontradas por estudantes da área médica na busca de informação técnica de medicina. O objetivo principal é mostrar a importância da recuperação de informação na área médica por meio... more

descriptionView Paper arrow_downwardDownload

Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002

by Heikki Keskustalo

2000, Information Retrieval

This paper reviews literature on dictionary-based cross-language information retrieval (CLIR) and presents CLIR research done at the University of Tampere (UTA). The main problems associated with dictionary-based CLIR, as well as... more

descriptionView Paper arrow_downwardDownload

A cross-language information retrieval based on an Arabic ontology in the legal domain

by Samah Alydai

2005, Proceedings of the …

In this paper, we describe a web-based multilingual tool for Arabic information retrieval based on ontology in the legal domain. We illustrate the manual construction of the ontology and the way it is edited using Protégé2000. Using... more

Figure 1. The general architecture of the proposed system The system that we propose to improve the arabic information retrieval on the Web in the legal domain, is situated in a general architecture of an Arabic search engine supporting the translation in English or French queries. The aim is to return documents written in Arabic, French or English.

Figure 3. Presentation of the concepts by using protégé-2000 Figure 2. The hierarchy of the concepts

descriptionView Paper arrow_downwardDownload

A SURVEY ON INFORMATION RETRIEVAL METHODS IN REGIONAL LANGUAGES

by IRJCS: : International Research Journal of Computer Science

2019, IRJCS:: AM Publications,India

Data available on the web is growing at an exponential rate, creating Knowledge or extracting information is of paramount importance. Information Retrieval (IR) plays a crucial role in Knowledge management as it helps us to find the... more

descriptionView Paper arrow_downwardDownload

PERSONAL IDENTITY MATCHING

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

Despite all existing methods to identify a person, such as fingerprint, iris, and facial recognition, the personal name is still one of the most common ways to identify an individual. In this paper, we propose two novel algorithms: The... more

descriptionView Paper arrow_downwardDownload

Automatic Language-Specific Stemming in Information Retrieval

by John Goldsmith

2001, Lecture Notes in Computer Science

We employ Automorphology, an MDL-based algorithm that determines the suffixes present in a language-sample with no prior knowledge of the language in question, and describe our experiments on the usefulness of this approach for... more

Fig. 1. The basic design of the Chicago IR system, using Automorphology to stem terms from queries and documents, and employing standard SMART vector-based retrieval.

Our system was run in monolingual IR tests in the CLEF project in 2000 involving Italian, French, and German. The principal results are presented in Figure 2. Fig. 2. Precision rates for CLEF experiments on French, German, and Italian

descriptionView Paper arrow_downwardDownload

Hindi to English and Marathi to English cross language information retrieval evaluation

by Gajanan Shastrakar

2008, … in Multilingual and …

In this paper, we present our Hindi to English and Marathi to English CLIR systems developed as part of our participation in the CLEF 2007 Ad-Hoc Bilingual task. We take a query translation based approach using bi-lingual dictionaries.... more

Fig. 2. Translation Disambiguation: Co-occurrence Graph for Disambiguating Trans- lations and Transliterations, Comparison of Dice Coefficient and PMI

Fig. 3. CLEF 2007 Ad-Hoc Monolingual and Bilingual Precision-Recall Curves Table 2. CLEF 2007 Ad-Hoc Monolingual and Bilingual Overall Results (Percentage ; of monolingual performance given in brackets below the actual numbers) DC is defined as follows:

descriptionView Paper arrow_downwardDownload

They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval

by Raghavendra Udupa

2009

It is well known that the use of a good Machine Transliteration system improves the retrieval performance of Cross-Language Information Retrieval (CLIR) systems when the query and document languages have different orthography and phonetic... more

descriptionView Paper arrow_downwardDownload

How to conduct legal research in the English legal system.

by Constanze Andel

How to conduct legal research in the English legal system using only official resources from the courts, and without having to subscribe to a subscription from a legal information provider.

descriptionView Paper arrow_downwardDownload

Dictionary-Independent Translation In CLIR Between Publication Details

by Sanna Kumpulainen

en.scientificcommons.org

Figure 1: Example of a set of candidate pages with probabilities and keywords and a hierarchical structure for this set. ‘Nota’ is short for ‘None of the above’.

Table 1: The average number of steps of simulated users with perfect choices.

Figure 4: The average number of steps of simulated users with various noise levels.

Figure 5: The average number of steps that human users needed to reach their target pages. Table 2: The average number of steps of human users and the average number of targets that were found by human users.

Figure 4: Number of result clicks per user in the focused interface. Dark: Clicks on page-links. Light: Clicks on focused-links.

Figure 5: Linear depth of Wikipedia pages which have one or more sections. The distribution of pages over the number of sections is plotted on a log-log scale.

Figure 6: Number of page visits per user in the focused interface. Dark: Pages visited via the result list. Light: Pages visited via internal links.

Table 5: Responses on system suitability for answer- ing general questions: Mean rating and standard de- viation (in brackets). Answers we on a 5-point scale, ranging from 1 (“very unsuitable”) to 5 (“very suit- able” )

Table 4: Responses on system suitability for answer- ing specific questions: Mean rating and standard de- viation (in brackets). Answers were on a 5-point scale, ranging from 1 (“very unsuitable”) to 5 (“very suitable” ).

Table 9: Page-link clicks vs. focused-link clicks in the focused interface: mean number of clicks and standard deviation (in brackets). Each search task contained three distinct search assignments.

Table 8: Time spent per search task (minutes): mean time and standard deviation (in brackets). Each search task was divided into three distinct search assignments.

Table 7: Page views per search task: Mean num- ber of page views and standard deviation (in brack- ets). Each search task was divided into three dis- tinct search assignments.

Table 10: Analysis of focused-clicks in the focused interface. Left: Type of element clicked (hierarchi- cal depth). Right: Section number (in the Wiki- pedia source) of the of the sections clicked (linear depth).

Table 3: Comparison of the re-ranking approaches on R-precision scores. The underlined scores are statisti- cally significant improvements over the baseline.

Table 1: With an increase in network size, the aver- age in-degree k;, increases.

Figure 1: ViPF interface after a query was submitted: result panel (1), graph panel (2), info panel (3), subject panel (4), fitness color bar (5)

Figure 2: In-degree distribution of the Citebase data set with yn = 2.1564

Figure 3: Preferential attachment measurements of the Citebase data set

//Email[ subject = "Multimedia" and from = "Makici" and date="0ctober"] Figure 1: Examples of different ways of expressing an information need.

Figure 1. Recall-precision curves for all queries.

The extracted information is stored in an XML file which is then accessible by the retrieval component of the system. This component, which is currently under development, dynamically forms an index of the processed conferences based on the information found in the XML repository. When projected to the client’s browser conferences are classified as open or past and they are categorized based on their date. This tool will also allow multiriteria retrieval of conferences, such as “show me conferences in Athens or near Athens which are about Web mining and will take place this summer”. Supporting these queries will be based on the location knowledge base and on the month dictionary. Figure 1. Flowchart of the extraction procedure.

Figure 1: Precision vs. Recall for CACM.

Figure 3: Precision vs. Recall for ILK. results of, for instance, a probabilistic IR model. However, it would also be interesting to investigate whether using other IR models such as probabilistic retrieval or a language mod- elling approach indeed show this increase to be universal over the entire range of IR approaches.

Figure 2: Precision vs. Recall for CISI.

Table 2: Statistics on the main tags in the query set. were 91 instances where query terms were “about” a field in an email, which indicates that a considerable amount of noise is present within the queries. We refer to this noise as ambiguity as the query being expressed to the system contains uncertain information with respect to the target email. We classified queries according to how much ambigu- ity was present, using three grades: (0) not, (1) somewhat, or (3) very ambiguous. If all the features occurred in the known-item, then there is little or no ambiguity (i.e., we assumed that the query term was put there with specific reference to some field in the email). However, if more than half the query features are present in the known-item email, then there is some ambiguity in the query. If the majority of query terms do not occur in the email then the query is very ambiguous. To some extent this measure reflects the loss of recall that is experienced by a user when formulating the query; assuming that they are trying to select (remem- ber) the exact words and phrases from the email they have in mind. The more vague the user is about their missing email, presumably the less precise and more ambiguous the query will be as a result. Figure 2: Examples of different ways of expressing an information need.

Table 3: Classification Accuracy shown as a percent- age (%) correct per class on the Independence and Dependence Classification models.

Table 1: Manual element name expansion lists based on INEX 2004 assessments.

Table 3: Equivalence classes in INEX IEEE collection.

Table 2: Score region algebra operators.

Table 4: INEX 2004 CAS experiments with different expansion classes evaluated using inex_eval and precision different recall points. Table 5: INEX 2005 CAS experiments with different expansion classes evaluated using nxCG at different recall point and ep/gr.

Table 6: INEX 2005 COS experiments with different expansion classes evaluated using nxCG at different recall point and ep/gr.

Table 7: INEX 2004 CAS experiments with different vague scenarios and rewriting techniques evaluated using inex_eval and precision at different recall points.

Table 8: INEX 2005 CAS experiments with different vague scenarios and rewriting techniques evaluated using nxCG at different recall points and ep/gr.

Table 9: INEX 2005 COS experiments with different vague scenarios and rewriting techniques evaluated using nxC' at different recall points and ep/gr.

Table 10: INEX 2004 CAS experiments on combining vague search and rewriting techniques evaluated using inex_eva and precision at different recall points.

Table 11: INEX 2005 CAS experiments on combining vague search and rewriting techniques evaluated using nxCC at different recall points and ep/gr.

Table 12: INEX 2005 COS experiments on combining vague search and rewriting techniques evaluated using nxCG at different recall points and ep/gr.

‘able 1. The MAP values (% ) for the test queries and their difference to the baselines (% ) (* statistically significant difference, ** statistically highly significant difference)

‘able 2. The interpolated recall precision averages (% ) at standard recall level 10 for the test queries, and their difference to the baselines. (* statistically significant difference, ** statistically highly significant difference) Table 3. The interpolated recall precision averages (% ) at standard recall level 50 for the test queries, and their difference to th baselines. (* statistically significant difference, ** statistically highly significant difference)

Table 1. Precision and Recall of the extraction procedure

Table 2. Precision and Recall of location queries in Google 7. REFERENCES

Table 1: Recall of Academy Award Winners Recall. The number of entries in IMDb exceeds our ontology by far. Although our algorithm performs especially well on re- cent productions, we are interested how well it performs on classic movies, actors and directors. First, we made lists of all Academy Award winners (1927-2005) in a number of relevant categories, and checked the recall (Table 1). IMDb has a top 250 of best movies ever. The algorithm found 85% of them. We observe that results are strongly oriented towards Hollywood productions. We also made a list of all winners of the Cannes Film Festival, the ‘Palme d’Or’. Alas, our algorithm only found 26 of the 58 winning movies in this category.

Table 2: Some examples to illustrate the difficulties in discrim- inating between persons names and other text fragments. Not all text fragments we have found in the extraction phase will be person names. Typically, historic periods, art styles, geographic names, etc. can also directly precede a time interval. Table 2 illus- trates the difficulties in discriminating between person names and other text fragments. We note that West Mae is an inversion of the person name Mae West and that Napoleon Hill refers to a person as well as to a geographic location in the state Idaho (USA).

nist. Table 3 give the top-40 of the professions found, ranked by the number of times that these professions were found in the excerpts. Table 3: The professions that were found most often.

Table 1: Characteristics of the three main test col- lections used in the experiments. The total au- thor count (‘# total authors’) is the sum of the au- thor count over all documents; the total number of unique authors (‘# unique authors’) is the sum of the author count over all documents with each au- thor counted only once.

Table 2: Author-related characteristics of the six special test collections.

descriptionView Paper arrow_downwardDownload

Proper Noun Extracting Algorithm for Arabic Language

by Riyad al-Shalabi and

2009

Information Retrieval, the results is not encouraging. Proper names are problematic for cross language information retrieval (CLIR), detecting and extracting proper noun in Arabic language is a primary key for improving the effectiveness... more

ee ieee KEYWORDSAND SPECIAL VERBS TABLE 1 The location entity is recognized by the rule that stipulates: If we have in the text a word whose lemma is in this list ( ) followed by a Proper Noun, this sequence of words is marked as a location.

descriptionView Paper arrow_downwardDownload

Improving Query Translation In English-Korean Cross-Language Information Retrieval

by Sung-Hyon Myaeng

2005, Information Processing & …

descriptionView Paper arrow_downwardDownload

MIRACLE Retrieval Experiments with East Asian Languages

by Jose C Gonzalez-Cristobal

2005

This paper describes the participation of MIRACLE in NTCIR 2005 CLIR task. Although our group has a strong background and long expertise in Computational Linguistics and Information Retrieval applied to European languages and using Latin... more

descriptionView Paper arrow_downwardDownload

Machine Translation on the Medical Domain: The Role of BLEU/NIST and METEOR in a Controlled Vocabulary Setting

by Sergio Furuie

2015

The main objective of our project is to extract clinical information from thoracic radiology reports in Portuguese using Machine Translation (MT) and cross language information retrieval techniques. To accomplish this task we need to... more

descriptionView Paper arrow_downwardDownload

Structural and Semantic Transformations of Scientific and Technical Texts Using Modern English Machine Translation

by Ilya (w-495) Nikitin

2011, Структурно-семантические трансформации в научно-техническом тексте при машинном переводе в современном английском языке

Цель работы: изучение особенностей научно-технического стиля в аспекте машинного перевода. Задачи работы: описать в общих чертах стилистику научно-технического текста; описать принципы работы, основные типы систем машинного перевода;... more

Цель работы: изучение особенностей научно-технического стиля в аспекте машинного перевода. Задачи работы: описать в общих чертах стилистику научно-технического текста; описать принципы работы, основные типы систем машинного перевода; изучить особенности языковых средств научного стиля; выяснить какие из языковых средств остаются в тексте после его перевода; сравнить текст, переведенный машиной и профессиональным переводчиком. Объектом исследования являются преобразования, возникающие при машинном переводе текста научно-технической литературы. Предметом исследования являются особенности структурно-семантических трансформаций при машинном переводе определенной научной работы с английского языка на русский. В ходе исследования были использованы следующие методы: аналитический; сравнительный; изучение монографических публикаций и статей; метод сплошной выборки; методы лингвистического анализа; метод обобщения. Теоретическая значимость работы заключается в попытке собрать и систематизировать информацию о малоизученном аспекте взаимодействия двух областей знаний: об особенностях стиля научно-технической литературы; об особенностях систем машинного перевода. Практическая ценность работы определяется: описанием особенностей научного стиля в виде диаграмм и схем, что, по нашему мнению, является наиболее доходчивым и методически оправданным. комплексом собранных сведений относительно систем машинного перевода, которые также представлены в виде диаграмм и схем; приведенным стилистическим анализом отрывков из классического труда Д. Кнута «Искусство программирования»; сопоставлением машинного и «человеческого» перевода приведенного отрывка. В качестве теоретической основы исследования были использованы как классические работы А. И. Гальперина И. В. Арнольд, посвященные стилистике английского языка, Я. И. Рецкера, А. Д. Швейцера, посвященные общей теории перевода, так и современные труды Ю. Н. Марчука, П. Н. Хроменкова, Б. Н. Рахимбердиева, посвященные машинному переводу. Особенно хотелось бы отметить работу С. Рассела, в которой дается объяснение принципов работы многих приложений искусственного интеллекта, в том числе машинного перевода.

descriptionView Paper arrow_downwardDownload

Dual PECCS: A Cognitive System for Conceptual Representation and Categorization

by Antonio Lieto and

In this article we present an advanced version of Dual-PECCS, a cognitively-inspired knowledge representation and reasoning system aimed at extending the capabilities of artificial systems in conceptual categorization tasks. It combines... more

descriptionView Paper arrow_downwardDownload

Making miracles: Interactive translingual search for cebuano and hindi

by Anton Leuski

2003, ACM Transactions on …

Searching is inherently a user-centered process; people pose the questions for which machines seek answers, and ultimately people judge the degree to which retrieved documents meet their needs. Rapid development of interactive systems... more

descriptionView Paper arrow_downwardDownload

Combining bidirectional translation and synonymy for cross-language information retrieval

by Jianqiang Wang

2006

This paper introduces a general framework for the use of translation probabilities in cross-language information retrieval based on the notion that information retrieval fundamentally requires matching what the searcher means with what... more

descriptionView Paper arrow_downwardDownload

Interactive Cross-Language Document Selection

by Jianqiang Wang

2004, Information Retrieval

The problem of finding documents written in a language that the searcher cannot read is perhaps the most challenging application of cross-language information retrieval technology. In interactive applications, that task involves at least... more

descriptionView Paper arrow_downwardDownload

Language Identification Strategies for Cross Language Information Retrieval

by Alessio Bosca

2010, Clef

In our participation to the 2010 LogCLEF track we focused on the analysis of the European Library (TEL) logs and in particular we experimented with the identification of the natural language used in the queries. Language identification is... more

descriptionView Paper arrow_downwardDownload

NTCIR-2 ECIR Experiments at Maryland: Comparing Pirkola's Structured Queries and Balanced Translation

by Jianqiang Wang

2001

European languages. Monolingual Chinese retrieval experiments, by contrast often find that character bigrams perform as well as (and sometimes better than) automatically segmented words. During the Mandarin-English Information (MEI)... more

descriptionView Paper arrow_downwardDownload

CLIR Evaluation at TREC

by Michael Kluck

2000

Starting in 1997, the National Institute of Standards and Technology conducted 3 years of evaluation of cross-language information retrieval systems in the Text REtrieval Conference (TREC). Twentytwo participating systems used topics... more

descriptionView Paper arrow_downwardDownload

Portuguese-English Experiments using Latent Semantic Indexing

by Christian Huyck

2000

This paper reports the work of Middlesex University for the CLEF bilingual task. We have carried out experiments using Portuguese queries to retrieve documents in English. The approach used was Latent Semantic Indexing, which is an... more

descriptionView Paper arrow_downwardDownload

English-Latvian Toponym Processing: Translation Strategies and Linguistic Patterns

by Inguna Skadina

Режим доступу до ресурсу: http://dspace. utlib. ee …

The paper presents a study of a challeng-ing task in machine translation and cross-language information retrieval transla-tion of toponyms. Due to their linguistic and extra-linguistic nature, toponyms de-serve a special treatment. The... more

Generally, LTTPs are the ways source topo- nymic units are rendered into target toponymic units. LTTPs can be of two types: in-word pat- terns and multi-word pattems. The in-word LTTP is a word transformation model, based on English-Latvian transliteration rules, including the most frequent prefixes, suffixes, and letter Target string normalisation modifies a toponymic unit according to the rules of the Latvian gram- mar and orthography, e.g. all populated places are feminine gender (see P1): Newcastle > Nikdasla which is indicated by the ending —a (feminine, singular nominative).

descriptionView Paper arrow_downwardDownload

Language-specific encoding in multilingual corpora: Requirements and solutions

by Jost Gippert

1999

Dies ist eine Internet-Sonderausgabe des Aufsatzes „Language-specific encoding in multilingual corpora: Requirements and solutions“ von Jost Gippert (1999). Sie sollte nicht zitiert werden. Zitate sind der Originalausgabe in Multilinguale... more

descriptionView Paper arrow_downwardDownload

Evaluating Language Resources for English-Indonesian CLIR

by Herika Hayurani

2006

We present a report on our participation in the Indonesian-English ad hoc bilingual task of the 2006 Cross-Language Evaluation Forum (CLEF). This year we compare the use of several language resources to translate Indonesian queries into... more

descriptionView Paper arrow_downwardDownload

iCompileCorpora: A Web-based Application to Semi-automatically Compile Multilingual Comparable Corpora

by Hernani Costa and

This article presents an ongoing project that aims to design and develop a robust and agile web-based application capable of semi-automatically compiling monolingual and multilingual comparable corpora, which we named iCompileCorpora. The... more

descriptionView Paper arrow_downwardDownload

A Cross Lingual Information Retrieval (CLIR) System for Afaan Oromo-English using a Corpus Based Approach

by Daniel Bekele

2015, International Journal of Engineering Research and

The goal of Cross Language Information Retrieval (CLIR) is to provide users with access to information that is in a different language from their queries. It has the ability to issue a query in one language and retrieve documents in... more

Figure 3.1 Architecture of the Afaan Oromo-English CLIR System The architecture of the Afaan Oromo-English CLIR system is shown diagrammatically in figure 3.1 (adopted from Aynalem, 2009). As illustrated in the figure, the proposed CLIR system uses a number of phases to translate a given Afaan Oromo query into an English query. The major components involved in the Afaan Oromo-English cross- language information retrieval system are explained in the following sections. Corpus-based CLIR method is based on multilingual text collections, from which translation knowledge is derived using various statistical methods (Talvensaari et al., 2007). It is one of the query translation ap proaches of CLIR that uses either parallel or comparable corpora (Kishida, 2005), to establish a link between the query and the documents. However, Talvensaari (2008) proved that more accurate translation knowledge is extracted than comparable corpus. This from parallel corpus rather research, therefore, uses parallel documents of Afaan Oromo and English to study the application of corpus-based query translation approach of CLIR.

Figure 4.2 Average Recall-Precision graph of experimentation phase one for English documents Figure 4.1 Average Recall-Precision graph of experimentation phase one for Afaan Oromo documents

Figure 4.4 Average Recall-Precision graph of experimentation phase two for Figure 4.3 Average Recall-Precision graph of experimentation phase two for Afaan Oromo documents

In both experiments, the number of queries for which no documents retrieved was larger for the bilingual run than that of monolingual run. This low performance of the bilingual run was caused because of the source words were not really aligned with the Corresponding target words. This wrong alignment was caused because of the limited data size of the parallel corpora used for building bilingual dictionary.

Table 3.1 Vocabulary file for the given Afaan Oromo sentence daniel knew how to worship A word alignment for a parallel sentence pair represents the correspondence between words in a source language and their translations in a target language (Brown et al., 1993). In this study, word alignment represents the mapping between Afaan Oromo (source language) and English (target language). Nowadays, word aligned bilingual corpora are being used as an important source of the knowledge. Word alignment model was first introduced in SMT by Brown et al. (1993). GIZA++ uses a statistical alignment model which computes a translation probability for each co-occurring word pair. A given word from the source language may appear as being aligned with several translation candidates of target words, each one with a given probability value. For example, for the following Afaan Oromo-English parallel sentence pairs selected from the collected corpora, the vocabulary files are given in table 3.1 and table 3.2 and the bitext file generated for the sentence pairs is given in table 3.3. daani’eel akkamitti akka waaqeffachuu qabu beeka ture

Ihe translation of Afaan Oromo queries into English was based on the Afaan Oromo- English bilingual dictionary which has been constructed automatically from the Afaan Oromo-English parallel corpora collected. The bilingual dictionary that is constructed stores both source words and their corresponding translation of the target words. The word alignment in the dictionary contains all possible translation of a word from the source text into a target word together with its probability of alignment. This probability value assigned for each possible translation of a word shows the degree to which Afaan Oromo word is most likely translated into its equivalent English word. The highest the probability value indicates the best translation among the candidates translations exist. The bilingual dictionary was, therefore, constructed by the help of this probability value assigned to each translation. Python script was developed to select the one that has the highest probability of alignment, if there is more than one alignment for the given source word. Table 3.4 shows sample of the constructed Afaan Oromo-English bilingual dictionary. Table 3.4 Sample Afaan Oromo-English bilingual dictionary constructed D. Translation This component is responsible for taking query in one language and translating it into another language, i.e. it is the query translation phase. Query translation is required to achieve CLIR by the help of a bilingual dictionary built using parallel corpora collected. The translation of the given query into another language is needed to retrieve documents in the translated (target) language. For this research, a given Afaan Oromo query was translated into its equivalent English query

Table 3.5 Minimum edit distance of strings In the following table the edit distance for the selected strings is presented. As shown in table 3.5 if the edit distance is greater, the strings are more different (i.e. they are not morphologically related). The edit distance is 0 (zero) if the strings are identical. The smaller value of edit distance indicates that the strings are morphologically related or likely variants of each others.

The normalized Levenshtein distance returns the value between 0 and 1. If the value is | there is a strong similarity between the strings, but if it is 0 there is no similarity between the strings. The closer a value is to 1, the more certain the character strings are the same; the closer to 0, the less certain. By using this normalized edit distance the difficulties indicated in the above situation can be minimized. The normalized value of edit distance of the strings in the table 3.5 is given in the table 3.6.

descriptionView Paper arrow_downwardDownload

CoLesIR at CLEF 2006: Rapid Prototyping of an N-gram-Based CLIR System

by Jesus Vilares

2006

In this our rst joint participation as the CoLesIR group, our team has participated in the Portuguese monolingual ad-hoc task and in all robust ad-hoc tasks |all monolingual tasks, the English-to-German bilingual task, and the... more

descriptionView Paper arrow_downwardDownload

Automated Construction of Arabic-English Parallel Corpus

by Ali Allam

aliallam.com

Large-scale parallel corpus has become a reliable resource to cross the language barriers between the user and the web. These parallel texts provide the primary training material for statistical translation models and testing machine... more

descriptionView Paper arrow_downwardDownload

Transitive dictionary translation challenges direct dictionary translation in CLIR

by Kalervo Järvelin

2004

The paper reports on experiments carried out in transitive translation, a branch of cross-language information retrieval (CLIR). By transitive translation we mean translation of search queries into the language of the document collection... more

descriptionView Paper arrow_downwardDownload

Linear Combinations Based on Document Structure and Varied Stemming for Arabic Retrieval

by Mohammed Aljlayl

2002, Proceedings of the 11th …

For TREC 10 we participated in the Named Page Finding Task and the Cross-Lingual Task. In the web track, we explored the use of linear combinations of term collections based on document structure. Our goal was to examine the effects of... more

descriptionView Paper arrow_downwardDownload

Experiments in Cross Language Query Focused Multi-Document Summarization

by Prasad Pingali

2000

descriptionView Paper arrow_downwardDownload

Proper noun extracting algorithm for arabic language

by Riyad al-Shalabi and

2009, … Conference on IT, …

Information Retrieval, the results is not encouraging. Proper names are problematic for cross language information retrieval (CLIR), detecting and extracting proper noun in Arabic language is a primary key for improving the effectiveness... more

descriptionView Paper arrow_downwardDownload

Cognitive and Psychological Factors in Cross-Language Information Retrieval

by Rowena Li

Advances in Library and Information Science

While a lot of research has focused on the effectiveness of system functionality, few studies have examined information needs and social aspects related to cross-language information retrieval. This chapter aims to speculate the human and... more

descriptionView Paper

Improving translation accuracy in web-based translation extraction

by Shlomo Geva

In this paper, we present some approaches to improve translation accuracy in web-based translation extraction. In previous work, the term extraction techniques that researchers used are proposed under large static corpus. We proposed some... more

descriptionView Paper arrow_downwardDownload

Web-based pattern learning for named entity translation in Korean–Chinese cross-language information retrieval

by Richard Tzong-Han Tsai

2009, Expert Systems with Applications

Named entity (NE) translation plays an important role in many applications, such as information retrieval and machine translation. In this paper, we focus on translating NEs from Korean to Chinese in order to improve Korean-Chinese... more

descriptionView Paper arrow_downwardDownload

Cross Language Information Retrieval

Related Topics