In our participation in this evaluation campaign, our first objective was to analyze retrieval ef... more In our participation in this evaluation campaign, our first objective was to analyze retrieval effectiveness when using The European Library (TEL) corpora composed of very short descriptions (library catalog records) and also to evaluate the retrieval effectiveness of several IR models. As a second objective we wanted to design and evaluate a stopword list and a light stemming strategy for the Persian (Farsi), a member of the Indo-European family of languages and whose morphology is more complex than of the English language.
In participating in this CLEF evaluation campaign, our first objective is to propose and evaluate... more In participating in this CLEF evaluation campaign, our first objective is to propose and evaluate various indexing and search strategies for the Russian language, in order to obtain better retrieval effectiveness than that provided by the language-independent approach (n-gram). Our second objective is to more effectively measure the relative merit of various search engines when used for the German and to a lesser extent the English language. To do so we evaluate the GIRT-4 test-collection using the Okapi, various IR models derived from the Divergence from Randomness (DFR) paradigm, the statistical language model (LM) together with the classical tf. idf vector-processing scheme. We also evaluated different pseudo-relevance feedback approaches. For the Russian language, we find that word-based indexing with our light stemming procedure results in better retrieval effectiveness than does 4-gram indexing strategy (relative difference around 30%). Using the GIRT corpora (available in German and English), we examine certain variations in retrieval effectiveness that result from applying the specialized thesaurus to automatically enlarge topic descriptions. In this case, the performance variations were relatively small and usually non significant.
For our participation in the CLEF 2006 campaign, our first objective was to propose and evaluate ... more For our participation in the CLEF 2006 campaign, our first objective was to propose and evaluate a decompounding algorithm and a more aggressive stemmer for the Hungarian language. Our second objective was to obtain a better picture of the relative merit of various search engines for the French, Portuguese/Brazilian and Bulgarian languages. To achieve this we evaluated the test-collections using the Okapi approach, some of the models derived from the Divergence from Randomness (DFR) family and a language model (LM), as well as two vector-processing approaches. In the bilingual track, we evaluated the effectiveness of various machine translation systems for a query submitted in English and automatically translated into the French and Portuguese languages. After blind query expansion, the MAP achieved by the best single MT system was around 95% for the corresponding monolingual search when French was the target language, or 83% with Portuguese. Finally, in the robust retrieval task we investigated various techniques in order to improve the retrieval performance of difficult topics.
This paper first describes various strategies (character, bigram, automatic segmentation) used to... more This paper first describes various strategies (character, bigram, automatic segmentation) used to index the Chinese (ZH), Japanese (JA) and Korean (KR) languages. Second, based on the NTCIR-5 testcollections, it evaluates various retrieval models, varying from classical vector-space models to more recent developments in probabilistic and language models. While no clear conclusion was reached for the Japanese language, the bigram-based indexing strategy seems to be the best choice for Korean, and the combined "unigram & bigram" indexing strategy is best for traditional Chinese. On the other hand, Divergence from Randomness (DFR) probabilistic model usually results in the best mean average precision. Finally, upon an evaluation of the four different statistical tests, we find that their conclusions correlate, even more when comparing the non-parametric bootstrap with the t-test.
For our participation in this CLEF evaluation campaign, the first objective was to propose and ev... more For our participation in this CLEF evaluation campaign, the first objective was to propose and evaluate various indexing and search strategies for the Hungarian language in order to produce better retrieval effectiveness than language-independent approach (n-gram). Using both a new stemmer including some derivational suffixes removals, and a more aggressive automatic decompounding scheme, we were able to produce better retrieval effectiveness than corresponding 4-gram indexing scheme. Our second objective was to obtain a better picture of the relative merit of various search engines with the French, Brazilian/Portuguese and Bulgarian languages. To do so we evaluated these test-collections using the Okapi, Divergence from Randomness (DFR) and language model (LM) models together with nine vector-processing approaches. After pseudo-relevance feedback, either the DFR or the LM approach tends to produce the best IR performance. For the Bulgarian language, we also found that word-based in...
This paper describes our second participation in an evaluation campaign involving the Chinese, Ja... more This paper describes our second participation in an evaluation campaign involving the Chinese, Japa-nese, Korean and English languages (NTCIR-5). Our participation is motivated by four objectives: 1) study the retrieval performances of various IR models for these languages; 2) compare the relative retrieval effectiveness of bigram and automatic word-segmenting approaches for Chinese and Japanese languages; 3) propose a new blind-query expansion hopefully capable of improving mean average preci-sion; and 4) evaluate the relative performance of the various merging strategies used to combine separate result lists extracted from a corpus written in
This paper describes our participation in the TREC 2006 Genomics evaluation campaign. In an effor... more This paper describes our participation in the TREC 2006 Genomics evaluation campaign. In an effort to find text passages that will meet user requests, we propose and evaluate a new approach to the generation of orthographic variants of search terms (mainly genomic names in our case). We also evaluate the retrieval effectiveness of both the Okapi (BM25) model and the I(n)B2 probabilistic model derived from the Divergence from Randomness paradigm. In our experiments, we find that in terms of mean average precision the latter model performs clearly better than the Okapi model (with a relative difference of 50%). Moreover when comparing a 5-gram indexing approach to word-based indexing schemes, the mean average precision decreases by about 10% when using the n-gram indexing scheme. Additionally, including the article’s title in all passages generated from a given article does not improve retrieval effectiveness. Finally, the generation of passages delimited by HTML tags was not a succes...
En recherche d'information, les langues asiatiques presentent des defis multiples. Contrairem... more En recherche d'information, les langues asiatiques presentent des defis multiples. Contrairement aux langues europeennes, les mots ne se sont pas delimites de maniere explicite ce qui pose un probleme pour l'indexation. Pour cette raison, plusieurs travaux ont propose differentes strategies pour representer les documents (et requetes) rediges dans ces langues. Cet article presente une comparaison des strategies d'indexation les plus courantes. En particulier, nous avons compare quatre strategies pour le chinois et le japonais (unigramme, bigrammes, uni- et bigrammes et finalement les mots) et trois pour le coreen (mots, bigrammes et morphemes). Base sur les collections-tests de NTCIR-5, nous avons evalue ces differentes approches a l'aide de onze modeles de recherche, soit deux approches probabilistes et neuf vectorielles. Une analyse statistique revele que les quatre tests couramment utilises en recherche d'information sont correles et que cette relation est par...
This paper describes our participation in TREC-2005 for the ad hoc Genomic track, in which we eva... more This paper describes our participation in TREC-2005 for the ad hoc Genomic track, in which we evaluate five different stemming approaches to performing domainspecific searches within a MEDLINE subset. We also evaluate the impact that manually assigned descriptors (MeSH headings) have on retrieval effectiveness. We design a domain-specific query expansion scheme and compare it with the more classic Rocchio approach. In our experiments on this collection subset, we discover that mean average precision does not improve when using different stemming algorithm. We then show how the presence of the MeSH headings significantly enhances mean average precision by about 9%. Finally, we illustrate how the use of various query expansion techniques can impairs retrieval performance.
Based on a relatively large subset representing one third of the MEDLINE collection, this paper e... more Based on a relatively large subset representing one third of the MEDLINE collection, this paper evaluates ten different IR models (probabilistic, language model and vector-space approaches) using three different stemmers. The impact that manually assigned descriptors (MeSH headings) have on retrieval effectiveness is also evaluated. Finally, we propose both a new general blind-query expansion and a domain-specific query expansion scheme and compare them with the classic Rocchio approach. MOTS-CLÉS : Recherche d’information ; évaluation ; modèle probabiliste ; modèle de langue ; expansion automatique de requêtes ; indexation manuelle.
This paper describes our third participation in an evaluation campaign involving the Chinese, Jap... more This paper describes our third participation in an evaluation campaign involving the Chinese, Japanese and Korean languages (NTCIR-6). Our participation is motivated by three objectives: 1) study the retrieval performances of various probabilistic and language models for these languages; 2) compare the relative retrieval effectiveness of a combined “unigram & bigram” indexing scheme combined with an automatic wordsegmenting approach for Chinese and Japanese languages; and 3) evaluate the relative performance of the various data fusion strategies used to combine separate result lists in order to enhance retrieval effectiveness.
For our participation in this CLEF evaluation campaign, the first objective was to propose and ev... more For our participation in this CLEF evaluation campaign, the first objective was to propose and evaluate various indexing and search strategies for the Hungarian language in order to produce better retrieval effectiveness than language-independent approach (n-gram). Using both a new stemmer including some derivational suffixes removals, and a more aggressive automatic decompounding scheme, we were able to produce better retrieval effectiveness than corresponding 4-gram indexing scheme. Our second objective was to obtain a better picture of the relative merit of various search engines with the French, Brazilian/Portuguese and Bulgarian languages. To do so we evaluated these test-collections using the Okapi, Divergence from Randomness (DFR) and language model (LM) models together with nine vector-processing approaches. After pseudorelevance feedback, either the DFR or the LM approach tends to produce the best IR performance. For the Bulgarian language, we also found that word-based ind...
De nombreuses personnes ont contribué de près ou de loin au succès de ce travail de thèse. Je rem... more De nombreuses personnes ont contribué de près ou de loin au succès de ce travail de thèse. Je remercie mes parents, mes soeurs et mes frères pour leur aide et encouragement durant toutes mes études. Je remercie particulièrement mon frère Farid et ma mère Houria pour leur soutien à tous les niveaux. Je leur serai toujours redevable de tous les efforts qu'ils ont fournis à mon égard. Je tiens également à vivement remercier mon amie Sandra pour sa patience, sa gentillesse et son encouragement durant toute cette période. J'aimerais exprimer ma gratitude et mes sincères remerciements à mon directeur de thèse, le Professeur Jacques Savoy, qui m'a offert l'opportunité de travailler avec lui et de découvrir le monde de la recherche d'information. Son enthousiasme, son encadrement ainsi que son humanisme dans le travail ont constitué un environnement idéal pour mener à bien ce projet.
Conference en Recherche d'Infomations et Applications, 2006
En recherche d'information, les langues chinoise et japonais présentent des défis multiples. Cont... more En recherche d'information, les langues chinoise et japonais présentent des défis multiples. Contrairement aux langues européennes, les mots ne se sont pas délimités de manière explicite ce qui pose un problème pour l'indexation. Pour cette raison, plusieurs travaux ont proposé différentes stratégies pour représenter les documents (et requêtes) rédigés dans ces langues. Cet article présente une comparaison des stratégies d'indexation les plus courantes. En particulier, nous avons comparé quatre stratégies pour le chinois (unigrammes, bigrammes, uni-et bigrammes et finalement les mots), deux pour le japonais (bigrammes et mots) et trois pour le coréen (mots, bigrammes et morphèmes). Basé sur les collections-tests de NTCIR-5, nous avons évalués ces différentes approches à l'aide de neuf modèles de recherche, soit deux approches probabilistes et sept vectoriels.
Conference en Recherche d'Infomations et Applications, 2007
This paper describes and evaluates vector-space, probabilistic and language IR models used to ret... more This paper describes and evaluates vector-space, probabilistic and language IR models used to retrieve news articles from a corpus written in the French language. Based on three CLEF test-collections and 151 topics, we analyze the retrieval effectiveness of these approaches and analyze the poor retrieval results of hard topics. An appropriate robust evaluation is not easy because both the mean
This paper describes our participation in TREC-2005 for the ad hoc Genomic track, in which we eva... more This paper describes our participation in TREC-2005 for the ad hoc Genomic track, in which we evaluate five different stemming approaches to performing domain- specific searches within a MEDLINE subset. We also evaluate the impact that manually assigned descriptors (MeSH headings) have on retrieval effectiveness. We design a domain-specific query expansion scheme and compare it with the more classic Rocchio
This paper describes our second participation in an evaluation campaign involving the Chinese, Ja... more This paper describes our second participation in an evaluation campaign involving the Chinese, Japa- nese, Korean and English languages (NTCIR-5). Our participation is motivated by four objectives: 1) study the retrieval performances of various IR models for these languages; 2) compare the relative retrieval effectiveness of bigram and automatic word- segmenting approaches for Chinese and Japanese languages; 3) propose a
For our participation in this CLEF evaluation campaign, the first objective was to propose and ev... more For our participation in this CLEF evaluation campaign, the first objective was to propose and evaluate various indexing and search strategies for the Hungarian language in order to produce better retrieval effectiveness than language-independent approach (n-gram). Using both a new stemmer including some derivational suffixes removals, and a more aggressive automatic decompounding scheme, we were able to produce better retrieval
Cet article décrit la banque documentaire MEDLINE depuis laquelle une collection test comprenant ... more Cet article décrit la banque documentaire MEDLINE depuis laquelle une collection test comprenant environ 4,5 million de documents structurés a été construite à partir des campagnes d'évaluation TREC. Dans une deuxième partie, nous évaluons et comparons l'efficacité du dépistage de l'information de dix modèles (probabiliste, modèle de langue, approches vectorielles). Cette évaluation est complétée par l'analyse de l'efficacité de trois enracineurs (stemmers) pour la recherche d'information oeuvrant dans un contexte spécifique. L'impact des descripteurs MeSH, manuellement sélectionnés pour chaque article, complète cette analyse. Enfin nous avons conçu deux nouvelles approches d'expansion automatique des requêtes, l'une générale l'autre spécifique et nous les avons évaluées en les comparant au modèle proposé par Rocchio.
Uploads
Papers by Samir Abdou