As the volume of information in internet is increasing staggeringly therefore it is required to develop new methods for document retrieval and ranking them according to their relevance value as per the user query. Information Retrieval... more
The paper demonstrates how the Laboratory Research Framework fits into the holistic Cognitive Framework for IR. It first discusses the Laboratory Framework with emphasis on its underlying assumptions and known limitations. This is... more
Abstract: In this paper we compare the relevance of information obtained from" discriminative" media and from" non-discriminative" media. Discriminative media are the ones which accumulate and deliver information using... more
Canonical Information Retrieval systems perform a ranked keyword search strategy: Given a user's one-off information need (query), a list of documents, ordered by relevance, is returned. The main limitation of that “one fits all”... more
How does an information user perceive a document as relevant? The literature on relevance has identified numerous factors affecting such a judgment. Taking a cognitive approach, this study focuses on the criteria users employ in making... more
Using a fuzzy-logic-based calculus of linguistically quantified propositions we present FQUERY III+. a new, more "human-friendly" and easier-to-use implementation of a querying scheme proposed originally by Kacprzyk and Zioikowski to... more
Traditionally, keyphrases (or keywords) have been manually assigned to documents by their authors or by human indexers. This, however, has become impractical due to the massive growth of documents—particularly short articles (e.g.... more
In this paper we describe a type of data fusion involving the combination of evidence derived from multiple document representations. Our aim is to investigate if a composite representation can improve the online detection of novel events... more
In this paper, we explore the effects of data fusion on First Story Detection [1] in a broadcast news domain. The data fusion element of this experiment involves the combination of evidence derived from two distinct representations of... more
Information is commonly reflected in news articles. However, texts are unstructured and thus demanding to analyze automatically. To identify and capture the facts in a news story we propose a novel approach, which utilizes natural... more
In this paper, we address the problem of selection collections. This is important for locating responses in digital libraries. The aim of methods, which deal with the area of information retrieval, is to reduce the amount of the exchanged... more
Many modern applications produce and process XML data, which is queried in its both structural and textual component. This is especially useful if we consider a casual user who looks for information in web-based database systems or... more
Your Ms V m reforenca Our iUe Narre raterence The author has granted a non-L'auteur a accordé une Licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de... more
Documents exist in different formats. When we have document images, in order to access some part, preferably all, of the information contained in that images, we have to deploy a document image analysis application. Document images can be... more
This paper proposes a method to improve the existing methods to classify documents in to categories based on supervised machine learning technique. It includes converting the unstructured text data into a numerical vector form for... more
Dans cet article, nous presentons une approche pour mesurer la similarite semantique entre des textes heterogenes et de qualite differente provenant de differentes sources Web. Notre approche commence par extraire le contenu des textes... more
Risk Management is a practice composed by processes, methods and tools that allows managing risks in projects; this activity is typically started during the initial phase of a project and it continues during the whole project life cycle.... more
Identifying topics and concepts associated with a set of documents is a critical task for information retrieval systems. One approach is to associate a query with a set of topics selected from a fixed ontology or vocabulary of terms. The... more
The widespread use of the XML format for document representation and message exchange has influenced techniques for data integration in last years. A development of various XML languages, methods and tools gave rise to so called XML... more
Document classification presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex semantics of the natural language. The traditional document representation is a word-based vector (Bag... more
Notre travail se situe dans le cadre d'un projet d'annotation descriptive, conceptuelle et thématique de corpus textuel. Dans le présent article, nous focalisons notre attention sur l'annotation conceptuelle et plus précisément sur la... more
In this paper, we introduce TextRank-a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications. In particular, we propose two innovative unsupervised methods for... more
In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document... more
In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document... more
New text analysis softwares issued from fields of research such as Machine Learning and Natural Languages Processing prove to be relevant tools for the language sciences. Littératron is a new data-processing tool for the automatic... more
A massive amount of unstructured data, in this information age, is composed of document collections. Examples include news articles, blog posts, scholarly publications, and reports generated by organizations as well as people. Many data... more
Most text classification systems use bag-of-words representation of documents to find the classification target function. Linguistic structures such as morphology, syntax and semantic are completely neglected in the learning process. This... more
Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across... more
In metaphorical conceptualization, the structure of one conceptual system is projected onto another. Our previous work suggest that idiomaticity in specialized languages is based in part on this phenomenon. However, since metaphorical... more
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and... more
One of the most challenging aspects of developing information systems is the processing and management of large volumes of information. One way to overcome this problem is to implement efficient data indexing and classification systems.... more
13 Comparison of various loading techniques. The vertical line indicates the approximate size (1250 nodes) of cnn.com.
In this research, we enhanced the performance of Support Vector Machine (SVM) in text classification by applying semantic-knowledge enrichment. We propose using semantic-knowledge enrichment scheme to inject new concepts into the original... more
Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy... more
MAD (Movie Authoring and Design) is a novel design and authoring system that facilitates the process of creating dynamic visual presentations. MAD aids this process by simultaneously allowing easy creation and modification of structured... more
This paper presents the first results of a study on the bargaining process of web standards in World Wide Web Consortium (W3C) arenas. This process is analysed through bargaining habits and through networks of actors who take part in it.... more
An overwhelming number of users use Ebooks as their primary formats. Gone are the days when buying a physical copy was the only thing to do. Even though technology has advanced to extreme extents in display and visualization of texts... more
Feature selection (FS) is a widely used method for removing redundant or irrelevant features to improve classification accuracy and decrease the model's computational cost. In this paper, we present an improved method (referred to... more
Feature selection (FS) is a widely used method for removing redundant or irrelevant features to improve classification accuracy and decrease the model's computational cost. In this paper, we present an improved method (referred to... more
Answering mobile users' queries intelligently is one of the significant challenges in information retrieval (IR) in intelligent systems. Current popular Quranic retrieval application ranks the document by counting the occurrences of each... more
Language resources are typically defined and created for application in speech technology contexts, but the documentation of languages which are unlikely ever to be provided with enabling technologies nevertheless plays an important role... more
This paper describes an algorithm for document representation in a reduced vectorial space by a process of feature extraction. The algorithm is applied and evaluated in the context of the supervised classification of news articles from... more
This paper describes an algorithm for document representation in a reduced vectorial space by a process of feature extraction. The algorithm is evaluated in the context of the supervised classification of news articles. We are generating... more
Abstract-OntoGen is a semi-automatic and data-driven ontology editor focusing on editing of topic ontologies. It utilizes text mining tools to make the ontology-related tasks simpler to the user. This focus on building ontologies from... more
Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract... more
Adobe's Acrobat software, released in June 1993, is based around a new Portable Document Format (PDF) which offers the possibility of being able to view and exchange electronic documents, independent of the originating software, across a... more
This paper draws a parallel between document preparation and the traditional processes of compilation and link editing for computer programs. A block-based document model is described which allows for separate compilation of various... more
The two complementary de facto standards for the publication of electronic documents are HTML on the World Wide Web and Adobe's Acrobat viewers using PDF (Portable Document Format). A brief overview is given of these two systems followed... more