Academia.eduAcademia.edu

Document representation

description572 papers
group21 followers
lightbulbAbout this topic
Document representation refers to the methods and techniques used to encode and structure textual or visual information in a format suitable for processing, analysis, and retrieval by computer systems. It encompasses various approaches, including vector space models, semantic representations, and markup languages, aimed at facilitating efficient information access and manipulation.
lightbulbAbout this topic
Document representation refers to the methods and techniques used to encode and structure textual or visual information in a format suitable for processing, analysis, and retrieval by computer systems. It encompasses various approaches, including vector space models, semantic representations, and markup languages, aimed at facilitating efficient information access and manipulation.
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize... more
Text categorization -the assignment of natural language texts to one or more predefined categories based on their content -is an important component in many information organization and management tasks. We compare the effectiveness of... more
In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability-ranking principle shows that optimum retrieval quality can be... more
This document representation is successivelytransformed into a sequence of sentence plans (to-gether with formatting instructions in a selectabletarget format; SGML, IgTEX , Zmacs and - for screenoutput - formatted ASCII are currently... more
Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations... more
Although we see the positive results of information retrieval research embodied throughout the Internet, on our computer desktops, and in many other aspects of daily life, at the same time we notice that people still have a wide variety... more
Drawing on the theory of documents representation (Perfetti, Rouet, & Britt, 1999), we argue that successfully dealing with multiple documents on the WWW requires readers to form documents models, that is, to deal with contents and... more
This study investigates the use of criteria to assess relevant, partially relevant, and not-relevant documents. Study participants identified passages within 20 document representations that they used to make relevance judgments; judged... more
In this paper we present a new document representation model based on implicit user feedback obtained from search engine queries. The main objective of this model is to achieve better results in non-supervised tasks, such as clustering... more
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia... more
In this paper we describe work relating to classification of web documents using a graph-based model instead of the traditional vector-based model for document representation. We compare the classification accuracy of the vector model... more
Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and... more
is becoming the most relevant standardization e ort in the area of document representation through markup languages. Through XML, it is possible to de ne complex documents, containing information at di erent degrees of sensitivity.... more
Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing... more
Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when... more
The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are... more
In this paper we describe a type of data fusion involving the combination of evidence derived from multiple document representations. Our aim is to investigate if a composite representation can improve the online detection of novel events... more
The publication of material in 'electronic form' should ideally preserve, in a unified document representation, all of the richness of the printed document while maintaining enough of its underlying structure to enable searching and other... more
The use of the computing with words paradigm for the automatic text documents categorization problem is discussed. This specific problem of information retrieval (IR) becomes more and more important, notably in view of a fast... more
Active reading, involving acts such as highlighting, writing notes, etc., is an important part of knowledge workers' activities. Most computer-based active reading support seeks to replicate the affordances of paper, but paper has... more
Multiclass support vector machine (SVM) methods are well studied in recent literature. Comparison studies on UCI/statlog multiclass datasets suggest using one-against-one method for multiclass SVM classification. However, in unilabel... more
We present a text document search engine with several new visualization front-ends that aid navigation through the set of documents returned by a query (short "returned documents"). Our methods are based on identifying and selecting... more
Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this... more
Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation... more
Artificial intelligence models may be used to improve performance of information retrieval (IR) systems and the genetic algorithms (GAs) are an example of such a model. This paper presents an application of GAs as a relevance feedback... more
In the Web context, link-based evidence is most commonly used in conjunction with contentbased evidential information in order to improve retrieval effectiveness. This paper examines the impact the various types of link-based evidence and... more
Traditional information retrieval systems aim at satisfying most users for most of their searches, leaving aside the context in which the search takes place. We propose to model two main aspects of context: The themes of the user's... more
This paper presents a new document representation with vectorized multiple features including term frequency and term-connection-frequency. A document is represented by undirected and directed graph, respectively. Then terms and... more
This paper provides a thematic frame analysis of Australian newspaper reporting of the outcome and implications of the trial of Rolah McCabe versus British American Tobacco Australasia (BATA). In this trial, a Melbourne woman was awarded... more
Effective retrieval of structured documents should exploit the content and structural knowledge associated with the documents. This knowledge can be used to focus retrieval to the best entry points: document components that contain... more
In this paper, we explore the effects of data fusion on First Story Detection [1] in a broadcast news domain. The data fusion element of this experiment involves the combination of evidence derived from two distinct representations of... more
Most web content categorization methods are based on the vector-space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based... more
Text mining, intelligent text analysis, text data mining and knowledge-discovery in text are generally used aliases to the process of extracting relevant and non-trivial information from text. Some crucial issues arise when trying to... more
In this paper we proposed to automatically classify documents based on the meanings of words and the relationships between groups of meanings or concepts. Our proposed classification algorithm builds on the word structures provided by... more
Document clustering has the goal of discovering groups with similar documents. The success of the document clustering algorithms depends on the model used for representing these documents. Documents are commonly represented with the... more
In this paper the problem of indexing heterogeneous structured documents and of retrieving semistructured documents is considered. We propose a flexible paradigm for both indexing such documents and formulating user queries specifying... more
In this paper we describe a solution for incorporating background knowledge into the OntoGen system for semi-automatic ontology construction. This makes it easier for different users to construct different and more personalized ontologies... more
Being able to author a hypermedia document once for presentation under a wide variety of potential circumstances requires that it be stored in a manner that is adaptable to these circumstances. Since the nature of these circumstances is... more
Most recent document standards like XML rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more... more
The third Cross-Language Evaluation Forum workshop (CLEF-2002) provides the unprecedented opportunity to evaluate retrieval in eight different languages using a uniform set of topics and assessment methodology. This year the Johns Hopkins... more
Most recent document standards rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex... more
In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document... more
Most web content classification methods are based on the vectorspace model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers... more
Download research papers for free!