Document representation Research Papers

Machine learning in automated text categorization

2002, ACM Computing Surveys

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize... more

12 Cooper [1995] has pointed out that in this case the full independence assumption of (4) is not ac- tually made in the Naive Bayes classifier; the as- sumption needed here is instead the weaker linked depeniaenee assumption, which may be written as P(dj \ci) =yy2 P(wpj |¢i) Pid; |@) k=1 Plwp; 1G) We may further observe that in TC the document space is partitioned into two categories,” c; and its complement é;, such that P(é; \d; J=1- Pc; |d; ). If we plug in (4) and (5) into (3) andl take logs we obtain

Fig. 2. A decision tree equivalent to the DNF rule of Figure 1. Edges are labeled by terms and leaves are labeled by categories (underlining denotes negation).

Fig. 3. A comparison between the TC behavior of (a) the Rocchio classifier, and (b) the k-NN classifier. Small crosses and circles denote positive and negative training instances, respectively. The big circles denote the “influence area” of the classifier. Note that, for ease of illustration, document similarities are here viewed in terms of Euclidean distance rather than, as is more common, in terms of dot product or cosine. Machine Learning in Automated Text Categorization

Fig. 4. Learning support vector classifiers. The small crosses and circles represent posi- tive and negative training examples, respec- tively, whereas lines represent decision sur- faces. Decision surface oj (indicated by the thicker line) is, among those shown, the best possible one, as it is the middle element of the widest set of parallel decision surfaces (i.e., its minimum distance to any training example is maximum). Small boxes indicate the support vectors.

—microaveraging: x and ¢ are obtained by summing over all individual decisions:

with the lowest value for x2(¢g, c;) are thus the most independent from c;; since we are interested in the terms which are not, we select the terms for which x7(tz, c;) is highest. However, it should be noted that these results are just indicative, and that more general statements on the relative mer- its of these functions could be made only as a result of comparative experiments performed in thoroughly controlled condi- tions and on a variety of different situ- ations (e.g., different classifiers, different initial corpora, ...).

Machine Learning in Automated Text Categorization

Table VI. Comparative Results Among Different Classifiers Obtained on Five Different Versions of Reuters. (Unless otherwise noted, entries indicate the microaveraged breakeven point; within parentheses, “M” indicates macroaveraging and “F ,” indicates use of the F ; measure; boldface indicates the best performer on the collection) the categories are the “postable terms” of the MESH thesaurus.

descriptionView Paper arrow_downwardDownload

Inductive learning algorithms and representations for text categorization

by David Heckerman

1998, Proceedings of the seventh international conference on Information and knowledge management - CIKM '98

Text categorization -the assignment of natural language texts to one or more predefined categories based on their content -is an important component in many information organization and management tasks. We compare the effectiveness of... more

descriptionView Paper arrow_downwardDownload

Probabilistic Models in Information Retrieval

by Norbert Fuhr

1992, The Computer Journal

In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability-ranking principle shows that optimum retrieval quality can be... more

descriptionView Paper arrow_downwardDownload

TECHDOC: Multilingual generation of online and offline instructional text

by Dietmar Rösner

This document representation is successivelytransformed into a sequence of sentence plans (to-gether with formatting instructions in a selectabletarget format; SGML, IgTEX , Zmacs and - for screenoutput - formatted ASCII are currently... more

descriptionView Paper arrow_downwardDownload

Complex Linguistic Features for Text Classification: A Comprehensive Study

by Alessandro Moschitti

2004, Lecture Notes in Computer Science

Previous researches on advanced representations for document retrieval have shown that statistical state-of-the-art models are not improved by a variety of different linguistic representations. Phrases, word senses and syntactic relations... more

descriptionView Paper arrow_downwardDownload

On modeling of information retrieval concepts in vector spaces

by Vijay Raghavan

1987, ACM Transactions on Database Systems

descriptionView Paper arrow_downwardDownload

Some(what) grand challenges for information retrieval

by NIcholas Belkin

2008, ACM SIGIR Forum

Although we see the positive results of information retrieval research embodied throughout the Internet, on our computer desktops, and in many other aspects of daily life, at the same time we notice that people still have a wide variety... more

descriptionView Paper arrow_downwardDownload

Effective information retrieval using genetic algorithms based matching functions adaptation

by Praveen Pathak

2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences

descriptionView Paper arrow_downwardDownload

Dealing with multiple documents on the WWW: The role of metacognition in the formation of documents models

by Marc Stadtler

2007

Drawing on the theory of documents representation (Perfetti, Rouet, & Britt, 1999), we argue that successfully dealing with multiple documents on the WWW requires readers to form documents models, that is, to deal with contents and... more

descriptionView Paper arrow_downwardDownload

User perspectives on relevance criteria: A comparison among relevant, partially relevant, and not-relevant judgments

by Diane H . Sonnenwald

2002, Journal of the American Society for Information Science and Technology

This study investigates the use of criteria to assess relevant, partially relevant, and not-relevant documents. Study participants identified passages within 20 document representations that they used to make relevance judgments; judged... more

FIG. 1. Data collection and analysis process.

FIG. 2. Use of criteria within relevance judgments.

FIG. 3. Positive and negative contributions of criteria used in evaluation of passages.

FIG. 4. Positive and negative values of criteria used in evaluation of documents.

TABLE |. Examples of types of relevance. and passed on to a search intermediary, the user’s informa- tion need and the written query may no longer be closely tied: “The issue is not what the requester meant to ask but what the request itself actually said” (pp. 391-392). Howard (1994) elaborates on this by stating that objective relevance “is taken to be that relationship which is system-based and usually measured by topicality. That is, the crucial relation is how well the topic of the information request is repre- sented in the topics of the responses” (p. 172). Objective relevance, as defined by both Swanson and Howard, is the relationship between the stated request and the response to that request. This implies that all items containing one or more query terms could conceivably be objectively relevant although IR systems typically consider the number and frequency of query terms in items. However, the user’s perception of how those items relate to his or her informa- tion need is not considered when calculating objective rel- evance.

‘Category assumed but not studied directly. TABLE 2. Overlapping relevance categories identified in studies com- paring relevance criteria literature.

depending on whether the participant used the criterion as a positive or negative indication of relevance. The criteria identification was an iterative process. document relevance were compared to the criteria identified in Schamber (1991), Park (1992, 1993), Cool et al. (1993), Barry (1993, 1994), Wang and White (1994), and Spink, Greisdorf and Bateman (1998, 1999). The combined set of criteria identified in this literature did not appear to fully capture the information discussed by the participants in this study, although there was overlap. Therefore, as suggested by Stempel (1981), a new set of codes for the participants’ criteria was developed.

TABLE 4. Use of category and criteria in both passage and document evaluation.

TABLE 6. Distribution of positive and negative values of criteria in document judgments. As with the passage evaluation, the relationship between positive and negative values of the criteria varied across categories and types of documents relevance judgment (Fig. 4). In document relevance determination, abstract criteria were only mentioned positively while participant criteria were only mentioned negatively. This differed from passage evaluation where abstract criteria were mentioned more negatively in partially relevant judgments and more posi- tively in relevant judgments, and participant criteria were mentioned more positively in both partially relevant and relevant judgments. This may indicate that the most note- worthy aspects of these criteria in document evaluation are the positive aspects of the abstract but the negative aspects of participant criteria. related information for their jobs, making geographic prox- imity a very important criterion. The criteria identified in studies may also be influenced by the design of the study. For example, in this study, participants were promised full- text versions of articles they deemed relevant before they evaluated the document representation. Therefore, availabil- ity, unlike in other studies (e.g., Barry, 1994; Park, 1992; Schamber & Bateman, 1998), was not a criterion in this study.

TABLE 7. Use of passage criteria and criterion values in document relevance judgments.

TABLE 8. Synthesis of common concepts for relevance criteria in literature. * Category often assumed but not studied directly.

TABLE 9. Frequency of category usage across relevance judgments. ipants indicated that they were unsatisfied with the infor- mativeness of the abstract more frequently in partially rel- evant than relevant document representations, and that they made more guesses about the content of the full-text docu- ment with partially relevant document representations. The results from this study also support theories from Bookstein (1983) and Janes (1993) that partially relevant documents are selected based on the same criteria as relevant docu- ments, they just do not meet as many criteria or do not satisfy the criteria to the same degree.

descriptionView Paper arrow_downwardDownload

Document Understanding for a Broad Class of Documents

by Marco Aiello and

2002, International Journal on …

descriptionView Paper arrow_downwardDownload

Query-sets

by Ricardo Baeza-Yates

2008, Proceeding of the 17th international conference on World Wide Web - WWW '08

In this paper we present a new document representation model based on implicit user feedback obtained from search engine queries. The main objective of this model is to achieve better results in non-supervised tasks, such as clustering... more

descriptionView Paper arrow_downwardDownload

Clustering documents with active learning using wikipedia

by Eibe Frank and

2008, Proceedings - IEEE International Conference on Data Mining, ICDM

Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia... more

descriptionView Paper arrow_downwardDownload

Classification of web documents using a graph model

by Mark Last

2003

In this paper we describe work relating to classification of web documents using a graph-based model instead of the traditional vector-based model for document representation. We compare the classification accuracy of the vector model... more

descriptionView Paper arrow_downwardDownload

Logical structure recovery in scholarly articles with rich document features

by Thuy Nguyen

2011, Journal of Digital Library Systems. Forthcoming

Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and... more

descriptionView Paper arrow_downwardDownload

Controlled access and dissemination of XML documents

by Elisa Bertino

1999, Proceedings of the 2nd …

is becoming the most relevant standardization e ort in the area of document representation through markup languages. Through XML, it is possible to de ne complex documents, containing information at di erent degrees of sensitivity.... more

descriptionView Paper arrow_downwardDownload

Linear discriminant analysis in document classification

by Kari Torkkola

2001

Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing... more

descriptionView Paper arrow_downwardDownload

Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization

by Shaul Markovitch

2007

Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when... more

descriptionView Paper arrow_downwardDownload

A probabilistic description-oriented approach for categorizing web documents

by Norbert Fuhr

1999

The automatic categorisation of web documents is becoming crucial for organising the huge amount of information available in the Internet. We are facing a new challenge due to the fact that web documents have a rich structure and are... more

descriptionView Paper arrow_downwardDownload

Query-Driven Document Partitioning and Collection Selection

by Fabrizio Silvestri

2006, Proceedings of the 1st …

descriptionView Paper arrow_downwardDownload

Combining semantic and syntactic document classifiers to improve first story detection

by Joe Carthy

2001, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '01

In this paper we describe a type of data fusion involving the combination of evidence derived from multiple document representations. Our aim is to investigate if a composite representation can improve the online detection of novel events... more

descriptionView Paper arrow_downwardDownload

Journal Publishing with Acrobat: the CAJUN Project

by David Brailsford

1993, Electronic Publishing - Origination, Dissemination and Design

The publication of material in 'electronic form' should ideally preserve, in a unified document representation, all of the richness of the printed document while maintaining enough of its underlying structure to enable searching and other... more

descriptionView Paper arrow_downwardDownload

Computing with words for text processing: An approach to the text categorization

by Slawomir Zadrozny

2006, Information Sciences

The use of the computing with words paradigm for the automatic text documents categorization problem is discussed. This specific problem of information retrieval (IR) becomes more and more important, notably in view of a fast... more

Comparison of different thresholding strategies for the M.I.2 matching scheme

Comparison of matching schemes for T.II thresholding strategy

descriptionView Paper arrow_downwardDownload

LiquidText: a flexible, multitouch environment to support active reading

by Craig Tashman

2011, … of the 2011 annual conference on Human …

Active reading, involving acts such as highlighting, writing notes, etc., is an important part of knowledge workers' activities. Most computer-based active reading support seeks to replicate the affordances of paper, but paper has... more

GVU Center, Georgia Institute of Technology 85 Sth St., Atlanta, GA 30308 USA

Figure 6. A) Holding document with thumb while putting finger on selection. B) Dragging selection as indicated by arrow to create excerpt.

Figure 7. Attaching two excerpts to form a group.

descriptionView Paper arrow_downwardDownload

A comparison study on multiple binary-class SVM methods for unilabel text categorization

by Ch. Mani Kumar, P. Rajendra Babu, R Narendra Kumar, Y. Raghu Ram

2010, Pattern Recognition Letters

Multiclass support vector machine (SVM) methods are well studied in recent literature. Comparison studies on UCI/statlog multiclass datasets suggest using one-against-one method for multiclass SVM classification. However, in unilabel... more

Fig. 1. Accuracy of each category in WebKB (left) and 20 Newsgroups (right).

Fig. 2. Accuracy of OAA and OAO with varying training sizes. Fig. 3. Accuracy of OAA and OAO binary SVM classifiers on 20 Newgroups with WC representation (left) and MI representation (right)

Fig. 4. Accuracy of OAA and OAO with default parameter setting on varying training sizes.

Analysis of AO results. Training time (in seconds) comparisons of OAA, OAO and AO. Table 8 Table 6 with lesser number of documents when compared to binary SVMs of OAA which are always trained with complete set of documents. Table 6 shows some sample results of total training time (across 4-folds and all categories) taken by OAA, OAO and AO methods on all the three text corpuses with MI based document representation. The training times for DAGSVM/BTS/c-BTS were same as that of OAO.

Analysis of BTS/c-BTS results on 20 Newsgroups (WC). Table 7

used in AO, introduced more new errors than what it corrects (as shown in 20 Newsgroups with MI results). Thus the reason for AO not performing better than OAA is these new errors introduced by the binary classifier of OAO.

descriptionView Paper arrow_downwardDownload

Info Navigator: A visualization tool for document searching and browsing

by Stefan Rueger

2003

We present a text document search engine with several new visualization front-ends that aid navigation through the set of documents returned by a query (short "returned documents"). Our methods are based on identifying and selecting... more

descriptionView Paper arrow_downwardDownload

Indexing Shared Content in Information Retrieval Systems

by Ronny Lempel

2006, Lecture Notes in Computer Science

Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this... more

descriptionView Paper arrow_downwardDownload

Document Representation and Dimension Reduction for Text Clustering

by Raymond Spiteri

2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop

Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation... more

Figure 1. Comparing dimension reduction techniques, ICA, LSI, and DF on the word rep- resentation of a typical dataset (RD-256). The results show that for all datasets, clustering quality using ICA is better than using LSI in the whole range of dimensionalities investigated. For low dimensionalities, es- pecially lowerthan 50, for all datasets, the DF based method has the worst performance among the dimension reduction methods used. The best performance of DF is comparable to the best performance of LSI and ICA, but at much higher dimensionalities.

Figure 2. Comparing dimension reduction techniques, ICA, LSI and DF, on the term rep- resentation of a typical dataset (RD-256).

descriptionView Paper arrow_downwardDownload

Large Population or Many Generations for Genetic Algorithms? Implications in Information Retrieval

by Dana Vrajitoru

2000, Studies in Fuzziness and Soft Computing

Artificial intelligence models may be used to improve performance of information retrieval (IR) systems and the genetic algorithms (GAs) are an example of such a model. This paper presents an application of GAs as a relevance feedback... more

descriptionView Paper arrow_downwardDownload

Combining evidence for Web retrieval using the inference network model: an experimental study

by M. Lalmas

2004, Information Processing & Management

In the Web context, link-based evidence is most commonly used in conjunction with contentbased evidential information in order to improve retrieval effectiveness. This paper examines the impact the various types of link-based evidence and... more

descriptionView Paper arrow_downwardDownload

Modeling context through domain ontologies

by nathalie hernandez

2007, Information Retrieval

Traditional information retrieval systems aim at satisfying most users for most of their searches, leaving aside the context in which the search takes place. We propose to model two main aspects of context: The themes of the user's... more

descriptionView Paper arrow_downwardDownload

A new document representation using term frequency and vectorized graph connectionists with application to document retrieval

by MK Masukur Rahman

2009

This paper presents a new document representation with vectorized multiple features including term frequency and term-connection-frequency. A document is represented by undirected and directed graph, respectively. Then terms and... more

descriptionView Paper arrow_downwardDownload

Individual versus corporate responsibility for smoking-related illness: Australian press coverage of the Rolah McCabe trial

by Kim McLeod

2003, Health Promotion International

This paper provides a thematic frame analysis of Australian newspaper reporting of the outcome and implications of the trial of Rolah McCabe versus British American Tobacco Australasia (BATA). In this trial, a Melbourne woman was awarded... more

descriptionView Paper arrow_downwardDownload

A Model for the Representation and Focussed Retrieval of Structured Documents Based on Fuzzy Aggregation

by Gabriella Kazai

2001

Effective retrieval of structured documents should exploit the content and structural knowledge associated with the documents. This knowledge can be used to focus retrieval to the best entry points: document components that contain... more

descriptionView Paper arrow_downwardDownload

First story detection using a composite document representation

by Joe Carthy

2000, Human Language Technology

In this paper, we explore the effects of data fusion on First Story Detection [1] in a broadcast news domain. The data fusion element of this experiment involves the combination of evidence derived from two distinct representations of... more

descriptionView Paper arrow_downwardDownload

The hybrid representation model for web document classification

by Alex Markov

2008, International Journal of Intelligent Systems

Most web content categorization methods are based on the vector-space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based... more

descriptionView Paper arrow_downwardDownload

On Text-based Mining with Active Learning and Background Knowledge Using SVM

by Catarina Silva

2007, Soft Computing

Text mining, intelligent text analysis, text data mining and knowledge-discovery in text are generally used aliases to the process of extracting relevant and non-trivial information from text. Some crucial issues arise when trying to... more

descriptionView Paper arrow_downwardDownload

Document classifications based on word semantic hierarchies

by Ben Choi

2005, Proc. of International Conference on Artificial …

In this paper we proposed to automatically classify documents based on the meanings of words and the relationships between groups of meanings or concepts. Our proposed classification algorithm builds on the word structures provided by... more

descriptionView Paper arrow_downwardDownload

Document Clustering Based on Maximal Frequent Sequences

by Edith Reyes

2006

Document clustering has the goal of discovering groups with similar documents. The success of the document clustering algorithms depends on the model used for representing these documents. Documents are commonly represented with the... more

descriptionView Paper arrow_downwardDownload

The Case for Explicit Knowledge In Documents

by Gary Wills

2004, Proceedings of the 2004 …

descriptionView Paper arrow_downwardDownload

Personalised Indexing and Retrieval of Heterogeneous Structured Documents

by Gabriella Pasi

2005, Information Retrieval

In this paper the problem of indexing heterogeneous structured documents and of retrieving semistructured documents is considered. We propose a flexible paradigm for both indexing such documents and formulating user queries specifying... more

descriptionView Paper arrow_downwardDownload

Background knowledge for ontology construction

by Marko Grobelnik

2006

In this paper we describe a solution for incorporating background knowledge into the OntoGen system for semi-automatic ontology construction. This makes it easier for different users to construct different and more personalized ontologies... more

descriptionView Paper arrow_downwardDownload

A multi-layered Bayesian network model for structured document retrieval

by Fabio Crestani

2004, … to Reasoning with …

descriptionView Paper arrow_downwardDownload

A framework for generating adaptable hypermedia documents

by Lloyd Rutledge

1997

Being able to author a hypermedia document once for presentation under a wide variety of potential circumstances requires that it be stored in a manner that is adaptable to these circumstances. Since the nature of these circumstances is... more

descriptionView Paper arrow_downwardDownload

A bayesian framework for xml information retrieval: Searching and learning with the inex collection

by Benjamin Piwowarski

2005, Information Retrieval

Most recent document standards like XML rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more... more

Figure 1. A simple BN for modelling a document composed of 2 sections with respectively | and 2 paragraphs
where the summation is over all possible values d and sz (R and —R in this example) of BN
variables D and S,. The score of an element thus depends on its context as defined by the
dependence relations encoded in the BN. With this simple model, this context is reduced

Figure 2. Independence in the BN. Knowing the relevance of a journal (the double circled Journal variable ir
the figure), the relevance of the journal collection has no influence on the articles relevance within this journal.
In this model, due to the conditional independence property of the BN variables, relevance
is a local property in the following sense: if we knew that the journal is (not) relevant, the
relevance value of the journal collection would not bring any new information on the
relevance of one article of this journal. This choice of model structure has mainly been
motivated by practical considerations. We also considered using models taking into account
relations between siblings, or models where the dependency between variables reflects a

Figure 3. A local view of the BN—two local baseline models (model 1 and 2) are used here.
and to doxel X and m; its realisation. As in classical IR this variable will take two values:
R (relevant) and —R (not relevant), i.c. m; € {R, —R} . The local relevance score of X
given the query q for the ith model is P(M;(X) = R|q).

Figure 4._Okapi runs with the entire INEX collection; measures are “highly specific” inex_eval precision recall
- only highly specific elements are taken into account in this measure (bottom) and ERR measure (top). The ERR
measure is a generalisation of recall (Piwowarski and Gallinari 2003). The first letter corresponds to the “doxel
collection” considered for the computation of the document frequency, it can take three values: D (“document
frequency”), E (“element frequency”) and T (“tag frequency’). The second one (length normalisation) can also
take three different values: C (“corpus”), T (“tag’’) and P (“parent ”).

Figure 5. Mapping the INEX scale to a probability distribution on states /, B and E. The table to the right gives
the exact distribution associated to each E.X;Spj; assessment. The graph to the left gives the same information it
a more intuitive way. Since P(X = E)+ P(X = B)+ P(X = 1) =1, we removed the P(X = /) axis from the
graph.

Figure 6. Train (top) and Test (bottom) learning curves with Okapi-D-T base model (BN-1). X-axis corresponds
to learning iterations and y-axis to ratios BN-measure/baseline Okapi measure. The 5 plotted measures are detailed
in the text.

Figure 8. Plots correspond to R Phs ratio between the BN models and Okapi-D measured for all 30 individua
queries after 9000 iterations of the learning algorithm. Numbers on the x-axis for the plots correspond to the query
identifiers used in INEX (integers between 90 and 125). The same information appears in a more synthetic way
with the boxplots computed over all ratio values.

Figure 9. Plots correspond to ERR @50 ratio between the BN model and Okapi-D computed for all 30 individual
queries after 9000 iterations of the learning algorithm (the ratios were bounded by 0.01 and 100) since for two
queries (query 98 and 100) the ratio was 0 for BN-2 and above 100 for query 121 and BN-1. Numbers on the
x-axis for the plots correspond to the query identifiers used in INEX (integers between 90 and 125). The same
information appears in a more synthetic way with the boxplots computed over all ratio values.

We give here the derivation of the updating formula (7) for learning the parameters of the BN.
Notations are the same as in Section 6. The parameter to be updated is 0. In the following,
only BN variables associated to doxels are considered. Let us assume that for any node j the
ancestors of X ; are the variables X; where / € anc(j) . We will denote pa(k) the parent of
the node k and ANC(/) the set of ancestors including j, that is ANC(j) = anc(j) U {J}.
We will also use the abreviation v, for X, = v, within probabilities. The Po derivative
in (6) decomposes as:

Table 1. Conditional probabilities associated to node X in
figure 3. Each table entry gives the probability that X is J, B
or E given the values of its parent variables Y and M;(X) for
i = 1, 2. For example, the “*” in the table represents three
values: P(X = 1|Y = B,M,(X) = R, Mo(X) = =R),
P(X = B\Y = B,M\(X) = R, M2o(X) = -R) and
P(X = E|Y = B, M,(X) = R, Mo(X) = AR).
P(X = vy | Y = vy, Mi (Xx) = m,...,M,(X) = m,) is usually encoded into condi-
tional probability tables, one table for each doxel. Table 1 for example is the conditional
probability table for doxel X in figure 3.

Table 2. Element categories used in the experiments on INEX 2003.
In order to enforce these constraints on the BN conditional probabilities, the probability
estimates are defined as follows:

descriptionView Paper arrow_downwardDownload

Scalable Multilingual Information Access

by James Mayfield

2003, Lecture Notes in Computer Science

The third Cross-Language Evaluation Forum workshop (CLEF-2002) provides the unprecedented opportunity to evaluate retrieval in eight different languages using a uniform set of topics and assessment methodology. This year the Johns Hopkins... more

descriptionView Paper arrow_downwardDownload

A machine learning model for information retrieval with structured documents

by Benjamin Piwowarski

2003, Machine Learning and Data Mining in Pattern …

Most recent document standards rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex... more

descriptionView Paper arrow_downwardDownload

Theme Topic Mixture Model for Document Representation

by egrvbvfd fdbrfdd

In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document... more

descriptionView Paper arrow_downwardDownload

Model-Based Classification of Web Documents Represented by Graphs

by Alex Markov

2006

Most web content classification methods are based on the vectorspace model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers... more

descriptionView Paper arrow_downwardDownload

Document representation

Related Topics