Document representation Research Papers

A Novel Approach to Rank Documents Using Semantic Similarity

2025

As the volume of information in internet is increasing staggeringly therefore it is required to develop new methods for document retrieval and ranking them according to their relevance value as per the user query. Information Retrieval... more

descriptionView Paper arrow_downwardDownload

On the holistic cognitive theory for information retrieval

by Kal Jarvelin

2025

The paper demonstrates how the Laboratory Research Framework fits into the holistic Cognitive Framework for IR. It first discusses the Laboratory Framework with emphasis on its underlying assumptions and known limitations. This is... more

descriptionView Paper arrow_downwardDownload

A Study on the Relevance of Information in Discriminative and Non-Discriminative Media

by Andrej Probst

2025, Arxiv preprint nlin/0206001

Abstract: In this paper we compare the relevance of information obtained from" discriminative" media and from" non-discriminative" media. Discriminative media are the ones which accumulate and deliver information using... more

descriptionView Paper arrow_downwardDownload

A retrieval model for personalized searching relying on content-based user profiles

by Giovanni Semeraro

2025, 6th AAAI Workshop on Intelligent Techniques for Web Personalization and Recommender Systems (ITWP 2008)

Canonical Information Retrieval systems perform a ranked keyword search strategy: Given a user's one-off information need (query), a list of documents, ordered by relevance, is returned. The main limitation of that “one fits all”... more

descriptionView Paper arrow_downwardDownload

Pertinent Information retrieval based on Possibilistic Bayesian network : origin and possibilistic perspective

by kamel garrouch

2025

descriptionView Paper arrow_downwardDownload

Relevance judgment: What do information users consider beyond topicality?

by Zhiwei Chen

2025, Journal of the American Society for Information Science and Technology

How does an information user perceive a document as relevant? The literature on relevance has identified numerous factors affecting such a judgment. Taking a cognitive approach, this study focuses on the criteria users employ in making... more

descriptionView Paper arrow_downwardDownload

FQUERY III+: A “human-consistent” database querying system based on fuzzy logic with linguistic quantifiers

by Janusz Kacprzyk

2025, Information Systems

Using a fuzzy-logic-based calculus of linguistically quantified propositions we present FQUERY III+. a new, more "human-friendly" and easier-to-use implementation of a querying scheme proposed originally by Kacprzyk and Zioikowski to... more

descriptionView Paper arrow_downwardDownload

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

by GERARDO MARTINEZ FIGUEROA

2025

Traditionally, keyphrases (or keywords) have been manually assigned to documents by their authors or by human indexers. This, however, has become impractical due to the massive growth of documents—particularly short articles (e.g.... more

descriptionView Paper arrow_downwardDownload

Combining semantic and syntactic document classifiers to improve first story detection

by Joe Carthy

2025

In this paper we describe a type of data fusion involving the combination of evidence derived from multiple document representations. Our aim is to investigate if a composite representation can improve the online detection of novel events... more

descriptionView Paper arrow_downwardDownload

First story detection using a composite document representation

by Joe Carthy

2025

In this paper, we explore the effects of data fusion on First Story Detection [1] in a broadcast news domain. The data fusion element of this experiment involves the combination of evidence derived from two distinct representations of... more

descriptionView Paper arrow_downwardDownload

News Representation with Multi-Word Features

by Christoph Schommer

2025

Information is commonly reflected in news articles. However, texts are unstructured and thus demanding to analyze automatically. To identify and capture the facts in a news story we propose a novel approach, which utilizes natural... more

descriptionView Paper arrow_downwardDownload

Collection and Selection Based Relevant Degrees Of Docume

by Zekri Lougmiri

2025

In this paper, we address the problem of selection collections. This is important for locating responses in digital libraries. The aim of methods, which deal with the area of information retrieval, is to reduce the amount of the exchanged... more

descriptionView Paper arrow_downwardDownload

Vector-oriented retrieval in XML data collections

by Jaroslav Pokorný

2025

Many modern applications produce and process XML data, which is queried in its both structural and textual component. This is especially useful if we consider a casual user who looks for information in web-based database systems or... more

Many modern applications produce and process XML data, which is queried in its both structural and textual component. This is especially useful if we consider a casual user who looks for information in web-based database systems or intranets containing XML data, like online shops, airline reservations, digital libraries catalogues or any other, and does not expect an exact answer. Many websites are built from document-centric XML documents . A remarkable characteristic of such XML data collections is that they are mostly heterogeneous, i.e. they contain domainfocused data, possibly valid w.r.t. various DTDs or XML schemes. XML documents can come from various sources. These collections can be managed as XML databases [5] as well as collections, providing an approximate way for users to search their contents. To ensure such functionality, it is required to approach these collections with both database and information retrieval (IR) methods. Current XML query languages like XPath and XQuery are applicable rather for data-centric than for document-centric XML data. Moreover, XML schemes are often necessary for their use. In other words, the languages are not longer appropriate for searching in such environments because they can not cope with the diversity of data. Hence, a research of integration of database querying and IR in context of XML is undoubtedly interesting and promising trend. Despite of the fact that a variety of systems that support such methods have been proposed, conventional IR techniques [2], e.g. vector space model, can be employed only restrictedly. The reason for it is that two types of queries should be dealt with: content-only (CO) queries, i.e. the traditional ones in IR, and content-and-structure (CAS) queries. A number of techniques to extend the vector space model have been designed, e.g. [6], [7], [8], [9], [11], and [12]. A usual critique of the mentioned approaches is that they not sufficiently reflect the structure of XML documents. A more advanced, twophase evaluation schema is proposed in . First, a modified vector space model is employed to obtain similarity scores for the textual nodes of XML trees. Then, the scores are propagated upward in the XML-trees with a possible modification and possibly new scores of other nodes are generated. In we described a matrix model based on an extension of the vector space model for XML data. A document D in a collection of XML documents C is represented by a matrix D, whose each row vector w t associated with a term t contains the weights of t for each path occurring in C. A query Q considered also as an XML tree is expressed as a matrix Q. The matrix model proposes to evaluate the degree of similarity of D with regard to the Q as the correlation between the matrices D and Q.

descriptionView Paper arrow_downwardDownload

Automatically generating hypertext by computing semantic similarity

by Graeme Hirst

2025, ACM Conference on Hypertext

Your Ms V m reforenca Our iUe Narre raterence The author has granted a non-L'auteur a accordé une Licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de... more

descriptionView Paper arrow_downwardDownload

Indexation of Document Images Using Frequent Items

by Eugen Barbu

2025

Documents exist in different formats. When we have document images, in order to access some part, preferably all, of the information contained in that images, we have to deploy a document image analysis application. Document images can be... more

descriptionView Paper arrow_downwardDownload

Fuzzy Mapping Model For Classification In Supervised Learning

by Ms. Sangeetha Jamal

2025

This paper proposes a method to improve the existing methods to classify documents in to categories based on supervised machine learning technique. It includes converting the unstructured text data into a numerical vector form for... more

descriptionView Paper arrow_downwardDownload

Mise en correspondance de données textuelles hétérogènes à partir d'informations sémantiques

by Nourelhouda YAHI

2024

Dans cet article, nous presentons une approche pour mesurer la similarite semantique entre des textes heterogenes et de qualite differente provenant de differentes sources Web. Notre approche commence par extraire le contenu des textes... more

descriptionView Paper arrow_downwardDownload

A taxonomy-based model for identifying risks

by Maria Rosa Galli

2024

Risk Management is a practice composed by processes, methods and tools that allows managing risks in projects; this activity is typically started during the initial phase of a project and it continues during the whole project life cycle.... more

descriptionView Paper arrow_downwardDownload

Query Expansion Using Persian Ontology Derived from Wikipedia

by Maryam Mahmoudi

2024, World Applied Sciences …

Identifying topics and concepts associated with a set of documents is a critical task for information retrieval systems. One approach is to associate a query with a set of topics selected from a fixed ontology or vocabulary of terms. The... more

descriptionView Paper arrow_downwardDownload

XML in Enterprise Systems

by Jaroslav Pokorný

2024, Informatica

The widespread use of the XML format for document representation and message exchange has influenced techniques for data integration in last years. A development of various XML languages, methods and tools gave rise to so called XML... more

descriptionView Paper arrow_downwardDownload

Building semantic kernels for text classification using wikipedia

by Carlotta Domeniconi

2024, Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Document classification presents difficult challenges due to the sparsity and the high dimensionality of text data, and to the complex semantics of the natural language. The traditional document representation is a word-based vector (Bag... more

descriptionView Paper arrow_downwardDownload

L'exploitation des relations d'association de termes pour l'enrichissement de l'indexation de documents textuels

by Ferihane Kboubi

2024

Notre travail se situe dans le cadre d'un projet d'annotation descriptive, conceptuelle et thématique de corpus textuel. Dans le présent article, nous focalisons notre attention sur l'annotation conceptuelle et plus précisément sur la... more

descriptionView Paper arrow_downwardDownload

TextRank: Bringing Order into Text

by Paul Tarau

2024, Empirical Methods in Natural Language Processing

In this paper, we introduce TextRank-a graph-based ranking model for text processing, and show how this model can be successfully used in natural language applications. In particular, we propose two innovative unsupervised methods for... more

descriptionView Paper arrow_downwardDownload

Theme Topic Mixture Model: A Graphical Model for Document Representation

by Mikaela Keller

2024

In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document... more

descriptionView Paper arrow_downwardDownload

Theme Topic Mixture Model for Document Representation

by Mikaela Keller

2024

In Automatic Text Processing tasks, documents are usually represented in the bag-ofwords space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document... more

descriptionView Paper arrow_downwardDownload

D'Extraction Automatique Du Langage

by Jean-Gabriel Ganascia

2024

New text analysis softwares issued from fields of research such as Machine Learning and Natural Languages Processing prove to be relevant tools for the language sciences. Littératron is a new data-processing tool for the automatic... more

New text analysis softwares issued from fields of research such as Machine Learning and Natural Languages Processing prove to be relevant tools for the language sciences. Littératron is a new data-processing tool for the automatic extraction of syntactic patterns, designed at LIP6 by Jean-Gabriel Ganascia. Associated with a linear text analyser, it reveals the stylistic peculiarities of a text. We will see that Littératron carries out a linguistic diagnosis of learners if used in language sciences, especially in the field of acquisition of written French as a foreign language. The learner can be from a heterogeneous group (various language levels and various mother tongues) or from a homogeneous group (only one language level and one mother tongue, here, Arabic). The interest of this approach is related to three fields: first, language didactics, on a purely educational basis; next, computational linguistics; finally, computer-assisted learning. De nouveaux logiciels d"analyse textuelle tirent partie des progrès récents effectués en apprentissage symbolique et dans le traitement automatique des langues naturelles. Conçu au LIP6 par Jean-Gabriel Ganascia, le Littératron est l'un d'entre eux ; il extrait automatiquement des motifs syntaxiques i à partir de textes écrits en langage naturel. Plus exactement, le Littératron prend comme entré un arbre d'analyse syntaxique et donne en sortie un certain nombre de motifs syntaxiques récurrents. Associé à un analyseur de textes, qui engendre l'arbre d'analyse syntaxique à partir de textes écrits en langage naturel, il révèle les singularités stylistiques de ces textes. Nous allons voir qu"utilisé en sciences du langage, dans le domaine de l"acquisition du français écrit, le Littératron permet d'effectuer un diagnostic linguistique de l"apprenant, que celui-ci provienne d"une classe de langue hétérogène (différentes langues maternelles) ou homogène (une seule langue maternelle, en l"occurrence ici l"arabe). L"intérêt de cette approche concerne trois domaines : d"une part la didactique des langues, à titre éducatif ; d"autre part, la linguistique computationnelle, et enfin l"enseignement assisté par ordinateur.

descriptionView Paper arrow_downwardDownload

Contextual representation of documents, entities, and faces of people using a news corpus

by Abdul Kader

2024

A massive amount of unstructured data, in this information age, is composed of document collections. Examples include news articles, blog posts, scholarly publications, and reports generated by organizations as well as people. Many data... more

descriptionView Paper arrow_downwardDownload

Using graph-kernels to represent semantic information in text classification

by Teresa Gonçalves

2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Most text classification systems use bag-of-words representation of documents to find the classification target function. Linguistic structures such as morphology, syntax and semantic are completely neglected in the learning process. This... more

descriptionView Paper arrow_downwardDownload

Two-layer classification and distinguished representations of users and documents for grouping and authorship identification

by Amr Ahmed

2024, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems

Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across... more

descriptionView Paper arrow_downwardDownload

Annotation XML et interrogation de corpus pour l'étude de la conceptualisation métaphorique

by Sylvie Vandaele

2024

In metaphorical conceptualization, the structure of one conceptual system is projected onto another. Our previous work suggest that idiomaticity in specialized languages is based in part on this phenomenon. However, since metaphorical... more

descriptionView Paper arrow_downwardDownload

Enforcing End-to-End Proportional Fairness with Bounded Buffer Overflow Probabilities

by Uday V Shanbhag

2024

Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and... more

descriptionView Paper arrow_downwardDownload

Knowledge-Based Semantic Information Indexing and Management Framework: Integration of Structured Knowledge and Information Management Systems

by Journal of Computer and Knowledge Engineering

2024, Journal of Computer and Knowledge Engineering

One of the most challenging aspects of developing information systems is the processing and management of large volumes of information. One way to overcome this problem is to implement efficient data indexing and classification systems.... more

descriptionView Paper arrow_downwardDownload

Creación del modelo servicio de gestión de clientes : Kibernum

by MATIAS ALFONSO CHINCHON ARAVENA

2024

descriptionView Paper arrow_downwardDownload

Discovering and tracking interesting web services

by Daniel Rocco

2024

13 Comparison of various loading techniques. The vertical line indicates the approximate size (1250 nodes) of cnn.com.

descriptionView Paper arrow_downwardDownload

Optimizing Support Vector Machine Classification Based on Semantic-Text Knowledge Enrichment

by Shadi Diab and

2024, Palestinian Journal of Technology and Applied Sciences

In this research, we enhanced the performance of Support Vector Machine (SVM) in text classification by applying semantic-knowledge enrichment. We propose using semantic-knowledge enrichment scheme to inject new concepts into the original... more

descriptionView Paper arrow_downwardDownload

Exploiting noun phrases and semantic relationships for text document clustering

by Bo-Yeong Kang

2024, Information Sciences

Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy... more

descriptionView Paper arrow_downwardDownload

MAD: a movie authoring and design system

by Alan Rosenthal

2024, Computer Human Interaction

MAD (Movie Authoring and Design) is a novel design and authoring system that facilitates the process of creating dynamic visual presentations. MAD aids this process by simultaneously allowing easy creation and modification of structured... more

descriptionView Paper arrow_downwardDownload

« Table ronde 1 - Les méthodes en science politique des deux côtés de l'Atlantique »

by François-Xavier Dudouet

2024

This paper presents the first results of a study on the bargaining process of web standards in World Wide Web Consortium (W3C) arenas. This process is analysed through bargaining habits and through networks of actors who take part in it.... more

descriptionView Paper arrow_downwardDownload

Keyword Extraction from E-Books and Conversion to Reader Friendly Formats

by P M Chawan

2024

An overwhelming number of users use Ebooks as their primary formats. Gone are the days when buying a physical copy was the only thing to do. Even though technology has advanced to extreme extents in display and visualization of texts... more

descriptionView Paper arrow_downwardDownload

An improved Arabic text classification method using word embedding

by Omar EL BEGGAR

2024, International Journal of Power Electronics and Drive Systems

Feature selection (FS) is a widely used method for removing redundant or irrelevant features to improve classification accuracy and decrease the model's computational cost. In this paper, we present an improved method (referred to... more

descriptionView Paper arrow_downwardDownload

An improved Arabic text classification method using word embedding

by Tarik SABRI

2024, International Journal of Power Electronics and Drive Systems

Feature selection (FS) is a widely used method for removing redundant or irrelevant features to improve classification accuracy and decrease the model's computational cost. In this paper, we present an improved method (referred to... more

descriptionView Paper arrow_downwardDownload

Intelligent Information Retrieval Approach using Discrete Wavelet Transform for Holy Quran in Smartphone Application

by Huda Aljaloud

2024

Answering mobile users' queries intelligently is one of the significant challenges in information retrieval (IR) in intelligent systems. Current popular Quranic retrieval application ranks the document by counting the occurrences of each... more

Figure 3: Basic Structure of IR using DDMDWT The average of words in a document on our dataset equal 13 with variation equal to ten. There is a document with one word, on the other hand, there is another which contains 124 words. Not to mention the fact that assign each one of the previous document to the same number of bins is wasted in index space.

Figure 5: Performance Measure Results for SBIRM and DDMDWT

In order to reduce the complexity of the documents and make them easier to handle, the dataset had to be transformed from the full-text version of documents. In each dataset, it’s important to define the content of each document. A definition of a document is that it is made of a joint membership of terms which have various patterns of occurrence (Meadow, 1992). Our dataset (The Holy Quran) can represent documents in different forms. Table 1 shows some statistics about the Holy Quran define various document representation 3.1 Document representation

Table 2: Results of the proposed DDMDWT Model against SBIRM Model.

descriptionView Paper arrow_downwardDownload

Medefaidrin: resources documenting the birth and death language life-cycle

by Dafydd Gibbon

2024, Proceedings of the Seventh …

Language resources are typically defined and created for application in speech technology contexts, but the documentation of languages which are unlikely ever to be provided with enabling technologies nevertheless plays an important role... more

descriptionView Paper arrow_downwardDownload

Un algorithme de génération de profil de document et son évaluation dans le contexte de la classification thématique

by François Rousselot

2024

This paper describes an algorithm for document representation in a reduced vectorial space by a process of feature extraction. The algorithm is applied and evaluated in the context of the supervised classification of news articles from... more

This paper describes an algorithm for document representation in a reduced vectorial space by a process of feature extraction. The algorithm is applied and evaluated in the context of the supervised classification of news articles from the collection of Le Monde newspaper issued in the years 2003 and 2004. We are generating a document representation (or profile), in a space of 800 dimensions, represented by semantic tags from a machine-readable dictionary. We are dealing with two issues: the synonymy handled by thematic conflation and polysemy for which we have developed a statistical method for word-sense disambiguation. We propose four variants for the profile generation (of a document) depending on whether a recursive system is used or not, and whether a corrective factor for polysemous words is taken into account or not. To determine the best classifier provided by our algorithm we have evaluated 32 variants, depending on the algorithm type (as previously) and on three other parameters that influence the document representation: grammatical category selection, 15% reduction of the profile, and a stop-list of semantic tags. The evaluation is done on a set of documents from six categories by calculating the precision, the recall and the F-measure to determine the best algorithm related to the threshold detection. Some parameters (like profile reduction) have low influence on the classifier performance and others (corrective factor for the ambiguous words, stop-list) improve it noticeably. Résumé Cet article présente un algorithme de représentation vectorielle de textes qui réalise une réduction de l'espace de représentation par une méthode d'extraction d'attributs. Nous en avons évalué les performances dans le cadre de la classification automatique supervisée à partir d'un ensemble de documents textes issus de la collection du journal « Le Monde » des années 2003 et 2004. Notre algorithme génère une représentation de document (appelée « profil »), dans un espace d'environ 800 dimensions représentées par des étiquettes sémantiques issues d'un dictionnaire électronique. Parallèlement à cette réduction de l'espace de représentation, celui-ci traite également les problèmes de synonymie, gérés par regroupement thématique, et de polysémie, à l'aide d'une méthode statistique de désambiguïsation sémantique. Nous proposons quatre variations différentes pour générer un profil de document en fonction de l'utilisation ou non du système récursif, et de l'ajout d'un coefficient de pénalité corrigeant l'influence des mots polysémiques. Afin de déterminer le meilleur classifieur possible issu de notre algorithme, nous avons alors généré un ensemble de 32 classifieurs en étudiant en plus de ces quatre variations, l'influence de trois paramètres supplémentaires (sélection des catégories grammaticales, réduction de 15% du profil, application d'une stop-liste d'étiquettes sémantiques) agissant sur la représentation des documents. L'évaluation de la performance de ces classifieurs s'est faite sur la base de documents issus de six catégories. Le calcul du rappel, de la précision et de la mesure F 1 nous a permis de déterminer l'algorithme optimal en fonction du type de détection de seuil utilisé (dépendant de l'application envisagée). Certains paramètres (réduction du profil) ont ainsi une faible influence sur la performance du classifieur, tandis que d'autres (coefficient de correction pour les mots ambigus, stop-liste) améliorent sensiblement la performance.

descriptionView Paper arrow_downwardDownload

Representation De Textes a L’Aide D’Etiquettes Semantiques Dans Le Cadre De La Classification Automatique

by François Rousselot

2024

This paper describes an algorithm for document representation in a reduced vectorial space by a process of feature extraction. The algorithm is evaluated in the context of the supervised classification of news articles. We are generating... more

descriptionView Paper arrow_downwardDownload

S.: Ontogen extension for exploring image collections

by Dunja Mladenic

2024

Abstract-OntoGen is a semi-automatic and data-driven ontology editor focusing on editing of topic ontologies. It utilizes text mining tools to make the ontology-related tasks simpler to the user. This focus on building ontologies from... more

descriptionView Paper arrow_downwardDownload

Mapping and displaying structural transformations between XML and PDF

by David Brailsford

2024

Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract... more

descriptionView Paper arrow_downwardDownload

Adobe's Acrobat -- the Electronic Journal Catalyst?

by David Brailsford

2024

Adobe's Acrobat software, released in June 1993, is based around a new Portable Document Format (PDF) which offers the possibility of being able to view and exchange electronic documents, independent of the originating software, across a... more

descriptionView Paper arrow_downwardDownload

Separate compilation of structured documents

by David Brailsford

2024, Electronic Publishing - Origination, Dissemination and Design

This paper draws a parallel between document preparation and the traditional processes of compilation and link editing for computer programs. A block-based document model is described which allows for separate compilation of various... more

descriptionView Paper arrow_downwardDownload

Dynamic hyperlink inclusion in online journals

by David Brailsford

2024

The two complementary de facto standards for the publication of electronic documents are HTML on the World Wide Web and Adobe's Acrobat viewers using PDF (Portable Document Format). A brief overview is given of these two systems followed... more

descriptionView Paper arrow_downwardDownload

Document representation

Related Topics