Document Representation Using Global Association Distance Model
Lecture Notes in Computer Science
https://doi.org/10.1007/978-3-540-71496-5_52…
8 pages
1 file
Sign up for access to the world's latest research
Abstract
Text information processing depends critically on the proper representation of documents. Traditional models, like the vector space model, have significant limitations because they do not consider semantic relations amongst terms. In this paper we analyze a document representation using the association graph scheme and present a new approach called Global Association Distance Model (GADM). At the end, we compare GADM using K-NN classifier with the classical vector space model and the association graph model. where d i , d j are the vectors of documents i, j, ||di||, ||dj|| the norms of the vectors, and w ir , w jr are the term weights in the vectors di, dj, respectively. Other common measures are Dice and Jaccard coefficients.
Related papers
Knowledge and Information Systems, 2012
The rapid proliferation of the World Wide Web has increased the importance and prevalence of text as a medium for dissemination of information. A variety of text mining and management algorithms have been developed in recent years such as clustering, classification, indexing, and similarity search. Almost all these applications use the well-known vectorspace model for text representation and analysis. While the vector-space model has proven itself to be an effective and efficient representation for mining purposes, it does not preserve information about the ordering of the words in the representation. In this paper, we will introduce the concept of distance graph representations of text data. Such representations preserve information about the relative ordering and distance between the words in the graphs and provide a much richer representation in terms of sentence structure of the underlying data. Recent advances in graph mining and hardware capabilities of modern computers enable us to process more complex representations of text. We will see that such an approach has clear advantages from a qualitative perspective. This approach enables knowledge discovery from text which is not possible with the use of a pure vector-space representation, because it loses much less information about the ordering of the underlying words. Furthermore, this representation does not require the development of new mining and management techniques. This is because the technique can also be converted into a structural version of the vectorspace representation, which allows the use of all existing tools for text. In addition, existing techniques for graph and XML data can be directly leveraged with this new representation. Thus, a much wider spectrum of algorithms is available for processing this representation. We will apply this technique to a variety of mining and management applications and show its advantages and richness in exploring the structure of the underlying text documents.
International Journal of Engineering and Technology, 2017
Text representation is the essential step for the tasks of text mining. To represent the textual information more expressively, a kind of Text Mining based Semantic Graph approach is proposed, in which more semantic and ordering information among terms as well as the structural information of the text is incorporated. Such model can be constructed by extracting representative terms from texts and their mutually semantic relationships. The implementation of the proposed work is provided using the JAVA environment and python environment. Moreover, WordNet is showing relationship amongst word node. So that GEPHI tool is used to constructing more effectively semantic graph. Additionally the comparative performance is also compared with traditional. In order to compare the performance of the algorithms the memory consumption and time consumption is taken as stand parameters. The experimental results have proved the better performance of the proposed text information representation model in terms of its Time and Space complexity. Keyword-SVM, Semantic Graphs, POS Tagging, WordNet, Text representation, Text Mining, Graph Model, Semantic Networks I. INTRODUCTION Advances in digital technology and the World Wide Web have led to the increase of digital documents that are used for various purposes such as publishing and digital library. This phenomenon raises awareness for the requirement of effective techniques that can help during the search and retrieval of text. Nowadays, by using digital and computational techniques, we can store, manage and retrieve information automatically without any printed or hard copy of document. In addition of that in various applications automated text analysis or text mining played important role such as medical science, library management, social media and others. Typical tasks involved in these two areas include text classification, information extraction, document summarization, text pattern mining etc. [1]. Nowadays text is the most common form of storing the information. The representation of document is important step in the process of text mining. Hence, the challenging task is the appropriate representation of the textual information which will capable of representing the semantic information of the text [2]. In this work, we developed graph-based document model which is leverage valuable knowledge about relations between entities. Hence the work is intended to deliver a mechanism for constructing a semantic graph of text documents. A. Semantic Graph The data structure we will focus on is the semantic graphs. Semantic graphs are appropriate to represent the semantically information in their nodes, i.e., they carry semantic information on their nodes and edges. A semantic graph is a type of linkage of the different objects where nodes represent objects (e.g., persons, papers, organizations, etc.) and links (or edges) represent binary relationships between those objects (e.g., friend, citation, authorship, etc.). A semantic graph is a powerful representation structure which can encode semantic relationships between different types of objects. The edge relation information provides us the information of how the two different object nodes are connected to each other and their meaning. These graphs encode relationships as typed link between a pair of typed nodes. These semantically structured graphs are also called a relational data graph or an attributed relational graph. Indeed, semantic graphs are very similar to semantic networks and multi-relational networks (MRNs) used in artificial intelligence and knowledge representation [3].
2003
In this paper we describe work relating to classification of web documents using a graph-based model instead of the traditional vector-based model for document representation. We compare the classification accuracy of the vector model approach using the k-Nearest Neighbor (k-NN) algorithm to a novel approach which allows the use of graphs for document representation in the k-NN algorithm. The proposed method is evaluated on three different web document collections using the leave-one-out approach for measuring classification accuracy. The results show that the graph-based k-NN approach can outperform traditional vector-based k-NN methods in terms of both accuracy and execution time.
Proceedings. 13th International Workshop on Database and Expert Systems Applications, 2002
This paper addresses similarity model and term association for similarity-based document categorization. Both Euclidean distance-and cosine-based similarity models are widely used for measures of document similarity in information retrieval and document categorization community. These two similarity models are based on the assumption that term vectors are orthogonal. Term associations are ignored in such similarity models. In fact, the assumption above is not true. In the context of document categorization, we analyze the properties of term-document space, termcategory space and category-document space. Then, without the assumption of term independence, we propose a new mathematical model to estimate the association between terms. Different from other models of term relationship, here we make best use of existing category membership represented by corpus as more as possible, and the objective is to improve categorization performance. By introducing eassociation between terms, we take into account term associations for calculating document similarity and define a e-similarity model of documents. Experiments have been done with k-NN classifier over Reuters-5178 corpus. The empirical results show that utilization of term association can improve the effectiveness of categorization system and e-similarity model outperforms than ones without considering term association.
2009
This paper presents a new document representation with vectorized multiple features including term frequency and term-connection-frequency. A document is represented by undirected and directed graph, respectively. Then terms and vectorized graph connectionists are extracted from the graphs by employing several feature extraction methods. This hybrid document feature representation more accurately reflects the underlying semantics that are difficult to achieve from the currently used term histograms, and it facilitates the matching of complex graph. In application level, we develop a document retrieval system based on self-organizing map (SOM) to speed up the retrieval process. We perform extensive experimental verification, and the results suggest that the proposed method is computationally efficient and accurate for document retrieval.
A method is proposed for creating vector space representations of documents based on modeling target inter-document similarity values. The target similarity values are assumed to capture semantic relationships, or associations, between the documents. The vector representations are chosen so that the inner product similarities between document vector pairs closely match their target inter-document similarities. The method is closely related to the Latent Semantic Indexing approach; in fact, they are equivalent when the target similarities are derived directly from document similarities based on term co-occurrence. However, our method allows for external sources of inter-document
2014 International Conference on Big Data and Smart Computing (BIGCOMP), 2014
Classification of text documents is one of the most common themes in the field of machine learning. Although a text document expresses a wide range of information, but it lacks the imposed structure of tradition database. Thus, unstructured data, particularly free running text data has to be transferred into a structured data. Hence, in this paper we represent the text document unconventionally by making use of symbolic data analysis concepts. We propose a new method of representing documents based on clustering of term frequency vectors. Term frequency vectors of each cluster are used to form a symbolic representation by the use of Mean and Standard Deviation. Further, term frequency vectors are used in the form a interval valued features. To cluster the term frequency vectors, we make use of Single Linkage, Complete Linkage, Average Linkage, K-Means and Fuzzy C-Means clustering algorithms. To corroborate the efficacy of the proposed model we conducted extensive experimentations on standard datasets like 20 Newsgroup Large, 20 Mini Newsgroup, Vehicles Wikipedia datasets and our own created datasets like Google Newsgroup and Research Article Abstracts. Experimental results reveal that the proposed model gives better results when compared to the state of the art techniques. In addition, as the method is based on a simple matching scheme, it requires a negligible time.
2018
Text document representation is one of the main issue in the text analysis areas such as topic extraction and text similarities. Standard Bag-of-Word representation does not deal with relationships between words. In order to overcome this limitation, we introduce a new approach based on the joint use of co-occurrence graph and semantic network of English language called Wordnet. To do this, a word sense disambiguation algorithm has been used in order to establish semantic links between terms given the surrounding context. Experimentations on standard datasets show good performances of the proposed approach. MOTS-CLÉS : Représentation des textes, WordNet, graphe, désambiguïsation des mots, sémantique.
2012
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. To address it, patterns of cooccurring words or characters are typically extracted from the textual content of Web documents. However, not all documents are of the same quality; for example, the curated content of news articles usually entails lower levels of noise than the user-generated content of the blog posts and the other Social Media. In this paper, we provide some insight and a preliminary study on a tripartite categorization of Web documents, based on inherent document characteristics. We claim and support that each category calls for different classification settings with respect to the representation model. We verify this claim experimentally, by showing that topic classification on these different document types offers very different results per type. In addition, we consider a novel approach that improves the performance of topic classification across all types of Web documents: namely the n-gram graphs. This model goes beyond the established bag-of-words one, representing each document as a graph. Individual graphs can be combined into a class graph and graph similarities are then employed to position and classify documents into the vector space. Accuracy is increased due to the contextual information that is encapsulated in the edges of the n-gram graphs; efficiency, on the other hand, is boosted by reducing the feature space to a limited set of dimensions that depend on the number of classes, rather than the size of the vocabulary. Our experimental study over three largescale, real-world data sets validates the higher performance of n-gram graphs in all three domains of Web documents.
2009
ABSTRACT Automatic classification and clustering are two of the most common operations performed on text documents. Numerous algorithms have been proposed for this and invariably, all of these algorithms use some variation of the vector space model to represent the documents. Traditionally, the Bag of Words (BoW) representation is used to model the documents in a vector space. The BoW scheme is a simple and popular scheme, but it suffers from numerous drawbacks.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (9)
- Salton, G.: The SMART Retrieval System -Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, New Jersey, (1971).
- Berry, M.: Survey of Text Mining, Clustering, Classification and Retrieval. Springer, (2004).
- Feldman, R., Dagan, I.: Knowledge Discovery in Textual Databases (KDT). In Proc. of the first International Conference on Data Mining and Knowledge Discovery, KDD'95, Montreal, (1995) 112-117.
- Kou, H., Gardarin G.: Similarity Model and Term Association for Document Cat- egorization. NLDB 2002, Lecture Notes in Computer Science, Vol. 2553, Springer- Verlag Berlin Heidelberg New York, (2002) 223-229.
- Becker, J., Kuropka, D.: Topic-based Vector Space Model. In Proc. of Business Information Systems (BIS) 2003, (2003).
- Wong, S.K.M, Ziarko W. and Wong, P.C.N.: Generalized Vector Space Model in Information Retrieval. Proc. of the 8th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, New York, ACM 11 (1985).
- Medina-Pagola, J.E., Guevara-Martinez, E., Hernández-Palancar, J., Hechavarría- Díaz, A., Hernández-León, R.: Similarity Measures in Documents using Association Graphs. In Proc. of CIARP 2005, Lecture Notes in Computer Science, Vol. 3773, (2005) 741-751.
- Schmid, H.: Probabilistic Part-Of-Speech Tagging Using Decision Tree. In: Interna- tional Conference on New Methods in Language Processing, Manchester, UK (1994)
- Yang Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, Vol. 1, No. 1/2, (1999) 67-88.