Academia.eduAcademia.edu

Document Classification

description1,100 papers
group291 followers
lightbulbAbout this topic
Document classification is the process of categorizing documents into predefined classes or categories based on their content and features. This involves the application of algorithms and techniques from machine learning and natural language processing to automate the organization and retrieval of information in various formats.
lightbulbAbout this topic
Document classification is the process of categorizing documents into predefined classes or categories based on their content and features. This involves the application of algorithms and techniques from machine learning and natural language processing to automate the organization and retrieval of information in various formats.

Key research themes

1. How do machine learning techniques enhance document classification accuracy and efficiency?

This research theme focuses on applying and comparing various supervised machine learning algorithms to improve the precision and scalability of document classification across diverse textual datasets, including language-specific corpora and domain-specific documents. It matters because automated categorization enables effective handling of the ever-growing volume of digital texts, reduces manual labor, and supports scalable information retrieval.

Key finding: The application of four supervised learning algorithms (Decision Tree C4.5, K-Nearest Neighbour, Naïve Bayes, Support Vector Machine) on Bangla web documents demonstrated that SVM attained superior performance in categorizing... Read more
Key finding: A comparative survey of text classification algorithms including decision trees, SVM, and Naïve Bayes shows that each approach has unique advantages and that SVM often leads in accuracy and robustness; the study highlights... Read more
Key finding: This comparative study finds that supervised classifiers like k-NN, SVM, and ensemble methods benefit significantly from advanced word embedding representations such as GloVe compared to traditional TF-IDF vectors; this leads... Read more
Key finding: The paper surveys the dominant machine learning paradigm for text categorization, highlighting the superiority of inductive learning algorithms over knowledge engineering (manual rule-based approaches) due to their... Read more
Key finding: Using the Vector Space Model for document representation combined with well-known algorithms, the study validates automatic classification approaches on journalistic documents, showing the feasibility and efficiency of... Read more

2. What are effective document representation and feature engineering techniques improving classification performance?

This theme explores innovative document representation methods and feature extraction techniques that address challenges such as high dimensionality, sparsity, semantic loss, and short-text data limitations. Advancements in representation and feature engineering matter because they directly impact classifier accuracy, computational efficiency, and the ability to capture context and semantics within documents.

Key finding: By clustering term frequency vectors and creating symbolic interval-valued features capturing statistical measures (mean and standard deviation), the study proposes an unconventional document representation that mitigates... Read more
Key finding: The proposed classification approach preserves term sequence information using a novel 'Status Matrix' data structure combined with B-tree indexing of terms associated with class labels, overcoming limitations of bag-of-words... Read more
Key finding: Introduces a supervised feature extraction method that aggregates original document features into low-dimensional abstract features representing class-evidence strengths, achieving consistent improvements across seven... Read more
Key finding: The TextNetTopics Pro framework leverages lexical features grouped into topics alongside document-topic distributions from multiple short-text topic models, mitigating data sparsity and limited context typical in short text... Read more
Key finding: This work demonstrates that representing clinical reports through topic distributions generated by Latent Dirichlet Allocation (LDA) and related probabilistic models yields more compact and interpretable features than... Read more

3. How can document classification be enhanced in domain-specific or metadata-constrained scenarios?

This theme investigates approaches addressing domain-specific challenges such as legal, medical, or scientific documents, and situations where full content is inaccessible, focusing on leveraging metadata, multimodal features, or domain taxonomies. Exploring these methods is essential for enabling classification in real-world applications where standard text content may be limited, noisy, or structured differently.

Key finding: Demonstrates that multi-label classification of scientific articles using only metadata (title, keywords) can approach the performance of content-based methods, providing an effective alternative when full texts are... Read more
Key finding: Utilizing BERT-based neural networks for fully automated classification of artificial intelligence-related scientific documents, the model achieved 96.5% accuracy, uncovering substantial subject overlaps in other domains and... Read more
Key finding: Proposes a pipeline coupling template-based document classification with mixed integer programming for key information extraction in noisy scanned forms, achieving high f1 scores (0.97 for classification, 0.94 for KIE),... Read more
Key finding: Develops a machine learning-based legal document classification system using Multinomial Naive Bayes trained on document keywords and also summarizing documents via TextRank algorithm, effectively addressing challenges... Read more
Key finding: Applies k-Nearest Neighbors using Cosine-Binary similarity on Indonesian-language student thesis documents classified under the BATAN nuclear competence taxonomy, achieving 97% accuracy, and showcasing the utility of text... Read more

All papers in Document Classification

This research deals with the use of self-organising maps for the classification of text documents. The aim was to classify documents to separate classes according to their topics. We therefore constructed self-organising maps that were... more
We propose a novel bio-inspired solution for biomedical article classification. Our method draws from an existing model of T-cell cross-regulation in the vertebrate immune system (IS), which is a complex adaptive system of millions of... more
We present and study an agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification. Our method expands an existing analytical model of T-cell cross-regulation that was used to study... more
Document (text) classification is a common method in e-business, facilitating users in the tasks such as document collection, analysis, categorization and storage. Semantic analysis can help to improve the performance of document... more
Abstract. This paper presents a deployed semantic web application in the cultural domain: the semantic portal MuseumFinland. It is a demonstration of a community portal and a publication channel by which heterogeneous collection database... more
Definir l'architecture d?un systeme logiciel est une etape essentielle de son developpement. C'est la premiere etape de prise de decisions de conception. C?est en particulier la que seront prises les decisions concernant la... more
Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly... more
This paper presents a new reduction algorithm which employs Constraint Satisfaction Techniques for removing redundant literals of a clause efficiently. Inductive Logic Programming (ILP) learning algorithms using a generate and test... more
This paper presents a new reduction algorithm which employs Constraint Satisfaction Techniques for removing redundant literals of a clause efficiently. Inductive Logic Programming (ILP) learning algorithms using a generate and test... more
In this paper we describe the two classification approaches (i.e. categorization and clustering) and their preceding steps. For each approach we give a brief description of the underlying theory and outline the advantages and... more
This paper introduces a novel paradigm to impute missing data that combines a decision tree with an auto-associative neural network (AANN) based model and a principal component analysis-neural network (PCA-NN) based model. For each model,... more
Text document classification is an important research topic in the field of information retrieval, and so it is how we represent the information extracted from the documents to be classified. There exists document classification methods... more
The increasing procedural demand in judicial institutions has caused a workload overload, impacting the efficiency of the legal system. This scenario, exacerbated by limited human resources, highlights the need for technological solutions... more
Effective decision making is based on accurate and timely information. However, human decision makers are often overwhelmed by the huge amount of electronic data these days. The main contribution of this paper is the development of... more
As businesses move toward smarter automation and AI-driven processes, understanding structured documents like invoices, forms, receipts, and reports has become essential. This is where LayoutLM plays a groundbreaking role. Developed by... more
by Ranjna Garg and 
1 more
Text based Mining is the process of analyzing a document or set of documents to understand the content and meaning of the information they contain. Text Mining enhances human's ability to Process massive quantities of information and it... more
This paper describes the design and collection of NameDat, a database containing English proper names spoken by native Norwegians. The database was designed to cover the typical acoustic and phonetic variations that appear when Norwegians... more
Recent analysis have identified a significant rise in transactional fraud, where bad actors seek to deceive individuals or firms into unauthorized financial actions. Traditional fraud detection systems frequently struggle to effectively... more
Proposition d'une méthode de référencement d'images pour assister le processus de conception architecturale.
The quick spurt in online transactions has also brought with it a parallel surge in fraud. With digital payments, the scope for fraud remains very high. Credit card fraud, among others, can cause heavy losses to customers and erode the... more
This paper describe a methodology for semi-automatic classification schema definition (a classification schema is a taxonomy of categories useful for automatic document classification). The methodology is based on: (i) an extensional... more
This paper describes our approach towards the ECML/PKDD Discovery Challenge 2010. The challenge consists of three tasks: (1) a Web genre and facet classification task for English hosts, (2) an English quality task, and (3) a multilingual... more
Year. SchoolNet provides global Internet access to secondary schools (grades 7-12) throughout Thailand. By using technology to improve our education system, this project supports the human-resource-development emphasised in the 8th... more
Various platforms, including patent systems and repositories like GitHub and arXiv, support knowledge dissemination across domains. As knowledge increasingly spans multiple disciplines, there is a need to track innovations that intersect... more
Huge number of documents are increasing rapidly, therefore, to organize it in digitized form text categorization becomes an challenging issue. A major issue for text categorization is its large number of features. Most of the features are... more
Some of the key questions in paleography are those of classification, namely trying to ascertain when and where a given manuscript was written, and — if possible — by whom. Paleographers bring many skills and tools to bear on these... more
Documents are one of the most common methods for maintaining data and records. Everyday a lot of documents/files are generated with lot of data for future research purposes or for business analytics. These files/documents must should be... more
At the preceding GREC, we have proposed to use a "bag of symbols" formalism (similar to the bag of words approach) for the indexing of a graphical document image database. In this paper, we extend the proposed approach through the... more
by Av Av
An unprecedented rise in Unified Payment Interface (UPI) for the transactions in rural areas make them prone to fraud. In this paper, authors present a new method for detecting UPI fraud based on the name of the bank account, transaction... more
Document similarity is an important part of Natural Language Processing and is most commonly used for plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity algorithm could have a major... more
Document similarity is an important part of Natural Language Processing and is most commonly used for plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity algorithm could have a major... more
Credit card fraud detection remains a significant challenge for financial institutions and consumers globally, prompting the adoption of advanced data analytics and machine learning techniques. In this study, we investigate the... more
The detection of outliers in text documents is a highly challenging task, primarily due to the unstructured nature of documents and the curse of dimensionality. Text document outliers refer to text data that deviates from the text found... more
We propose a new approach denoted BNDI (Bayesian Network for Document Indexing) for indexing biomedical documents with controlled biomedical vocabulary based on a Bayesian Network. BNDI uses the probability inference to extract... more
In this paper, we present an application using the SUMMA-LSA platform developed by Baier, Lehnard, Hoffmann & Schneider (this volume). SUMMA-LSA was used to evaluate biology knowledge of 7th and 8th grades students dealing with The human... more
State-of-the-art review on opinion mining from online customers' feedback. In Aruka, Y & Namatame, A (Eds.
In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations... more
We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such... more
Feature selection is an essential preprocessing step for classifiers with high dimensional training sets. In pattern recognition, feature selection improves the performance of classification by reducing the feature space but preserving... more
In recent years, kNN algorithm is paid attention by many researchers and is proved one of the best text categorization algorithms. Text categorization is according to training set which is assigned class label to decide a new document... more
Frequent Itemset Mining (FIM) is a critical data mining operation, but the computational overhead increases in the presence of large data sets and deteriorates the efficiency of FIM computation. In this paper, we explore algorithms based... more
Recently the development of high performance, automatic text classification as well as feature selection has very challenging and tedious process because of many problems have been still unresolved. This paper provide the work done so far... more
The concept of example credibility evaluates how much a classifier can trust an example when building a classification model. It is given by a credibility function, which is application dependent and estimated according to a series of... more
Classification algorithms usually assume that any example in the training set should contribute equally to the classification model being generated. However, this is not always the case. This paper shows that the contribution of an... more
Adverse Drug Reaction (ADR) extraction is the process of identifying drug implications mentioned in social posts. Handling medical text for the identification of ADR is vital to research in terms of configuring the side effect and other... more
In the last decade, a variety of topic models have been proposed for text engineering. However, except Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), most of existing topic models are seldom applied... more
Credit card fraud detection remains a significant challenge for financial institutions and consumers globally, prompting the adoption of advanced data analytics and machine learning techniques. In this study, we investigate the... more
Download research papers for free!