Document Classification

description1,100 papers

group291 followers

lightbulbAbout this topic

Document classification is the process of categorizing documents into predefined classes or categories based on their content and features. This involves the application of algorithms and techniques from machine learning and natural language processing to automate the organization and retrieval of information in various formats.

lightbulbAbout this topic

Key research themes

1. How do machine learning techniques enhance document classification accuracy and efficiency?

This research theme focuses on applying and comparing various supervised machine learning algorithms to improve the precision and scalability of document classification across diverse textual datasets, including language-specific corpora and domain-specific documents. It matters because automated categorization enables effective handling of the ever-growing volume of digital texts, reduces manual labor, and supports scalable information retrieval.

Supervised Learning Methods for Bangla Web Document Categorization

by Rikta Sen

2021

Key finding: The application of four supervised learning algorithms (Decision Tree C4.5, K-Nearest Neighbour, Naïve Bayes, Support Vector Machine) on Bangla web documents demonstrated that SVM attained superior performance in categorizing... Read more

articleView Paper downloadDownload

Document Classification using Various Classification Algorithms: A Survey

by IJFRCSCE Journal

2018

Key finding: A comparative survey of text classification algorithms including decision trees, SVM, and Naïve Bayes shows that each approach has unique advantages and that SVM often leads in accuracy and robustness; the study highlights... Read more

articleView Paper downloadDownload

Text classification supervised algorithms with term frequency inverse document frequency and global vectors for word representation: a comparative study

by Zakia Labd

2024, International Journal of Power Electronics and Drive Systems

Key finding: This comparative study finds that supervised classifiers like k-NN, SVM, and ensemble methods benefit significantly from advanced word embedding representations such as GloVe compared to traditional TF-IDF vectors; this leads... Read more

articleView Paper downloadDownload

Machine Learning in Automated Text Categorization

by JOSEPH ALEXANDER

2020

Key finding: The paper surveys the dominant machine learning paradigm for text categorization, highlighting the superiority of inductive learning algorithms over knowledge engineering (manual rule-based approaches) due to their... Read more

articleView Paper downloadDownload

Automatic classification of journalistic documents on the Internet1

by Elias Oliveira

2022, Transinformação

Key finding: Using the Vector Space Model for document representation combined with well-known algorithms, the study validates automatic classification approaches on journalistic documents, showing the feasibility and efficiency of... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are effective document representation and feature engineering techniques improving classification performance?

This theme explores innovative document representation methods and feature extraction techniques that address challenges such as high dimensionality, sparsity, semantic loss, and short-text data limitations. Advancements in representation and feature engineering matter because they directly impact classifier accuracy, computational efficiency, and the ability to capture context and semantics within documents.

Classifying Text Documents using Unconventional Representation

by Manjunath Shantharamu

2015

Key finding: By clustering term frequency vectors and creating symbolic interval-valued features capturing statistical measures (mean and standard deviation), the study proposes an unconventional document representation that mitigates... Read more

articleView Paper downloadDownload

TEXT DOCUMENT CLASSIFICATION: AN APPROACH BASED ON INDEXING

by International Journal of Data Mining & Knowledge Management Process ( IJDKP ) and

2015

Key finding: The proposed classification approach preserves term sequence information using a novel 'Status Matrix' data structure combined with B-tree indexing of terms associated with class labels, overcoming limitations of bag-of-words... Read more

articleView Paper downloadDownload

Abstract feature extraction for text classification

by Banu Diri

2024, Turkish Journal of Electrical Engineering and Computer Sciences

Key finding: Introduces a supervised feature extraction method that aggregates original document features into low-dimensional abstract features representing class-evidence strengths, achieving consistent improvements across seven... Read more

articleView Paper downloadDownload

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

by Daniel Voskergian

2024, Frontiers in Genetics

Key finding: The TextNetTopics Pro framework leverages lexical features grouped into topics alongside document-topic distributions from multiple short-text topic models, mitigating data sparsity and limited context typical in short text... Read more

articleView Paper downloadDownload

Topic Modeling Based Classification of Clinical Reports

by Hyeong-Ah Choi

2023

Key finding: This work demonstrates that representing clinical reports through topic distributions generated by Latent Dirichlet Allocation (LDA) and related probabilistic models yields more compact and interpretable features than... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can document classification be enhanced in domain-specific or metadata-constrained scenarios?

This theme investigates approaches addressing domain-specific challenges such as legal, medical, or scientific documents, and situations where full content is inaccessible, focusing on leveraging metadata, multimodal features, or domain taxonomies. Exploring these methods is essential for enabling classification in real-world applications where standard text content may be limited, noisy, or structured differently.

A Novel Metadata Based Multi-Label Document Classification Technique

by Dr Mohammed Imran Basheer Ahmed

2023, Computer Systems Science and Engineering

Key finding: Demonstrates that multi-label classification of scientific articles using only metadata (title, keywords) can approach the performance of content-based methods, providing an effective alternative when full texts are... Read more

articleView Paper downloadDownload

AI for AI: Using AI methods for classifying AI science documents

by Evi Sachini

2023, Quantitative Science Studies

Key finding: Utilizing BERT-based neural networks for fully automated classification of artificial intelligence-related scientific documents, the model achieved 96.5% accuracy, uncovering substantial subject overlaps in other domains and... Read more

articleView Paper downloadDownload

End-to-End Document Classification and Key Information Extraction using Assignment Optimization

by Mairead O'Cuinn

2024, arXiv (Cornell University)

Key finding: Proposes a pipeline coupling template-based document classification with mixed integer programming for key information extraction in noisy scanned forms, achieving high f1 scores (0.97 for classification, 0.94 for KIE),... Read more

articleView Paper downloadDownload

Legal Document Classification and Summarization Using Multinomial Naive Bayes Classifier and TextRank

by Lisha Varghese

2024, Zenodo (CERN European Organization for Nuclear Research)

Key finding: Develops a machine learning-based legal document classification system using Multinomial Naive Bayes trained on document keywords and also summarizing documents via TextRank algorithm, effectively addressing challenges... Read more

articleView Paper downloadDownload

Exploring Final Project Trends Utilizing Nuclear Knowledge Taxonomy: An Approach Using Text Mining

by Faizhal Arif Santosa

2023, INFORMATION TECHNOLOGY AND LIBRARIES

Key finding: Applies k-Nearest Neighbors using Cosine-Binary similarity on Indonesian-language student thesis documents classified under the BATAN nuclear competence taxonomy, achieving 97% accuracy, and showcasing the utility of text... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Document Classification

On Document Classification with Self-Organising Maps

by Kal Jarvelin

2025, Lecture Notes in Computer Science

This research deals with the use of self-organising maps for the classification of text documents. The aim was to classify documents to separate classes according to their topics. We therefore constructed self-organising maps that were... more

descriptionView Paper arrow_downwardDownload

Biomedical Article Classification Using an Agent-Based Model of T-Cell Cross-Regulation

by Luis Rocha

2025, Lecture Notes in Computer Science

We propose a novel bio-inspired solution for biomedical article classification. Our method draws from an existing model of T-cell cross-regulation in the vertebrate immune system (IS), which is a complex adaptive system of millions of... more

descriptionView Paper arrow_downwardDownload

Collective classification of textual documents by guided self-organization in T-Cell cross-regulation dynamics

by Luis Rocha

2025, Evolutionary Intelligence

We present and study an agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification. Our method expands an existing analytical model of T-cell cross-regulation that was used to study... more

descriptionView Paper arrow_downwardDownload

Semantic Document Classification based on Strategies of Semantic Similarity Computation and Correlation Analysis

by Simon X . Yang

2025

Document (text) classification is a common method in e-business, facilitating users in the tasks such as document collection, analysis, categorization and storage. Semantic analysis can help to improve the performance of document... more

descriptionView Paper arrow_downwardDownload

PO Box 26, 00014 UNIV. OF HELSINKI, FINLAND

by Kim Viljanen

2025, cs.helsinki.fi

Abstract. This paper presents a deployed semantic web application in the cultural domain: the semantic portal MuseumFinland. It is a demonstration of a community portal and a publication channel by which heterogeneous collection database... more

descriptionView Paper arrow_downwardDownload

Architecture et qualité de systèmes logiciels

by Nicole Levy

2025

Definir l'architecture d?un systeme logiciel est une etape essentielle de son developpement. C'est la premiere etape de prise de decisions de conception. C?est en particulier la que seront prises les decisions concernant la... more

descriptionView Paper arrow_downwardDownload

Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

by inam elahi

2025

Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly... more

descriptionView Paper arrow_downwardDownload

An Efficient Algorithm for Reducing Clauses Based on Constraint Satisfaction Techniques

by Jérôme Maloberti

2025, Springer eBooks

This paper presents a new reduction algorithm which employs Constraint Satisfaction Techniques for removing redundant literals of a clause efficiently. Inductive Logic Programming (ILP) learning algorithms using a generate and test... more

descriptionView Paper arrow_downwardDownload

An Efficient Algorithm for Reducing Clauses Based on Constraint Satisfaction Techniques

by Jérôme Maloberti

2025, Lecture Notes in Computer Science

descriptionView Paper arrow_downwardDownload

Classification Methods

by Aijun An

2025, IGI Global eBooks

descriptionView Paper arrow_downwardDownload

Document Classification Methods for Organizing Explicit Knowledge

by Gerd Knolmayer

2025

In this paper we describe the two classification approaches (i.e. categorization and clustering) and their preceding steps. For each approach we give a brief description of the underlying theory and outline the advantages and... more

descriptionView Paper arrow_downwardDownload

Estimation of missing data using computational intelligence and decision trees

by George Ssali

2025

This paper introduces a novel paradigm to impute missing data that combines a decision tree with an auto-associative neural network (AANN) based model and a principal component analysis-neural network (PCA-NN) based model. For each model,... more

descriptionView Paper arrow_downwardDownload

Document Classification Method based on Graphs and Concepts of Non-rigid 3D Models Approach

by Cristian Lopez

2025, International Journal of Advanced Computer Science and Applications

Text document classification is an important research topic in the field of information retrieval, and so it is how we represent the information extracted from the documents to be classified. There exists document classification methods... more

descriptionView Paper arrow_downwardDownload

Avaliação de Grandes Modelos de Linguagem para Classificação de Documentos Jurídicos em Português

by Willgnner Ferreira Santos

2025, Universidade Federal de Goiás

The increasing procedural demand in judicial institutions has caused a workload overload, impacting the efficiency of the legal system. This scenario, exacerbated by limited human resources, highlights the need for technological solutions... more

descriptionView Paper arrow_downwardDownload

An intelligent information agent for document title classification and filtering in document-intensive domains

by Peter Bruza

2025, Decision Support Systems

Effective decision making is based on accurate and timely information. However, human decision makers are often overwhelmed by the huge amount of electronic data these days. The main contribution of this paper is the development of... more

descriptionView Paper arrow_downwardDownload

LayoutLM Revolutionizing Document Understanding Through Spatial-Aware AI Unveiling the Power of LayoutLM in Document Intelligence

by ppgplasticresins plasticresins

2025

As businesses move toward smarter automation and AI-driven processes, understanding structured documents like invoices, forms, receipts, and reports has become essential. This is where LayoutLM plays a groundbreaking role. Developed by... more

descriptionView Paper arrow_downwardDownload

Study of Text Based Mining

by Ranjna Garg and

2025

Text based Mining is the process of analyzing a document or set of documents to understand the content and meaning of the information they contain. Text Mining enhances human's ability to Process massive quantities of information and it... more

descriptionView Paper arrow_downwardDownload

NameDat: A Database of English Proper Names Spoken by Native Norwegians

by Torbjørn Svendsen

2025, Language Resources and Evaluation

This paper describes the design and collection of NameDat, a database containing English proper names spoken by native Norwegians. The database was designed to cover the typical acoustic and phonetic variations that appear when Norwegians... more

descriptionView Paper arrow_downwardDownload

International Journal of Advanced Research in Computer and Communication Engineering

by Doris Chinedu Asogwa

2025, International Journal of Advanced Research in Computer and Communication Engineering

Recent analysis have identified a significant rise in transactional fraud, where bad actors seek to deceive individuals or firms into unauthorized financial actions. Traditional fraud detection systems frequently struggle to effectively... more

descriptionView Paper arrow_downwardDownload

Discipline: Sciences de l’architecture

by Sabrina KACHER

2025

Proposition d'une méthode de référencement d'images pour assister le processus de conception architecturale.

descriptionView Paper arrow_downwardDownload

Web Spam Detection Using Different Features

by Sumit Sahu

2025

descriptionView Paper arrow_downwardDownload

Enhancing Credit Card Transaction Security Using Support Vector Machines and Feature Engineering Techniques

by IJIRCST I

2025, International Journal of Innovative Research in Computer Science and Technology (IJIRCST)

The quick spurt in online transactions has also brought with it a parallel surge in fraud. With digital payments, the scope for fraud remains very high. Credit card fraud, among others, can cause heavy losses to customers and erode the... more

descriptionView Paper arrow_downwardDownload

A methodology for semi-automatic classification schema building

by Erika Francesco

2025, Arxiv preprint arXiv: …

This paper describe a methodology for semi-automatic classification schema definition (a classification schema is a taxonomy of categories useful for automatic document classification). The methodology is based on: (i) an extensional... more

descriptionView Paper arrow_downwardDownload

Assessing the quality of Web content

by Inayatullah Khan

2025

This paper describes our approach towards the ECML/PKDD Discovery Challenge 2010. The challenge consists of three tasks: (1) a Web genre and facet classification task for English hosts, (2) an English quality task, and (3) a multilingual... more

descriptionView Paper arrow_downwardDownload

Technology in Thai Schools

by Paisal Kiattananan

2025

Year. SchoolNet provides global Internet access to secondary schools (grades 7-12) throughout Thailand. By using technology to improve our education system, this project supports the human-resource-development emphasised in the 8th... more

descriptionView Paper arrow_downwardDownload

Classification of Patents Into Knowledge Fields: Using a Proposed Knowledge Mapping Taxonomy (KnowMap

by Elham Motamedi and

2025, SiKDD

Various platforms, including patent systems and repositories like GitHub and arXiv, support knowledge dissemination across domains. As knowledge increasingly spans multiple disciplines, there is a need to track innovations that intersect... more

descriptionView Paper arrow_downwardDownload

A Multistage Feature Selection Model for Document Classification Using Information Gain and Rough Set

by Mohammad Atique

2025, International Journal of Advanced Research in Artificial Intelligence

Huge number of documents are increasing rapidly, therefore, to organize it in digitized form text categorization becomes an challenging issue. A major issue for text categorization is its large number of features. Most of the features are... more

descriptionView Paper arrow_downwardDownload

Document classification based on what is there and what should be there

by Noga Levy

2025

Some of the key questions in paleography are those of classification, namely trying to ascertain when and where a given manuscript was written, and — if possible — by whom. Paleographers bring many skills and tools to bear on these... more

descriptionView Paper arrow_downwardDownload

Text Document Classification using Convolutional Neural Networks

by Nihar Ranjan

2025, Journal of emerging technologies and innovative research

Documents are one of the most common methods for maintaining data and records. Everyday a lot of documents/files are generated with lot of data for future research purposes or for business analytics. These files/documents must should be... more

descriptionView Paper arrow_downwardDownload

A simple one class classifier with rejection strategy : application to symbol classification

by Eugen Barbu

2025, HAL (Le Centre pour la Communication Scientifique Directe)

At the preceding GREC, we have proposed to use a "bag of symbols" formalism (similar to the bag of words approach) for the indexing of a graphical document image database. In this paper, we extend the proposed approach through the... more

descriptionView Paper arrow_downwardDownload

FRAUD DETECTION IN FINANCIAL TRANSACTION USING MACHINE LEARNING

by Av Av

2025, Vemanadh Munaganti

An unprecedented rise in Unified Payment Interface (UPI) for the transactions in rural areas make them prone to fraud. In this paper, authors present a new method for detecting UPI fraud based on the name of the bank account, transaction... more

descriptionView Paper arrow_downwardDownload

Modeling the Internet and the Web

by Pierre Baldi

2025

descriptionView Paper arrow_downwardDownload

A Comparison of Document Similarity Algorithms

by Nicholas Gahman

2025, arXiv (Cornell University)

Document similarity is an important part of Natural Language Processing and is most commonly used for plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity algorithm could have a major... more

descriptionView Paper arrow_downwardDownload

A Comparison of Document Similarity Algorithms

by Nicholas Gahman

2025, International Journal of Artificial Intelligence & Applications

descriptionView Paper arrow_downwardDownload

Enhancing Credit Card Fraud Detection: A Comprehensive Study of Machine Learning Algorithms and Performance Evaluation

by Md Zikar Hossan

2025, AL-KINDI CENTER FOR RESEARCH AND DEVELOPMENT

Credit card fraud detection remains a significant challenge for financial institutions and consumers globally, prompting the adoption of advanced data analytics and machine learning techniques. In this study, we investigate the... more

descriptionView Paper arrow_downwardDownload

A Hybrid Unsupervised Density-based Approach with Mutual Information for Text Outlier Detection

by Ayman Tanira

2025, International Journal of Intelligent Systems and Applications

The detection of outliers in text documents is a highly challenging task, primarily due to the unstructured nature of documents and the curse of dimensionality. Text document outliers refer to text data that deviates from the text found... more

descriptionView Paper arrow_downwardDownload

BNDI:A Baysien Network for biomedial Documents Indexing with MeSH thesaurus

by Mohamed Nazih OMRI

2025

We propose a new approach denoted BNDI (Bayesian Network for Document Indexing) for indexing biomedical documents with controlled biomedical vocabulary based on a Bayesian Network. BNDI uses the probability inference to extract... more

descriptionView Paper arrow_downwardDownload

Knowledge evaluation based on LSA: MCQs and free answer questions

by Guy Denhiere

2025, Special edition of …

In this paper, we present an application using the SUMMA-LSA platform developed by Baier, Lehnard, Hoffmann & Schneider (this volume). SUMMA-LSA was used to evaluate biology knowledge of 7th and 8th grades students dealing with The human... more

descriptionView Paper arrow_downwardDownload

State-of-the-art review on opinion mining from online customers’ feedback

by Touhid Bhuiyan

2025

State-of-the-art review on opinion mining from online customers' feedback. In Aruka, Y & Namatame, A (Eds.

descriptionView Paper arrow_downwardDownload

An alternative text representation to TF-IDF and Bag-of-Words

by Minmin Chen

2025, arXiv (Cornell University)

In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations... more

descriptionView Paper arrow_downwardDownload

Efficient Vector Representation for Documents through Corruption

by Minmin Chen

2025, arXiv (Cornell University)

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such... more

descriptionView Paper arrow_downwardDownload

Feature Space Reduction for Graph-Based Image Classification

by Andrés Alonso

2025, Lecture Notes in Computer Science

Feature selection is an essential preprocessing step for classifiers with high dimensional training sets. In pattern recognition, feature selection improves the performance of classification by reducing the feature space but preserving... more

descriptionView Paper arrow_downwardDownload

An Adaptive Fuzzy kNN Text Classifier

by Haibin Zhu

2025, Lecture Notes in Computer Science

In recent years, kNN algorithm is paid attention by many researchers and is proved one of the best text categorization algorithms. Text categorization is according to training set which is assigned class label to decide a new document... more

descriptionView Paper arrow_downwardDownload

Hardware Accelerated FIM Techniques

by Subha Ilamathy

2025

Frequent Itemset Mining (FIM) is a critical data mining operation, but the computational overhead increases in the presence of large data sets and deteriorates the efficiency of FIM computation. In this paper, we explore algorithms based... more

descriptionView Paper arrow_downwardDownload

A Review on Feature Selection and Document Classification using Support Vector Machine

by Yoginee Surkar

2025

Recently the development of high performance, automatic text classification as well as feature selection has very challenging and tedious process because of many problems have been still unresolved. This paper provide the work done so far... more

descriptionView Paper arrow_downwardDownload

Assessing documents' credibility with genetic programming

by Marcos Goncalves

2024, 2011 IEEE Congress of Evolutionary Computation (CEC)

The concept of example credibility evaluates how much a classifier can trust an example when building a classification model. It is given by a credibility function, which is application dependent and estimated according to a series of... more

descriptionView Paper arrow_downwardDownload

Estimating the credibility of examples in automatic document classification

by Marcos Goncalves

2024

Classification algorithms usually assume that any example in the training set should contribute equally to the classification model being generated. However, this is not always the case. This paper shows that the contribution of an... more

descriptionView Paper arrow_downwardDownload

Enhance Medical Sentiment Vectors through Document Embedding using Recurrent Neural Network

by Nazlia Omar

2024, International Journal of Advanced Computer Science and Applications

Adverse Drug Reaction (ADR) extraction is the process of identifying drug implications mentioned in social posts. Handling medical text for the identification of ADR is vital to research in terms of configuring the side effect and other... more

descriptionView Paper arrow_downwardDownload

Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering

by Chen ZHANG

2024, Lecture Notes in Computer Science

In the last decade, a variety of topic models have been proposed for text engineering. However, except Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), most of existing topic models are seldom applied or considered in industrial scenarios. This phenomenon is caused by the fact that there are very few convenient tools to support these topic models so far. Intimidated by the demanding expertise and labor of designing and implementing parameter inference algorithms, software engineers are prone to simply resort to PLSA/LDA, without considering whether it is proper for their problem at hand or not. In this paper, we propose a configurable topic modeling framework named Familia, in order to bridge the huge gap between academic research fruits and current industrial practice. Familia supports an important line of topic models that are widely applicable in text engineering scenarios. In order to relieve burdens of software engineers without knowledge of Bayesian networks, Familia is able to conduct automatic parameter inference for a variety of topic models. Simply through changing the data organization of Familia, software engineers are able to easily explore a broad spectrum of existing topic models or even design their own topic models, and find the one that best suits the problem at hand. With its superior extendability, Familia has a novel sampling mechanism that strikes balance between effectiveness and efficiency of parameter inference. Furthermore, Familia is essentially a big topic modeling framework that supports parallel parameter inference and distributed parameter storage. After being open-sourced at Github, Familia rapidly becomes the second most popular project in topic model area, and is widely used in both industrial community and academia. The utilities and necessity of Familia are demonstrated in real-life industrial applications. Familia would significantly enlarge software engineers' arsenal of topic models and pave the way for utilizing highly customized topic models in real-life problems.

descriptionView Paper arrow_downwardDownload

Enhancing Credit Card Fraud Detection: A Comprehensive Study of Machine Learning Algorithms and Performance Evaluation

by Maniruzzaman Bhuiyan

2024, Journal of Business and Management Studies