Arabic Text Processing

description13 papers

group45 followers

lightbulbAbout this topic

Arabic Text Processing is the field of study focused on the computational handling of Arabic language text. It encompasses various tasks such as tokenization, morphological analysis, syntactic parsing, and semantic understanding, utilizing techniques from natural language processing (NLP) to enable effective interaction with Arabic written content.

lightbulbAbout this topic

Key research themes

1. How can morphological analysis and disambiguation address Arabic's complex morphology and dialectal variations effectively?

Arabic's morphological richness, cliticization, and high ambiguity pose major challenges to NLP tasks such as tokenization, part-of-speech tagging, lemmatization, and diacritization. Morphological analysis coupled with contextual disambiguation is essential to handle multiple valid analyses per word and to process both Modern Standard Arabic (MSA) and dialectal variants accurately. Developments focus on combining morphological analyzers with machine learning models to balance depth of analysis, speed, and support for different Arabic varieties.

Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic

by Ramy Nagah Eskander

2024

Key finding: MADAMIRA combines morphological analysis with Support Vector Machines and language models to produce accurate in-context morphological disambiguation for both MSA and Egyptian Arabic dialects, achieving broad coverage and... Read more

articleView Paper downloadDownload

Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic

by Emad Mohamed

2022

Key finding: Arabic-SOS addresses segmentation and orthographic variation specifically for pre-MSA Arabic texts by training a Gradient Boosting model on annotated historical corpora (Al-Manar and Classical Arabic). The segmenter achieves... Read more

articleView Paper downloadDownload

Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text

by Eric S Atwell

2021

Key finding: A fine-grained morphological analyzer and POS tagger that encodes rich morphological features such as gender, number, case, mood, and voice improves disambiguation accuracy by capturing detailed grammatical distinctions... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What role do preprocessing and text segmentation techniques play in improving Arabic text classification and NLP downstream tasks?

Arabic text preprocessing—including tokenization, stemming, lemmatization, stopword removal—and sentence segmentation critically influence the performance of text classification, retrieval, and other NLP applications. Due to Arabic's complex morphology, ambiguous word boundaries, and orthographic challenges, applying linguistically-informed preprocessing tailored to Arabic characteristics significantly enhances feature quality and classification accuracy. Furthermore, segmenting unpunctuated Arabic text into meaningful sentences facilitates better downstream understanding.

A Study of Text Preprocessing Tools for Arabic Text Categorization

by Nevin Darwish and

2016

Key finding: Comparative evaluation of Arabic preprocessing tools shows that light stemming combined with feature selection yields a positive impact on text categorization accuracy while reducing dimensionality. Contradictory findings in... Read more

articleView Paper downloadDownload

The effects of Pre-Processing Techniques on Arabic Text Classification

by WARSE The World Academy of Research in Science and Engineering and

2021, International Journal of Advanced Trends in Computer Science and Engineering

Key finding: Experimental results demonstrate that applying Arabic-specific preprocessing techniques such as stop words removal, stemming, and lemmatization—either individually or in combination—can increase classification accuracy by up... Read more

articleView Paper downloadDownload

Employing a Multilingual Transformer Model for Segmenting Unpunctuated Arabic Text

by Ahmad Alkhodre

2025

Key finding: Introducing PDTS, a punctuation detection and segmentation approach based on fine-tuning a pre-trained multilingual BERT model combined with linguistic rules, effectively segments long unpunctuated Arabic texts into candidate... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How do Arabic language resources and corpora support advancements in natural language processing for Arabic?

Comprehensive, large-scale Arabic corpora and language resources provide foundational data critical for training, evaluation, and development of Arabic NLP systems. These include newspaper archives, historical manuscript collections, and annotated datasets spanning classical to modern dialects. Resources encoded with various markup schemes and diverse dialect coverage enable researchers to capture the linguistic variability of Arabic, drive improved morphological analysis, and support downstream NLP tasks such as speech recognition, text classification, and information retrieval.

1.5 billion words Arabic Corpus

by Ibrahim Abu El-Khair

2025, arXiv (Cornell University)

Key finding: Creation of a free, large-scale Arabic corpus containing over 1.5 billion words from 5 million newspaper articles across 10 news sources and 8 countries over 14 years, encoded in UTF-8 and Windows CP-1256 with SGML and XML... Read more

articleView Paper downloadDownload

HADARA – A Software System for Semi-Automatic Processing of Historical Handwritten Arabic Documents

by Ofer Biller

2021

Key finding: The HADARA system integrates document digitization, annotation, layout analysis, and recognition for Arabic historical manuscripts, supporting semi-automatic transcription with modules tuned to handle complex scripts,... Read more

articleView Paper downloadDownload

Arabic Language Resources and Tools for Speech and Natural Language

by Mansour Alghamdi

2023

Key finding: Comprehensive development of morphological analyzers, stemmers, lexica, and morphological rules by KACST and UOB institutions have produced Arabic language resources that underpin NLP and speech applications. These resources... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Arabic Text Processing

Arabic dialects identification: North African dialects case study

by Mourad Oussalah

2024

Arabic is the fourth most used language on the Internet and the official language of more than 20 countries around the world. It has three main varieties, Modern Standard Arabic, which is used in books, news and education, local Dialects... more

Figure 1: Different dialects in the Maghreb countries, where people living in the borders speak a similar dialect to the country’s neighbours, the graphic shows how to say ’what are you doing’ in different dialects[8].

Figure 2: Generic graph showing the overall methodology Size of the dataset after splitting the long sentences into smaller ones

Figure 5: Accuracy level according to the number of features used in TF-IDF. Figure 4: Sample of sentences written in local dialects. 4.2. Feature extraction

Figure 6: Accuracy level according to the number of features used in TF-IDF without stopwords. 6. Conclusion

Figure 3: Sample of sentences written in Arabizi.

descriptionView Paper arrow_downwardDownload

A Grammatically and Structurally Based Part of Speech (POS) Tagger for Arabic Language

by Mohamed Elhadi

2024, International Journal on Natural Language Computing

In this paper we report on an experimental syntactically and morphologically driven rule-based Arabic tagger. The tagger is developed using Arabic language grammatical rules and regulations. The tagger requires no pre-tagged text and is... more

descriptionView Paper arrow_downwardDownload

Arabic text steganography using lunar and solar diacritics

by Ban N . Dhannoon

2024, Indonesian Journal of Electrical Engineering and Computer Science

The need to hide essential information has rapidly increased as mobile devices and the internet has overgrown. Steganography is a method created to create hidden communication. Recently, methods have been developed to hide important... more

descriptionView Paper arrow_downwardDownload

Arabic text steganography using lunar and solar diacritics

by Ban N . Dhannoon

2024, Indonesian Journal of Electrical Engineering and Computer Science

descriptionView Paper arrow_downwardDownload

A Grammatically and Structurally Based Part of Speech (Pos) Tagger for Arabic Language

by Mohamed Elhadi

2024, Zenodo (CERN European Organization for Nuclear Research)

descriptionView Paper arrow_downwardDownload

A Grammatically and Structurally Based Part of Speech (POS) Tagger for Arabic Language

by Mohamed Elhadi

2024, International Journal on Natural Language Computing

MTE Tagger is totally based on readily available primitive data lists and a complex set of linguistic rules both of which are highlighted next:

Table 1. Sample Closed POS Categories List International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022

Table 2. Gramatical, Morphological and Strctural Procedures (Rules) 5. EXPERIMENTS, EVALUATIONS, AND DISCUSSIONS The evaluations process consists of the following experiments with results as shown in Table 3. Two sets of experiments were performed. The first set is made of four runs on four different un- annotated data sets to compare performance (Accuracy and Timing) of the new tagger to that of Stanford Tagger.

Table 3. Accuracy and Timings results comparison. The second set of experiments are based on a small selected dataset that is manually annotated. The two taggers are both run on the data set and accuracy of tagging and speed of performance are noted and compared. Accuracy is a representation of the number of rightly tagged tokens while perfor- mance is the speed of tagging. Due to the expectation that rule-based systems tend to be much faster and robust, the measurements take are only indicative and lack features of a well-controlled experiments.

Table 4. A sentence example: Made of 27 Tokens. Taggers match on 19 and mismatch on &

International Journal on Natural Language Computing (IJNLC) Vol.11, No.5, October 2022

Table 5. Overal missed percentage for the different POS catagories Looking at the overall success of tagging we could see that Adjectives (JJ) are the least accurate in MTE and better rules will still have to be invented to improve the classification of JJs.

descriptionView Paper arrow_downwardDownload

A Grammatically and Structurally Based Part of Speech (Pos) Tagger for Arabic Language

by Ramadan Alfared

2023, Zenodo (CERN European Organization for Nuclear Research)

descriptionView Paper arrow_downwardDownload

Generating Arabic Stop-Word for Hadith

by Yousef Hazzaimeh

2023, Malaysian Journal of Science, Health & Technology

Stop-words or (function words) play a great role in performing various functions in sentences, but are still typically inadequate to use for retrieval. They consist of several elements such as common nouns, pronouns and prepositions. With... more

descriptionView Paper arrow_downwardDownload

A note on the Greek and Ethiopic text of 1 Enoch 5:8

by Fiodar Litvinau

2019, Journal for the Study of the Pseudepigrapha 29.1

Greek and Ethiopic versions of 1 Enoch 5:8 preserve a different text at the end of the passage. This note aims to demonstrate the superiority of the Ethiopic text of 1 En. 5:8 over the version preserved in Codex Panapolitanus by arguing... more

descriptionView Paper arrow_downwardDownload

The Fantastic Four: Alexander, Sesonchosis, Ninus and Semiramis

by Yvona Trnka-Amrhein

2018, The Alexander Romance: History and Literature

descriptionView Paper arrow_downwardDownload

Targum: Translation in Hellenistic and Roman Imperial Prose Fiction

by Daniel L . Selden

2013

descriptionView Paper arrow_downwardDownload

Paul de Lagarde and the Coptic New Testament

by Heike Behlmer

2013, Essays in Honour of Frederik Wisse: Scholar, Churchman, Mentor ed. W. Kappeler

descriptionView Paper arrow_downwardDownload

Lithargoel in the Acts of Peter and the Twelve

by Istvan Czachesz

2013, In: A. Hilhorst and G.H. van Kooten (eds), The Wisdom of Egypt: Jewish, Early Christian, and Gnostic Essays in Honour of Gerard P. Luttikhuizen, Leiden: Brill, 2005, 485-502.

descriptionView Paper arrow_downwardDownload

Text Networks

by Daniel L . Selden

2012