Academia.eduAcademia.edu

Arabic NLP

description1,000 papers
group10,082 followers
lightbulbAbout this topic
Arabic Natural Language Processing (NLP) is a subfield of artificial intelligence and computational linguistics focused on the interaction between computers and the Arabic language. It involves the development of algorithms and models to enable machines to understand, interpret, and generate Arabic text and speech, addressing unique linguistic features and challenges of the language.
lightbulbAbout this topic
Arabic Natural Language Processing (NLP) is a subfield of artificial intelligence and computational linguistics focused on the interaction between computers and the Arabic language. It involves the development of algorithms and models to enable machines to understand, interpret, and generate Arabic text and speech, addressing unique linguistic features and challenges of the language.

Key research themes

1. How can multidialectal Arabic NLP be advanced to address the diversity and complexity of Arabic dialects in tasks like Named Entity Recognition?

This research area focuses on developing robust NLP models that handle multiple Arabic dialects simultaneously, overcoming the challenge posed by the linguistic diversity, morphological richness, and lack of standardized dialectal resources. It is crucial because Arabic dialects differ considerably from Modern Standard Arabic (MSA) and from each other, leading to poor performance of MSA-centric tools on dialectal texts, thus hindering real-world applications such as information retrieval, machine translation, and question answering.

Key finding: Proposed ARDIAL-BERT—the first multidialectal NER model covering major Arabic dialects (Levantine, Maghrebi, Egyptian, Gulf)—and demonstrated that continual pretraining on regionally grouped datasets notably improves NER... Read more
Key finding: Introduced CAMeL Tools, an open-source Python toolkit supporting morphological modeling, dialect identification, named entity recognition, and sentiment analysis tailored to Arabic and its dialects. The toolkit addresses... Read more
Key finding: Developed MADAMIRA, a fast tool for morphological analysis and disambiguation applicable to both MSA and dialectal Arabic. It integrates morphologically rich analysis, diacritization, POS tagging, and tokenization using... Read more

2. What are the effective methodologies for constructing and utilizing large-scale Arabic language corpora and lexicons to support NLP applications, including dialectal variations?

This theme explores the creation, structuring, and use of large Arabic language corpora and lexical resources to enhance NLP tasks. Given the diglossic nature of Arabic, with its standard and multiple dialectal forms, language resources must represent this diversity. Properly designed corpora and lexicons enable better empirical analysis, lexicography, semantic understanding, and help overcome the scarcity of annotated data for dialects, which is a key bottleneck in Arabic NLP development.

Key finding: Created Tharwa, a pioneering three-way lexicon connecting Egyptian Dialectal Arabic, Modern Standard Arabic, and English, covering over 73,000 dialect entries. It includes detailed linguistic features such as POS, gender,... Read more
Key finding: Established a standardized set of rules for consistent data division of Arabic treebanks (Modern Standard Arabic and Egyptian dialects) into train, development, and test splits. This methodological contribution enables... Read more
Key finding: Outlined the design and compilation of ICA, an effort to build a large, representative Arabic corpus encompassing diverse genres and regional varieties, addressing the shortage of Arabic corpora for linguistic research and... Read more
Key finding: Presented a large-scale, free Arabic corpus comprising over 1.5 billion words collected from 5 million newspaper articles across 8 countries over 14 years. The corpus offers diverse, multi-source, multi-country data... Read more

3. What are the specific challenges and linguistic features of Arabic that must be addressed in NLP, and how can morphological structures like schemes and multiword expressions enhance Arabic NLP systems?

Arabic’s unique linguistic characteristics—such as rich morphology, complex word formation via roots and schemes, orthographic ambiguity due to optional diacritics, diglossia, and pervasive use of multiword expressions—pose significant challenges in NLP. Research focuses on modeling these features accurately, including leveraging scheme-based abstractions to reduce vocabulary sparsity and compiling annotated repositories of multiword expressions to improve language understanding and processing accuracy.

Key finding: Explored the use of Arabic morphological schemes—templates guiding root-based derivation—as abstractions to reduce model sparsity in NLP. Demonstrated a vocabulary reduction of over 90% when converting text to schemes, with a... Read more
Key finding: Compiled a manually curated, morphosyntactically annotated repository of approximately 5,000 Arabic multiword expressions (MWEs), categorized by syntactic type and enriched with context-sensitive morphological analysis.... Read more
Key finding: Reviewed critical linguistic challenges in Arabic NLP arising from Arabic's derivational and inflectional morphology, syntactic free word order, diglossia (Classical, MSA, dialects), and orthographic ambiguities due to absent... Read more

All papers in Arabic NLP

We developed an original approach to Arabic traditional morphology, involving new concepts in Semitic lexicology, morphology, and grammar for standard written Arabic. This new methodology for handling the rich and complex Semitic... more
Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say,... more
We describe a lexicon of Arabic verbs constructed on the basis of Semitic patterns and used in a resource-based method of morphological annotation of written Arabic text. The annotated output is a graph of morphemes with accurate... more
A natural path for Arabic morphology consists in adopting or adapting both the traditional Semitic model and finite-state technologies. On the one hand, we have to facilitate the linguist’s task of lexical encoding by proposing a familiar... more
Data annotation is an important and necessary task for all NLP applications. Designing and implementing a web-based application that enables many annotators to annotate and enter their input into one central database is not a trivial... more
This paper presents a language identification system designed to detect the language of each word, in its context, in a multilingual documents as generated in social media by bilingual/multilingual communities, in our case speakers of... more
DIRA is a query expansion tool that generates search terms in Standard Arabic and/or its dialects when provided with queries in English or Standard Arabic. The retrieval of dialectal Arabic text has recently become necessary due to the... more
Most developing economies continue to be overdependent on agriculture, yet farmers fail to access professional knowledge and information in a timely fashion that may help them generate higher yields, reduce risks, and make decisions. In... more
The rapid growth of social media in recent years has fed into some highly undesirable phenomena such as proliferation of abusive and offensive language on the Internet. Previous research suggests that such hateful content tends to come... more
The rapid growth of social media in recent years has fed into some highly undesirable phenomena such as proliferation of hateful and offensive language on the Internet. Previous research suggests that such abusive content tends to come... more
Extracting contextual information from low-resource languages such as Igbo remains a significant challenge due to limited linguistic data. This paper proposes a novel hybrid approach that leverages both global and subword-level... more
This research investigates the use of Machine Learning (ML) and Deep Learning, including BiLSTM approaches, for Sentiment Analysis (SA) of consumer reviews on social media sites. Businesses are increasingly depending on online reviews to... more
The relevance of textual analysis appears in numerous case studies across fields of social, business and academic communication. A central question in multilingual research is to develop a universal concept representation using a variety... more
This overview presents the Author Profiling and Deception Detection in Arabic (APDA) shared task at PAN@FIRE 2019. Two have been the main aims of this years task: i) to profile the age, gender and native language of a Twitter user; ii) to... more
In recent times, machine learning and deep learning have quickly risen to prominence as highly effective instruments across a multitude of domains, encompassing areas such as image and speech interpretation, the processing of natural... more
This paper introduces SERTUS (Speech Emotion Recognition TUnisian Spontaneous), an extensive dataset collection intended to propel research in Speech Emotion Recognition (SER), particularly within the realm of Tunisian Dialect (TD).... more
Emotional expressions are a fundamental aspect of human communication, with speech being one of the most natural modes of interaction. Speech Emotion Recognition (SER) is a significant research topic in Natural Language Processing (NLP),... more
Speech Emotion Recognition (SER) using Natural Language Processing (NLP) for underrepresented dialects faces significant challenges due to the lack of annotated corpora. This research addresses this issue by constructing and annotating... more
artificial intelligence (ai) tools such as deepseek r1 and chatGPT 4.5 have emerged as promising aids in arabic-english literary translation. This study aims to compare the translation performance of these two systems using a... more
Named Entity Recognition (NER) is among the main tasks of Natural Language Processing (NLP). NER is a critical and fundamental component for several NLP applications including Information Retrieval (IR), Question-Answering (QA) and... more
Irony and satire in Orwell's Animal Farm are lexically investigated in the current paper, in order to find out the correlation between both concepts. The researcher adopts a qualitative method of analysis, focusing on chapter 10. The... more
Paraphrase plagiarism is a significant and widespread problem and research shows that it is hard to detect. Several methods and automatic systems have been proposed to deal with it. However, evaluation and comparison of such solutions is... more
In this paper, we describe the process of building a corpus for Tunisian Speech Emotion Recognition (SER). To the best of our knowledge, it is the first work in the SER field that uses spontaneous speech emotion in Tunisian dialect.... more
The field of artificial intelligence (AI) is evolving at an unprecedented pace, with transfer learning and transformer-based models now forming the backbone of many state-of-the-art systems. This book, 200 Questions About Transfer... more
Question Answering (QA) is a specialized area in the field of Information Retrieval (IR). The QA systems are concerned with providing relevant answers in response to questions proposed in natural language. QA is therefore composed of... more
In the Philippines, parents refused their children having an anti-measles and anti-dengue vaccines, which created a medical outbreak. This may not happen if product warnings have been given and explained to the parents. Indeed, product... more
Websites are regarded as domains of limitless information which anyone and everyone can access. The new trend of technology has shaped the way we do and manage our businesses. Today, advancements in Internet technology has given rise to... more
The widespread dissemination of fake news across digital platforms has emerged as a critical issue, undermining public trust and influencing societal discourse. This paper presents a lightweight yet effective fake news detection system... more
In this work, we examine the limitations of digital tools in facilitating cross-linguistic and crosscultural research from a humanistic perspective. Our primary objective is to draw comparisons between the TenTen corpora, assessing their... more
This paper explores the interplay between artificial intelligence (AI) in natural language processing (NLP) and linguistics, offering NLP engineers actionable methodologies (e.g., syntactic probes, evaluation metrics) and linguists... more
The increasing diffusion of misinformation in online media has raised alarm as a significant threat to information credibility and societal trust. The ease of disseminating false information across social media platforms, news websites,... more
Social media is used as a dominant source of news distribution among users. The world's preeminent decisions such as politics are acclaimed by social media to influence users for enclosing users' decisions in their favor. However, the... more
In this paper we implement a document retrieval system using the Lucene tool and we conduct some experiments in order to compare the efficiency of two different weighting schema: the well-known TF-IDF and the BM25. Then, we expand queries... more
The emergence and subsequent development of deep learning, specifically transformer-based architectures, Generative Adversarial Networks (GANs), and attention mechanisms, have had revolutionary implications on Natural Language Processing... more
The paper presents the creation of an end-toend voice assistant system designed for a lesser-resourced dialect of Arabic, Libyan Tripolitanian, which does not receive local support in commercial ASR and NLP applications. To remediate this... more
This study introduces a novel approach to sentiment classification of Arabic tweets regarding educational reforms in Saudi Arabia. The complexity of the Arabic language, with its numerous dialects, poses challenges for natural language... more
This paper describes a comprehensive set of experiments conducted in order to classify Arabic Wikipedia articles into predefined sets of Named Entity classes. We tackle using four different classifiers, namely: Naïve Bayes, Multinomial... more
This paper presents Small Synthetic Embedding Dataset, a fully synthetic dataset in Ukrainian designed for training, fine-tuning, and evaluating text embedding models. The use of large language models (LLMs) allows for controlling the... more
Automatic Text Classification is a machine learning task that automatically assigns a given text document to a set of pre-defined categories based on the features extracted from its textual content. Most online communication forums,... more
Arabic nouns can be marked for definiteness or indefiniteness. The definite article is the prefix ''Al-,'' which confines the determiner class to a single element ''Al-.'' This topic is generally discussed under noun inflections, such as... more
Arabic nouns can be marked for definiteness or indefiniteness. The definite article is the prefix ''Al-,'' which confines the determiner class to a single element ''Al-.'' This topic is generally discussed under noun inflections, such as... more
Fake News (FN) dissemination on websites and online platforms influences human behaviours, sociopolitical domains, and the sovereignty of a country. The outpour of biased news and propaganda on online portals can be addressed by... more
This article gives mathematical pseudocodes for large language model training based on a dataset from a corpus, inference, and chat with possibly lengthy human prompts and generated replies. It introduces the concepts of "microlect" and... more
The goal of natural language processing (NLP), which has recently gained popularity, is to improve the capacity of computers to comprehend and interact with human language. Consequently, to converse using natural language, it is crucial... more
Financial Named Entity Recognition (NER) presents a pivotal task in extracting structured information from unstructured financial data, especially when extending its application to languages beyond English. In this paper, we present... more
Question answering system (QAS) is essential to satisfy the need to query information available in various formats, including structured data (ontology, databases) or unstructured data (document, web). The QAS provides a correct response... more
Download research papers for free!