Academia.eduAcademia.edu

Arabic Language NLP

description10 papers
group4 followers
lightbulbAbout this topic
Arabic Language NLP (Natural Language Processing) is a subfield of artificial intelligence focused on the interaction between computers and the Arabic language. It involves the development of algorithms and models to enable machines to understand, interpret, and generate Arabic text, addressing unique linguistic features and complexities inherent to the language.
lightbulbAbout this topic
Arabic Language NLP (Natural Language Processing) is a subfield of artificial intelligence focused on the interaction between computers and the Arabic language. It involves the development of algorithms and models to enable machines to understand, interpret, and generate Arabic text, addressing unique linguistic features and complexities inherent to the language.

Key research themes

1. How can linguistic lexicons bridging Modern Standard Arabic, Dialectal Arabic, and English improve NLP performance across Arabic varieties?

This research theme focuses on building and employing large-scale multilingual lexicons that link Dialectal Arabic (DA)—primarily Egyptian Arabic—with Modern Standard Arabic (MSA) and English. The goal is to address the challenges posed by the significant morphological, phonological, and lexical divergences between Arabic varieties, which negatively affect NLP tool performance when applied across dialects. By integrating lexicons enriched with detailed morphological and linguistic annotations, researchers aim to enhance both theoretical linguistic studies and computational applications such as machine translation, sentiment analysis, and morphological disambiguation.

Key finding: Tharwa provides a pioneering large-scale electronic tri-lingual lexicon connecting over 73,000 Egyptian Arabic entries with their equivalents in MSA and English. The lexicon includes detailed morphological features (POS,... Read more
Key finding: This extended overview of Tharwa emphasizes its role in filling the lexical resource gap for Egyptian Arabic as a pilot dialect, showcasing how it supports analyses of phonological and lexical variation from MSA. By capturing... Read more
Key finding: This work highlights the need to study Arabic dialects explicitly due to their pervasive use in spontaneous speech and increasingly written forms on social media, and the inadequacy of applying MSA tools directly. It surveys... Read more
Key finding: CAMeL Tools integrates multiple Arabic NLP functionalities including morphological modeling, dialect identification, and named entity recognition, incorporating support for dialectal processing alongside MSA. This toolkit... Read more

2. What role do large-scale Arabic text corpora play in advancing NLP applications and linguistic research?

This theme addresses the development and utilization of sizable and representative Arabic corpora as critical foundations for data-driven NLP and linguistic studies. Given Arabic's diglossic and dialectal properties, large annotated and raw corpora spanning various domains, dialects, and writing styles provide empirical evidence necessary for lexicography, syntactic analysis, semantic studies, and machine learning model training. The advancement of Arabic NLP systems depends heavily on the availability of such corpora, which improve resource coverage and performance across tasks like sentiment analysis, information retrieval, and machine translation.

Key finding: This study presents a large-scale Arabic corpus containing over 1.5 billion words collected from newspaper articles across ten major news sources from eight Arabic countries, spanning fourteen years. The corpus is encoded in... Read more
Key finding: This paper outlines the creation of the International Corpus of Arabic (ICA), a representative and balanced Arabic corpus covering the entire Arab world. The ICA supports multiple fields including lexicography, grammar,... Read more
Key finding: This paper introduces a novel Arabic-English mixed multilingual corpus designed to reflect real-world scientific documents containing both languages in tightly integrated forms. Unlike most monolingual or parallel corpora,... Read more
Key finding: Leveraging a small manually annotated Tunisian Arabic corpus (6,000 words), this study compares rule-based and machine learning POS tagging methods for a severely under-resourced dialect. Despite the limited data size,... Read more

3. How can morphological patterns and multiword expressions enhance Arabic NLP tool development and accuracy?

Arabic’s rich, templatic morphology and widespread use of fixed multiword expressions (MWEs) pose unique challenges and opportunities for NLP. Research in this theme involves leveraging schemes (morphological templates) to reduce lexical sparsity and build text classifiers and parsers, as well as the compilation and annotation of extensive Arabic MWE repositories. Accurate morphological analysis and MWE identification improve key NLP functions such as tokenization, parsing, and semantic interpretation, which are essential for applications ranging from sentiment analysis to machine translation.

Key finding: This study pioneers the exploitation of Arabic morphological schemes, rather than surface words, to reduce data sparsity and build NLP systems including a neural network text classifier and a probabilistic context-free... Read more
Key finding: The authors manually compile a large repository of Arabic MWEs from multiple dictionaries, annotate every word with detailed context-sensitive morphological analyses, and automatically tag occurrences of these MWEs in a large... Read more
Key finding: MADAMIRA integrates state-of-the-art morphological analysis and disambiguation by combining strengths of prior tools (MADA and AMIRA), handling both MSA and dialects. By leveraging rich morphological analyzers producing... Read more

All papers in Arabic Language NLP

This study investigates the influence of various Natural Language Processing (NLP) models on the accuracy and efficiency of Arabic linguistic applications. Employing a systematic review and comparative analysis, the research evaluates... more
This study investigates the multifaceted challenges of Arabic language processing in artificial intelligence (AI) systems, emphasizing linguistic, technical, and ethical dimensions. Employing a qualitative analysis of current research, it... more
is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two sub- tasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been submitted... more
In the era of digital architecture, parametric design plays a fundamental role in the generative architectural design process. The most important of its benefits are that it allows a visual representation of the design process, a designer... more
As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. In order to enhance customer satisfaction and their shopping experiences, it has become important to analysis customers... more
In the era of digital architecture, parametric design plays a fundamental role in the generative architectural design process. The most important of its benefits are that it allows a visual representation of the design process, a designer... more
UNPUBLISHED PAPER COMPARING THE HAND-ROLLED PARSER/POS TAGGER USED IN TUNISIYA.ORG WITH SOME ML METHODS This paper presents a comparison of several different part-of-speech taggers trained on a hand-annotated Tunisian Arabic sample of... more
Measuring the amount of shared information between two documents is a key to address a number of Natural Language Processing (NLP) challenges such as Information Retrieval (IR), Semantic Textual Similarity (STS), Sentiment Analysis (SA)... more
AraPlagDet is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two subtasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been... more
Easy access that Internet has provided to vast quantities of electronic data, textual plagiarism has become a major concern especially in academic documents and research and scientific institutions. So with increasing rate of amount of... more
The paper investigates method for the style breach detection task. We developed a method based on mapping sentences into high dimensional vector space. Each sentence vector depends on the previous and next sentence vectors. As main... more
The anonymity of a text’s writer is an important topic for some domains, such as witness protection and anonymity programs. Stylometry can be used to reveal the true author of a text even if s/he wishes to hide his/her identity. In this... more
Various approaches have been implemented for plagiarism detection used, for author‘s work and academic publication, there is a purpose to create such reliable and performant plagiarism detection with increasing amount of publications.... more
Opinion mining applications work with a large number of opinion holders. This means a summary of opinions is important so we can easily interpret holders' opinions. The aim of this paper is to provide a feature-based summarization for... more
In this paper, we have used one of the preconditioned conjugate gradient algorithm with the Quasi-Newton approximation; namely the BFGS preconditioned algorithm which was suggested by (AL-Bayati and Aref, 2001). In this paper we have... more
We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection task in the PAN2014 challenge that achieved a plagdet score 0.855. By extracting contextual features for each document character and grouping... more
This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two... more
Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multilingual content on the Web has increased cross-language text reuse to an... more
PAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics. More specifically, this edition of PAN introduces a shared task in... more
Plagiarism Detection Systems are critical in identifying instances of plagiarism, particularly in the educational sector whenever it comes to scientific publications and papers. Plagiarism occurs when any material is copied without the... more
Many approaches are characteristic of name opinion is based only on the review of the single-shaft, ignoring non-trivial disparities in the distribution of the word of those around Corpus different. In Proposed work a new technique... more
This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an... more
المحاور: المحور الأول: - بعض المفاهيم المتعلقة بالبرمجة الآلية للغة. المحور الثاني: - المعالجة الآلية لمنظومة الصرف. المحور الثالث: - المعالجة الآلية لمنظومة النحو. المحور الرابع: - المعالجة الآلية لمنظومة الكتابة. المحور الخامس : -... more
Extreme Learning Machine (ELM) is a new learning algorithm for feed forward neural network for classification or regression with a single layer of hidden nodes where the weights connecting inputs to hidden nodes are randomly assigned.... more
The Web considers one of the main sources of customer opinions and reviews which they are represented in two formats; structured data (numeric ratings) and unstructured data (textual comments). Millions of textual comments about goods and... more
The article whose title is mentioned above is about showing the differences between human and machine translations.
Opinion target is defined as the object about which user expresses their opinions, typically as nouns or noun phrases. Opinion words are the words that are used to express user's opinions. Constructing an opinion words lexicon is also... more
Mining patterns are the main source of opinion feature extraction techniques, which was individually evaluated corpus mostly belong to evaluated corpus. A measure called Domain Relevance is used to identify candidate features from domain... more
AraPlagDet is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two subtasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been... more
‫ملخص‬ ‫الورقة‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫اﻟﻠﻐﺔ‬ ‫ﳌﻌﺎﳉﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫ﺑﺄﲝﺎث‬ ‫ﻟﻴﺪز‬ ‫ﲜﺎﻣﻌﺔ‬ ‫اﳊﺎﺳﻮب‬ ‫ﻟﻜﻠﻴﺔ‬ ‫اﻟﺘﺎﺑﻊ‬ ‫اﻟﻠﻐﺔ‬ ‫أﲝﺎث‬ ‫ﻳﻖ‬ ‫ﻓﺮ‬ ‫أﻋﻀﺎء‬ ‫ﻳﻬﺘﻢ‬ . ‫اﳌﺎﺿﻲ‬ ‫ﰲ‬ ‫ﻗﻤﻨﺎ‬ ‫ﻓﻌﻨﺪﻣﺎ‬ ‫ﻛﻨ‬ ‫أدر‬ ‫ﻟﻜﻨﻨﺎ‬ ‫و‬ ‫ات،‬ ‫اﻷدو‬ ‫ﻫﺬﻩ‬ ‫ﻣﻦ‬ ً ‫ﺟﺪا‬ ً ‫ﻗﻠﻴﻼ‬... more
‫ملخص‬ ‫الورقة‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫اﻟﻠﻐﺔ‬ ‫ﳌﻌﺎﳉﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫ﺑﺄﲝﺎث‬ ‫ﻟﻴﺪز‬ ‫ﲜﺎﻣﻌﺔ‬ ‫اﳊﺎﺳﻮب‬ ‫ﻟﻜﻠﻴﺔ‬ ‫اﻟﺘﺎﺑﻊ‬ ‫اﻟﻠﻐﺔ‬ ‫أﲝﺎث‬ ‫ﻳﻖ‬ ‫ﻓﺮ‬ ‫أﻋﻀﺎء‬ ‫ﻳﻬﺘﻢ‬ . ‫اﳌﺎﺿﻲ‬ ‫ﰲ‬ ‫ﻗﻤﻨﺎ‬ ‫ﻓﻌﻨﺪﻣﺎ‬ ‫ﻛﻨ‬ ‫أدر‬ ‫ﻟﻜﻨﻨﺎ‬ ‫و‬ ‫ات،‬ ‫اﻷدو‬ ‫ﻫﺬﻩ‬ ‫ﻣﻦ‬ ً ‫ﺟﺪا‬ ً ‫ﻗﻠﻴﻼ‬... more
Statistical Machine Translation (SMT) is considered as sub-field of computational linguistics; and the latter is regarded as a branch of Artificial Intelligence (AI) dedicated to Natural Language Processing (NLP). The main purpose of this... more
Plagiarism detection means detecting the document whether copied or stealing from the other document. The main goal is to detect the word by analyzing the writing style using technique intrinsic plagiarism detection. Text mining is... more
This paper describes a new tool that helps extracting clean text from the Arabic Wikisource dump in order to build corpora. The tool purpose is illustrated by the generation of a subcorpus from Wikisource, which is a step towards the... more
Download research papers for free!