Arabic Language NLP

description10 papers

group4 followers

lightbulbAbout this topic

Arabic Language NLP (Natural Language Processing) is a subfield of artificial intelligence focused on the interaction between computers and the Arabic language. It involves the development of algorithms and models to enable machines to understand, interpret, and generate Arabic text, addressing unique linguistic features and complexities inherent to the language.

lightbulbAbout this topic

Key research themes

1. How can linguistic lexicons bridging Modern Standard Arabic, Dialectal Arabic, and English improve NLP performance across Arabic varieties?

This research theme focuses on building and employing large-scale multilingual lexicons that link Dialectal Arabic (DA)—primarily Egyptian Arabic—with Modern Standard Arabic (MSA) and English. The goal is to address the challenges posed by the significant morphological, phonological, and lexical divergences between Arabic varieties, which negatively affect NLP tool performance when applied across dialects. By integrating lexicons enriched with detailed morphological and linguistic annotations, researchers aim to enhance both theoretical linguistic studies and computational applications such as machine translation, sentiment analysis, and morphological disambiguation.

Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon

by Maryam Aminian

2015

Key finding: Tharwa provides a pioneering large-scale electronic tri-lingual lexicon connecting over 73,000 Egyptian Arabic entries with their equivalents in MSA and English. The lexicon includes detailed morphological features (POS,... Read more

articleView Paper downloadDownload

Tharwa: A Large Scale Dialectal Arabic-Standard Arabic-English Lexicon

by Ramy Nagah Eskander and

2024

Key finding: This extended overview of Tharwa emphasizes its role in filling the lexical resource gap for Egyptian Arabic as a pilot dialect, showcasing how it supports analyses of phonological and lexical variation from MSA. By capturing... Read more

articleView Paper downloadDownload

CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing

by Mai Oudah

2022

Key finding: CAMeL Tools integrates multiple Arabic NLP functionalities including morphological modeling, dialect identification, and named entity recognition, incorporating support for dialectal processing alongside MSA. This toolkit... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What role do large-scale Arabic text corpora play in advancing NLP applications and linguistic research?

This theme addresses the development and utilization of sizable and representative Arabic corpora as critical foundations for data-driven NLP and linguistic studies. Given Arabic's diglossic and dialectal properties, large annotated and raw corpora spanning various domains, dialects, and writing styles provide empirical evidence necessary for lexicography, syntactic analysis, semantic studies, and machine learning model training. The advancement of Arabic NLP systems depends heavily on the availability of such corpora, which improve resource coverage and performance across tasks like sentiment analysis, information retrieval, and machine translation.

1.5 billion words Arabic Corpus

by Ibrahim Abu El-Khair

2025, arXiv (Cornell University)

Key finding: This study presents a large-scale Arabic corpus containing over 1.5 billion words collected from newspaper articles across ten major news sources from eight Arabic countries, spanning fourteen years. The corpus is encoded in... Read more

articleView Paper downloadDownload

Building an international corpus of Arabic (ICA): Progress of Compilation Stage

by Magdy Nagi

2021, 7th Int. Conf. on Language Eng. Cairo, Egypt

Key finding: This paper outlines the creation of the International Corpus of Arabic (ICA), a representative and balanced Arabic corpus covering the entire Arab world. The ICA supports multiple fields including lexicography, grammar,... Read more

articleView Paper downloadDownload

Building a Multilingual and Mixed Arabic-English Corpus

by Mohammed Mustafa Ali

2025, pubs.cs.uct.ac.za

Key finding: This paper introduces a novel Arabic-English mixed multilingual corpus designed to reflect real-world scientific documents containing both languages in tightly integrated forms. Unlike most monolingual or parallel corpora,... Read more

articleView Paper downloadDownload

Part of Speech Tagger for Tunisian Arabic: Comparing manual and ML methods for under-resourced languages

by Karen McNeil

2022

Key finding: Leveraging a small manually annotated Tunisian Arabic corpus (6,000 words), this study compares rule-based and machine learning POS tagging methods for a severely under-resourced dialect. Despite the limited data size,... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can morphological patterns and multiword expressions enhance Arabic NLP tool development and accuracy?

Arabic’s rich, templatic morphology and widespread use of fixed multiword expressions (MWEs) pose unique challenges and opportunities for NLP. Research in this theme involves leveraging schemes (morphological templates) to reduce lexical sparsity and build text classifiers and parsers, as well as the compilation and annotation of extensive Arabic MWE repositories. Accurate morphological analysis and MWE identification improve key NLP functions such as tokenization, parsing, and semantic interpretation, which are essential for applications ranging from sentiment analysis to machine translation.

Exploring the Potential of Schemes in Building NLP Tools for Arabic Language

by Mohamed Aziz Ben Mohamed

2023, International Arab Journal of Information Technology

Key finding: This study pioneers the exploitation of Arabic morphological schemes, rather than surface words, to reduce data sparsity and build NLP systems including a neural network text classifier and a probabilistic context-free... Read more

articleView Paper downloadDownload

Building an Arabic multiword expressions repository

by Mona Diab

2024

Key finding: The authors manually compile a large repository of Arabic MWEs from multiple dictionaries, annotate every word with detailed context-sensitive morphological analyses, and automatically tag occurrences of these MWEs in a large... Read more

articleView Paper downloadDownload

Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic

by Ramy Nagah Eskander

2024

Key finding: MADAMIRA integrates state-of-the-art morphological analysis and disambiguation by combining strengths of prior tools (MADA and AMIRA), handling both MSA and dialects. By leveraging rich morphological analyzers producing... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Arabic Language NLP

Evaluation of the performance of Moses statistical engine adapted to English-Arabic language combination

by Ouafa BENTERKI

Statistical Machine Translation (SMT) is considered as sub-field of computational linguistics; and the latter is regarded as a branch of Artificial Intelligence (AI) dedicated to Natural Language Processing (NLP). The main purpose of this... more

descriptionView Paper arrow_downwardDownload

Call For Papers: December Issue: International Journal on Natural Language Computing (IJNLC)

by International Journal on Natural Language Computing (IJNLC)

2021

Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that... more

descriptionView Paper arrow_downwardDownload

Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection

by Imene Bensalem

FIRE 2015

AraPlagDet is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two subtasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been... more

Fig. 3. Two passages with the same words but the 2"! passage contains some letters with diacritics (highlighted in green) and a substitution of some interchangeable letters (highlighted in yellow). A simple plagiarism detector may fail to match them. Regarding this aspect, Magooda et al. reported the use of two- language dependent processing in the source retrieval phase: stemming queries before submitting them to the search engine and extracting named entities. In the text alignment phase, words are stemmed in the skip-gram approach. Moreover, their methods pre- process the text by removing diacritics and normalizing letters)”. Alzahrani method is nearly language independent. The only reported language-specific process was stop words removal. It was applied as a pre-processing step on suspicious and source documents.

See [37] for more information on plagiarism detection evaluation measures. Table 4 provides the performance results of the participants’ methods as well as the baseline on the test corpus.

Fig. 5. Intrinsic plagiarism detection methods building block.

Table 1. Statistics on the external plagiarism detection training and test corpora.

Table 3. Text alignment approaches used in participants methods.

Table 5. Detailed performance of participant's methods. In each measure, the underlined values are the higher per parameter.

5.2 Method Description Table 6. Statistics on the intrinsic plagiarism detection training and test corpora.

5.3.3 Detailed Results Table 8. Performance of the intrinsic plagiarism detection methods. —eeeoer NE NE I DE NIE EE Unlike the external approach, we think that the performance of the intrinsic approach could be influenced by the document length and the percentage of plagiarism it incorporates. Table 9 presents the performance of Mahgoub et al. and the baseline methods on the test corpus according to the aforementioned parameters in addition to the case length. The segmentation strategy of the baseline does not produce short chunks, therefore the precision is not computed in detected short cases. However, the actual short cases are detected with high recall. For both methods, the best performance is obtained in the medium cases, the short documents and the documents with much plagiarism. Nonetheless, since we have only two methods, we cannot generalize any observed pattern.

descriptionView Paper arrow_downwardDownload

Plagiarism Detection: A focus on the Intrinsic Approach and the Evaluation in the Arabic Language

by Imene Bensalem

2020

This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an... more

AEPULV ALE CLI MME Se. ME MED VD oOwLtatrit, tli me MADER LEE ES WEI A Gai Usow tL VUeU Ul VV. Chapter II provides a survey on the current methods of detecting plagiarism in Arabic documents. The survey shows that almost all the methods are based on uncovering plagiarism by comparing the suspicious document to the potential sources of plagiarism (the external approach). This motivates us to conduct the first experiments on Arabic documents that attempt to detect plagiarism by spotting the writing style changes (the intrinsic approach). In the light of these experiments, that utilise a small ad-hoc corpus, we felt the necessity to build a larger evaluation corpus that allows for a better assessment of the task performance on Arabic documents. a larger evaluation corpus that allows for a better assessment of the task performance on Arabic Besides the technical aspect of Arabic plagiarism detection, this chapter discusses another

Figure II-1. External plagiarism detection methods building blocks coining standard terminology. Therefore, we refer the reader to PAN overview papers

Chapter II. Arabic Plagiarism Detection: Critical Review Figure II-2. Intrinsic plagiarism detection methods building block

Figure II-3. Arabic plagiarism detection papers published from 2008 to June 2019 As shown in Figure II-4, almost all the collected publications are papers that describe

Chapter II. Arabic Plagiarism Detection: Critical Review

Figure II-5. Proportion of Arabic plagiarism papers with and without “bad smells” Chapter II. Arabic Plagiarism Detection: Critical Review

Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents Figure III-1. The insertion based approach of building plagiarism detection evaluation corpor important books of building corpora (McEnery et al. 2006), the copyright-free documents are

Figure III-2. Different representations of the same word with and without letters’ diacritics. Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents

respectively. The symbols |saci] and |Sde| are, respectively, the lengths of Sac and Sdet in For a single actual plagiarism case, Sact, a plagiarism detection method may output multiple detections (separate or overlapping). Thus, granularity is used to average the number of the detected cases for each actual case as depicted in formula 3. Actge, & Act is the set of the actual cases that have been detected, and Det,,., © Det is the set of the detected cases that intersect with a given actual case Sact. The optimal value of the granularity is 1, and it means that for each actual case sac, no more than a single case has been detected (i.e. not many overlapping or adjacent cases). Detected cases). The symbols |Ac¢| and |Det| are the number of actual and detected cases,

Figure IV-2. Taxonomy of the building blocks of intrinsic plagiarism detection methods The pre-processing heuristics are called so because they operate before the fragment-level analysis. These heuristics aim to filter out the irrelevant information that may disrupt the style analysis (through cleaning, normalisation and genre analysis) or reduce the computation by taking an early decision on the document (through checking whether the document is taking an early decision on the document (through checking whether the document i:

Figure IV-3. Feature extraction at fragment and document levels. The symbols s,, ..., Sn denote the fragments and f,,..., fm denote the features. kinds of units: (i) one character, (ii) a sequence of characters, or (iii) a class of characters. See,

The IPD methods that use supervised learning are listed in Table IV-9. It remains to say that the pitfall of IPD methods based on supervised learning is that they may suffer from the lack of training data. And even if it is available, there will be an imbalance in the number of plagiarised and the non-plagiarised examples since naturally the original texts are more abundant. This issue renders the IPD a classification problem with skewed classes, which is a known problem in machine learning that may lead to training biased classifiers. In (Polydouri et al. 2017, 2018), the authors attempted to mitigate this problem by using sampling techniques on the training corpus aiming to construct a balanced dataset. This problem can be also tackled by using classification algorithms designed to function with datasets of skewed classes, such as Complement Naive Bayes (Rennie et al. 2003). In that context, we used this algorithm in one of our IPD experiments and it proved its effectiveness in comparison with the original Naive Bayes (Bensalem et al. 2014b)*.

Chapter IV. Intrinsic Plagiarism Detection: a Survey Figure IV-6. Illustration of the density-based outlier detection for intrinsic plagiarism detection. Plagiarised and non-plagiarised sections can be separated if their values of a feature fi are differently distributed (adapted after (Stein et al. 2011)). As for the priors, i.e., P(Class = plago) and P(Class = plagi) —which is the portion of each class among all the fragments”°— the authors of this approach stated that they are estimated either by an impurity assessment (meta information on the document) or by the maximum likelihood estimator which assumes that the classes are uniformly distributed, i.e., half of the fragments is plagiarised and the other half is not. However, it has not been stated in the paper (Stein et al. 2011) which of these two options is adopted in the conducted experiments. As for the priors, i.e., P(Class = plago) and P(Class = plag:) —which is the portion of each

Figure IV-7. Steps of the distance-based outlier detection for intrinsic plagiarism detection. In the figure (A), the distance is computed between the fragments and the document; and in the figure (B), the distance is computed between each pair of fragments. are averaged. Hence, for both cases, the document is represented with a vector of distances

Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism Figure V-3. Steps for computing the n-gram classes of a document. The parameter n is the length of n-grams and m is the number of classes. In this example m = 3 (class labels are from 0 to 2) following subsections provide further details on these three stages. 3.2. N-gram Classification

Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Computing the NFCP features by considering the repetition of n-grams in the fragment fragment, and its maximum value is the number of fragments in d if ng; occurs in each

Computing the NFCP features without considering the repetition of n-grams in the fragment Figure V-5. Illustration of two ways of computing the proportion of n-gram classes in a fragment

Figure V-6. Average of InfoGrain of the features generated by different variants of the extraction method Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism the subsequent experiments, we will adopt the best variant (S1RO) without mentioning that every time.

Figure V-7. F-measure of our method in comparison with the best methods in the PAN intrinsic plagiarism detection competitions Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Figure V-8. The 54 classes obtained from the n-grams of a document by classifying them into different number of classes, m. For example, when m = 2 (the top of the figure), this means that the n-grams of the document are classified into 2 classes labelled 0 and 1. The former represents n-grams of low frequency, and the latter represents n-grams of high frequency Practically, for each language, a total number of 540 classifiers (in each iteration), corresponding documents including 34765 and 5547 plagiarism cases, respectively. Once the 540 features

Figure V-9. The distribution of performance of the NFCP features computed on English text (a) and Arabic text (b) Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Figure V-11. Sensitivity of NFCP features performance to the n-gram length on English (left) and Arabic (right) Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism Figure V-12. Performance of combined NFCP features selected using different techniques 7 Sensitivity Analysis of Stamatatos’ Method Performance to N-grams Frequency and Length

* See the conclusion of Chapter V (pp. 122-123) for further details on this future work. Figure VI-1. Summary of the discussed future works and research prospects. The arrow between some future works means that each one of them implies the other.

Table Il-1. Bad smells that we detected in Arabic plagiarism detection papers 10 See Chapter HI (Section 5.3.1) for details on the standardised evaluation measures of plagiarism detection. 'l Most of the “bad smells” that we did not consider concern the statistical significance, which is usually not utilised in plagiarism detection studies. Note that this does not mean that this technique is not applicable for plagiarism detection evaluation but rather its use is uncommon even in the best studies. This fact might be attributed to the lack of practical guidelines on hypothesis testing that may accompany the current plagiarism detection evaluation measures. Instead of a simplistic approach of including/excluding papers from our literature review. we applied the approach proposed in (Menzies and Shepperd 2019), which consists in assessing the quality of papers in terms of twelve criteria. The authors called these criteria “bad smells” and defined them as the surface issues that might be detected in research publications and that can be indications of serious problems. The scope of Menzies anc Shepperd’s investigation is software analytics. Still, the authors noted that while some of the proposed “bad smells” are specific to software analytics, others are general and then applicable to other scientific domains. Hence, we selected four “bad smells” (from twelve) that we judged appropriate for plagiarism detection research". Table II-1 lists them in the first column. In the second column, we determine exactly how these “bad smells” emerged in the examined Arabic plagiarism detection papers.

Table II-2. Overview of the number of papers considered in our study 3.2.2 Results

Table II-3. Scope of the examined Arabic plagiarism detection papers 3.3. Methods and Evaluation Corpora Chapter II. Arabic Plagiarism Detection: Critical Review In this section, we review the 24 selected works (from the previous step) in terms of their

Table II-4. Papers on Arabic plagiarism detection using the external approach 'S This method does not use exactly the principle of creating queries to a search engine to retrieve the candidate document but it compares the suspicious and the source documents at three levels starting from the document level then th paragraph level and finally the sentence level. If no similarity is detected at the document level, the followings level will not be considered. Thus, we consider the document-level comparison as the candidate retrieval module.

Table II-5. Description of the corpora used to evaluate plagiarism detection methods on Arabic documents. The character '-' is used when no information is provided.

Table II-7. Performance evaluation Chapter II. Arabic Plagiarism Detection: Critical Review 4.2.2 Results and Discussion We used three measures to evaluate the performance of discriminators: Precision (equation 1)

Table II-8. Combination’s results: baseline vs. the most precise voting schemes 1.3, Experiment 2: Combining Discriminators

Table III-1. Comparison between approaches to creating suspicious documents. The symbol V indicates an advantage, and * indicates a disadvantage. Chapter III. Evaluation of Plagiarism Detection on Arabic Documents

Table Ill-2. AraPlagDet shared task schedule *July 16 is the release date of a sample of the training corpus. The complete training corpus has been released on August 10. Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents

Table III-3. Statistics of the ExAra corpus Chapter III. Evaluation of Plagiarism Detection on Arabic Documents

Table IIl-4. Source retrieval approaches with their building blocks used in the participants’ methods. Each column describes an approach in terms of its building blocks. The first line provides a concise description of the approach, and the second line indicates the methods that employed each approach. For example, Ma- gooda_2 method used two approaches: sentence-based and keyword-based indexing. With respect to Alzahrani’s method, it is suitable to an offline scenario, i.e., when th source of plagiarism is local and not too large, as in the case of detecting plagiarism betwee: students’ assignments. This is for two reasons: (i) its retrieval model is not structured to b used with search engines (for example, there is no query formulation, see Table III-4); and (ii it is based on fingerprinting all the source documents and entails an exhaustive compariso: between the n-grams of the suspicious document and those of each source document, which i not workable if the source of plagiarism is extremely large, like the web. Still, even with th intention to be used offline, it would be better to use retrieval techniques that allow for th processing of a large number of documents in a reasonable time such as inverted indexes Malcolm and Lane (2009) discuss the importance of scalability even for offline plagiarisn detectors.

Table III-5. Text alignment approaches with their building blocks used in the participants’ methods Language Dependence Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents

Method Precision Recall Granularity Plagdet Table III-6. Performance of the external plagiarism detection methods on the test corpus comparison with what has been achieved by the state-of-the-art methods (see for example the

Table III-7. Detailed performance of the participants’ methods. In each measure, the underlined values are the highest per parameter.

Table III-8. Statistics of InAra corpus criterion 1 Each host document must be written by one author only. If the document is multi- Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents

Table Ill-9. Sources of texts used to build InAra corpus Chapter III. Evaluation of Plagiarism Detection on Arabic Documents

Table III-11. Performance of the intrinsic plagiarism detection methods Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents 6.3.2. Detailed Results

Table III-12. Detailed performance of the intrinsic plagiarism detection methods a module that detects and filters out the Quranic citations. Such a module can rely on the Chapter III. Evaluation of Plagiarism Detection on Arabic Documents external approach, whereby the whole text of the document is compared to the Quran corpus

Chapter IV. Intrinsic Plagiarism Detection: a Survey Figure IV-1. Timeline of some milestones related to intrinsic plagiarism detection

Table IV-1 Intrinsic plagiarism detection and its related research areas The major drawback of this perception emerges when the plagiarism constitutes the Intrinsic plagiarism detection in its essence could be seen as an anomaly-of-authorship detection at fragment level (Guthrie et al. 2007), where plagiarism is the anomaly, and the text written in the plagiarist’s own style is the normal part. In fact, most of the current IPD methods deal with IPD as an anomaly detection problem. That is, they are based on the assumption that the normal data (original part) is the majority, and hence can be characterised, and the abnormal data (plagiarised part) is sparse and thus difficult to characterise. Therefore, methods based on this assumption build a writing style model of the whole suspicious document, and consider as plagiarism any fragment deviating from this general style (Mahgoub et al. 2015; Muhr et al. 2010; Oberreuter and Velasquez 2013; Stamatatos 2009a; Suarez et al. 2010; Zechner et al. 2009).

Table IV-2. Pre-processing heuristics in intrinsic plagiarism detection methods document as plagiarism-free if the variance of the style change function is not significant. Practically speaking, this implementation checks the significance of the style variance by comparing the standard deviation 6 of the style change function to a predefined threshold Ts. If 5 < ts then the heuristic marks the document as plagiarism-free.

it in a structured manner. Feature extraction in natural language processing (NLP) structures

Table IV-4. The units from which the character features are extracted with examples extracted from a sentence features are computed, we classify character features depending on whether the unit is defined Character n-grams are sequences of contiguous characters of a predefined length extracted from the text without considering any linguistic relationship between them. Despite their simplicity, these features have proven their effectiveness in many NLP applications, such as authorship attribution (Cavnar and Trenkle 1994; Stamatatos 2016), native language identification (Kulmizev et al. 2017) and opinion spam detection (Hernandez Fusilier et al. 2015). Based on their reputability as stylistic markers notably for authorship attribution, they have been employed in intrinsic plagiarism detection. As a matter of fact, Stamatatos (2009a) was the first to develop a character-n-grams-based IPD method. Although it utilises only these anguage-independent features, Stamatatos’ method was ranked first in the PAN 2009 intrinsic plagiarism detection shared task. This seminal method, by its simplicity, inspired other researchers, who reproduced it partially or fully in their works (Kasprzak and Brandejs 2010; Kestemont et al. 2011; Kuta and Kitowski 2014; Rao et al. 2011). Character n-grams are sequences of contiguous characters of a predefined length extracted

Table IV-6. Some linguistic aspects manipulated to produce different sentence structures detection wherein the writing style is analysed at the fragment level. Nonetheless, these measures are included in numerous intrinsic plagiarism detection methods, which are: (Meyet zu Eien et al. 2007), (Stein et al. 2011), (Kern et al. 2012), and (Carnahan et al. 2014). Or the other hand, Meyer zu EiBen and his colleagues (2007; 2006) proposed a new vocabulary richness measure called Average Word Frequency Class, which is argued to be ideal for IPD due to its stability with different text lengths. Later, this feature has been used in other methods. such as (Stein and Meyer zu Eifen 2007), (Zechner et al. 2009), and (Carnahan et al. 2014) In addition, variants of this measure are used in (Polydouri et al. 2017). measures are included in numerous intrinsic plagiarism detection methods, which are: (Meyer

writing style of a fragment and that of the whole document. Then, all fragments with a a wor Chapter IV. Intrinsic Plagiarism Detection: a Survey

Table IV-9. The supervised learning-based methods used for intrinsic plagiarism detection 4.4.2 Clustering Clustering is an unsupervised machine learning approach that creates, from a given set 0: elements, subsets that group together the similar elements. The similarity between the element: is assessed based on their feature vectors. The number of clusters to create should be determined a priori for most of the algorithms. This paradigm is well suited for multi-autho: documents segmentation wherein each cluster involves the fragments of similar writing style (see, e.g., (Akiva 2012; Kern et al. 2012)), and hence, the number of the clusters represent: the number of the authors involved in writing the document. In the existing intrinsic plagiarisn methods, the number of the clusters created from the suspicious document fragments i: typically two; one of them groups the plagiarism-free fragments and the other one contains the plagiarised fragments. Clustering is an unsupervised machine learning approach that creates, from a given set of elements, subsets that group together the similar elements. The similarity between the elements

Table IV-11. Post-processing heuristics in intrinsic plagiarism detection methods 8 The performance measure used by Polydouri et al. (2017, 2018) are computed based on the number of sentences an not the number of characters. For example, given a plagiarised fragment composed of 4 sentences, if the software detect 2 of them, the recall measured on this fragment, according to Polydouri et al. would be 0.5. However, the standardise: recall score (Potthast et al. 2010c) could be more or less different since it is the ratio of the length, in characters, of th 2 detected sentences to the length of the full plagiarised fragment. To complete the picture on intrinsic plagiarism detection, it is necessary to talk about its effectiveness. In fact, despite the variety of heuristics and stylistic features used in the methods (as shown in Section 4), their performance scores are still poor. To the best of our knowledge. few methods, such as (Stamatatos 2009a) and (Oberreuter et al. 2011b; Oberreuter and Velasquez 2013), reached an F-measure greater than 0.3 using a standardised evaluation framework. Other methods, for instance Stein et al. (2011), Tschuggnall and Specht (2013c) and Polydouri et al. (2017, 2018) obtained relatively higher scores. Nonetheless, the twc former methods have been evaluated on only subsets of the evaluation corpus, and the evaluation of the latter method is based on a modified version of the performance measures”*. evaluation of the latter method is based on a modified version of the performance measures”

Table V-1. The frequency and length of character n-grams in intrinsic plagiarism detection methods 5 The table lists only the methods that provide information on the used character n-grams. I ON For example, in (Kestemont et al. 2011), representing the text using only the most frequent n-grams extracted from a corpus was based on an efficiency reason which is to reduce the computation. However, no experiment has been done to check the impact of this reduction of the number of the used n-grams on performance or to prove that high-frequency n-grams are more effective than the rest of n-grams with lesser frequency. In (Kuznetsov et al. 2016), the frequencies of both rare and frequent n-grams in a sentence were among the features used to quantify the writing style incoherence between this sentence and the rest of the document. However, the rationale behind these choices has not been explained.

Table V-3. Statistics on the evaluation corpora Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism 1 Datasets and Performance Measures We used for our experiments three evaluation corpora in English and one corpus in Arabic with its two parts training and test. The English corpora (Potthast et al. 2010c) have been developed for the international competition on plagiarism detection (PAN) of the years 2009, 2010 and 2011 to evaluate the IPD methods (Potthast et al. 2009, 2010a, 2011). We used specifically the test part of each corpus'*. The Arabic corpus (InAra) (Bensalem et al. 2013a, 2013b) has been built by ourselves, following PAN annotation standards, and has been used in AraPlagDet 2015", the first plagiarism detection competition on Arabic documents (Bensalem et al. 2015). We used for our experiments three evaluation corpora in English and one corpus in Arabic

Table V-4. Evaluation setting of NFCP features to the 540 NFCP features, have been trained and tested using the five datasets described in Section 5. Explicitly, cross-validation has been performed between each couple of corpora, 1.e., each corpus is used separately, on the one hand, for training a classification model and on the other hand, for testing the models trained on the other corpora of the same language. Consequently, we obtained for each NFCP feature six classification results on English corpora and two classification results on the Arabic corpus as illustrated in Table V-4. Then, the F- measure scores are averaged for each language to be used in our analysis. to the 540 NFCP features, have been trained and tested using the five datasets described in Section 5. Explicitly, cross-validation has been performed between each couple of corpora, Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Table V-5 The configurations that produce the best NFCP features Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Table V-6. Cumulative percentages computed on the 3-grams of the suspicious-documento1020 of PAN-PC-09 profiles is based on the cumulative percentages that we computed on the frequency distribution

Table VI-1. Assumptions made when building the evaluation corpora of intrinsic plagiarism detection 10 They have been neglected in the context of IPD. However, they have been addressed in the context of EPD. Conceivably, a plagiarism case becomes invisible for an intrinsic plagiarism detection method if the plagiarist succeeded to obfuscate it by rewriting it in her/his own writing style so that the contrast between it and the rest of the document fades away. On the other hand, a plagiarism case becomes invisible for an external plagiarism detection method if the plagiarist succeeded to obfuscate it so that the similarity with its source is concealed. Therefore, the obfuscations aiming to defeat the external plagiarism detection systems will not certainly

descriptionView Paper arrow_downwardDownload

USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

The Web considers one of the main sources of customer opinions and reviews which they are represented in two formats; structured data (numeric ratings) and unstructured data (textual comments). Millions of textual comments about goods and... more