Arabic Language NLP Research Papers

Influence of NLP Models on Arabic Linguistic Applications

2025

This study investigates the influence of various Natural Language Processing (NLP) models on the accuracy and efficiency of Arabic linguistic applications. Employing a systematic review and comparative analysis, the research evaluates... more

descriptionView Paper arrow_downwardDownload

Challenges of Arabic Language Processing in AI Systems

by Daoud Jerab and

2025

This study investigates the multifaceted challenges of Arabic language processing in artificial intelligence (AI) systems, emphasizing linguistic, technical, and ethical dimensions. Employing a qualitative analysis of current research, it... more

descriptionView Paper arrow_downwardDownload

Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection

by imène boukhalfa

2024

is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two sub- tasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been submitted... more

descriptionView Paper arrow_downwardDownload

The Use of Computerized Parametric Typology in the Generation of Single-Family Housing Designs

by Dhuha Abdulgani Al-Kazzaz

2023, Al-Rafidain Engineering Journal (AREJ)

In the era of digital architecture, parametric design plays a fundamental role in the generative architectural design process. The most important of its benefits are that it allows a visual representation of the design process, a designer... more

descriptionView Paper arrow_downwardDownload

Integration of Opinion into Customer Analysis Model

by Dr Abdulmohsen Algarni

2023, 2011 IEEE 8th International Conference on e-Business Engineering

As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. In order to enhance customer satisfaction and their shopping experiences, it has become important to analysis customers... more

descriptionView Paper arrow_downwardDownload

The Use of Computerized Parametric Typology in the Generation of Single-Family Housing Designs

by Dhuha Al-kazzaz

2022, Al-Rafidain Engineering Journal (AREJ)

In the era of digital architecture, parametric design plays a fundamental role in the generative architectural design process. The most important of its benefits are that it allows a visual representation of the design process, a designer... more

descriptionView Paper arrow_downwardDownload

Part of Speech Tagger for Tunisian Arabic: Comparing manual and ML methods for under-resourced languages

by Karen McNeil

2022

UNPUBLISHED PAPER COMPARING THE HAND-ROLLED PARSER/POS TAGGER USED IN TUNISIYA.ORG WITH SOME ML METHODS This paper presents a comparison of several different part-of-speech taggers trained on a hand-annotated Tunisian Arabic sample of... more

descriptionView Paper arrow_downwardDownload

2L-APD: A Two-Level Plagiarism Detection System for Arabic Documents

by Hadda Cherroun

2022, Cybernetics and Information Technologies

Measuring the amount of shared information between two documents is a key to address a number of Natural Language Processing (NLP) challenges such as Information Retrieval (IR), Semantic Textual Similarity (STS), Sentiment Analysis (SA)... more

descriptionView Paper arrow_downwardDownload

Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection

by Imene Bensalem

2022, Forum for Information Retrieval Evaluation

AraPlagDet is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two subtasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been... more

Fig. 3. Two passages with the same words but the 2"! passage contains some letters with diacritics (highlighted in green) and a substitution of some interchangeable letters (highlighted in yellow). A simple plagiarism detector may fail to match them. Regarding this aspect, Magooda et al. reported the use of two- language dependent processing in the source retrieval phase: stemming queries before submitting them to the search engine and extracting named entities. In the text alignment phase, words are stemmed in the skip-gram approach. Moreover, their methods pre- process the text by removing diacritics and normalizing letters)”. Alzahrani method is nearly language independent. The only reported language-specific process was stop words removal. It was applied as a pre-processing step on suspicious and source documents.

See [37] for more information on plagiarism detection evaluation measures. Table 4 provides the performance results of the participants’ methods as well as the baseline on the test corpus.

Fig. 5. Intrinsic plagiarism detection methods building block.

Table 1. Statistics on the external plagiarism detection training and test corpora.

Table 3. Text alignment approaches used in participants methods.

Table 5. Detailed performance of participant's methods. In each measure, the underlined values are the higher per parameter.

5.2 Method Description Table 6. Statistics on the intrinsic plagiarism detection training and test corpora.

5.3.3 Detailed Results Table 8. Performance of the intrinsic plagiarism detection methods. —eeeoer NE NE I DE NIE EE Unlike the external approach, we think that the performance of the intrinsic approach could be influenced by the document length and the percentage of plagiarism it incorporates. Table 9 presents the performance of Mahgoub et al. and the baseline methods on the test corpus according to the aforementioned parameters in addition to the case length. The segmentation strategy of the baseline does not produce short chunks, therefore the precision is not computed in detected short cases. However, the actual short cases are detected with high recall. For both methods, the best performance is obtained in the medium cases, the short documents and the documents with much plagiarism. Nonetheless, since we have only two methods, we cannot generalize any observed pattern.

descriptionView Paper arrow_downwardDownload

Introducing an Automated Technique for Bilingual Plagiarism detection of English-Persian Documents

by soraya enayati

2022

Easy access that Internet has provided to vast quantities of electronic data, textual plagiarism has become a major concern especially in academic documents and research and scientific institutions. So with increasing rate of amount of... more

Fig.1. Overall architecture of bilingual plagiarism detection (English - Persian) The proposed method is shown in Figure 1 with three main stages (database initialization and processing stage - storage stage - execution stage) and with sub-procedures and their relationships.

Fig. 2. The process of execution stage of the proposed method The process of execution stage has been carried out as shown in Figure 2:

Table (1) Assessment results obtained from testing 100 samples of Persian language text. Fig. 3. Comparative value cossim test texts with morphological analysis and without morphological analysis

descriptionView Paper arrow_downwardDownload

Style Breach Detection with Neural Sentence Embeddings

by Rita Kuznetsova

2022

The paper investigates method for the style breach detection task. We developed a method based on mapping sentences into high dimensional vector space. Each sentence vector depends on the previous and next sentence vectors. As main... more

descriptionView Paper arrow_downwardDownload

SU@PAN'2016: Author Obfuscation

by Ivan Koychev

2022

The anonymity of a text’s writer is an important topic for some domains, such as witness protection and anonymity programs. Stylometry can be used to reveal the true author of a text even if s/he wishes to hide his/her identity. In this... more

descriptionView Paper arrow_downwardDownload

An Enhanced Framework for Extrinsic Plagiarism Avoidance for Research Article

by Shamas Imran

2022

Various approaches have been implemented for plagiarism detection used, for author‘s work and academic publication, there is a purpose to create such reliable and performant plagiarism detection with increasing amount of publications.... more

COMPARISON OF STATISTICAL AND SEMANTICAL MODELS

EXTRACTED PAPERS BASED ON THE CRITERIA IN LITERATURE

descriptionView Paper arrow_downwardDownload

Feature-Based Opinion Summarization for Arabic Reviews

by Alaa El-Halees

2021, 2018 International Arab Conference on Information Technology (ACIT)

Opinion mining applications work with a large number of opinion holders. This means a summary of opinions is important so we can easily interpret holders' opinions. The aim of this paper is to provide a feature-based summarization for... more

descriptionView Paper arrow_downwardDownload

Using the Update of Conditional BFGS in Constrained Optimization

by Abbas Y. Al-Bayati

2021, AL-Rafidain Journal of Computer Sciences and Mathematics

In this paper, we have used one of the preconditioned conjugate gradient algorithm with the Quasi-Newton approximation; namely the BFGS preconditioned algorithm which was suggested by (AL-Bayati and Aref, 2001). In this paper we have... more

descriptionView Paper arrow_downwardDownload

Plagiarism Alignment Detection by Merging Context Seeds

by Pashutan Modaresi

2021

We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection task in the PAN2014 challenge that achieved a plagdet score 0.855. By extracting contextual features for each document character and grouping... more

descriptionView Paper arrow_downwardDownload

Author Clustering using Hierarchical Clustering Analysis

by David Pinto

2021

This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two... more

descriptionView Paper arrow_downwardDownload

CLEU - A Cross-Language English-Urdu Corpus and Benchmark for Text Reuse Experiments

by iqra muneer

2021, Journal of the Association for Information Science and Technology

Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multilingual content on the Web has increased cross-language text reuse to an... more

descriptionView Paper arrow_downwardDownload

المشاريع الحاسوبية على اللغة العربية والقرآن بجامعة ليدز "Arabic and Quranic Computational Linguistics Projects at the University of Leeds

by Bayan Abu Shawar

2021

descriptionView Paper arrow_downwardDownload

تقييم الترجمة الآلية من اللغة العربية إلى اللغة التركية ترجمة جوجل ويانديكس نموذجا

by Abdulmuttalip IŞIDAN

2021, Dil ve Edebiyat Araştırmaları

descriptionView Paper arrow_downwardDownload

المشاريع الحاسوبية على اللغة العربية والقرآن بجامعة ليدز "Arabic and Quranic Computational Linguistics Projects at the University of Leeds

by Eric S Atwell

2021

descriptionView Paper arrow_downwardDownload

Overview of PAN 2018

by Efstathios Stamatatos

2021, Lecture Notes in Computer Science

PAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics. More specifically, this edition of PAN introduces a shared task in... more

PAN 2018 explores several authorship analysis tasks enabling a systematic comparison of competitive approaches and advancing research in digital text forensics. More specifically, this edition of PAN introduces a shared task in cross-domain authorship attribution, where texts of known and unknown authorship belong to distinct domains, and another task in style change detection that distinguishes between single-author and multi-author texts. In addition, a shared task in multimodal author profiling examines, for the first time, a combination of information from both texts and images posted by social media users to estimate their gender. Finally, the author obfuscation task studies how a text by a certain author can be paraphrased so that existing author identification tools are confused and cannot recognize the similarity with other texts of the same author. New corpora have been built to support these shared tasks. A relatively large number of software submissions (41 in total) was received and evaluated. Best paradigms are highlighted while baselines indicate the pros and cons of submitted approaches. (stylistic fingerprint) but she also shares some properties with other people of similar background (age, gender, education, etc.) It is quite challenging to define or measure both personal style (for each individual author) and collective style (males, females, young people, old people, etc.). In addition, it remains unclear what one should modify in her texts in order to attempt to hide her identity or to mimic the style of another author. This edition of PAN deals with these challenging issues. Author identification puts emphasis on the personal style of individual authors. The most common task is authorship attribution where there is a set of candidate authors (suspects), with samples of their texts, and one of them is selected as the most likely author of a text of disputed authorship [31]. This can be a closed-set (one of the suspects is surely the true author) or an open-set (the true author may not be among the suspects) attribution case. This edition of PAN focuses on closed-set cross-domain authorship attribution, that is, when the texts unquestionably written by the suspects and the texts of disputed authorship belong to different domains. This is a realistic scenario suitable for several applications. For example, imagine the case of a crime novel published anonymously when all candidate authors have only published fantasy novels [13] or a disputed tweet when the available texts written by the suspects are newspaper articles. To be able to control the domain of texts, we turned to so-called fanfiction [11]. This term refers to the large body of contemporary fiction that is nowadays created by non-professional authors ('fans'), who write in the tradition of a well-known source work, such as the Harry Potter series by J.K. Rowling, that is sometimes called the 'canon'. These writings or 'fics' within such a 'fandom' heavily borrow characters, motives, settings, etc. from the source fandom. Fanfiction provides excellent material to study cross-domain authorship attribution since most fans are active in multiple fandoms. Another important dimension in author identification is to intrinsically analyse a document, possibly written by multiple authors and identify the contribution of each co-author. The previous edition of PAN aimed to find the exact border positions within a document where the authorship changes. Taking the respective results into account which have shown that the problem is quite hard [39], we substantially relaxed the task this year and broke it down to the simple question: Given a document, are there any style changes or not? An alternative formulation would thus be to solely predict whether a document is written by a single author or by multiple collaborators, whereby it is irrelevant to the task to identify the exact border positions between authors. While the evaluation of the two preceding tasks relied on the Webis-TRC-12 data set [21], we created a novel data set by utilizing the StackExchange network 1. Containing millions of publicly available questions and answers regarding several topics and subtopics, it represents a rich source which we exploited to build a comprehensive, but still realistic data set for the style change detection task. When the collective style of groups of authors is considered, author profiling attempts to predict demographic and social characteristics, like age, gender, education, and personality traits. It is a research area associated with important applications in social media analytics and marketing as well as cyber forensics. In this edition of PAN, for the first time, multimodal information is considered. Both texts and images posted by social media users are used to predict their gender.

descriptionView Paper arrow_downwardDownload

Hybrid System for Plagiarism Detection on A Scientific Paper

by Asst. Prof. Dr. Mohammed S. H. Al-Tamimi

2021

Plagiarism Detection Systems are critical in identifying instances of plagiarism, particularly in the educational sector whenever it comes to scientific publications and papers. Plagiarism occurs when any material is copied without the... more

Plagiarism Detection Systems are critical in identifying instances of plagiarism, particularly in the educational sector whenever it comes to scientific publications and papers. Plagiarism occurs when any material is copied without the author's consent or attribution. To identify such acts, thorough knowledge of plagiarism types and classes is required. It is feasible to detect several sorts of plagiarism using current tools and methodologies. With the advancement of information and communication technologies (ICT) and the availability of online scientific publications, access to these publications has grown more convenient. Additionally, with the availability of several software text editors, plagiarism detection has become a crucial concern. Numerous scholarly articles have previously examined plagiarism detection and the two most often used datasets for plagiarism detection, WordNet and the PAN Dataset. The researchers described verbatim plagiarism detection as a straightforward case of copying and pasting, and then shed light on clever plagiarism, which is more difficult to detect since it may involve original text alteration, borrowing ideas from other studies, and Other scholars have said that plagiarism can obscure the scientific content by substituting terms, deleting or introducing material, rearranging or changing the original publications. The suggested system incorporated natural language processing (NLP) and machine learning (ML) techniques, as well as an external plagiarism detection strategy based on text mining and similarity analysis. The suggested technique employs a mix of Jaccard and cosine similarity. It was examined using the PAN-PC-11 corpus. The proposed system outperforms previous systems on the PAN-PC-11, as demonstrated by the findings. Additionally, the proposed system obtains an accuracy of 0.96, a recall of 0.86, an F-measure of 0.86, and a PlagDet score of 0.86. (0.86). 0.865 and the proposed technique is substantiated by a design application that is used to detect plagiarism in scientific publications and generate nonmedication notifications. Portable Document Format (PDF) .

Figure 8. Plagiarism Algorithms Comparison And the Figure 8 is Plagiarism Algorithms Comparison

As a means of increasing efficiency and performance, cosine similarity is used to calculate the number of vectot cosine angles in documents and to construct a hierarchical clustering method for them. The fowling methoc demonstrates cosine similarity in the proposed system: 9. Result Threshold

In words, if any sentence( s) has one similarity measure, or both, exceeds the threshold of each measure is considered a plagiarized sentence. For document D. the adaptive threshold work in the following algorithm: Algorithem4. hybrid system 11.Experiments and Discussion To facilitate comprehension of the proposed system, researchers will outline its components and compare it to another system's dataset, PAN-PC-2011. The method of plagiarism between the source and suspect documents in the data set will be explained in the same step-by-step fashion as the algorithm above. an all the stages will be run on each item in each folder of the suspect folder that contains the resource folder, so the procedure of each step will be conducted in a one-to-many fashion. The implementation method is to take a document from the suspect folder and pass it on to all the source folders, and this process was used to evaluate the proposed system's algorithm's efficiency and to extract the error rate for the purpose of evaluation in order to obtain an accurate result. The best result of the two measurements used in the proposed system (cosine and Jaccard) is r. When the experiment is conducted on a data set, the hybrid system established on the Jaccard threshold (0.2) and the Cosine

Several PD methods and their findings have been reported in the literature in recent years. The findings of the suggested approach are compared to some of the approaches mentioned in the literature in this section. Table 3.3 compares the detection measurement in the confusion matrix obtained by the proposed method to that obtained in prior investigations. Our proposed strategy appears to be superior to others based on the stated results. Table2. Comparison with Previous Studies

descriptionView Paper arrow_downwardDownload

Identifying Features in Opinion Mining using Bootstrap Methodology

by Saroj Date

2020, IEEE : 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM)

Many approaches are characteristic of name opinion is based only on the review of the single-shaft, ignoring non-trivial disparities in the distribution of the word of those around Corpus different. In Proposed work a new technique... more

descriptionView Paper arrow_downwardDownload

Plagiarism Detection: A focus on the Intrinsic Approach and the Evaluation in the Arabic Language

by Imene Bensalem

2020

This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an... more

AEPULV ALE CLI MME Se. ME MED VD oOwLtatrit, tli me MADER LEE ES WEI A Gai Usow tL VUeU Ul VV. Chapter II provides a survey on the current methods of detecting plagiarism in Arabic documents. The survey shows that almost all the methods are based on uncovering plagiarism by comparing the suspicious document to the potential sources of plagiarism (the external approach). This motivates us to conduct the first experiments on Arabic documents that attempt to detect plagiarism by spotting the writing style changes (the intrinsic approach). In the light of these experiments, that utilise a small ad-hoc corpus, we felt the necessity to build a larger evaluation corpus that allows for a better assessment of the task performance on Arabic documents. a larger evaluation corpus that allows for a better assessment of the task performance on Arabic Besides the technical aspect of Arabic plagiarism detection, this chapter discusses another

Figure II-1. External plagiarism detection methods building blocks coining standard terminology. Therefore, we refer the reader to PAN overview papers

Chapter II. Arabic Plagiarism Detection: Critical Review Figure II-2. Intrinsic plagiarism detection methods building block

Figure II-3. Arabic plagiarism detection papers published from 2008 to June 2019 As shown in Figure II-4, almost all the collected publications are papers that describe

Chapter II. Arabic Plagiarism Detection: Critical Review

Figure II-5. Proportion of Arabic plagiarism papers with and without “bad smells” Chapter II. Arabic Plagiarism Detection: Critical Review

Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents Figure III-1. The insertion based approach of building plagiarism detection evaluation corpor important books of building corpora (McEnery et al. 2006), the copyright-free documents are

Figure III-2. Different representations of the same word with and without letters’ diacritics. Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents

respectively. The symbols |saci] and |Sde| are, respectively, the lengths of Sac and Sdet in For a single actual plagiarism case, Sact, a plagiarism detection method may output multiple detections (separate or overlapping). Thus, granularity is used to average the number of the detected cases for each actual case as depicted in formula 3. Actge, & Act is the set of the actual cases that have been detected, and Det,,., © Det is the set of the detected cases that intersect with a given actual case Sact. The optimal value of the granularity is 1, and it means that for each actual case sac, no more than a single case has been detected (i.e. not many overlapping or adjacent cases). Detected cases). The symbols |Ac¢| and |Det| are the number of actual and detected cases,

Figure IV-2. Taxonomy of the building blocks of intrinsic plagiarism detection methods The pre-processing heuristics are called so because they operate before the fragment-level analysis. These heuristics aim to filter out the irrelevant information that may disrupt the style analysis (through cleaning, normalisation and genre analysis) or reduce the computation by taking an early decision on the document (through checking whether the document is taking an early decision on the document (through checking whether the document i:

Figure IV-3. Feature extraction at fragment and document levels. The symbols s,, ..., Sn denote the fragments and f,,..., fm denote the features. kinds of units: (i) one character, (ii) a sequence of characters, or (iii) a class of characters. See,

The IPD methods that use supervised learning are listed in Table IV-9. It remains to say that the pitfall of IPD methods based on supervised learning is that they may suffer from the lack of training data. And even if it is available, there will be an imbalance in the number of plagiarised and the non-plagiarised examples since naturally the original texts are more abundant. This issue renders the IPD a classification problem with skewed classes, which is a known problem in machine learning that may lead to training biased classifiers. In (Polydouri et al. 2017, 2018), the authors attempted to mitigate this problem by using sampling techniques on the training corpus aiming to construct a balanced dataset. This problem can be also tackled by using classification algorithms designed to function with datasets of skewed classes, such as Complement Naive Bayes (Rennie et al. 2003). In that context, we used this algorithm in one of our IPD experiments and it proved its effectiveness in comparison with the original Naive Bayes (Bensalem et al. 2014b)*.

Chapter IV. Intrinsic Plagiarism Detection: a Survey Figure IV-6. Illustration of the density-based outlier detection for intrinsic plagiarism detection. Plagiarised and non-plagiarised sections can be separated if their values of a feature fi are differently distributed (adapted after (Stein et al. 2011)). As for the priors, i.e., P(Class = plago) and P(Class = plagi) —which is the portion of each class among all the fragments”°— the authors of this approach stated that they are estimated either by an impurity assessment (meta information on the document) or by the maximum likelihood estimator which assumes that the classes are uniformly distributed, i.e., half of the fragments is plagiarised and the other half is not. However, it has not been stated in the paper (Stein et al. 2011) which of these two options is adopted in the conducted experiments. As for the priors, i.e., P(Class = plago) and P(Class = plag:) —which is the portion of each

Figure IV-7. Steps of the distance-based outlier detection for intrinsic plagiarism detection. In the figure (A), the distance is computed between the fragments and the document; and in the figure (B), the distance is computed between each pair of fragments. are averaged. Hence, for both cases, the document is represented with a vector of distances

Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism Figure V-3. Steps for computing the n-gram classes of a document. The parameter n is the length of n-grams and m is the number of classes. In this example m = 3 (class labels are from 0 to 2) following subsections provide further details on these three stages. 3.2. N-gram Classification

Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Computing the NFCP features by considering the repetition of n-grams in the fragment fragment, and its maximum value is the number of fragments in d if ng; occurs in each

Computing the NFCP features without considering the repetition of n-grams in the fragment Figure V-5. Illustration of two ways of computing the proportion of n-gram classes in a fragment

Figure V-6. Average of InfoGrain of the features generated by different variants of the extraction method Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism the subsequent experiments, we will adopt the best variant (S1RO) without mentioning that every time.

Figure V-7. F-measure of our method in comparison with the best methods in the PAN intrinsic plagiarism detection competitions Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Figure V-8. The 54 classes obtained from the n-grams of a document by classifying them into different number of classes, m. For example, when m = 2 (the top of the figure), this means that the n-grams of the document are classified into 2 classes labelled 0 and 1. The former represents n-grams of low frequency, and the latter represents n-grams of high frequency Practically, for each language, a total number of 540 classifiers (in each iteration), corresponding documents including 34765 and 5547 plagiarism cases, respectively. Once the 540 features

Figure V-9. The distribution of performance of the NFCP features computed on English text (a) and Arabic text (b) Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Figure V-11. Sensitivity of NFCP features performance to the n-gram length on English (left) and Arabic (right) Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism Figure V-12. Performance of combined NFCP features selected using different techniques 7 Sensitivity Analysis of Stamatatos’ Method Performance to N-grams Frequency and Length

* See the conclusion of Chapter V (pp. 122-123) for further details on this future work. Figure VI-1. Summary of the discussed future works and research prospects. The arrow between some future works means that each one of them implies the other.

Table Il-1. Bad smells that we detected in Arabic plagiarism detection papers 10 See Chapter HI (Section 5.3.1) for details on the standardised evaluation measures of plagiarism detection. 'l Most of the “bad smells” that we did not consider concern the statistical significance, which is usually not utilised in plagiarism detection studies. Note that this does not mean that this technique is not applicable for plagiarism detection evaluation but rather its use is uncommon even in the best studies. This fact might be attributed to the lack of practical guidelines on hypothesis testing that may accompany the current plagiarism detection evaluation measures. Instead of a simplistic approach of including/excluding papers from our literature review. we applied the approach proposed in (Menzies and Shepperd 2019), which consists in assessing the quality of papers in terms of twelve criteria. The authors called these criteria “bad smells” and defined them as the surface issues that might be detected in research publications and that can be indications of serious problems. The scope of Menzies anc Shepperd’s investigation is software analytics. Still, the authors noted that while some of the proposed “bad smells” are specific to software analytics, others are general and then applicable to other scientific domains. Hence, we selected four “bad smells” (from twelve) that we judged appropriate for plagiarism detection research". Table II-1 lists them in the first column. In the second column, we determine exactly how these “bad smells” emerged in the examined Arabic plagiarism detection papers.

Table II-2. Overview of the number of papers considered in our study 3.2.2 Results

Table II-3. Scope of the examined Arabic plagiarism detection papers 3.3. Methods and Evaluation Corpora Chapter II. Arabic Plagiarism Detection: Critical Review In this section, we review the 24 selected works (from the previous step) in terms of their

Table II-4. Papers on Arabic plagiarism detection using the external approach 'S This method does not use exactly the principle of creating queries to a search engine to retrieve the candidate document but it compares the suspicious and the source documents at three levels starting from the document level then th paragraph level and finally the sentence level. If no similarity is detected at the document level, the followings level will not be considered. Thus, we consider the document-level comparison as the candidate retrieval module.

Table II-5. Description of the corpora used to evaluate plagiarism detection methods on Arabic documents. The character '-' is used when no information is provided.

Table II-7. Performance evaluation Chapter II. Arabic Plagiarism Detection: Critical Review 4.2.2 Results and Discussion We used three measures to evaluate the performance of discriminators: Precision (equation 1)

Table II-8. Combination’s results: baseline vs. the most precise voting schemes 1.3, Experiment 2: Combining Discriminators

Table III-1. Comparison between approaches to creating suspicious documents. The symbol V indicates an advantage, and * indicates a disadvantage. Chapter III. Evaluation of Plagiarism Detection on Arabic Documents

Table Ill-2. AraPlagDet shared task schedule *July 16 is the release date of a sample of the training corpus. The complete training corpus has been released on August 10. Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents

Table III-3. Statistics of the ExAra corpus Chapter III. Evaluation of Plagiarism Detection on Arabic Documents

Table IIl-4. Source retrieval approaches with their building blocks used in the participants’ methods. Each column describes an approach in terms of its building blocks. The first line provides a concise description of the approach, and the second line indicates the methods that employed each approach. For example, Ma- gooda_2 method used two approaches: sentence-based and keyword-based indexing. With respect to Alzahrani’s method, it is suitable to an offline scenario, i.e., when th source of plagiarism is local and not too large, as in the case of detecting plagiarism betwee: students’ assignments. This is for two reasons: (i) its retrieval model is not structured to b used with search engines (for example, there is no query formulation, see Table III-4); and (ii it is based on fingerprinting all the source documents and entails an exhaustive compariso: between the n-grams of the suspicious document and those of each source document, which i not workable if the source of plagiarism is extremely large, like the web. Still, even with th intention to be used offline, it would be better to use retrieval techniques that allow for th processing of a large number of documents in a reasonable time such as inverted indexes Malcolm and Lane (2009) discuss the importance of scalability even for offline plagiarisn detectors.

Table III-5. Text alignment approaches with their building blocks used in the participants’ methods Language Dependence Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents

Method Precision Recall Granularity Plagdet Table III-6. Performance of the external plagiarism detection methods on the test corpus comparison with what has been achieved by the state-of-the-art methods (see for example the

Table III-7. Detailed performance of the participants’ methods. In each measure, the underlined values are the highest per parameter.

Table III-8. Statistics of InAra corpus criterion 1 Each host document must be written by one author only. If the document is multi- Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents

Table Ill-9. Sources of texts used to build InAra corpus Chapter III. Evaluation of Plagiarism Detection on Arabic Documents

Table III-11. Performance of the intrinsic plagiarism detection methods Chapter Ill. Evaluation of Plagiarism Detection on Arabic Documents 6.3.2. Detailed Results

Table III-12. Detailed performance of the intrinsic plagiarism detection methods a module that detects and filters out the Quranic citations. Such a module can rely on the Chapter III. Evaluation of Plagiarism Detection on Arabic Documents external approach, whereby the whole text of the document is compared to the Quran corpus

Chapter IV. Intrinsic Plagiarism Detection: a Survey Figure IV-1. Timeline of some milestones related to intrinsic plagiarism detection

Table IV-1 Intrinsic plagiarism detection and its related research areas The major drawback of this perception emerges when the plagiarism constitutes the Intrinsic plagiarism detection in its essence could be seen as an anomaly-of-authorship detection at fragment level (Guthrie et al. 2007), where plagiarism is the anomaly, and the text written in the plagiarist’s own style is the normal part. In fact, most of the current IPD methods deal with IPD as an anomaly detection problem. That is, they are based on the assumption that the normal data (original part) is the majority, and hence can be characterised, and the abnormal data (plagiarised part) is sparse and thus difficult to characterise. Therefore, methods based on this assumption build a writing style model of the whole suspicious document, and consider as plagiarism any fragment deviating from this general style (Mahgoub et al. 2015; Muhr et al. 2010; Oberreuter and Velasquez 2013; Stamatatos 2009a; Suarez et al. 2010; Zechner et al. 2009).

Table IV-2. Pre-processing heuristics in intrinsic plagiarism detection methods document as plagiarism-free if the variance of the style change function is not significant. Practically speaking, this implementation checks the significance of the style variance by comparing the standard deviation 6 of the style change function to a predefined threshold Ts. If 5 < ts then the heuristic marks the document as plagiarism-free.

it in a structured manner. Feature extraction in natural language processing (NLP) structures

Table IV-4. The units from which the character features are extracted with examples extracted from a sentence features are computed, we classify character features depending on whether the unit is defined Character n-grams are sequences of contiguous characters of a predefined length extracted from the text without considering any linguistic relationship between them. Despite their simplicity, these features have proven their effectiveness in many NLP applications, such as authorship attribution (Cavnar and Trenkle 1994; Stamatatos 2016), native language identification (Kulmizev et al. 2017) and opinion spam detection (Hernandez Fusilier et al. 2015). Based on their reputability as stylistic markers notably for authorship attribution, they have been employed in intrinsic plagiarism detection. As a matter of fact, Stamatatos (2009a) was the first to develop a character-n-grams-based IPD method. Although it utilises only these anguage-independent features, Stamatatos’ method was ranked first in the PAN 2009 intrinsic plagiarism detection shared task. This seminal method, by its simplicity, inspired other researchers, who reproduced it partially or fully in their works (Kasprzak and Brandejs 2010; Kestemont et al. 2011; Kuta and Kitowski 2014; Rao et al. 2011). Character n-grams are sequences of contiguous characters of a predefined length extracted

Table IV-6. Some linguistic aspects manipulated to produce different sentence structures detection wherein the writing style is analysed at the fragment level. Nonetheless, these measures are included in numerous intrinsic plagiarism detection methods, which are: (Meyet zu Eien et al. 2007), (Stein et al. 2011), (Kern et al. 2012), and (Carnahan et al. 2014). Or the other hand, Meyer zu EiBen and his colleagues (2007; 2006) proposed a new vocabulary richness measure called Average Word Frequency Class, which is argued to be ideal for IPD due to its stability with different text lengths. Later, this feature has been used in other methods. such as (Stein and Meyer zu Eifen 2007), (Zechner et al. 2009), and (Carnahan et al. 2014) In addition, variants of this measure are used in (Polydouri et al. 2017). measures are included in numerous intrinsic plagiarism detection methods, which are: (Meyer

writing style of a fragment and that of the whole document. Then, all fragments with a a wor Chapter IV. Intrinsic Plagiarism Detection: a Survey

Table IV-9. The supervised learning-based methods used for intrinsic plagiarism detection 4.4.2 Clustering Clustering is an unsupervised machine learning approach that creates, from a given set 0: elements, subsets that group together the similar elements. The similarity between the element: is assessed based on their feature vectors. The number of clusters to create should be determined a priori for most of the algorithms. This paradigm is well suited for multi-autho: documents segmentation wherein each cluster involves the fragments of similar writing style (see, e.g., (Akiva 2012; Kern et al. 2012)), and hence, the number of the clusters represent: the number of the authors involved in writing the document. In the existing intrinsic plagiarisn methods, the number of the clusters created from the suspicious document fragments i: typically two; one of them groups the plagiarism-free fragments and the other one contains the plagiarised fragments. Clustering is an unsupervised machine learning approach that creates, from a given set of elements, subsets that group together the similar elements. The similarity between the elements

Table IV-11. Post-processing heuristics in intrinsic plagiarism detection methods 8 The performance measure used by Polydouri et al. (2017, 2018) are computed based on the number of sentences an not the number of characters. For example, given a plagiarised fragment composed of 4 sentences, if the software detect 2 of them, the recall measured on this fragment, according to Polydouri et al. would be 0.5. However, the standardise: recall score (Potthast et al. 2010c) could be more or less different since it is the ratio of the length, in characters, of th 2 detected sentences to the length of the full plagiarised fragment. To complete the picture on intrinsic plagiarism detection, it is necessary to talk about its effectiveness. In fact, despite the variety of heuristics and stylistic features used in the methods (as shown in Section 4), their performance scores are still poor. To the best of our knowledge. few methods, such as (Stamatatos 2009a) and (Oberreuter et al. 2011b; Oberreuter and Velasquez 2013), reached an F-measure greater than 0.3 using a standardised evaluation framework. Other methods, for instance Stein et al. (2011), Tschuggnall and Specht (2013c) and Polydouri et al. (2017, 2018) obtained relatively higher scores. Nonetheless, the twc former methods have been evaluated on only subsets of the evaluation corpus, and the evaluation of the latter method is based on a modified version of the performance measures”*. evaluation of the latter method is based on a modified version of the performance measures”

Table V-1. The frequency and length of character n-grams in intrinsic plagiarism detection methods 5 The table lists only the methods that provide information on the used character n-grams. I ON For example, in (Kestemont et al. 2011), representing the text using only the most frequent n-grams extracted from a corpus was based on an efficiency reason which is to reduce the computation. However, no experiment has been done to check the impact of this reduction of the number of the used n-grams on performance or to prove that high-frequency n-grams are more effective than the rest of n-grams with lesser frequency. In (Kuznetsov et al. 2016), the frequencies of both rare and frequent n-grams in a sentence were among the features used to quantify the writing style incoherence between this sentence and the rest of the document. However, the rationale behind these choices has not been explained.

Table V-3. Statistics on the evaluation corpora Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism 1 Datasets and Performance Measures We used for our experiments three evaluation corpora in English and one corpus in Arabic with its two parts training and test. The English corpora (Potthast et al. 2010c) have been developed for the international competition on plagiarism detection (PAN) of the years 2009, 2010 and 2011 to evaluate the IPD methods (Potthast et al. 2009, 2010a, 2011). We used specifically the test part of each corpus'*. The Arabic corpus (InAra) (Bensalem et al. 2013a, 2013b) has been built by ourselves, following PAN annotation standards, and has been used in AraPlagDet 2015", the first plagiarism detection competition on Arabic documents (Bensalem et al. 2015). We used for our experiments three evaluation corpora in English and one corpus in Arabic

Table V-4. Evaluation setting of NFCP features to the 540 NFCP features, have been trained and tested using the five datasets described in Section 5. Explicitly, cross-validation has been performed between each couple of corpora, 1.e., each corpus is used separately, on the one hand, for training a classification model and on the other hand, for testing the models trained on the other corpora of the same language. Consequently, we obtained for each NFCP feature six classification results on English corpora and two classification results on the Arabic corpus as illustrated in Table V-4. Then, the F- measure scores are averaged for each language to be used in our analysis. to the 540 NFCP features, have been trained and tested using the five datasets described in Section 5. Explicitly, cross-validation has been performed between each couple of corpora, Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Table V-5 The configurations that produce the best NFCP features Chapter V. Character N-grams as the Only Intrinsic Evidence of Plagiarism

Table V-6. Cumulative percentages computed on the 3-grams of the suspicious-documento1020 of PAN-PC-09 profiles is based on the cumulative percentages that we computed on the frequency distribution

Table VI-1. Assumptions made when building the evaluation corpora of intrinsic plagiarism detection 10 They have been neglected in the context of IPD. However, they have been addressed in the context of EPD. Conceivably, a plagiarism case becomes invisible for an intrinsic plagiarism detection method if the plagiarist succeeded to obfuscate it by rewriting it in her/his own writing style so that the contrast between it and the rest of the document fades away. On the other hand, a plagiarism case becomes invisible for an external plagiarism detection method if the plagiarist succeeded to obfuscate it so that the similarity with its source is concealed. Therefore, the obfuscations aiming to defeat the external plagiarism detection systems will not certainly

descriptionView Paper arrow_downwardDownload

اللغة العربية والذكاء الاصطناعى- المجلة التربوية جامعة سوهاج

by Gamal Eldahshan

2020, اللغة العربية والذكاء الاصطناعى كيف يمكن الاستفادة من تقنيات الذكاء الاصطناعى فى تعزيز اللغة العربية ؟

descriptionView Paper arrow_downwardDownload

اللغة العربية والذكاء الاصطناعى

by Gamal Eldahshan

2020

descriptionView Paper arrow_downwardDownload

البرمجة الآلية للغة

by Issam Tihami

2019, البرمجة الآلية للغة

المحاور: المحور الأول: - بعض المفاهيم المتعلقة بالبرمجة الآلية للغة. المحور الثاني: - المعالجة الآلية لمنظومة الصرف. المحور الثالث: - المعالجة الآلية لمنظومة النحو. المحور الرابع: - المعالجة الآلية لمنظومة الكتابة. المحور الخامس : -... more

descriptionView Paper arrow_downwardDownload

OPINION MINING AND SENTIC ANALYSIS IN SOCIAL NETWORK BASED ON ELM

by IJRMS Journal

2017

Extreme Learning Machine (ELM) is a new learning algorithm for feed forward neural network for classification or regression with a single layer of hidden nodes where the weights connecting inputs to hidden nodes are randomly assigned.... more

descriptionView Paper arrow_downwardDownload

USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS

by Computer Science & Information Technology (CS & IT) Computer Science Conference Proceedings (CSCP)

2017

The Web considers one of the main sources of customer opinions and reviews which they are represented in two formats; structured data (numeric ratings) and unstructured data (textual comments). Millions of textual comments about goods and... more

Fig 1: Approach Structure To support our approach and to achieve our goal we collect attributes and adjectives and classify new adjectives while we are running our approach and save them in two main tables: attributes table and adjectives table, attributes tables include both simple attributes and compound attributes, each entry in this table has a pair of two roots represent a certain attribute, for simple attributes the second root is null. Adjectives table includes root of each adjectives and its classification either good or bad, we also have collected neglect tools (words) and saved them in a list.

5. CONCLUSION Table 1: Customer Reviews for Samsung LED 4009MS-U7D 40 inch TV

descriptionView Paper arrow_downwardDownload

Thinking arabic translation

by bayan mohammed

2016

descriptionView Paper arrow_downwardDownload

Human Translation VS Machine Translation

by Ibrahim Talaat Ibrahim

2016

The article whose title is mentioned above is about showing the differences between human and machine translations.

descriptionView Paper arrow_downwardDownload

Mining Words and Targets using Alignment Model

by IJSTE - International Journal of Science Technology and Engineering

2016

Opinion target is defined as the object about which user expresses their opinions, typically as nouns or noun phrases. Opinion words are the words that are used to express user's opinions. Constructing an opinion words lexicon is also... more

descriptionView Paper arrow_downwardDownload

Opinion Feature Extraction Using Enhanced Opinion Mining Technique and Intrinsic- Extrinsic Domain Relevance

by Ijaems Journal

2016

Mining patterns are the main source of opinion feature extraction techniques, which was individually evaluated corpus mostly belong to evaluated corpus. A measure called Domain Relevance is used to identify candidate features from domain... more

descriptionView Paper arrow_downwardDownload

Overview of the AraPlagDet PAN@FIRE2015 Shared Task on Arabic Plagiarism Detection

by Imene Bensalem

2016, FIRE 2015

AraPlagDet is the first shared task that addresses the evaluation of plagiarism detection methods for Arabic texts. It has two subtasks, namely external plagiarism detection and intrinsic plagiarism detection. A total of 8 runs have been... more

descriptionView Paper arrow_downwardDownload

Arabic and Quranic Computational Linguistics Projects at the University of Leeds المشاريع الحاسوبية على اللغة العربية والقرآن بجامعة ليدز.

by Eric S Atwell

2014

‫ملخص‬ ‫الورقة‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫اﻟﻠﻐﺔ‬ ‫ﳌﻌﺎﳉﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫ﺑﺄﲝﺎث‬ ‫ﻟﻴﺪز‬ ‫ﲜﺎﻣﻌﺔ‬ ‫اﳊﺎﺳﻮب‬ ‫ﻟﻜﻠﻴﺔ‬ ‫اﻟﺘﺎﺑﻊ‬ ‫اﻟﻠﻐﺔ‬ ‫أﲝﺎث‬ ‫ﻳﻖ‬ ‫ﻓﺮ‬ ‫أﻋﻀﺎء‬ ‫ﻳﻬﺘﻢ‬ . ‫اﳌﺎﺿﻲ‬ ‫ﰲ‬ ‫ﻗﻤﻨﺎ‬ ‫ﻓﻌﻨﺪﻣﺎ‬ ‫ﻛﻨ‬ ‫أدر‬ ‫ﻟﻜﻨﻨﺎ‬ ‫و‬ ‫ات،‬ ‫اﻷدو‬ ‫ﻫﺬﻩ‬ ‫ﻣﻦ‬ ً ‫ﺟﺪا‬ ً ‫ﻗﻠﻴﻼ‬... more

‫ملخص‬ ‫الورقة‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫اﻟﻠﻐﺔ‬ ‫ﳌﻌﺎﳉﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫ﺑﺄﲝﺎث‬ ‫ﻟﻴﺪز‬ ‫ﲜﺎﻣﻌﺔ‬ ‫اﳊﺎﺳﻮب‬ ‫ﻟﻜﻠﻴﺔ‬ ‫اﻟﺘﺎﺑﻊ‬ ‫اﻟﻠﻐﺔ‬ ‫أﲝﺎث‬ ‫ﻳﻖ‬ ‫ﻓﺮ‬ ‫أﻋﻀﺎء‬ ‫ﻳﻬﺘﻢ‬ . ‫اﳌﺎﺿﻲ‬ ‫ﰲ‬ ‫ﻗﻤﻨﺎ‬ ‫ﻓﻌﻨﺪﻣﺎ‬ ‫ﻛﻨ‬ ‫أدر‬ ‫ﻟﻜﻨﻨﺎ‬ ‫و‬ ‫ات،‬ ‫اﻷدو‬ ‫ﻫﺬﻩ‬ ‫ﻣﻦ‬ ً ‫ﺟﺪا‬ ً ‫ﻗﻠﻴﻼ‬ ً ‫ﻋﺪدا‬ ‫وﺟﺪﻧﺎ‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ‫اﻟﻨﺼﻴﺔ‬ ‫ﻟﻠﻤﻌﺎﳉﺔ‬ ً ‫ﳎﺎﻧﺎ‬ ‫اﳌﺘﺎﺣﺔ‬ ‫ات‬ ‫ﻟﻸدو‬ ‫اﺳﻌﺔ‬ ‫و‬ ‫اﺳﺔ‬ ‫در‬ ‫ﺑﻌﻤﻞ‬ ‫أﻧﻪ‬ ‫ﺎ‬ ‫اﻵﱄ‬ ‫اﻟﺘﻌﻠﻢ‬ ‫ﺑﺮﳎﻴﺎت‬ ‫ﺗﻄﻮﻳﻊ‬ ‫ﳝﻜﻦ‬ ) machine learning ( ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫اﻟﻠﻐﺔ‬ ‫ﻋﻠﻰ‬ ‫ذﻟﻚ‬ ‫وﺗﻄﺒﻴﻖ‬ ‫ﻋﺎم‬ ‫ﺑﺸﻜﻞ‬ ‫اﻟﻠﻐﻮﻳﺔ‬ ‫ﻟﻠﻠﻤﻌﺎﳉﺔ‬ . ‫ﺑﻴﺔ‬ ‫ﻋﺮ‬ ‫ﻧﺼﻮص‬ ‫ﳎﻤﻮﻋﺔ‬ ‫أول‬ ‫ﲜﻤﻊ‬ ‫ﻗﻤﻨﺎ‬ ‫ﰒ‬ ‫وﻣﻦ‬ ‫اﻟﱪﳎﻴﺎت،‬ ‫ﻫﺬﻩ‬ ‫ﻳﺐ‬ ‫ﻟﺘﺪر‬ ‫ﻋﺮﰊ‬ ‫ﻧﺺ‬ ‫وﺟﻮد‬ ‫ﻳﺴﺘﺪﻋﻲ‬ ‫اﻷﻣﺮ‬ ‫وﻫﺬا‬ ) corpus ( ‫اﺠﻤﻟ‬ ‫ﻟﻠﺘﺤﻤﻴﻞ‬ ‫وﻣﺘﺎﺣﺔ‬ ‫اﳌﺼﺪر‬ ‫ﻣﻔﺘﻮﺣﺔ‬ ‫وﺟﻌﻠﻨﺎﻫﺎ‬ ‫ﻣﻨﺴﺠﻢ‬ ‫ﺑﺸﻜﻞ‬ ‫اﻟﻨﺺ‬ ‫ﻟﻌﺮض‬ ‫ﻣﺮﳛﺔ‬ ‫اﺟﻬﺔ‬ ‫و‬ ‫ﻧﺎﻣﺞ‬ ‫ﺑﺮ‬ ‫ﻧﺎ‬ ‫ﻃﻮر‬ ‫وﻛﺬﻟﻚ‬ ‫ﺎﱐ‬ . ‫اﺠﻤﻟﻤﻮﻋﺔ‬ ‫ﻫﺬﻩ‬ ‫اﻟﻨﺘﺎﺋﺞ‬ ‫وﺗﻘﻴﻴﻢ‬ ‫أﺧﺮى‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫اﻣﺞ‬ ‫ﺑﺮ‬ ‫ﻳﺐ‬ ‫ﻟﺘﺪر‬ ‫اﺳﺘﺨﺪﻣﻮﻫﺎ‬ ‫و‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ‫اﳊﺎﺳﻮﺑﻴﺔ‬ ‫اﳌﻌﺎﳉﺔ‬ ‫اء‬ ‫ﺧﱪ‬ ‫ﻗﺒﻞ‬ ‫ﻣﻦ‬ ‫اﺳﻌﺔ‬ ‫و‬ ‫ﺷﻬﺮة‬ ‫اﻛﺘﺴﺒﺖ‬ . ‫اﻟﺼ‬ ‫اﻟﺘﺤﻠﻴﻞ‬ ‫ﻣﺜﻞ‬ ‫اﳌﻌﺎﺻﺮ‬ ‫اﻟﻌﺮﰊ‬ ‫اﻟﻨﺺ‬ ‫ﻟﺘﺤﻠﻴﻞ‬ ‫ات‬ ‫أدو‬ ‫ﺑﺘﻄﻮﻳﺮ‬ ‫ﻗﻤﻨﺎ‬ ‫أﻧﻨﺎ‬ ‫ﻛﻤﺎ‬ ‫اﺳﻌﺔ‬ ‫اﻟﻮ‬ ‫ﻟﻠﺘﻐﻄﻴﺔ‬ ‫ﻟﻐﻮﻳﺔ‬ ‫وذﺧﺎﺋﺮ‬ ‫اﻟﻌﻨﻮﻧﺔ‬ ‫و‬ ‫اﻟﺘﺠﺬﻳﺮ‬ ‫ﺮﰲ،‬ ‫اﳋﻄﺎﺑﻴﺔ‬ ‫ﺑﺎﻟﻌﻼﻗﺎت‬ ‫ﻣﻌﻨﻮﻧﺔ‬ ‫ﻧﺼﻴﺔ‬ ‫ﳎﻤﻮﻋﺔ‬ ‫ﺗﻄﻮﻳﺮ‬ ‫و‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ) discourse relations ( ‫اﻟﻌﻼﻗﺎت‬ ‫ار‬ ‫ﻏﺮ‬ ‫ﻋﻠﻰ‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ‫ﺑﻨﺴﻠﻔﺎﻧﻴﺎ‬ ‫ﺟﺎﻣﻌﺔ‬ ‫ﻣﻦ‬ ‫اﻟﺼﺎدرة‬ ‫ﻳﺔ‬ ‫اﻹﳒﻠﻴﺰ‬ ‫ﻟﻠﻐﺔ‬ ‫اﳋﻄﺎﺑﻴﺔ‬ . ‫ﻧﻌﺘﺰ‬ ‫اﻟﱵ‬ ‫اﺋﺪة‬ ‫اﻟﺮ‬ ‫اﻟﺒﺤﺜﻴﺔ‬ ‫اﺠﻤﻟﺎﻻت‬ ‫وﻣﻦ‬ ‫ﻟﺘﻘﻨﻴﺔ‬ ‫إﻣﺘﺪاد‬ ‫وﺗﻌﺘﱪ‬ ‫اﻟﻜﺮﱘ‬ ‫ﻟﻠﻘﺮآن‬ ‫اﻟﻠﻐﻮﻳﺔ‬ ‫ﻟﻠﻤﻌﺎﳉﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫ﻳﻊ‬ ‫ﻣﺸﺎر‬ ‫ﻫﻮ‬ ‫ﻟﻴﺪز‬ ‫ﺟﺎﻣﻌﺔ‬ ‫ﰲ‬ ‫ﺎ‬ ‫اﻟﺘﻘﻠﻴﺪﻳﺔ‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ‫اﻟﻨﺼﻲ‬ ‫اﻟﺘﻨﻘﻴﺐ‬ . ‫ﻣﻮﻗﻊ‬ ‫و‬ ‫آﻟﻴﺔ‬ ‫ﳏﺎورة‬ ‫ﻧﺎﻣﺞ‬ ‫ﺑﺮ‬ ‫ﺗﻄﻮﻳﺮ‬ ‫اﻷﲝﺎث‬ ‫ﻫﺬﻩ‬ ‫وﻣﻦ‬ " ‫ﻗﺮآﱐ‬ " ‫ﻋﻠﻰ‬ ‫اﻟﺒﺤﺚ‬ ‫ﻳﺘﻴﺢ‬ ‫اﻟﺬي‬ ‫و‬ ‫اﻟ‬ ‫اﳌﻌﺮﻓﺔ‬ ‫ﻟﺘﻤﺜﻴﻞ‬ ‫وإﻃﺎر‬ ، ً ‫ﻣﺴﺒﻘﺎ‬ ‫ﻣﻌﺪة‬ ‫ﻣﻔﺎﻫﻴﻢ‬ ‫ﻣﺴﺘﻮى‬ ‫اﻟﻨﺤﻮﻳﺔ‬ ‫اﻟﻌﻨﻮﻧﺔ‬ ‫و‬ ‫ﻘﺮآﻧﻴﺔ‬ . ‫ﻣﻮﻗﻊ‬ ‫ﺑﺘﺪﺷﲔ‬ ‫ﻗﻤﻨﺎ‬ ً ‫ا‬ ‫وﻣﺆﺧﺮ‬ " ‫ﻟﻠﻘﺮآن‬ ‫ﺑﻴﺔ‬ ‫ﻋﺮ‬ ‫ﻟﻐﻮﻳﺔ‬ ‫ذﺧﲑة‬ ‫اﻟﻜﺮﱘ‬ ) " Quranic Arabic Corpus ] ( http //: corpus . quran . com [ ‫ﻟﻠﺘﺤﻤﻴﻞ‬ ‫ﻗﺎﺑﻞ‬ ‫إﻟﻜﱰوﱐ‬ ‫ﻣﻮرد‬ ‫وﻳﻌﺘﱪ‬ ‫ﻗﺮآﻧﻴﺔ‬ ‫ﻛﻠﻤﺔ‬ ‫ﻟﻜﻞ‬ ‫اﻟﻜﻼم‬ ‫أﻗﺴﺎم‬ ‫و‬ ‫اﻟﺼﺮف‬ ‫ﻣﺴﺘﻮى‬ ‫ﻋﻠﻰ‬ ‫ﺗﻔﺼﻴﻠﻴﺔ‬ ‫ﻟﻐﻮﻳﺔ‬ ‫ﲟﻌﻠﻮﻣﺎت‬ ‫وﻣﻮﺳﻢ‬ ‫اﺠﻤﻟﺎﱐ‬ . ‫وﻣﻨ‬ ‫ﺑﻘﺒﻮل‬ ‫ﺣﻈﻲ‬ ‫اﳌﻮﻗﻊ‬ ‫اﻧﻄﻼق‬ ‫ﺬ‬ ‫وﻫﻮ‬ ‫ﺑﻴﺔ،‬ ‫ﺑﺎﻟﻌﺮ‬ ‫اﻟﻨﺎﻃﻘﲔ‬ ‫ﻏﲑ‬ ‫ﻣﻦ‬ ‫اﻟﻘﺮآن‬ ‫ﻟﻐﺔ‬ ‫ﺗﻌﻠﻢ‬ ‫ﰲ‬ ‫اﻏﺒﲔ‬ ‫اﻟﺮ‬ ‫وﻛﺬﻟﻚ‬ ‫اﻟﻠﺴﺎﻧﻴﺎت‬ ‫أﲝﺎث‬ ‫ﰲ‬ ‫اء‬ ‫ﺧﱪ‬ ‫ﻣﻦ‬ ً ‫اﺳﻌﺎ‬ ‫و‬ ً ‫ا‬ ‫ﲨﻬﻮر‬ ‫اﺳﺘﻘﻄﺐ‬ ‫و‬ ‫اﺳﻊ‬ ‫و‬ ‫ﻣﻦ‬ ‫أﻛﺜﺮ‬ ‫وﳚﻠﺐ‬ ‫ﻟﻠﺒﺤﺚ‬ ‫ﺟﻮﺟﻞ‬ ‫ﳏﺮك‬ ‫ﻋﻠﻰ‬ ‫اﻟﻘﺮآن‬ ‫اﻋﺪ‬ ‫ﻗﻮ‬ ‫اﺳﺔ‬ ‫ﻟﺪر‬ ‫اﻷول‬ ‫اﳌﻮﻗﻊ‬ 50 ً ‫ﻳﺎ‬ ‫ﺷﻬﺮ‬ ‫اﺋﺮ‬ ‫ز‬ ‫أﻟﻒ‬ . ‫ﻧﻘﱰح‬ ‫ﺟﻌﻠﻨﺎ‬ ‫وﻫﺬا‬ " ‫ﻓﻬﻢ‬ ‫اﻟﻘﺮآن‬ " ‫ﻟﻌﺎم‬ ‫اﻟﺼﻨﺎﻋﻲ‬ ‫اﻟﺬﻛﺎء‬ ‫و‬ ‫اﳊﺎﺳﻮب‬ ‫ﻟﻌﻠﻮم‬ ‫ﻛﺒﲑ‬ ‫ﻛﺘﺤﺪ‬ 2010 ‫ﺑﻌﺪﻩ‬ ‫وﻣﺎ‬ . ‫أﻧﻨﺎ‬ ‫ﺣﻴﺚ‬ ‫اﳌﺸﺮوع‬ ‫ﻫﺬا‬ ‫ﺑﺘﻮﺳﻴﻊ‬ ‫ﻧﻘﻮم‬ ً ‫وﺗﺪرﳚﻴﺎ‬ ‫اﻟﻘﺮآﻧﻴﺔ‬ ‫اﳉﻤﻠﺔ‬ ‫ﻟﺒﻨﻴﺔ‬ ‫ﺗﺒﻌﻴﺔ‬ ‫ﺑﻴﺎﻧﻴﺔ‬ ‫ات‬ ‫ﺷﺠﺮ‬ ‫إﻋﺪاد‬ ‫ﻃﻮر‬ ‫ﰲ‬ ‫اﻵن‬ )) Quranic Arabic Dependency Treebank ‫ﻟﻠﻘﺮآء‬ ‫وﳑﻜﻨﺔ‬ ‫اﻟﻜﺮﱘ‬ ‫اﻟﻘﺮآن‬ ‫اب‬ ‫إﻋﺮ‬ ‫ﻛﺘﺐ‬ ‫ﻣﻦ‬ ‫ﻣﺴﺘﻤﺪة‬ ‫اﻵﻟﻴﺔ‬ ‫ة‬ . ‫ﺑﲔ‬ ‫اﺑﻂ‬ ‫رو‬ ‫ﺷﺒﻜﺔ‬ ‫وﺑﻨﺎء‬ ‫ﻗﺮآﻧﻴﺔ‬ ‫ﻣﻔﺎﻫﻴﻢ‬ ‫ﺑﺎﺳﺘﺨﻼص‬ ‫ﻧﻘﻮم‬ ‫أﻧﻨﺎ‬ ‫ﻛﻤﺎ‬ ‫اﳌﻔﺎﻫﻴﻢ‬ ‫ﳍﺬﻩ‬ ‫اﻟﻘﺮآن‬ ‫ﺿﻤﺎﺋﺮ‬ ‫وﻋﻮدة‬ ‫اﳌﻔﺎﻫﻴﻢ‬ ‫ﻫﺬﻩ‬ . ‫اﻟﻘﺮآن‬ ‫ﳌﻔﺮدات‬ ‫ﻟﻐﻮﻳﺔ‬ ‫ذﺧﲑة‬ ‫ﺑﻨﺎء‬ ‫اﻟﻘﺮآﱐ‬ ‫اﳌﺸﺮوع‬ ‫ﻫﺬا‬ ‫ﰲ‬ ‫اﳌﺴﺘﻘﺒﻠﻴﺔ‬ ‫اﳋﻄﻂ‬ ‫وﻣﻦ‬ ‫ﻣﺸﺮوع‬ ‫ار‬ ‫ﻏﺮ‬ ‫ﻋﻠﻰ‬ WordNet ‫ﳌﻔﺮدا‬ ‫دﻻﻟﻴﺔ‬ ‫ات‬ ‫إﻃﺎر‬ ‫ذﺧﲑة‬ ‫ﺗﻄﻮﻳﺮ‬ ‫وﻛﺬﻟﻚ‬ ‫ﻣﺸﺮوع‬ ‫ار‬ ‫ﻏﺮ‬ ‫ﻋﻠﻰ‬ ‫اﻟﻘﺮآن‬ ‫ت‬ FrameNet ‫ﻋﻠﻰ‬ ‫اﻟﻘﺮآﱐ‬ ‫اﻟﺒﺤﺚ‬ ‫ﻳﺘﻴﺢ‬ ‫ﻣﺘﻜﺎﻣﻞ‬ ‫ﲝﺚ‬ ‫ﳏﺮك‬ ‫ﺗﻄﻮﻳﺮ‬ ‫ﰒ‬ ‫وﻣﻦ‬ ‫اﳋﻄﺎﺑﻴﺔ،‬ ‫ﺑﺎﻟﻌﻼﻗﺎت‬ ‫ﻣﻌﻨﻮﻧﺔ‬ ‫ﻟﻠﻘﺮآن‬ ‫ﻧﺼﻴﺔ‬ ‫ﳎﻤﻮﻋﺔ‬ ‫ﺗﻄﻮﻳﺮ‬ ‫وﻛﺬﻟﻚ‬ ‫اﻟﺬﻛﺮ‬ ‫آﻧﻔﺔ‬ ‫اﻟﻠﻐﻮﻳﺔ‬ ‫اﻟﺬﺧﺎﺋﺮ‬ ‫ﺧﻼل‬ ‫ﻣﻦ‬ ‫وذﻟﻚ‬ ‫اﳌﻔﺎﻫﻴﻢ‬ ‫و‬ ‫اﻟﺼﺮف‬ ‫و‬ ‫اﻟﻨﺤﻮ‬ ‫و‬ ‫اﻷﺳﻠﻮب‬ ‫و‬ ‫اﳌﻔﺮدات‬ ‫ﻣﺴﺘﻮي‬ . ‫ﻧﺎﺣﺠﺔ‬ ‫ﻳﻊ‬ ‫ﻣﺸﺎر‬ ‫ﺧﻼل‬ ‫وﻣﻦ‬ ‫اﻟﺘﻘﻠﻴﺪﻳﺔ‬ ‫ﻟﻠﻨﺼﻮص‬ ‫اﳊﺎﺳﻮﺑﻴﺔ‬ ‫اﳌﻌﺎﳉﺔ‬ ‫ﰲ‬ ‫اﳌﻜﺘﺴﺒﺔ‬ ‫اﳋﱪة‬ ‫ﺗﻄﺒﻴﻖ‬ ‫ﻧﻄﻤﺢ‬ ‫اﻟﻜﺮﱘ‬ ‫اﻟﻘﺮآن‬ ‫ﻋﻠﻰ‬ -‫اث‬ ‫اﻟﱰ‬ ‫وﻛﺘﺐ‬ ‫اﻟﻨﺒﻮي‬ ‫اﳊﺪﻳﺚ‬ ‫ﻣﺜﻞ‬ ‫اﻟﻌﺮﰊ‬ -‫اﳌﻌﺎﺻﺮة‬ ‫اﻟﻔﺼﺤﻰ‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫وﻛﺬﻟﻚ‬ . ‫اﻟﺘﻨﻘﻴﺐ‬ ‫و‬ ‫اﻟﺒﺤﺚ‬ ‫اض‬ ‫أﻏﺮ‬ ‫ﺗﻠﱯ‬ ‫ﻣﻌﻨﻮﻧﺔ‬ ‫ﻧﺼﻴﺔ‬ ‫ﳎﻤﻮﻋﺔ‬ ‫ﻟﺘﻄﻮﻳﺮ‬ ‫ﳔﻄﻂ‬ ‫اﳌﺜﺎل‬ ‫ﺳﺒﻴﻞ‬ ‫ﻓﻌﻠﻰ‬ ‫اﻹ‬ ‫ﳎﺎل‬ ‫ﰲ‬ ‫ﻟﻠﻤﺨﺘﺼﲔ‬ ‫اﻟﻨﺼﻲ‬ ‫اﻹﺧﺘﺼﺎص‬ ‫ﻫﺬا‬ ‫ﰲ‬ ‫اﳌﻌﻨﻮﻧﺔ‬ ‫ﻟﻠﻤﻔﺎﻫﻴﻢ‬ ‫اﳌﻌﺮﰲ‬ ‫اﻟﺘﻤﺜﻴﻞ‬ ‫ﺧﻼل‬ ‫ﻣﻦ‬ ‫وذﻟﻚ‬ ‫اﻹﺳﻼﻣﻲ‬ ‫ﻗﺘﺼﺎد‬ . ‫وﻛﺬﻟﻚ‬ ‫ﺗﻘﻨﻴﺎت‬ ‫ﺗﻮﻇﻴﻒ‬ ‫ﻣﻦ‬ ‫ﳝﻜﻨﻨﺎ‬ ‫وﻫﺬا‬ ‫اﻹﳒﻴﻞ‬ ‫و‬ ‫اة‬ ‫اﻟﺘﻮر‬ ‫ﻣﺜﻞ‬ ‫اﻷﺧﺮى‬ ‫اﻷدﻳﺎن‬ ‫ﰲ‬ ‫ﻣﻘﺪﺳﺔ‬ ‫ﻛﺘﺐ‬ ‫ﻟﺘﺸﻤﻞ‬ ‫اﳊﺎﺳﻮﺑﻴﺔ‬ ‫أﲝﺎﺛﻨﺎ‬ ‫ﲤﺪﻳﺪ‬ ‫ﻋﻠﻴﻨﺎ‬ ‫ﺮض‬ ُ ‫ﻋ‬ ‫اﳌﺼﺎدر‬ ‫ﻫﺬﻩ‬ ‫ﺑﲔ‬ ‫اﻟﻔﺮوﻗﺎت‬ ‫و‬ ‫اﻟﺘﺸﺎﺑﻪ‬ ‫ﳌﻼﺣﻈﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ .

descriptionView Paper arrow_downwardDownload

Arabic and Quranic Computational Linguistics Projects at the University of Leeds // المشاريع الحاسوبية على اللغة العربية والقرآن بجامعة ليدز

by Eric S Atwell and

2014

‫ملخص‬ ‫الورقة‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫اﻟﻠﻐﺔ‬ ‫ﳌﻌﺎﳉﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫ﺑﺄﲝﺎث‬ ‫ﻟﻴﺪز‬ ‫ﲜﺎﻣﻌﺔ‬ ‫اﳊﺎﺳﻮب‬ ‫ﻟﻜﻠﻴﺔ‬ ‫اﻟﺘﺎﺑﻊ‬ ‫اﻟﻠﻐﺔ‬ ‫أﲝﺎث‬ ‫ﻳﻖ‬ ‫ﻓﺮ‬ ‫أﻋﻀﺎء‬ ‫ﻳﻬﺘﻢ‬ . ‫اﳌﺎﺿﻲ‬ ‫ﰲ‬ ‫ﻗﻤﻨﺎ‬ ‫ﻓﻌﻨﺪﻣﺎ‬ ‫ﻛﻨ‬ ‫أدر‬ ‫ﻟﻜﻨﻨﺎ‬ ‫و‬ ‫ات،‬ ‫اﻷدو‬ ‫ﻫﺬﻩ‬ ‫ﻣﻦ‬ ً ‫ﺟﺪا‬ ً ‫ﻗﻠﻴﻼ‬... more

‫ملخص‬ ‫الورقة‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫اﻟﻠﻐﺔ‬ ‫ﳌﻌﺎﳉﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫ﺑﺄﲝﺎث‬ ‫ﻟﻴﺪز‬ ‫ﲜﺎﻣﻌﺔ‬ ‫اﳊﺎﺳﻮب‬ ‫ﻟﻜﻠﻴﺔ‬ ‫اﻟﺘﺎﺑﻊ‬ ‫اﻟﻠﻐﺔ‬ ‫أﲝﺎث‬ ‫ﻳﻖ‬ ‫ﻓﺮ‬ ‫أﻋﻀﺎء‬ ‫ﻳﻬﺘﻢ‬ . ‫اﳌﺎﺿﻲ‬ ‫ﰲ‬ ‫ﻗﻤﻨﺎ‬ ‫ﻓﻌﻨﺪﻣﺎ‬ ‫ﻛﻨ‬ ‫أدر‬ ‫ﻟﻜﻨﻨﺎ‬ ‫و‬ ‫ات،‬ ‫اﻷدو‬ ‫ﻫﺬﻩ‬ ‫ﻣﻦ‬ ً ‫ﺟﺪا‬ ً ‫ﻗﻠﻴﻼ‬ ً ‫ﻋﺪدا‬ ‫وﺟﺪﻧﺎ‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ‫اﻟﻨﺼﻴﺔ‬ ‫ﻟﻠﻤﻌﺎﳉﺔ‬ ً ‫ﳎﺎﻧﺎ‬ ‫اﳌﺘﺎﺣﺔ‬ ‫ات‬ ‫ﻟﻸدو‬ ‫اﺳﻌﺔ‬ ‫و‬ ‫اﺳﺔ‬ ‫در‬ ‫ﺑﻌﻤﻞ‬ ‫أﻧﻪ‬ ‫ﺎ‬ ‫اﻵﱄ‬ ‫اﻟﺘﻌﻠﻢ‬ ‫ﺑﺮﳎﻴﺎت‬ ‫ﺗﻄﻮﻳﻊ‬ ‫ﳝﻜﻦ‬ ) machine learning ( ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫اﻟﻠﻐﺔ‬ ‫ﻋﻠﻰ‬ ‫ذﻟﻚ‬ ‫وﺗﻄﺒﻴﻖ‬ ‫ﻋﺎم‬ ‫ﺑﺸﻜﻞ‬ ‫اﻟﻠﻐﻮﻳﺔ‬ ‫ﻟﻠﻠﻤﻌﺎﳉﺔ‬ . ‫ﺑﻴﺔ‬ ‫ﻋﺮ‬ ‫ﻧﺼﻮص‬ ‫ﳎﻤﻮﻋﺔ‬ ‫أول‬ ‫ﲜﻤﻊ‬ ‫ﻗﻤﻨﺎ‬ ‫ﰒ‬ ‫وﻣﻦ‬ ‫اﻟﱪﳎﻴﺎت،‬ ‫ﻫﺬﻩ‬ ‫ﻳﺐ‬ ‫ﻟﺘﺪر‬ ‫ﻋﺮﰊ‬ ‫ﻧﺺ‬ ‫وﺟﻮد‬ ‫ﻳﺴﺘﺪﻋﻲ‬ ‫اﻷﻣﺮ‬ ‫وﻫﺬا‬ ) corpus ( ‫اﺠﻤﻟ‬ ‫ﻟﻠﺘﺤﻤﻴﻞ‬ ‫وﻣﺘﺎﺣﺔ‬ ‫اﳌﺼﺪر‬ ‫ﻣﻔﺘﻮﺣﺔ‬ ‫وﺟﻌﻠﻨﺎﻫﺎ‬ ‫ﻣﻨﺴﺠﻢ‬ ‫ﺑﺸﻜﻞ‬ ‫اﻟﻨﺺ‬ ‫ﻟﻌﺮض‬ ‫ﻣﺮﳛﺔ‬ ‫اﺟﻬﺔ‬ ‫و‬ ‫ﻧﺎﻣﺞ‬ ‫ﺑﺮ‬ ‫ﻧﺎ‬ ‫ﻃﻮر‬ ‫وﻛﺬﻟﻚ‬ ‫ﺎﱐ‬ . ‫اﺠﻤﻟﻤﻮﻋﺔ‬ ‫ﻫﺬﻩ‬ ‫اﻟﻨﺘﺎﺋﺞ‬ ‫وﺗﻘﻴﻴﻢ‬ ‫أﺧﺮى‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫اﻣﺞ‬ ‫ﺑﺮ‬ ‫ﻳﺐ‬ ‫ﻟﺘﺪر‬ ‫اﺳﺘﺨﺪﻣﻮﻫﺎ‬ ‫و‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ‫اﳊﺎﺳﻮﺑﻴﺔ‬ ‫اﳌﻌﺎﳉﺔ‬ ‫اء‬ ‫ﺧﱪ‬ ‫ﻗﺒﻞ‬ ‫ﻣﻦ‬ ‫اﺳﻌﺔ‬ ‫و‬ ‫ﺷﻬﺮة‬ ‫اﻛﺘﺴﺒﺖ‬ . ‫اﻟﺼ‬ ‫اﻟﺘﺤﻠﻴﻞ‬ ‫ﻣﺜﻞ‬ ‫اﳌﻌﺎﺻﺮ‬ ‫اﻟﻌﺮﰊ‬ ‫اﻟﻨﺺ‬ ‫ﻟﺘﺤﻠﻴﻞ‬ ‫ات‬ ‫أدو‬ ‫ﺑﺘﻄﻮﻳﺮ‬ ‫ﻗﻤﻨﺎ‬ ‫أﻧﻨﺎ‬ ‫ﻛﻤﺎ‬ ‫اﺳﻌﺔ‬ ‫اﻟﻮ‬ ‫ﻟﻠﺘﻐﻄﻴﺔ‬ ‫ﻟﻐﻮﻳﺔ‬ ‫وذﺧﺎﺋﺮ‬ ‫اﻟﻌﻨﻮﻧﺔ‬ ‫و‬ ‫اﻟﺘﺠﺬﻳﺮ‬ ‫ﺮﰲ،‬ ‫اﳋﻄﺎﺑﻴﺔ‬ ‫ﺑﺎﻟﻌﻼﻗﺎت‬ ‫ﻣﻌﻨﻮﻧﺔ‬ ‫ﻧﺼﻴﺔ‬ ‫ﳎﻤﻮﻋﺔ‬ ‫ﺗﻄﻮﻳﺮ‬ ‫و‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ) discourse relations ( ‫اﻟﻌﻼﻗﺎت‬ ‫ار‬ ‫ﻏﺮ‬ ‫ﻋﻠﻰ‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ‫ﺑﻨﺴﻠﻔﺎﻧﻴﺎ‬ ‫ﺟﺎﻣﻌﺔ‬ ‫ﻣﻦ‬ ‫اﻟﺼﺎدرة‬ ‫ﻳﺔ‬ ‫اﻹﳒﻠﻴﺰ‬ ‫ﻟﻠﻐﺔ‬ ‫اﳋﻄﺎﺑﻴﺔ‬ . ‫ﻧﻌﺘﺰ‬ ‫اﻟﱵ‬ ‫اﺋﺪة‬ ‫اﻟﺮ‬ ‫اﻟﺒﺤﺜﻴﺔ‬ ‫اﺠﻤﻟﺎﻻت‬ ‫وﻣﻦ‬ ‫ﻟﺘﻘﻨﻴﺔ‬ ‫إﻣﺘﺪاد‬ ‫وﺗﻌﺘﱪ‬ ‫اﻟﻜﺮﱘ‬ ‫ﻟﻠﻘﺮآن‬ ‫اﻟﻠﻐﻮﻳﺔ‬ ‫ﻟﻠﻤﻌﺎﳉﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ ‫ﻳﻊ‬ ‫ﻣﺸﺎر‬ ‫ﻫﻮ‬ ‫ﻟﻴﺪز‬ ‫ﺟﺎﻣﻌﺔ‬ ‫ﰲ‬ ‫ﺎ‬ ‫اﻟﺘﻘﻠﻴﺪﻳﺔ‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫ﻟﻠﻐﺔ‬ ‫اﻟﻨﺼﻲ‬ ‫اﻟﺘﻨﻘﻴﺐ‬ . ‫ﻣﻮﻗﻊ‬ ‫و‬ ‫آﻟﻴﺔ‬ ‫ﳏﺎورة‬ ‫ﻧﺎﻣﺞ‬ ‫ﺑﺮ‬ ‫ﺗﻄﻮﻳﺮ‬ ‫اﻷﲝﺎث‬ ‫ﻫﺬﻩ‬ ‫وﻣﻦ‬ " ‫ﻗﺮآﱐ‬ " ‫ﻋﻠﻰ‬ ‫اﻟﺒﺤﺚ‬ ‫ﻳﺘﻴﺢ‬ ‫اﻟﺬي‬ ‫و‬ ‫اﻟ‬ ‫اﳌﻌﺮﻓﺔ‬ ‫ﻟﺘﻤﺜﻴﻞ‬ ‫وإﻃﺎر‬ ، ً ‫ﻣﺴﺒﻘﺎ‬ ‫ﻣﻌﺪة‬ ‫ﻣﻔﺎﻫﻴﻢ‬ ‫ﻣﺴﺘﻮى‬ ‫اﻟﻨﺤﻮﻳﺔ‬ ‫اﻟﻌﻨﻮﻧﺔ‬ ‫و‬ ‫ﻘﺮآﻧﻴﺔ‬ . ‫ﻣﻮﻗﻊ‬ ‫ﺑﺘﺪﺷﲔ‬ ‫ﻗﻤﻨﺎ‬ ً ‫ا‬ ‫وﻣﺆﺧﺮ‬ " ‫ﻟﻠﻘﺮآن‬ ‫ﺑﻴﺔ‬ ‫ﻋﺮ‬ ‫ﻟﻐﻮﻳﺔ‬ ‫ذﺧﲑة‬ ‫اﻟﻜﺮﱘ‬ ) " Quranic Arabic Corpus ] ( http //: corpus . quran . com [ ‫ﻟﻠﺘﺤﻤﻴﻞ‬ ‫ﻗﺎﺑﻞ‬ ‫إﻟﻜﱰوﱐ‬ ‫ﻣﻮرد‬ ‫وﻳﻌﺘﱪ‬ ‫ﻗﺮآﻧﻴﺔ‬ ‫ﻛﻠﻤﺔ‬ ‫ﻟﻜﻞ‬ ‫اﻟﻜﻼم‬ ‫أﻗﺴﺎم‬ ‫و‬ ‫اﻟﺼﺮف‬ ‫ﻣﺴﺘﻮى‬ ‫ﻋﻠﻰ‬ ‫ﺗﻔﺼﻴﻠﻴﺔ‬ ‫ﻟﻐﻮﻳﺔ‬ ‫ﲟﻌﻠﻮﻣﺎت‬ ‫وﻣﻮﺳﻢ‬ ‫اﺠﻤﻟﺎﱐ‬ . ‫وﻣﻨ‬ ‫ﺑﻘﺒﻮل‬ ‫ﺣﻈﻲ‬ ‫اﳌﻮﻗﻊ‬ ‫اﻧﻄﻼق‬ ‫ﺬ‬ ‫وﻫﻮ‬ ‫ﺑﻴﺔ،‬ ‫ﺑﺎﻟﻌﺮ‬ ‫اﻟﻨﺎﻃﻘﲔ‬ ‫ﻏﲑ‬ ‫ﻣﻦ‬ ‫اﻟﻘﺮآن‬ ‫ﻟﻐﺔ‬ ‫ﺗﻌﻠﻢ‬ ‫ﰲ‬ ‫اﻏﺒﲔ‬ ‫اﻟﺮ‬ ‫وﻛﺬﻟﻚ‬ ‫اﻟﻠﺴﺎﻧﻴﺎت‬ ‫أﲝﺎث‬ ‫ﰲ‬ ‫اء‬ ‫ﺧﱪ‬ ‫ﻣﻦ‬ ً ‫اﺳﻌﺎ‬ ‫و‬ ً ‫ا‬ ‫ﲨﻬﻮر‬ ‫اﺳﺘﻘﻄﺐ‬ ‫و‬ ‫اﺳﻊ‬ ‫و‬ ‫ﻣﻦ‬ ‫أﻛﺜﺮ‬ ‫وﳚﻠﺐ‬ ‫ﻟﻠﺒﺤﺚ‬ ‫ﺟﻮﺟﻞ‬ ‫ﳏﺮك‬ ‫ﻋﻠﻰ‬ ‫اﻟﻘﺮآن‬ ‫اﻋﺪ‬ ‫ﻗﻮ‬ ‫اﺳﺔ‬ ‫ﻟﺪر‬ ‫اﻷول‬ ‫اﳌﻮﻗﻊ‬ 50 ً ‫ﻳﺎ‬ ‫ﺷﻬﺮ‬ ‫اﺋﺮ‬ ‫ز‬ ‫أﻟﻒ‬ . ‫ﻧﻘﱰح‬ ‫ﺟﻌﻠﻨﺎ‬ ‫وﻫﺬا‬ " ‫ﻓﻬﻢ‬ ‫اﻟﻘﺮآن‬ " ‫ﻟﻌﺎم‬ ‫اﻟﺼﻨﺎﻋﻲ‬ ‫اﻟﺬﻛﺎء‬ ‫و‬ ‫اﳊﺎﺳﻮب‬ ‫ﻟﻌﻠﻮم‬ ‫ﻛﺒﲑ‬ ‫ﻛﺘﺤﺪ‬ 2010 ‫ﺑﻌﺪﻩ‬ ‫وﻣﺎ‬ . ‫أﻧﻨﺎ‬ ‫ﺣﻴﺚ‬ ‫اﳌﺸﺮوع‬ ‫ﻫﺬا‬ ‫ﺑﺘﻮﺳﻴﻊ‬ ‫ﻧﻘﻮم‬ ً ‫وﺗﺪرﳚﻴﺎ‬ ‫اﻟﻘﺮآﻧﻴﺔ‬ ‫اﳉﻤﻠﺔ‬ ‫ﻟﺒﻨﻴﺔ‬ ‫ﺗﺒﻌﻴﺔ‬ ‫ﺑﻴﺎﻧﻴﺔ‬ ‫ات‬ ‫ﺷﺠﺮ‬ ‫إﻋﺪاد‬ ‫ﻃﻮر‬ ‫ﰲ‬ ‫اﻵن‬ )) Quranic Arabic Dependency Treebank ‫ﻟﻠﻘﺮآء‬ ‫وﳑﻜﻨﺔ‬ ‫اﻟﻜﺮﱘ‬ ‫اﻟﻘﺮآن‬ ‫اب‬ ‫إﻋﺮ‬ ‫ﻛﺘﺐ‬ ‫ﻣﻦ‬ ‫ﻣﺴﺘﻤﺪة‬ ‫اﻵﻟﻴﺔ‬ ‫ة‬ . ‫ﺑﲔ‬ ‫اﺑﻂ‬ ‫رو‬ ‫ﺷﺒﻜﺔ‬ ‫وﺑﻨﺎء‬ ‫ﻗﺮآﻧﻴﺔ‬ ‫ﻣﻔﺎﻫﻴﻢ‬ ‫ﺑﺎﺳﺘﺨﻼص‬ ‫ﻧﻘﻮم‬ ‫أﻧﻨﺎ‬ ‫ﻛﻤﺎ‬ ‫اﳌﻔﺎﻫﻴﻢ‬ ‫ﳍﺬﻩ‬ ‫اﻟﻘﺮآن‬ ‫ﺿﻤﺎﺋﺮ‬ ‫وﻋﻮدة‬ ‫اﳌﻔﺎﻫﻴﻢ‬ ‫ﻫﺬﻩ‬ . ‫اﻟﻘﺮآن‬ ‫ﳌﻔﺮدات‬ ‫ﻟﻐﻮﻳﺔ‬ ‫ذﺧﲑة‬ ‫ﺑﻨﺎء‬ ‫اﻟﻘﺮآﱐ‬ ‫اﳌﺸﺮوع‬ ‫ﻫﺬا‬ ‫ﰲ‬ ‫اﳌﺴﺘﻘﺒﻠﻴﺔ‬ ‫اﳋﻄﻂ‬ ‫وﻣﻦ‬ ‫ﻣﺸﺮوع‬ ‫ار‬ ‫ﻏﺮ‬ ‫ﻋﻠﻰ‬ WordNet ‫ﳌﻔﺮدا‬ ‫دﻻﻟﻴﺔ‬ ‫ات‬ ‫إﻃﺎر‬ ‫ذﺧﲑة‬ ‫ﺗﻄﻮﻳﺮ‬ ‫وﻛﺬﻟﻚ‬ ‫ﻣﺸﺮوع‬ ‫ار‬ ‫ﻏﺮ‬ ‫ﻋﻠﻰ‬ ‫اﻟﻘﺮآن‬ ‫ت‬ FrameNet ‫ﻋﻠﻰ‬ ‫اﻟﻘﺮآﱐ‬ ‫اﻟﺒﺤﺚ‬ ‫ﻳﺘﻴﺢ‬ ‫ﻣﺘﻜﺎﻣﻞ‬ ‫ﲝﺚ‬ ‫ﳏﺮك‬ ‫ﺗﻄﻮﻳﺮ‬ ‫ﰒ‬ ‫وﻣﻦ‬ ‫اﳋﻄﺎﺑﻴﺔ،‬ ‫ﺑﺎﻟﻌﻼﻗﺎت‬ ‫ﻣﻌﻨﻮﻧﺔ‬ ‫ﻟﻠﻘﺮآن‬ ‫ﻧﺼﻴﺔ‬ ‫ﳎﻤﻮﻋﺔ‬ ‫ﺗﻄﻮﻳﺮ‬ ‫وﻛﺬﻟﻚ‬ ‫اﻟﺬﻛﺮ‬ ‫آﻧﻔﺔ‬ ‫اﻟﻠﻐﻮﻳﺔ‬ ‫اﻟﺬﺧﺎﺋﺮ‬ ‫ﺧﻼل‬ ‫ﻣﻦ‬ ‫وذﻟﻚ‬ ‫اﳌﻔﺎﻫﻴﻢ‬ ‫و‬ ‫اﻟﺼﺮف‬ ‫و‬ ‫اﻟﻨﺤﻮ‬ ‫و‬ ‫اﻷﺳﻠﻮب‬ ‫و‬ ‫اﳌﻔﺮدات‬ ‫ﻣﺴﺘﻮي‬ . ‫ﻧﺎﺣﺠﺔ‬ ‫ﻳﻊ‬ ‫ﻣﺸﺎر‬ ‫ﺧﻼل‬ ‫وﻣﻦ‬ ‫اﻟﺘﻘﻠﻴﺪﻳﺔ‬ ‫ﻟﻠﻨﺼﻮص‬ ‫اﳊﺎﺳﻮﺑﻴﺔ‬ ‫اﳌﻌﺎﳉﺔ‬ ‫ﰲ‬ ‫اﳌﻜﺘﺴﺒﺔ‬ ‫اﳋﱪة‬ ‫ﺗﻄﺒﻴﻖ‬ ‫ﻧﻄﻤﺢ‬ ‫اﻟﻜﺮﱘ‬ ‫اﻟﻘﺮآن‬ ‫ﻋﻠﻰ‬ -‫اث‬ ‫اﻟﱰ‬ ‫وﻛﺘﺐ‬ ‫اﻟﻨﺒﻮي‬ ‫اﳊﺪﻳﺚ‬ ‫ﻣﺜﻞ‬ ‫اﻟﻌﺮﰊ‬ -‫اﳌﻌﺎﺻﺮة‬ ‫اﻟﻔﺼﺤﻰ‬ ‫ﺑﻴﺔ‬ ‫اﻟﻌﺮ‬ ‫وﻛﺬﻟﻚ‬ . ‫اﻟﺘﻨﻘﻴﺐ‬ ‫و‬ ‫اﻟﺒﺤﺚ‬ ‫اض‬ ‫أﻏﺮ‬ ‫ﺗﻠﱯ‬ ‫ﻣﻌﻨﻮﻧﺔ‬ ‫ﻧﺼﻴﺔ‬ ‫ﳎﻤﻮﻋﺔ‬ ‫ﻟﺘﻄﻮﻳﺮ‬ ‫ﳔﻄﻂ‬ ‫اﳌﺜﺎل‬ ‫ﺳﺒﻴﻞ‬ ‫ﻓﻌﻠﻰ‬ ‫اﻹ‬ ‫ﳎﺎل‬ ‫ﰲ‬ ‫ﻟﻠﻤﺨﺘﺼﲔ‬ ‫اﻟﻨﺼﻲ‬ ‫اﻹﺧﺘﺼﺎص‬ ‫ﻫﺬا‬ ‫ﰲ‬ ‫اﳌﻌﻨﻮﻧﺔ‬ ‫ﻟﻠﻤﻔﺎﻫﻴﻢ‬ ‫اﳌﻌﺮﰲ‬ ‫اﻟﺘﻤﺜﻴﻞ‬ ‫ﺧﻼل‬ ‫ﻣﻦ‬ ‫وذﻟﻚ‬ ‫اﻹﺳﻼﻣﻲ‬ ‫ﻗﺘﺼﺎد‬ . ‫وﻛﺬﻟﻚ‬ ‫ﺗﻘﻨﻴﺎت‬ ‫ﺗﻮﻇﻴﻒ‬ ‫ﻣﻦ‬ ‫ﳝﻜﻨﻨﺎ‬ ‫وﻫﺬا‬ ‫اﻹﳒﻴﻞ‬ ‫و‬ ‫اة‬ ‫اﻟﺘﻮر‬ ‫ﻣﺜﻞ‬ ‫اﻷﺧﺮى‬ ‫اﻷدﻳﺎن‬ ‫ﰲ‬ ‫ﻣﻘﺪﺳﺔ‬ ‫ﻛﺘﺐ‬ ‫ﻟﺘﺸﻤﻞ‬ ‫اﳊﺎﺳﻮﺑﻴﺔ‬ ‫أﲝﺎﺛﻨﺎ‬ ‫ﲤﺪﻳﺪ‬ ‫ﻋﻠﻴﻨﺎ‬ ‫ﺮض‬ ُ ‫ﻋ‬ ‫اﳌﺼﺎدر‬ ‫ﻫﺬﻩ‬ ‫ﺑﲔ‬ ‫اﻟﻔﺮوﻗﺎت‬ ‫و‬ ‫اﻟﺘﺸﺎﺑﻪ‬ ‫ﳌﻼﺣﻈﺔ‬ ‫ﺣﺎﺳﻮﺑﻴﺔ‬ .

descriptionView Paper arrow_downwardDownload

Evaluation of the performance of Moses statistical engine adapted to English-Arabic language combination

by Ouafa BENTERKI

2014

Statistical Machine Translation (SMT) is considered as sub-field of computational linguistics; and the latter is regarded as a branch of Artificial Intelligence (AI) dedicated to Natural Language Processing (NLP). The main purpose of this... more

descriptionView Paper arrow_downwardDownload

Intrinsic Plagiarism Detection for Text Based Features Pattern.

by Ijesrt Journal

2014, International Journal of Engineering Sciences & Research Technology

Plagiarism detection means detecting the document whether copied or stealing from the other document. The main goal is to detect the word by analyzing the writing style using technique intrinsic plagiarism detection. Text mining is... more

descriptionView Paper arrow_downwardDownload

Building Arabic Corpora from Wikisource

by Imene Bensalem and

2013, 10th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA'13)

This paper describes a new tool that helps extracting clean text from the Arabic Wikisource dump in order to build corpora. The tool purpose is illustrated by the generation of a subcorpus from Wikisource, which is a step towards the... more

believe that the tool described here will save time and effort of researchers during the process of Arabic corpora building. The tool (See Fig. 1) is based on a Perl script that allows applying to Wikisource dump the following main operations:

descriptionView Paper arrow_downwardDownload

Arabic Language NLP

Key research themes

1. How can linguistic lexicons bridging Modern Standard Arabic, Dialectal Arabic, and English improve NLP performance across Arabic varieties?

2. What role do large-scale Arabic text corpora play in advancing NLP applications and linguistic research?

3. How can morphological patterns and multiword expressions enhance Arabic NLP tool development and accuracy?

All papers in Arabic Language NLP

Arabic Language NLP

Key research themes

1. How can linguistic lexicons bridging Modern Standard Arabic, Dialectal Arabic, and English improve NLP performance across Arabic varieties?

2. What role do large-scale Arabic text corpora play in advancing NLP applications and linguistic research?

3. How can morphological patterns and multiword expressions enhance Arabic NLP tool development and accuracy?

Related Topics

All papers in Arabic Language NLP