Traitement Automatique Du Dialecte Tunisien : Construction De Ressources Linguistiques
This thesis deals with the linguistic resources creation of spoken Tunisian Arabic. First, we des... more This thesis deals with the linguistic resources creation of spoken Tunisian Arabic. First, we described a method for creating the STAC corpus (Spoken Tunisian Arabic Corpus). Our method started with the definition of two orthographic transcription conventions for writing dialectal words and annotating spontaneous oral phenomena. Then, we proposed a method for creating a Tunisian Arabic lexicon based on the STAC corpus and a modern standard Arabic lexicon. This lexicon was exploited to morphological analyze the Tunisian Arabic. To solve the ambiguity caused by the morphological analysis, we proposed a statistical method that is able to choose one correct analysis for a word in a given sentence. We proposed a hybrid method based on a set of contextual rules and a statistical method in order to detect sentence boundaries. The obtained results show that the different methods proposed for resource development for the Tunisian dialect are promising and can be exploited to provide methods for the automatic detection and correction of disfluencies.Cette thèse s'intègre dans le cadre du traitement automatique de la langue parlée et s'intéresse à la création des ressources linguistiques pour le dialecte tunisien. D'abord, nous avons décrit une méthode pour la création du corpus STAC (Spoken Tunisian Arabic Corpus). Cette méthode commence par l'élaboration de deux conventions de transcription orthographique pour écrire les mots dialectaux et annoter les phénomènes dus au caractère spontané des productions orales. Ensuite, nous avons utilisé le corpus STAC et un lexique « racine-patron » de l'arabe standard afin de créer un lexique pour le dialecte tunisien. Ce dernier a été exploité pour analyser morphologiquement le dialecte tunisien.Pour résoudre le problème d'ambiguïté causé par l'analyse morphologique, nous avons proposé une méthode statistique permettant de choisir une seule analyse correcte pour un mot dans une phrase. Enfin, nous avons proposé une méthode hybride qui se fonde sur un ensemble de règles contextuelles et une méthode statistique afin de détecter les frontières des phrases en dialecte tunisien. Les résultats d'évaluation montrent que les différentes méthodes proposées pour le développement des ressources pour le dialecte tunisien sont prometteuses et elles peuvent être exploitées pour proposer des méthodes permettant la détection et la correction automatique des disfluences
In this paper, we study the problem of syntactic analysis of Dialectal Arabic (DA). Actually, cor... more In this paper, we study the problem of syntactic analysis of Dialectal Arabic (DA). Actually, corpora are considered as an important resource for the automatic processing of languages. Thus, we propose a method of creating a treebank for the Tunisian Arabic (TA) “Tunisian Treebank” in order to adapt an Arabic parser to treat the TA which is considered as a variant of the Arabic language.
Sentence boundary detection (SBD) is an essential step for a very large number of natural languag... more Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the problem of SBD of dialectal Arabic, especially for the Tunisian dialect. We compare the efficiency of three learning algorithms: Deep Neuronal Networks (DNN), Support Vector Machines (SVM) and Conditional Random Fields (CRF) to detect the boundaries of sentences written in different types of dialect. The best model achieved an F-measure of 84.37% using CRF which is a popular formalism for structured prediction in NLP and it has been widely applied in text segmentation.
2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), 2020
Tunisian Arabic (TA) is a morphologically and syntactically rich dialect, which presents an inter... more Tunisian Arabic (TA) is a morphologically and syntactically rich dialect, which presents an interesting challenge for Natural Language Processing (NLP) tasks such as part-of-speech tagging, parsing, semantic analysis, etc. It is classified as a low-resourced language. Tunisians use it in daily life communication, social media exchanges, etc. In this paper, we focus on Tunisian Arabic linguistic resources and tools creation. We present the creation and generation of Tunisian treebank and parser for social media texts. We use an existing state-of-the-art parser to build this treebank. Then, we investigate the effects of different data sizes and different combinations of Tunisian dialect forms in automatic parsing.
ACM Transactions on Asian and Low-Resource Language Information Processing
Tokenization represents the way of segmenting a piece of text into smaller units called tokens. S... more Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, etc. In this paper, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses Tunisian Arabic for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.
2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA), 2021
The lack of a single standard orthography causes multiple forms of writing. This orthographic inc... more The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic normalization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.
ACM Transactions on Asian and Low-Resource Language Information Processing
Tokenization represents the way of segmenting a piece of text into smaller units called tokens. S... more Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, etc. In this paper, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses Tunisian Arabic for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.
Arabic dialects have no standard dialectal spelling systems. Arbitrary transcription of dialect w... more Arabic dialects have no standard dialectal spelling systems. Arbitrary transcription of dialect words will display varieties of orthographic forms. This causes problems for natural language processing (NLP). In this paper, we present an automatic process for normalization of spontaneously spelled Tunisian Arabic (TA) into a conventional orthography CODA-TA [1]. We show that rule-based and statistical methods can reduce the transcription errors by 77.73% over this baseline on an unseen test set.
Improvement of the COTA-Orthography system through language modeling
2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA), 2021
The lack of a single standard orthography causes multiple forms of writing. This orthographic inc... more The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic normalization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.
Treebank Creation and Parser Generation for Tunisian Social Media Text
2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), 2020
Tunisian Arabic (TA) is a morphologically and syntactically rich dialect, which presents an inter... more Tunisian Arabic (TA) is a morphologically and syntactically rich dialect, which presents an interesting challenge for Natural Language Processing (NLP) tasks such as part-of-speech tagging, parsing, semantic analysis, etc. It is classified as a low-resourced language. Tunisians use it in daily life communication, social media exchanges, etc. In this paper, we focus on Tunisian Arabic linguistic resources and tools creation. We present the creation and generation of Tunisian treebank and parser for social media texts. We use an existing state-of-the-art parser to build this treebank. Then, we investigate the effects of different data sizes and different combinations of Tunisian dialect forms in automatic parsing.
Corpora are considered as an important resource for natural language processing (NLP). Currently,... more Corpora are considered as an important resource for natural language processing (NLP). Currently, the Dialectal Arabic corpora are somewhat limited, particularly in the case of the Tunisian Arabic. In recent years, since the events of the revolution, the increasing presence of spoken Tunisian Arabic in interviews, news and debate programs, the increasing use of language technologies for many spoken languages (e.g., Siri) [6], and the need for works on speech technologies requires a huge amount of well-designed Tunisian spoken corpora. This paper presents the "STAC" corpus (Spoken Tunisian Arabic Corpus) of spontaneous Tunisian Arabic speech. We present our method used for the collection and the transcription of this corpus. Then, we detail the different stages done to enrich the corpus with necessary linguistic and speech annotations that makes it more useful for many NLP applications.
Journal of King Saud University - Computer and Information Sciences, 2017
In this paper, we propose a method to disambiguate the output of a morphological analyzer of the ... more In this paper, we propose a method to disambiguate the output of a morphological analyzer of the Tunisian dialect. We test three machine-learning techniques that classify the morphological analysis of each word token into two classes: true and false. The class label is assigned to each analysis according to the context of the corresponding word in a sentence. In failure cases, we combine the results of the proposed techniques with a bigram classifier to choose only one analysis for a given word. We disambiguate the result of the morphological analyzer of the Tunisian Dialect Al-Khalil-TUN (Zribi et al., 2013b). We use the Spoken Tunisian Arabic Corpus STAC (Zribi et al., 2015) to train and test our method. The evaluation shows that the proposed method has achieved an accuracy performance of 87.32%.
Orthographic Transcription for Spoken Tunisian Arabic
Lecture Notes in Computer Science, 2013
ABSTRACT Transcribing spoken Arabic dialects is an important task for building speech corpora. Th... more ABSTRACT Transcribing spoken Arabic dialects is an important task for building speech corpora. Therefore, it is necessary to follow a definite orthography and a definite annotation to transcribe speech data. In this paper, we present OTTA, Orthographic Transcription for Tunisian Arabic. This convention proposes the use of some rules based on the standard Arabic transcription conventions and we define a set of conventions which preserve the particularities of Tunisian dialect.
Uploads
Papers by Inès Zribi