Inès Zribi

TTK: A toolkit for Tunisian linguistic analysis

Computer Speech & Language, Dec 31, 2023

Named Entity Recognition of Tunisian Arabic Using the Bi-LSTM-CRF Model

International Journal on Artificial Intelligence Tools, Sep 27, 2023

Traitement Automatique Du Dialecte Tunisien : Construction De Ressources Linguistiques

This thesis deals with the linguistic resources creation of spoken Tunisian Arabic. First, we des... more This thesis deals with the linguistic resources creation of spoken Tunisian Arabic. First, we described a method for creating the STAC corpus (Spoken Tunisian Arabic Corpus). Our method started with the definition of two orthographic transcription conventions for writing dialectal words and annotating spontaneous oral phenomena. Then, we proposed a method for creating a Tunisian Arabic lexicon based on the STAC corpus and a modern standard Arabic lexicon. This lexicon was exploited to morphological analyze the Tunisian Arabic. To solve the ambiguity caused by the morphological analysis, we proposed a statistical method that is able to choose one correct analysis for a word in a given sentence. We proposed a hybrid method based on a set of contextual rules and a statistical method in order to detect sentence boundaries. The obtained results show that the different methods proposed for resource development for the Tunisian dialect are promising and can be exploited to provide methods for the automatic detection and correction of disfluencies.Cette thèse s'intègre dans le cadre du traitement automatique de la langue parlée et s'intéresse à la création des ressources linguistiques pour le dialecte tunisien. D'abord, nous avons décrit une méthode pour la création du corpus STAC (Spoken Tunisian Arabic Corpus). Cette méthode commence par l'élaboration de deux conventions de transcription orthographique pour écrire les mots dialectaux et annoter les phénomènes dus au caractère spontané des productions orales. Ensuite, nous avons utilisé le corpus STAC et un lexique « racine-patron » de l'arabe standard afin de créer un lexique pour le dialecte tunisien. Ce dernier a été exploité pour analyser morphologiquement le dialecte tunisien.Pour résoudre le problème d'ambiguïté causé par l'analyse morphologique, nous avons proposé une méthode statistique permettant de choisir une seule analyse correcte pour un mot dans une phrase. Enfin, nous avons proposé une méthode hybride qui se fonde sur un ensemble de règles contextuelles et une méthode statistique afin de détecter les frontières des phrases en dialecte tunisien. Les résultats d'évaluation montrent que les différentes méthodes proposées pour le développement des ressources pour le dialecte tunisien sont prometteuses et elles peuvent être exploitées pour proposer des méthodes permettant la détection et la correction automatique des disfluences

Reconnaissance des entités nommées pour la langue arabe

Automatic processing of Tunisian dialect: construction of linguistic resources

Download

Syntactic Analysis of the Tunisian Arabic

by Asma Mekki and Inès Zribi

In this paper, we study the problem of syntactic analysis of Dialectal Arabic (DA). Actually, cor... more In this paper, we study the problem of syntactic analysis of Dialectal Arabic (DA). Actually, corpora are considered as an important resource for the automatic processing of languages. Thus, we propose a method of creating a treebank for the Tunisian Arabic (TA) “Tunisian Treebank” in order to adapt an Arabic parser to treat the TA which is considered as a variant of the Arabic language.

Download

Sentence boundary detection of various forms of Tunisian Arabic

by Asma Mekki and Inès Zribi

Sentence boundary detection (SBD) is an essential step for a very large number of natural languag... more Sentence boundary detection (SBD) is an essential step for a very large number of natural language processing applications such as parsing, information retrieval, automatic summarization, machine translation, etc. In this paper, we tackle the problem of SBD of dialectal Arabic, especially for the Tunisian dialect. We compare the efficiency of three learning algorithms: Deep Neuronal Networks (DNN), Support Vector Machines (SVM) and Conditional Random Fields (CRF) to detect the boundaries of sentences written in different types of dialect. The best model achieved an F-measure of 84.37% using CRF which is a popular formalism for structured prediction in NLP and it has been widely applied in text segmentation.

Treebank Creation and Parser Generation for Tunisian Social Media Text

by Asma Mekki and Inès Zribi

2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), 2020

Tunisian Arabic (TA) is a morphologically and syntactically rich dialect, which presents an inter... more Tunisian Arabic (TA) is a morphologically and syntactically rich dialect, which presents an interesting challenge for Natural Language Processing (NLP) tasks such as part-of-speech tagging, parsing, semantic analysis, etc. It is classified as a low-resourced language. Tunisians use it in daily life communication, social media exchanges, etc. In this paper, we focus on Tunisian Arabic linguistic resources and tools creation. We present the creation and generation of Tunisian treebank and parser for social media texts. We use an existing state-of-the-art parser to build this treebank. Then, we investigate the effects of different data sizes and different combinations of Tunisian dialect forms in automatic parsing.

Sarcasm Detection in Tunisian Social Media Comments: Case of COVID-19

by Asma Mekki and Inès Zribi

Lecture Notes in Computer Science, 2022

Tokenization of Tunisian Arabic: a comparison between three Machine Learning models

by Asma Mekki and Inès Zribi

ACM Transactions on Asian and Low-Resource Language Information Processing

Tokenization represents the way of segmenting a piece of text into smaller units called tokens. S... more Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, etc. In this paper, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses Tunisian Arabic for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.

Download

Improvement of the COTA-Orthography system through language modeling

by Asma Mekki and Inès Zribi

2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA), 2021

The lack of a single standard orthography causes multiple forms of writing. This orthographic inc... more The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic normalization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.

Critical description of TA linguistic resources

by Asma Mekki and Inès Zribi

Procedia Computer Science

Tokenization of Tunisian Arabic: a comparison between three Machine Learning models

ACM Transactions on Asian and Low-Resource Language Information Processing

Tokenization represents the way of segmenting a piece of text into smaller units called tokens. S... more Tokenization represents the way of segmenting a piece of text into smaller units called tokens. Since Arabic is an agglutinating language by nature, this treatment becomes a crucial preprocessing step for many Natural Language Processing (NLP) applications such as morphological analysis, parsing, machine translation, information extraction, etc. In this paper, we investigate word tokenization task with a rewriting process to rewrite the orthography of the stem. For this task, we are using Tunisian Arabic (TA) text. To the best of the researchers’ knowledge, this is the first study that uses Tunisian Arabic for word tokenization. Therefore, we start by collecting and preparing various TA corpora from different sources. Then, we present a comparison of three character-based tokenizers based on Conditional Random Fields (CRF), Support Vector Machines (SVM) and Deep Neural Networks (DNN). The best proposed model using CRF achieved an F-measure result of 88.9%.

Download

Sarcasm Detection in Tunisian Social Media Comments: Case of COVID-19

Lecture Notes in Computer Science, 2022

An Automatic Process for Tunisian Arabic Orthography Normalization

Arabic dialects have no standard dialectal spelling systems. Arbitrary transcription of dialect w... more Arabic dialects have no standard dialectal spelling systems. Arbitrary transcription of dialect words will display varieties of orthographic forms. This causes problems for natural language processing (NLP). In this paper, we present an automatic process for normalization of spontaneously spelled Tunisian Arabic (TA) into a conventional orthography CODA-TA [1]. We show that rule-based and statistical methods can reduce the transcription errors by 77.73% over this baseline on an unseen test set.

Download

Improvement of the COTA-Orthography system through language modeling

2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA), 2021

The lack of a single standard orthography causes multiple forms of writing. This orthographic inc... more The lack of a single standard orthography causes multiple forms of writing. This orthographic inconsistency is a frequent issue for Natural Language Processing (NLP). In this paper, we present a contextual method based on the orthography convention CODA-TUN [34] to improve the semi-automatic normalization tool, COTA Orthography [7], [25]. Our method targets words having multiple possible corrections which are semi-treated by this system. Therefore, we trained and improved a trigram language model based on a large corpus. We introduced, also, a generative algorithm to retrieve candidates for sentence having the target words. The selection of the correct correction is based on the trigram model. The evaluation results show that the selection accuracy reaches 79.38%.

Treebank Creation and Parser Generation for Tunisian Social Media Text

2020 IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), 2020

Tunisian Arabic (TA) is a morphologically and syntactically rich dialect, which presents an inter... more Tunisian Arabic (TA) is a morphologically and syntactically rich dialect, which presents an interesting challenge for Natural Language Processing (NLP) tasks such as part-of-speech tagging, parsing, semantic analysis, etc. It is classified as a low-resourced language. Tunisians use it in daily life communication, social media exchanges, etc. In this paper, we focus on Tunisian Arabic linguistic resources and tools creation. We present the creation and generation of Tunisian treebank and parser for social media texts. We use an existing state-of-the-art parser to build this treebank. Then, we investigate the effects of different data sizes and different combinations of Tunisian dialect forms in automatic parsing.

Spoken Tunisian Arabic Corpus “STAC”: Transcription and Annotation

Research in Computing Science, 2015

Corpora are considered as an important resource for natural language processing (NLP). Currently,... more Corpora are considered as an important resource for natural language processing (NLP). Currently, the Dialectal Arabic corpora are somewhat limited, particularly in the case of the Tunisian Arabic. In recent years, since the events of the revolution, the increasing presence of spoken Tunisian Arabic in interviews, news and debate programs, the increasing use of language technologies for many spoken languages (e.g., Siri) [6], and the need for works on speech technologies requires a huge amount of well-designed Tunisian spoken corpora. This paper presents the "STAC" corpus (Spoken Tunisian Arabic Corpus) of spontaneous Tunisian Arabic speech. We present our method used for the collection and the transcription of this corpus. Then, we detail the different stages done to enrich the corpus with necessary linguistic and speech annotations that makes it more useful for many NLP applications.

Download

Morphological disambiguation of Tunisian dialect

Journal of King Saud University - Computer and Information Sciences, 2017

In this paper, we propose a method to disambiguate the output of a morphological analyzer of the ... more In this paper, we propose a method to disambiguate the output of a morphological analyzer of the Tunisian dialect. We test three machine-learning techniques that classify the morphological analysis of each word token into two classes: true and false. The class label is assigned to each analysis according to the context of the corresponding word in a sentence. In failure cases, we combine the results of the proposed techniques with a bigram classifier to choose only one analysis for a given word. We disambiguate the result of the morphological analyzer of the Tunisian Dialect Al-Khalil-TUN (Zribi et al., 2013b). We use the Spoken Tunisian Arabic Corpus STAC (Zribi et al., 2015) to train and test our method. The evaluation shows that the proposed method has achieved an accuracy performance of 87.32%.

Download

Orthographic Transcription for Spoken Tunisian Arabic

Lecture Notes in Computer Science, 2013

ABSTRACT Transcribing spoken Arabic dialects is an important task for building speech corpora. Th... more ABSTRACT Transcribing spoken Arabic dialects is an important task for building speech corpora. Therefore, it is necessary to follow a definite orthography and a definite annotation to transcribe speech data. In this paper, we present OTTA, Orthographic Transcription for Tunisian Arabic. This convention proposes the use of some rules based on the standard Arabic transcription conventions and we define a set of conventions which preserve the particularities of Tunisian dialect.

Uploads

Papers by Inès Zribi

Log In