Word Representation Models for Arabic Dialect Identification

Mahmoud Sobhy

doi:10.18653/V1/2022.WANLP-1.52

Outline

Word Representation Models for Arabic Dialect Identification

Mahmoud Sobhy

Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)

https://doi.org/10.18653/V1/2022.WANLP-1.52

visibility

…

description

5 pages

link

1 file

Abstract

This paper describes the systems submitted by BFCAI team to Nuanced Arabic Dialect Identification (NADI) shared task 2022. Dialect identification task aims at detecting the source variant of a given text or speech segment automatically. There are two subtasks in NADI 2022, the first subtask for country-level identification and the second subtask for sentiment analysis. Our team participated in the first subtask. The proposed systems use Term Frequency Inverse/Document Frequency and word embeddings as vectorization models. Different machine learning algorithms have been used as classifiers. The proposed systems have been tested on two test sets: Test-A and Test-B. The proposed models achieved Macro-f1 score of 21.25% and 9.71% for Test-A and Test-B set respectively. On other hand, the best-performed submitted system achieved Macro-f1 score of 36.48% and 18.95% for Test-A and Test-B set respectively.

References (11)

Muhammad Abdul-Mageed, AbdelRahim Elmadany, and El Moatez Billah Nagoudi. 2021a. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 7088-7105, Online. Association for Computa- tional Linguistics.
Muhammad Abdul-Mageed, Chiyu Zhang, Houda Bouamor, and Nizar Habash. 2020. NADI 2020: The first nuanced Arabic dialect identification shared task. In Proceedings of the Fifth Arabic Nat- ural Language Processing Workshop, pages 97-110, Barcelona, Spain (Online). Association for Compu- tational Linguistics.
Muhammad Abdul-Mageed, Chiyu Zhang, Abdel- Rahim Elmadany, Houda Bouamor, and Nizar Habash. 2021b. NADI 2021: The second nuanced Arabic dialect identification shared task. In Proceed- ings of the Sixth Arabic Natural Language Process- ing Workshop, pages 244-259, Kyiv, Ukraine (Vir- tual). Association for Computational Linguistics.
Muhammad Abdul-Mageed, Chiyu Zhang, Abdel- Rahim Elmadany, Houda Bouamor, and Nizar Habash. 2022. NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task. In Pro- ceedings of the Seven Arabic Natural Language Pro- cessing Workshop (WANLP 2022).
Ahmed Aliwy, Hawraa Taher, and Zena AboAltaheen. 2020. Arabic dialects identification for all Arabic countries. In Proceedings of the Fifth Arabic Natu- ral Language Processing Workshop, pages 302-307, Barcelona, Spain (Online). Association for Compu- tational Linguistics.
Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraBERT: Transformer-based model for Arabic lan- guage understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Pro- cessing Tools, with a Shared Task on Offensive Lan- guage Detection, pages 9-15, Marseille, France. Eu- ropean Language Resource Association.
Nsrin Ashraf, Fathy Elkazzaz, Mohamed Taha, Hamada Nayel, and Tarek Elshishtawy. 2022a. BF- CAI at SemEval-2022 task 6: Multi-layer perceptron for sarcasm detection in Arabic texts. In Proceed- ings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 881-884, Seattle, United States. Association for Computational Lin- guistics.
Nsrin Ashraf, Hamada Nayel, and Mohamed Taha. 2022b. A comparative study of machine learning approaches for rumors detection in covid-19 tweets. In 2022 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), pages 384-387.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics.
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In 1st International Con- ference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
Hamada Nayel. 2020. NAYEL at SemEval-2020 task 12: TF/IDF-based approach for automatic offensive language detection in Arabic tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 2086-2089, Barcelona (online). International Committee for Computational Linguistics.

Word Representation Models for Arabic Dialect Identification

Sign up for access to the world's latest research

Abstract

Related papers

References (11)

Related papers

Related topics