Word Representation Models for Arabic Dialect Identification
Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)
https://doi.org/10.18653/V1/2022.WANLP-1.52Abstract
This paper describes the systems submitted by BFCAI team to Nuanced Arabic Dialect Identification (NADI) shared task 2022. Dialect identification task aims at detecting the source variant of a given text or speech segment automatically. There are two subtasks in NADI 2022, the first subtask for country-level identification and the second subtask for sentiment analysis. Our team participated in the first subtask. The proposed systems use Term Frequency Inverse/Document Frequency and word embeddings as vectorization models. Different machine learning algorithms have been used as classifiers. The proposed systems have been tested on two test sets: Test-A and Test-B. The proposed models achieved Macro-f1 score of 21.25% and 9.71% for Test-A and Test-B set respectively. On other hand, the best-performed submitted system achieved Macro-f1 score of 36.48% and 18.95% for Test-A and Test-B set respectively.
References (11)
- Muhammad Abdul-Mageed, AbdelRahim Elmadany, and El Moatez Billah Nagoudi. 2021a. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Nat- ural Language Processing (Volume 1: Long Papers), pages 7088-7105, Online. Association for Computa- tional Linguistics.
- Muhammad Abdul-Mageed, Chiyu Zhang, Houda Bouamor, and Nizar Habash. 2020. NADI 2020: The first nuanced Arabic dialect identification shared task. In Proceedings of the Fifth Arabic Nat- ural Language Processing Workshop, pages 97-110, Barcelona, Spain (Online). Association for Compu- tational Linguistics.
- Muhammad Abdul-Mageed, Chiyu Zhang, Abdel- Rahim Elmadany, Houda Bouamor, and Nizar Habash. 2021b. NADI 2021: The second nuanced Arabic dialect identification shared task. In Proceed- ings of the Sixth Arabic Natural Language Process- ing Workshop, pages 244-259, Kyiv, Ukraine (Vir- tual). Association for Computational Linguistics.
- Muhammad Abdul-Mageed, Chiyu Zhang, Abdel- Rahim Elmadany, Houda Bouamor, and Nizar Habash. 2022. NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task. In Pro- ceedings of the Seven Arabic Natural Language Pro- cessing Workshop (WANLP 2022).
- Ahmed Aliwy, Hawraa Taher, and Zena AboAltaheen. 2020. Arabic dialects identification for all Arabic countries. In Proceedings of the Fifth Arabic Natu- ral Language Processing Workshop, pages 302-307, Barcelona, Spain (Online). Association for Compu- tational Linguistics.
- Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraBERT: Transformer-based model for Arabic lan- guage understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Pro- cessing Tools, with a Shared Task on Offensive Lan- guage Detection, pages 9-15, Marseille, France. Eu- ropean Language Resource Association.
- Nsrin Ashraf, Fathy Elkazzaz, Mohamed Taha, Hamada Nayel, and Tarek Elshishtawy. 2022a. BF- CAI at SemEval-2022 task 6: Multi-layer perceptron for sarcasm detection in Arabic texts. In Proceed- ings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 881-884, Seattle, United States. Association for Computational Lin- guistics.
- Nsrin Ashraf, Hamada Nayel, and Mohamed Taha. 2022b. A comparative study of machine learning approaches for rumors detection in covid-19 tweets. In 2022 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), pages 384-387.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171-4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics.
- Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. In 1st International Con- ference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
- Hamada Nayel. 2020. NAYEL at SemEval-2020 task 12: TF/IDF-based approach for automatic offensive language detection in Arabic tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 2086-2089, Barcelona (online). International Committee for Computational Linguistics.