Arabic dialects identification: North African dialects case study
2020
Abstract
Arabic is the fourth most used language on the Internet and the official language of more than 20 countries around the world. It has three main varieties, Modern Standard Arabic, which is used in books, news and education, local Dialects that vary from region to another, and Classical Arabic, the written language of the Quran. Maghrebi dialect is the Arabic dialect language used in North African countries, where internet users from these countries feel more comfortable using local slangs than native Arabic. In this study, we present a large dataset of regional dialects of three countries, namely Algeria, Tunisia, and Morocco, then we investigate the identification of each dialect using a machine learning classifiers with TF-IDF features. The approach shows promising results, where we achieved accuracy up to 96%.
Key takeaways
AI
AI
- The study achieves 96% accuracy in identifying Maghrebi dialects using machine learning classifiers and TF-IDF features.
- A dataset of 60,000 text sequences from Algeria, Tunisia, and Morocco supports dialect identification research.
- Maghrebi dialects lack formal grammar rules, complicating their identification compared to Modern Standard Arabic.
- Preprocessing steps, including stop word removal and sentence splitting, enhance dataset quality for model training.
- Future work will explore word embedding features for various NLP tasks beyond dialect identification.
References (16)
- I. Guellil, H. Saâdane, F. Azouaou, B. Gueni, D. Nouvel, Arabic natural language process- ing: An overview, Journal of King Saud University-Computer and Information Sciences (2019).
- A. Farghaly, K. Shaalan, Arabic natural language processing: Challenges and solutions, ACM Transactions on Asian Language Information Processing (TALIP) 8 (2009) 1-22.
- R. Al-Sabbagh, R. Girju, Yadac: Yet another dialectal arabic corpus., in: LREC, 2012, pp. 2882-2889.
- MustGo , about world languages, arabic (levantine), https://www.mustgo.com/ worldlanguages/arabic-eastern/, 2020. Accessed: 2020-07-27.
- statcounter , social media stats algeria, https://gs.statcounter.com/social-media-stats/all/ algeria, 2020. Accessed: 2020-07-27.
- statcounter , social media stats tunisia, https://gs.statcounter.com/social-media-stats/all/ tunisia, 2020. Accessed: 2020-07-27.
- statcounter , social media stats morocco, https://gs.statcounter.com/social-media-stats/ all/Morocco, 2020. Accessed: 2020-07-27.
- Qatar Foundation International , infographic: Dialects of the arab world, https://www.qfi. org/blog/infographic-dialects-arab-world/, 2020. Accessed: 2020-07-28.
- T. Tobaili, Arabizi identification in twitter data, in: Proceedings of the ACL 2016 Student Research Workshop, 2016, pp. 51-57.
- K. Sayadi, M. Liwicki, R. Ingold, M. Bui, Tunisian dialect and modern standard arabic dataset for sentiment analysis: Tunisian election context, in: Second International Con- ference on Arabic Computational Linguistics, ACLING, 2016, pp. 35-53.
- T. Tobaili, Arabizi identification in twitter data, in: Proceedings of the ACL 2016 Student Research Workshop, 2016, pp. 51-57.
- I. Guellil, F. Azouaou, Arabic dialect identification with an unsupervised learning (based on a lexicon). application case: Algerian dialect, in: 2016 IEEE Intl Conference on Com- putational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES), IEEE, 2016, pp. 724-731.
- D. Seddah, F. Essaidi, A. Fethi, M. Futeral, B. Muller, P. J. O. Suárez, B. Sagot, A. Srivastava, Building a user-generated content north-african arabizi treebank: Tackling hell, in: Pro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 1139-1150.
- K. Darwish, Arabizi detection and conversion to arabic, arXiv preprint arXiv:1306.6755 (2013).
- O. F. Zaidan, C. Callison-Burch, Arabic dialect identification, Computational Linguistics 40 (2014) 171-202.
- R. Cotterell, C. Callison-Burch, A multi-dialect, multi-genre corpus of informal written arabic., in: LREC, 2014, pp. 241-245.