Academia.eduAcademia.edu

Outline

Arabic Dialect Identification for Travel and Twitter Text

Proceedings of the Fourth Arabic Natural Language Processing Workshop

https://doi.org/10.18653/V1/W19-4628

Abstract

This paper presents the results of the experiments done as a part of MADAR Shared Task in WANLP 2019 on Arabic Fine-Grained Dialect Identification. Dialect Identification is one of the prominent tasks in the field of Natural language processing where the subsequent language modules can be improved based on it. We explored the use of different features like char, word n-gram, language model probabilities, etc on different classifiers. Results show that these features help to improve dialect classification accuracy. Results also show that traditional machine learning classifier tends to perform better when compared to neural network models on this task in a low resource setting.

References (18)

  1. Muhammad Abdul-Mageed, Hassan Alhuzali, and Mo- hamed Elaraby. 2018. You tweet what you speak: A city-level dataset of arabic dialects. In Proceedings of the Eleventh International Conference on Lan- guage Resources and Evaluation (LREC-2018).
  2. Manar Alkhatib, May El Barachi, and Khaled Shaalan. 2019. An arabic social media based framework for incidents and events monitoring in smart cities. Journal of Cleaner Production, 220:771-785.
  3. Areej Alshutayri and Eric Atwell. 2017. Exploring twitter as a source of an arabic dialect corpus. In- ternational Journal of Computational Linguistics (IJCL), 8(2):37-44.
  4. Fadi Biadsy, Julia Hirschberg, and Nizar Habash. 2009. Spoken arabic dialect identification using phono- tactic modeling. In Proceedings of the eacl 2009 workshop on computational approaches to semitic languages, pages 53-61. Association for Computa- tional Linguistics.
  5. Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdul- rahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann, et al. 2018. The madar arabic dialect corpus and lexicon. In Proceedings of the Eleventh International Conference on Language Re- sources and Evaluation (LREC-2018).
  6. Houda Bouamor, Sabit Hassan, and Nizar Habash. 2019. The MADAR Shared Task on Arabic Fine- Grained Dialect Identification. In Proceedings of the Fourth Arabic Natural Language Processing Work- shop (WANLP19), Florence, Italy.
  7. Andrei M Butnaru and Radu Tudor Ionescu. 2018. Unibuckernel reloaded: First place in arabic dialect identification for the second year in a row. arXiv preprint arXiv:1805.04876.
  8. Mohamed Elaraby and Muhammad Abdul-Mageed. 2018. Deep models for arabic dialect identification on benchmarked data. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 263-274.
  9. Imane Guellil, Houda Saâdane, Faical Azouaou, Bil- lel Gueni, and Damien Nouvel. 2019. Arabic nat- ural language processing: An overview. Journal of King Saud University-Computer and Information Sciences.
  10. Tommi Jauhiainen, Marco Lui, Marcos Zampieri, Tim- othy Baldwin, and Krister Lindén. 2018. Automatic language identification in texts: A survey. arXiv preprint arXiv:1804.08186.
  11. Yuanzhi Li and Yang Yuan. 2017. Convergence analy- sis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Sys- tems, pages 597-607.
  12. Wang Ling, Guang Xiang, Chris Dyer, Alan Black, and Isabel Trancoso. 2013. Microblogs as parallel cor- pora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), volume 1, pages 176-186.
  13. Maryam Najafian, Sameer Khurana, Suwon Shan, Ahmed Ali, and James Glass. 2018. Exploiting convolutional neural networks for phonotactic based dialect identification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 5174-5178. IEEE.
  14. Fabian Pedregosa, Gael Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexan- dre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830.
  15. Mohammad Salameh and Houda Bouamor. 2018. Fine-grained arabic dialect identification. In Pro- ceedings of the 27th International Conference on Computational Linguistics, pages 1332-1344.
  16. Suwon Shon, Ahmed Ali, and James Glass. 2017. Mit- qcri arabic dialect identification system for the 2017 multi-genre broadcast challenge. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 374-380. IEEE.
  17. Omar F Zaidan and Chris Callison-Burch. 2014. Ara- bic dialect identification. Computational Linguis- tics, 40(1):171-202.
  18. Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scher- rer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiede- mann, et al. 2018. Language identification and mor- phosyntactic tagging: The second vardial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics.