Academia.eduAcademia.edu

Outline

Parsing Arabic Dialects

2006

Abstract

The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect NLP tools such as parsers. In this paper, we address the problem of parsing transcribed spoken Levantine Arabic (LA).We do not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel corpus LAMSA. Instead, we use explicit knowledge about the relation between LA and MSA.

References (16)

  1. Daniel M. Bikel. 2002. Design of a multi-lingual, parallel- processing statistical parsing engine. In Proceedings of International Conference on Human Language Technol- ogy Research (HLT).
  2. John Chen. 2001. Towards Efficient Statistical Parsing Us- ing Lexicalized Grammatical Information. Ph.D. thesis, University of Delaware.
  3. David Chiang. 2000. Statistical parsing with an automatically-extracted tree adjoining grammar. In 38th Meeting of the Association for Computational Linguistics (ACL'00), pages 456-463, Hong Kong, China.
  4. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic tagging of arabic text: From raw text to base phrase chunks. In 5th Meeting of the North American Chapter of the Association for Computational Linguis- tics/Human Language Technologies Conference (HLT- NAACL04), Boston, MA.
  5. David Graff. 2003. Arabic Gigaword, LDC Catalog No.: LDC2003T12. Linguistic Data Consortium, University of Pennsylvania.
  6. Nizar Habash. 2004. Large scale lexeme based arabic mor- phological generation. In Proceedings of Traitement Au- tomatique du Langage Naturel (TALN-04). Fez, Morocco.
  7. Jan Hajic, Jan Hric, and Vladislav Kubon. 2000. Machine Translation of very close languages. In 6th Applied Natu- Language Processing Conference (ANLP'2000), pages 7-12, Seattle.
  8. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2004. Bootstrapping parsers via syn- tactic projection across parallel texts. Natural Language Engineering.
  9. Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a large-scale anno- tated Arabic corpus. In NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt.
  10. Mohamed Maamouri, Ann Bies, Tim Buckwalter, Mona Diab, Nizar Habash, Owen Rambow, and Dalila Tabessi. 2006. Developing and using a pilot dialectal Arabic tree- bank. In Proceedings of the Fifth International Confer- ence on Language Resources and Evaluation, LREC'06, page to appear, Genoa, Italy.
  11. Owen Rambow, K. Vijay-Shanker, and David Weir. 2001. D- Tree Substitution Grammars. Computational Linguistics, 27(1).
  12. Owen Rambow, David Chiang, Mona Diab, Nizar Habash, Rebecca Hwa, Khalil Sima'an, Vincent Lacey, Roger Levy, Carol Nichols, and Safi ullah Shareef. 2005. Parsing arabic dialects. Final Report, 2005 JHU Summer Work- shop.
  13. Yves Schabes. 1990. Mathematical and Computational As- pects of Lexicalized Grammars. Ph.D. thesis, Department of Computer and Information Science, University of Penn- sylvania.
  14. Khalil Sima'an. 2000. Tree-gram parsing: Lexical depen- dencies and structural relations. In Proceedings of 38th Annual Meeting of the Association for Computational Lin- guistics (ACL'00), Hong Kong, China.
  15. David A. Smith and Noah A. Smith. 2004. Bilingual pars- ing with factored estimation: Using english to parse ko- rean. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP04).
  16. Fei Xia, Martha Palmer, and Aravind Joshi. 2000. A uni- form method of grammar extraction and its applications. In Proc. of the EMNLP 2000, Hong Kong.