Digital Language Resources for Albanian
2023, Innovative Paths of Albanology
https://doi.org/10.3726/B19908Abstract
Albanian is currently still one of the languages in Europe with the least developed corpus linguistic resources. This contribution gives an overview of the freely accessible resources currently available and their characteristics. It will be shown that at present it is mainly the lack of sound training corpora that hinders the development of usable corpora. The paper further focuses on the basics of linguistic annotation and, in particular, on an analysis of the tagsets available for Albanian. With MULTEXT East, Universal Dependencies, Kabashi's tagset and the tagset of the Russian Albanian National Corpus four approaches are presented and compared. The focus is on explaining the difficulties that can arise in the search for concrete technical solutions and on presenting their respective solutions. It is shown that they differ in their basic theoretical and usage-oriented orientation. Nevertheless, they are broadly compatible and translatable in terms of capturing the morphosyntactic system of Albanian. Thus, any further development should build on the widest possible integration of the existing resources as well as the linguistic community. This article therefore aims to provide an overview of the current state of corpus linguistics of Albanian and to show how progress can be made in this field.
FAQs
AI
What are the main challenges in developing Albanian language corpora?
The primary obstacles include the lack of well-annotated training corpora and the small size of available resources, notably with no corpus exceeding 83 million tokens.
How do the tagsets for Albanian differ in their objectives?
The ANC and MTE tagsets focus on morphological detail while UD emphasizes syntactic dependencies; for instance, ANC lacks syntactic information found in UD.
What role do training corpora play in the development of Albanian language resources?
Manually annotated training corpora are crucial for training taggers; currently, only a small corpus by Kabashi exists, comprising 2000 sentences and 31,000 tokens.
When were key tagsets for Albanian language developed?
Recent tagsets were developed by Nelda Kote and Marsida Toska around 2018, in conjunction with Universal Dependencies.
Why is the current state of Albanian lexicons limited?
Currently, only two lexicons exist, with 125,000 and 220,000 word forms respectively; both have multiple ambiguities complicating morphological interpretation.
References (20)
- Archangel'skij, Timofej A.: Electronic Corpora of the Albanian, Kalmyk, Lezgian, and Ossetic Languages. In: Automatic Documentation and Mathematical Linguistics 46 (2012) 2, 118-123.
- Buchholz, Od/Wilfried Fiedler: Albanische Grammatik. 1. Aufl. Leipzig: Verlag Enzyklopädie, 1987.
- Caka, Nebi/Ali Caka: Korpusi i gjuhës shqipe -rezultatet e para, problemet dhe detyrat. In: Rexhep Ismajli (Hg.): Shqipja dhe gjuhët e Ballkanit / Albanian and Balkan Languages. Prishtinë / Tiranë: Akademia e Shkencave dhe e Arteve e Kosovës / Akademia e Shkencave e Shqipërisë, 2012, 643-656. EAGLES: Expert Advisory Group on Language Engineering Standards, 1996.
- Erjavec, Tomaž: MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In: Proceedings of the seventh international conference on language resources and evaluation (LREC'06), Valetta, 2010.
- Erjavec, Tomaž: MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. In: Language Resources and Evaluation 46 (2012) 1, 131-142.
- Hasanaj, Besmir: A Part of Speech Tagging Model for Albanian. Saarbrücken: Lambert Academic Publishing, 2012.
- Haspelmath, Martin: Pre-established categories don't exist: Consequences for language description and typology. In: Linguistic Typology 11 (2007) 1, 119- 132. Ide, Nancy/Jean Véronis: Multext (multilingual tools and corpora). In: Proceedings of the 15th international conference on computational linguistics (CoLing'94), 1994, 90-96.
- Kabashi, Besim: Pronominal clitics and Valency in Albanian. A computational linguistics prespective and modelling within the LAG-Framework. In: Thomas Herbst und Katrin Götz-Votteler (Hg.): Valency. Theoretical, descriptive and cognitive issues. Berlin / New York: Götz-Votteler Katrin, 2007, 339-352.
- Kabashi, Besim: Automatische Verarbeitung der Morphologie des Albanischen. Erlangen: FAU University Press, 2015.
- Kabashi, Besim: AlCo -njё korpus tekstesh i gjuhёs shqipe me njёqind milionё fjalё. In: Seminari XXXVI Nderkombёtar pёr Gjuhёn, Letёrsinё dhe Kulturёn Shqiptare. Prishtinё: Universiteti i Prishtinёs, 2017.
- Kabashi, Besim/Thomas Proisl: Albanian Part-of-Speech Tagging: Gold Standard and Evaluation. In: Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, 2018, 2593-2599.
- Kadriu, Arbana: NLTK tagger for Albanian using iterative approach. In: Proceedings of the 35th International Conference on Information Technology Interfaces (ITI), 2013.
- Kote, Nelda/Marenglen Biba/Jenna Kanerva u.a.: Morphological Tagging and Lemmatization of Albanian. A Manually Annotated Corpus and Neural Models. arXiv pre-print, 2019.
- Marneffe, Marie-Catherine de/Christopher Manning/Joakim Nivre u.a: Universal Dependencies. In: Computational Linguistics 47 (2021) 2, 255-308.
- Morozova, Marija. S./Timofej A. Archangel'skij/M. A. Daniel' u.a.: Albanskij nacional'nyj korpus: osnovnye napravlenija raboty. In: N. N. Kazansky (Hg.): Acta linguistica Petropolitana. Trudy Instituta lingvisticheskix issledovaniĭ. Sankt-Peterburg: "Nauka" (2016) 3, 169-189.
- Morozova, Maria/Aleksandёr Rusakov: Korpusi elektronik i shqipes: pёrpunimi, pёrmbajtja dhe pёrdorimi. In: Bardh Rugova (Hg.): Seminari XXXII Nderkombёtar pёr Gjuhёn, Letёrsinё dhe Kulturёn Shqiptare. Prishtinё: Universiteti i Prishtinёs, 2013, 85-96.
- Piton, Odile/Klara Lagji: Morphological study of Albanian words, and processing with NooJ. In: Proceedings of the 2007 International NooJ Conference, 2008, 189-205.
- Tiedemann, Jörg: Parallel Data, Tools and Interfaces in OPUS. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'2012), 2012.
- Toska, Marsida/ Joakim Nivre/Daniel Zeman: Universal Dependencies for Albanian. In: Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020). Barcelona, Spain (Online), 2020, 178-188.
- Trommer, Jochen/Dalina Kallulli: A morphological analyzer for standard Albanian. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), 2004, 1271-1274.