Digital Language Resources for Albanian

Philipp Wasserscheidt

doi:10.3726/B19908

Outline

Digital Language Resources for Albanian

Philipp Wasserscheidt

2023, Innovative Paths of Albanology

https://doi.org/10.3726/B19908

visibility

…

description

21 pages

link

1 file

Abstract

Albanian is currently still one of the languages in Europe with the least developed corpus linguistic resources. This contribution gives an overview of the freely accessible resources currently available and their characteristics. It will be shown that at present it is mainly the lack of sound training corpora that hinders the development of usable corpora. The paper further focuses on the basics of linguistic annotation and, in particular, on an analysis of the tagsets available for Albanian. With MULTEXT East, Universal Dependencies, Kabashi's tagset and the tagset of the Russian Albanian National Corpus four approaches are presented and compared. The focus is on explaining the difficulties that can arise in the search for concrete technical solutions and on presenting their respective solutions. It is shown that they differ in their basic theoretical and usage-oriented orientation. Nevertheless, they are broadly compatible and translatable in terms of capturing the morphosyntactic system of Albanian. Thus, any further development should build on the widest possible integration of the existing resources as well as the linguistic community. This article therefore aims to provide an overview of the current state of corpus linguistics of Albanian and to show how progress can be made in this field.

FAQs

What are the main challenges in developing Albanian language corpora?add

The primary obstacles include the lack of well-annotated training corpora and the small size of available resources, notably with no corpus exceeding 83 million tokens.

How do the tagsets for Albanian differ in their objectives?add

The ANC and MTE tagsets focus on morphological detail while UD emphasizes syntactic dependencies; for instance, ANC lacks syntactic information found in UD.

What role do training corpora play in the development of Albanian language resources?add

Manually annotated training corpora are crucial for training taggers; currently, only a small corpus by Kabashi exists, comprising 2000 sentences and 31,000 tokens.

When were key tagsets for Albanian language developed?add

Recent tagsets were developed by Nelda Kote and Marsida Toska around 2018, in conjunction with Universal Dependencies.

Why is the current state of Albanian lexicons limited?add

Currently, only two lexicons exist, with 125,000 and 220,000 word forms respectively; both have multiple ambiguities complicating morphological interpretation.

Figures (4)

'4 Abbreviations: Type: NE = Proper name, Het = Noun with heterogenous/ambigue Gender Gender: m = Masc = masculine, f = Fem = feminine, n = neuter, NHg = mf = Noun with heterogenous/ambigue Gender; Number: s = singular, p = plural; Case: n = nominative, g = gen = Genitive, d = dat = dative, a = acc = accusative, abl = b = ablative, abl2 = ablative II, loc = locative, unmkd = unmarked case; Definiteness: Def = definite, Ind = indefinite; prearticulated: NA = Noun preceded by an article i ee inguistic reality can certainly pose a challenge. In the MTE-Tagset there are 42 different attributes (see e.g. the following two tables) with 145 attribute values. This results in a total of 984 different combinations of specifications. However, this number of combinations is greater than the actual variety of forms in Albanian. Based on the specifications a lexicon with a total of 221.759 word forms has been created. In reality, the lexicon consists of only 36,699 different word forms — a ratio of 1:6.0. In comparison, the Romanian tagset has 205 attribute values and 616 combinations, giving a ratio of 1:1.3. The challenge in developing a tagset is therefore to strike a good balance hetween the individual nerenectives Tn 9 relatively emall scientific camminitv

Two things are noteworthy here. First, it is easy to see how Kabashi's tagset does not mark morphological details but introduces some attributes that indicate syntactic context - and especially the presence of clitics before verbs. It should be noted here that Kabashi's tagset also differs from the others in that (just like the STTS tagset) it does not include any systematisation of the encoded information below the level of parts of speech but consists only of a list of tags. Therefore, only coded features are displayed. All word forms that do not fall into the classes covered by the tags are simply marked as V. In the other tag sets, all possible values are generally indicated and annotated. Secondly, it is noticeable that the division or categorisation of attributes is in

References (20)

Archangel'skij, Timofej A.: Electronic Corpora of the Albanian, Kalmyk, Lezgian, and Ossetic Languages. In: Automatic Documentation and Mathematical Linguistics 46 (2012) 2, 118-123.
Buchholz, Od/Wilfried Fiedler: Albanische Grammatik. 1. Aufl. Leipzig: Verlag Enzyklopädie, 1987.
Caka, Nebi/Ali Caka: Korpusi i gjuhës shqipe -rezultatet e para, problemet dhe detyrat. In: Rexhep Ismajli (Hg.): Shqipja dhe gjuhët e Ballkanit / Albanian and Balkan Languages. Prishtinë / Tiranë: Akademia e Shkencave dhe e Arteve e Kosovës / Akademia e Shkencave e Shqipërisë, 2012, 643-656. EAGLES: Expert Advisory Group on Language Engineering Standards, 1996.
Erjavec, Tomaž: MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In: Proceedings of the seventh international conference on language resources and evaluation (LREC'06), Valetta, 2010.
Erjavec, Tomaž: MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. In: Language Resources and Evaluation 46 (2012) 1, 131-142.
Hasanaj, Besmir: A Part of Speech Tagging Model for Albanian. Saarbrücken: Lambert Academic Publishing, 2012.
Haspelmath, Martin: Pre-established categories don't exist: Consequences for language description and typology. In: Linguistic Typology 11 (2007) 1, 119- 132. Ide, Nancy/Jean Véronis: Multext (multilingual tools and corpora). In: Proceedings of the 15th international conference on computational linguistics (CoLing'94), 1994, 90-96.
Kabashi, Besim: Pronominal clitics and Valency in Albanian. A computational linguistics prespective and modelling within the LAG-Framework. In: Thomas Herbst und Katrin Götz-Votteler (Hg.): Valency. Theoretical, descriptive and cognitive issues. Berlin / New York: Götz-Votteler Katrin, 2007, 339-352.
Kabashi, Besim: Automatische Verarbeitung der Morphologie des Albanischen. Erlangen: FAU University Press, 2015.
Kabashi, Besim: AlCo -njё korpus tekstesh i gjuhёs shqipe me njёqind milionё fjalё. In: Seminari XXXVI Nderkombёtar pёr Gjuhёn, Letёrsinё dhe Kulturёn Shqiptare. Prishtinё: Universiteti i Prishtinёs, 2017.
Kabashi, Besim/Thomas Proisl: Albanian Part-of-Speech Tagging: Gold Standard and Evaluation. In: Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, 2018, 2593-2599.
Kadriu, Arbana: NLTK tagger for Albanian using iterative approach. In: Proceedings of the 35th International Conference on Information Technology Interfaces (ITI), 2013.
Kote, Nelda/Marenglen Biba/Jenna Kanerva u.a.: Morphological Tagging and Lemmatization of Albanian. A Manually Annotated Corpus and Neural Models. arXiv pre-print, 2019.
Marneffe, Marie-Catherine de/Christopher Manning/Joakim Nivre u.a: Universal Dependencies. In: Computational Linguistics 47 (2021) 2, 255-308.
Morozova, Marija. S./Timofej A. Archangel'skij/M. A. Daniel' u.a.: Albanskij nacional'nyj korpus: osnovnye napravlenija raboty. In: N. N. Kazansky (Hg.): Acta linguistica Petropolitana. Trudy Instituta lingvisticheskix issledovaniĭ. Sankt-Peterburg: "Nauka" (2016) 3, 169-189.
Morozova, Maria/Aleksandёr Rusakov: Korpusi elektronik i shqipes: pёrpunimi, pёrmbajtja dhe pёrdorimi. In: Bardh Rugova (Hg.): Seminari XXXII Nderkombёtar pёr Gjuhёn, Letёrsinё dhe Kulturёn Shqiptare. Prishtinё: Universiteti i Prishtinёs, 2013, 85-96.
Piton, Odile/Klara Lagji: Morphological study of Albanian words, and processing with NooJ. In: Proceedings of the 2007 International NooJ Conference, 2008, 189-205.
Tiedemann, Jörg: Parallel Data, Tools and Interfaces in OPUS. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'2012), 2012.
Toska, Marsida/ Joakim Nivre/Daniel Zeman: Universal Dependencies for Albanian. In: Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020). Barcelona, Spain (Online), 2020, 178-188.
Trommer, Jochen/Dalina Kallulli: A morphological analyzer for standard Albanian. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004), 2004, 1271-1274.

Digital Language Resources for Albanian

Sign up for access to the world's latest research

Abstract

FAQs

Related papers

References (20)

Related papers

Related topics