Papers by Peteris Paikens

Lecture Notes in Computer Science
In this paper we present an ongoing research investigating the possibility and potential of integ... more In this paper we present an ongoing research investigating the possibility and potential of integrating frame semantics, particularly FrameNet, in the Grammatical Framework (GF) application grammar development. An important component of GF is its Resource Grammar Library (RGL) that encapsulates the low-level linguistic knowledge about morphology and syntax of currently more than 20 languages facilitating rapid development of multilingual applications. In the ideal case, porting a GF application grammar to a new language would only require introducing the domain lexicon - translation equivalents that are interlinked via common abstract terms. While it is possible for a highly restricted CNL, developing and porting a less restricted CNL requires above average linguistic knowledge about the particular language, and above average GF experience. Specifying a lexicon is mostly straightforward in the case of nouns (incl. multi-word units), however, verbs are the most complex category (in t...

Proceedings of, 2007
This paper describes a practical solution for lexicon-based morphological analysis of Latvian lan... more This paper describes a practical solution for lexicon-based morphological analysis of Latvian language. As it is a flexive language, the core of this system is an implementation of word inflection based on a stem and its properties as listed in the lexicon. The main advantage of the described solution over similar implementations is augmenting the lexicon with methods for word derivation from related word stems, significantly increasing the recognition rate. The implemented system is able to provide full morphological detail for 96 % words of unrestricted Latvian language texts, even when using a rather limited lexicon of 25,000 word stems. For remaining unknown words, the system is extended with heuristics for recognising proper names, and determining verb and noun flexive forms based on ending, allowing a good quality guess for the linguistic properties of words that are not included in the lexicon. Such wide coverage allows the solution to be used in other linguistic tools as a transparent and robust layer for analysing word properties.

We describe an approach for morphological analysis combining a rule-based word level morphologica... more We describe an approach for morphological analysis combining a rule-based word level morphological analyzer with statistical tagging, detailing its application to Latvian language. Latvian is a highly inflective Indo-European language with a rich morphology. The tools described here include an implementation of Latvian inflectional paradigms, a morphological analysis tool with a guessing module for out-of-vocabulary words, and a statistical POS/morphology tagger for disambiguation of multiple analysis possibilities. Currently achieved accuracy with a training set of only ~40 000 words is 97.9% for part of speech tagging and 93.6% for the full morphological feature tag set, which is better than any previously publicly available taggers for Latvian. We also describe the construction and methodology of the necessary linguistic resources – a morphological dictionary and an annotated morphological corpus, and evaluate the effect of resource size on analysis accuracy, showing what results...
The paper describes a work in progress of building a catalogue of named entities-people, places a... more The paper describes a work in progress of building a catalogue of named entities-people, places and organizations-based on a recently digitized large (4.5 billion tokens) Latvian corpus. The authors propose an annotation standard for markup of named entities within Latvian corpus, according to which a representative set of documents (150 000 words) are manually annotated. This corpus is used for training and evaluation of an automated named entity recognition system based on Stanford CRF classifier, achieving an F-score of up to 81%. The named entities indexed within the Latvian National Library corpus and the annnotated documents are publicly available for linguistic and historical research online. © 2012 The Authors and IOS Press.
We present an open-source Latvian resource grammar implemented in Grammatical Framework (GF), a p... more We present an open-source Latvian resource grammar implemented in Grammatical Framework (GF), a programming language for multilingual grammar applications. GF differentiates between concrete grammars and abstract grammars: translation among concrete languages is provided via abstract syntax trees. The same concrete grammar is effectively used for both language analysis and language generation. Furthermore, GF differentiates between general-purpose resource grammars and domain-specific application grammars that are built on top of the resource grammars. The GF resource grammar library (RGL) currently supports more than 20 languages that implement a common API. Latvian is the 13th official European Union language that is made available in the RGL.

Lecture Notes in Computer Science, 2012
In this paper we present an ongoing research investigating the possibility and potential of integ... more In this paper we present an ongoing research investigating the possibility and potential of integrating frame semantics, particularly FrameNet, in the Grammatical Framework (GF) application grammar development. An important component of GF is its Resource Grammar Library (RGL) that encapsulates the low-level linguistic knowledge about morphology and syntax of currently more than 20 languages facilitating rapid development of multilingual applications. In the ideal case, porting a GF application grammar to a new language would only require introducing the domain lexicon -translation equivalents that are interlinked via common abstract terms. While it is possible for a highly restricted CNL, developing and porting a less restricted CNL requires above average linguistic knowledge about the particular language, and above average GF experience. Specifying a lexicon is mostly straightforward in the case of nouns (incl. multi-word units), however, verbs are the most complex category (in terms of both inflectional paradigms and argument structure), and adding them to a GF application grammar is not a straightforward task. In this paper we are focusing on verbs, investigating the possibility of creating a multilingual FrameNet-based GF library. We propose an extension to the current RGL, allowing GF application developers to define clauses on the semantic level, thus leaving the language-specific syntactic mapping to this extension. We demonstrate our approach by reengineering the MOLTO Phrasebook application grammar.
In this paper we describe an ongoing work developing a system (a set of web-services) for transli... more In this paper we describe an ongoing work developing a system (a set of web-services) for transliterating the Gothic-based Fraktur script of historical Latvian to the Latin-based script of contemporary Latvian. Currently the system consists of two main components: a generic transliteration engine that can be customized with alternative sets of rules, and a wide coverage explanatory dictionary of Latvian. The transliteration service also deals with correction of typical OCR errors and uses a morphological analyzer of contemporary Latvian to acquire lemmas – potential headwords in the dictionary. The system is being developed for the National Library of Latvia in order to support advanced reading aids in the web-interfaces of their digital collections.
Uploads
Papers by Peteris Paikens