Variants and Homographs : Eternal Problem of Dictionary Makers ⋆

Jaroslava Hlaváčová

Outline

Title

Abstract

Figures

Introduction and Basic Concepts

Summary

References

Variants and Homographs : Eternal Problem of Dictionary Makers ⋆

Jaroslava Hlaváčová

2008

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

We discuss two types of asymmetry between wordforms and their (morphological) characteristics, namely (morphological) variants and homographs. We introduce a concept of multiple lemma that allows for unique identification of wordform variants as well as ‘morphologicallybased’ identification of homographic lexemes. The deeper insight into these concepts allows further refining of morphological dictionaries and subsequently better performance of any NLP tasks. We demonstrate our approach on the morphological dictionary of Czech.

Figures (3)

Fig. 1. Relations among basic concepts. Lexeme is a set of lexical units that share the same paradigm. We are aware that especially this term is simplified but it is sufficient for dictionaries containing all necessary information about words but at the same time, easy to use.

Implementation of Multiple Lemmas. In the morphological dictionary of Czech [7], the wordforms are not listed separately, they are clustered according to their lemmas. The lemma represents the whole paradigm. However, the multiple lemma cannot represent the extended paradigm straightforwardly because a set cannot serve as unique identifier. Thus, we keep all lemma variants separately but we connect them with pointers (see Fig. 2). Fig. 2. Schema of implementation of multiple lemma.

Fig. 3. Schema of variants and homographs. Parts in ellipses concern polysemy. The basic difference between the two concepts are illustrated on the schemas in Fig. 3. For variants, the shape of the schema resembles the letter A, while for homographs it is the letter Y. The polysemy appears only at the syntactic (if applicable) or semantic levels of the schema (see the right schema). It is not surprising that these schemas resemble those introduced in [8], where they illus- trate synonymy and homonymy as relations between separate layers of language description.

Graeme Ritchie

1986

downloadDownload free PDF View PDFchevron_right

GéDériF: Automatic Generation and Analysis of Morphologically Constructed Lexical Resources

Fiammetta Namer

One of the major frequent problems in text retrieval comes from large number of words encountered which are not listed in general language dictionaries. However, it is very often the case that these words are morphologically complex, and as such have a meaning which is predictable on the basis of their structure. Furthermore, such words typically belong to specialized language uses (e.g. scientific, philosophical or media technolects). Consequently, tools for listing and analysing such words can help enrich a terminological database. The purpose of this paper is to present a system that automatically generates morphologically complex lexical French items which are not listed in dictionaries, and that furthermore provides a structural and semantic analysis of these items. The output of this system is a morphological database (currently in progress) which forms a powerful lexical resource. It will be very useful in Natural Language Processing (NLP) and in IR (Information Retrieval) applications. Indeed the system generates a potentially infinite set of complex (derived) lexical units (henceforth CLUs) automatically associated with a rich array of morpho-semantic features, and is thus capable of dealing morphologically complex structures which are unlisted in dictionaries. Technologie français [French National Ministry of Education, Research and Technology], as part of the program Actions Concertées Incitatives 1999 [Concerted Incitement Actions]. 4 This number corresponds to an estimation of the number of derivatives produced by the affixes -(a)tion, -(at)eur, -able, -age, -aire, -al, dé-, -et(te), -eux, -ifi(er), -is(er), -ité and -oir(e). 5 In French, NLP attaches little attention to constructional information (Bouillon P. 1998: 48), which is considered less adapted to the field than inflectional information (Sproat R.W. 1992; Fradin B. 1994).

downloadDownload free PDF View PDFchevron_right

Production of morphological dictionaries of multi-word units using a multipurpose tool

Ranka Stanković, Cvetana Krstev

2011

In this paper we outline the use of the multipurpose software tool LeXimir in our approach to automated production of lemmas for e-dictionaries of multi-word units. Development of morphological dictionaries of MWUs is a tedious task, especially in the case of Serbian and other languages featuring complex morphological structures. After realizing that the development of such a dictionary manually is an extremely slow process, we endeavored towards a procedure aimed at automated production of MWU dictionary lemmas, which is also outlined in this paper. The procedure was subsequently implemented as a new functionality of LeXimir, and makes use of our comprehensive edictionaries of Serbian simple words. We present an evaluation of the performance of this functionality, and hence of our procedure, obtained from experiments on two types of data. Finally, we discuss some further possible applications of our procedure and LeXimir in language processing tasks.

downloadDownload free PDF View PDFchevron_right

Derivational morphology in an e-dictionary of Serbian

Duško Vitas

2005

In this paper we explore the relation between derivational morphology and synonymy in connection with an electronic dictionary, inspired by the work of Maurice Gross. The characteristics of this relation are illustrated by derivation in Serbian, which produces new lemmas with predictable meaning. We call this regular derivation. We then demonstrate how this kind of derivation is handled in text processing using a morphological e-dictionary of Serbian and a collection of transducers with lexical constraints. Finally, we analyze the cases of synonymy that include regular derivation in one aligned text.

downloadDownload free PDF View PDFchevron_right

Derivational Morphology in E-Dictionaries of Serbian *

Cvetana Krstev

2013

The problem of derived words in automatic text processing based on morphological electronic dictionaries is discussed in this paper. The problem of unknown words (words that are not in edictionaries) produced by using derivational patterns is discussed first and the mechanism of processing them which uses morphological grammars as the enhancement of regular expressions is presented. Further, the possibilities of the enhancement of the lemma description are analyzed that would comprise derivatives of a given lemma. Two possible enhancements are suggested, their characteristics are presented as well as directions for implementation.

downloadDownload free PDF View PDFchevron_right

Language independent morphological analysis

Yuji Matsumoto

Proceedings of the sixth conference on Applied natural language processing -, 2000

This paper proposes a framework of language independent morphological analysis and mainly concentrate on tokenization, the first process of morphological analysis. Although tokenization is usually not regarded as a difficult task in most segmented languages such as English, there are a number of problems in achieving precise treatment of lexical entries. We first introduce the concept of morpho-fragments, which are intermediate units between characters and lexical entries. We describe our approach to resolve problems arising in tokenization so as to attain a language independent morphological analyzer.

downloadDownload free PDF View PDFchevron_right

Lexicon-Based Morphological Analysis of Latvian Language

Peteris Paikens

Proceedings of, 2007

This paper describes a practical solution for lexicon-based morphological analysis of Latvian language. As it is a flexive language, the core of this system is an implementation of word inflection based on a stem and its properties as listed in the lexicon. The main advantage of the described solution over similar implementations is augmenting the lexicon with methods for word derivation from related word stems, significantly increasing the recognition rate. The implemented system is able to provide full morphological detail for 96 % words of unrestricted Latvian language texts, even when using a rather limited lexicon of 25,000 word stems. For remaining unknown words, the system is extended with heuristics for recognising proper names, and determining verb and noun flexive forms based on ending, allowing a good quality guess for the linguistic properties of words that are not included in the lexicon. Such wide coverage allows the solution to be used in other linguistic tools as a transparent and robust layer for analysing word properties.

downloadDownload free PDF View PDFchevron_right

Linguistic problems in multilingual morphological decomposition

Gregor Thurmair

1984

An algorithm for the morphological decomposition of words into morphemes is presented. The application area is information retrieval, and the purpose is to find morphologically related terms to a given search term. First, the parsing framework is presented, then several linguistic decisions are discussed: morpheme selection and segmentation, morpheme classes, morpheme grammar, allomorph handling, etc. Since the system works in several languages, language-specific phenomena are mentioned.

downloadDownload free PDF View PDFchevron_right

Morphological Meanings in the Prague Dependency Treebank 2.0

Magda Sevcikova

Lecture Notes in Computer Science, 2005

In this paper we report our work on the system of grammatemes (mostly semantically-oriented counterparts of morphological categories such as number, degree of comparison, or tense), the concept of which was introduced in Functional Generative Description, and is now further elaborated in the context of Prague Dependency Treebank 2.0. We present also a new hierarchical typology of tectogrammatical nodes. We would like to thank professor Jarmila Panevová for an extensive linguistic advice. The research reported in this paper has been supported by the projects 1ET101120503, GA-UK 352/2005 and GAČR 201/05/H014. 1 Just for curiosity: almost the same term 'grammemes' is used for the same notion in the Meaning-Text Theory ([3]), although to a large extent the two approaches were created independently.

downloadDownload free PDF View PDFchevron_right

Morphological annotation of the Lithuanian corpus

Andrius Utka

Proceedings of the Workshop on …, 2007

There are over 400 million speakers of Balto-Slavonic languages world-wide (synonymously used: Balto-Slavic). As of 2007, almost a third of the 23 official European Union languages are Balto-Slavonic, i.e. Bulgarian, Czech, Latvian, Lithuanian, Polish, Slovak and Slovene. The two most recent rounds of the EU Enlargement fundamentally raised the interest in these languages: translators and interpreters for new language pairs need to be found, the interest in Machine (Aided) Translation systems has risen and tools that help language specialists and information-seeking individuals are now highly sought after.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (8)

ISO/TC 37/SC 4: Language Resources Management -Lexical Markup Framework (LMF). http://www.lexicalmarkupframework.org/ (2007) Rev. 14, date 2007-06-03.
Matthews, H.: The Concise Oxford Dictionary of Linguistics. Oxford University Press, Oxford (1997)
Cruse, D.A.: Lexical Semantics. Cambridge University Press, Cambridge (1986)
Filipec, J.: Lexicology and Lexicography: Development and State of the Research. In Luelsdorff, P.A., ed.: The Prague School of Structural and Functional Linguistics, Amsterdam-Philadelphia, John Benjamins (1994) 163-183
Žabokrtský, Z.: Valency Lexicon of Czech Verbs. PhD thesis, Charles University, Prague (2005)
Hlaváčová, J.: Pravopisné varianty a morfologická anotace korpusů. In Štícha, F., ed.: Proceedings of 2nd International Conference Grammar and Corpora 2007. (2008) In press.
Hajič, J.: Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Charles Univeristy Press, Prague (2004)
Panevová, J.: Formy a funkce ve stavbě české věty. Academia, Praha (1980)

Jaroslava Hlaváčová

Journal of Linguistics/Jazykovedný casopis

We describe systematic changes that have been made to the Czech morphological dictionary related to annotating new data within the project of Prague Dependency Treebank (PDT). We bring new solutions to several complicated morphological features that occur in Czech texts. We introduced two new parts of speech, namely foreign word and segment. We adopted new principles for morphological analysis of global and inflectional variants, homonymous lemmas, abbreviations and aggregates. The changes were initiated by the need of consistency between the data and the dictionary and of the dictionary itself.

downloadDownload free PDF View PDFchevron_right

Collapsing Morphological Information in Lexical Databases for NLP Applications

Juan Alberto Alonso

The morphology of inflectional languages poses specific problems in the processing of morphological alternations. Regular alternations at morpheme boundaries can be elegantly captured by the use of rule formalisms based on the two-level morphology model. Stem alternations and completely irregular alternations at morpheme boundaries, however, need to be captured in some way in the lexicon. This paper presents four possible solutions to the problem and makes a claim in favor of one of them. The proposed approach makes use of feature bundles that contain the necessary linguistic information to uniquely identify allomorphic variations of stems in the lexicon. The proposal is an improvement in that it simplifies the representation of allomorphic variations in the lexicon by avoiding duplication of stem allomorphs to capture cross-combination of several morphosyntactic features in stem+flex sequences.

downloadDownload free PDF View PDFchevron_right

Restricted inflectional form generation in management of morphological keyword variation

Kimmo Kettunen

Information Retrieval, 2007

Word form normalization through lemmatization or stemming is a standard procedure in information retrieval because morphological variation needs to be accounted for and several languages are morphologically non-trivial. Lemmatization is effective but often requires expensive resources. Stemming is also effective in most contexts, generally almost as good as lemmatization and typically much less expensive; besides it also has a query expansion effect. However, in both approaches the idea is to turn many inflectional word forms to a single lemma or stem both in the database index and in queries. This means extra effort in creating database indexes. In this paper we take an opposite approach: we leave the database index un-normalized and enrich the queries to cover for surface form variation of keywords. A potential penalty of the approach would be long queries and slow processing. However, we show that it only matters to cover a negligible number of possible surface forms even in morphologically complex languages to arrive at a performance that is almost as good as that delivered by stemming or lemmatization. Moreover, we show that, at least for typical test collections, it only matters to cover nouns and adjectives in queries. Furthermore, we show that our findings are particularly good for short queries that resemble normal searches of web users. Our approach is called FCG (for Frequent Case (form) Generation). It can be relatively easily implemented for Latin/Greek/Cyrillic alphabet languages by examining their (typically very skewed) nominal form statistics in a small text sample and by creating surface form generators for the 3-9 most frequent forms. We demonstrate the potential of our FCG approach for several languages of varying morphological complexity: Swedish, German, Russian, and Finnish in well-known test collections. Applications include in particular Web IR in languages poor in morphological resources.

downloadDownload free PDF View PDFchevron_right

Golden Rule of Morphology and Variants of Word forms

Jaroslava Hlaváčová

Journal of Linguistics/Jazykovedný casopis

In many languages, some words can be written in several ways. We call them variants. Values of all their morphological categories are identical, which leads to an identical morphological tag. Together with the identical lemma, we have two or more wordforms with the same morphological description. This ambiguity may cause problems in various NLP applications. There are two types of variants – those affecting the whole paradigm (global variants) and those affecting only wordforms sharing some combinations of morphological values (inflectional variants). In the paper, we propose means how to tag all wordforms, including their variants, unambiguously. We call this requirement “Golden rule of morphology”. The paper deals mainly with Czech, but the ideas can be applied to other languages as well.

downloadDownload free PDF View PDFchevron_right

PoliMorf: a (not so) new open morphological dictionary for Polish

Maciej Ogrodniczuk, Łukasz Szałkiewicz, Maciej Ogrodniczuk, Marcin Miłkowski

This paper presents preliminary results of an effort aiming at the creation of a morphological dictionary of Polish, PoliMorf, available under a very liberal BSD-style license. The dictionary is a result of a merger of two existing resources, SGJP and Morfologik and was prepared within the CESAR/META-NET initiative. The work completed so far includes re-licensing of the two dictionaries and filling the new resource with the morphological data semi-automatically unified from both sources. The merging process is controlled by the collaborative dictionary development web application Kuźnia, also implemented within the project. The tool involves several advanced features such as using SGJP inflectional patterns for form generation, possibility of attaching dictionary labels and classification schemes to lexemes, dictionary source record and change tracking. Since SGJP and Morfologik are already used in a significant number of Natural Language Processing projects in Poland, we expect PoliMorf to become the Polish morphological dictionary of choice for many years to come.

downloadDownload free PDF View PDFchevron_right

Towards Czech Morphological Guesser

Petr Sojka

This paper presents a morphological guesser for Czech based on data from Czech morphological analyzer ajka [1]. The idea behind the presented concept lies in a presumption that the new (and therefore unknown to the analyzer) words in a language behave quite regularly and that a description of this regular behaviour can be extracted from the existing data of the morphological analyzer. The paper describes both the construction of guesser data and the architecture of the guesser itself.

downloadDownload free PDF View PDFchevron_right

Improbable morphological forms in a computational lexicon

Kristin Hagen

2005

In the construction of a computational lexicon, one of the problems is how to handle cases where words have a partial morphological paradigm. In this paper we will describe this problem and sketch how we implemented a system for capturing the degree to which forms should be considered improbable. Also, we will describe how our results can be used in language applications.

downloadDownload free PDF View PDFchevron_right

Automatic morphological processing of Bulgarian proper nouns

Hristo Krushkov

TAL. Traitement automatique des langues, 2000

This paper presents (i) a classification of Bulgarian proper nouns, (ii) a methodology for automatic morphological analysis and generation of proper nouns, (iii) some approaches to automatically build a dictionary of proper nouns. Bulgarian proper nouns are divided into classes. Every class comprises rules for generation of the paradigm. The pattern is a lexical representation, which matches all forms of the paradigm. The morphological analysis is based on the pattern matching process between the proper noun and the pattern. The pattern and the class incorporate information about the whole paradigm of a particular proper noun. An electronic dictionary of proper nouns has been created. It consists of pairs <pattern, class number>.

downloadDownload free PDF View PDFchevron_right

Morphisto: Service-Oriented Open Source Morphology for German

Andrea Zielinski

Communications in Computer and Information Science, 2009

Preface VII This book starts with a theory-oriented paper by Thomas Hanneforth ("Using Ranked Semirings for Representing Morphology Automata"), proposing complex weight structures to represent morphological analyzers.

downloadDownload free PDF View PDFchevron_right

The Role of Morphological Analysis in Natural Language Processing

Zeynep Altan

2002

AB5TRACT Traditionally, the analysis of word structure (morphology) is divided into two basic fields as infleetion and derivation. Therefore, the morphological structure of each word may include elements such as prefix, suffix, infix, or even a separate root, and these elements can modify the meaning of the basic root or stern of the word. If the consequent word is only a paradigmatic application of its base form, this variation of the word is called inflection; but if the resulting word is an entirely different word or a compound, which is formed of two or more roots, it is called derivation. While derivation is a word-creating process, infleetion constitutes different forms of any word. The model developed in this study, which analyses the morphology of Turkish verbs, can recognize all of the inflectional categories. The computational tool consists of a Java applet that can run on every machine, and a database that has been extracted from Turkish Dictionary published by Turkish La...

downloadDownload free PDF View PDFchevron_right

Variants and Homographs : Eternal Problem of Dictionary Makers ⋆

Sign up for access to the world's latest research

Abstract

Related papers

References (8)

Related papers