Although Latvia is a CLARIN member supported only by the government of Latvia, it actively partic... more Although Latvia is a CLARIN member supported only by the government of Latvia, it actively participates in CLARIN project activities. The paper presents current situation in Latviaexisting infrastructure (both LRT and technical), activities taken until now and further work and possible co-operation with NEALT countries.
In cases when phrase-based statistical machine translation (SMT) is applied to languages with rat... more In cases when phrase-based statistical machine translation (SMT) is applied to languages with rather free word order and rich morphology, translated texts often are not fluent due to misused inflectional forms and wrong word order between phrases or even inside the phrase. One of possible solutions how to improve translation quality is to apply factored models. The paper presents work on English-Latvian phrase-based and factored SMT systems and, using evaluation results, demonstrates that although factored models seem more appropriate for highly inflected languages, they have rather small influence on translation results, while using phrase-model with more data better translation quality could be achieved.
This paper reports on the development of the annotated Latvian language error corpus designed for... more This paper reports on the development of the annotated Latvian language error corpus designed for grammar checker development and evaluation. We describe the error classification system introduced for this purpose, the annotation process, and guidelines. Two corpora (the corpus of student papers and the balanced text corpus) consisting of a total of 20,877 sentences have been created and annotated. A general characterisation of the corpora and a summary of the annotation results are presented.
This paper aims to contribute to an in-depth understanding of computer based word alignment proce... more This paper aims to contribute to an in-depth understanding of computer based word alignment processes in machine translation (MT). The performance of word alignment, based on IBM models and incorporated in GIZA++, has been widely discussed in machine translation literature. The debate has lead towards a general consensus that GIZA++ does not provide sufficiently good results for word alignments. In this paper, we analyse the performance of GIZA++ and Fast Align for the Latvian-English pair against the manually aligned Gold Standard. Experiments showed that Fast Align proved to be approximately 2-3% more accurate and three times faster than GIZA++ in the alignment task. Where it concerns pre-processing, the removal of articles has a small, but positive, influence on alignment quality and machine translation output. We also present a Word Alignment Visualisation tool for analysis and editing of word alignments.
Although Machine Translation is very popular for personal tasks, its use in localization and othe... more Although Machine Translation is very popular for personal tasks, its use in localization and other business applications is still very limited. The paper presents an experiment on the evaluation of an English-Latvian SMT system integrated into SDL Trados which has been used in an actual localization assignment by a professional localization company. We show that such an integrated localization environment can increase the productivity of localization by 32.9% without a critical reduction in quality.
Toponyms in general are studied by toponymy, they represent names of places comprising the follow... more Toponyms in general are studied by toponymy, they represent names of places comprising the following types: hydronyms (names of bodies of water: bays, or other building); cosmonyms or astronyms (names of stars, constellations or other heavenly bodies). The paper aims to research a complicated task of machine translation (MT) and cross-language information retrieval (CLIR) -automated translation of toponyms. Most of toponym translation approaches are data-driven (see, e.g. Meng et al., 2001; Al-Onaizan and Knight, 2002; Sproat et al., 2006; Alegria et al., 2006; Wentland et al., 2008) since they deal with widely used languages which have enough linguistic resources for development. Taking into account an under-resourced status of the Latvian language with few available corpus resources, especially parallel bilingual corpora, a rule-based approach is proposed for the English-Latvian toponym translation. There are several commonly used translation strategies for toponyms (Babych and Hartley, 2004) : transference strategy (i.e., do-not-translate), transliteration strategy (i.e., phonetic or spelling rendering), translation strategy (i.e., translation itself) and combined strategy. Transference strategy with a do-not-translate list is often used for translation of toponyms which do not need any rendering at all and are often left not translated , e.g. organization names (Babych and Hartley, 2003) or names of hotels in our system. The most common transliteration techniques are phoneme-based and grapheme-based (Zhang et al., 2004) . The phoneme-based approach (Knight and Graehl, 1998; Meng et al., 2001; Oh and Choi, 2002; Lee and Chang, 2003) implies conversion of a source language word into a target language word via its phonemic representation, i.e., grapheme-phoneme-grapheme conversion. The grapheme-based technique converts a source language word into a target language word without any phonemic representation (grapheme-grapheme conversion) (Stalls and Knight, 1998; Li et al., 2004) . Although Geoffrey Leech (1981) accepts a special status of toponyms as proper names without a conceptual meaning since any componential analysis cannot be performed for them, we should bear in mind and admit the fact that many toponyms are at least meaningful etymologically, e.g Cambridge -bridge over the river Cam (Leidner, 2007) . Toponyms are also ambiguous. Leidner (2007) describes three types of toponymical ambiguity: morpho-syntactic ambiguity: a word itself may be a toponym or may be a non-toponym, e.g. Liepa as a populated place in Latvia versus liepa (lime-tree) as a common noun; referential ambiguity: a toponym may refer to more than one place of the same type, e.g. Riga as a populated place and the capital of Latvia and Riga as a populated place in the USA, state Michigan; feature type ambiguity: a toponym may refer to more than one place of a different type, e.g. Ogre as a populated place and a river in Latvia. Another type of toponymical ambiguity is eponymical ambiguity when places are named after people or deities, e.g., Vancouver after George Vancouver. Sometimes the same place is known by different names -endonyms (names of places used by inhabitants, self-assigned names) and exonyms (names of places used by other groups, not locals), e.g. Firenze for its inhabitants and Florence for English.
Although human language technologies have a long history in Latvia, the Latvian language still be... more Although human language technologies have a long history in Latvia, the Latvian language still belongs to under-resourced languages, as there are many gaps in basic language technologies and tools. However, despite difficulties, some of these gaps for both, resources and tools, have been filled in the last five years. The main goal of this paper is to report on recent achievements in language resources and technologies (LRT) for Latvian and to describe the current situation.
The modern electronic dictionary that always gives an answer
This paper presents the Tilde Dictionary Browser (TDB) – an innovative dictionary browsing enviro... more This paper presents the Tilde Dictionary Browser (TDB) – an innovative dictionary browsing environment for a wide range of users – language learners, language teachers, translators, and casual users. We describe several techniques to maximise the likelihood of providing users with a useful result even when searched items do not have a direct match in the dictionary due to misspellings, inflected words, multi-word items or phrase fragments, or there is a lack of data in the main dictionary. TDB is targeted for broad use on multiple platforms and is implemented as a desktop software, Web application, and mobile application. The desktop version of TDB currently contains dictionaries for more than 20 language pairs, including the languages of the Baltic countries, and is easily extendable to other languages. Besides the data from translation dictionaries, TDB also provides information from different on-line resources, such as terminology dictionaries, as well as integrates the machine t...
Language Resources and Technology for the Humanities in Latvia (2004-2010)
The last six years have been very important for research and development of language technologies... more The last six years have been very important for research and development of language technologies in Latvia. Several large projects have been funded by the government of Latvia, important tools and resources have been created by the industry, and since 2006 Latvia has participated in the CLARIN initiative. Although there is still a gap in language resources and technology (LRT) for Latvian and the more widely used languages, the current LRT for Latvian can already serve as a basic research infrastructure for the Humanities. The paper presents an overview and the current status of LRT in Latvia. Special attention is paid to the CLARIN project and its role for the humanities in Latvia.
Georg Rehm, Hans Uszkoreit (editors, redaktori) PRIEKŠVĀRDS PREFACE Šī baltā grāmata ir daļa no d... more Georg Rehm, Hans Uszkoreit (editors, redaktori) PRIEKŠVĀRDS PREFACE Šī baltā grāmata ir daļa no dokumentu sērijas, kurā ap-is white paper is part of a series that promotes kopota informācija par valodu tehnoloģijām un to ie-knowledge about language technology and its potenspējām. Tā ir paredzēta pedagogiem, žurnālistiem, po-tial. It addresses educators, journalists, politicians, lanlitiķiem, valodniekiem un citiem sabiedrības locekļiem. guage communities, and others. Valodu tehnoloģiju pieejamība un lietojums dažādās e availability and use of language technology in Eu-Eiropas valodās atšķiras. Tādējādi katrai valodai ne-rope varies among languages. Consequently, the acpieciešamas atšķirīgas darbības, lai tālāk izpētītu un at-tions that are required to further support research and tīstītu valodu tehnoloģijas. Tās ir atkarīgas no dau-development of language technologies also differ for dziem faktoriem, piemēram, konkrētās valodas sarež-each language. e actions depend on many factors, ģītības un tās lietotāju skaita. such as the complexity of a given language and the size Šajās balto grāmatu publikācijās (91. lpp.) veikta paš-of its community. reizējo valodas resursu un tehnoloģiju analīze. Tās va-META-NET, a Network of Excellence funded by the dītājs bija META-NET -Eiropas Komisijas finansē-European Commission, has conducted an analysis of tais izcilības tīkls. Šajā analīzē galvenā uzmanība tika current language resources and technologies in this pievērsta 23 Eiropas oficiālajām valodām, kā arī citām white paper series (p. 91). e analysis focused on the nozīmīgām Eiropas valstu un reģionālajām valodām. 23 official European languages as well as other impor-Analīzes rezultāti liecina, ka visu valodu pētniecībā ir tant national and regional languages in Europe. e redaudz svarīgu izaicinājumu un problēmu. Lai turpmā-sults of this analysis suggest that there are many signifkajai pētniecībai būtu maksimāla atdeve un tiktu sama-icant research gaps for each language. A more detailed zināti potenciālie riski, nepieciešama detalizēta un liet-expert analysis and assessment of the current situation pratīga analīze, kā arī pašreizējās situācijas novērtējums. will help maximise the impact of additional research Tīklā META-NET ietilpst 54 pētniecības centri 33 val-and minimise any risks. stīs [1] (87. lpp.). Tie sadarbojas ar pārstāvjiem no META-NET consists of 54 research centres from 33 privātajiem uzņēmumiem, valsts aģentūrām, rūpnie-countries [1] (p. 87) that are working with stakeholdcības nozarēm, pētniecības iestādēm, programmatūras ers from commercial businesses, government agencies, izstrādātājiem, tehnoloģiju nodrošinātājiem un Eiro-industry, research organisations, soware companies, pas universitātēm. Visi šī tīkla dalībnieki strādā pie technology providers, and European universities. Tokopīga tehnoloģiju redzējuma. Tiek izstrādāta stra-gether, they are creating a common technology vision tēģija, kā līdz 2020. gadam risināt visas ar pētniecību while developing a strategic research agenda that shows saistītās problēmas, izmantojot valodu tehnoloģiju how language technology applications can address any lietojumprogrammas. research gaps by 2020. III META-NET -office@meta-net.eu -http://www.meta-net.eu Šī dokumenta autori pateicas vācu valodas baltās grāmatas autoriem par atļauju atkārtoti izmantot daļu sava dokumenta materiālu, kas neskar konkrēto valodu [2].
We are delighted to hereby present the proceedings of CHAT 2012. Altogether, 7 papers have been s... more We are delighted to hereby present the proceedings of CHAT 2012. Altogether, 7 papers have been selected for presentation (4 regular papers and 3 short papers). The workshop papers cover various topics on automated approaches to terminology extraction and creation of terminology resources, compiling multilingual terminology, ensuring interoperability and harmonization of terminology resources, integrating these resources in language processing applications, distributing and sharing terminology data, and other. Electronically published at Linköping University Electronic Press (Sweden) http://www.ep.liu.se/ecp_home/index.en.aspx?issue=072
Due to their linguistic and extra-linguistic nature toponyms deserve a special treatment when the... more Due to their linguistic and extra-linguistic nature toponyms deserve a special treatment when they are translated. The paper deals with issues related to automated translation of toponyms from English into Latvian. Translation process allows us to translate not only toponyms from a dictionary, but out-of-vocabulary toponyms as well. Translation of out-of-vocabulary toponyms is divided into three steps: source string normalization, translation, and target string normalization. Translation step implies application of translation strategies and linguistic toponym translation patterns. 10,000 UK-related toponyms from Geonames were used as a development set. The developed methods have been evaluated on a test set: the accuracy of translation is 67% for the whole test set, 58% for oneword toponymic units, and 81% for multiword toponyms.
The paper describes English(Russian)-Latvian Machine Translation (MT) system which allows users w... more The paper describes English(Russian)-Latvian Machine Translation (MT) system which allows users with limited knowledge of the English or Russian language to understand text. The developed system has an architecture typical for transfer MT systems. It includes language identification, parsing, processing of multiword expressions (MWE), syntactic and lexical transfer, disambiguation and generation modules. Most of the system"s constituents are rulebased, however, for language identification and disambiguation statistical approach is used. One of the most complicated issues in MT is translation of MWE. The paper presents detailed analysis of the problem and provides technique for dealing with MWEs in the context of MT. Finally, the paper presents automatic evaluation results and outlines directions of further work.
This paper evaluates the impact of machine translation on the software localization process and t... more This paper evaluates the impact of machine translation on the software localization process and the daily work of professional translators when SMT is applied to low-resourced languages with rich morphology. Translation from English into six low-resourced languages (Czech, Estonian, Hungarian, Latvian, Lithuanian and Polish) from different language groups are examined. Quality, usability and applicability of SMT for professional translation were evaluated. The building of domain and project tailored SMT systems for localization purposes was evaluated in two setups. The results of the first evaluation were used to improve SMT systems and MT platform. The second evaluation analysed a more complex situation considering tag translation and its effects on the translator's productivity.
Uploads
Papers by Inguna Skadina