Papers by Andrejs Vasiļjevs
HAL (Le Centre pour la Communication Scientifique Directe), Jan 17, 2013
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific r... more HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Language Resources and Evaluation, May 1, 2016
This paper describes corpora collection activity for building large machine translation systems f... more This paper describes corpora collection activity for building large machine translation systems for Latvian e-Government platform. We describe requirements for corpora, selection and assessment of data sources, collection of the public corpora and creation of new corpora from miscellaneous sources. Methodology, tools and assessment methods are also presented along with the results achieved, challenges faced and conclusions made. Several approaches to address the data scarceness are discussed. We summarize the volume of obtained corpora and provide quality metrics of MT systems trained on this data. Resulting MT systems for English-Latvian, Latvian-English and Latvian-Russian are integrated in the Latvian e-service portal and are freely available on website HUGO.LV. This paper can serve as a guidance for similar activities initiated in other countries, particularly in the context of European Language Resource Coordination action.

Language Resources and Evaluation, May 1, 2014
This paper presents the concept of the innovative platform TaaS "Terminology as a Service". TaaS ... more This paper presents the concept of the innovative platform TaaS "Terminology as a Service". TaaS brings the benefits of cloud services to the user, in order to foster the creation of terminology resources and to maintain their up-to-datedness by integrating automated data extraction and user-supported clean-up of raw terminological data and sharing user-validated terminology. The platform is based on cutting-edge technologies, provides single-access-point terminology services, and facilitates the establishment of emerging trends beyond conventional praxis and static models in terminology work. A cloud-based, user-oriented, collaborative, portable, interoperable, and multilingual platform offers such terminology services as terminology project creation and sharing, data collection for translation lookup, user document upload and management, terminology extraction customisation and execution, raw terminological data management, validated terminological data export and reuse, and other terminology services.

Service model for semi-automatic generation of multilingual terminology resources
Terminology and Knowledge Engineering, Jun 19, 2014
ABSTRACT The authors present a service-based model for semi-automatic gener-ation of multilingual... more ABSTRACT The authors present a service-based model for semi-automatic gener-ation of multilingual terminology resources which, if performed manually, is very time consuming. In this model, the automation of individual terminology work tasks is rendered as a set of interoperable cloud-based services integrated into workflows. These services automate the identification of term candidates in user documents, and the lookup of translation equivalents in online terminology resources and on the Web by automatically extracting multilingual terminology from comparable and parallel online resources. Collaborative involvement of users contributes to the refinement and enrichment of the raw terminological data. Finally, we present the TaaS platform, which implements this service-based model, particularly focusing on the processing of Web content.
Language Resources and Evaluation, May 1, 2018
In this paper, we present Tilde MT, a custom machine translation (MT) platform that provides ling... more In this paper, we present Tilde MT, a custom machine translation (MT) platform that provides linguistic data storage (parallel, monolingual corpora, multilingual term collections), data cleaning and normalisation, statistical and neural machine translation system training and hosting functionality, as well as wide integration capabilities (a machine user API and popular computer-assisted translation tool plugins). We provide details for the most important features of the platform, as well as elaborate typical MT system training workflows for client-specific MT solution development.

Baltic Journal of Modern Computing, 2022
Ten years ago, when the META-NET Network of Excellence conducted a study on language technology s... more Ten years ago, when the META-NET Network of Excellence conducted a study on language technology support for European languages, Latvian was included in the category of languages with little or no support. During the last decade, notable progress has been made in the development of language resources and tools for Latvian, particularly regarding the creation of advanced datasets like speech corpora and treebanks, state-of-the-art neural language models, machine translation systems, speech technology, and technologies for natural language understanding and human-computer interaction. This paper provides an overview of the most recent activities in the language technology field in Latvia: national and international initiatives, key language resources and tools, key projects and initiatives. We summarize both the recent activities and the most significant achievements after the publication of the META-NET White Paper on Latvian.
TaaS: Terminology as a Service
In this system demonstration paper we present a cloud-based platform providing online terminology... more In this system demonstration paper we present a cloud-based platform providing online terminology services for human and machine users. We focus on the use case for the application of online terminology services in statistical machine translation and describe the applied methods for monolingual and bilingual terminology integration into statistical machine translation during training and translation phases.
HAL (Le Centre pour la Communication Scientifique Directe), Jan 17, 2013
This paper presents the concept of the cloud-based terminology services for acquiring, sharing an... more This paper presents the concept of the cloud-based terminology services for acquiring, sharing and reusing of multilingual terminology for human and machine users. An ongoing "Terminology as a Service" project was initiated to establish the TaaS platform addressing user needs and providing online core terminology services for key terminology tasks. The paper describes the main target user groups of the platform. The problems that language workers (technical writers, translators, interpreters, terminologists, editors etc.) encounter when working with terminology are analysed on basis of the results of the survey performed within the project.

Baltic Journal of Modern Computing
Information and Communication Technology terms are mainly formed in English and then secondary-fo... more Information and Communication Technology terms are mainly formed in English and then secondary-formed in other languages. Because of the differences in the morphological and term-formation traditions in various languages, the results of secondary term formation tend to be somewhat chaotic. Latvia's Information and Communication Technology terminologists and linguists have developed a rather rigorous, semi-algorithmic approach to term formation that has been approbated for over thirty years. This paper aims to describe this approach and show its viability on an example of the most commonly used terms. We also analyse the usage of the officially approved terms in texts and the possible reasons why they sometimes encounter resistance from everyday users. In conclusion, we summarise the research regarding the current situation in the secondary ICT terminology in Latvian and provide insight into possibilities for further development.

This thesis addresses the issues and solutions involved when consolidating heterogeneous multilin... more This thesis addresses the issues and solutions involved when consolidating heterogeneous multilingual terminology resources that are dispersed throughout numerous collections, publications and databases to provide single access point for both human users and web-services. Online availability of consolidated terminology resources from diverse sources is of utmost importance in translation practice and domain specific communications. One of the major goals for consolidation is to provide a single unified web-based access to distributed multilingual terminology resources. Unified methodology has been developed covering all major aspects from scenario based requirements analysis to data modeling, data storage, exchange and representation. The federation approach proposed in this work allows the consolidation of various existing terminology databases and centrally stored resources. This thesis introduces a new concept of terminology entry compounding for identification and unification of matching multilingual entries from different collections. Application of international standards is discussed to ensure global interoperability of terminology resources and integration into global language resource infrastructure. The practical results from using these approaches in the development of the EuroTermBank terminology databank are described. For the first time heterogeneous multilingual terminology resources are integrated a[IS1]nd database federation is established with a unified online interface, serving as a prove-of-concept for the approaches described in this work.

In order to help improve the quality, coverage and performance of automated translation solutions... more In order to help improve the quality, coverage and performance of automated translation solutions for current and future Connecting Europe Facility (CEF) digital services, the European Language Resource Coordination (ELRC) was set up in 2015 through a service contract operating under the European Commission's CEF SMART 2014/1074 programme. Since then, ELRC initiated a number of actions to support the collection of Language Resources (LRs) within the public sector in EU member and CEF-affiliated countries. All resources shared by the contributors were gathered and curated in the ELRC-SHARE Repository, after having passed the validation process developed by ELRC. This paper provides insights into the overall data collection and curation process (including both technical and legal validation of resources) employed within ELRC. The ELRC Helpdesk provides both technical and legal guidance (e.g. Intellectual Property Rights (IPR) clearance support) to potential data contributors, thus enabling the sustainable sharing of language data.
This paper describes scientific, technical, and legal work done on the creation of the linguistic... more This paper describes scientific, technical, and legal work done on the creation of the linguistic infrastructure for the Nordic and Baltic countries. The paper describes the research on assessment of language technology support for the languages of the Baltic and Nordic countries, work on establishing a language resource sharing infrastructure, and collection and description of linguistic resources. We present improvements necessary to ensure usability and interoperability of language resources, discuss issues related to intellectual property rights for complex resources, and describe extension of infrastructure through integration of language-resource specific repositories. Work on treebanks, wordnets, terminology resources, and finite-state technology is described in more detail. Finally, our approach on ensuring the sustainability of infrastructure is discussed.

Electronic Lexicography in the 21st Century Thinking Outside the Paper Proceedings of the Elex 2013 Conference 17 19 October 2013 Tallinn Estonia 2013 Pags 421 434, 2013
This paper presents the Tilde Dictionary Browser (TDB), an innovative dictionary browsing environ... more This paper presents the Tilde Dictionary Browser (TDB), an innovative dictionary browsing environment for a wide range of users: language learners, language teachers, translators, and casual users. We describe several techniques to maximise the likelihood of providing users with a useful result even when searched items do not have a direct match in the dictionary due to misspellings, inflected words, multi-word items or phrase fragments, or where there is a lack of data in the main dictionary. TDB is targeted for broad use on multiple platforms and is implemented as desktop software, and a Web and mobile application. The desktop version of TDB currently contains dictionaries for more than 20 language pairs, including the languages of the Baltic countries, and is easily extendable to other languages. Besides the data from translation dictionaries, TDB also provides information from different online resources, such as terminology dictionaries, as well as integrates the machine translation facility.
Language Resources and Evaluation, 2016
This article provides an overview of the dissemination work carried out in META-NET from 2010 unt... more This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative's work throughout Europe in order to boost progress and innovation in our field.

This paper presents a set of principles and practical guidelines for terminology work in the nati... more This paper presents a set of principles and practical guidelines for terminology work in the national scenario to ensure a harmonized approach in term localization. These linguistic principles and guidelines are elaborated by the Terminology Commission in Latvia in the domain of Information and Communication Technology (ICT). We also present a novel approach in a corpus-based selection and an evaluation of the most frequently used terms. Analysis of the terms proves that, in general, in the normative terminology work in Latvia localized terms are coined according to these guidelines. We further evaluate how terms included in the database of official terminology are adopted in the general use such as newspaper articles, blogs, forums, websites etc. Our evaluation shows that in a non-normative context the official terminology faces a strong competition from other variations of localized terms. Conclusions and recommendations from lexical analysis of localized terms are provided. We hope that presented guidelines and approach in evaluation will be useful to terminology institutions, regulative authorities and researchers in different countries that are involved in the national terminology work.
NEALT Proceedings Series Proceedings of the NODALIDA 2011 workshop "CHAT 2011: Creation, Harmonization and Application of Terminology Resources
This paper evaluates the impact of ma-chine translation on the software localiza-tion process and... more This paper evaluates the impact of ma-chine translation on the software localiza-tion process and the daily work of profes-sional translators when SMT is applied to low-resourced languages with rich mor-phology. Translation from English into six low-resourced languages (Czech, Es-tonian, Hungarian, Latvian, Lithuanian and Polish) from different language groups are examined. Quality, usability and applicability of SMT for professional translation were evaluated. The building of domain and project tailored SMT sys-tems for localization purposes was evalu-ated in two setups. The results of the first evaluation were used to improve SMT systems and MT platform. The second evaluation analysed a more complex situ-ation considering tag translation and its effects on the translator’s productivity.

White Paper Series, 2012
Šī baltā grāmata ir daļa no dokumentu sērijas, kurā ap-is white paper is part of a series that p... more Šī baltā grāmata ir daļa no dokumentu sērijas, kurā ap-is white paper is part of a series that promotes kopota informācija par valodu tehnoloģijām un to ie-knowledge about language technology and its potenspējām. Tā ir paredzēta pedagogiem, žurnālistiem, po-tial. It addresses educators, journalists, politicians, lanlitiķiem, valodniekiem un citiem sabiedrības locekļiem. guage communities, and others. Valodu tehnoloģiju pieejamība un lietojums dažādās e availability and use of language technology in Eu-Eiropas valodās atšķiras. Tādējādi katrai valodai ne-rope varies among languages. Consequently, the acpieciešamas atšķirīgas darbības, lai tālāk izpētītu un at-tions that are required to further support research and tīstītu valodu tehnoloģijas. Tās ir atkarīgas no dau-development of language technologies also differ for dziem faktoriem, piemēram, konkrētās valodas sarež-each language. e actions depend on many factors, ģītības un tās lietotāju skaita. such as the complexity of a given language and the size Šajās balto grāmatu publikācijās (91. lpp.) veikta paš-of its community. reizējo valodas resursu un tehnoloģiju analīze. Tās va-META-NET, a Network of Excellence funded by the dītājs bija META-NET -Eiropas Komisijas finansē-European Commission, has conducted an analysis of tais izcilības tīkls. Šajā analīzē galvenā uzmanība tika current language resources and technologies in this pievērsta 23 Eiropas oficiālajām valodām, kā arī citām white paper series (p. 91). e analysis focused on the nozīmīgām Eiropas valstu un reģionālajām valodām. 23 official European languages as well as other impor-Analīzes rezultāti liecina, ka visu valodu pētniecībā ir tant national and regional languages in Europe. e redaudz svarīgu izaicinājumu un problēmu. Lai turpmā-sults of this analysis suggest that there are many signifkajai pētniecībai būtu maksimāla atdeve un tiktu sama-icant research gaps for each language. A more detailed zināti potenciālie riski, nepieciešama detalizēta un liet-expert analysis and assessment of the current situation pratīga analīze, kā arī pašreizējās situācijas novērtējums. will help maximise the impact of additional research Tīklā META-NET ietilpst 54 pētniecības centri 33 val-and minimise any risks. stīs [1] (87. lpp.). Tie sadarbojas ar pārstāvjiem no META-NET consists of 54 research centres from 33 privātajiem uzņēmumiem, valsts aģentūrām, rūpnie-countries [1] (p. 87) that are working with stakeholdcības nozarēm, pētniecības iestādēm, programmatūras ers from commercial businesses, government agencies, izstrādātājiem, tehnoloģiju nodrošinātājiem un Eiro-industry, research organisations, soware companies, pas universitātēm. Visi šī tīkla dalībnieki strādā pie technology providers, and European universities. Tokopīga tehnoloģiju redzējuma. Tiek izstrādāta stra-gether, they are creating a common technology vision tēģija, kā līdz 2020. gadam risināt visas ar pētniecību while developing a strategic research agenda that shows saistītās problēmas, izmantojot valodu tehnoloģiju how language technology applications can address any lietojumprogrammas. research gaps by 2020.
LUEP Proceedings Series Proceedings of the TKE 2012 workshop "CHAT 2012: The 2nd Workshop on the Creation, Harmonization and Application of Terminology Resources
We are delighted to hereby present the proceedings of CHAT 2012. Altogether, 7 papers have been s... more We are delighted to hereby present the proceedings of CHAT 2012. Altogether, 7 papers have been selected for presentation (4 regular papers and 3 short papers). The workshop papers cover various topics on automated approaches to terminology extraction and creation of terminology resources, compiling multilingual terminology, ensuring interoperability and harmonization of terminology resources, integrating these resources in language processing applications, distributing and sharing terminology data, and other. Electronically published at Linköping University Electronic Press (Sweden) http://www.ep.liu.se/ecp_home/index.en.aspx?issue=072
Uploads
Papers by Andrejs Vasiļjevs