Skip to main content

Josef Van Genabith

University of Saarland, Applied Linguistics, Translation and Interpreting, Faculty Member

Followers

51

Following

21

Co-authors

21

Public Views

Interests

Uploads

Papers by Josef Van Genabith

TransIns: Document Translation with Markup Reinsertion

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2021

For many use cases, it is required that MT does not just translate raw text, but complex formatte... more For many use cases, it is required that MT does not just translate raw text, but complex formatted documents (e.g. websites, slides, spreadsheets) and the result of the translation should reflect the formatting. This is challenging, as markup can be nested, apply to spans contiguous in source but non-contiguous in target etc. Here we present TransIns, a system for non-plain text document translation that builds on the Okapi framework and MT models trained with Marian NMT. We develop, implement and evaluate different strategies for reinserting markup into translated sentences using token alignments between source and target sentences. We propose a simple and effective strategy that compiles down all markup to single source tokens and transfers them to aligned target tokens. Our evaluation shows that this strategy yields highly accurate markup in the translated documents that outperforms the markup quality found in documents translated with popular translation services. We release TransIns under the MIT License as open-source software on https:// github.com/DFKI-MLT/TransIns. An online demonstrator is available at https://transins. dfki.de.

CATaLog: New Approaches to TM and Post Editing Interfaces

This paper explores a new TM-based CAT tool entitled CATaLog. New features have been integrated i... more This paper explores a new TM-based CAT tool entitled CATaLog. New features have been integrated into the tool which aim to improve post-editing both in terms of performance and productivity. One of the new features of CATaLog is a color coding scheme that is based on the similarity between a particular input sentence and the segments retrieved from the TM. This color coding scheme will help translators to identify which part of the sentence is most likely to require post-editing thus demanding minimal effort and increasing productivity. We demonstrate the tool’s functionalities using an English Bengali dataset.

English to Manipuri and Mizo Post-Editing Effort and its Impact on Low Resource Machine Translation

We present the first study on the postediting (PE) effort required to build a parallel dataset fo... more We present the first study on the postediting (PE) effort required to build a parallel dataset for English-Manipuri and English-Mizo, in the context of a project on creating data for machine translation (MT). English source text from a local daily newspaper are machine translated into Manipuri and Mizo using PBSMT systems built in-house. A Computer Assisted Translation (CAT) tool is used to record the time, keystroke and other indicators to measure PE effort in terms of temporal and technical effort. A positive correlation between the technical effort and the number of function words is seen for EnglishManipuri and English-Mizo but a negative correlation between the technical effort and the number of noun words for EnglishMizo. However, average time spent per token in PE English-Mizo text is negatively correlated with the temporal effort. The main reason for these results are due to (i) English and Mizo using the same script, while Manipuri uses a different script and (ii) the agglu...

Fostering the Next Generation of European Language Technology: Recent Developments ― Emerging Initiatives ― Challenges and Opportunities

META-NET is a European network of excellence, founded in 2010, that consists of 60 research centr... more META-NET is a European network of excellence, founded in 2010, that consists of 60 research centres in 34 European countries. One of the key visions and goals of META-NET is a truly multilingual Europe, which is substantially supported and realised through language technologies. In this article we provide an overview of recent developments around the multilingual Europe topic, we also describe recent and upcoming events as well as recent and upcoming strategy papers. Furthermore, we provide overviews of two new emerging initiatives, the CEF.AT and ELRC activity on the one hand and the Cracking the Language Barrier federation on the other. The paper closes with several suggested next steps in order to address the current challenges and to open up new opportunities.

The German EU Council Presidency Translator

KI - Künstliche Intelligenz, 2021

This contribution describes the German EU Council Presidency Translator (EUC PT), a machine trans... more This contribution describes the German EU Council Presidency Translator (EUC PT), a machine translation service created for the German EU Council Presidency in the second half of 2020, which is open to the general public. Following a series of earlier presidency translators, the German version exhibits important extensions and improvements. The German EUC PT is the first to integrate systems from commercial vendors, public services, and a research center, using a mix of custom and generic translation engines, and to introduce a new webpage translation widget. A further important feature is the close collaboration with human translators from the German ministries, who were provided with computer-assisted translation tool plugins integrating machine translation services into their daily work environments. Uptake by the public reflects a huge interest in the service, showing the need for breaking language barriers.

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Due to its effectiveness and performance, the Transformer translation model has attracted wide at... more Due to its effectiveness and performance, the Transformer translation model has attracted wide attention, most recently in terms of probing-based approaches. Previous work focuses on using or probing source linguistic features in the encoder. To date, the way word translation evolves in Transformer layers has not yet been investigated. Naively, one might assume that encoder layers capture source information while decoder layers translate. In this work, we show that this is not quite the case: translation already happens progressively in encoder layers and even in the input embeddings. More surprisingly, we find that some of the lower decoder layers do not actually do that much decoding. We show all of this in terms of a probing approach where we project representations of the layer analyzed to the final trained and frozen classifier level of the Transformer decoder to measure word translation accuracy. Our findings motivate and explain a Transformer configuration change: if translation already happens in the encoder layers, perhaps we can increase the number of encoder layers, while decreasing the number of decoder layers, boosting decoding speed, without loss in translation quality? Our experiments show that this is indeed the case: we can increase speed by up to a factor 2.3 with small gains in translation quality, while an 18-4 deep encoder configuration boosts translation quality by +1.42 BLEU (En-De) at a speed-up of 1.4.

Neural machine translation for low-resource languages without parallel corpora

Machine Translation, 2017

The problem of a total absence of parallel data is present for a large number of language pairs a... more The problem of a total absence of parallel data is present for a large number of language pairs and can severely detriment the quality of machine translation. We describe a language-independent method to enable machine translation between a low-resource language (LRL) and a third language, e.g. English. We deal with cases of LRLs for which there is no readily available parallel data between the low-resource language and any other language, but there is ample training data between a closelyrelated high-resource language (HRL) and the third language. We take advantage of the similarities between the HRL and the LRL in order to transform the HRL data into data similar to the LRL using transliteration. The transliteration models are trained on transliteration pairs extracted from Wikipedia article titles. Then, we automatically back-translate monolingual LRL data with the models trained on the transliterated HRL data and use the resulting parallel corpus to train our final models. Our method achieves significant improvements in translation quality, close to the results that can be achieved by a general purpose neural machine translation system trained on a significant amount of parallel data. Moreover, the method does not rely on the existence of any parallel data for training, but attempts to bootstrap already existing resources in a related language.

An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification

IEEE Journal of Selected Topics in Signal Processing, 2017

End-to-end neural machine translation has overtaken statistical machine translation in terms of t... more End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words-or sentences-which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present paper is twofold. First, we systematically study the neural machine translation (NMT) context vectors, i.e., output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F 1 = 98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F 1 reaches 98.9%.

USAAR-WLV: Hypernym Generation with Deep Neural Nets

Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015

This paper describes the USAAR-WLV taxonomy induction system that participated in the Taxonomy Ex... more This paper describes the USAAR-WLV taxonomy induction system that participated in the Taxonomy Extraction Evaluation task of SemEval-2015. We extend prior work on using vector space word embedding models for hypernym-hyponym extraction by simplifying the means to extract a projection matrix that transforms any hyponym to its hypernym. This is done by making use of function words, which are usually overlooked in vector space approaches to NLP. Our system performs best in the chemical domain and has achieved competitive results in the overall evaluations.

Re-assessing the WMT2013 Human Evaluation with Professional Translators Trainees

This paper presents experiments on the human ranking task performed during WMT2013. The goal of t... more This paper presents experiments on the human ranking task performed during WMT2013. The goal of these experiments is to rerun the human evaluation task with translation studies students and to compare the results with the human rankings performed by the WMT development teams during WMT2013. More specifically, we test whether we can reproduce, and if yes to what extent, the WMT2013 ranking task and whether specialised knowledge from translation studies influences the results in terms of intra-and inter-annotator agreement as well as in terms of system ranking. We present two experiments on the English-German WMT2013 machine translation output. Analysis of the data follows the methods described in the official WMT2013 report. The results indicate a higher inter-and intra-annotator agreement, less ties and slight differences in ranking for the translation studies students as compared to the WMT development teams.

Can Translation Memories afford not to use paraphrasing?

This paper investigates to what extent the use of paraphrasing in translation memory (TM) matchin... more This paper investigates to what extent the use of paraphrasing in translation memory (TM) matching and retrieval is useful for human translators. Current translation memories lack semantic knowledge like paraphrasing in matching and retrieval. Due to this, paraphrased segments are often not retrieved. Lack of semantic knowledge also results in inappropriate ranking of the retrieved segments. Gupta and Or˘ asan (2014) proposed an improved matching algorithm which incorporates paraphrasing. Its automatic evaluation suggested that it could be beneficial to translators. In this paper we perform an extensive human evaluation of the use of paraphrasing in the TM matching and retrieval process. We measure post-editing time, keystrokes, two subjective evaluations , and HTER and HMETEOR to assess the impact on human performance. Our results show that paraphrasing improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase enhanced TMs.

Working with a small dataset-semi-supervised dependency parsing for Irish

We present a number of semi-supervised parsing experiments on the Irish language carried out usin... more We present a number of semi-supervised parsing experiments on the Irish language carried out using a small seed set of manually parsed trees and a larger, yet still relatively small, set of unlabelled sentences. We take two popular dependency parsers-one graph-based and one transition-based-and compare results for both. Results show that using semisupervised learning in the form of self-training and co-training yields only very modest improvements in parsing accuracy. We also try to use morphological information in a targeted way and fail to see any improvements.

CNGL-CORE: Referential Translation Machines for Measuring Semantic Similarity

We invent referential translation machines (RTMs), a computational model for identifying the tran... more We invent referential translation machines (RTMs), a computational model for identifying the translation acts between any two data sets with respect to a reference corpus selected in the same domain, which can be used for judging the semantic similarity between text. RTMs make quality and semantic similarity judgments possible by using retrieved relevant training data as interpretants for reaching shared semantics. An MTPP (machine translation performance predictor) model derives features measuring the closeness of the test sentences to the training data, the difficulty of translating them, and the presence of acts of translation involved. We view semantic similarity as paraphrasing between any two given texts. Each view is modeled by an RTM model, giving us a new perspective on the binary relationship between the two. Our prediction model is the 15th on some tasks and 30th overall out of 89 submissions in total according to the official results of the Semantic Textual Similarity (STS 2013) challenge.

TTS - A Treebank Tool Suite

Treebanks are important resources in descriptive, theoretical and computational linguistic resear... more Treebanks are important resources in descriptive, theoretical and computational linguistic research, development and teaching. This paper presents a treebank tool suite (TTS) for and derived from the Penn-II treebank resource (Marcus et al, 1993). The tools include treebank inspection and viewing options which support search for CF-PSG rule tokens extracted from the treebank, graphical display of complete trees containing the rule instance, display of subtrees rooted by the rule instance and display of the yield of the subtree (with or without context). The search can be further restricted by constraining the yield to contain particular strings. Rules can be ordered by frequency and the user can set frequency thresholds. To process new text, the tool suite provides a PCFG chart parser (based on the CYK algorithm) operating on CFG grammars extracted from the treebank following the method of (Charniak, 1996) as well as a HMM bi-/trigram tagger trained on the tagged version of the tree...

Linear logic-based semantics construction for LTAG

In this paper we review existing appoaches to semantics construction in LTAG (Lexicalised Tree Ad... more In this paper we review existing appoaches to semantics construction in LTAG (Lexicalised Tree Adjoining Grammar) which are all based on the notion of derivation (tree)s. We argue that derivation structures in LTAG are not appropriate to guide semantic composition, due to a non-isomorphism, in LTAG, between the syntactic operation of adjunction on the one hand, and the semantic operations of complementation and modification, on the other. Linear Logic based "glue" semantics, by now the classical approach to semantics construction within the LFG framework (cf. Dalrymple (1999)) allows for flexible coupling of syntactic and semantic structure. We investigate application of "glue semantics" to LTAG syntax, using as underlying structure the derived tree, which is more appropriate for principle-based semantics construction. We show how linear logic semantics construction helps to bridge the nonisomorphism between syntactic and semantic operations in LTAG. The glue approach allows to capture non-tree local dependencies in control and modification structures, and extends to the treatment of scope ambiguity with quantified NPs and VP adverbials. Finally, glue semantics applies successfully to the adjunction-based analysis of long-distance dependencies in LTAG, which differs significantly from the f-structure based analysis in LFG. * We are grateful for valuable comments from the audiences of the LFG'01 conference and the University of Konstanz, in particular Ron Kaplan, Josef Bayer and Ellen Brandner. Thanks go also to Dick Crouch and Mary Dalrymple for comments on earlier versions of this paper. Some interesting observations could not be given full justice in this paper, but gave important feedback for the overall conception of this work, which we hope to extend in future research. This research was partially funded by a BMBF grant to the DFKI project whiteboard (FKZ: 01 IW 002). 1 Hepple (1999) sketches LL-based semantics for D-Trees, to overcome problems faced by categorial semantics in the analysis of quantification. Muskens (2001) develops a description-based syntax-semantics interface for LTAG, yet with extension to tree descriptions as used in D-Trees. We briefly discuss these related approaches in Section 4.7.

Morphological features for parsing morphologically-rich languages: a case of Arabic

We investigate how morphological features in the form of part-of-speech tags impact parsing perfo... more We investigate how morphological features in the form of part-of-speech tags impact parsing performance, using Arabic as our test case. The large, fine-grained tagset of the Penn Arabic Treebank (498 tags) is difficult to handle by parsers, ultimately due to data sparsity. However, ad-hoc conflations of treebank tags runs the risk of discarding potentially useful parsing information. The main contribution of this paper is to describe several automated, language-independent methods that search for the optimal feature combination to help parsing. We first identify 15 individual features from the Penn Arabic Treebank tagset. Either including or excluding these features results in 32,768 combinations, so we then apply heuristic techniques to identify the combination achieving the highest parsing performance. Our results show a statistically significant improvement of 2.86% for vocalized text and 1.88% for unvocalized text, compared with the baseline provided by the Bikel-Bies Arabic POS mapping (and an improvement of 2.14% using product models for vocalized text, 1.65% for unvocalized text), giving state-of-the-art results for Arabic constituency parsing.

Maximising TM performance through sub-tree alignment and SMT

With the steadily increasing demand for highquality translation, the localisation industry is con... more With the steadily increasing demand for highquality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, in particular focusing on the use of high-quality Statistical Machine Translation (SMT) supplementing the established Translation Memory (TM) technology. In this paper, we present a novel modular approach that utilises state-ofthe-art sub-tree alignment and SMT techniques to turn the fuzzy matches from a TM into nearperfect translations. Rather than relegate SMT to a last-resort status where it is only used should the TM system fail to produce the desired output, for us SMT is an integral part of the translation process that we rely on to obtain high-quality results. We show that the presented system consistently produces betterquality output than the TM and performs on par or better than the standalone SMT system. 1 See e.g. the Déjà Vu TM system (http://www.atril.com/overview), as well as Chapter 3 in (Carl and Way, 2003). For an indepth comparative review of TM systems and EBMT, see (Somers and Fernández Díaz, 2004).

Partial dependency parsing for Irish

In this paper we present a partial dependency parser for Irish, in which Constraint Grammar (CG) ... more In this paper we present a partial dependency parser for Irish, in which Constraint Grammar (CG) rules are used to annotate dependency relations and grammatical functions in unrestricted Irish text. Chunking is performed using a regular-expression grammar which operates on the dependency tagged sentences. As this is the first implementation of a parser for unrestricted Irish text (to our knowledge), there were no guidelines or precedents available. Therefore deciding what constitutes a syntactic unit, and how it should be annotated, accounts for a major part of the early development effort. Currently, all tokens in a sentence are tagged for grammatical function and local dependency. Long-distance dependencies, prepositional attachments or coordination are not handled, resulting in a partial dependency analysis. Evaluations show that the partial dependency analysis achieves an f-score of 93.60% on development data and 94.28% on unseen test data, while the chunker achieves an f-score of 97.20% on development data and 93.50% on unseen test data.

Statistical post-editing for a statistical MT system

Statistical post-editing (SPE) techniques have been successfully applied to the output of Rule Ba... more Statistical post-editing (SPE) techniques have been successfully applied to the output of Rule Based MT (RBMT) systems. In this paper we investigate the impact of SPE on a standard Phrase-Based Statistical Machine Translation (PB-SMT) system, using PB-SMT both for the first-stage MT and the second stage SPE system. Our results show that, while a naive approach to using SPE in a PB-SMT pipeline produces no or only modest improvements, a novel combination of source context modelling and thresholding can produce statistically significant improvements of 2 BLEU points over baseline using technical translation data for French to English.

Seeding statistical machine translation with translation memory output through tree-based structural alignment

With the steadily increasing demand for high-quality translation, the localisation industry is co... more With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM) technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness.