Cascaded Phrase-Based Statistical Machine Translation Systems
2012
Abstract
Statistical-based methods are the prevalent approaches for implementing machine translation systems today. However the resulted translations are usually flawed to some degree. We assume that a statistical baseline system can be re-used to automatically learn how to (partially) correct translation errors, i.e. to turn a "broken" target translation into a better one. By training and testing on initial bilingual data, we constructed a system S1 which was used to translate the source language part of the training corpus. The new translated corpus and its reference translation are used to train and test another similar system S2. Without any additional data, the chain S1+S2 shows a sensible quality increase against S1 in terms of BLEU scores, for both translation directions (English to Romanian and Romanian to English).
References (13)
- Avramidis E., Koehn, P. 2008. Enriching morpho- logically poor languages for statistical machine translation. In: Proceedings of Association for Computational Linguistics / HLT, pp. 763-770, Columbus, Ohio
- Ceaușu Alexandru. 2006. Maximum Entropy Tiered Tagging, Janneke Huitink & Sophia Katrenko (eds), Proceedings of the Eleventh ESSLLI Student Ses- sion, ESSLLI 2006, pp. 173-179
- Ceaușu, A., Tufiș, D. 2011. Addressing SMT Data Sparseness when Translating into Morphologi- cally-Rich Languages. In Bernadette Sharp, Mi- chael Zock, Michael Carl, and Arnt Lykke Jakobsen (eds.) Human-machine interaction in translation, Copenhagen Business School, pp. 57- 68.
- Ehara T. 2011. Machine translation system for patent documents combining rule-based translation and statistical postediting applied to the PatentMT Task, Proceedings of NTCIR-9 Workshop Meeting, De- cember 6-9, 2011, Tokyo, Japan, pp. 623-628.
- Erjavec, T., Monachini, M. (Eds.). 1997. Specifica- tions and Notation for Lexicon Encoding. Deliver- able D1.1 F. Multext-East Project COP-106. http://nl.ijs.si/ME/CD/docs/ mte-d11f/
- Habash, N., Dorr, B., Monz, C. 2006. Challenges in Building an Arabic-English GHMT System with SMT Components. In Proceedings of AMTA'06, Cambridge, MA, USA.
- Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constan- tin, A., Herbst, E. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Pro- ceedings of the Annual Meeting of the Association for Computational Linguistics, demonstration ses- sion, Prague.
- Koehn, P., Hoang, H. 2007. Factored Translation Models. In: Proceedings of the 2007 Joint Confer- ence on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 868-876, Prague.
- Papineni, K., Roukos, S., Ward, T., Zhu W.J. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 20th Annual Meeting of the Association for Computa- tional Linguistics, Philadelphia, pp. 311-318.
- Tiedemann, J. 2009. News from OPUS -A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In: N. Nicolov and K. Bontcheva and G. Angelova and R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), pp. 237-248.
- Tufiş, D. 1999. Tiered Tagging and Combined Classifiers. In: F. Jelinek, E. Nth (eds) Text, Speech and Dialogue LNCS vol. 1692, pp. 28-33 Springer- Verlag Berlin Heidelberg.
- Tufiş, D., Ceauşu, A. 2008. DIAC+: A Professional Diacritics Recovering System, in Proceedings of LREC 2008, May 26 -June 1, Marrakech, Morocco. ELRA -European Language Resources Associa- tion.
- Tufiş, D., Ion, R., Ceauşu, A., Ştefănescu, D. 2008. RACAI's Linguistic Web Services, in Proceedings of LREC 2008, May 26 -June 1, Marrakech, Mo- rocco. ELRA -European Language Resources As- sociation.