Machine translation with source-predicted target morphology

Khalil Sima'an

Outline

Machine translation with source-predicted target morphology

Khalil Sima'an

2015

Abstract

We propose a novel pipeline for translation into morphologically rich languages which consists of two steps: initially, the source string is enriched with target morphological features and then fed into a translation model which takes care of reordering and lexical choice that matches the provided morphological features. As a proof of concept we ﬁrst show improved translation performance for a phrase-based model translating source strings enriched with morphological features projected through the word alignments from target words to source words. Given this potential, we present a model for predicting target morphological features on the source string and its predicate-argument structure, and tackle two major technical challenges: (1) How to ﬁt the morphological feature set to training data? and (2) How to integrate the morphology into the back-end phrase-based model such that it can also be trained on projected (rather than predicted) features for a more efﬁcient pipeline? For the ...

References (29)

Avramidis, E. and Koehn, P. (2008). Enriching morphologically poor languages for statistical machine translation. In Proceedings of ACL-08: HLT, pages 763-770, Columbus, Ohio. Association for Computational Linguistics.
Bojar, O. and Kos, K. (2010). 2010 failures in English-Czech phrase-based MT. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, WMT '10, pages 60-66, Stroudsburg, PA, USA. Association for Computational Linguistics.
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263- 311.
Cap, F., Fraser, A., Weller, M., and Cahill, A. (2014a). How to produce unseen teddy bears: Improved morphological processing of compounds in SMT. In Proceedings of the 14th Con- ference of the European Chapter of the Association for Computational Linguistics, pages 579-587, Gothenburg, Sweden. Association for Computational Linguistics.
Cap, F., Weller, M., Ramm, A., and Fraser, A. (2014b). CimS -the CIS and IMS joint sub- mission to WMT 2014 translating from English into German. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 71-78, Baltimore, Maryland, USA. As- sociation for Computational Linguistics.
Carpuat, M. and Wu, D. (2007). Context-dependent phrasal translation lexicons for statistical machine translation. Proceedings of Machine Translation Summit XI, pages 73-80.
Cer, D., Galley, M., Jurafsky, D., and Manning, C. D. (2010). Phrasal: A statistical machine translation toolkit for exploring new model. Proceedings of the NAACL HLT 2010 Demon- stration Session, pages 9-12.
Chahuneau, V., Schlinger, E., Smith, N. A., and Dyer, C. (2013). Translating into morpho- logically rich languages with synthetic phrases. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1677-1687, Seattle, Washington, USA. Association for Computational Linguistics.
Collins, M., Koehn, P., and Kucerova, I. (2005). Clause restructuring for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 531-540, Ann Arbor, Michigan. Association for Computational Linguistics.
Denkowski, M. and Lavie, A. (2011). Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 85-91, Edinburgh, Scotland. Association for Compu- tational Linguistics.
Fraser, A., Weller, M., Cahill, A., and Cap, F. (2012). Modeling inflection and word-formation in SMT. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 664-674, Avignon, France. Association for Computa- tional Linguistics.
Jeong, M., Toutanova, K., Suzuki, H., and Quirk, C. (2010). A discriminative lexicon model for complex morphology. In Proceedings of the Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010).
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, pages 81-93.
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceed- ings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388-395. Association for Computational Linguistics.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X, volume 5, pages 79-86.
Lerner, U. and Petrov, S. (2013). Source-side classifier preordering for machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 513-523, Seattle, Washington, USA. Association for Computational Linguistics.
Machacek, M. and Bojar, O. (2014). Results of the WMT14 metrics shared task. In Proceed- ings of the Ninth Workshop on Statistical Machine Translation, pages 293-301, Baltimore, Maryland, USA. Association for Computational Linguistics.
Martins, A., Smith, N., Xing, E., Aguiar, P., and Figueiredo, M. (2010). Turbo parsers: Depen- dency parsing by approximate variational inference. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 34-44, Cambridge, MA. As- sociation for Computational Linguistics.
Mueller, T., Schmid, H., and Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322-332, Seattle, Washington, USA. Association for Computational Lin- guistics.
Nivre, J. and Nilsson, J. (2005). Pseudo-projective dependency parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 99-106, Ann Arbor, Michigan. Association for Computational Linguistics.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311-318. Association for Computational Linguistics.
Petrov, S., Barrett, L., Thibaux, R., and Klein, D. (2006). Learning accurate, compact, and in- terpretable tree annotation. In Proceedings of the 21st International Conference on Computa- tional Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 433-440, Sydney, Australia. Association for Computational Linguistics.
Quirk, C., Menezes, A., and Cherry, C. (2005). Dependency treelet translation: Syntactically informed phrasal SMT. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 271-279, Ann Arbor, Michigan. Association for Computational Linguistics.
Toutanova, K., Suzuki, H., and Ruopp, A. (2008). Applying morphology generation models to machine translation. In Proceedings of ACL-08: HLT, pages 514-522, Columbus, Ohio. Association for Computational Linguistics.
Tran, K., Bisazza, A., and Monz, C. (2014). Word translation prediction for morphologically rich languages with bilingual neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Lin- guistics.
Williams, P. and Koehn, P. (2011). Agreement constraints for statistical machine translation into German. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 217-226, Edinburgh, Scotland. Association for Computational Linguistics.
Yeniterzi, R. and Oflazer, K. (2010). Syntax-to-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 454-464, Uppsala, Sweden. Association for Computational Linguistics.
Zeman, D. (2008). Reusable tagset conversion using tagset drivers. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08).
Zeman, D., Mareček, D., Popel, M., Ramasamy, L., Štěpánek, J., Žabokrtský, Z., and Hajič, J. (2012). Hamledt: To parse or not to parse? In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey. European Language Resources Association (ELRA).

Machine translation with source-predicted target morphology

Sign up for access to the world's latest research

Abstract

Related papers

References (29)

Related papers

Related topics