Academia.eduAcademia.edu

Outline

LAYERED: Metric for Machine Translation Evaluation

Abstract

This paper describes the LAYERED metric which is used for the shared WMT'14 metrics task. Various metrics exist for MT evaluation: BLEU (Papineni, 2002), METEOR (Alon Lavie, 2007), TER (Snover, 2006) etc., but are found inadequate in quite a few language settings like, for example , in case of free word order languages. In this paper, we propose an MT evaluation scheme that is based on the NLP layers: lexical, syntactic and semantic. We contend that higher layer met-rics are after all needed. Results are presented on the corpora of ACL-WMT, 2013 and 2014. We end with a metric which is composed of weighted metrics at individual layers, which correlates very well with human judgment.

References (13)

  1. Alexandra Birch and Miles Osborne Reordering Met- rics for MT. Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguis- tics: Human Language Technologies -Volume 1, se- ries = HLT 2011.
  2. Alon Lavie and Abhaya Agarwal. METEOR: An Auto- matic Metric for MT Evaluation with High Levels of Correlation with Human Judgments, Proceedings of the Second Workshop on Statistical Machine Trans- lation, StatMT 2007.
  3. Ananthakrishnan R and Pushpak Bhattacharyya and M Sasikumar and Ritesh M Shah Some Issues in Auto- matic Evaluation of English-Hindi MT: More Blues for BLEU. ICON, 2007.
  4. Doddington and George Automatic evaluation of machine translation quality using N-gram co- occurrence statistics, NIST. Proceedings of the 2nd International Conference on Human Language Technology Research HLT 2002.
  5. Ding Liu and Daniel Gildea Syntactic Features for Evaluation of Machine Translation. Workshop On Intrinsic And Extrinsic Evaluation Measures For Machine Translation And/or Summarization, 2005. Findings of the 2013 Workshop on Statistical Machine Translation. ACL-WMT 2013.
  6. Giménez, Jesús and Màrquez, Lluís Linguistic Mea- sures for Automatic Machine Translation Evalua- tion. Machine Translation, December, 2010.
  7. Liu D, Gildea D Syntactic features for evaluation of machine translation. ACL 2005 workshop on intrin- sic and extrinsic evaluation measures for machine translation and/or summarization.
  8. Owczarzak K, Genabith J, Way A Evaluating ma- chine translation with LFG dependencies. Machine Translation 21(2):95119.
  9. Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. Generating Typed Depen- dency Parses from Phrase Structure Parses. LREC 2006.
  10. Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul. A Study of Translation Edit Rate with Targeted Hu- man Annotation, In Proceedings of Association for Machine Translation in the Americas, 2006.
  11. Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting on Association for Com- putational Linguistics, ACL 2002. Results of the WMT13 Metrics Shared Task. ACL- WMT 2013. Results of the WMT14 Metrics Shared Task. ACL- WMT 2014.
  12. Sebastian Padó and Michel Galley and Dan Jurafsky and Chris Manning Robust Machine Translation Evaluation with Entailment Features. Proceedings of ACL-IJCNLP 2009, ACL 2009.
  13. Zhang Y, Vogel S, Waibel A Interpreting Bleu/NIST scores: how much improvement do we need to have a better system?. In: Proceedings of the 4th interna- tional conference on language resources and evalua- tion. Lisbon, Portugal.