LAYERED: Metric for Machine Translation Evaluation
Abstract
This paper describes the LAYERED metric which is used for the shared WMT'14 metrics task. Various metrics exist for MT evaluation: BLEU (Papineni, 2002), METEOR (Alon Lavie, 2007), TER (Snover, 2006) etc., but are found inadequate in quite a few language settings like, for example , in case of free word order languages. In this paper, we propose an MT evaluation scheme that is based on the NLP layers: lexical, syntactic and semantic. We contend that higher layer met-rics are after all needed. Results are presented on the corpora of ACL-WMT, 2013 and 2014. We end with a metric which is composed of weighted metrics at individual layers, which correlates very well with human judgment.
References (13)
- Alexandra Birch and Miles Osborne Reordering Met- rics for MT. Proceedings of the 49th Annual Meet- ing of the Association for Computational Linguis- tics: Human Language Technologies -Volume 1, se- ries = HLT 2011.
- Alon Lavie and Abhaya Agarwal. METEOR: An Auto- matic Metric for MT Evaluation with High Levels of Correlation with Human Judgments, Proceedings of the Second Workshop on Statistical Machine Trans- lation, StatMT 2007.
- Ananthakrishnan R and Pushpak Bhattacharyya and M Sasikumar and Ritesh M Shah Some Issues in Auto- matic Evaluation of English-Hindi MT: More Blues for BLEU. ICON, 2007.
- Doddington and George Automatic evaluation of machine translation quality using N-gram co- occurrence statistics, NIST. Proceedings of the 2nd International Conference on Human Language Technology Research HLT 2002.
- Ding Liu and Daniel Gildea Syntactic Features for Evaluation of Machine Translation. Workshop On Intrinsic And Extrinsic Evaluation Measures For Machine Translation And/or Summarization, 2005. Findings of the 2013 Workshop on Statistical Machine Translation. ACL-WMT 2013.
- Giménez, Jesús and Màrquez, Lluís Linguistic Mea- sures for Automatic Machine Translation Evalua- tion. Machine Translation, December, 2010.
- Liu D, Gildea D Syntactic features for evaluation of machine translation. ACL 2005 workshop on intrin- sic and extrinsic evaluation measures for machine translation and/or summarization.
- Owczarzak K, Genabith J, Way A Evaluating ma- chine translation with LFG dependencies. Machine Translation 21(2):95119.
- Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. Generating Typed Depen- dency Parses from Phrase Structure Parses. LREC 2006.
- Matthew Snover and Bonnie Dorr and Richard Schwartz and Linnea Micciulla and John Makhoul. A Study of Translation Edit Rate with Targeted Hu- man Annotation, In Proceedings of Association for Machine Translation in the Americas, 2006.
- Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. BLEU: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting on Association for Com- putational Linguistics, ACL 2002. Results of the WMT13 Metrics Shared Task. ACL- WMT 2013. Results of the WMT14 Metrics Shared Task. ACL- WMT 2014.
- Sebastian Padó and Michel Galley and Dan Jurafsky and Chris Manning Robust Machine Translation Evaluation with Entailment Features. Proceedings of ACL-IJCNLP 2009, ACL 2009.
- Zhang Y, Vogel S, Waibel A Interpreting Bleu/NIST scores: how much improvement do we need to have a better system?. In: Proceedings of the 4th interna- tional conference on language resources and evalua- tion. Lisbon, Portugal.