CobaltF: A Fluent Metric for MT Evaluation
2016, Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
https://doi.org/10.18653/V1/W16-2339Abstract
The vast majority of Machine Translation (MT) evaluation approaches are based on the idea that the closer the MT output is to a human reference translation, the higher its quality. While translation quality has two important aspects, adequacy and fluency, the existing referencebased metrics are largely focused on the former. In this work we combine our metric UPF-Cobalt, originally presented at the WMT15 Metrics Task, with a number of features intended to capture translation fluency. Experiments show that the integration of fluency-oriented features significantly improves the results, rivalling the best-performing evaluation metrics on the WMT15 data.
References (31)
- Nguyen Bach, Fei Huang, and Yaser Al-Onaizan. 2011. Goodness: A Method for Measuring Machine Translation Confidence. In Proceedings of the 49th Annual Meeting of the Association for Computa- tional Linguistics: Human Language Technologies- Volume 1, pages 211-219. Association for Compu- tational Linguistics (ACL).
- John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. 2004. Confidence Es- timation for Machine Translation. In Proceedings of the 20th International Conference on Computational Linguistics, pages 315-321. ACL.
- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 Workshop on Statistical Machine Translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1-46, Lisbon, Portugal, September. ACL.
- Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank Using Gradient Descent. In Proceedings of the 22nd international conference on Machine learning, pages 89-96. ACM.
- Chris Callison-Burch and Miles Osborne. 2006. Re- evaluating the Role of BLEU in Machine Transla- tion Research. In In Proceedings of the European Association for Computational Linguistics (EACL), pages 249-256. ACL.
- Elisabet Comelles, Jordi Atserias, Victoria Arranz, and Irene Castellón. 2012. VERTa: Linguistic Features in MT Evaluation. In Proceedings of the Interna- tional Conference on Language Resources and Eval- uation (LREC), pages 3944-3950.
- Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evalua- tion for any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376-380.
- Mariano Felice and Lucia Specia. 2012. Linguistic Features for Quality Estimation. In Proceedings of the Seventh Workshop on Statistical Machine Trans- lation, pages 96-103. ACL.
- Marina Fomicheva and Núria Bel. 2016. Using Con- textual Information for Machine Translation Eval- uation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 2755-2761.
- Marina Fomicheva, Núria Bel, and Iria da Cunha. 2015a. Neutralizing the Effect of Translation Shifts on Automatic Machine Translation Evaluation. In Computational Linguistics and Intelligent Text Pro- cessing, pages 596-607.
- Marina Fomicheva, Núria Bel, Iria da Cunha, and An- ton Malinovskiy. 2015b. UPF-Cobalt Submission to WMT15 Metrics Task. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 373-379.
- Jesús Giménez and Lluís Màrquez. 2010a. Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation. The Prague Bulletin of Mathe- matical Linguistics, (94):77-86.
- Jesús Giménez and Lluís Màrquez. 2010b. Linguistic Measures for Automatic Machine Translation Eval- uation. Machine Translation, 24(3):209-240.
- Francisco Guzmán, Shafiq Joty, Lluís Màrquez, and Preslav Nakov. 2014. Using Discourse Structure Improves Machine Translation Evaluation. In ACL (1), pages 687-698.
- Ding Liu and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceed- ings of the ACL Workshop on Intrinsic and Extrin- sic Evaluation Measures for Machine Translation and/or Summarization, pages 25-32.
- Chi-Kiu Lo, Anand Karthik Tumuluru, and Dekai Wu. 2012. Fully Automatic Semantic MT Evaluation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 243-252. ACL.
- Ngoc-Quang Luong, Laurent Besacier, and Benjamin Lecouteux. 2015. Towards Accurate Predictors of Word Quality for Machine Translation: Lessons Learned on French-English and English-Spanish Systems. Data & Knowledge Engineering, 96:32- 42. Matouš Macháček and Ondřej Bojar. 2014. Results of the WMT14 Metrics Shared Task. In Proceedings of the Ninth Workshop on Statistical Machine Transla- tion, pages 293-301.
- Benjamin Marie and Marianna Apidianaki. 2015. Alignment-based Sense Selection in METEOR and the RATATOUILLE Recipe. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 385-391.
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the ACL, pages 311- 318. ACL.
- Sylvain Raybaud, David Langlois, and Kamel Smaïli. 2011. this sentence is wrong. detecting errors in machine-translated sentences. Machine Translation, 25(1):1-34.
- Helmut Schmid. 1999. Improvements in part-of- speech tagging with an application to german. In Natural language processing using very large cor- pora, pages 13-25. Springer.
- Kashif Shah, Trevor Cohn, and Lucia Specia. 2013. An Investigation on the Effectiveness of Features for Translation Quality Estimation. In Proceedings of the Machine Translation Summit, volume 14, pages 167-174.
- Lucia Specia and Jesús Giménez. 2010. Combining Confidence Estimation and Reference-based Metrics for Segment-level MT Evaluation. In The Ninth Conference of the Association for Machine Trans- lation in the Americas.
- Lucia Specia, Marco Turchi, Nicola Cancedda, Marc Dymetman, and Nello Cristianini. 2009. Estimating the Sentence-level Quality of Machine Translation Systems. In 13th Conference of the European Asso- ciation for Machine Translation, pages 28-37.
- Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Machine Translation Evaluation versus Quality Es- timation. Machine Translation, 24(1):39-50.
- Lucia Specia, Gustavo Paetzold, and Carolina Scar- ton. 2015. Multi-level Translation Quality Predic- tion with QuEst++. In 53rd Annual Meeting of the Association for Computational Linguistics and Sev- enth International Joint Conference on Natural Lan- guage Processing of the Asian Federation of Natu- ral Language Processing: System Demonstrations, pages 115-120.
- Miloš Stanojevic and Khalil Sima'an. 2015. BEER 1.1: ILLC UvA Submission to Metrics and Tuning Task. In Proceedings of the Tenth Workshop on Sta- tistical Machine Translation, pages 396-401.
- Md Arafat Sultan, Steven Bethard, and Tamara Sum- ner. 2014. Back to Basics for Monolingual Align- ment: Exploiting Word Similarity and Contextual Evidence. Transactions of the ACL, 2:219-230.
- Gideon Toury. 2012. Descriptive Translation Stud- ies and beyond: Revised edition, volume 100. John Benjamins Publishing.
- C. Uhrik and W. Ward. 1997. Confidence Metrics Based on N-gram Language Model Backoff Behav- iors. In Proceedings of Fifth European Conference on Speech Communication and Technology, pages 2771-2774.
- Hui Yu, Qingsong Ma, Xiaofeng Wu, and Qun Liu. 2015. CASICT-DCU Participation in WMT2015 Metrics Task. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 417-421.