Machine Translation Evaluation Resources and Methods: A Survey

Lifeng Han

doi:10.48550/ARXIV.1605.04515

Outline

Machine Translation Evaluation Resources and Methods: A Survey

Lifeng Han

2018, Cornell University - arXiv

https://doi.org/10.48550/ARXIV.1605.04515

visibility

…

description

18 pages

link

1 file

Abstract

We introduce a Machine Translation (MT) evaluation survey that contains both manual and automatic evaluation methodologies. The traditional human evaluation criteria mainly include intelligibility, fidelity, fluency, adequacy, comprehension, and informativeness. The advanced human assessments include task-oriented measures, post-editing, segment ranking, direct assessment, and other extended criteria. We classify the automatic evaluation methods into two categories including lexical similarity scenario and linguistic features application. The lexical similarity methods contain edit distance, precision, recall, Fmeasure, and word order. The linguistic features can be divided into syntactic features and semantic features respectively. The syntactic features include part of speech tags, phrase types and sentence structures, and the semantic features include named entity, synonym, textual entailment, paraphrase, semantic role, and language models. The deep learning models for evaluation are very recently proposed due to word embedding popularity. Subsequently, we also introduce the evaluation methodology for MT evaluation including different correlation scores, and the lately quality estimation (QE) tasks for MT. This paper differs from the previous works (Dorr et al., 2009; EuroMatrix, 2007) from several aspects, by introducing some recent development of MT evaluation measures, the different classifications from manual to automatic evaluation measures, the introduction of lately QE tasks of MT, and the concise construction of the content. We hope this work will be helpful for MT researchers to easily pick up some metrics that are best suitable for their specific MT model development, and help MT evaluation researchers to get a general clue of how MT evaluation research developed. Furthermore, hopefully, this work can also shine some light on other evaluation tasks, except for translation, in natural language processing (NLP) fields. 1

Figures (3)

Figure 1: Human Evaluation Methods language. The requirement that a translation is of

where c is the total length of candidate transla- ion corpus, and r refers to the sum of effective reference sentence length in the corpus. If there are multi-references for each candidate sentence, hen the nearest length as compared to the candi- date sentence is selected as the effective one. In he BLEU metric, the n-gram precision weight \,, is usually selected as uniform weight. However, he 4-gram precision value is usually very low or even zero when the test corpus is small. To weight more heavily those n-grams that are more informa- tive, (Doddington, 2002) proposes the NIST met- ric with the information weight added.

References (128)

J. Albrecht and R. Hwa. 2007. A re-examination of machine learning approaches for sentence-level mt evaluation. In The Proceedings of the 45th Annual Meeting of the ACL, Prague, Czech Republic.
Jon Androutsopoulos and Prodromos Malakasiotis. 2010. A survey of paraphrasing and textual entail- ment methods. Journal of Artificial Intelligence Re- search, 38:135-187.
D. Arnold. 2003. Computers and Translation: A trans- lator's guide-Chap8 Why translation is difficult for computers. Benjamins Translation Library.
Eleftherios Avramidis, Maja Popovic, David Vilar, and Aljoscha Burchardt. 2011. Evaluate with confi- dence estimation: Machine ranking of translation outputs using grammatical features. In Proceedings of WMT.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL.
Srinivas Bangalore, Owen Rambow, and Steven Whit- taker. 2000. Evaluation metrics for generation. In Proceedings of INLG.
Regina Barzilayand and Lillian Lee. 2003. Learn- ing to paraphrase: an unsupervised approach us- ing multiple-sequence alignment. In Proceedings NAACL.
Ondřej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. Findings of the 2013 workshop on statistical machine translation. In Proceedings of WMT.
Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Transla- tion, pages 12-58, Baltimore, Maryland, USA, June. Association for Computational Linguistics.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1-46, Lisbon, Portugal, September. Association for Computational Linguis- tics.
Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L Mercer. 1993. The mathemat- ics of statistical machine translation: Parameter esti- mation. Computational linguistics, 19(2):263-311.
Christian Buck. 2012. Black box features for the wmt 2012 quality estimation shared task. In Proceedings of WMT.
Chris Callison-Burch, Philipp Koehn, and Miles Os- borne. 2006a. Improved statistical machine trans- lation using paraphrases. In Proceedings of HLT- NAACL.
Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006b. Re-evaluating the role of bleu in ma- chine translation research. In Proceedings of EACL, volume 2006, pages 249-256.
Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007a. (meta-) evaluation of machine translation. In Pro- ceedings of WMT.
Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007b. (meta-) evaluation of machine translation. In Pro- ceedings of the Second Workshop on Statistical Ma- chine Translation, pages 64-71. Association for Computational Linguistics.
Chris Callison-Burch, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2008. Further meta-evaluation of machine translation. In Proceedings of WMT.
Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 workshop on statistical machine translation. In Pro- ceedings of the 4th WMT.
Chris Callison-Burch, Philipp Koehn, Christof Monz, Kay Peterson, Mark Przybocki, and Omar F. Zari- dan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for ma- chine translation. In Proceedings of the WMT.
Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaridan. 2011. Findings of the 2011 workshop on statistical machine translation. In Pro- ceedings of WMT.
Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical ma- chine translation. In Proceedings of WMT.
Michael Carl and Andy Way. 2003. Recent advances in example-based machine translation.
John B. Carroll. 1966. An experiment in evaluating the quality of translation. Mechanical Translation and Computational Linguistics, 9(3-4):67-75.
Boxing Chen, Roland Kuhn, and Samuel Larkin. 2012. Port: a precision-order-recall mt evaluation metric for tuning. In Proceedings of the ACL.
David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Pro- ceedings of the 43rd Annual Meeting of the Asso- ciation for Computational Linguistics (ACL), pages 263-270.
KyungHyun Cho, Bart van Merrienboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder ap- proaches. CoRR, abs/1409.1259.
Kenneth Church and Eduard Hovy. 1991. Good ap- plications for crummy machine translation. In Pro- ceedings of the Natural Language Processing Sys- tems Evaluation Workshop.
Jasob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):3746.
Elisabet Comelles, Jordi Atserias, Victoria Arranz, and Irene Castellón. 2012. Verta: Linguistic features in mt evaluation. In LREC, pages 3944-3950.
Ido Dagan and Oren Glickman. 2004. Probabilistic textual entailment: Generic applied modeling of lan- guage variability. In Learning Methods for Text Un- derstanding and Mining workshop.
Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. Machine Learning Challenges:LNCS, 3944:177-190.
Daniel Dahlmeier, Chang Liu, and Hwee Tou Ng. 2011. Tesla at wmt2011: Translation evaluation and tunable metric. In Proceedings of WMT.
George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co- occurrence statistics. In HLT Proceedings.
Bonnie Dorr, Matt Snover, and etc. Nitin Madnani. 2009. Part 5: Machine translation evaluation. In Bonnie Dorr edited DARPA GALE program report.
Jennifer B. Doyon, John S. White, and Kathryn B. Tay- lor. 1999. Task-based evaluation for machine trans- lation. In Proceedings of MT Summit 7.
H. Echizen-ya and K. Araki. 2010. Automatic eval- uation method for machine translation using noun- phrase chunking. In Proceedings of the ACL.
Matthias Eck and Chiori Hori. 2005. Overview of the iwslt 2005 evaluation campaign. In In proceeding of International Workshop on Spoken Language Trans- lation (IWSLT).
Project EuroMatrix. 2007. 1.3: Survey of machine translation evaluation. In EuroMatrix Project Re- port, Statistical and Hybrid MT between All Euro- pean Languages, co-ordinator: Prof. Hans Uszkor- eit.
Marcello Federico, Luisa Bentivogli, Michael Paul, and Sebastian Stüker. 2011. Overview of the iwslt 2011 evaluation campaign. In In proceeding of In- ternational Workshop on Spoken Language Transla- tion (IWSLT).
Alexander Fraser and Daniel Marcu. 2007. Measuring word alignment quality for statistical machine trans- lation. Computational Linguistics.
Michael Gamon, Anthony Aue, and Martine Smets. 2005. Sentence-level mt evaluation without refer- ence translations beyond language modelling. In Proceedings of EAMT, pages 103-112.
Daniel Gildea, Giorgio Satta, and Hao Zhang. 2006. Factoring synchronous grammars by sorting. In Proceedings of ACL.
Jesús Giméne and Llu ís Márquez. 2008. A smorgas- bord of features for automatic mt evaluation. In Pro- ceedings of WMT, pages 195-198.
Jesús Giménez and Llu ís Márquez. 2007. Linguistic features for automatic evaluation of heterogenous mt systems. In Proceedings of WMT.
Yvette Graham and Qun Liu. 2016. Achieving ac- curate conclusions in evaluation of automatic ma- chine translation metrics. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1-10.
Yvette Graham, Timothy Baldwin, and Nitika Mathur. 2015. Accurate evaluation of segment-level ma- chine translation metrics. In NAACL HLT 2015, The 2015 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 -June 5, 2015, pages 1183-1191.
Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009. Named entity recognition in query. In Proceeding of SIGIR.
Rohit Gupta, Constantin Orasan, and Josef van Gen- abith. 2015a. Machine translation evaluation us- ing recurrent neural networks. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 380-384, Lisbon, Portugal, September. Asso- ciation for Computational Linguistics.
Rohit Gupta, Constantin Orasan, and Josef van Gen- abith. 2015b. Reval: A simple and effective ma- chine translation evaluation metric based on recur- rent neural networks. In Proceedings of the 2015 Conference on Emperical Methods in Natural Lan- guage Processing, pages 1066-1072. Association for Computational Linguistics, o.A.
Francisco Guzmán, Shafiq Joty, Lluís Màrquez, and Preslav Nakov. 2015. Pairwise neural machine translation evaluation. In Proceedings of the 53rd Annual Meeting of the Association for Computa- tional Linguistics and The 7th International Joint Conference of the Asian Federation of Natural Lan- guage Processing (ACL'15), pages 805-814, Bei- jing, China, July. Association for Computational Linguistics.
Francisco Guzmn, Shafiq Joty, Llus Mrquez, and Preslav Nakov. 2017. Machine translation evalua- tion with neural networks. Comput. Speech Lang., 45(C):180-200, September.
Anders Hald. 1998. A History of Mathematical Statis- tics from 1750 to 1930. ISBN-10: 0471179124. Wiley-Interscience; 1 edition.
Lifeng Han, Derek Fai Wong, and Lidia Sam Chao. 2012. Lepor: A robust evaluation metric for ma- chine translation with augmented factors. In Pro- ceedings of COLING.
Aaron Li Feng Han, Derek Fai Wong, Lidia Sam Chao, Liangeye He, Shuo Li, and Ling Zhu. 2013a. Phrase tagset mapping for french and english treebanks and its application in machine translation evaluation. In International Conference of the German Society for Computational Linguistics and Language Technol- ogy, LNAI Vol. 8105, pages 119-131.
Aaron Li Feng Han, Derek Fai Wong, Lidia Sam Chao, Yi Lu, Liangye He, Yiming Wang, and Jiaji Zhou. 2013b. A description of tunable machine translation evaluation systems in wmt13 metrics task. In Pro- ceedings of WMT, pages 414-421.
Aaron Li Feng Han, Derek Fai Wong, Lidia Sam Chao, Liangeye He, and Yi Lu. 2014. Unsupervised qual- ity estimation model for english to german transla- tion and its application in extensive supervised eval- uation. In The Scientific World Journal. Issue: Re- cent Advances in Information Technology, pages 1- 12.
Lifeng Han, Gareth Jones, and Alan Smeaton. 2020. MultiMWE: Building a multi-lingual multi-word expression (MWE) parallel corpora. In Proceed- ings of the 12th Language Resources and Evaluation Conference, pages 2970-2979, Marseille, France, May. European Language Resources Association.
Lifeng Han. 2014. LEPOR: An Augmented Ma- chine Translation Evaluation Metric. University of Macau, Macao.
Young Sook Hwang, Andrew Finch, and Yutaka Sasaki. 2007. Improving statistical machine transla- tion using shallow linguistic knowledge. Computer Speech and Language, 21(2):350-372.
Maurice G. Kendall and Jean Dickinson Gibbons. 1990. Rank Correlation Methods. Oxford Univer- sity Press, New York.
Maurice G. Kendall. 1938. A new measure of rank correlation. Biometrika, 30:81-93.
Maxim Khalilov and José A. R. Fonollosa. 2011. Syntax-based reordering for statistical machine translation. Computer Speech and Language, 25(4):761-788.
Marrgaret King, Andrei Popescu-Belis, and Eduard Hovy. 2003. Femti: Creating and using a frame- work for mt evaluation. In Proceedings of the Ma- chine Translation Summit IX.
Philipp Koehn and Kevin Knight. 2009. Statisti- cal machine translation, November 24. US Patent 7,624,005.
Philipp Koehn and Christof Monz. 2005. Shared task: Statistical machine translation between euro- pean languages. In Proceedings of the ACL Work- shop on Building and Using Parallel Texts.
Philipp Koehn and Christof Monz. 2006a. Manual and automatic evaluation of machine translation between european languages. In Proceedings on the Work- shop on Statistical Machine Translation, pages 102- 121, New York City, June. Association for Compu- tational Linguistics.
Philipp Koehn and Christof Monz. 2006b. Manual and automatic evaluation of machine translation between european languages. In Proceedings of WMT.
Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computa- tional Linguistics on Human Language Technology- Volume 1, pages 48-54. Association for Computa- tional Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007a. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007b. Moses: Open source toolkit for statistical machine translation. In Pro- ceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177-180. Association for Computational Linguis- tics.
Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP.
Philipp Koehn. 2010. Statistical Machine Translation. Cambridge University Press.
J. Richard Landis and Gary G. Koch. 1977. The mea- surement of observer agreement for categorical data. Biometrics, 33(1):159-174.
Jamal Laoudi, Ra R. Tate, and Clare R. Voss. 2006. Task-based mt evaluation: From who/when/where extraction to event understanding. In in Proceedings of LREC06, pages 2048-2053.
Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceed- ings of ACL.
Alon Lavie. 2013. Automated metrics for mt evalua- tion. Machine Translation, 11:731.
Guy Lebanon and John Lafferty. 2002. Combin- ing rankings using conditional probability models on permutations. In Proceeding of the ICML.
Gregor Leusch and Hermann Ney. 2009. Edit dis- tances with block movements and error rate confi- dence estimates. Machine Translation, 23(2-3).
Liang You Li, Zheng Xian Gong, and Guo Dong Zhou. 2012. Phrase-based evaluation for machine transla- tion. In Proceedings of COLING, pages 663-672.
A. LI. 2005. Results of the 2005 nist machine transla- tion evaluation. In Proceedings of WMT.
Chin-Yew Lin and E. H. Hovy. 2003. Automatic eval- uation of summaries using n-gram co-occurrence statistics. In Proceedings NAACL.
Chin-Yew Lin and Franz Josef Och. 2004. Auto- matic evaluation of machine translation quality us- ing longest common subsequence and skip-bigram statistics. In Proceedings of ACL.
Ding Liu and Daniel Gildea. 2005. Syntactic fea- tures for evaluation of machine translation. In Pro- ceedingsof theACL Workshop on Intrinsic and Ex- trinsic Evaluation Measures for Machine Transla- tion and/or Summarization.
Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng. 2011. Better evaluation metrics lead to better ma- chine translation. In Proceedings of EMNLP.
Chi Kiu Lo and Dekai Wu. 2011a. Meant: An in- expensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles. In Proceedings of ACL.
Chi Kiu Lo and Dekai Wu. 2011b. Structured vs. flat semantic role representations for machine trans- lation evaluation. In Proceedings of the 5th Work- shop on Syntax and Structure in StatisticalTransla- tion (SSST-5).
Chi Kiu Lo, Anand Karthik Turmuluru, and Dekai Wu. 2012. Fully automatic semantic mt evaluation. In Proceedings of WMT.
Qingsong Ma, Fandong Meng, Daqi Zheng, Mingx- uan Wang, Yvette Graham, Wenbin Jiang, and Qun Liu. 2016. Maxsd: A neural machine translation evaluation metric optimized by maximizing similar- ity distance. In Natural Language Understanding and Intelligent Applications -5th CCF Conference on Natural Language Processing and Chinese Com- puting, NLPCC 2016, and 24th International Con- ference on Computer Processing of Oriental Lan- guages, ICCPOL 2016, Kunming, China, December 2-6, 2016, Proceedings, pages 153-161.
Gideon Maillette de Buy Wenniger and Khalil Sima'an. 2015. Labeling hierarchical phrase-based models without linguistic resources. Machine Translation, 29(3):225-265.
José B. Mariño, Rafael E. Banchs, Josep M. Crego, Adrià de Gispert, Patrik Lambert, José A. R. Fonol- losa, and Marta R. Costa-jussà. 2006. N-gram based machine translation. Computational Linguis- ticsLinguistics, 32(4):527-549.
Elaine Marsh and Dennis Perzanowski. 1998. Muc-7 evaluation of ie technology: Overview of results. In Proceedingsof Message Understanding Conference (MUC-7).
Kathleen R. McKeown. 1979. Paraphrasing using given and new information in a question-answer sys- tem. In Proceedings of ACL.
Arul Menezes, Kristina Toutanova, and Chris Quirk. 2006. Microsoft research treelet translation system: Naacl 2006 europarl evaluation. In Proceedings of WMT. Marie Meteer and Varda Shaked. 1988. Microsoft research treelet translation system: Naacl 2006 eu- roparl evaluation. In Proceedings of COLING.
G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. 1990. Wordnet: an on-line lexical database. International Journal of Lexicography, 3(4):235-244.
Douglas C. Montgomery and George C. Runger. 2003. Applied statistics and probability for engineers. John Wiley and Sons, New York, third edition.
John Moran and David Lewis. 2012. Unobtrusive methods for low-cost manual assessment of machine translation. Tralogy I [Online], Session 5.
L. Mrquez. 2013. automatic evaluation of machine translation quality. Dialogue 2013 invited talk, ex- tended.
Sergei Nirenburg. 1989. Knowledge-based machine translation. Machine Translation, 4(1):5-24.
Franz Josef Och. 2003. Minimum error rate training for statistical machine translation. In Proceedings of ACL.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL.
Kristen Parton, Joel Tetreault ans Nitin Madnani, and Martin Chodorow. 2011. E-rating machine transla- tion. In Proceedings of WMT.
Michael Paul, Marcello Federico, and Sebastian Stüker. 2010. Overview of the iwslt 2010 evalua- tion campaign. In Proceeding of IWSLT.
M. Paul. 2009. Overview of the iwslt 2009 evaluation campaign. In Proceeding of IWSLT.
Karl Pearson. 1900. On the criterion that a given sys- tem of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50(5):157-175.
Maja Popović, David Vilar, Eleftherios Avramidis, and Aljoscha Burchardt. 2011. Evaluation without ref- erences: Ibm1 scores as evaluation In ceedings of WMT.
M. Popovic and Hermann Ney. 2007. Word error rates: Decomposition over pos classes and applications for error analysis. In Proceedings of WMT.
Claus Povlsen, Nancy Underwood, Bradley Music, and Anne Neville. 1998. Evaluating text-type suitability for machine translation a case study on an english- danish system. In Proceeding LREC.
Sylvain Raybaud, David Langlois, and Kamel Sma ïli. 2011. "this sentence is wrong." detecting errors in machine-translated sentences. Machine Translation, 25(1):1-34.
F. Sánchez-Mart ínez and M. L. Forcada. 2009. Infer- ring shallow-transfer machine translation rules from small parallel corpora. Journal of Artificial Intelli- gence Research, 34:605-635.
Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. Multiword expressions: A pain in the neck for nlp. In Alexan- der Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, pages 1-15, Berlin, Hei- delberg. Springer Berlin Heidelberg.
Bahar Salehi, Nitika Mathur, Paul Cook, and Timothy Baldwin. 2015. The impact of multiword expres- sion compositionality on machine translation evalu- ation. In Proceedings of the 11th Workshop on Mul- tiword Expressions, pages 54-59, Denver, Colorado, June. Association for Computational Linguistics.
Mattthew Snover, Bonnie J. Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human an- notation. In Proceeding of AMTA. Xingyi Song and Trevor Cohn. 2011. Regression and ranking based optimisation for sentence level mt evaluation. In Proceedings of WMT.
L. Specia and J. Giménez. 2010. Combining con- fidence estimation and reference-based metrics for segment-level mt evaluation. In The Ninth Confer- ence of the Association for Machine Translation in the Americas (AMTA).
Lucia Specia, Naheh Hajlaoui, Catalina Hallett, and Wilker Aziz. 2011. Predicting machine translation adequacy. In Machine Translation Summit XIII.
Miloš Stanojević and Khalil Sima'an. 2014a. Beer: Better evaluation as ranking. In Proceedings of the Ninth Workshop on Statistical Machine Translation.
Miloš Stanojević and Khalil Sima'an. 2014b. Evaluat- ing word order recursively over permutation-forests. In Proceedings of the Eight Workshop on Syntax, Se- mantics and Structure in Statistical Translation.
Miloš Stanojević and Khalil Sima'an. 2014c. Fit- ting sentence level translation evaluation with many dense features. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing.
Keh-Yih Su, Wu Ming-Wen, and Chang Jing-Shin. 1992. A new quantitative quality measure for ma- chine translation systems. In Proceeding of COL- ING.
Christoph Tillmann, Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf. 1997. Accel- erated dp based search for statistical translation. In Proceeding of EUROSPEECH.
Joseph P Turian, Luke Shea, and I Dan Melamed. 2006. Evaluation of machine translation and its evaluation. Technical report, DTIC Document.
Clare R. Voss and Ra R. Tate. 2006. Task-based eval- uation of machine translation (mt) engines: Measur- ing how well people extract who, when, where-type elements in mt output. In In Proceedings of 11th Annual Conference of the European Association for Machine Translation (EAMT-2006, pages 203-212.
Warren Weaver. 1955. Translation. Machine Transla- tion of Languages: Fourteen Essays.
John S. White and Kathryn B. Taylor. 1998. A task- oriented evaluation metric for machine translation. In Proceeding LREC.
John S. White, Theresa O' Connell, and Francis O' Mara. 1994. The arpa mt evaluation methodolo- gies: Evolution, lessons, and future approaches. In Proceeding of AMTA.
Billy Wong and Chun yu Kit. 2009. Atec: automatic evaluation of machine translation via word choice and word order. Machine Translation, 23(2-3):141- 155.
Hui Yu, Xiaofeng Wu, Jun Xie, Wenbin Jiang, Qun Liu, and Shouxun Lin. 2014. RED: A reference de- pendency based MT evaluation metric. In COLING 2014, 25th International Conference on Computa- tional Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ire- land, pages 2042-2051.
Jiajun Zhang and Chengqing Zong. 2015. Deep neu- ral networks in machine translation: An overview. IEEE Intelligent Systems, (5):16-25.

Machine Translation Evaluation Resources and Methods: A Survey

Sign up for access to the world's latest research

Abstract

Related papers

References (128)

Related papers

Related topics