Automating Text Naturalness Evaluation of NLG Systems
2020
Abstract
Automatic methods and metrics that assess various quality criteria of automatically generated texts are important for developing NLG systems because they produce repeatable results and allow for a fast development cycle. We present here an attempt to automate the evaluation of text naturalness which is a very important characteristic of natural language generation methods. Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process by using a human likeliness metric we define and a discrimination procedure based on large pretrained language models with their probability distributions. We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process. Based on our results, bigger generators and larger pretrained discriminators are more appropriate for a better evaluation of text naturalness. A comprehensive validation procedure with human participants is required as follow up to check how well this automatic evaluation scheme correlates with human judgments.
References (49)
- Aguilar-Ruiz, J., Bacardit, J., Divina, F.: Experimental evaluation of discretization schemes for rule induction. In: Deb, K. (ed.) Genetic and Evolutionary Computa- tion -GECCO 2004. pp. 828-839. Springer Berlin Heidelberg (2004)
- Ambati, B.R., Reddy, S., Steedman, M.: Assessing relative sentence com- plexity using an incremental CCG parser. In: Proceedings of the 2016 Con- ference of the North American Chapter of the Association for Compu- tational Linguistics. pp. 1051-1057. Association for Computational Linguis- tics, San Diego, California (Jun 2016). https://doi.org/10.18653/v1/N16-1120, https://www.aclweb.org/anthology/N16-1120
- Amidei, J., Piwek, P., Willis, A.: Evaluation methodologies in auto- matic question generation 2013-2018. In: Proceedings of the 11th In- ternational Conference on Natural Language Generation. pp. 307- 317. Association for Computational Linguistics, Tilburg University, The Netherlands (Nov 2018). https://doi.org/10.18653/v1/W18-6537, https://www.aclweb.org/anthology/W18-6537
- Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learn- ing to align and translate. In: 3rd International Conference on Learning Repre- sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1409.0473
- Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. pp. 65-72. Association for Computational Linguistics, Ann Arbor, Michigan (Jun 2005), https://www.aclweb.org/anthology/W05-0909
- Beltagy, I., Lo, K., Cohan, A.: SciBERT: A pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3615-3620. Association for Computational Lin- guistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1371, https://www.aclweb.org/anthology/D19-1371
- Bojar, O., Kos, K., Mareček, D.: Tackling sparse data issue in machine transla- tion evaluation. In: Proceedings of the ACL 2010 Conference Short Papers. pp. 86-91. Association for Computational Linguistics, Uppsala, Sweden (Jul 2010), https://www.aclweb.org/anthology/P10-2016
- C ¸ano, E., Bojar, O.: Efficiency metrics for data-driven models: A text sum- marization case study. In: Proceedings of the 12th International Conference on Natural Language Generation. pp. 229-239. Association for Computational Lin- guistics, Tokyo, Japan (Oct-Nov 2019). https://doi.org/10.18653/v1/W19-8630, https://www.aclweb.org/anthology/W19-8630
- C ¸ano, E., Bojar, O.: Keyphrase generation: A multi-aspect survey. In: 2019 25th Conference of Open Innovations Association (FRUCT). pp. 85-94. Helsinki, Finland (Nov 2019). https://doi.org/10.23919/FRUCT48121.2019.8981519, https://ieeexplore.ieee.org/document/8981519
- C ¸ano, E., Bojar, O.: Human or machine: Automating human likeliness evaluation of nlg texts. CoRR abs/2006.03189 (2020), https://arxiv.org/abs/2006.03189
- C ¸ano, E., Bojar, O.: Two huge title and keyword generation corpora of research articles. In: Proceedings of The 12th Language Resources and Evaluation Confer- ence. pp. 6663-6671. European Language Resources Association, Marseille, France (may 2020), https://www.aclweb.org/anthology/2020.lrec-1.823
- C ¸ano, E., Bojar, O.: Sentiment analysis of czech texts: An algorithmic survey. In: Proceedings of the 11th International Conference on Agents and Artificial Intelligence -Volume 2: NLPinAI,. pp. 973-979. INSTICC, SciTePress (2019). https://doi.org/10.5220/0007695709730979
- C ¸ano, E., Morisio, M.: A data-driven neural network archi- tecture for sentiment analysis. Data Technologies and Applica- tions 53(1), 2-19 (2019). https://doi.org/10.1108/DTA-03-2018-0017, https://doi.org/10.1108/DTA-03-2018-0017
- Chen, P., Wu, F., Wang, T., Ding, W.: A semantic qa-based approach for text summarization evaluation. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceed- ings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. pp. 4800-4807. AAAI Press (2018), https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16115
- Denkowski, M., Lavie, A.: Choosing the Right Evaluation for Machine Transla- tion: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks. In: Proceedings of AMTA 2010 (2010)
- Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171-4186. Association for Computational Linguis- tics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
- Fornaciari, T., Poesio, M.: Identifying fake Amazon reviews as learn- ing from crowds. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguis- tics. pp. 279-287. Association for Computational Linguistics, Gothen- burg, Sweden (Apr 2014). https://doi.org/10.3115/v1/E14-1030, https://www.aclweb.org/anthology/E14-1030
- Gehrmann, S., Strobelt, H., Rush, A.: GLTR: Statistical detection and visualization of generated text. In: Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 111-116. Association for Computational Linguis- tics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-3019, https://www.aclweb.org/anthology/P19-3019
- Gkatzia, D., Mahamood, S.: A snapshot of NLG evaluation practices 2005 -2014. In: Proceedings of the 15th European Workshop on Natural Lan- guage Generation (ENLG). pp. 57-60. Association for Computational Lin- guistics, Brighton, UK (Sep 2015). https://doi.org/10.18653/v1/W15-4708, https://www.aclweb.org/anthology/W15-4708
- Hastie, H., Belz, A.: A comparative evaluation methodology for nlg in interac- tive systems. In: Proceedings of LREC'14. pp. 4004-4011. European Language Resources Association (12 2014)
- Kim, H., Lim, J.H., Kim, H.K., Na, S.H.: Qe bert: Bilingual bert using multi- task learning for neural quality estimation. In: Proceedings of the Fourth Con- ference on Machine Translation (Volume 3: Shared Task Papers, Day 2). pp. 87-91. Association for Computational Linguistics, Florence, Italy (August 2019), http://www.aclweb.org/anthology/W19-5407
- Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: Open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations. pp. 67-72. Association for Computational Linguistics, Vancou- ver, Canada (Jul 2017), https://www.aclweb.org/anthology/P17-4012
- van der Lee, C., Gatt, A., van Miltenburg, E., Wubben, S., Krah- mer, E.: Best practices for the human evaluation of automatically gener- ated text. In: Proceedings of the 12th International Conference on Natu- ral Language Generation. pp. 355-368. Association for Computational Lin- guistics, Tokyo, Japan (Oct-Nov 2019). https://doi.org/10.18653/v1/W19-8643, https://www.aclweb.org/anthology/W19-8643
- Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234-1240 (09 2019). https://doi.org/10.1093/bioinformatics/btz682
- Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Proc. ACL workshop on Text Summarization Branches Out. p. 10 (2004), http://aclweb.org/anthology/W04-1013
- Liu, Y., Lapata, M.: Text summarization with pretrained encoders. In: Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). pp. 3730-3740. Association for Computational Lin- guistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1387, https://www.aclweb.org/anthology/D19-1387
- Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 142-150. Association for Computational Linguistics, Portland, Oregon, USA (June 2011), http://www.aclweb.org/anthology/P11-1015
- Malmasi, S., Zampieri, M.: Detecting hate speech in social media. In: Pro- ceedings of the Recent Advances in Natural Language Processing Con- ference (RANLP 2017). p. 467-472. Varna, Bulgaria (September 2017). https://doi.org/10.26615/978-954-452-049-6 062
- Mozafari, M., Farahbakhsh, R., Crespi, N.: A bert-based transfer learning approach for hate speech detection in online social media. In: Cherifi, H., Gaito, S., Mendes, J.F., Moro, E., Rocha, L.M. (eds.) Complex Networks and Their Applications VIII. pp. 928-940. Springer International Publishing, Cham (2020)
- Nallapati, R., Zhou, B., dos Santos, C., Gulcehre, C., Xiang, B.: Ab- stractive text summarization using sequence-to-sequence rnns and be- yond. In: Proceedings of The 20th SIGNLL Conference on Computa- tional Natural Language Learning. pp. 280-290. Association for Com- putational Linguistics (2016). https://doi.org/10.18653/v1/K16-1028, http://aclweb.org/anthology/K16-1028
- Novikova, J., Dušek, O., Cercas Curry, A., Rieser, V.: Why we need new evaluation metrics for NLG. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2241-2252. Association for Computational Linguistics, Copen- hagen, Denmark (Sep 2017). https://doi.org/10.18653/v1/D17-1238, https://www.aclweb.org/anthology/D17-1238
- Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for au- tomatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002), http://aclweb.org/anthology/P02-1040
- Pérez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic detection of fake news. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 3391-3401. Association for Computational Linguistics, Santa Fe, New Mexico, USA (Aug 2018), https://www.aclweb.org/anthology/C18-1287
- Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227-2237. Association for Computational Linguis- tics, New Orleans, Louisiana (Jun 2018). https://doi.org/10.18653/v1/N18-1202, https://www.aclweb.org/anthology/N18-1202
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2018), https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
- Reiter, E.: A structured review of the validity of BLEU. Computational Linguistics 44(3), 393-401 (Sep 2018). https://doi.org/10.1162/coli a 00322, https://www.aclweb.org/anthology/J18-3002
- Reiter, E., Belz, A.: An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics 35(4), 529-558 (2009). https://doi.org/10.1162/coli.2009.35.4.35405, https://www.aclweb.org/anthology/J09-4008
- Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detec- tion on social media: A data mining perspective. SIGKDD Explor. Newsl. 19(1), 2236 (Sep 2017). https://doi.org/10.1145/3137597.3137600, https://doi.org/10.1145/3137597.3137600
- Sulem, E., Abend, O., Rappoport, A.: BLEU is not suitable for the evaluation of text simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 738-744. Association for Computational Linguis- tics, Brussels, Belgium (Oct-Nov 2018). https://doi.org/10.18653/v1/D18-1081, https://www.aclweb.org/anthology/D18-1081
- Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: A joint model for video and language representation learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7463-7472. Seoul, South Korea (2019)
- Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H.: Modeling coverage for neu- ral machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. pp. 76-85. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1008, http://aclweb.org/anthology/P16-1008
- Vajjala, S., Meurers, D.: Assessing the relative reading level of sentence pairs for text simplification. In: Proceedings of the 14th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics (EACL- 14). Association for Computational Linguistics, Gothenburg, Sweden (2014), http://purl.org/dm/papers/Vajjala.Meurers-14-eacl.html
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998-6008. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
- Wang, A., Cho, K.: BERT has a mouth, and it must speak: BERT as a Markov random field language model. In: Proceedings of the Work- shop on Methods for Optimizing and Evaluating Neural Language Gen- eration. pp. 30-36. Association for Computational Linguistics, Min- neapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/W19-2304, https://www.aclweb.org/anthology/W19-2304
- Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing [review article].
- IEEE Computational Intelligence Magazine 13(3), 55-75 (Aug 2018). https://doi.org/10.1109/MCI.2018.2840738
- Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., Choi, Y.: Defending against neural fake news. In: Advances in Neural Infor- mation Processing Systems 32, pp. 9054-9065. Curran Associates, Inc. (2019), http://papers.nips.cc/paper/9106-defending-against-neural-fake-news.pdf
- Zhang, Z., Robinson, D., Tepper, J.: Detecting hate speech on twitter using a convolution-gru based deep neural network. In: Gangemi, A., Navigli, R., Vidal, M.E., Hitzler, P., Troncy, R., Hollink, L., Tordai, A., Alam, M. (eds.) The Semantic Web. pp. 745-760. Springer International Publishing, Cham (2018)
- Zhou, W., Xu, K.: Learning to compare for better training and evaluation of open domain natural language generation models. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innova- tive Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. pp. 9717-9724. AAAI Press (2020), https://aaai.org/ojs/index.php/AAAI/article/view/6521