Academia.eduAcademia.edu

Outline

Automating Text Naturalness Evaluation of NLG Systems

2020

Abstract

Automatic methods and metrics that assess various quality criteria of automatically generated texts are important for developing NLG systems because they produce repeatable results and allow for a fast development cycle. We present here an attempt to automate the evaluation of text naturalness which is a very important characteristic of natural language generation methods. Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process by using a human likeliness metric we define and a discrimination procedure based on large pretrained language models with their probability distributions. We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process. Based on our results, bigger generators and larger pretrained discriminators are more appropriate for a better evaluation of text naturalness. A comprehensive validation procedure with human participants is required as follow up to check how well this automatic evaluation scheme correlates with human judgments.

References (49)

  1. Aguilar-Ruiz, J., Bacardit, J., Divina, F.: Experimental evaluation of discretization schemes for rule induction. In: Deb, K. (ed.) Genetic and Evolutionary Computa- tion -GECCO 2004. pp. 828-839. Springer Berlin Heidelberg (2004)
  2. Ambati, B.R., Reddy, S., Steedman, M.: Assessing relative sentence com- plexity using an incremental CCG parser. In: Proceedings of the 2016 Con- ference of the North American Chapter of the Association for Compu- tational Linguistics. pp. 1051-1057. Association for Computational Linguis- tics, San Diego, California (Jun 2016). https://doi.org/10.18653/v1/N16-1120, https://www.aclweb.org/anthology/N16-1120
  3. Amidei, J., Piwek, P., Willis, A.: Evaluation methodologies in auto- matic question generation 2013-2018. In: Proceedings of the 11th In- ternational Conference on Natural Language Generation. pp. 307- 317. Association for Computational Linguistics, Tilburg University, The Netherlands (Nov 2018). https://doi.org/10.18653/v1/W18-6537, https://www.aclweb.org/anthology/W18-6537
  4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learn- ing to align and translate. In: 3rd International Conference on Learning Repre- sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1409.0473
  5. Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. pp. 65-72. Association for Computational Linguistics, Ann Arbor, Michigan (Jun 2005), https://www.aclweb.org/anthology/W05-0909
  6. Beltagy, I., Lo, K., Cohan, A.: SciBERT: A pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 3615-3620. Association for Computational Lin- guistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1371, https://www.aclweb.org/anthology/D19-1371
  7. Bojar, O., Kos, K., Mareček, D.: Tackling sparse data issue in machine transla- tion evaluation. In: Proceedings of the ACL 2010 Conference Short Papers. pp. 86-91. Association for Computational Linguistics, Uppsala, Sweden (Jul 2010), https://www.aclweb.org/anthology/P10-2016
  8. C ¸ano, E., Bojar, O.: Efficiency metrics for data-driven models: A text sum- marization case study. In: Proceedings of the 12th International Conference on Natural Language Generation. pp. 229-239. Association for Computational Lin- guistics, Tokyo, Japan (Oct-Nov 2019). https://doi.org/10.18653/v1/W19-8630, https://www.aclweb.org/anthology/W19-8630
  9. C ¸ano, E., Bojar, O.: Keyphrase generation: A multi-aspect survey. In: 2019 25th Conference of Open Innovations Association (FRUCT). pp. 85-94. Helsinki, Finland (Nov 2019). https://doi.org/10.23919/FRUCT48121.2019.8981519, https://ieeexplore.ieee.org/document/8981519
  10. C ¸ano, E., Bojar, O.: Human or machine: Automating human likeliness evaluation of nlg texts. CoRR abs/2006.03189 (2020), https://arxiv.org/abs/2006.03189
  11. C ¸ano, E., Bojar, O.: Two huge title and keyword generation corpora of research articles. In: Proceedings of The 12th Language Resources and Evaluation Confer- ence. pp. 6663-6671. European Language Resources Association, Marseille, France (may 2020), https://www.aclweb.org/anthology/2020.lrec-1.823
  12. C ¸ano, E., Bojar, O.: Sentiment analysis of czech texts: An algorithmic survey. In: Proceedings of the 11th International Conference on Agents and Artificial Intelligence -Volume 2: NLPinAI,. pp. 973-979. INSTICC, SciTePress (2019). https://doi.org/10.5220/0007695709730979
  13. C ¸ano, E., Morisio, M.: A data-driven neural network archi- tecture for sentiment analysis. Data Technologies and Applica- tions 53(1), 2-19 (2019). https://doi.org/10.1108/DTA-03-2018-0017, https://doi.org/10.1108/DTA-03-2018-0017
  14. Chen, P., Wu, F., Wang, T., Ding, W.: A semantic qa-based approach for text summarization evaluation. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceed- ings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018. pp. 4800-4807. AAAI Press (2018), https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16115
  15. Denkowski, M., Lavie, A.: Choosing the Right Evaluation for Machine Transla- tion: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks. In: Proceedings of AMTA 2010 (2010)
  16. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171-4186. Association for Computational Linguis- tics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
  17. Fornaciari, T., Poesio, M.: Identifying fake Amazon reviews as learn- ing from crowds. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguis- tics. pp. 279-287. Association for Computational Linguistics, Gothen- burg, Sweden (Apr 2014). https://doi.org/10.3115/v1/E14-1030, https://www.aclweb.org/anthology/E14-1030
  18. Gehrmann, S., Strobelt, H., Rush, A.: GLTR: Statistical detection and visualization of generated text. In: Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 111-116. Association for Computational Linguis- tics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-3019, https://www.aclweb.org/anthology/P19-3019
  19. Gkatzia, D., Mahamood, S.: A snapshot of NLG evaluation practices 2005 -2014. In: Proceedings of the 15th European Workshop on Natural Lan- guage Generation (ENLG). pp. 57-60. Association for Computational Lin- guistics, Brighton, UK (Sep 2015). https://doi.org/10.18653/v1/W15-4708, https://www.aclweb.org/anthology/W15-4708
  20. Hastie, H., Belz, A.: A comparative evaluation methodology for nlg in interac- tive systems. In: Proceedings of LREC'14. pp. 4004-4011. European Language Resources Association (12 2014)
  21. Kim, H., Lim, J.H., Kim, H.K., Na, S.H.: Qe bert: Bilingual bert using multi- task learning for neural quality estimation. In: Proceedings of the Fourth Con- ference on Machine Translation (Volume 3: Shared Task Papers, Day 2). pp. 87-91. Association for Computational Linguistics, Florence, Italy (August 2019), http://www.aclweb.org/anthology/W19-5407
  22. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: Open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations. pp. 67-72. Association for Computational Linguistics, Vancou- ver, Canada (Jul 2017), https://www.aclweb.org/anthology/P17-4012
  23. van der Lee, C., Gatt, A., van Miltenburg, E., Wubben, S., Krah- mer, E.: Best practices for the human evaluation of automatically gener- ated text. In: Proceedings of the 12th International Conference on Natu- ral Language Generation. pp. 355-368. Association for Computational Lin- guistics, Tokyo, Japan (Oct-Nov 2019). https://doi.org/10.18653/v1/W19-8643, https://www.aclweb.org/anthology/W19-8643
  24. Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234-1240 (09 2019). https://doi.org/10.1093/bioinformatics/btz682
  25. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Proc. ACL workshop on Text Summarization Branches Out. p. 10 (2004), http://aclweb.org/anthology/W04-1013
  26. Liu, Y., Lapata, M.: Text summarization with pretrained encoders. In: Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). pp. 3730-3740. Association for Computational Lin- guistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1387, https://www.aclweb.org/anthology/D19-1387
  27. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp. 142-150. Association for Computational Linguistics, Portland, Oregon, USA (June 2011), http://www.aclweb.org/anthology/P11-1015
  28. Malmasi, S., Zampieri, M.: Detecting hate speech in social media. In: Pro- ceedings of the Recent Advances in Natural Language Processing Con- ference (RANLP 2017). p. 467-472. Varna, Bulgaria (September 2017). https://doi.org/10.26615/978-954-452-049-6 062
  29. Mozafari, M., Farahbakhsh, R., Crespi, N.: A bert-based transfer learning approach for hate speech detection in online social media. In: Cherifi, H., Gaito, S., Mendes, J.F., Moro, E., Rocha, L.M. (eds.) Complex Networks and Their Applications VIII. pp. 928-940. Springer International Publishing, Cham (2020)
  30. Nallapati, R., Zhou, B., dos Santos, C., Gulcehre, C., Xiang, B.: Ab- stractive text summarization using sequence-to-sequence rnns and be- yond. In: Proceedings of The 20th SIGNLL Conference on Computa- tional Natural Language Learning. pp. 280-290. Association for Com- putational Linguistics (2016). https://doi.org/10.18653/v1/K16-1028, http://aclweb.org/anthology/K16-1028
  31. Novikova, J., Dušek, O., Cercas Curry, A., Rieser, V.: Why we need new evaluation metrics for NLG. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2241-2252. Association for Computational Linguistics, Copen- hagen, Denmark (Sep 2017). https://doi.org/10.18653/v1/D17-1238, https://www.aclweb.org/anthology/D17-1238
  32. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for au- tomatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002), http://aclweb.org/anthology/P02-1040
  33. Pérez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic detection of fake news. In: Proceedings of the 27th International Conference on Computational Linguistics. pp. 3391-3401. Association for Computational Linguistics, Santa Fe, New Mexico, USA (Aug 2018), https://www.aclweb.org/anthology/C18-1287
  34. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 2227-2237. Association for Computational Linguis- tics, New Orleans, Louisiana (Jun 2018). https://doi.org/10.18653/v1/N18-1202, https://www.aclweb.org/anthology/N18-1202
  35. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2018), https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
  36. Reiter, E.: A structured review of the validity of BLEU. Computational Linguistics 44(3), 393-401 (Sep 2018). https://doi.org/10.1162/coli a 00322, https://www.aclweb.org/anthology/J18-3002
  37. Reiter, E., Belz, A.: An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics 35(4), 529-558 (2009). https://doi.org/10.1162/coli.2009.35.4.35405, https://www.aclweb.org/anthology/J09-4008
  38. Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detec- tion on social media: A data mining perspective. SIGKDD Explor. Newsl. 19(1), 2236 (Sep 2017). https://doi.org/10.1145/3137597.3137600, https://doi.org/10.1145/3137597.3137600
  39. Sulem, E., Abend, O., Rappoport, A.: BLEU is not suitable for the evaluation of text simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 738-744. Association for Computational Linguis- tics, Brussels, Belgium (Oct-Nov 2018). https://doi.org/10.18653/v1/D18-1081, https://www.aclweb.org/anthology/D18-1081
  40. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: A joint model for video and language representation learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7463-7472. Seoul, South Korea (2019)
  41. Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H.: Modeling coverage for neu- ral machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. pp. 76-85. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1008, http://aclweb.org/anthology/P16-1008
  42. Vajjala, S., Meurers, D.: Assessing the relative reading level of sentence pairs for text simplification. In: Proceedings of the 14th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics (EACL- 14). Association for Computational Linguistics, Gothenburg, Sweden (2014), http://purl.org/dm/papers/Vajjala.Meurers-14-eacl.html
  43. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998-6008. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
  44. Wang, A., Cho, K.: BERT has a mouth, and it must speak: BERT as a Markov random field language model. In: Proceedings of the Work- shop on Methods for Optimizing and Evaluating Neural Language Gen- eration. pp. 30-36. Association for Computational Linguistics, Min- neapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/W19-2304, https://www.aclweb.org/anthology/W19-2304
  45. Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent trends in deep learning based natural language processing [review article].
  46. IEEE Computational Intelligence Magazine 13(3), 55-75 (Aug 2018). https://doi.org/10.1109/MCI.2018.2840738
  47. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., Choi, Y.: Defending against neural fake news. In: Advances in Neural Infor- mation Processing Systems 32, pp. 9054-9065. Curran Associates, Inc. (2019), http://papers.nips.cc/paper/9106-defending-against-neural-fake-news.pdf
  48. Zhang, Z., Robinson, D., Tepper, J.: Detecting hate speech on twitter using a convolution-gru based deep neural network. In: Gangemi, A., Navigli, R., Vidal, M.E., Hitzler, P., Troncy, R., Hollink, L., Tordai, A., Alam, M. (eds.) The Semantic Web. pp. 745-760. Springer International Publishing, Cham (2018)
  49. Zhou, W., Xu, K.: Learning to compare for better training and evaluation of open domain natural language generation models. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innova- tive Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. pp. 9717-9724. AAAI Press (2020), https://aaai.org/ojs/index.php/AAAI/article/view/6521