What Kind of Language Is Hard to Language-Model?

Kyle Gorman

doi:10.18653/V1/P19-1491

Outline

What Kind of Language Is Hard to Language-Model?

Kyle Gorman

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

https://doi.org/10.18653/V1/P19-1491

visibility

…

description

15 pages

link

1 file

Abstract

How language-agnostic are current state-ofthe-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the highresource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that "translationese" is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

References (45)

Željko Agić, Anders Johannsen, Barbara Plank, Héctor Martínez Alonso, Natalie Schluter, and An- ders Søgaard. 2016. Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4:301-312.
Mona Baker. 1993. Corpus linguistics and translation studies: Implications and applications. Text and Technology: In Honour of John Sinclair, pages 233- 250.
Emily M. Bender. 2009. Linguistically naïve != lan- guage independent: Why NLP needs linguistic ty- pology. In EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics, pages 26-32.
Yoav Benjamini and Yosef Hochberg. 1995. Control- ling the false discovery rate: A practical and pow- erful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289-300.
Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of Statistical Software, Articles, 76(1):1-32.
Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, and Brian Roark. 2018. Are all languages equally hard to language-model? In Proceedings of NAACL, pages 536-541.
Mathieu Dehouck and Pascal Denis. 2018. A frame- work for understanding the role of morphology in universal dependency parsing. In Proceedings of EMNLP, pages 2864-2870.
Chris Drummond. 2009. Replicability is not repro- ducibility: Nor is it good science. In Proceedings of the Evaluation Methods for Machine Learning Work- shop at the 26th ICML.
Matthew S. Dryer and Martin Haspelmath, editors. 2013. WALS Online. Max Planck Institute for Evo- lutionary Anthropology, Leipzig.
Lawrence Fenton. 1960. The sum of log-normal probability distributions in scatter transmission sys- tems. IRE Transactions on Communications Sys- tems, 8(1):57-67.
Richard Futrell, Kyle Mahowald, and Edward Gibson. 2015. Large-scale evidence of dependency length minimization in 37 languages. Proceedings of the National Academy of Sciences, 112(33):10336- 10341.
Johannes Graën, Dolores Batinic, and Martin Volk. 2014. Cleaning the Europarl corpus for linguistic applications. In Konvens, pages 222-227.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735-1780.
Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Arya McCarthy, Sabrina J. Mielke, Sandra Kübler, David Yarowsky, Jason Eis- ner, and Mans Hulden. 2018. Unimorph 2.0: Uni- versal morphology. In Proceedings of the Ninth In- ternational Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA).
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT Summit, pages 79-86.
Charles J. Kowalski. 1972. On the effects of non- normality on the distribution of the sample product- moment correlation coefficient. Journal of the Royal Statistical Society. Series C (Applied Statis- tics), 21(1):1-12.
Taku Kudo. 2018. Subword regularization: Improv- ing neural network translation models with multiple subword candidates. In Proceedings of the 56th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66-75, Melbourne, Australia.
Gennadi Lembersky, Noam Ordan, and Shuly Wintner. 2012. Adapting translation models to translationese improves SMT. In Proceedings of EACL, pages 255-265.
Haitao Liu. 2008. Dependency distance as a metric of language comprehension difficulty. Journal of Cog- nitive Science, 9(2):159-191.
Haitao Liu, Chunshan Xu, and Junying Liang. 2017. Dependency distance: A new perspective on syntac- tic patterns in natural languages. Physics of Life Re- views, 21:171-193.
Thomas Mayer and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. In Proceedings of LREC, pages 3158-3163.
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240. Bart van Merriënboer, Amartya Sanyal, Hugo Larochelle, and Yoshua Bengio. 2017. Multiscale sequence modeling with a learned dictionary. arXiv preprint arXiv:1707.00762.
Sabrina J. Mielke and Jason Eisner. 2018. Spell once, summon anywhere: A two-level open-vocabulary language model. arXiv preprint arXiv:1804.08205.
NIST Multimodal Information Group. 2010a. NIST 2002 Open Machine Translation (OpenMT) evalua- tion LDC2010T10.
NIST Multimodal Information Group. 2010b. NIST 2003 Open Machine Translation (OpenMT) evalua- tion LDC2010T11.
NIST Multimodal Information Group. 2010c. NIST 2004 Open Machine Translation (OpenMT) evalua- tion LDC2010T12.
NIST Multimodal Information Group. 2010d. NIST 2005 Open Machine Translation (OpenMT) evalua- tion LDC2010T14.
NIST Multimodal Information Group. 2010e. NIST 2006 Open Machine Translation (OpenMT) evalua- tion LDC2010T17.
NIST Multimodal Information Group. 2010f. NIST 2008 Open Machine Translation (OpenMT) evalua- tion LDC2010T21.
NIST Multimodal Information Group. 2010g. NIST 2009 Open Machine Translation (OpenMT) evalua- tion LDC2010T23.
NIST Multimodal Information Group. 2013a. NIST 2008-2012 Open Machine Translation (OpenMT) progress test sets LDC2013T07.
NIST Multimodal Information Group. 2013b. NIST 2012 Open Machine Translation (OpenMT) evalua- tion LDC2013T03.
Lewis M. Paul, Gary F. Simons, Charles D. Fennig, et al. 2009. Ethnologue: Languages of the world, 19 edition. SIL International, Dallas.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word repre- sentations. In Proceedings of NAACL, pages 2227- 2237.
Ella Rabinovich and Shuly Wintner. 2015. Unsuper- vised identification of translationese. Transactions of the Association for Computational Linguistics, 3:419-432.
Ella Rabinovich, Shuly Wintner, and Ofek Luis Lewin- sohn. 2016. A parallel corpus of translationese. In International Conference on Intelligent Text Process- ing and Computational Linguistics, pages 140-155. Springer.
Philip Resnik, Mari Broman Olsen, and Mona Diab. 1999. The bible as a parallel corpus: Annotating the 'book of 2000 tongues'. Computers and the Hu- manities, 33(1):129-153.
Benoît Sagot. 2013. Comparing complexity mea- sures. In Computational Approaches to Morpholog- ical Complexity.
S. C. Schwartz and Y. S. Yeh. 1982. On the distri- bution function and moments of power sums with log-normal components. The Bell System Technical Journal, 61(7):1441-1462.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of ACL, pages 1715- 1725.
Claude E. Shannon. 1951. Prediction and entropy of printed English. Bell Labs Technical Journal, 30(1):50-64.
Milan Straka, Jan Haji, and Jana Strakov. 2016. UD- Pipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analy- sis, POS tagging and parsing. In Proceedings of LREC, pages 4290-4297.
Milan Straka and Jana Strakov. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UD- Pipe. In CoNLL 2017 Shared Task: Multilingual parsing from raw text to Universal Dependencies, pages 88-99.
Ilya Sutskever, James Martens, and Geoffrey Hinton. 2011. Generating text with recurrent neural net- works. In Proceedings of ICML, pages 1017-1024.
David Yarowsky, Grace Ngai, and Richard Wicen- towski. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research, pages 1-8.

What Kind of Language Is Hard to Language-Model?

Sign up for access to the world's latest research

Abstract

Related papers

References (45)

Related papers

Cited by