Enriching Word Vectors with Subword Information

Peter Yuan

Outline

Enriching Word Vectors with Subword Information

Abstract

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation , especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.

Figures (3)

Table 1: Correlations between vector similarity score and human judgement on several datasets (top) and accuracies on analogy tasks (bottom) for models trained on Wikipedia. For each dataset, we evaluate models trained on several sizes of training sets. Small contains 50M tokens, Medium 200M tokens and Full is the complete Wikipedia dump. For each dataset, we report the out-of-vocabulary rate as well as the performance of our model and the skip-gram and CBOW models from/Mikolov et al. (2013b).

Table 2: Nearest neighbors of rare words using our representations and skip- gram. These hand picked examples are for illustration.

Table 3: Comparison of our approach with previous work incorporating morphology in word representations, on word similarity tasks. We keep all the word pairs of the evaluation set and obtain representations for out-of-vocabulary words with our model by summing the vectors of character n-grams. We report Spearman’s rank correlation coefficient between model scores and human idaqement

References (17)

References [Alexandrescu and Kirchhoff2006] Andrei Alexandrescu and Katrin Kirchhoff. 2006. Factored neural language models. In Proc. NAACL. [Ballesteros et al.2015] Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with lstms. In Proc. EMNLP. [Baroni and Lenci2010] Marco Baroni and Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computa- tional Linguistics, 36(4).
Bojanowski et al.2015] Piotr Bojanowski, Armand Joulin, and Tomáš Mikolov. 2015. Alternative structures for character-level rnns. In ICLR Workshop. [Botha and Blunsom2014] Jan A Botha and Phil Blun- som. 2014. Compositional morphology for word rep- resentations and language modelling. In Proc. ICML. [Chen et al.2015] Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015. Joint learning of character and word embeddings. In Proc. IJCAI.
Grzegorz Chrupała. 2014. Normalizing tweets with edit scripts and recurrent neural embed- dings. In Proc. ACL. [Collobert and Weston2008] Ronan Collobert and Jason Weston. 2008. A unified architecture for natural lan- guage processing: Deep neural networks with multi- task learning. In Proc. ICML. [Cotterell and Schütze2015] Ryan Cotterell and Hinrich Schütze. 2015. Morphological word-embeddings. In Proc. of NAACL. [Cui et al.2015] Qing Cui, Bin Gao, Jiang Bian, Siyu Qiu, Hanjun Dai, and Tie-Yan Liu. 2015. Knet: A general framework for learning word embedding using mor- phological knowledge. ACM Transactions on Infor- mation Systems. [Deerwester et al.1990] Scott Deerwester, Susan T Du- mais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for informa- tion science, 41(6).
dos Santos and Gatti2014] Cicero Nogueira dos Santos and Maira Gatti. 2014. Deep convolutional neural networks for sentiment analysis of short texts. In Proc. COLING. [dos Santos and Zadrozny2014] Cicero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proc. ICML. [Finkelstein et al.2001] Lev Finkelstein, Evgeniy
Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proc. WWW.
Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
Iryna Gurevych. 2005. Using the struc- ture of a conceptual network in computing semantic relatedness. In Proc. IJCNLP.
Zellig S Harris. 1954. Distributional struc- ture. Word, 10(2-3).
Hassan and Mihalcea2009] Samer Hassan and Rada Mi- halcea. 2009. Cross-lingual semantic relatedness us- ing encyclopedic knowledge. In Proc. EMNLP. [Joubarne and Inkpen2011] Colette Joubarne and Diana Inkpen. 2011. Comparison of semantic similarity for different languages using the google n-gram corpus and second-order co-occurrence measures. In Adv. in A.I. [Kim et al.2016] Yoon Kim, Yacine Jernite, David Son- tag, and Alexander M Rush. 2016. Character-aware neural language models. In Proc. AAAI. [Lazaridou et al.2013] Angeliki Lazaridou, Marco
Marelli, Roberto Zamparelli, and Marco Baroni. 2013. Compositional-ly derived representations of morphologically complex words in distributional semantics. In Proc. ACL. [Ling et al.2015] Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015. Finding function in form: Compositional character models for open vo- cabulary word representation. In Proc. EMNLP. [Lund and Burgess1996] Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Meth- ods, Instruments, & Computers, 28(2). [Luong and Manning2016] Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proc. ACL. [Luong et al.2013] Thang Luong, Richard Socher, and Christopher D Manning. 2013. Better word represen- tations with recursive neural networks for morphology. In Proc. CoNLL. [Mikolov et al.2012] Tomáš Mikolov, Ilya Sutskever, Anoop Deoras, Hai-Son Le, Stefan Kom- brink, and J Cernocky. 2012. Subword lan- guage modeling with neural networks. preprint (http://www.fit.vutbr.cz/imikolov/rnnlm/char.pdf). [Mikolov et al.2013a] Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estima- tion of word representations in vector space. arXiv preprint arXiv:1301.3781. [Mikolov et al.2013b] Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Adv. NIPS. [Qiu et al.2014] Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014. Co-learning of word represen- tations and morpheme representations. In Proc. COL- ING. [Rumelhart et al.1988] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1988. Learning repre- sentations by back-propagating errors. Cognitive mod- eling, 5(3).
Sak et al.2010] Haşim Sak, Murat Saraclar, and Tunga Gungör. 2010. Morphology-based and sub-word lan- guage modeling for turkish speech recognition. In Proc. ICASSP.
Hinrich Schütze. 1992. Dimensions of meaning. In Proc. Supercomputing.
Hinrich Schütze. 1993. Word space. In Adv. NIPS.
Sennrich et al.2016] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. ACL. [Shaoul and Westbury2010] C. Shaoul and C. Westbury. 2010. The Westbury lab Wikipedia corpus. [Soricut and Och2015] Radu Soricut and Franz Och. 2015. Unsupervised morphology induction using word embeddings. In Proc. NAACL.
Charles Spearman. 1904. The proof and measurement of association between two things. American Journal of Psychology, 15. [Sperr et al.2013] Henning Sperr, Jan Niehues, and Alexander Waibel. 2013. Letter n-gram-based in- put encoding for continuous space language models. In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality. [Sutskever et al.2011] Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recur- rent neural networks. In Proc. ICML. [Svoboda and Brychcin2016] L. Svoboda and T. Brychcin. 2016. New word analogy corpus for exploring embeddings of czech words. In Proc. CICLING. [Turney et al.2010] Peter D Turney, Patrick Pantel, et al. 2010. From frequency to meaning: Vector space mod- els of semantics. Journal of artificial intelligence re- search, 37(1).
Zesch and Gurevych2006] Torsten Zesch and Iryna Gurevych. 2006. Automatically creating datasets for measures of semantic relatedness. In Proc. Workshop on Linguistic Distances.
Zhang et al.2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Adv. NIPS.

Enriching Word Vectors with Subword Information

Sign up for access to the world's latest research

Abstract

Related papers

References (17)

Related papers