GloVe: Global Vectors for Word Representation

Delia Ioana

Outline

GloVe: Global Vectors for Word Representation

Abstract

Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic , but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global log-bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word co-occurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful sub-structure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

Figures (8)

Table 1: Co-occurrence probabilities for target words ice and steam with selected context words from a 6 billion token corpus. Only in the ratio does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.

Figure 1: Weighting function f with a@ = 3/4. The performance of the model depends weakly on the cutoff, which we fix to Xmax = 100 for all our experiments. We found that a = 3/4 gives a mod- est improvement over a linear version with a = 1. Although we offer only empirical motivation for choosing the value 3/4, it is interesting that a sim- ilar fractional power scaling was found to give the best performance in (Mikolov et al., 2013a).

Table 2: Results on the word analogy task, given as percent accuracy. Underlined scores are best within groups of similarly-sized models; bold scores are best overall. HPCA vectors are publicly available”; (i)vLBL results are from (Mnih et al., 2013); skip-gram (SG) and CBOW results are from (Mikolov et al., 2013a,b); we trained SGt and CBOW' using the word2vec tool. See text for details and a description of the SVD models. for details and a description of the SVD models. dataset for NER (Tjong Kim Sang and De Meul- der, 2003). Word analogies. The word analogy task con- sists of questions like, “a is to b as c is The dataset contains 19,544 such questi to?” ons, di- vided into a semantic subset and a syntactic sub- set. The semantic questions are typically analogies about people or places, like “Athens is to Greece as Berlin is to __?”. The syntactic ques ions are typically analogies about verb tenses or forms of adjectives, for example “dance is to dancing as fly is to__?”. To correctly answer the ques ion, the model should uniquely identify the missing term, with only an exact correspondence coun ed as a correct match. We answer the question “a is to b asc isto?” by finding the word d whose repre- sentation wg is closest to Wy — Wg + We according to the cosine similarity.4 Word analogies. The word analogy task con-

and differ in that they contain phrase vectors.

Figure 3: Accuracy on the analogy task for 300- dimensional vectors trained on different corpora.

shown for neural vectors in (Turian et al., 2010). 4.4 Model Analysis: Vector Length and Context Size In Fig. 2, we show the results of experiments that vary vector length and context window. A context window that extends to the left and right of a tar- get word will be called symmetric, and one which extends only to the left will be called asymmet- ric. In (a), we observe diminishing returns for vec- tors larger than about 200 dimensions. In (b) and (c), we examine the effect of varying the window size for symmetric and asymmetric context win- dows. Performance is better on the syntactic sub- task for small and asymmetric context windows, which aligns with the intuition that syntactic infor- mation is mostly drawn from the immediate con- text and can depend strongly on word order. Se- mantic information, on the other hand, is more fre- quently non-local, and more of it is captured with larger window sizes. In Fig. 2, we show the results of experiments that

it specifies a learning schedule specific to a single pass through the data, making a modification for multiple passes a non-trivial task. Another choice is to vary the number of negative samples. Adding negative samples effectively increases the number of training words seen by the model, so in some ways it is analogous to extra epochs. methods or from prediction-based methods. Cur- rently, prediction-based models garner substantial support; for example, Baroni et al. (2014) argue that these models perform better across a range of tasks. In this work we argue that the two classes of methods are not dramatically different at a fun- damental level since they both probe the under- lying co-occurrence statistics of the corpus, but the efficiency with which the count-based meth- ods capture global statistics can be advantageous. We construct a model that utilizes this main ben- efit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec. The result, GloVe, is a new global log-bilinear regression model for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.

References (29)

Tom M. Apostol. 1976. Introduction to Analytic Number Theory. Introduction to Analytic Num- ber Theory.
Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL.
Yoshua Bengio. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning.
Yoshua Bengio, Réjean Ducharme, Pascal Vin- cent, and Christian Janvin. 2003. A neural prob- abilistic language model. JMLR, 3:1137-1155.
John A. Bullinaria and Joseph P. Levy. 2007. Ex- tracting semantic representations from word co- occurrence statistics: A computational study. Behavior Research Methods, 39(3):510-526.
Dan C. Ciresan, Alessandro Giusti, Luca M. Gam- bardella, and Jürgen Schmidhuber. 2012. Deep neural networks segment neuronal membranes in electron microscopy images. In NIPS, pages 2852-2860.
Ronan Collobert and Jason Weston. 2008. A uni- fied architecture for natural language process- ing: deep neural networks with multitask learn- ing. In Proceedings of ICML, pages 160-167.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Al- most) from Scratch. JMLR, 12:2493-2537.
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41.
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learn- ing and stochastic optimization. JMLR, 12.
Lev Finkelstein, Evgenly Gabrilovich, Yossi Ma- tias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in con- text: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406-414. ACM.
Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving Word Representations via Global Context and Multiple Word Prototypes. In ACL.
Rémi Lebret and Ronan Collobert. 2014. Word embeddings through Hellinger PCA. In EACL.
Omer Levy, Yoav Goldberg, and Israel Ramat- Gan. 2014. Linguistic regularities in sparse and explicit word representations. CoNLL-2014.
Kevin Lund and Curt Burgess. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, In- strumentation, and Computers, 28:203-208.
Minh-Thang Luong, Richard Socher, and Christo- pher D Manning. 2013. Better word represen- tations with recursive neural networks for mor- phology. CoNLL-2013.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. In ICLR Work- shop Papers.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111-3119.
Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in con- tinuous space word representations. In HLT- NAACL.
George A. Miller and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1-28.
Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS.
Douglas L. T. Rohde, Laura M. Gonnerman, and David C. Plaut. 2006. An improved model of semantic similarity based on lexical co-occurence. Communications of the ACM, 8:627-633.
Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Com- munications of the ACM, 8(10):627-633.
Fabrizio Sebastiani. 2002. Machine learning in au- tomated text categorization. ACM Computing Surveys, 34:1-47.
Richard Socher, John Bauer, Christopher D. Man- ning, and Andrew Y. Ng. 2013. Parsing With Compositional Vector Grammars. In ACL.
Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton. 2003. Quanti- tative evaluation of passage retrieval algorithms for question answering. In Proceedings of the SIGIR Conference on Research and Develop- ment in Informaion Retrieval.
Erik F. Tjong Kim Sang and Fien De Meul- der. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named en- tity recognition. In CoNLL-2003.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and gen- eral method for semi-supervised learning. In Proceedings of ACL, pages 384-394.
Mengqiu Wang and Christopher D. Manning. 2013. Effect of non-linear deep architecture in sequence labeling. In Proceedings of the 6th International Joint Conference on Natural Lan- guage Processing (IJCNLP).

GloVe: Global Vectors for Word Representation

Sign up for access to the world's latest research

Abstract

Related papers

References (29)

Related papers