Using Dynamic Embeddings to Improve Static Embeddings

leyang cui

Outline

Using Dynamic Embeddings to Improve Static Embeddings

leyang cui

2019, ArXiv

Abstract

How to build high-quality word embeddings is a fundamental research question in the field of natural language processing. Traditional methods such as Skip-Gram and Continuous Bag-of-Words learn {\it static} embeddings by training lookup tables that translate words into dense vectors. Static embeddings are directly useful for solving lexical semantics tasks, and can be used as input representations for downstream problems. Recently, contextualized embeddings such as BERT have been shown more effective than static embeddings as NLP input embeddings. Such embeddings are {\it dynamic}, calculated according to a sentential context using a network structure. One limitation of dynamic embeddings, however, is that they cannot be used without a sentence-level context. We explore the advantages of dynamic embeddings for training static embeddings, by using contextualized embeddings to facilitate training of static embedding lookup tables. Results show that the resulting embeddings outperform ...

Figures (8)

Figure 3: Our proposed model. R”*¢* and values V € R”*?* with multi-head attention transformation: ing context words Wj_ws, --, Wi-1, Wit1, +» Witws Into embeddings by using a target lookup table V and a context lookup table V’, respectively. The center word and context words are projected into v»,; and v,,,,,1 < [Jj] < ws, respectively, as Figure | shows. Given a training corpus C = {8. = wi, W2,-.-, Wn, }|AL1, The training objective is to minimizes the loss function:

Figure 4: Comparison between different window sizes.

Figure 5: Comparison between with/without attention.

Table 3: Evaluation on POS tagging and chunking. Table 2: Evaluation on word similarity and analogy.

Figure 6: Visualization of word pairs with the male-female relationship.

Figure 7: Attention visualization of sentence “football is the most popular sport in Brazil”.

References (28)

Alfaro, Costa-jussà, and Fonollosa 2019] Alfaro, F.; Costa- jussà, M. R.; and Fonollosa, J. A. R. 2019. BERT masked language modeling for co-reference resolution. In ACL, 76-81.
Florence, Italy: Association for Computational Linguistics. [Bojanowski et al. 2017] Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. TACL 5:135-146.
Bruni et al. 2012] Bruni, E.; Boleda, G.; Baroni, M.; and Tran, N.-K. 2012. Distributional semantics in technicolor. In ACL, 136-145. Jeju Island, Korea: Association for Computational Linguistics.
Choi et al. 2018] Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.-t.; Choi, Y.; Liang, P.; and Zettlemoyer, L. 2018. Quac: Question answering in context. In EMNLP, 2174-2184.
Coenen et al. 2019] Coenen, A.; Reif, E.; Yuan, A.; Kim, B.; Pearce, A. T.; Vi'egas, F.; and Wattenberg, M. 2019. Visualizing and measuring the geometry of bert. ArXiv abs/1906.02715.
Dai et al. 2019] Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.; and Salakhutdinov, R. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL, 2978- 2988. Florence, Italy: Association for Computational Linguis- tics.
Devlin et al. 2019] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 4171- 4186. Minneapolis, Minnesota: Association for Computational Linguistics. [Finkelstein et al. 2001] Finkelstein, L.; Gabrilovich, E.; Ma- tias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; and Ruppin, E. 2001. Placing search in context: The concept revisited. TOIS 20:406-414.
Hall, Durrett, and Klein 2014] Hall, D.; Durrett, G.; and Klein, D. 2014. Less grammar, more features. In ACL, 228-237. Bal- timore, Maryland: Association for Computational Linguistics. [Harris 1954] Harris, Z. S. 1954. Distributional structure. Word 10(2-3):146-162.
Hewitt and Manning 2019] Hewitt, J., and Manning, C. D. 2019. A structural probe for finding syntax in word repre- sentations. In NAACL, 4129-4138. Minneapolis, Minnesota: Association for Computational Linguistics. [Hochreiter and Schmidhuber 1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735-1780.
Iacobacci, Pilehvar, and Navigli 2015] Iacobacci, I.; Pilehvar, M. T.; and Navigli, R. 2015. SensEmbed: Learning sense em- beddings for word and relational similarity. In ACL, 95-105.
Beijing, China: Association for Computational Linguistics. [Jurgens et al. 2012] Jurgens, D.; Mohammad, S.; Turney, P.; and Holyoak, K. 2012. SemEval-2012 task 2: Measur- ing degrees of relational similarity. In SEMEVAL, 356-364. Montréal, Canada: Association for Computational Linguistics.
Kiela, Hill, and Clark 2015] Kiela, D.; Hill, F.; and Clark, S. 2015. Specializing word embeddings for similarity or related- ness. In EMNLP, 2044-2048. Lisbon, Portugal: Association for Computational Linguistics. [Komninos and Manandhar 2016] Komninos, A., and Manand- har, S. 2016. Dependency based embeddings for sentence clas- sification tasks. In NAACL, 1490-1500. San Diego, California: Association for Computational Linguistics. [Lample et al. 2016] Lample, G.; Ballesteros, M.; Subrama- nian, S.; Kawakami, K.; and Dyer, C. 2016. Neural architec- tures for named entity recognition. In NAACL, 260-270. San Diego, California: Association for Computational Linguistics. [Levy and Goldberg 2014a] Levy, O., and Goldberg, Y. 2014a. Dependency-based word embeddings. In ACL, 302-308. Bal- timore, Maryland: Association for Computational Linguistics. [Levy and Goldberg 2014b] Levy, O., and Goldberg, Y. 2014b. Linguistic regularities in sparse and explicit word representa- tions. In CoNLL, 171-180. Ann Arbor, Michigan: Association for Computational Linguistics.
Li and Jurafsky 2015] Li, J., and Jurafsky, D. 2015. Do multi- sense embeddings improve natural language understanding? In EMNLP, 1722-1732. Lisbon, Portugal: Association for Com- putational Linguistics.
Ling et al. 2015] Ling, W.; Dyer, C.; Black, A. W.; and Tran- coso, I. 2015. Two/too simple adaptations of Word2Vec for syntax problems. In NAACL, 1299-1304. Denver, Colorado: Association for Computational Linguistics. [Liu et al. 2019a] Liu, N. F.; Gardner, M.; Belinkov, Y.; Peters, M. E.; and Smith, N. A. 2019a. Linguistic knowledge and transferability of contextual representations. In NAACL, 1073- 1094.
Minneapolis, Minnesota: Association for Computational Linguistics. [Liu et al. 2019b] Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019b. Roberta: A robustly optimized bert pretraining ap- proach. arXiv preprint arXiv:1907.11692.
Luong, Pham, and Manning 2015] Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In EMNLP, 1412-1421. Lisbon, Portugal: Association for Computational Linguistics.
Luong, Socher, and Manning 2013] Luong, T.; Socher, R.; and Manning, C. 2013. Better word representations with recursive neural networks for morphology. In CoNLL, 104-113. Sofia, Bulgaria: Association for Computational Linguistics. [Ma and Hovy 2016] Ma, X., and Hovy, E. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In ACL, 1064-1074. Berlin, Germany: Association for Computational Linguistics.
Marcus, Marcinkiewicz, and Santorini 1993] Marcus, M. P.; Marcinkiewicz, M. A.; and Santorini, B. 1993. Building a large annotated corpus of english: The penn treebank. Com- put. Linguist. 19(2):313-330.
Mikolov et al. 2013a] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient estimation of word representations in vector space. CoRR abs/1301.3781. [Mikolov et al. 2013b] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013b. Distributed representa- tions of words and phrases and their compositionality. In NIPS, 3111-3119.
Mikolov, Yih, and Zweig 2013] Mikolov, T.; Yih, W.-t.; and Zweig, G. 2013. Linguistic regularities in continuous space word representations. In NAACL, 746-751. Atlanta, Georgia: Association for Computational Linguistics.
Pennington, Socher, and Manning 2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP, 1532-1543. [Peters et al. 2018] Peters, M.; Neumann, M.; Iyyer, M.; Gard- ner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL, 2227-2237. New Orleans, Louisiana: Association for Computational Lin- guistics.
Radford et al. 2019] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. [Reimers and Gurevych 2017] Reimers, N., and Gurevych, I. 2017. Reporting score distributions makes a difference: Per- formance study of LSTM-networks for sequence tagging. In EMNLP, 338-348. Copenhagen, Denmark: Association for Computational Linguistics.
Shi and Lin 2019] Shi, P., and Lin, J. 2019. Simple bert mod- els for relation extraction and semantic role labeling. CoRR abs/1904.05255.
Tjong Kim Sang and Buchholz 2000] Tjong Kim Sang, E. F., and Buchholz, S. 2000. Introduction to the CoNLL-2000 shared task chunking. In CoNLL.
van der Maaten and Hinton 2008] van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-SNE. JMLR 9:2579- 2605.
Vashishth et al. 2019] Vashishth, S.; Bhandari, M.; Yadav, P.; Rai, P.; Bhattacharyya, C.; and Talukdar, P. 2019. Incorpo- rating syntactic and semantic information in word embeddings using graph convolutional networks. In ACL, 3308-3318. Flo- rence, Italy: Association for Computational Linguistics. [Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L. u.; and Polo- sukhin, I. 2017. Attention is all you need. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vish- wanathan, S.; and Garnett, R., eds., NIPS. Curran Associates, Inc. 5998-6008.
Xu et al. 2019] Xu, H.; Liu, B.; Shu, L.; and Yu, P. 2019. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In NAACL, 2324-2335. Min- neapolis, Minnesota: Association for Computational Linguis- tics. [Yang et al. 2019] Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and Le, Q. V. 2019. Xlnet: Generalized autoregressive pretraining for language understanding.
Yang, Liang, and Zhang 2018] Yang, J.; Liang, S.; and Zhang, Y. 2018. Design challenges and misconceptions in neural se- quence labeling. In COLING, 3879-3889.

Using Dynamic Embeddings to Improve Static Embeddings

Sign up for access to the world's latest research

Abstract

Related papers

References (28)

Related papers

Related topics