Analysis of Italian Word Embeddings
2017, Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it 2017
Abstract
English. In this work we analyze the performances of two of the most used word embeddings algorithms, skip-gram and continuous bag of words on Italian language. These algorithms have many hyper-parameter that have to be carefully tuned in order to obtain accurate word representation in vectorial space. We provide an extensive analysis and an evaluation, showing what are the best configuration of parameters for specific analogy tasks. Italiano. In questo lavoro analizziamo le performances di due tra i più usati algoritmi di word embedding: skip-gram e continuous bag of words. Questi algoritmi hanno diversi iperparametri che devono essere impostati accuratamente per ottenere delle rappresentazioni accurate delle parole all'interno di spazi vettoriali. Presentiamo un'analisi accurata e una valutazione dei due algoritmi mostrando quali sono le configurazioni migliori di parametri su specifiche applicazioni.
References (17)
- Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilin- gual nlp. CoNLL-2013, page 183.
- Giuseppe Attardi and Maria Simi. 2014. Dependency pars- ing techniques for information extraction.
- Giuseppe Attardi, Vittoria Cozza, and Daniele Sartiano. 2014. Adapting linguistic tools for the analysis of italian medical records.
- Giuseppe Attardi, Daniele Sartiano, Chiara Alzetta, Federica Semplici, and Largo B Pontecorvo. 2016. Convolutional neural networks for sentiment analysis on italian tweets. In CLiC-it/EVALITA.
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic lan- guage model. Journal of machine learning research, 3(Feb):1137-1155.
- Giacomo Berardi, Andrea Esuli, and Diego Marcheggiani. 2015. Word embeddings go to italy: A comparison of models and training datasets. In IIR.
- Ronan Collobert and Jason Weston. 2008. A unified archi- tecture for natural language processing: Deep neural net- works with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160- 167. ACM.
- John Rupert Firth. 1935. The technique of semantics. Trans- actions of the philological society, 34(1):36-73.
- Yoav Goldberg. 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Lan- guage Technologies, 10(1):1-309.
- Zellig S Harris. 1954. Distributional structure. word, 10 (2- 3): 146-162. reprinted in fodor, j. a and katz, jj (eds.), readings in the philosophy of language.
- Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-aware neural language models. In AAAI, pages 2741-2749.
- Omer Levy, Yoav Goldberg, and Israel Ramat-Gan. 2014. Linguistic regularities in sparse and explicit word repre- sentations. In CoNLL, pages 171-180.
- Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211-225.
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer- nockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, volume 2, page 3.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vec- tor space. arXiv preprint arXiv:1301.3781.
- Egon W Stemle. 2016. bot. zen@ evalita 2016-a minimally- deep learning pos-tagger (trained for italian tweets). In CLiC-it/EVALITA.
- Fabio Tamburini. 2016. A bilstm-crf pos-tagger for ital- ian tweets using morphological information. In CLiC- it/EVALITA.