Academia.eduAcademia.edu

Outline

Efficient Vector Representation for Documents through Corruption

2017, arXiv (Cornell University)

Abstract

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

References (36)

  1. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993-1022.
  2. Chen, M., Weinberger, K. Q., Sha, F., and Bengio, Y. (2014). Marginalized denoising auto-encoders for nonlinear representations. In ICML, pages 1476-1484.
  3. Chen, M., Xu, Z., Weinberger, K., and Sha, F. (2012). Marginalized denoising autoencoders for domain adaptation. arXiv preprint arXiv:1206.4683.
  4. Croft, B. and Lafferty, J. (2013). Language modeling for information retrieval, volume 13. Springer Science & Business Media.
  5. Dai, A. M. and Le, Q. V. (2015). Semi-supervised sequence learning. In Advances in Neural Information Processing Systems, pages 3079-3087.
  6. Dai, A. M., Olah, C., and Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998.
  7. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391.
  8. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). Liblinear: A library for large linear classification. JMLR, 9(Aug):1871-1874.
  9. Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain adaptation for large-scale sentiment classi- fication: A deep learning approach. In ICML, pages 513-520.
  10. Grefenstette, E., Dinu, G., Zhang, Y.-Z., Sadrzadeh, M., and Baroni, M. (2013). Multi-step regres- sion learning for compositional distributional semantics. arXiv preprint arXiv:1301.6939.
  11. Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y. (2012). Improving word representations via global context and multiple word prototypes. In ACL, pages 873-882.
  12. Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. (2015). Character-aware neural language models. arXiv preprint arXiv:1508.06615.
  13. Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Skip-thought vectors. In Advances in neural information processing systems, pages 3294-3302.
  14. Kusner, M. J., Sun, Y., Kolkin, N. I., and Weinberger, K. Q. (2015). From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 957-966.
  15. Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In ICML, volume 14, pages 1188-1196.
  16. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In ACL, pages 142-150.
  17. Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(Nov):2579-2605.
  18. Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., and Zamparelli, R. (2014). Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sen- tences through semantic relatedness and textual entailment. SemEval-2014.
  19. Mesnil, G., Mikolov, T., Ranzato, M., and Bengio, Y. (2014). Ensemble of generative and discrimi- native techniques for sentiment analysis of movie reviews. arXiv preprint arXiv:1412.5335.
  20. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representa- tions vector space. arXiv preprint arXiv:1301.3781.
  21. Mikolov, T. and Dean, J. (2013). Distributed representations of words and phrases and their compo- sitionality. Advances in neural information processing systems.
  22. Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech, volume 2, page 3.
  23. Mikolov, T., Yih, W.-t., and Zweig, G. (2013b). Linguistic regularities in continuous space word representations. In HLT-NAACL, volume 13, pages 746-751.
  24. Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Cognitive science, 34(8):1388-1429.
  25. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Informa- tion processing & management, 24(5):513-523.
  26. Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng, A. Y. (2014). Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2:207-218.
  27. Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, volume 1631, page 1642.
  28. Tai, K. S., Socher, R., and Manning, C. D. (2015). Improved semantic representations from tree- structured long short-term memory networks. arXiv preprint arXiv:1503.00075.
  29. Van Der Maaten, L., Chen, M., Tyree, S., and Weinberger, K. Q. (2013). Learning with marginalized corrupted features. In ICML (1), pages 410-418.
  30. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096-1103. ACM.
  31. Wager, S., Wang, S., and Liang, P. S. (2013). Dropout training as adaptive regularization. In Advances in neural information processing systems, pages 351-359.
  32. Wang, S. and Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 90-94. Association for Computational Linguistics.
  33. Yessenalina, A. and Cardie, C. (2011). Compositional matrix-space models for sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 172-182. Association for Computational Linguistics.
  34. Zanzotto, F. M., Korkontzelos, I., Fallucchi, F., and Manandhar, S. (2010). Estimating linear models for compositional distributional semantics. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1263-1271.
  35. Zhang, X. and LeCun, Y. (2015). Text understanding from scratch. arXiv preprint arXiv:1502.01710.
  36. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In arXiv preprint arXiv:1506.06724.