Academia.eduAcademia.edu

Outline

Cross-Modal Attribute Recognition in Fashion

Abstract

In this paper we focus on cross-modal (visual and textual) attribute recognition within the fashion domain. Particularly, we investigate two tasks: 1) given a query image, we retrieve textual descriptions that correspond to the visual attributes in the query; and 2) given a textual query that may express visual characteristics, we retrieve relevant images that exhibit the required visual attributes. To this end, we collected a dataset that consists of 53,689 images coupled with textual descriptions in natural language. The images contain fashion garments that display a great variety of visual attributes. Examples of visual attributes in fashion include colors (e., the text provides a rough and noisy description of the item in the image. We extensively analyze this dataset in the context of cross-modal attribute recognition. We investigate two latent variable models to bridge between textual and visual data: bilingual latent Dirichlet allocation and canonical correlation analysis. We use visual and textual features and report promising results 1 .

References (21)

  1. G. Csurka and C. Dance. Visual categorization with bags of keypoints. Workshop on statistical learning in computer vision, ECCV, 1(1-22):1-2, 2004.
  2. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  3. M. Faruqui and C. Dyer. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 462-471, 2014.
  4. D. R. Hardoon, S. Szedmák, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639-2664, 2004.
  5. M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853-899, 2013.
  6. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
  7. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097-1105. Curran Associates, Inc., 2012.
  8. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278-2324, 1998.
  9. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
  10. D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91- 110, Nov. 2004.
  11. R. Mason and E. Charniak. Annotation of online shopping images without labeled training examples. In North American Chapter of the ACL Human Language Technologies, volume 2013, page 1, 2013.
  12. R. Mason and E. Charniak. Domain-specific image captioning. In Proceedings of the Eighteenth Confer- ence on Computational Natural Language Learning, pages 11-20, Ann Arbor, Michigan, 2014. ACL.
  13. N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th International ACM Conference on Multimedia (ACM Multimedia), pages 251-260, 2010.
  14. H. Schmid. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing, volume 12, pages 44-49. Citeseer, 1994.
  15. H. Schmid. Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop. Citeseer, 1995.
  16. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  17. J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. In Proceedings Ninth IEEE International Conference on Computer Vision, pages 1470-1477 vol.2. IEEE, 2003.
  18. A. Vedaldi and B. Fulkerson. VLFeat -an open and portable library of computer vision algorithms. In ACM International Conference on Multimedia, 2010.
  19. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67-78, 2014.
  20. S. Zoghbi, G. Heyman, J. C. G. Carranza, and M.-F. Moens. Cross-Modal Fashion Search. In Lecture Notes in Computer Science (LNCS) Vol. 9517, pp 367-373, 2016.
  21. S. Zoghbi, G. Heyman, J. C. G. Carranza, and M.-F. Moens. Fashion Meets Computer Vision and NLP at E-Commerce Search. In International Journal of Computer and Electrical Engineering (IJCEE). (Ac- cepted), 2016.