Fashion Meets Computer Vision and NLP at e-Commerce Search
https://doi.org/10.17706/IJCEE.2016.8.1.31-43Abstract
In this paper, we focus on cross-modal (visual and textual) e-commerce search within the fashion domain. Particularly, we investigate two tasks: 1) given a query image, we retrieve textual descriptions that correspond to the visual attributes in the query; and 2) given a textual query that may express an interest in specific visual product characteristics, we retrieve relevant images that exhibit the required visual attributes. To this end, we introduce a new dataset that consists of 53,689 images coupled with textual descriptions. The images contain fashion garments that display a great variety of visual attributes, such as different shapes, colors and textures in natural language. Unlike previous datasets, the text provides a rough and noisy description of the item in the image. We extensively analyze this dataset in the context of cross-modal e-commerce search. We investigate two state-of-the-art latent variable models to bridge between textual and visual data: bilingual latent Dirichlet allocation and canonical correlation analysis. We use state-of-the-art visual and textual features and report promising results.
References (33)
- Chen, H., Gallagher, A., & Girod, B. (2012). Describing clothing by semantic attributes. Proceedings of the 12th European Conference on Computer Vision. Berlin, Heidelberg: Springer-Verlag.
- Yamaguchi, K., Kiapour, M. H., & Berg, T. L. (2013). Paper doll parsing: Retrieving similar styles to parse clothing items. Proceedings of the IEEE International Conference on Computer Vision.
- Bossard, L., Dantone, M., Leistner, C., Wengert, C., Quack, T., & Van Gool, L. (2013). Apparel classification with style. Proceedings of the 11th Asian Conference on Computer Vision (pp. 321-335). Springer-Verlag.
- Mason, R., & Charniak, E. (2014). Domain-specific image captioning. Proceedings of the Eighteenth Conference on Computational Natural Language Learning (pp. 11-20). Ann Arbor, Michigan: ACL.
- Choi, T-M., Hui, C-L., & Yu, Y. (2013). Intelligent Fashion Forecasting Systems: Models and Applications. Springer Publishing Company, Incorporated.
- Chen, Q., Wang, G., & Tan, C. L. (2013). Modeling fashion. Proceedings of IEEE International Conference on Multimedia and Expo.
- Mori, Y., Takahashi, H., & Oka, R. (1999). Image-to-word transformation based on dividing and vector quantizing images with words. Proceedings of First International Workshop on Multimedia Intelligent Storage and Retrieval Management.
- Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif Intell Res.
- Socher, R., Karpathy, A., Le, Q. V., et al. (2014). Grounded Compositional Semantics for Finding and Describing Images with Sentences. TACL.
- Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (pp. 3128-3137).
- Vulić, I., Zoghbi, S., & Moens, M.-F. (2014). Learning to bridge colloquial and formal language applied to linking and search of e-commerce data. Proceedings of the 37th International ACM SIGIR Conference on Research; Development in Information Retrieval (pp. 1195-1198). New York, NY, USA: ACM.
- Yu, J., Mohan, S., Putthividhya, D., & Wong, W.-K. (2014). Latent dirichlet allocation based diversified retrieval for e-commerce search. Proceedings of the 7th ACM International Conference on Web Search and Data Mining (pp. 463-472). New York, NY, USA: ACM.
- Lin K., Yang H-F., Liu K-H., Hsiao J-H., & Chen C-S. (2015). Rapid Clothing Retrieval via Deep Learning of Binary Codes and Hierarchical Search. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (pp. 499-502). New York, NY, USA: ACM.
- Lin, T-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft (COCO) common objects in context. Proceedings of European Conference on Computer Vision (ECCV).
- Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist, 2, 67-78.
- Mason, R., & Charniak, E. (2013). Annotation of online shopping images without labeled training examples. North American Chapter of the ACL Human Language Technologies.
- Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing (pp. 44-49).
- Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. Proceedings of the ACL SIGDAT-Workshop (pp. 47-50).
- Lowe, D. G. (Nov. 2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput Vis. 60(2), 91-110.
- Vedaldi, A., & Fulkerson, B. (2010). VLFeat -An open and portable library of computer vision Volume 8, Number 1, February 2016 algorithms. Proceedings of ACM International Conference on Multimedia (pp. 1469-1472).
- Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE Conference (pp. 2278-2324).
- Deng, J., Dong, W., Socher, R., Li, L-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. CVPR09.
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (pp. 1097-1105). Curran Associates, Inc..
- Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv abs/1409.1556.
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 22Nd ACM International Conference on Multimedia (pp. 675-678).
- Rasiwasia, N., Pereira, J. C., Coviello, E., Doyle, G., Lanckriet, G. R. G., Levy, R., et al. (2010). A new approach to cross-modal multimedia retrieval. Proceedings of the 18th International ACM Conference on Multimedia (pp. 251-260).
- Hardoon, D. R., Szedmák, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Comput., 16(12), 2639-2664.
- Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 462-471).
- De Smet, W., & Moens, M.-F. (2009). Cross-language linking of news stories on the Web using interlingual topic modeling. Proceedings of the CIKM 2009 Workshop on Social Web Search and Mining.
- De Smet, W., Tang, J., & Moens, M.-F. (2011). Knowledge transfer across multilingual corpora via latent topics. Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 549-560).
- Zoghbi, S., Heyman, G., Carranza, J. C. G., & Moens, M.-F. (2015). Cross-modal fashion search. Proceedings of the 22nd International Conference on Multimedia Modelling.
- Zoghbi, S., Heyman, G., Carranza, J. C. G., & Moens, M-F. (2015). Cross-modal attribute recognition in fashion. Proceedings of NIPS Multimodal Machine Learning Workshop.
- Susana Zoghbi is a PhD student in computer science at the KU Leuven. She obtained a masters degree from the University of British Columbia in 2011. Her research interests lie at the boundary of computer vision and natural language processing, and include deep learning, topic modeling and graphical models. Geert Heyman is a doctoral researcher in the Department of Computer Science, KU Leuven, Belgium. He completed his undergraduate studies and his master thesis at the Faculty of Engineering Science at KU Leuven in July 2014. His research interests are statistical models (such as neural networks and graphical models) for natural language processing, in particular for language modeling and machine translation.