Fashion Meets Computer Vision and NLP at e-Commerce Search

Susana Zoghbi; Department of Computer Science, KU Leuven, Belgium.; Geert Heyman; Juan Carlos Gomez; Marie-Francine Moens

doi:10.17706/IJCEE.2016.8.1.31-43

Outline

Fashion Meets Computer Vision and NLP at e-Commerce Search

Susana Zoghbi

Juan Carlos Gomez

https://doi.org/10.17706/IJCEE.2016.8.1.31-43

visibility

…

description

13 pages

link

1 file

Abstract

In this paper, we focus on cross-modal (visual and textual) e-commerce search within the fashion domain. Particularly, we investigate two tasks: 1) given a query image, we retrieve textual descriptions that correspond to the visual attributes in the query; and 2) given a textual query that may express an interest in specific visual product characteristics, we retrieve relevant images that exhibit the required visual attributes. To this end, we introduce a new dataset that consists of 53,689 images coupled with textual descriptions. The images contain fashion garments that display a great variety of visual attributes, such as different shapes, colors and textures in natural language. Unlike previous datasets, the text provides a rough and noisy description of the item in the image. We extensively analyze this dataset in the context of cross-modal e-commerce search. We investigate two state-of-the-art latent variable models to bridge between textual and visual data: bilingual latent Dirichlet allocation and canonical correlation analysis. We use state-of-the-art visual and textual features and report promising results.

Figures (9)

Fig. 1. Our system performs two cross-modal tasks: image to text and text to image. Cross-modal e-commerce fashion search has not received much attention in the literature. Existing works 1]-[3] mostly focus on classifying fashion visual content into a set of predefined categories. One exception ; [4], where the authors focus on automatically annotating images of shoes and bags, but fashion garments nd the Txt2Img task are not investigated.

Table 1. Statistics on the Training Datasets after Projection on POS- and Zappos-Based Vocabularies

EDIE II II OIE OI II II II EI IIE NIE EE IEE Regarding visual features, the overall best performer is the convolutional neural network (CNN). For. iven fixed triad of (model, vocabulary, K), the CNN feature always outperforms the SIFT feature in bot recision and recall. For example, for the POS vocabulary, at K=5, BiLDA-SIFT achieves 19.97% precisio! nd 10.34% recall; whereas BiLDA-CNN achieves 24.04% in precision (20% increase) and 11.96% in recal 15% increase). It is remarkable that CNNs perform so well compared to the SIFT counterparts becaus: 1ey were not trained for this particular task. Instead, as previously described, the model that generate 1em was trained on a large image classification task [22]. In the future, we will explore fine-tuning th NNs to improve further performance.

Regarding the preprocessing, we observe that for this task the Zappos vocabulary performs much better than the POS-based counterpart. This makes sense, since the Zappos vocabulary is much more limited (around 200 unique tokens) than the POS-based vocabulary (over 9000 unique tokens). Having a limited, yet meaningful and complete vocabulary is beneficial for this task. In particular, the Zappos vocabulary is quite interesting because it contains the actual categories that a real-life online shop uses to manually categorize its apparel dress garments.

‘I'7o Increase) and 11.96'/1n recall (65 Increase). simUar behaviors are observed as K Increases. Furthermore, when we compare our best system (BiLDA-CNN-Zappos) with the setup ir 6](BiLDA-Sift-POS) we obtain remarkable improvements. For example, at K=5, [16] obtains 24.04% ‘ecision and 11.97% recall. In contrast our system obtains 48.75% precision (102% increase) and 43.66% call (364% increase). Similar observations can be made for other values of K, as shown in Fig. 2 and Fig. 3. Of course, our best system benefits from a more targeted vocabulary (Zapos-based). The natural questio1 which of these two vocabularies to use if we were to deploy this application. In general, there is a tradeof tween the size of the vocabulary and expressiveness. The larger the vocabulary, the more likely we are tc pture the nuances in visual attributes. For example, we might have many more distinct words tc fferentiate varied shades of the same color, as opposed to one single word that encompasses the whole yectrum. We might also be able to differentiate in a more detailed way types of textures and shapes owever, as the vocabulary increases, the task becomes more difficult because we might not have enougt ita to instantiate and consequently learn all these nuances. This explains the large differences ir srformance between the two vocabularies. However, regardless of the choice of the vocabulary, the clea p choice of visual features is the CNN and not SIFT as previously used in [16].

Fig. 5. Txt2Img: Recall@K for POS-based vocabulary (left) and Zappos-based vocabulary (right). dl Txt2Img Results. Fig. 5 presents recall@K for all conditions. In all instances, our models perform mucl etter than random. This suggests that we have captured meaningful, useful aspects of the data. Regardins visual features, just as in the previous task, the best performer is the convolutional neural network (CNN) ‘or any fixed (model, vocabulary, K) combination, the CNN feature always outperforms the SIFT feature <egarding the choice of model, the BiLDA model performs roughly as well as the CCA model when using the appos vocabularies. However, CCA outperforms BiLDA with the POS-vocabulary. Regarding the choice o vocabulary, we see that performance is generally higher using the POS-based vocabulary. This makes sens« yecause the POS-based vocabulary is much larger (over 9,000 unique tokens) than the Zappos-basec around 200 tokens). Having a larger vocabulary is beneficial to the Txt2Img task because it allows for more “xpressiveness in the required visual attributes. Fig. 6 presents qualitative results, where given a textual query, we show the top 4 images retrieved. We see very interesting results. The query ‘little black dress black polyester jersey lace’ actually finds little black dresses. It can be argued that the retrieved items also display some jersey-type characteristics, especially the first and third items. For the attribute polyester, it is not clear from the image what the fabrics of the garments are, as this is not a particularly visual word. Also, we do not see any laces on the dresses, so there is room for improvement. The query ‘wedding gown sleeveless scalloped ruffles’ retrieved wedding gowns in all four items. Two of them are sleeveless and three of them contain ‘scalloped ruffles’. The query ‘casual sleeveless floral print’ retrieves garments with floral patterns on them. They are sleeveless and casual. The query ‘long cocktail wedding gown strapless yellow ruched’ retrieves yellow items in all cases. Three of them are long and ruched. Two of them are strapless. In this case, the query might be slightly misleading, since it mentions wedding gown, and it is often assumed that that corresponds to an actual white wedding dress. However, the word wedding is often used to describe the occasion that the dress may be worn to. It is similar to the word ‘cocktail’ as it also refers to the occasion. The occasion to which a particular garment is appropriate to wear is of course highly subjective. Overall, these results are highly impressive given the difficulty of the task. A demonstration and workshop Regarding the choice of model, the BiLDA model performs roughly as well as the CCA model when using the

Fig. 6. Txt2Img: Example results. Given a textual query, we show the top retrieved images. 7. Conclusions We investigated cross-modal search of fashion items. Given a textual query composed of visual attribute of dresses, our system retrieves relevant images of dresses, and given a picture of a dress as query, thé system describes the attributes of the dress in natural language terms. We implemented and compare¢ several algebraic and probabilistic graphical models to learn latent components that bridge the visual anc textual features. We have experimented with different types of visual and textual features. Our system wa. trained on real Web data found at Amazon.com composed of fashion products and their textual description. and was evaluated on an additional set of Amazon data. Our best approach uses CNN-based visual feature: and a controlled, commonly used fashion vocabulary. It obtained a remarkable performance whel compared to the state-of-the-art setting of [16], which uses SIFT-based features and a vocabulary based o1 part-of-speech. For example, at K=5, the previous setting obtains 24.04% precision and 11.97% recall. Ii contrast our best system obtains 48.75% precision (102% increase) and 43.66% recall (364% increase).W find a similar behaviour for other values of K. Additionally, by visually inspecting the annotations ou system generates, we find reasonable descriptions that capture different garment lengths, colors an¢ textures.

References (33)

Chen, H., Gallagher, A., & Girod, B. (2012). Describing clothing by semantic attributes. Proceedings of the 12th European Conference on Computer Vision. Berlin, Heidelberg: Springer-Verlag.
Yamaguchi, K., Kiapour, M. H., & Berg, T. L. (2013). Paper doll parsing: Retrieving similar styles to parse clothing items. Proceedings of the IEEE International Conference on Computer Vision.
Bossard, L., Dantone, M., Leistner, C., Wengert, C., Quack, T., & Van Gool, L. (2013). Apparel classification with style. Proceedings of the 11th Asian Conference on Computer Vision (pp. 321-335). Springer-Verlag.
Mason, R., & Charniak, E. (2014). Domain-specific image captioning. Proceedings of the Eighteenth Conference on Computational Natural Language Learning (pp. 11-20). Ann Arbor, Michigan: ACL.
Choi, T-M., Hui, C-L., & Yu, Y. (2013). Intelligent Fashion Forecasting Systems: Models and Applications. Springer Publishing Company, Incorporated.
Chen, Q., Wang, G., & Tan, C. L. (2013). Modeling fashion. Proceedings of IEEE International Conference on Multimedia and Expo.
Mori, Y., Takahashi, H., & Oka, R. (1999). Image-to-word transformation based on dividing and vector quantizing images with words. Proceedings of First International Workshop on Multimedia Intelligent Storage and Retrieval Management.
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif Intell Res.
Socher, R., Karpathy, A., Le, Q. V., et al. (2014). Grounded Compositional Semantics for Finding and Describing Images with Sentences. TACL.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (pp. 3128-3137).
Vulić, I., Zoghbi, S., & Moens, M.-F. (2014). Learning to bridge colloquial and formal language applied to linking and search of e-commerce data. Proceedings of the 37th International ACM SIGIR Conference on Research; Development in Information Retrieval (pp. 1195-1198). New York, NY, USA: ACM.
Yu, J., Mohan, S., Putthividhya, D., & Wong, W.-K. (2014). Latent dirichlet allocation based diversified retrieval for e-commerce search. Proceedings of the 7th ACM International Conference on Web Search and Data Mining (pp. 463-472). New York, NY, USA: ACM.
Lin K., Yang H-F., Liu K-H., Hsiao J-H., & Chen C-S. (2015). Rapid Clothing Retrieval via Deep Learning of Binary Codes and Hierarchical Search. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (pp. 499-502). New York, NY, USA: ACM.
Lin, T-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft (COCO) common objects in context. Proceedings of European Conference on Computer Vision (ECCV).
Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist, 2, 67-78.
Mason, R., & Charniak, E. (2013). Annotation of online shopping images without labeled training examples. North American Chapter of the ACL Human Language Technologies.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing (pp. 44-49).
Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. Proceedings of the ACL SIGDAT-Workshop (pp. 47-50).
Lowe, D. G. (Nov. 2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput Vis. 60(2), 91-110.
Vedaldi, A., & Fulkerson, B. (2010). VLFeat -An open and portable library of computer vision Volume 8, Number 1, February 2016 algorithms. Proceedings of ACM International Conference on Multimedia (pp. 1469-1472).
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE Conference (pp. 2278-2324).
Deng, J., Dong, W., Socher, R., Li, L-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. CVPR09.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (pp. 1097-1105). Curran Associates, Inc..
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv abs/1409.1556.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 22Nd ACM International Conference on Multimedia (pp. 675-678).
Rasiwasia, N., Pereira, J. C., Coviello, E., Doyle, G., Lanckriet, G. R. G., Levy, R., et al. (2010). A new approach to cross-modal multimedia retrieval. Proceedings of the 18th International ACM Conference on Multimedia (pp. 251-260).
Hardoon, D. R., Szedmák, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Comput., 16(12), 2639-2664.
Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 462-471).
De Smet, W., & Moens, M.-F. (2009). Cross-language linking of news stories on the Web using interlingual topic modeling. Proceedings of the CIKM 2009 Workshop on Social Web Search and Mining.
De Smet, W., Tang, J., & Moens, M.-F. (2011). Knowledge transfer across multilingual corpora via latent topics. Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 549-560).
Zoghbi, S., Heyman, G., Carranza, J. C. G., & Moens, M.-F. (2015). Cross-modal fashion search. Proceedings of the 22nd International Conference on Multimedia Modelling.
Zoghbi, S., Heyman, G., Carranza, J. C. G., & Moens, M-F. (2015). Cross-modal attribute recognition in fashion. Proceedings of NIPS Multimodal Machine Learning Workshop.
Susana Zoghbi is a PhD student in computer science at the KU Leuven. She obtained a masters degree from the University of British Columbia in 2011. Her research interests lie at the boundary of computer vision and natural language processing, and include deep learning, topic modeling and graphical models. Geert Heyman is a doctoral researcher in the Department of Computer Science, KU Leuven, Belgium. He completed his undergraduate studies and his master thesis at the Faculty of Engineering Science at KU Leuven in July 2014. His research interests are statistical models (such as neural networks and graphical models) for natural language processing, in particular for language modeling and machine translation.

Fashion Meets Computer Vision and NLP at e-Commerce Search

Sign up for access to the world's latest research

Abstract

Related papers

References (33)

Related papers

Related topics