Zero-shot Learning for Visual Recognition Problems
2015
Abstract
In this thesis we discuss different aspects of zero-shot learning and propose solutions for three challenging visual recognition problems: 1) unknown object recognition from images 2) novel action recognition from videos and 3) unseen object segmentation. In all of these three problems, we have two different sets of classes, the “known classes”, which are used in the training phase and the “unknown classes” for which there is no training instance. Our proposed approach exploits the available semantic relationships between known and unknown object classes and use them to transfer the appearance models from known object classes to unknown object classes to recognize unknown objects. We also propose an approach to recognize novel actions from videos by learning a joint model that links videos and text. Finally, we present a ranking based approach for zero-shot object segmentation. We represent each unknown object class as a semantic ranking of all the known classes and use this semanti...
References (59)
- Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output em- beddings for fine-grained image classification. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in context database. arXiv preprint arXiv:1412.0623, 2014.
- I. Biederman. Recognition by components: A theory of human image under- standing. Psychological Review, 94(2):115-147, 1987.
- M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space- time shapes. In IEEE International Conference on Computer Vision, 2005.
- O. Chapelle, Q. Le, and A. Smola. Large margin optimization of ranking mea- sures. In NIPS Workshop on Learning to Rank, 2007.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In International Confernece on Learning Representations, 2015.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.
- D. Eigen and R. Fergus. Predicting depth, surface normals and semantic la- bels with a common multi-scale convolutional architecture. arXiv preprint arXiv:1411.4734, 2014.
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2):303-338, 2010.
- A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for cross- category generalization. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
- A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.
- L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594- 611, April 2006.
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. DeViSE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, 2013.
- C. Gan, M. Lin, Y. Yang, Y. Zhuang, and A. G. Hauptmann. Exploring se- mantic inter-class relationships (sir) for zero-shot action recognition. In AAAI Conference on Artificial Intelligence, 2015.
- S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. YouTube2Text: Recognizing and describ- ing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision, 2013.
- M. Guillaumin, D. Kuettel, and V. Ferrari. Imagenet auto-annotation with seg- mentation propagation. International Journal of Computer Vision, 2014.
- M. Guillaumin, D. Küttel, and V. Ferrari. Imagenet auto-annotation with seg- mentation propagation. International Journal of Computer Vision, 110(3):328- 348, 2014.
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- rama, and T. Darrell. Caffe: Convolutional architecture for fast feature embed- ding. arXiv:1408.5093, 2014.
- A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirec- tional image-sentence mapping. In Advances in Neural Information Processing Systems, 2014.
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-through vectors. In Arxiv, 2015.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, 2011.
- D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in ima- genet. In European Conference on Computer Vision, 2012.
- C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition, 2009.
- I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008.
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognition natural scene categories. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006.
- H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems, 2007.
- T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hayes, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, 2014.
- J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos"in the wild". In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. arXiv preprint arXiv:1411.4038, 2014.
- W.-L. Lu, J.-A. Ting, J. J. Little, and K. P. Murphy. Learning to track and identify players from broadcast sports videos. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(7):1704-1716, 2013.
- T. Malisiewicz, A. Gupta, A. Efros, et al. Ensemble of exemplar-svms for object detection and beyond. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 89-96. IEEE, 2011.
- M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009.
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed rep- resentations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. MIT Press, 2013.
- M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In International Conference on Learning Representations, 2014.
- D. Osherson, J. Stern, O. Wilkie, M. Stob, and E. E. Smith. Default probability. Cognitive Science, 15(2), 1001.
- G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly-and semi- supervised learning of a deep convolutional network for semantic image segmen- tation.
- D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144, 2014.
- J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vector for word representation. In Conference on Empirical Methods in Natural Language Pro- cessing, 2014.
- P. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1713-1721, 2015.
- D. Ramanan, D. A. Forsyth, and A. Zisserman. Tracking people by learning their appearance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):65-81, January 2007.
- M. Rochan and Y. Wang. Weakly supervised localization of novel objects us- Bibliography ing appearance transfer. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- M. Rohrbach, S. Ebert, and B. Schiele. Transfer learning in a transductive setting. In Advances in Neural Information Processing Systems, 2013.
- M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps where -and why? semantic relatedness for knowledge transfer. In IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition, 2010.
- A. Rosenfeld and D. Weinshall. Extracting foreground masks towards object recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1371-1378. IEEE, 2011.
- C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground ex- traction using iterative graph cuts. In SIGGRAPH, 2004.
- C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local SVM approach. In IEEE International Conference on Pattern Recognition, volume 3, pages 32-36, 2004.
- R. Socher, D. Chen, C. D. Manning, and A. Y. Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems. MIT Press, 2013.
- K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. Technical Report CRCV-TR-12-01, UCF, 2012.
- T. Tommasi, T. Tuytelaars, and B. Caputo. A testbed for cross-dataset analysis. Technical report, arXiv: 1402.5923, 2014.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatial- temporal features with 3d convolutional networks. Arxiv, 2015.
- L. van der Maaten and G. E. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579-2605, 2008.
- H. Wang and C. Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, 2013.
- S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In IEEE Con- ference on Computer Vision and Pattern Recognition, 2014.
- R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI, 2015.
- H. Yu and J. M. Siskind. Grounded language learning from video described with sentences. In Proceedings of ACL, 2013.
- S. Zheng, M.-M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother, and P. H. S. Torr. Dense semantic image segmentation with objects and attributes. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.