Modeling Video Evolution For Action Recognition
Abstract
In this paper we present a method to capture video-wide temporal information for action recognition. We postulate that a function capable of ordering the frames of a video temporally (based on the appearance) captures well the evolution of the appearance within the video. We learn such ranking functions per video via a ranking machine and use the parameters of these as a new video representation. The proposed method is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We perform a large number of evaluations on datasets for generic action recognition (Hollywood2 and HMDB51), fine-grained actions (MPII- cooking activities) and gestures (Chalearn). Results show that the proposed method brings an absolute improvement of 7-10%, while being compatible with and complementary to further improvements in appearance and local motion based methods.
References (44)
- O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499-526, 2002. 3
- S. Escalera, J. Gonzàlez, X. Baró, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante. Multi-modal gesture recognition challenge 2013: Dataset and results. In ICMI, 2013. 5, 7
- A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models for efficient action detection. In CVPR, 2011. 2
- A. Gaidon, Z. Harchaoui, C. Schmid, et al. Recognizing activities with cluster-trees of tracklets. In BMVC, 2012. 2
- M. Hoai and A. Zisserman. Improving human action recog- nition using score distribution and ranking. In ACCV, 2014. 8
- H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. In ECCV, 2012. 2
- M. Jain, H. Jégou, and P. Bouthemy. Better exploiting mo- tion for better action recognition. In CVPR, 2013. 1, 8
- M. Jain, J. van Gemert, H. Jegou, P. Bouthemy, and C. G. Snoek. Action localization with tubelets from motion. In CVPR, 2014. 2
- M. Jain, J. van Gemert, and C. G. M. Snoek. What do 15,000 object categories tell us about classifying and localizing ac- tions? In CVPR, 2015. 2
- T. Joachims. Training linear svms in linear time. In ICKDD, 2006. 3, 6
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convo- lutional neural networks. In CVPR, 2014. 2
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011. 5, 7
- I. Laptev. On space-time interest points. IJCV, 64:107-123, 2005. 1, 2
- I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 1, 2, 5
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 1, 5
- Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learn- ing hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2
- T.-Y. Liu. Learning to rank for information retrieval. Foun- dations and Trends in Information Retrieval, 3(3):225-331, 2009. 2, 3
- T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In Com- puter Vision (ICCV), 2011 IEEE International Conference on, pages 89-96. IEEE, 2011. 2
- M. Marszałek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009. 7
- A. Pasko, V. Adzhiev, A. Sourin, and V. Savchenko. Function representation in geometric modeling: concepts, implemen- tation and applications. The Visual Computer, 11(8):429- 446, 1995. 2
- X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In ECCV, 2014. 8
- F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with compressed fisher vectors. In CVPR, 2010. 6
- T. Pfister, J. Charles, and A. Zisserman. Domain-adaptive discriminative one-shot learning of gestures. In ECCV, 2014. 8
- M. Raptis, I. Kokkinos, and S. Soatto. Discovering discrim- inative action parts from mid-level video representations. In CVPR, 2012. 2
- M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activ- ities. In CVPR, 2012. 5, 6, 7, 8
- M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. Script data for attribute-based recognition of composite activities. In ECCV, 2012. 2
- M. S. Ryoo and J. K. Aggarwal. Recognition of composite human activities through context-free grammar based repre- sentation. In CVPR, 2006. 2
- J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp, M. Fino- chio, R.Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR, 2011. 6
- K. Simonyan and A. Zisserman. Two-stream convolu- tional networks for action recognition in videos. CoRR, abs/1406.2199:1-8, 2014. 2
- A. J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and computing, 14:199-222, 2004. 6
- Y. Song, L.-P. Morency, and R. Davis. Action recognition by hierarchical sequence summarization. In CVPR, 2013. 1, 2
- C. Sun and R. Nevatia. Active: Activity concept transitions in video event classification. In ICCV, 2013. 2
- K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. 2
- G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu- tional learning of spatio-temporal features. In ECCV, 2010. 1, 2, 8
- A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008. 5
- A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. PAMI, 34:480-492, 2012. 3
- H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense tra- jectories and motion boundary descriptors for action recog- nition. IJCV, 103:60-79, 2013. 1, 2, 5, 6, 8
- H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. 1, 2, 5, 6, 8
- Y. Wang and G. Mori. Hidden part models for human ac- tion recognition: Probabilistic versus max margin. PAMI, 33:1310-1323, 2011. 1, 2
- D. Wu and L. Shao. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In CVPR, 2014. 1
- J. Wu, J. Cheng, C. Zhao, and H. Lu. Fusing multi-modal features for gesture recognition. In ICMI, 2013. 8
- J. Wu, Y. Zhang, and W. Lin. Towards good practices for action video encoding. In CVPR, 2014. 8
- A. Yao, L. Van Gool, and P. Kohli. Gesture recognition port- folios for personalization. In CVPR, 2014. 8
- Y. Zhou, B. Ni, S. Yan, P. Moulin, and Q. Tian. Pipelining lo- calized semantic features for fine-grained action recognition. In ECCV, 2014. 8