Academia.eduAcademia.edu

Outline

Modeling Video Evolution For Action Recognition

Abstract

In this paper we present a method to capture video-wide temporal information for action recognition. We postulate that a function capable of ordering the frames of a video temporally (based on the appearance) captures well the evolution of the appearance within the video. We learn such ranking functions per video via a ranking machine and use the parameters of these as a new video representation. The proposed method is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We perform a large number of evaluations on datasets for generic action recognition (Hollywood2 and HMDB51), fine-grained actions (MPII- cooking activities) and gestures (Chalearn). Results show that the proposed method brings an absolute improvement of 7-10%, while being compatible with and complementary to further improvements in appearance and local motion based methods.

References (44)

  1. O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499-526, 2002. 3
  2. S. Escalera, J. Gonzàlez, X. Baró, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante. Multi-modal gesture recognition challenge 2013: Dataset and results. In ICMI, 2013. 5, 7
  3. A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models for efficient action detection. In CVPR, 2011. 2
  4. A. Gaidon, Z. Harchaoui, C. Schmid, et al. Recognizing activities with cluster-trees of tracklets. In BMVC, 2012. 2
  5. M. Hoai and A. Zisserman. Improving human action recog- nition using score distribution and ranking. In ACCV, 2014. 8
  6. H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. In ECCV, 2012. 2
  7. M. Jain, H. Jégou, and P. Bouthemy. Better exploiting mo- tion for better action recognition. In CVPR, 2013. 1, 8
  8. M. Jain, J. van Gemert, H. Jegou, P. Bouthemy, and C. G. Snoek. Action localization with tubelets from motion. In CVPR, 2014. 2
  9. M. Jain, J. van Gemert, and C. G. M. Snoek. What do 15,000 object categories tell us about classifying and localizing ac- tions? In CVPR, 2015. 2
  10. T. Joachims. Training linear svms in linear time. In ICKDD, 2006. 3, 6
  11. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convo- lutional neural networks. In CVPR, 2014. 2
  12. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011. 5, 7
  13. I. Laptev. On space-time interest points. IJCV, 64:107-123, 2005. 1, 2
  14. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 1, 2, 5
  15. S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 1, 5
  16. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learn- ing hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2
  17. T.-Y. Liu. Learning to rank for information retrieval. Foun- dations and Trends in Information Retrieval, 3(3):225-331, 2009. 2, 3
  18. T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In Com- puter Vision (ICCV), 2011 IEEE International Conference on, pages 89-96. IEEE, 2011. 2
  19. M. Marszałek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009. 7
  20. A. Pasko, V. Adzhiev, A. Sourin, and V. Savchenko. Function representation in geometric modeling: concepts, implemen- tation and applications. The Visual Computer, 11(8):429- 446, 1995. 2
  21. X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In ECCV, 2014. 8
  22. F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with compressed fisher vectors. In CVPR, 2010. 6
  23. T. Pfister, J. Charles, and A. Zisserman. Domain-adaptive discriminative one-shot learning of gestures. In ECCV, 2014. 8
  24. M. Raptis, I. Kokkinos, and S. Soatto. Discovering discrim- inative action parts from mid-level video representations. In CVPR, 2012. 2
  25. M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activ- ities. In CVPR, 2012. 5, 6, 7, 8
  26. M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. Script data for attribute-based recognition of composite activities. In ECCV, 2012. 2
  27. M. S. Ryoo and J. K. Aggarwal. Recognition of composite human activities through context-free grammar based repre- sentation. In CVPR, 2006. 2
  28. J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp, M. Fino- chio, R.Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR, 2011. 6
  29. K. Simonyan and A. Zisserman. Two-stream convolu- tional networks for action recognition in videos. CoRR, abs/1406.2199:1-8, 2014. 2
  30. A. J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and computing, 14:199-222, 2004. 6
  31. Y. Song, L.-P. Morency, and R. Davis. Action recognition by hierarchical sequence summarization. In CVPR, 2013. 1, 2
  32. C. Sun and R. Nevatia. Active: Activity concept transitions in video event classification. In ICCV, 2013. 2
  33. K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. 2
  34. G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu- tional learning of spatio-temporal features. In ECCV, 2010. 1, 2, 8
  35. A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008. 5
  36. A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. PAMI, 34:480-492, 2012. 3
  37. H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense tra- jectories and motion boundary descriptors for action recog- nition. IJCV, 103:60-79, 2013. 1, 2, 5, 6, 8
  38. H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. 1, 2, 5, 6, 8
  39. Y. Wang and G. Mori. Hidden part models for human ac- tion recognition: Probabilistic versus max margin. PAMI, 33:1310-1323, 2011. 1, 2
  40. D. Wu and L. Shao. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In CVPR, 2014. 1
  41. J. Wu, J. Cheng, C. Zhao, and H. Lu. Fusing multi-modal features for gesture recognition. In ICMI, 2013. 8
  42. J. Wu, Y. Zhang, and W. Lin. Towards good practices for action video encoding. In CVPR, 2014. 8
  43. A. Yao, L. Van Gool, and P. Kohli. Gesture recognition port- folios for personalization. In CVPR, 2014. 8
  44. Y. Zhou, B. Ni, S. Yan, P. Moulin, and Q. Tian. Pipelining lo- calized semantic features for fine-grained action recognition. In ECCV, 2014. 8