Modeling Video Evolution For Action Recognition

Basura Fernando; Efstratios Gavves; Jose Oramas M

Outline

Modeling Video Evolution For Action Recognition

Basura Fernando

Efstratios Gavves

Jose Oramas M

Abstract

In this paper we present a method to capture video-wide temporal information for action recognition. We postulate that a function capable of ordering the frames of a video temporally (based on the appearance) captures well the evolution of the appearance within the video. We learn such ranking functions per video via a ranking machine and use the parameters of these as a new video representation. The proposed method is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We perform a large number of evaluations on datasets for generic action recognition (Hollywood2 and HMDB51), fine-grained actions (MPII- cooking activities) and gestures (Chalearn). Results show that the proposed method brings an absolute improvement of 7-10%, while being compatible with and complementary to further improvements in appearance and local motion based methods.

Figures (12)

Figure 1: Illustration of how VideoDarwin works. In this video, as Emma moved out from the house, the appearance of the frames evolves with time. A ranking machine learns this evolution of the appearance over time and returns a ranking function. We use the parameters of this ranking function as a new video representation which captures vital information about the action. firstname. lastname@esat.kuleuven.be

Figure 2: Processing steps of VideoDarwin for action recognition. First, we extract frames x1... x, from each video. Then we generate feature vz for frame t by processing frames from x, to x¢ as explained in section 3.2. Afterwards, using ranking machines we learn the video representation u for each video. Finally, video specific u vectors are used as a representation for action classification.

Figure 3: Vector value representations for VideoDarwin . For a random video, we see the signal for (a) the original in- dependent frames, (b) moving average and (c) time varying mean vectors. Each colour represents a dimension. In (d), (e) and (f) y axis shows the predicted ranking score of each frame obtained from signal (a), (b) and (c) respectively after applying the ranking function (prediction ranking value at t =u! .y;).

Table 1: Comparison of different video representations for VideoDarwin . Results reported in mAP on the Holywood2 dataset using FDVD with Fisher vectors. As also motivated in Sec. 3.2, the time varying mean vector representation captures better the video-wide temporal information present in a video.

Table 2: One-vs-all accuracy on HMDBS1 dataset [12]

Table 3: Results in mAP on Hollywood? dataset [19]

Table 4: Results in mAP on MPII Cooking fine grained ac- tion dataset [25].

Figure 4: Per class AP in the Hollywood? dataset. The AP is improved for all classes significantly, with an exception of “Drive car’, where context already provides useful infor- mation.

Figure 5: Mean class similarity obtained with (left) max- pooling and (right) VideoDarwin on MPII Cooking activi- ties dataset using BOW-based MBH features extracted on dense trajectories. Non-linear forward VideoDarwin are used for our method. Table 5: Detailed analysis of precision and recall on the ChaLearn gesture recognition dataset [2]

Table 7: Comparison of the proposed approach with the state-of-the-art methods on ChaLearn gesture recognition dataset sorted by reverse chronological order. Table 6: Comparison of the proposed approach with the state-of-the-art methods sorted by reverse chronological or- der. Results reported in mAP for Hollywood2 and Cooking datasets. For HMDBS51 we report one-vs-all classification accuracy.

Figure 6: Comparison of action recognition performance after removing some frames from each video randomly on Hollywood2. VideoDarwin appears to be stable even when up to 20% of the frames are missing.

References (44)

O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499-526, 2002. 3
S. Escalera, J. Gonzàlez, X. Baró, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante. Multi-modal gesture recognition challenge 2013: Dataset and results. In ICMI, 2013. 5, 7
A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models for efficient action detection. In CVPR, 2011. 2
A. Gaidon, Z. Harchaoui, C. Schmid, et al. Recognizing activities with cluster-trees of tracklets. In BMVC, 2012. 2
M. Hoai and A. Zisserman. Improving human action recog- nition using score distribution and ranking. In ACCV, 2014. 8
H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. In ECCV, 2012. 2
M. Jain, H. Jégou, and P. Bouthemy. Better exploiting mo- tion for better action recognition. In CVPR, 2013. 1, 8
M. Jain, J. van Gemert, H. Jegou, P. Bouthemy, and C. G. Snoek. Action localization with tubelets from motion. In CVPR, 2014. 2
M. Jain, J. van Gemert, and C. G. M. Snoek. What do 15,000 object categories tell us about classifying and localizing ac- tions? In CVPR, 2015. 2
T. Joachims. Training linear svms in linear time. In ICKDD, 2006. 3, 6
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convo- lutional neural networks. In CVPR, 2014. 2
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011. 5, 7
I. Laptev. On space-time interest points. IJCV, 64:107-123, 2005. 1, 2
I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 1, 2, 5
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006. 1, 5
Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learn- ing hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011. 2
T.-Y. Liu. Learning to rank for information retrieval. Foun- dations and Trends in Information Retrieval, 3(3):225-331, 2009. 2, 3
T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-svms for object detection and beyond. In Com- puter Vision (ICCV), 2011 IEEE International Conference on, pages 89-96. IEEE, 2011. 2
M. Marszałek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009. 7
A. Pasko, V. Adzhiev, A. Sourin, and V. Savchenko. Function representation in geometric modeling: concepts, implemen- tation and applications. The Visual Computer, 11(8):429- 446, 1995. 2
X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In ECCV, 2014. 8
F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with compressed fisher vectors. In CVPR, 2010. 6
T. Pfister, J. Charles, and A. Zisserman. Domain-adaptive discriminative one-shot learning of gestures. In ECCV, 2014. 8
M. Raptis, I. Kokkinos, and S. Soatto. Discovering discrim- inative action parts from mid-level video representations. In CVPR, 2012. 2
M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele. A database for fine grained activity detection of cooking activ- ities. In CVPR, 2012. 5, 6, 7, 8
M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele. Script data for attribute-based recognition of composite activities. In ECCV, 2012. 2
M. S. Ryoo and J. K. Aggarwal. Recognition of composite human activities through context-free grammar based repre- sentation. In CVPR, 2006. 2
J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp, M. Fino- chio, R.Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR, 2011. 6
K. Simonyan and A. Zisserman. Two-stream convolu- tional networks for action recognition in videos. CoRR, abs/1406.2199:1-8, 2014. 2
A. J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and computing, 14:199-222, 2004. 6
Y. Song, L.-P. Morency, and R. Davis. Action recognition by hierarchical sequence summarization. In CVPR, 2013. 1, 2
C. Sun and R. Nevatia. Active: Activity concept transitions in video event classification. In ICCV, 2013. 2
K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012. 2
G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu- tional learning of spatio-temporal features. In ECCV, 2010. 1, 2, 8
A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008. 5
A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. PAMI, 34:480-492, 2012. 3
H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense tra- jectories and motion boundary descriptors for action recog- nition. IJCV, 103:60-79, 2013. 1, 2, 5, 6, 8
H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. 1, 2, 5, 6, 8
Y. Wang and G. Mori. Hidden part models for human ac- tion recognition: Probabilistic versus max margin. PAMI, 33:1310-1323, 2011. 1, 2
D. Wu and L. Shao. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In CVPR, 2014. 1
J. Wu, J. Cheng, C. Zhao, and H. Lu. Fusing multi-modal features for gesture recognition. In ICMI, 2013. 8
J. Wu, Y. Zhang, and W. Lin. Towards good practices for action video encoding. In CVPR, 2014. 8
A. Yao, L. Van Gool, and P. Kohli. Gesture recognition port- folios for personalization. In CVPR, 2014. 8
Y. Zhou, B. Ni, S. Yan, P. Moulin, and Q. Tian. Pipelining lo- calized semantic features for fine-grained action recognition. In ECCV, 2014. 8

Modeling Video Evolution For Action Recognition

Sign up for access to the world's latest research

Abstract

Related papers

References (44)

Related papers

Related topics