Academia.eduAcademia.edu

Outline

Rank Pooling for Action Recognition

Abstract

We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the frame-level features of a video in chronological order, we obtain a new representation that captures the video-wide temporal dynamics of a video, suitable for action recognition. Other than ranking functions, we explore different parametric models that could also explain the temporal changes in videos. The proposed functional pooling methods, and rank pooling in particular, is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We evaluate our method on various benchmarks for generic action, fine-grained action and gesture recognition. Results show that rank pooling brings an absolute improvement of 7-10 average pooling baseline. At the same time, rank pooling is compatible with and complementary to several appearance and local motion based methods and features, such as improved trajectories and deep learning features.

FAQs

sparkles

AI

What does rank pooling reveal about video temporal evolution in action recognition?add

Rank pooling captures video-wide temporal evolution by using parameters of ranking functions for feature representation, resulting in state-of-the-art performance in action classification across datasets like Hollywood2 and HMDB51.

How does rank pooling compare to traditional temporal recognition methods?add

The study finds that rank pooling, unlike traditional methods like HMMs and CRFs, effectively models temporal dynamics without requiring labeled data, yielding improved classification accuracy by up to 10%.

What evidence supports the effectiveness of time varying mean vectors in video representation?add

The research shows that time varying mean vectors outperform independent frame and moving average representations by significantly capturing video-wide temporal information, as verified in experiments on the Hollywood2 dataset.

What factors contribute to the stability of rank pooling against dropped frames?add

Rank pooling maintains high accuracy even with up to 20% of frames removed, showcasing remarkable stability compared to average pooling and temporal pyramids that see significant performance drops.

How are features combined for optimal performance in action classification?add

Combining rank pooling with local motion features and CNN-based max pooled features led to accuracy improvements of 6.6% on HMDB51 and 7.1% on Hollywood2, illustrating its complementary nature.

References (74)

  1. R. Arandjelović and A. Zisserman, "Three things everyone should know to improve object retrieval," in CVPR, 2012.
  2. H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, "Dynamic image networks for action recognition," in CVPR, 2016.
  3. O. Bousquet and A. Elisseeff, "Stability and generalization," JMLR, vol. 2, pp. 499-526, 2002.
  4. G. Chéron, I. Laptev, and C. Schmid, "P-cnn: pose-based cnn features for action recognition," in ICCV, 2015, pp. 3218-3226.
  5. R. De Geest and T. Tuytelaars, "Dense interest features for video processing," in ICIP, 2014.
  6. G. Doretto, A. Chiuso, Y. N. Wu, and S. Soatto, "Dynamic textures," IJCV, vol. 51, no. 2, pp. 91-109, 2003.
  7. Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in CVPR, 2015.
  8. S. Escalera, J. Gonzàlez, X. Baró, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, and H. Escalante, "Multi-modal gesture recognition chal- lenge 2013: Dataset and results," in ICMI, 2013.
  9. B. Fernando, P. Anderson, M. Hutter, and S. Gould, "Discriminative hierarchical rank pooling for activity recognition," in CVPR, 2016.
  10. B. Fernando, E. Gavves, D. Muselet, and T. Tuytelaars, "Learning-to- rank based on subsequences," in ICCV, 2015.
  11. B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars, "Modeling video evolution for action recognition," in CVPR, 2015.
  12. B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, "Unsupervised visual domain adaptation using subspace alignment," in ICCV, 2013.
  13. A. Gaidon, Z. Harchaoui, and C. Schmid, "Actom sequence models for efficient action detection," in CVPR, 2011.
  14. A. Gaidon, Z. Harchaoui, C. Schmid et al., "Recognizing activities with cluster-trees of tracklets," in BMVC, 2012.
  15. E. Gavves, B. Fernando, C. Snoek, A. Smeulders, and T. Tuytelaars, "Local alignments for fine-grained categorization," IJCV, vol. 111, no. 2, pp. 191-212, 2014.
  16. E. Gavves, C. G. M. Snoek, and A. W. M. Smeulders, "Convex reduction of high-dimensional kernels for visual classification," in CVPR, 2012.
  17. G. Gkioxari and J. Malik, "Finding action tubes," in CVPR, June 2015.
  18. M. Hoai and A. Zisserman, "Improving human action recognition using score distribution and ranking," in ACCV, 2014.
  19. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
  20. H. Izadinia and M. Shah, "Recognizing complex events using large margin joint low-level event model," in ECCV, 2012.
  21. M. Jain, H. Jégou, and P. Bouthemy, "Better exploiting motion for better action recognition," in CVPR, 2013.
  22. M. Jain, J. van Gemert, H. Jegou, P. Bouthemy, and C. G. Snoek, "Action localization with tubelets from motion," in CVPR, 2014.
  23. M. Jain, J. van Gemert, and C. G. M. Snoek, "What do 15,000 object categories tell us about classifying and localizing actions?" in CVPR, 2015.
  24. T. Joachims, "Training linear svms in linear time," in ICKDD, 2006.
  25. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutional neural networks," in CVPR, 2014.
  26. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in NIPS, 2012.
  27. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "Hmdb: a large video database for human motion recognition," in ICCV, 2011.
  28. Z. Lan, D. Yao, M. Lin, S.-I. Yu, and A. Hauptmann, "The best of both worlds: Combining data-independent and data-driven approaches for action recognition," arXiv preprint arXiv:1505.04427, 2015.
  29. I. Laptev, "On space-time interest points," IJCV, vol. 64, pp. 107-123, 2005.
  30. I. Laptev and T. Lindeberg, "Space-time interest points," in ICCV, 2003.
  31. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning realistic human actions from movies," in CVPR, 2008.
  32. S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories," in CVPR, 2006.
  33. Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng, "Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis," in CVPR, 2011.
  34. B. Li, M. Ayazoglu, T. Mao, O. I. Camps, and M. Sznaier, "Activity recognition using dynamic subspace angles," in CVPR. IEEE, 2011, pp. 3193-3200.
  35. T.-Y. Liu, "Learning to rank for information retrieval," Foundations and Trends in Information Retrieval, vol. 3, no. 3, pp. 225-331, 2009.
  36. T. Malisiewicz, A. Gupta, and A. A. Efros, "Ensemble of exemplar-svms for object detection and beyond," in ICCV, 2011.
  37. M. Marszałek, I. Laptev, and C. Schmid, "Actions in context," in CVPR, 2009.
  38. M. Martínez-Camarena, J. Oramas M, and T. Tuytelaars, "Towards sign language recognition based on body parts relations," in ICIP, 2015.
  39. M. Mazloom, E. Gavves, and C. G. Snoek, "Conceptlets: Selective semantics for classifying video events," Multimedia, IEEE Transactions on, vol. 16, no. 8, pp. 2214-2228, 2014.
  40. A. Pasko, V. Adzhiev, A. Sourin, and V. Savchenko, "Function repre- sentation in geometric modeling: concepts, implementation and applica- tions," The Visual Computer, vol. 11, no. 8, 429-446, 1995.
  41. X. Peng, C. Zou, Y. Qiao, and Q. Peng, "Action recognition with stacked fisher vectors," in ECCV, 2014.
  42. F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, "Large-scale image retrieval with compressed fisher vectors," in CVPR, 2010.
  43. T. Pfister, J. Charles, and A. Zisserman, "Domain-adaptive discriminative one-shot learning of gestures," in ECCV, 2014.
  44. V. Ponce-López, H. J. Escalante, S. Escalera, and X. Baró, "Gesture and action recognition by evolved dynamic subgestures," in BMVC, 2015.
  45. M. Raptis, I. Kokkinos, and S. Soatto, "Discovering discriminative action parts from mid-level video representations," in CVPR, 2012.
  46. J. Revaud, M. Douze, C. Schmid, and H. Jégou, "Event retrieval in large video collections with circulant temporal encoding," in CVPR, 2013.
  47. M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, "A database for fine grained activity detection of cooking activities," in CVPR, 2012.
  48. M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, and B. Schiele, "Script data for attribute-based recognition of composite activities," in ECCV, 2012.
  49. M. S. Ryoo and J. K. Aggarwal, "Recognition of composite human activities through context-free grammar based representation," in CVPR, 2006.
  50. J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp, M. Finochio, R.Moore, A. Kipman, and A. Blake, "Real-time human pose recognition in parts from single depth images," in CVPR, 2011.
  51. K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," CoRR, vol. abs/1406.2199, pp. 1-8, 2014.
  52. --, "Very deep convolutional networks for large-scale image recogni- tion," in ICLR, 2015.
  53. C. Sminchisescu, A. Kanaujia, and D. Metaxas, "Conditional models for contextual human motion recognition," Computer Vision and Image Understanding, vol. 104, no. 2, pp. 210-220, 2006.
  54. A. Smola and V. Vapnik, "Support vector regression machines," Advances in neural information processing systems, vol. 9, pp. 155-161, 1997.
  55. Y. Song, L.-P. Morency, and R. Davis, "Action recognition by hierarchical sequence summarization," in CVPR, 2013.
  56. N. Srivastava, E. Mansimov, and R. Salakhutdinov, "Unsupervised learn- ing of video representations using lstms," CoRR, vol. abs/1502.04681, 2015.
  57. W.-T. Su, Y.-H. Lu, and A. S. Kaseb, "Harvest the information from multimedia big data in global camera networks," in IEEE International Conference on Multimedia Big Data, 2015.
  58. C. Sun and R. Nevatia, "Active: Activity concept transitions in video event classification," in ICCV, 2013.
  59. I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in NIPS, 2014, pp. 3104-3112.
  60. K. Tang, L. Fei-Fei, and D. Koller, "Learning latent temporal structure for complex event detection," in CVPR, 2012.
  61. G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, "Convolutional learning of spatio-temporal features," in ECCV, 2010.
  62. A. Vedaldi and B. Fulkerson, "VLFeat: An open and portable library of computer vision algorithms," 2008.
  63. A. Vedaldi and A. Zisserman, "Efficient additive kernels via explicit feature maps," PAMI, vol. 34, pp. 480-492, 2012.
  64. H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, "Dense trajectories and motion boundary descriptors for action recognition," IJCV, vol. 103, pp. 60-79, 2013.
  65. H. Wang and C. Schmid, "Action recognition with improved trajectories," in ICCV, 2013.
  66. Y. Wang and G. Mori, "Hidden part models for human action recognition: Probabilistic versus max margin," PAMI, vol. 33, pp. 1310-1323, 2011.
  67. D. Wu and L. Shao, "Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition," in CVPR, 2014.
  68. J. Wu, Y. Zhang, and W. Lin, "Towards good practices for action video encoding," in CVPR, 2014.
  69. J. Wu, J. Cheng, C. Zhao, and H. Lu, "Fusing multi-modal features for gesture recognition," in ICMI, 2013.
  70. J. Yamato, J. Ohya, and K. Ishii, "Recognizing human action in time- sequential images using hidden markov model," in CVPR, Jun 1992, pp. 379-385.
  71. A. Yao, L. Van Gool, and P. Kohli, "Gesture recognition portfolios for personalization," in CVPR, 2014.
  72. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, "Beyond short snippets: Deep networks for video classification," in CVPR, 2015.
  73. J. Zepeda and P. Perez, "Exemplar svms as visual feature encoders," in CVPR, 2015.
  74. Y. Zhou, B. Ni, S. Yan, P. Moulin, and Q. Tian, "Pipelining localized semantic features for fine-grained action recognition," in ECCV, 2014.