Academia.eduAcademia.edu

Outline

High-level event recognition in unconstrained videos

2012, International Journal of Multimedia Information Retrieval

https://doi.org/10.1007/S13735-012-0024-2

Abstract

The goal of high-level event recognition is to automatically detect complex high-level events in a given video sequence. This is a difficult task especially when videos are captured under unconstrained conditions by nonprofessionals. Such videos depicting complex events have limited quality control, and therefore, may include severe camera motion, poor lighting, heavy background clutter, and occlusion. However, due to the fast growing popularity of such videos, especially on the Web, solutions to this problem are in high demands and have attracted great interest from researchers. In this paper, we review current technologies for complex event recognition in unconstrained videos. While the existing solutions vary, we identify common key modules and provide detailed descriptions along with some insights for each of them, including extraction and representation of low-level features across different modalities, classification strategies, fusion techniques, etc. Publicly available benchmark datasets, performance metrics, and related research forums are also described. Finally, we discuss promising directions for future research.

References (173)

  1. Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):1-16
  2. Ali S, Shah M (2010) Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans Pattern Anal Mach Intell 32(2):288-303
  3. Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832-843
  4. Atkeson CG, Hollerbach JM (1985) Kinematic features of unre- strained vertical arm movements. J Neurosci 5(9):2318-2330
  5. Aucouturier JJ, Defreville B, Pachet F (2007) The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. J Acoust Soc Am 122(2):881-891
  6. Aytar Y, Shah M, Luo J (2008) Utilizing semantic word similarity measures for video retrieval. In: Proceedings of IEEE conference on computer vision and pattern recognition, Providence, USA
  7. Baillie M, Jose JM (2003) Audio-based event detection for sports video. In: Proceedings of international conference on image and video retrieval, Urbana-Champaign, IL
  8. Ballan L, Bertini M, Bimbo AD, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimedia Tools Appl 51(1):279-302
  9. Banko M, Mittal VO, Witbrock, MJ (2000) Headline generation based on statistical translation. In: Proceedings of the annual meet- ing of the association for computational linguistics, Hong Kong
  10. Bao L, Yu SI, Lan ZZ, Overwijk A, Jin Q, Langner B, Garbus M, Burger S, Metze F, Hauptmann A (2011) Informedia @ TRECVID 2011. In: Proceedings of NIST TRECVID, Workshop, Gaithers- burg, MD, USA
  11. Barbu, A., Bridge, A., Coroian, D., Dickinson, S., Mussman, S., Narayanaswamy, S., Salvi, D., Schmidt, L., Shangguan, J., Siskind, J.M., Waggoner, J., Wang, S., Wei, J., Yin, Y., Zhang, Z.: Large-scale automatic labeling of video events with verbs based on event-participant interaction. In: arXiv:1204.3616v1 (2012)
  12. Bay H, Ess A, Tuytelaars T, van Gool L (2008) SURF: speeded up robust features. Comput Vision Image Underst 110(3):346-359
  13. Beal MJ, Jojic N, Attias H (2003) A graphical model for audio- visual object tracking. IEEE Trans Pattern Anal Mach Intell 25(7):828-836
  14. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Proceedings of International Conference on Computer Vision
  15. Bobick AF (1997) Movement, activity, and action: the role of knowledge in the perception of motion. Philos Trans Royal Soc London 352:1257-1265
  16. Boiman O, Shechtman E, Irani M (2008) In defense of nearest- neighbor based image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
  17. Brezeale D, Cook D (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybernet Part C 38(3):416-430
  18. de Campos C, Ji Q (2011) Efficient structure learning of bayesian networks using constraints. J Mach Learn Res 12(3):663-689
  19. Cao J, Zhang YD, Song YC, Chen ZN, Zhang X, Li JT (2009) MCG-WEBV: a benchmark dataset for web video analysis. Tech. rep., ICT-MCG-09-001, Institute of Computing Technology, Chi- nese Academy of Sciences
  20. Castel C, Chaudron L, Tessier C (1996) What is going on? a high level interpretation of sequences of images. In: Proceedings of European conference on computer vision, Springer-Verlag, Lon- don, UK
  21. Chang SF, He J, Jiang YG, El Khoury E, Ngo CW, Yanagawa A, Zavesky, E. (2008) Columbia University/VIREO-CityU/IRIT TRECVID2008 high-level feature extraction and interactive video search. In: Proceedings of NIST TRECVID, Workshop, Gaithers- burg
  22. Chang YL, Zeng W, Kamel I, Alonso R (1996) Integrated image and speech analysis for content-based video indexing. In: Pro- ceedings of IEEE international conference on multimedia com- puting and systems, Washington, DC
  23. Chen M, Xu ZE, Weinberger KQ, Sha F (2012) Marginalized stacked denoising autoencoders for domain adaptation. In: Pro- ceedings international conference on machine learning
  24. Cheng H et al (2011) Team SRI-Sarnoff's AURORA System @ TRECVID 2011. In: Proceedings of NIST TRECVID, Workshop
  25. Connolly CI (2007) Learning to recognize complex actions using conditional random fields. In: Proceedings of International Con- ference on Advances in Visual Computing
  26. Cotton CV, Ellis DPW, Loui AC (2011) Soundtrack classification by transient events. In: Proceedings of IEEE international confer- ence acoustics, speech, signal processing, pp 473-476
  27. Dalal N, Triggs B (2005) Histogram of oriented gradients for human detection. In: Proceedings of IEEE conference on com- puter vision and pattern recognition
  28. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of IEEE conference on computer vision and, pattern recognition
  29. Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recog- nition via sparse spatio-temporal features. In Proceedings of joint IEEE international workshop on visual surveillance and perfor- mance evaluation of tracking and surveillance
  30. Dorko G (2012) Interest point detectors local descriptors. http:// lear.inrialpes.fr/people/dorko/downloads.html
  31. Duan L, Xu D, Tsang IW, Luo J (2010) Visual event recognition in videos by learning from web data. In: Proceedings of IEEE conference on computer vision and, pattern recognition
  32. Duchenne O, Laptev I, Sivic J, Bach F, Ponce J (2009) Automatic annotation of human actions in video. In: Proceedings of IEEE international conference on computer vision
  33. Eronen A, Peltonen V, Tuomi J, Klapuri A, Fagerlund S, Sorsa T, Lorho G, Huopaniemi J (2006) Audio-based context recognition. IEEE Trans Audio Speech Lang Process 14(1):321-329
  34. Everingham M, van Gool L, Williams CKI, Winn J, Zisserman A (2007) The PASCAL visual object classes challenge 2007 (VOC2007) Results. http://pascallin.ecs.soton.ac.uk/challenges/ VOC/voc2007/ results/index.shtml
  35. Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part based models. IEEE Trans Pattern Anal Mach Intell 32(9):1530-1535
  36. Feng Y, Lapata M (2010) How many words is a picture worth? automatic caption generation for news images. In: Proceedings of the annual meeting of the association for computational linguistics
  37. Fillmore CJ (1968) The case for case. In: Bach E, Harms R (eds), Universals in Linguistic Theory, New York, pp 1-88
  38. Fiscus J et al (2011) TRECVID multimedia event detec- tion evaluation plan. http://www.nist.gov/itl/iad/mig/upload/ MED11-EvalPlan-V03-20110801a.pdf
  39. Francois ARJ, Nevatia R, Hobbs J, Bolles RC (2005) Verl: an ontology framework for representing and annotating video events. IEEE Multimedia Magazine 12(4):76-86
  40. Frey BJ, Jojic N (2005) A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Trans Pattern Anal Mach Intell 27(9):1392-1416
  41. van Gemert JC, Veenman CJ, Smeulders AWM, Geusebroek JM (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271-1283
  42. Ghanem N, DeMenthon D, Doermann D, Davis L (2004) Rep- resentation and recognition of events in surveillance video using petri nets. In: Proceedings of IEEE conference on computer vision and pattern recognition workshop
  43. Granger C (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424-438
  44. Hakeem A, Sheikh Y, Shah M (2004) Casee: a hierarchical event representation for the analysis of videos. In: Proceedings of AAAI conference
  45. Herbrich R (2001) Learning Kernel classifiers: theory and algo- rithms. The MIT Press, Cambridge
  46. Hu Y, Cao L, Lv F, Yan S, Gong Y, Huang TS (2009) Action detec- tion in complex scenes with spatial and temporal ambiguities. In: Proceedings of IEEE international conference on computer vision
  47. Huang CL, Shih HC, Chao CY (2006) Semantic analysis of soccer video using dynamic bayesian network. IEEE Trans Multimedia 8(4):749-760
  48. Inoue N, Kamishima Y, Wada T, Shinoda K, Sato S (2011) TokyoTech+Canon at TRECVID 2011. In: Proceedings of NIST TRECVID Workshop
  49. Intille SS, Bobick AF (2001) Recognizing planned, multiperson action. Comput Vision Image Underst 81(3):414-445
  50. Ivanov YA, Bobick AF (2000) Recognition of visual activities and interactions by stochastic parsing. IEEE Trans Pattern Anal Mach Intell 22(8):852-872
  51. Jiang W, Cotton C, Chang SF, Ellis D, Loui AC (2009) Short- term audio-visual atoms for generic video concept classification. In: Proceedings of ACM international conference on multimedia
  52. Jiang W, Loui AC (2011) Audio-visual grouplet: Temporal audio- visual interactions for general video concept classification. In: Proceedings of ACM international conference on multimedia
  53. Jiang YG (2012) SUPER: Towards real-time event recognition in Internet videos. In: Proceedings of ACM international conference on multimedia retrieval
  54. Jiang YG, Dai Q, Xue X, Liu W, Ngo CW (2012) Trajectory- based modeling of human actions with motion reference points. In: Proceedings of European conference on computer vision
  55. Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of- features for object categorization and semantic video retrieval. In: Proceedings of ACM international conference on image and video retrieval
  56. Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Represen- tations of keypoint-based semantic concept detection: a compre- hensive study. IEEE Trans Multimedia 12(1):42-53
  57. Jiang YG, Ye G, Chang SF, Ellis D, Loui AC (2011) Consumer video understanding: a bechmark database and an evaluation of human and machine performance. In: Proceedings of ACM inter- national conference on multimedia retrieval
  58. Jiang YG, Zeng X, Ye G, Bhattacharya S, Ellis D, Shah M, Chang SF (2010) Columbia-UCF TRECVID2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In: Proceedings of NIST TRECVID, Workshop
  59. Joo SW, Chellappa R (2006) Attribute grammar-based event recognition and anomaly detection. In: Proceedings of IEEE con- ference on computer vision and pattern recognition, Workshop
  60. Ke Y, Sukthankar R (2004) PCA-SIFT: a more distinctive rep- resentation for local image descriptors. In: Proceedings of IEEE conference on computer vision and pattern recognition
  61. Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of British machine vision conference
  62. Knopp J, Prasad M, Willems G, Timofte R, van Gool L (2010) Hough transform and 3D SURF for robust three dimensional clas- sification. In: Proceedings of European conference on computer vision
  63. Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on con- cept hierarchy of actions. Int J Comput Vision 50(2):171-184
  64. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceed- ings of IEEE international conference on computer vision
  65. Laptev I (2005) On space-time interest points. Int J Comput Vision 64:107-123
  66. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of IEEE conference on computer vision and pattern recognition
  67. Lavee G, Rivlin E, Rudzsky M (2009) Understanding video events: a survey of methods for automatic interpretation of seman- tic occurrences in videos. IEEE Trans Syst Man Cybernet Part C 39(5):489-504
  68. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE conference on computer vision and pat- tern recognition
  69. Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchi- cal invariant spatio-temporal features for action recognition with independent subspace analysis. In: Proceedings of IEEE confer- ence on computer vision and, pattern recognition
  70. Lee K, Ellis DPW (2010) Audio-based semantic concept clas- sification for consumer video. IEEE Trans Audio Speech Lang Process 18(6):1406-1416
  71. Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circ Syst Video Technol 18(11):1499-1510
  72. Lindeberg T (1998) Feature detection with automatic scale selec- tion. Int J Comput Vision 30:79-116
  73. Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: Proceedings of IEEE conference on computer vision and, pattern recognition, pp 3337-3344
  74. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos "in the wild". In: Proceedings of IEEE conference on com- puter vision and pattern recognition
  75. Liu J, Shah M (2008) Learning human actions via information maximization. In: Proceedings of IEEE conference on computer vision and pattern recognition
  76. Loui AC, Luo J, Chang SF, Ellis D, Jiang W, Kennedy L, Lee K, Yanagawa A (2007) Kodak's consumer video benchmark data set: concept definition and annotation. In: Proceedings of ACM international workshop on multimedia, information retrieval
  77. Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60:91-110
  78. Lu L, Hanjalic A (2008) Audio keywords discovery for text- like audio content analysis and retrieval. IEEE Trans Multimedia 10(1):74-85
  79. Lucas BD, Kanade T (1981) An iterative image registration tech- nique with an application to stereo vision. In: Proceedings of inter- national joint conference on artificial intelligence
  80. Lyon RF, Rehn M, Bengio S, Walters TC, Chechik G (2010) Sound retrieval and ranking using sparse auditory representations. Neural Comput 22(9):2390-2416
  81. Maji S, Berg AC, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: Proceedings of IEEE conference on computer vision and pattern recognition
  82. Mandel MI, Ellis DPW (2005) Song-level features and support vector machines for music classification. In: Proceedings of inter- national society of music information retrieval conference
  83. Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 18(8):837-842
  84. Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The det curve in assessment of detection task performance. In: Procedings of European conference on speech communication and technology, pp 1895-1898
  85. Matas J, Chum O, Urban M, Pajdla T (2002) Robust wide baseline stereo from maximally stable extremal regions. In: Proceedings of British machine vision conference, vol 1, pp 384-393
  86. MediaEval: Multimedia retrieval benchmark evaluation. http:// www.multimediaeval.org
  87. Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of IEEE international conference on computer vision
  88. Mikolajczyk K, Schmid C (2004) Scale and affine invariant inter- est point detectors. Int J Comput Vision 60:63-86
  89. Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10):1615-1630
  90. Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J et al (2005) A comparison of affine region detectors. Int J Comput Vision 65(1/2):43-72
  91. Minami K, Akutsu A, Hamada H, Tonomura Y (1998) Video han- dling with music and speech detection. IEEE Multimedia Maga- zine 5:17-25
  92. Moore D, Essa I (2001) Recognizing multitasked activities using stochastic context-free grammar. In: Proceedings of AAAI con- ference
  93. Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. IEEE Trans Pattern Anal Mach Intell 30(9):1632-1646
  94. Morsillo N, Mann G, Pal C (2010) Youtube scale, large vocab- ulary video annotation, Chapter 14 in video search and mining. Springer-Verlag series on studies in computational intelligence. Springer, Berlin, pp 357-386
  95. Naphade M, Smith J, Tesic J, Chang SF, Hsu W, Kennedy L, Hauptmann A, Curtis J (2006) Large-scale concept ontology for multimedia. IEEE Multimedia Magazine 13(3):86-91
  96. Natarajan P et al (2011) BBN VISER TRECVID 2011 multime- dia event detection system. In: Proceedings of NIST TRECVID, Workshop
  97. Natarajan P, Nevatia R (2008) Online, real-time tracking and recognition of human actions. In: Proceedings of IEEE workshop on motion and video, computing, pp 1-8
  98. Natsev A, Smith JR, Hill M, Hua G, Huang B, Merler M, Xie L, Ouyang H, Zhou, M (2010) IBM Research TRECVID-2010 video copy detection and multimedia event detection system. In: Proceedings of NIST TRECVID, Workshop
  99. NIST Trecvid Multimedia Event Detection (MED) task. http:// www.nist.gov/itl/iad/mig/med.cfm
  100. Nister D, Stewenius H (2006) Scalable recognition with a vocabu- lary tree. In: Proceedings of IEEE conference on computer vision and pattern recognition
  101. Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of- features image classification. In: Proceedings of European con- ference on computer vision
  102. Oikonomopoulos A, Patras I, Pantic M (2011) Spatiotemporal localization and categorization of human actions in unsegmented image sequences. IEEE Trans Image Process 20(4):1126-1140
  103. Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray- scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971-987
  104. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vision 42:145-175
  105. Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Proceedings of advances in neural information processing systems 106. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceeed- ings of the annual meeting of the association for computational linguistics
  106. Patterson RD, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M (1992) Complex sounds and auditory images. In: Proceedings of international symposium on hearing, pp 429-446
  107. Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of European conference on computer vision
  108. Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of IEEE conference on computer vision and pattern recognition
  109. Pollard C, Sag I (1994) Head-driven phrase structure grammar. Chicago University Press, Chicago
  110. Poppe R (2010) Survey on vision-based human action recognition. Image Vision Comput 28(6):976-990
  111. Rapantzikos K, Avrithis Y, Kollias S (2009) Dense saliency- based spatiotemporal feature points for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition
  112. Raptis M, Soatto S (2010) Tracklet descriptors for action modeling and video analysis. In: Proceedings of European conference on computer vision
  113. Rodriguez MD, Ahmed J, Shah M (2008) Action mach: a spatio- temporal maximum average correlation height filter for action recognition. In: Procedings of IEEE conference on computer vision and pattern recognition
  114. Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover's distance as a metric for image retrieval. Int J Comput Vision 40(2):99- 121
  115. Russell B, Torralba A, Murphy K, Freeman WT (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vision 77(1-3):157-173
  116. Ryoo MS, Aggarwal JK (2006) Recognition of composite human activities through context-free grammar based representation. In: Proceedings pf IEEE conference on computer vision and pattern recognition
  117. Sadlier DA, O'Connor NE (2005) Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans Circ Syst Video Technol 15(10):1225-1233
  118. van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582-1596
  119. Satkin S, Hebert M (2010) Modeling the temporal extent of actions. In: Proceedings of European conference on computer vision
  120. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of international confer- ence on pattern recognition
  121. Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descrip- tor and its application to action recognition. In: Proceedings of ACM international conference on multimedia
  122. Shechtman E, Irani M (2007) Matching local self-similarities across images and videos. In: Proceedings lo IEEE conference on computer vision and pattern recognition
  123. Shotton J, Johnson M, Cipolla R (2008) Semantic texton forests for image categorization and segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognitio
  124. Si Z, Pei M, Yao B, Zhu SC (2011) Unsupervised learning of event and-or grammar and semantics from video. In: Proceedings IEEE international conference on computer vision
  125. Silpa-Anan C, Hartley R (2008) Optimised KD-trees for fast image descriptor matching. In: IEEE conference on computer vision and pattern recognition
  126. Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings of IEEE international conference on computer vision
  127. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of ACM international workshop on multimedia information retrieval
  128. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349-1380
  129. Snoek CGM, Worring M (2008) Concept-based video retrieval. Found Trends Inf Retr 2(4):215-322
  130. Starner TE (1995) Visual recognition of american sign language using hidden markov models. Ph.D. thesis
  131. Sun J, Wu X, Yan S, Cheong LF, Chua TS, Li J (2009) Hierarchi- cal spatio-temporal context modeling for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition
  132. Sun SW, Wang YCF, Hung YL, Chang CL, Chen KC, Cheng SS, Wang HM, Liao HYM (2011) Automatic annotation of web videos. In: Proceedings of IEEE international conference on mul- timedia and expo
  133. Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: Proceedings of ACM international conference on multimedia
  134. Taylor G, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of European conference on computer vision
  135. Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes. In: Proceedings of Euro- pean conference on computer vision
  136. Tran SD, Davis LS (2008) Event modeling and recognition using markov logic networks. In: Proceedings of European conference on computer vision
  137. Tsekeridou S, Pitas I (2001) Content-based video parsing and indexing based on audio-visual interaction. IEEE Transactions on Circuits and Systems for Video Technology 11(4):522-535
  138. Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: a survey. IEEE Trans Circ Syst Video Technol 18(11):1473-1488
  139. Tuytelaars T (2010) Dense interest points. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 2281-2288
  140. Uemura H, Ishikawa S, Mikolajczyk K (2008) Feature tracking and motion compensation for action recognition. In: Proceedings British machine vision conference
  141. Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual concept classification. IEEE Trans Multimedia 12(7): 665-680
  142. University of Central Florida 50 human action dataset (2010). http://server.cs.ucf.edu/~ision/data/UCF50.rar
  143. Vail DL, Veloso MM, Lafferty JD (2007) Conditional random fields for activity recognition. In: Proceedings of international joint conference on autonomous agents and multiagent systems
  144. Vedaldi A, Gulshan V, Varma M, Zisserman A (2009) Multiple kernels for object detection. In: Proceedings of IEEE international conference on computer vision
  145. Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extract- ing and composing robust features with denoising autoencoders. In: Procedings of international conference on machine learning
  146. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12):3371-3408
  147. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In; Proceedings of IEEE conference on computer vision and pattern recognition
  148. Wang F, Jiang YG, Ngo CW (2008) Video event detection using motion relativity and visual relatedness. In: Proceedings of ACM international conference on multimedia
  149. Wang H, Klaser A, Schmid C, Liu CL (2011) Action recogni- tion by dense trajectories. In: Proceedings of IEEE conference on computer vision and pattern recognition
  150. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2008) Evalu- ation of local spatio-temporal features for action recognition. In: Proceedings of British machine vision conference
  151. Wang J, Kumar S, Chang SF (2010) Semi-supervised hashing for scalable image retrieval. In: Proceedings of IEEE conference on computer vision and pattern recognition
  152. Wang L, Suter D (2007) Recognizing human activities from sil- houettes: motion subspace and factorial discriminative graphical model. In: Proceedings of IEEE conference on computer vision and pattern recognition
  153. Wang Y, Mori G (2009) Max-margin hidden conditional random fields for human action recognition. In: Proceedings of IEEE con- ference on computer vision and pattern recognition
  154. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vision Image Underst 104(2):249-257
  155. Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: Pro- ceedings of advances in neural information processing systems
  156. White B, Yeh T, Lin J, Davis L (2009) Web-scale computer vision using mapreduce for multimedia data mining. In: Proceedings of ACM SIGKDD workshop on multimedia data mining
  157. Willems G, Tuytelaars T, van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Pro- ceedings European conference on computer vision
  158. Wu S, Oreifej O, Shah M (2011) Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In: Proceedings of IEEE inter- national conference on computer vision
  159. Xie L, Xu P, Chang SF, Divakaran A, Sun H (2004) Structure analysis of soccer video with domain knowledge and hidden markov models. Pattern Recognit Lett 25(7):767-775
  160. Xu C, Wang J, Lu H, Zhang Y (2008) A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans Multimedia 10(3):421-436
  161. Xu D, Chang SF (2008) Video event recognition using Kernel methods with multilevel temporal alignment. IEEE Trans Pattern Anal Mach Intell 30(11):1985-1997
  162. Xu M, Maddage NC, Xu C, Kankanhalli M, Tian Q (2003) Creating audio keywords for event detection in soccer video. In: Proceedings IEEE international conference on multimedia and expo
  163. Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden markov model. In: Proceed- ings of IEEE conference on computer vision and pattern recogni- tion
  164. Yan R, Fleury MO, Merler M, Natsev A, Smith JR (2009) Large- scale multimedia semantic concept modeling using robust sub- space bagging and mapreduce. In: Proceedings of ACM workshop on large-scale multimedia retrieval and mining
  165. Yanagawa A, Hsu W, Chang SF (2006) Brief descriptions of visual features for baseline trecvid concept detectors. Columbia Univer- sity, Tech. rep.
  166. Yao B, Yang X, Lin L, Lee M, Zhu S (2010) I2T: Image parsing to text description. Proc IEEE 98(8):1485-1508
  167. Ye G, Jhuo IH, Liu D, Jiang YG, Chang SF (2012) Joint audio- visual bi-modal codewords for video event detection. In: Proceed- ings of ACM international conference on multimedia retrieval
  168. Ye G, Liu D, Jhuo IH, Chang SF (2012) Robust late fusion with rank minimization. In: Proceedings IEEE conference on computer vision and pattern recognition
  169. Yu TH, Kim TK, Cipolla R (2010) Real-time action recognition by sptiotemoral semantic and structural forests. In: Proceedings of British machine vision conference
  170. Yuan F, Prinet V, Yuan J (2010) Middle-level representation for human activities recognition: the role of spatio-temporal relation- ships. In: Proceedings of ECCV Workshop on human motion: understanding, modeling, capture and animation
  171. Yuen J, Russell BC, Liu C, Torralba A (2009) LabelMe video: building a video database with human annotations. In: Proceed- ings of international conference on computer vision
  172. Zhang D, Chang SF (2002) Event detection in baseball video using superimposed caption recognition. In: Proceedings of ACM international conference on multimedia
  173. Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local fea- tures and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vision 73(2):213-238