High-level event recognition in unconstrained videos
2012, International Journal of Multimedia Information Retrieval
https://doi.org/10.1007/S13735-012-0024-2Abstract
The goal of high-level event recognition is to automatically detect complex high-level events in a given video sequence. This is a difficult task especially when videos are captured under unconstrained conditions by nonprofessionals. Such videos depicting complex events have limited quality control, and therefore, may include severe camera motion, poor lighting, heavy background clutter, and occlusion. However, due to the fast growing popularity of such videos, especially on the Web, solutions to this problem are in high demands and have attracted great interest from researchers. In this paper, we review current technologies for complex event recognition in unconstrained videos. While the existing solutions vary, we identify common key modules and provide detailed descriptions along with some insights for each of them, including extraction and representation of low-level features across different modalities, classification strategies, fusion techniques, etc. Publicly available benchmark datasets, performance metrics, and related research forums are also described. Finally, we discuss promising directions for future research.
References (173)
- Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):1-16
- Ali S, Shah M (2010) Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans Pattern Anal Mach Intell 32(2):288-303
- Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832-843
- Atkeson CG, Hollerbach JM (1985) Kinematic features of unre- strained vertical arm movements. J Neurosci 5(9):2318-2330
- Aucouturier JJ, Defreville B, Pachet F (2007) The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. J Acoust Soc Am 122(2):881-891
- Aytar Y, Shah M, Luo J (2008) Utilizing semantic word similarity measures for video retrieval. In: Proceedings of IEEE conference on computer vision and pattern recognition, Providence, USA
- Baillie M, Jose JM (2003) Audio-based event detection for sports video. In: Proceedings of international conference on image and video retrieval, Urbana-Champaign, IL
- Ballan L, Bertini M, Bimbo AD, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimedia Tools Appl 51(1):279-302
- Banko M, Mittal VO, Witbrock, MJ (2000) Headline generation based on statistical translation. In: Proceedings of the annual meet- ing of the association for computational linguistics, Hong Kong
- Bao L, Yu SI, Lan ZZ, Overwijk A, Jin Q, Langner B, Garbus M, Burger S, Metze F, Hauptmann A (2011) Informedia @ TRECVID 2011. In: Proceedings of NIST TRECVID, Workshop, Gaithers- burg, MD, USA
- Barbu, A., Bridge, A., Coroian, D., Dickinson, S., Mussman, S., Narayanaswamy, S., Salvi, D., Schmidt, L., Shangguan, J., Siskind, J.M., Waggoner, J., Wang, S., Wei, J., Yin, Y., Zhang, Z.: Large-scale automatic labeling of video events with verbs based on event-participant interaction. In: arXiv:1204.3616v1 (2012)
- Bay H, Ess A, Tuytelaars T, van Gool L (2008) SURF: speeded up robust features. Comput Vision Image Underst 110(3):346-359
- Beal MJ, Jojic N, Attias H (2003) A graphical model for audio- visual object tracking. IEEE Trans Pattern Anal Mach Intell 25(7):828-836
- Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Proceedings of International Conference on Computer Vision
- Bobick AF (1997) Movement, activity, and action: the role of knowledge in the perception of motion. Philos Trans Royal Soc London 352:1257-1265
- Boiman O, Shechtman E, Irani M (2008) In defense of nearest- neighbor based image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
- Brezeale D, Cook D (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybernet Part C 38(3):416-430
- de Campos C, Ji Q (2011) Efficient structure learning of bayesian networks using constraints. J Mach Learn Res 12(3):663-689
- Cao J, Zhang YD, Song YC, Chen ZN, Zhang X, Li JT (2009) MCG-WEBV: a benchmark dataset for web video analysis. Tech. rep., ICT-MCG-09-001, Institute of Computing Technology, Chi- nese Academy of Sciences
- Castel C, Chaudron L, Tessier C (1996) What is going on? a high level interpretation of sequences of images. In: Proceedings of European conference on computer vision, Springer-Verlag, Lon- don, UK
- Chang SF, He J, Jiang YG, El Khoury E, Ngo CW, Yanagawa A, Zavesky, E. (2008) Columbia University/VIREO-CityU/IRIT TRECVID2008 high-level feature extraction and interactive video search. In: Proceedings of NIST TRECVID, Workshop, Gaithers- burg
- Chang YL, Zeng W, Kamel I, Alonso R (1996) Integrated image and speech analysis for content-based video indexing. In: Pro- ceedings of IEEE international conference on multimedia com- puting and systems, Washington, DC
- Chen M, Xu ZE, Weinberger KQ, Sha F (2012) Marginalized stacked denoising autoencoders for domain adaptation. In: Pro- ceedings international conference on machine learning
- Cheng H et al (2011) Team SRI-Sarnoff's AURORA System @ TRECVID 2011. In: Proceedings of NIST TRECVID, Workshop
- Connolly CI (2007) Learning to recognize complex actions using conditional random fields. In: Proceedings of International Con- ference on Advances in Visual Computing
- Cotton CV, Ellis DPW, Loui AC (2011) Soundtrack classification by transient events. In: Proceedings of IEEE international confer- ence acoustics, speech, signal processing, pp 473-476
- Dalal N, Triggs B (2005) Histogram of oriented gradients for human detection. In: Proceedings of IEEE conference on com- puter vision and pattern recognition
- Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of IEEE conference on computer vision and, pattern recognition
- Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recog- nition via sparse spatio-temporal features. In Proceedings of joint IEEE international workshop on visual surveillance and perfor- mance evaluation of tracking and surveillance
- Dorko G (2012) Interest point detectors local descriptors. http:// lear.inrialpes.fr/people/dorko/downloads.html
- Duan L, Xu D, Tsang IW, Luo J (2010) Visual event recognition in videos by learning from web data. In: Proceedings of IEEE conference on computer vision and, pattern recognition
- Duchenne O, Laptev I, Sivic J, Bach F, Ponce J (2009) Automatic annotation of human actions in video. In: Proceedings of IEEE international conference on computer vision
- Eronen A, Peltonen V, Tuomi J, Klapuri A, Fagerlund S, Sorsa T, Lorho G, Huopaniemi J (2006) Audio-based context recognition. IEEE Trans Audio Speech Lang Process 14(1):321-329
- Everingham M, van Gool L, Williams CKI, Winn J, Zisserman A (2007) The PASCAL visual object classes challenge 2007 (VOC2007) Results. http://pascallin.ecs.soton.ac.uk/challenges/ VOC/voc2007/ results/index.shtml
- Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part based models. IEEE Trans Pattern Anal Mach Intell 32(9):1530-1535
- Feng Y, Lapata M (2010) How many words is a picture worth? automatic caption generation for news images. In: Proceedings of the annual meeting of the association for computational linguistics
- Fillmore CJ (1968) The case for case. In: Bach E, Harms R (eds), Universals in Linguistic Theory, New York, pp 1-88
- Fiscus J et al (2011) TRECVID multimedia event detec- tion evaluation plan. http://www.nist.gov/itl/iad/mig/upload/ MED11-EvalPlan-V03-20110801a.pdf
- Francois ARJ, Nevatia R, Hobbs J, Bolles RC (2005) Verl: an ontology framework for representing and annotating video events. IEEE Multimedia Magazine 12(4):76-86
- Frey BJ, Jojic N (2005) A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Trans Pattern Anal Mach Intell 27(9):1392-1416
- van Gemert JC, Veenman CJ, Smeulders AWM, Geusebroek JM (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271-1283
- Ghanem N, DeMenthon D, Doermann D, Davis L (2004) Rep- resentation and recognition of events in surveillance video using petri nets. In: Proceedings of IEEE conference on computer vision and pattern recognition workshop
- Granger C (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424-438
- Hakeem A, Sheikh Y, Shah M (2004) Casee: a hierarchical event representation for the analysis of videos. In: Proceedings of AAAI conference
- Herbrich R (2001) Learning Kernel classifiers: theory and algo- rithms. The MIT Press, Cambridge
- Hu Y, Cao L, Lv F, Yan S, Gong Y, Huang TS (2009) Action detec- tion in complex scenes with spatial and temporal ambiguities. In: Proceedings of IEEE international conference on computer vision
- Huang CL, Shih HC, Chao CY (2006) Semantic analysis of soccer video using dynamic bayesian network. IEEE Trans Multimedia 8(4):749-760
- Inoue N, Kamishima Y, Wada T, Shinoda K, Sato S (2011) TokyoTech+Canon at TRECVID 2011. In: Proceedings of NIST TRECVID Workshop
- Intille SS, Bobick AF (2001) Recognizing planned, multiperson action. Comput Vision Image Underst 81(3):414-445
- Ivanov YA, Bobick AF (2000) Recognition of visual activities and interactions by stochastic parsing. IEEE Trans Pattern Anal Mach Intell 22(8):852-872
- Jiang W, Cotton C, Chang SF, Ellis D, Loui AC (2009) Short- term audio-visual atoms for generic video concept classification. In: Proceedings of ACM international conference on multimedia
- Jiang W, Loui AC (2011) Audio-visual grouplet: Temporal audio- visual interactions for general video concept classification. In: Proceedings of ACM international conference on multimedia
- Jiang YG (2012) SUPER: Towards real-time event recognition in Internet videos. In: Proceedings of ACM international conference on multimedia retrieval
- Jiang YG, Dai Q, Xue X, Liu W, Ngo CW (2012) Trajectory- based modeling of human actions with motion reference points. In: Proceedings of European conference on computer vision
- Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of- features for object categorization and semantic video retrieval. In: Proceedings of ACM international conference on image and video retrieval
- Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Represen- tations of keypoint-based semantic concept detection: a compre- hensive study. IEEE Trans Multimedia 12(1):42-53
- Jiang YG, Ye G, Chang SF, Ellis D, Loui AC (2011) Consumer video understanding: a bechmark database and an evaluation of human and machine performance. In: Proceedings of ACM inter- national conference on multimedia retrieval
- Jiang YG, Zeng X, Ye G, Bhattacharya S, Ellis D, Shah M, Chang SF (2010) Columbia-UCF TRECVID2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In: Proceedings of NIST TRECVID, Workshop
- Joo SW, Chellappa R (2006) Attribute grammar-based event recognition and anomaly detection. In: Proceedings of IEEE con- ference on computer vision and pattern recognition, Workshop
- Ke Y, Sukthankar R (2004) PCA-SIFT: a more distinctive rep- resentation for local image descriptors. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of British machine vision conference
- Knopp J, Prasad M, Willems G, Timofte R, van Gool L (2010) Hough transform and 3D SURF for robust three dimensional clas- sification. In: Proceedings of European conference on computer vision
- Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on con- cept hierarchy of actions. Int J Comput Vision 50(2):171-184
- Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceed- ings of IEEE international conference on computer vision
- Laptev I (2005) On space-time interest points. Int J Comput Vision 64:107-123
- Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Lavee G, Rivlin E, Rudzsky M (2009) Understanding video events: a survey of methods for automatic interpretation of seman- tic occurrences in videos. IEEE Trans Syst Man Cybernet Part C 39(5):489-504
- Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE conference on computer vision and pat- tern recognition
- Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchi- cal invariant spatio-temporal features for action recognition with independent subspace analysis. In: Proceedings of IEEE confer- ence on computer vision and, pattern recognition
- Lee K, Ellis DPW (2010) Audio-based semantic concept clas- sification for consumer video. IEEE Trans Audio Speech Lang Process 18(6):1406-1416
- Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circ Syst Video Technol 18(11):1499-1510
- Lindeberg T (1998) Feature detection with automatic scale selec- tion. Int J Comput Vision 30:79-116
- Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: Proceedings of IEEE conference on computer vision and, pattern recognition, pp 3337-3344
- Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos "in the wild". In: Proceedings of IEEE conference on com- puter vision and pattern recognition
- Liu J, Shah M (2008) Learning human actions via information maximization. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Loui AC, Luo J, Chang SF, Ellis D, Jiang W, Kennedy L, Lee K, Yanagawa A (2007) Kodak's consumer video benchmark data set: concept definition and annotation. In: Proceedings of ACM international workshop on multimedia, information retrieval
- Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60:91-110
- Lu L, Hanjalic A (2008) Audio keywords discovery for text- like audio content analysis and retrieval. IEEE Trans Multimedia 10(1):74-85
- Lucas BD, Kanade T (1981) An iterative image registration tech- nique with an application to stereo vision. In: Proceedings of inter- national joint conference on artificial intelligence
- Lyon RF, Rehn M, Bengio S, Walters TC, Chechik G (2010) Sound retrieval and ranking using sparse auditory representations. Neural Comput 22(9):2390-2416
- Maji S, Berg AC, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Mandel MI, Ellis DPW (2005) Song-level features and support vector machines for music classification. In: Proceedings of inter- national society of music information retrieval conference
- Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 18(8):837-842
- Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The det curve in assessment of detection task performance. In: Procedings of European conference on speech communication and technology, pp 1895-1898
- Matas J, Chum O, Urban M, Pajdla T (2002) Robust wide baseline stereo from maximally stable extremal regions. In: Proceedings of British machine vision conference, vol 1, pp 384-393
- MediaEval: Multimedia retrieval benchmark evaluation. http:// www.multimediaeval.org
- Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of IEEE international conference on computer vision
- Mikolajczyk K, Schmid C (2004) Scale and affine invariant inter- est point detectors. Int J Comput Vision 60:63-86
- Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10):1615-1630
- Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J et al (2005) A comparison of affine region detectors. Int J Comput Vision 65(1/2):43-72
- Minami K, Akutsu A, Hamada H, Tonomura Y (1998) Video han- dling with music and speech detection. IEEE Multimedia Maga- zine 5:17-25
- Moore D, Essa I (2001) Recognizing multitasked activities using stochastic context-free grammar. In: Proceedings of AAAI con- ference
- Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. IEEE Trans Pattern Anal Mach Intell 30(9):1632-1646
- Morsillo N, Mann G, Pal C (2010) Youtube scale, large vocab- ulary video annotation, Chapter 14 in video search and mining. Springer-Verlag series on studies in computational intelligence. Springer, Berlin, pp 357-386
- Naphade M, Smith J, Tesic J, Chang SF, Hsu W, Kennedy L, Hauptmann A, Curtis J (2006) Large-scale concept ontology for multimedia. IEEE Multimedia Magazine 13(3):86-91
- Natarajan P et al (2011) BBN VISER TRECVID 2011 multime- dia event detection system. In: Proceedings of NIST TRECVID, Workshop
- Natarajan P, Nevatia R (2008) Online, real-time tracking and recognition of human actions. In: Proceedings of IEEE workshop on motion and video, computing, pp 1-8
- Natsev A, Smith JR, Hill M, Hua G, Huang B, Merler M, Xie L, Ouyang H, Zhou, M (2010) IBM Research TRECVID-2010 video copy detection and multimedia event detection system. In: Proceedings of NIST TRECVID, Workshop
- NIST Trecvid Multimedia Event Detection (MED) task. http:// www.nist.gov/itl/iad/mig/med.cfm
- Nister D, Stewenius H (2006) Scalable recognition with a vocabu- lary tree. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of- features image classification. In: Proceedings of European con- ference on computer vision
- Oikonomopoulos A, Patras I, Pantic M (2011) Spatiotemporal localization and categorization of human actions in unsegmented image sequences. IEEE Trans Image Process 20(4):1126-1140
- Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray- scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971-987
- Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vision 42:145-175
- Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Proceedings of advances in neural information processing systems 106. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceeed- ings of the annual meeting of the association for computational linguistics
- Patterson RD, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M (1992) Complex sounds and auditory images. In: Proceedings of international symposium on hearing, pp 429-446
- Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of European conference on computer vision
- Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Pollard C, Sag I (1994) Head-driven phrase structure grammar. Chicago University Press, Chicago
- Poppe R (2010) Survey on vision-based human action recognition. Image Vision Comput 28(6):976-990
- Rapantzikos K, Avrithis Y, Kollias S (2009) Dense saliency- based spatiotemporal feature points for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Raptis M, Soatto S (2010) Tracklet descriptors for action modeling and video analysis. In: Proceedings of European conference on computer vision
- Rodriguez MD, Ahmed J, Shah M (2008) Action mach: a spatio- temporal maximum average correlation height filter for action recognition. In: Procedings of IEEE conference on computer vision and pattern recognition
- Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover's distance as a metric for image retrieval. Int J Comput Vision 40(2):99- 121
- Russell B, Torralba A, Murphy K, Freeman WT (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vision 77(1-3):157-173
- Ryoo MS, Aggarwal JK (2006) Recognition of composite human activities through context-free grammar based representation. In: Proceedings pf IEEE conference on computer vision and pattern recognition
- Sadlier DA, O'Connor NE (2005) Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans Circ Syst Video Technol 15(10):1225-1233
- van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582-1596
- Satkin S, Hebert M (2010) Modeling the temporal extent of actions. In: Proceedings of European conference on computer vision
- Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of international confer- ence on pattern recognition
- Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descrip- tor and its application to action recognition. In: Proceedings of ACM international conference on multimedia
- Shechtman E, Irani M (2007) Matching local self-similarities across images and videos. In: Proceedings lo IEEE conference on computer vision and pattern recognition
- Shotton J, Johnson M, Cipolla R (2008) Semantic texton forests for image categorization and segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognitio
- Si Z, Pei M, Yao B, Zhu SC (2011) Unsupervised learning of event and-or grammar and semantics from video. In: Proceedings IEEE international conference on computer vision
- Silpa-Anan C, Hartley R (2008) Optimised KD-trees for fast image descriptor matching. In: IEEE conference on computer vision and pattern recognition
- Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings of IEEE international conference on computer vision
- Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of ACM international workshop on multimedia information retrieval
- Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349-1380
- Snoek CGM, Worring M (2008) Concept-based video retrieval. Found Trends Inf Retr 2(4):215-322
- Starner TE (1995) Visual recognition of american sign language using hidden markov models. Ph.D. thesis
- Sun J, Wu X, Yan S, Cheong LF, Chua TS, Li J (2009) Hierarchi- cal spatio-temporal context modeling for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Sun SW, Wang YCF, Hung YL, Chang CL, Chen KC, Cheng SS, Wang HM, Liao HYM (2011) Automatic annotation of web videos. In: Proceedings of IEEE international conference on mul- timedia and expo
- Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: Proceedings of ACM international conference on multimedia
- Taylor G, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of European conference on computer vision
- Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes. In: Proceedings of Euro- pean conference on computer vision
- Tran SD, Davis LS (2008) Event modeling and recognition using markov logic networks. In: Proceedings of European conference on computer vision
- Tsekeridou S, Pitas I (2001) Content-based video parsing and indexing based on audio-visual interaction. IEEE Transactions on Circuits and Systems for Video Technology 11(4):522-535
- Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: a survey. IEEE Trans Circ Syst Video Technol 18(11):1473-1488
- Tuytelaars T (2010) Dense interest points. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 2281-2288
- Uemura H, Ishikawa S, Mikolajczyk K (2008) Feature tracking and motion compensation for action recognition. In: Proceedings British machine vision conference
- Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual concept classification. IEEE Trans Multimedia 12(7): 665-680
- University of Central Florida 50 human action dataset (2010). http://server.cs.ucf.edu/~ision/data/UCF50.rar
- Vail DL, Veloso MM, Lafferty JD (2007) Conditional random fields for activity recognition. In: Proceedings of international joint conference on autonomous agents and multiagent systems
- Vedaldi A, Gulshan V, Varma M, Zisserman A (2009) Multiple kernels for object detection. In: Proceedings of IEEE international conference on computer vision
- Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extract- ing and composing robust features with denoising autoencoders. In: Procedings of international conference on machine learning
- Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12):3371-3408
- Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In; Proceedings of IEEE conference on computer vision and pattern recognition
- Wang F, Jiang YG, Ngo CW (2008) Video event detection using motion relativity and visual relatedness. In: Proceedings of ACM international conference on multimedia
- Wang H, Klaser A, Schmid C, Liu CL (2011) Action recogni- tion by dense trajectories. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2008) Evalu- ation of local spatio-temporal features for action recognition. In: Proceedings of British machine vision conference
- Wang J, Kumar S, Chang SF (2010) Semi-supervised hashing for scalable image retrieval. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Wang L, Suter D (2007) Recognizing human activities from sil- houettes: motion subspace and factorial discriminative graphical model. In: Proceedings of IEEE conference on computer vision and pattern recognition
- Wang Y, Mori G (2009) Max-margin hidden conditional random fields for human action recognition. In: Proceedings of IEEE con- ference on computer vision and pattern recognition
- Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vision Image Underst 104(2):249-257
- Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: Pro- ceedings of advances in neural information processing systems
- White B, Yeh T, Lin J, Davis L (2009) Web-scale computer vision using mapreduce for multimedia data mining. In: Proceedings of ACM SIGKDD workshop on multimedia data mining
- Willems G, Tuytelaars T, van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Pro- ceedings European conference on computer vision
- Wu S, Oreifej O, Shah M (2011) Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In: Proceedings of IEEE inter- national conference on computer vision
- Xie L, Xu P, Chang SF, Divakaran A, Sun H (2004) Structure analysis of soccer video with domain knowledge and hidden markov models. Pattern Recognit Lett 25(7):767-775
- Xu C, Wang J, Lu H, Zhang Y (2008) A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans Multimedia 10(3):421-436
- Xu D, Chang SF (2008) Video event recognition using Kernel methods with multilevel temporal alignment. IEEE Trans Pattern Anal Mach Intell 30(11):1985-1997
- Xu M, Maddage NC, Xu C, Kankanhalli M, Tian Q (2003) Creating audio keywords for event detection in soccer video. In: Proceedings IEEE international conference on multimedia and expo
- Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden markov model. In: Proceed- ings of IEEE conference on computer vision and pattern recogni- tion
- Yan R, Fleury MO, Merler M, Natsev A, Smith JR (2009) Large- scale multimedia semantic concept modeling using robust sub- space bagging and mapreduce. In: Proceedings of ACM workshop on large-scale multimedia retrieval and mining
- Yanagawa A, Hsu W, Chang SF (2006) Brief descriptions of visual features for baseline trecvid concept detectors. Columbia Univer- sity, Tech. rep.
- Yao B, Yang X, Lin L, Lee M, Zhu S (2010) I2T: Image parsing to text description. Proc IEEE 98(8):1485-1508
- Ye G, Jhuo IH, Liu D, Jiang YG, Chang SF (2012) Joint audio- visual bi-modal codewords for video event detection. In: Proceed- ings of ACM international conference on multimedia retrieval
- Ye G, Liu D, Jhuo IH, Chang SF (2012) Robust late fusion with rank minimization. In: Proceedings IEEE conference on computer vision and pattern recognition
- Yu TH, Kim TK, Cipolla R (2010) Real-time action recognition by sptiotemoral semantic and structural forests. In: Proceedings of British machine vision conference
- Yuan F, Prinet V, Yuan J (2010) Middle-level representation for human activities recognition: the role of spatio-temporal relation- ships. In: Proceedings of ECCV Workshop on human motion: understanding, modeling, capture and animation
- Yuen J, Russell BC, Liu C, Torralba A (2009) LabelMe video: building a video database with human annotations. In: Proceed- ings of international conference on computer vision
- Zhang D, Chang SF (2002) Event detection in baseball video using superimposed caption recognition. In: Proceedings of ACM international conference on multimedia
- Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local fea- tures and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vision 73(2):213-238