High-level event recognition in unconstrained videos

Subhabrata Bhattacharya

doi:10.1007/S13735-012-0024-2

Outline

High-level event recognition in unconstrained videos

Subhabrata Bhattacharya

2012, International Journal of Multimedia Information Retrieval

https://doi.org/10.1007/S13735-012-0024-2

visibility

…

description

29 pages

link

1 file

Abstract

The goal of high-level event recognition is to automatically detect complex high-level events in a given video sequence. This is a difficult task especially when videos are captured under unconstrained conditions by nonprofessionals. Such videos depicting complex events have limited quality control, and therefore, may include severe camera motion, poor lighting, heavy background clutter, and occlusion. However, due to the fast growing popularity of such videos, especially on the Web, solutions to this problem are in high demands and have attracted great interest from researchers. In this paper, we review current technologies for complex event recognition in unconstrained videos. While the existing solutions vary, we identify common key modules and provide detailed descriptions along with some insights for each of them, including extraction and representation of low-level features across different modalities, classification strategies, fusion techniques, etc. Publicly available benchmark datasets, performance metrics, and related research forums are also described. Finally, we discuss promising directions for future research.

References (173)

Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):1-16
Ali S, Shah M (2010) Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans Pattern Anal Mach Intell 32(2):288-303
Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832-843
Atkeson CG, Hollerbach JM (1985) Kinematic features of unre- strained vertical arm movements. J Neurosci 5(9):2318-2330
Aucouturier JJ, Defreville B, Pachet F (2007) The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. J Acoust Soc Am 122(2):881-891
Aytar Y, Shah M, Luo J (2008) Utilizing semantic word similarity measures for video retrieval. In: Proceedings of IEEE conference on computer vision and pattern recognition, Providence, USA
Baillie M, Jose JM (2003) Audio-based event detection for sports video. In: Proceedings of international conference on image and video retrieval, Urbana-Champaign, IL
Ballan L, Bertini M, Bimbo AD, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimedia Tools Appl 51(1):279-302
Banko M, Mittal VO, Witbrock, MJ (2000) Headline generation based on statistical translation. In: Proceedings of the annual meet- ing of the association for computational linguistics, Hong Kong
Bao L, Yu SI, Lan ZZ, Overwijk A, Jin Q, Langner B, Garbus M, Burger S, Metze F, Hauptmann A (2011) Informedia @ TRECVID 2011. In: Proceedings of NIST TRECVID, Workshop, Gaithers- burg, MD, USA
Barbu, A., Bridge, A., Coroian, D., Dickinson, S., Mussman, S., Narayanaswamy, S., Salvi, D., Schmidt, L., Shangguan, J., Siskind, J.M., Waggoner, J., Wang, S., Wei, J., Yin, Y., Zhang, Z.: Large-scale automatic labeling of video events with verbs based on event-participant interaction. In: arXiv:1204.3616v1 (2012)
Bay H, Ess A, Tuytelaars T, van Gool L (2008) SURF: speeded up robust features. Comput Vision Image Underst 110(3):346-359
Beal MJ, Jojic N, Attias H (2003) A graphical model for audio- visual object tracking. IEEE Trans Pattern Anal Mach Intell 25(7):828-836
Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Proceedings of International Conference on Computer Vision
Bobick AF (1997) Movement, activity, and action: the role of knowledge in the perception of motion. Philos Trans Royal Soc London 352:1257-1265
Boiman O, Shechtman E, Irani M (2008) In defense of nearest- neighbor based image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Brezeale D, Cook D (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybernet Part C 38(3):416-430
de Campos C, Ji Q (2011) Efficient structure learning of bayesian networks using constraints. J Mach Learn Res 12(3):663-689
Cao J, Zhang YD, Song YC, Chen ZN, Zhang X, Li JT (2009) MCG-WEBV: a benchmark dataset for web video analysis. Tech. rep., ICT-MCG-09-001, Institute of Computing Technology, Chi- nese Academy of Sciences
Castel C, Chaudron L, Tessier C (1996) What is going on? a high level interpretation of sequences of images. In: Proceedings of European conference on computer vision, Springer-Verlag, Lon- don, UK
Chang SF, He J, Jiang YG, El Khoury E, Ngo CW, Yanagawa A, Zavesky, E. (2008) Columbia University/VIREO-CityU/IRIT TRECVID2008 high-level feature extraction and interactive video search. In: Proceedings of NIST TRECVID, Workshop, Gaithers- burg
Chang YL, Zeng W, Kamel I, Alonso R (1996) Integrated image and speech analysis for content-based video indexing. In: Pro- ceedings of IEEE international conference on multimedia com- puting and systems, Washington, DC
Chen M, Xu ZE, Weinberger KQ, Sha F (2012) Marginalized stacked denoising autoencoders for domain adaptation. In: Pro- ceedings international conference on machine learning
Cheng H et al (2011) Team SRI-Sarnoff's AURORA System @ TRECVID 2011. In: Proceedings of NIST TRECVID, Workshop
Connolly CI (2007) Learning to recognize complex actions using conditional random fields. In: Proceedings of International Con- ference on Advances in Visual Computing
Cotton CV, Ellis DPW, Loui AC (2011) Soundtrack classification by transient events. In: Proceedings of IEEE international confer- ence acoustics, speech, signal processing, pp 473-476
Dalal N, Triggs B (2005) Histogram of oriented gradients for human detection. In: Proceedings of IEEE conference on com- puter vision and pattern recognition
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of IEEE conference on computer vision and, pattern recognition
Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recog- nition via sparse spatio-temporal features. In Proceedings of joint IEEE international workshop on visual surveillance and perfor- mance evaluation of tracking and surveillance
Dorko G (2012) Interest point detectors local descriptors. http:// lear.inrialpes.fr/people/dorko/downloads.html
Duan L, Xu D, Tsang IW, Luo J (2010) Visual event recognition in videos by learning from web data. In: Proceedings of IEEE conference on computer vision and, pattern recognition
Duchenne O, Laptev I, Sivic J, Bach F, Ponce J (2009) Automatic annotation of human actions in video. In: Proceedings of IEEE international conference on computer vision
Eronen A, Peltonen V, Tuomi J, Klapuri A, Fagerlund S, Sorsa T, Lorho G, Huopaniemi J (2006) Audio-based context recognition. IEEE Trans Audio Speech Lang Process 14(1):321-329
Everingham M, van Gool L, Williams CKI, Winn J, Zisserman A (2007) The PASCAL visual object classes challenge 2007 (VOC2007) Results. http://pascallin.ecs.soton.ac.uk/challenges/ VOC/voc2007/ results/index.shtml
Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part based models. IEEE Trans Pattern Anal Mach Intell 32(9):1530-1535
Feng Y, Lapata M (2010) How many words is a picture worth? automatic caption generation for news images. In: Proceedings of the annual meeting of the association for computational linguistics
Fillmore CJ (1968) The case for case. In: Bach E, Harms R (eds), Universals in Linguistic Theory, New York, pp 1-88
Fiscus J et al (2011) TRECVID multimedia event detec- tion evaluation plan. http://www.nist.gov/itl/iad/mig/upload/ MED11-EvalPlan-V03-20110801a.pdf
Francois ARJ, Nevatia R, Hobbs J, Bolles RC (2005) Verl: an ontology framework for representing and annotating video events. IEEE Multimedia Magazine 12(4):76-86
Frey BJ, Jojic N (2005) A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Trans Pattern Anal Mach Intell 27(9):1392-1416
van Gemert JC, Veenman CJ, Smeulders AWM, Geusebroek JM (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271-1283
Ghanem N, DeMenthon D, Doermann D, Davis L (2004) Rep- resentation and recognition of events in surveillance video using petri nets. In: Proceedings of IEEE conference on computer vision and pattern recognition workshop
Granger C (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424-438
Hakeem A, Sheikh Y, Shah M (2004) Casee: a hierarchical event representation for the analysis of videos. In: Proceedings of AAAI conference
Herbrich R (2001) Learning Kernel classifiers: theory and algo- rithms. The MIT Press, Cambridge
Hu Y, Cao L, Lv F, Yan S, Gong Y, Huang TS (2009) Action detec- tion in complex scenes with spatial and temporal ambiguities. In: Proceedings of IEEE international conference on computer vision
Huang CL, Shih HC, Chao CY (2006) Semantic analysis of soccer video using dynamic bayesian network. IEEE Trans Multimedia 8(4):749-760
Inoue N, Kamishima Y, Wada T, Shinoda K, Sato S (2011) TokyoTech+Canon at TRECVID 2011. In: Proceedings of NIST TRECVID Workshop
Intille SS, Bobick AF (2001) Recognizing planned, multiperson action. Comput Vision Image Underst 81(3):414-445
Ivanov YA, Bobick AF (2000) Recognition of visual activities and interactions by stochastic parsing. IEEE Trans Pattern Anal Mach Intell 22(8):852-872
Jiang W, Cotton C, Chang SF, Ellis D, Loui AC (2009) Short- term audio-visual atoms for generic video concept classification. In: Proceedings of ACM international conference on multimedia
Jiang W, Loui AC (2011) Audio-visual grouplet: Temporal audio- visual interactions for general video concept classification. In: Proceedings of ACM international conference on multimedia
Jiang YG (2012) SUPER: Towards real-time event recognition in Internet videos. In: Proceedings of ACM international conference on multimedia retrieval
Jiang YG, Dai Q, Xue X, Liu W, Ngo CW (2012) Trajectory- based modeling of human actions with motion reference points. In: Proceedings of European conference on computer vision
Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of- features for object categorization and semantic video retrieval. In: Proceedings of ACM international conference on image and video retrieval
Jiang YG, Yang J, Ngo CW, Hauptmann AG (2010) Represen- tations of keypoint-based semantic concept detection: a compre- hensive study. IEEE Trans Multimedia 12(1):42-53
Jiang YG, Ye G, Chang SF, Ellis D, Loui AC (2011) Consumer video understanding: a bechmark database and an evaluation of human and machine performance. In: Proceedings of ACM inter- national conference on multimedia retrieval
Jiang YG, Zeng X, Ye G, Bhattacharya S, Ellis D, Shah M, Chang SF (2010) Columbia-UCF TRECVID2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In: Proceedings of NIST TRECVID, Workshop
Joo SW, Chellappa R (2006) Attribute grammar-based event recognition and anomaly detection. In: Proceedings of IEEE con- ference on computer vision and pattern recognition, Workshop
Ke Y, Sukthankar R (2004) PCA-SIFT: a more distinctive rep- resentation for local image descriptors. In: Proceedings of IEEE conference on computer vision and pattern recognition
Klaser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of British machine vision conference
Knopp J, Prasad M, Willems G, Timofte R, van Gool L (2010) Hough transform and 3D SURF for robust three dimensional clas- sification. In: Proceedings of European conference on computer vision
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on con- cept hierarchy of actions. Int J Comput Vision 50(2):171-184
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceed- ings of IEEE international conference on computer vision
Laptev I (2005) On space-time interest points. Int J Comput Vision 64:107-123
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of IEEE conference on computer vision and pattern recognition
Lavee G, Rivlin E, Rudzsky M (2009) Understanding video events: a survey of methods for automatic interpretation of seman- tic occurrences in videos. IEEE Trans Syst Man Cybernet Part C 39(5):489-504
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE conference on computer vision and pat- tern recognition
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchi- cal invariant spatio-temporal features for action recognition with independent subspace analysis. In: Proceedings of IEEE confer- ence on computer vision and, pattern recognition
Lee K, Ellis DPW (2010) Audio-based semantic concept clas- sification for consumer video. IEEE Trans Audio Speech Lang Process 18(6):1406-1416
Li W, Zhang Z, Liu Z (2008) Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans Circ Syst Video Technol 18(11):1499-1510
Lindeberg T (1998) Feature detection with automatic scale selec- tion. Int J Comput Vision 30:79-116
Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: Proceedings of IEEE conference on computer vision and, pattern recognition, pp 3337-3344
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos "in the wild". In: Proceedings of IEEE conference on com- puter vision and pattern recognition
Liu J, Shah M (2008) Learning human actions via information maximization. In: Proceedings of IEEE conference on computer vision and pattern recognition
Loui AC, Luo J, Chang SF, Ellis D, Jiang W, Kennedy L, Lee K, Yanagawa A (2007) Kodak's consumer video benchmark data set: concept definition and annotation. In: Proceedings of ACM international workshop on multimedia, information retrieval
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60:91-110
Lu L, Hanjalic A (2008) Audio keywords discovery for text- like audio content analysis and retrieval. IEEE Trans Multimedia 10(1):74-85
Lucas BD, Kanade T (1981) An iterative image registration tech- nique with an application to stereo vision. In: Proceedings of inter- national joint conference on artificial intelligence
Lyon RF, Rehn M, Bengio S, Walters TC, Chechik G (2010) Sound retrieval and ranking using sparse auditory representations. Neural Comput 22(9):2390-2416
Maji S, Berg AC, Malik J (2008) Classification using intersection kernel support vector machines is efficient. In: Proceedings of IEEE conference on computer vision and pattern recognition
Mandel MI, Ellis DPW (2005) Song-level features and support vector machines for music classification. In: Proceedings of inter- national society of music information retrieval conference
Manjunath BS, Ma WY (1996) Texture features for browsing and retrieval of image data. IEEE Trans Pattern Anal Mach Intell 18(8):837-842
Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The det curve in assessment of detection task performance. In: Procedings of European conference on speech communication and technology, pp 1895-1898
Matas J, Chum O, Urban M, Pajdla T (2002) Robust wide baseline stereo from maximally stable extremal regions. In: Proceedings of British machine vision conference, vol 1, pp 384-393
MediaEval: Multimedia retrieval benchmark evaluation. http:// www.multimediaeval.org
Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of IEEE international conference on computer vision
Mikolajczyk K, Schmid C (2004) Scale and affine invariant inter- est point detectors. Int J Comput Vision 60:63-86
Mikolajczyk K, Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10):1615-1630
Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J et al (2005) A comparison of affine region detectors. Int J Comput Vision 65(1/2):43-72
Minami K, Akutsu A, Hamada H, Tonomura Y (1998) Video han- dling with music and speech detection. IEEE Multimedia Maga- zine 5:17-25
Moore D, Essa I (2001) Recognizing multitasked activities using stochastic context-free grammar. In: Proceedings of AAAI con- ference
Moosmann F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. IEEE Trans Pattern Anal Mach Intell 30(9):1632-1646
Morsillo N, Mann G, Pal C (2010) Youtube scale, large vocab- ulary video annotation, Chapter 14 in video search and mining. Springer-Verlag series on studies in computational intelligence. Springer, Berlin, pp 357-386
Naphade M, Smith J, Tesic J, Chang SF, Hsu W, Kennedy L, Hauptmann A, Curtis J (2006) Large-scale concept ontology for multimedia. IEEE Multimedia Magazine 13(3):86-91
Natarajan P et al (2011) BBN VISER TRECVID 2011 multime- dia event detection system. In: Proceedings of NIST TRECVID, Workshop
Natarajan P, Nevatia R (2008) Online, real-time tracking and recognition of human actions. In: Proceedings of IEEE workshop on motion and video, computing, pp 1-8
Natsev A, Smith JR, Hill M, Hua G, Huang B, Merler M, Xie L, Ouyang H, Zhou, M (2010) IBM Research TRECVID-2010 video copy detection and multimedia event detection system. In: Proceedings of NIST TRECVID, Workshop
NIST Trecvid Multimedia Event Detection (MED) task. http:// www.nist.gov/itl/iad/mig/med.cfm
Nister D, Stewenius H (2006) Scalable recognition with a vocabu- lary tree. In: Proceedings of IEEE conference on computer vision and pattern recognition
Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of- features image classification. In: Proceedings of European con- ference on computer vision
Oikonomopoulos A, Patras I, Pantic M (2011) Spatiotemporal localization and categorization of human actions in unsegmented image sequences. IEEE Trans Image Process 20(4):1126-1140
Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray- scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971-987
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vision 42:145-175
Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Proceedings of advances in neural information processing systems 106. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceeed- ings of the annual meeting of the association for computational linguistics
Patterson RD, Robinson K, Holdsworth J, McKeown D, Zhang C, Allerhand M (1992) Complex sounds and auditory images. In: Proceedings of international symposium on hearing, pp 429-446
Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of European conference on computer vision
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of IEEE conference on computer vision and pattern recognition
Pollard C, Sag I (1994) Head-driven phrase structure grammar. Chicago University Press, Chicago
Poppe R (2010) Survey on vision-based human action recognition. Image Vision Comput 28(6):976-990
Rapantzikos K, Avrithis Y, Kollias S (2009) Dense saliency- based spatiotemporal feature points for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition
Raptis M, Soatto S (2010) Tracklet descriptors for action modeling and video analysis. In: Proceedings of European conference on computer vision
Rodriguez MD, Ahmed J, Shah M (2008) Action mach: a spatio- temporal maximum average correlation height filter for action recognition. In: Procedings of IEEE conference on computer vision and pattern recognition
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover's distance as a metric for image retrieval. Int J Comput Vision 40(2):99- 121
Russell B, Torralba A, Murphy K, Freeman WT (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vision 77(1-3):157-173
Ryoo MS, Aggarwal JK (2006) Recognition of composite human activities through context-free grammar based representation. In: Proceedings pf IEEE conference on computer vision and pattern recognition
Sadlier DA, O'Connor NE (2005) Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans Circ Syst Video Technol 15(10):1225-1233
van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582-1596
Satkin S, Hebert M (2010) Modeling the temporal extent of actions. In: Proceedings of European conference on computer vision
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of international confer- ence on pattern recognition
Scovanner P, Ali S, Shah M (2007) A 3-dimensional SIFT descrip- tor and its application to action recognition. In: Proceedings of ACM international conference on multimedia
Shechtman E, Irani M (2007) Matching local self-similarities across images and videos. In: Proceedings lo IEEE conference on computer vision and pattern recognition
Shotton J, Johnson M, Cipolla R (2008) Semantic texton forests for image categorization and segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognitio
Si Z, Pei M, Yao B, Zhu SC (2011) Unsupervised learning of event and-or grammar and semantics from video. In: Proceedings IEEE international conference on computer vision
Silpa-Anan C, Hartley R (2008) Optimised KD-trees for fast image descriptor matching. In: IEEE conference on computer vision and pattern recognition
Sivic J, Zisserman A (2003) Video Google: a text retrieval approach to object matching in videos. In: Proceedings of IEEE international conference on computer vision
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of ACM international workshop on multimedia information retrieval
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349-1380
Snoek CGM, Worring M (2008) Concept-based video retrieval. Found Trends Inf Retr 2(4):215-322
Starner TE (1995) Visual recognition of american sign language using hidden markov models. Ph.D. thesis
Sun J, Wu X, Yan S, Cheong LF, Chua TS, Li J (2009) Hierarchi- cal spatio-temporal context modeling for action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition
Sun SW, Wang YCF, Hung YL, Chang CL, Chen KC, Cheng SS, Wang HM, Liao HYM (2011) Automatic annotation of web videos. In: Proceedings of IEEE international conference on mul- timedia and expo
Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio-visual concept classifiers. In: Proceedings of ACM international conference on multimedia
Taylor G, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of European conference on computer vision
Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes. In: Proceedings of Euro- pean conference on computer vision
Tran SD, Davis LS (2008) Event modeling and recognition using markov logic networks. In: Proceedings of European conference on computer vision
Tsekeridou S, Pitas I (2001) Content-based video parsing and indexing based on audio-visual interaction. IEEE Transactions on Circuits and Systems for Video Technology 11(4):522-535
Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: a survey. IEEE Trans Circ Syst Video Technol 18(11):1473-1488
Tuytelaars T (2010) Dense interest points. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 2281-2288
Uemura H, Ishikawa S, Mikolajczyk K (2008) Feature tracking and motion compensation for action recognition. In: Proceedings British machine vision conference
Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual concept classification. IEEE Trans Multimedia 12(7): 665-680
University of Central Florida 50 human action dataset (2010). http://server.cs.ucf.edu/~ision/data/UCF50.rar
Vail DL, Veloso MM, Lafferty JD (2007) Conditional random fields for activity recognition. In: Proceedings of international joint conference on autonomous agents and multiagent systems
Vedaldi A, Gulshan V, Varma M, Zisserman A (2009) Multiple kernels for object detection. In: Proceedings of IEEE international conference on computer vision
Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extract- ing and composing robust features with denoising autoencoders. In: Procedings of international conference on machine learning
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12):3371-3408
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In; Proceedings of IEEE conference on computer vision and pattern recognition
Wang F, Jiang YG, Ngo CW (2008) Video event detection using motion relativity and visual relatedness. In: Proceedings of ACM international conference on multimedia
Wang H, Klaser A, Schmid C, Liu CL (2011) Action recogni- tion by dense trajectories. In: Proceedings of IEEE conference on computer vision and pattern recognition
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2008) Evalu- ation of local spatio-temporal features for action recognition. In: Proceedings of British machine vision conference
Wang J, Kumar S, Chang SF (2010) Semi-supervised hashing for scalable image retrieval. In: Proceedings of IEEE conference on computer vision and pattern recognition
Wang L, Suter D (2007) Recognizing human activities from sil- houettes: motion subspace and factorial discriminative graphical model. In: Proceedings of IEEE conference on computer vision and pattern recognition
Wang Y, Mori G (2009) Max-margin hidden conditional random fields for human action recognition. In: Proceedings of IEEE con- ference on computer vision and pattern recognition
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vision Image Underst 104(2):249-257
Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: Pro- ceedings of advances in neural information processing systems
White B, Yeh T, Lin J, Davis L (2009) Web-scale computer vision using mapreduce for multimedia data mining. In: Proceedings of ACM SIGKDD workshop on multimedia data mining
Willems G, Tuytelaars T, van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Pro- ceedings European conference on computer vision
Wu S, Oreifej O, Shah M (2011) Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In: Proceedings of IEEE inter- national conference on computer vision
Xie L, Xu P, Chang SF, Divakaran A, Sun H (2004) Structure analysis of soccer video with domain knowledge and hidden markov models. Pattern Recognit Lett 25(7):767-775
Xu C, Wang J, Lu H, Zhang Y (2008) A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans Multimedia 10(3):421-436
Xu D, Chang SF (2008) Video event recognition using Kernel methods with multilevel temporal alignment. IEEE Trans Pattern Anal Mach Intell 30(11):1985-1997
Xu M, Maddage NC, Xu C, Kankanhalli M, Tian Q (2003) Creating audio keywords for event detection in soccer video. In: Proceedings IEEE international conference on multimedia and expo
Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden markov model. In: Proceed- ings of IEEE conference on computer vision and pattern recogni- tion
Yan R, Fleury MO, Merler M, Natsev A, Smith JR (2009) Large- scale multimedia semantic concept modeling using robust sub- space bagging and mapreduce. In: Proceedings of ACM workshop on large-scale multimedia retrieval and mining
Yanagawa A, Hsu W, Chang SF (2006) Brief descriptions of visual features for baseline trecvid concept detectors. Columbia Univer- sity, Tech. rep.
Yao B, Yang X, Lin L, Lee M, Zhu S (2010) I2T: Image parsing to text description. Proc IEEE 98(8):1485-1508
Ye G, Jhuo IH, Liu D, Jiang YG, Chang SF (2012) Joint audio- visual bi-modal codewords for video event detection. In: Proceed- ings of ACM international conference on multimedia retrieval
Ye G, Liu D, Jhuo IH, Chang SF (2012) Robust late fusion with rank minimization. In: Proceedings IEEE conference on computer vision and pattern recognition
Yu TH, Kim TK, Cipolla R (2010) Real-time action recognition by sptiotemoral semantic and structural forests. In: Proceedings of British machine vision conference
Yuan F, Prinet V, Yuan J (2010) Middle-level representation for human activities recognition: the role of spatio-temporal relation- ships. In: Proceedings of ECCV Workshop on human motion: understanding, modeling, capture and animation
Yuen J, Russell BC, Liu C, Torralba A (2009) LabelMe video: building a video database with human annotations. In: Proceed- ings of international conference on computer vision
Zhang D, Chang SF (2002) Event detection in baseball video using superimposed caption recognition. In: Proceedings of ACM international conference on multimedia
Zhang J, Marszalek M, Lazebnik S, Schmid C (2007) Local fea- tures and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vision 73(2):213-238

High-level event recognition in unconstrained videos

Sign up for access to the world's latest research

Abstract

Related papers

References (173)

Related papers

Related topics

Cited by