A Review of Human Activity Recognition Methods

Ioannis Kakadiaris

doi:10.3389/FROBT.2015.00028

Outline

A Review of Human Activity Recognition Methods

Ioannis Kakadiaris

2015, Frontiers in Robotics and AI

https://doi.org/10.3389/FROBT.2015.00028

visibility

…

description

28 pages

link

1 file

Abstract

Recognizing human activities from video sequences or still images is a challenging task due to problems, such as background clutter, partial occlusion, changes in scale, viewpoint, lighting, and appearance. Many applications, including video surveillance systems, human-computer interaction, and robotics for human behavior characterization, require a multiple activity recognition system. In this work, we provide a detailed review of recent and state-of-the-art research advances in the field of human activity classification. We propose a categorization of human activity methodologies and discuss their advantages and limitations. In particular, we divide human activity classification methods into two large categories according to whether they use data from different modalities or not. Then, each of these categories is further analyzed into sub-categories, which reflect how they model human activities and what type of activities they are interested in. Moreover, we provide a comprehensive analysis of the existing, publicly available human activity classification datasets and examine the requirements for an ideal human activity recognition dataset. Finally, we report the characteristics of future research directions and present some open issues on human activity recognition.

Key takeaways
AI

Human activity recognition faces challenges like background clutter and occlusion, complicating accurate classification.
The text categorizes recognition methods into unimodal and multimodal approaches, detailing their advantages and limitations.
Addressing complex activities often requires decomposing them into simpler actions for better recognition accuracy.
An ideal human activity recognition dataset must be diverse and reflect real-world scenarios for effective training.
Future research should focus on generalization and robustness in varied environments, including handling occlusions and missing data.

References (288)

Aggarwal, J. K., and Cai, Q. (1999). Human motion analysis: a review. Comput. Vis. Image Understand. 73, 428-440. doi:10.1006/cviu.1998.0744
Aggarwal, J. K., and Ryoo, M. S. (2011). Human activity analysis: a review. ACM Comput. Surv. 43, 1-43. doi:10.1145/1922649.1922653
Aggarwal, J. K., and Xia, L. (2014). Human activity recognition from 3D data: a review. Pattern Recognit. Lett. 48, 70-80. doi:10.1016/j.patrec.2014.04.011
Akata, Z., Perronnin, F., Harchaoui, Z., and Schmid, C. (2013). "Label-embedding for attribute-based classification, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 819-826.
Alahi, A., Ramanathan, V., and Fei-Fei, L. (2014). "Socially-aware large-scale crowd forecasting, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 2211-2218.
AlZoubi, O., Fossati, D., D'Mello, S. K., and Calvo, R. A. (2013). "Affect detec- tion and classification from the non-stationary physiological data, " in Proc. International Conference on Machine Learning and Applications (Portland, OR), 240-245.
Amer, M. R., and Todorovic, S. (2012). "Sum-product networks for modeling activities with stochastic structure, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1314-1321.
Amin, S., Andriluka, M., Rohrbach, M., and Schiele, B. (2013). "Multi-view pictorial structures for 3D human pose estimation, " in Proc. British Machine Vision Conference (Bristol), 1-12.
Andriluka, M., Pishchulin, L., Gehler, P. V., and Schiele, B. (2014). "2D human pose estimation: new benchmark and state of the art analysis, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 3686-3693.
Andriluka, M., and Sigal, L. (2012). "Human context: modeling human-human interactions for monocular 3D pose estimation, " in Proc. International Confer- ence on Articulated Motion and Deformable Objects (Mallorca: Springer-Verlag), 260-272.
Anirudh, R., Turaga, P., Su, J., and Srivastava, A. (2015). "Elastic functional coding of human actions: from vector-fields to latent variables, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3147-3155.
Atrey, P. K., Hossain, M. A., El-Saddik, A., and Kankanhalli, M. S. (2010). Mul- timodal fusion for multimedia analysis: a survey. Multimed. Syst. 16, 345-379. doi:10.1007/s00530-010-0182-0
Bandla, S., and Grauman, K. (2013). "Active learning of an action detector from untrimmed videos, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 1833-1840.
Baxter, R. H., Robertson, N. M., and Lane, D. M. (2015). Human behaviour recog- nition in data-scarce domains. Pattern Recognit. 48, 2377-2393. doi:10.1016/j. patcog.2015.02.019
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., and Ilic, S. (2014). "3D pictorial structures for multiple human pose estimation, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1669-1676.
Bilakhia, S., Petridis, S., and Pantic, M. (2013). "Audiovisual detection of behavioural mimicry, " in Proc. 2013 Humaine Association Conference on Affec- tive Computing and Intelligent Interaction (Geneva), 123-128.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Secaucus, NJ: Springer.
Blank, M., Gorelick, L., Shechtman, E., Irani, M., and Basri, R. (2005). "Actions as space-time shapes, " in Proc. IEEE International Conference on Computer Vision (Beijing), 1395-1402.
Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., and Sivic, J. (2013). "Finding actors and actions in movies, " in Proc. IEEE International Conference on Computer Vision (Sydney), 2280-2287.
Bousmalis, K., Mehu, M., and Pantic, M. (2013a). Towards the automatic detection of spontaneous agreement and disagreement based on nonverbal behaviour: a survey of related cues, databases, and tools. Image Vis. Comput. 31, 203-221. doi:10.1016/j.imavis.2012.07.003
Bousmalis, K., Zafeiriou, S., Morency, L. P., and Pantic, M. (2013b). Infinite hidden conditional random fields for human behavior analysis. IEEE Trans. Neural Networks Learn. Syst. 24, 170-177. doi:10.1109/TNNLS.2012.2224882
Bousmalis, K., Morency, L., and Pantic, M. (2011). "Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition, " in Proc. IEEE International Conference on Automatic Face and Gesture Recognition (Santa Barbara, CA), 746-752.
Burenius, M., Sullivan, J., and Carlsson, S. (2013). "3D pictorial structures for multi- ple view articulated pose estimation, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 3618-3625.
Burgos-Artizzu, X. P., Dollár, P., Lin, D., Anderson, D. J., and Perona, P. (2012). "Social behavior recognition in continuous video, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1322-1329.
Candamo, J., Shreve, M., Goldgof, D. B., Sapper, D. B., and Kasturi, R. (2010). Understanding transit scenes: a survey on human behavior-recognition algo- rithms. IEEE Trans. Intell. Transp. Syst. 11, 206-224. doi:10.1109/TITS.2009. 2030963
Castellano, G., Villalba, S. D., and Camurri, A. (2007). "Recognising human emo- tions from body movement and gesture dynamics, " in Proc. Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, Vol. 4738 (Lisbon), 71-82.
Chakraborty, B., Holte, M. B., Moeslund, T. B., and Gonzàlez, J. (2012). Selective spatio-temporal interest points. Comput. Vis. Image Understand. 116, 396-410. doi:10.1016/j.cviu.2011.09.010
Chaquet, J. M., Carmona, E. J., and Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Comput. Vis. Image Understand. 117, 633-659. doi:10.1016/j.cviu.2013.01.013
Chaudhry, R., Ravichandran, A., Hager, G. D., and Vidal, R. (2009). "Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1932-1939.
Chen, C. Y., and Grauman, K. (2012). "Efficient activity detection with max- subgraph search, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1274-1281.
Chen, H., Li, J., Zhang, F., Li, Y., and Wang, H. (2015). "3D model-based continuous emotion recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1836-1845.
Chen, L., Duan, L., and Xu, D. (2013a). "Event recognition in videos by learning from heterogeneous web sources, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2666-2673.
Chen, L., Wei, H., and Ferryman, J. (2013b). A survey of human motion analysis using depth imagery. Pattern Recognit. Lett. 34, 1995-2006. doi:10.1016/j.patrec. 2013.02.006
Chen, W., Xiong, C., Xu, R., and Corso, J. J. (2014). "Actionness ranking with lattice conditional ordinal random fields, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 748-755.
Cherian, A., Mairal, J., Alahari, K., and Schmid, C. (2014). "Mixing body-part sequences for human pose estimation, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Columbus, OH), 2361-2368.
Choi, W., Shahid, K., and Savarese, S. (2011). "Learning context for collective activity recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3273-3280.
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. (2011). "Flexible, high performance convolutional neural networks for image classification, " in Proc. International Joint Conference on Artificial Intelligence (Barcelona), 1237-1242.
Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012). "Multi-column deep neural networks for image classification, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 3642-3649.
Cui, X., Liu, Q., Gao, M., and Metaxas, D. N. (2011). "Abnormal detection using interaction energy potentials, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3161-3167.
Dalal, N., and Triggs, B. (2005). "Histograms of oriented gradients for human detection, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 886-893.
Dalal, N., Triggs, B., and Schmid, C. (2006). "Human detection using oriented histograms of flow and appearance, " in Proc. European Conference on Computer Vision (Graz), 428-441.
Dollár, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005). "Behavior recognition via sparse spatio-temporal features, " in Proc. International Conference on Com- puter Communications and Networks (Beijing), 65-72.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). "Long-term recurrent convolutional networks for visual recognition and description, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2625-2634.
Du, Y., Wang, W., and Wang, L. (2015). "Hierarchical recurrent neural network for skeleton based action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1110-1118.
Efros, A. A., Berg, A. C., Mori, G., and Malik, J. (2003). "Recognizing action at a distance, " in Proc. IEEE International Conference on Computer Vision, Vol. 2 (Nice), 726-733.
Ekman, P., Friesen, W. V., and Hager, J. C. (2002). Facial Action Coding System (FACS): Manual. Salt Lake City: A Human Face.
Elgammal, A., Duraiswami, R., Harwood, D., and Davis, L. S. (2002). Background and foreground modeling using nonparametric kernel density for visual surveil- lance. Proc. IEEE 90, 1151-1163. doi:10.1109/JPROC.2002.801448
Escalera, S., Baró, X., Vitrià, J., Radeva, P., and Raducanu, B. (2012). Social network extraction and analysis based on multimodal dyadic interaction. Sensors 12, 1702-1719. doi:10.3390/s120201702
Evangelopoulos, G., Zlatintsi, A., Potamianos, A., Maragos, P., Rapantzikos, K., Skoumas, G., et al. (2013). Multimodal saliency and fusion for movie summa- rization based on aural, visual, and textual attention. IEEE Trans. Multimedia 15, 1553-1568. doi:10.1109/TMM.2013.2267205
Evgeniou, T., and Pontil, M. (2004). "Regularized multi-task learning, " in Proc. ACM International Conference on Knowledge Discovery and Data Mining (Seattle, WA), 109-117.
Eweiwi, A., Cheema, M. S., Bauckhage, C., and Gall, J. (2014). "Efficient pose-based action recognition, " in Proc. Asian Conference on Computer Vision (Singapore), 428-443.
Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. A. (2009). "Describing objects by their attributes, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1778-1785.
Fathi, A., Hodgins, J. K., and Rehg, J. M. (2012). "Social interactions: a first-person perspective, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1226-1233.
Fathi, A., and Mori, G. (2008). "Action recognition by learning mid-level motion features, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
Fergie, M., and Galata, A. (2013). Mixtures of Gaussian process models for human pose estimation. Image Vis. Comput. 31, 949-957. doi:10.1016/j.imavis.2013.09. 007
Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., and Tuytelaars, T. (2015). "Modeling video evolution for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 5378-5387.
Ferrari, V., Marin-Jimenez, M., and Zisserman, A. (2009). "Pose search: retrieving people using their pose, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1-8.
Fisher, R. B. (2004). PETS04 Surveillance Ground Truth Dataset. Available at: http: //www-prima.inrialpes.fr/PETS04/
Fisher, R. B. (2007a). Behave: Computer-Assisted Prescreening of Video Streams for Unusual Activities. Available at: http://homepages.inf.ed.ac.uk/rbf/BEHAVE/
Fisher, R. B. (2007b). PETS07 Benchmark Dataset. Available at: http://www.cvg. reading.ac.uk/PETS2007/data.html
Fogel, I., and Sagi, D. (1989). Gabor filters as texture discriminator. Biol. Cybern. 61, 103-113. doi:10.1007/BF00204594
Fothergill, S., Mentis, H. M., Kohli, P., and Nowozin, S. (2012). "Instructing people for training gestural interactive systems, " in Proc. Conference on Human Factors in Computing Systems (Austin, TX), 1737-1746.
Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., and Sivic, J. (2014). People watching: human actions as a cue for single view geometry. Int. J. Comput. Vis. 110, 259-274. doi:10.1007/s11263-014-0710-z
Fu, Y., Hospedales, T. M., Xiang, T., and Gong, S. (2012). "Attribute learning for understanding unstructured social activity, " in Proc. European Conference on Computer Vision, Lecture Notes in Computer Science, Vol. 7575 (Florence), 530-543.
Fu, Y., Hospedales, T. M., Xiang, T., and Gong, S. (2014). Learning multimodal latent attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 303-316. doi:10. 1109/TPAMI.2013.128
Gaidon, A., Harchaoui, Z., and Schmid, C. (2014). Activity representation with motion hierarchies. Int. J. Comput. Vis. 107, 219-238. doi:10.1007/s11263-013- 0677-1
Gan, C., Wang, N., Yang, Y., Yeung, D. Y., and Hauptmann, A. G. (2015). "DevNet: a deep event network for multimedia event detection and evidence recounting, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2568-2577.
Gao, Z., Zhang, H., Xu, G. P., and Xue, Y. B. (2015). Multi-perspective and multi- modality joint representation and recognition model for 3D action recognition. Neurocomputing 151, 554-564. doi:10.1016/j.neucom.2014.06.085
Gavrila, D. M. (1999). The visual analysis of human movement: a survey. Comput. Vis. Image Understand. 73, 82-98. doi:10.1006/cviu.1998.0716
Gorelick, L., Blank, M., Shechtman, E., Irani, M., and Basri, R. (2007). Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247-2253. doi:10.1109/TPAMI.2007.70711
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R. J., Darrell, T., et al. (2013). "Youtube2text: recognizing and describing arbi- trary activities using semantic hierarchies and zero-shot recognition, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 2712-2719.
Guha, T., and Ward, R. K. (2012). Learning sparse representations for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1576-1588. doi:10.1109/ TPAMI.2011.253
Guo, G., and Lai, A. (2014). A survey on still image based human action recognition. Pattern Recognit. 47, 3343-3361. doi:10.1016/j.patcog.2014.04.018
Gupta, A., and Davis, L. S. (2007). "Objects in action: an approach for combining action understanding and object perception, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Minneapolis, MN), 1-8.
Gupta, A., Kembhavi, A., and Davis, L. S. (2009). Observing human-object inter- actions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775-1789. doi:10.1109/TPAMI.2009.83
Haralick, R. M., and Watson, L. (1981). A facet model for image data. Comput. Graph. Image Process. 15, 113-129. doi:10.1016/0146-664X(81)90073-3
Hardoon, D. R., Szedmak, S. R., and Shawe-Taylor, J. R. (2004). Canonical correla- tion analysis: an overview with application to learning methods. Neural Comput. 16, 2639-2664. doi:10.1162/0899766042321814
Healey, J. (2011). "Recording affect in the field: towards methods and metrics for improving ground truth labels, " in Proc. International Conference on Affective Computing and Intelligent Interaction (Memphis, TN), 107-116.
Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C. (2015). "ActivityNet: a large-scale video benchmark for human activity understanding, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 961-970.
Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527-1554. doi:10.1162/neco.2006.18.7.1527
Ho, T. K. (1995). "Random decision forests, " in Proc. International Conference on Document Analysis and Recognition, Vol. 1 (Washington, DC: IEEE Computer Society), 278-282.
Hoai, M., Lan, Z. Z., and Torre, F. (2011). "Joint segmentation and classifi- cation of human actions in video, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3265-3272.
Hoai, M., and Zisserman, A. (2014). "Talking heads: detecting humans and rec- ognizing their interactions, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 875-882.
Holte, M. B., Chakraborty, B., Gonzàlez, J., and Moeslund, T. B. (2012a). A local 3- D motion descriptor for multi-view human action recognition from 4-D spatio- temporal interest points. IEEE J. Sel. Top. Signal Process. 6, 553-565. doi:10.1109/ JSTSP.2012.2193556
Holte, M. B., Tran, C., Trivedi, M. M., and Moeslund, T. B. (2012b). Human pose estimation and activity recognition from multi-view videos: comparative explorations of recent developments. IEEE J. Sel. Top. Signal Process. 6, 538-552. doi:10.1109/JSTSP.2012.2196975
Huang, Z. F., Yang, W., Wang, Y., and Mori, G. (2011). "Latent boosting for action recognition, " in Proc. British Machine Vision Conference (Dundee), 1-11.
Hussain, M. S., Calvo, R. A., and Pour, P. A. (2011). "Hybrid fusion approach for detecting affects from multichannel physiology, " in Proc. International Confer- ence on Affective Computing and Intelligent Interaction, Lecture Notes in Com- puter Science, Vol. 6974 (Memphis, TN), 568-577.
Ikizler, N., and Duygulu, P. (2007). "Human action recognition using distribution of oriented rectangular patches, " in Proc. Conference on Human Motion: Under- standing, Modeling, Capture and Animation (Rio de Janeiro), 271-284.
Ikizler-Cinbis, N., and Sclaroff, S. (2010). "Object, scene and actions: combining multiple features for human action recognition, " in Proc. European Conference on Computer Vision, Lecture Notes in Computer Science, Vol. 6311 (Hersonissos, Heraclion, Crete, greece: Springer), 494-507.
Iosifidis, A., Tefas, A., and Pitas, I. (2012a). Activity-based person identification using fuzzy representation and discriminant learning. IEEE Trans. Inform. Forensics Secur. 7, 530-542. doi:10.1109/TIFS.2011.2175921
Iosifidis, A., Tefas, A., and Pitas, I. (2012b). View-invariant action recognition based on artificial neural networks. IEEE Trans. Neural Networks Learn. Syst. 23, 412-424. doi:10.1109/TNNLS.2011.2181865
Jaimes, A., and Sebe, N. (2007). "Multimodal human-computer interaction: a sur- vey, " in Computer Vision and Image Understanding, Vol. 108 (Special Issue on Vision for Human-Computer Interaction), 116-134.
Jain, M., Gemert, J., Jégou, H., Bouthemy, P., and Snoek, C. G. M. (2014). "Action localization with tubelets from motion, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Columbus, OH), 740-747.
Jain, M., Jegou, H., and Bouthemy, P. (2013). "Better exploiting motion for better action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2555-2562.
Jainy, M., Gemerty, J. C., and Snoek, C. G. M. (2015). "What do 15,000 object cate- gories tell us about classifying and localizing actions?, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 46-55.
Jayaraman, D., and Grauman, K. (2014). "Zero-shot recognition with unreliable attributes, " in Proc. Annual Conference on Neural Information Processing Systems (Montreal, QC), 3464-3472.
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., and Black, M. J. (2013). "Towards understanding action recognition, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 3192-3199.
Jhuang, H., Serre, T., Wolf, L., and Poggio, T. (2007). "A biologically inspired system for action recognition, " in Proc. IEEE International Conference on Computer Vision (Rio de Janeiro), 1-8.
Jiang, B., Martínez, B., Valstar, M. F., and Pantic, M. (2014). "Decision level fusion of domain specific regions for facial action recognition, " in Proc. International Conference on Pattern Recognition (Stockholm), 1776-1781.
Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D. P. W., and Loui, A. C. (2011). "Con- sumer video understanding: a benchmark database and an evaluation of human and machine performance, " in Proc. International Conference on Multimedia Retrieval (Trento), 29-36.
Jiang, Z., Lin, Z., and Davis, L. S. (2013). A unified tree-based framework for joint action localization, recognition and segmentation. Comput. Vis. Image Understand. 117, 1345-1355. doi:10.1016/j.cviu.2012.09.008
Jung, H. Y., Lee, S., Heo, Y. S., and Yun, I. D. (2015). "Random treewalk toward instantaneous 3D human pose estimation, " in Proc. IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (Boston, MA), 2467-2474.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). "Large-scale video classification with convolutional neural networks, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1725-1732.
Khamis, S., Morariu, V. I., and Davis, L. S. (2012). "A flow model for joint action recognition and identity maintenance, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Providence, RI), 1218-1225.
Kim, Y., Lee, H., and Provost, E. M. (2013). "Deep learning for robust feature generation in audiovisual emotion recognition, " in Proc. IEEE International Con- ference on Acoustics, Speech and Signal Processing (Vancouver, BC), 3687-3691.
Klami, A., and Kaski, S. (2008). Probabilistic approach to detecting dependencies between data sets. Neurocomputing 72, 39-46. doi:10.1016/j.neucom.2007.12. 044
Kläser, A., Marszałek, M., and Schmid, C. (2008). "A spatio-temporal descriptor based on 3D-gradients, " in Proc. British Machine Vision Conference (Leeds: University of Leeds), 995-1004.
Kohonen, T., Schroeder, M. R., and Huang, T. S. (eds) (2001). Self-Organizing Maps, Third Edn. New York, NY.: Springer-Verlag Inc.
Kong, Y., and Fu, Y. (2014). "Modeling supporting regions for close human inter- action recognition, " in Proc. European Conference on Computer Vision (Zurich), 29-44.
Kong, Y., Jia, Y., and Fu, Y. (2014a). Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1775-1788. doi:10.1109/TPAMI.2014.2303090
Kong, Y., Kit, D., and Fu, Y. (2014b). "A discriminative model with multiple tem- poral scales for action prediction, " in Proc. European Conference on Computer Vision (Zurich), 596-611.
Kovashka, A., and Grauman, K. (2010). "Learning a hierarchy of discriminative space-time neighborhood features for human action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2046-2053.
Kuehne, H., Arslan, A., and Serre, T. (2014). "The language of actions: recov- ering the syntax and semantics of goal-directed human activities, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 780-787.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011). "HMDB: a large video database for human motion recognition, " in Proc. IEEE International Conference on Computer Vision (Barcelona), 2556-2563.
Kulkarni, K., Evangelidis, G., Cech, J., and Horaud, R. (2015). Continuous action recognition based on sequence alignment. Int. J. Comput. Vis. 112, 90-114. doi:10.1007/s11263-014-0758-9
Kulkarni, P., Sharma, G., Zepeda, J., and Chevallier, L. (2014). "Transfer learning via attributes for improved on-the-fly classification, " in Proc. IEEE Winter Con- ference on Applications of Computer Vision (Steamboat Springs, CO), 220-226.
Kviatkovsky, I., Rivlin, E., and Shimshoni, I. (2014). Online action recognition using covariance of shape and motion. Comput. Vis. Image Understand. 129, 15-26. doi:10.1016/j.cviu.2014.08.001
Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). "Conditional random fields: probabilistic models for segmenting and labeling sequence data, " in Proc. International Conference on Machine Learning (Williamstown, MA: Williams College), 282-289.
Lampert, C. H., Nickisch, H., and Harmeling, S. (2009). "Learning to detect unseen object classes by between-class attribute transfer, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 951-958.
Lan, T., Chen, T. C., and Savarese, S. (2014). "A hierarchical representation for future action prediction, " in Proc. European Conference on Computer Vision (Zurich), 689-704.
Lan, T., Sigal, L., and Mori, G. (2012a). "Social roles in hierarchical models for human activity recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1354-1361.
Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., and Mori, G. (2012b). Discrim- inative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1549-1562. doi:10.1109/TPAMI.2011.228
Lan, T., Wang, Y., and Mori, G. (2011). "Discriminative figure-centric models for joint action localization and recognition, " in Proc. IEEE International Conference on Computer Vision (Barcelona), 2003-2010.
Laptev, I. (2005). On space-time interest points. Int. J. Comput. Vis. 64, 107-123. doi:10.1007/s11263-005-1838-7
Laptev, I., Marszałek, M., Schmid, C., and Rozenfeld, B. (2008). "Learning realistic human actions from movies, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011). "Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3361-3368.
Li, B., Ayazoglu, M., Mao, T., Camps, O. I., and Sznaier, M. (2011). "Activity recognition using dynamic subspace angles, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3193-3200.
Li, B., Camps, O. I., and Sznaier, M. (2012). "Cross-view activity recognition using hankelets, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1362-1369.
Li, R., and Zickler, T. (2012). "Discriminative virtual views for cross-view action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2855-2862.
Lichtenauer, J., Valstar, J. S. M., and Pantic, M. (2011). Cost-effective solution to synchronised audio-visual data capture using multiple sensors. Image Vis. Comput. 29, 666-680. doi:10.1016/j.imavis.2011.07.004
Lillo, I., Soto, A., and Niebles, J. C. (2014). "Discriminative hierarchical modeling of spatio-temporally composable human activities, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 812-819.
Lin, Z., Jiang, Z., and Davis, L. S. (2009). "Recognizing actions by shape-motion prototype trees, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 444-451.
Liu, J., Kuipers, B., and Savarese, S. (2011a). "Recognizing human actions by attributes, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3337-3344.
Liu, N., Dellandréa, E., Tellez, B., and Chen, L. (2011b). "Associating textual features with visual ones to improve affective image classification, " in Proc. International Conference on Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, Vol. 6974 (Memphis, TN), 195-204.
Liu, J., Luo, J., and Shah, M. (2009). "Recognizing realistic actions from videos in the wild, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1-8.
Liu, J., Yan, J., Tong, M., and Liu, Y. (2010). "A Bayesian framework for 3D human motion tracking from monocular image, " in IEEE International Conference on Acoustics, Speech and Signal Processing (Dallas, TX: IEEE), 1398-1401.
Livne, M., Sigal, L., Troje, N. F., and Fleet, D. J. (2012). Human attributes from 3D pose tracking. Comput. Vis. Image Understanding 116, 648-660. doi:10.1016/j. cviu.2012.01.003
Lu, J., Xu, R., and Corso, J. J. (2015). "Human action segmentation with hierar- chical supervoxel consistency, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3762-3771.
Lu, W. L., Ting, J. A., Murphy, K. P., and Little, J. J. (2011). "Identifying players in broadcast sports videos using conditional random fields, " in Proc. IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3249-3256.
Ma, S., Sigal, L., and Sclaroff, S. (2015). "Space-time tree ensemble for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 5024-5032.
Maji, S., Bourdev, L. D., and Malik, J. (2011). "Action recognition from a distributed representation of pose and appearance, " in Proc. IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3177-3184.
Marín-Jiménez, M. J., Noz Salinas, R. M., Yeguas-Bolivar, E., and de la Blanca, N. P. (2014). Human interaction categorization by using audio-visual cues. Mach. Vis. Appl. 25, 71-84. doi:10.1007/s00138-013-0521-1
Marszałek, M., Laptev, I., and Schmid, C. (2009). "Actions in context, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 2929-2936.
Martinez, H. P., Bengio, Y., and Yannakakis, G. N. (2013). Learning deep physio- logical models of affect. IEEE Comput. Intell. Mag. 8, 20-33. doi:10.1109/MCI. 2013.2247823
Martinez, H. P., Yannakakis, G. N., and Hallam, J. (2014). Don't classify ratings of affect; rank them! IEEE Trans. Affective Comput. 5, 314-326. doi:10.1109/ TAFFC.2014.2352268
Matikainen, P., Hebert, M., and Sukthankar, R. (2009). "Trajectons: action recog- nition through the motion analysis of tracked features, " in Workshop on Video-Oriented Object and Event Classification, in Conjunction with ICCV (Kyoto: IEEE), 514-521.
Messing, R., Pal, C. J., and Kautz, H. A. (2009). "Activity recognition using the velocity histories of tracked keypoints, " in Proc. IEEE International Conference on Computer Vision (Kyoto), 104-111.
Metallinou, A., Katsamanis, A., and Narayanan, S. (2013). Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image Vis. Comput. 31, 137-152. doi:10.1016/ j.imavis.2012.08.018
Metallinou, A., Lee, C. C., Busso, C., Carnicke, S. M., and Narayanan, S. (2010). "The USC creative IT database: a multimodal database of theatrical improvisation, " in Proc. Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (Malta: Springer), 1-4.
Metallinou, A., Lee, S., and Narayanan, S. (2008). "Audio-visual emotion recognition using Gaussian mixture models for face and voice, " in Proc. IEEE International Symposium on Multimedia (Berkeley, CA), 250-257.
Metallinou, A., and Narayanan, S. (2013). "Annotation and processing of contin- uous emotional attributes: challenges and opportunities, " in Proc. IEEE Inter- national Conference and Workshops on Automatic Face and Gesture Recognition (Shanghai), 1-8.
Metallinou, A., Wollmer, M., Katsamani, A., Eyben, F., Schuller, B., and Narayanan, S. (2012). Context-sensitive learning for enhanced audiovisual emotion classifi- cation. IEEE Trans. Affective Comput. 3, 184-198. doi:10.1109/T-AFFC.2011.40
Mikolajczyk, K., and Uemura, H. (2008). "Action recognition with motion- appearance vocabulary forest, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
Moeslund, T. B., Hilton, A., and Krüger, V. (2006). A survey of advances in vision- based human motion capture and analysis. Comput. Vis. Image Understand. 104, 90-126. doi:10.1016/j.cviu.2006.08.002
Morariu, V. I., and Davis, L. S. (2011). "Multi-agent event recognition in structured scenarios, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3289-3296.
Morris, B. T., and Trivedi, M. M. (2011). Trajectory learning for activity understand- ing: unsupervised, multilevel, and long-term adaptive approach. IEEE Trans. Pattern Anal. Mach. Intell. 33, 2287-2301. doi:10.1109/TPAMI.2011.64
Moutzouris, A., del Rincon, J. M., Nebel, J. C., and Makris, D. (2015). Efficient track- ing of human poses using a manifold hierarchy. Comput. Vis. Image Understand. 132, 75-86. doi:10.1016/j.cviu.2014.10.005
Mumtaz, A., Zhang, W., and Chan, A. B. (2014). "Joint motion segmentation and background estimation in dynamic scenes, " in Proc. IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (Columbus, OH), 368-375.
Murray, R. M., Sastry, S. S., and Zexiang, L. (1994). A Mathematical Introduction to Robotic Manipulation, first Edn. Boca Raton, FL: CRC Press, Inc.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). "Multimodal deep learning, " in Proc. International Conference on Machine Learning (Bellevue, WA), 689-696.
Ni, B., Moulin, P., Yang, X., and Yan, S. (2015). "Motion part regularization: improv- ing action recognition via trajectory group selection, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3698-3706.
Ni, B., Paramathayalan, V. R., and Moulin, P. (2014). "Multiple granularity analysis for fine-grained action detection, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 756-763.
Nicolaou, M. A., Gunes, H., and Pantic, M. (2011). Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affective Comput. 2, 92-105. doi:10.1109/T-AFFC.2011.9
Nicolaou, M. A., Pavlovic, V., and Pantic, M. (2014). Dynamic probabilistic CCA for analysis of affective behavior and fusion of continuous annotations. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1299-1311. doi:10.1109/TPAMI.2014.16
Nie, B. X., Xiong, C., and Zhu, S. C. (2015). "Joint action recognition and pose estimation from video, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1293-1301.
Niebles, J. C., Wang, H., and Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299-318. doi:10.1007/s11263-007-0122-4
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C., Lee, J. T., et al. (2011). "A large- scale benchmark dataset for event recognition in surveillance video, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3153-3160.
Oikonomopoulos, A., Pantic, M., and Patras, I. (2009). Sparse B-spline polynomial descriptors for human activity recognition. Image Vis. Comput. 27, 1814-1825. doi:10.1016/j.imavis.2009.05.010
Oliver, N. M., Rosario, B., and Pentland, A. P. (2000). A Bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22, 831-843. doi:10.1109/34.868684
Ouyang, W., Chu, X., and Wang, X. (2014). "Multi-source deep learning for human pose estimation, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 2337-2344.
Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). "Zero-shot learning with semantic output codes, " in Proc. Annual Conference on Neural Information Processing Systems (Vancouver, BC), 1410-1418.
Pantic, M., Pentland, A., Nijholt, A., and Huang, T. (2006). "Human computing and machine understanding of human behavior: a survey, " in Proc. International Conference on Multimodal Interfaces (New York, NY), 239-248.
Pantic, M., and Rothkrantz, L. (2003). "Towards an affect-sensitive multimodal human-computer interaction, " in Proc. IEEE, Special Issue on Multimodal Human-Computer Interaction, Invited Paper, Vol. 91 (IEEE), 1370-1390.
Park, H. S., and Shi, J. (2015). "Social saliency prediction, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4777-4785.
Patron-Perez, A., Marszalek, M., Reid, I., and Zisserman, A. (2012). Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2441-2453. doi:10.1109/TPAMI.2012.24
Perez, P., Vermaak, J., and Blake, A. (2004). Data fusion for visual tracking with particles. Proc. IEEE 92, 495-513. doi:10.1109/JPROC.2003.823147
Perronnin, F., and Dance, C. R. (2007). "Fisher kernels on visual vocabularies for image categorization, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Minneapolis, MN), 1-8.
Picard, R. W. (1997). Affective Computing. Cambridge, MA: MIT Press.
Pirsiavash, H., and Ramanan, D. (2012). "Detecting activities of daily living in first- person camera views, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2847-2854.
Pirsiavash, H., and Ramanan, D. (2014). "Parsing videos of actions with segmental grammars, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 612-619.
Pishchulin, L., Andriluka, M., Gehler, P. V., and Schiele, B. (2013). "Strong appear- ance and expressive spatial models for human pose estimation, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 3487-3494.
Poppe, R. (2010). A survey on vision-based human action recognition. Image Vis. Comput. 28, 976-990. doi:10.1016/j.imavis.2009.11.014
Prince, S. J. D. (2012). Computer Vision: Models Learning and Inference. New York, NY: Cambridge University Press.
Quattoni, A., Wang, S., Morency, L. P., Collins, M., and Darrell, T. (2007). Hid- den conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1848-1852. doi:10.1109/TPAMI.2007.1124
Rahmani, H., Mahmood, A., Huynh, D. Q., and Mian, A. S. (2014). "Real time action recognition using histograms of depth gradients and random decision forests, " in Proc. IEEE Winter Conference on Applications of Computer Vision (Steamboat Springs, CO), 626-633.
Rahmani, H., and Mian, A. (2015). "Learning a non-linear knowledge transfer model for cross-view action recognition, " in Proc. IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (Boston, MA), 2458-2466.
Ramanathan, V., Li, C., Deng, J., Han, W., Li, Z., Gu, K., et al. (2015). "Learning semantic relationships for better action retrieval in images, " in Proc. IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1100-1109.
Ramanathan, V., Liang, P., and Fei-Fei, L. (2013). "Video event understanding using natural language descriptions, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 905-912.
Raptis, M., Kokkinos, I., and Soatto, S. (2012). "Discovering discriminative action parts from mid-level video representations, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1242-1249.
Rawlinson, G. (2007). The significance of letter position in word recognition. IEEE Aerosp. Electron. Syst. Mag. 22, 26-27. doi:10.1109/MAES.2007.327521
Reddy, K. K., and Shah, M. (2013). Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24, 971-981. doi:10.1007/s00138-012-0450-4
Robertson, N., and Reid, I. (2006). A general method for human activity recognition in video. Comput. Vis. Image Understand. 104, 232-248. doi:10.1016/j.cviu.2006. 07.006
Rodriguez, M. D., Ahmed, J., and Shah, M. (2008). "Action MACH: a spatio- temporal maximum average correlation height filter for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
Rodríguez, N. D., Cuéllar, M. P., Lilius, J., and Calvo-Flores, M. D. (2014). A survey on ontologies for human behavior recognition. ACM Comput. Surv. 46, 1-33. doi:10.1145/2523819
Rohrbach, M., Amin, S., Mykhaylo, A., and Schiele, B. (2012). "A database for fine grained activity detection of cooking activities, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1194-1201.
Roshtkhari, M. J., and Levine, M. D. (2013). Human activity recognition in videos using a single example. Image Vis. Comput. 31, 864-876. doi:10.1016/j.imavis. 2013.08.005
Rudovic, O., Petridis, S., and Pantic, M. (2013). "Bimodal log-linear regression for fusion of audio and visual features, " in Proc. ACM Multimedia Conference (Barcelona), 789-792.
Sadanand, S., and Corso, J. J. (2012). "Action bank: a high-level representation of activity in video, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1234-1241.
Salakhutdinov, R., Torralba, A., and Tenenbaum, J. B. (2011). "Learning to share visual appearance for multiclass object detection, " in Proc. IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 1481-1488.
Samanta, S., and Chanda, B. (2014). Space-time facet model for human activity classification. IEEE Trans. Multimedia 16, 1525-1535. doi:10.1109/TMM.2014. 2326734
Sanchez-Riera, J., Cech, J., and Horaud, R. (2012). "Action recognition robust to background clutter by using stereo vision, " in Proc. European Conference on Computer Vision (Firenze), 332-341.
Sapienza, M., Cuzzolin, F., and Torr, P. H. S. (2014). Learning discriminative space- time action parts from weakly labelled videos. Int. J. Comput. Vis. 110, 30-47. doi:10.1007/s11263-013-0662-8
Sargin, M. E., Yemez, Y., Erzin, E., and Tekalp, A. M. (2007). Audiovisual synchro- nization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 1396-1403. doi:10.1109/TMM.2007.906583
Satkin, S., and Hebert, M. (2010). "Modeling the temporal extent of actions, " in Proc. European Conference on Computer Vision (Heraklion), 536-548.
Schindler, K., and Gool, L. V. (2008). "Action snippets: how many frames does human action recognition require?, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
Schuldt, C., Laptev, I., and Caputo, B. (2004). "Recognizing human actions: a local SVM approach, " in Proc. International Conference on Pattern Recognition (Cambridge), 32-36.
Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., and Pantic, M. (2011). "Avec 2011 -the first international audio visual emotion challenge, " in Proc. International Audio/Visual Emotion Challenge and Workshop, Lecture Notes in Computer Science, Vol. 6975 (Memphis, TN), 415-424.
Sedai, S., Bennamoun, M., and Huynh, D. Q. (2013a). Discriminative fusion of shape and appearance features for human pose estimation. Pattern Recognit. 46, 3223-3237. doi:10.1016/j.patcog.2013.05.019
Sedai, S., Bennamoun, M., and Huynh, D. Q. (2013b). A Gaussian process guided particle filter for tracking 3D human pose in video. IEEE Trans. Image Process. 22, 4286-4300. doi:10.1109/TIP.2013.2271850
Seo, H. J., and Milanfar, P. (2011). Action recognition from one example. IEEE Trans. Pattern Anal. Mach. Intell. 33, 867-882. doi:10.1109/TPAMI.2010. 156
Shabani, A. H., Clausi, D., and Zelek, J. S. (2011). "Improved spatio-temporal salient feature detection for action recognition, " in Proc. British Machine Vision Conference (Dundee), 1-12.
Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton NJ: Princeton University Press.
Shao, J., Kang, K., Loy, C. C., and Wang, X. (2015). "Deeply learned attributes for crowded scene understanding, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4657-4666.
Shivappa, S., Trivedi, M. M., and Rao, B. D. (2010). Audiovisual information fusion in human-computer interfaces and intelligent environments: a survey. Proc. IEEE 98, 1692-1715. doi:10.1109/JPROC.2010.2057231
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). "Real-time human pose recognition in parts from single depth images, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 1297-1304.
Shu, T., Xie, D., Rothrock, B., Todorovic, S., and Zhu, S. C. (2015). "Joint inference of groups, events and human roles in aerial videos, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4576-4584.
Siddiquie, B., Khan, S. M., Divakaran, A., and Sawhney, H. S. (2013). "Affect analysis in natural human interaction using joint hidden conditional random fields, " in Proc. IEEE International Conference on Multimedia and Expo (San Jose, CA), 1-6. Sigal, L., Isard, M., Haussecker, H. W., and Black, M. J. (2012a). Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98, 15-48. doi:10.1007/s11263-011-0493-4
Sigal, L., Isard, M., Haussecker, H., and Black, M. J. (2012b). Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propaga- tion. Int. J. Comput. Vis. 98, 15-48. doi:10.1007/s11263-011-0493-4
Singh, S., Velastin, S. A., and Ragheb, H. (2010). "Muhavi: a multicamera human action video dataset for the evaluation of action recognition methods, " in Proc. IEEE International Conference on Advanced Video and Signal Based Surveillance (Boston, MA), 48-55.
Singh, V. K., and Nevatia, R. (2011). "Action recognition in cluttered dynamic scenes using pose-specific part models, " in Proc. IEEE International Conference on Computer Vision (Barcelona), 113-120.
Smola, A. J., and Schölkopf, B. (2004). A tutorial on support vector regression. Stat. Comput. 14, 199-222. doi:10.1023/B:STCO.0000035301.49549.88
Snoek, C. G. M., Worring, M., and Smeulders, A. W. M. (2005). "Early versus late fusion in semantic video analysis, " in Proc. Annual ACM International Conference on Multimedia (Singapore), 399-402.
Soleymani, M., Pantic, M., and Pun, T. (2012). Multimodal emotion recognition in response to videos. IEEE Trans. Affective Comput. 3, 211-223. doi:10.1109/T- AFFC.2011.37
Song, Y., Morency, L. P., and Davis, R. (2012a). "Multimodal human behavior analysis: learning correlation and interaction across modalities, " in Proc. ACM International Conference on Multimodal Interaction (Santa Monica, CA), 27-30.
Song, Y., Morency, L. P., and Davis, R. (2012b). "Multi-view latent variable discrimi- native models for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2120-2127.
Song, Y., Morency, L. P., and Davis, R. (2013). "Action recognition by hierarchical sequence summarization, " in Proc. IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (Portland, OR), 3562-3569.
Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. Cornell University Library. CoRR, abs/1212.0402.
Sun, C., and Nevatia, R. (2013). "ACTIVE: activity concept transitions in video event classification, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 913-920.
Sun, Q. S., Zeng, S. G., Liu, Y., Heng, P. A., and Xia, D. S. (2005). A new method of feature fusion and its application in image recognition. Pattern Recognit. 38, 2437-2448. doi:10.1016/j.patcog.2004.12.013
Sun, X., Chen, M., and Hauptmann, A. (2009). "Action recognition via local descriptors and holistic features, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (Los Alamitos, CA), 58-65.
Tang, K. D., Yao, B., Fei-Fei, L., and Koller, D. (2013). "Combining the right features for complex event recognition, " in Proc. IEEE International Conference on Computer Vision, pages (Sydney, NSW), 2696-2703.
Tenorth, M., Bandouch, J., and Beetz, M. (2009). "The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition, " in Proc. IEEE International Workshop on Tracking Humans for the Evaluation of Their Motion in Image Sequences (THEMIS) (Kyoto), 1089-1096.
Theodorakopoulos, I., Kastaniotis, D., Economou, G., and Fotopoulos, S. (2014). Pose-based human action recognition via sparse representation in dissimilarity space. J. Vis. Commun. Image Represent. 25, 12-23. doi:10.1016/j.jvcir.2013.03. 008
Theodoridis, S., and Koutroumbas, K. (2008). Pattern Recognition, Fourth Edn. Boston: Academic Press.
Thurau, C., and Hlavac, V. (2008). "Pose primitive based human action recognition in videos or still images, " in Proc. IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (Anchorage, AK), 1-8.
Tian, Y., Sukthankar, R., and Shah, M. (2013). "Spatiotemporal deformable part models for action detection, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2642-2649.
Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211-244. doi:10.1162/15324430152748236
Toshev, A., and Szegedy, C. (2014). "Deeppose: human pose estimation via deep neural networks, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1653-1660.
Tran, D., Yuan, J., and Forsyth, D. (2014a). Video event detection: from subvolume localization to spatiotemporal path search. IEEE Trans. Pattern Anal. Mach. Intell. 36, 404-416. doi:10.1109/TPAMI.2013.137
Tran, K. N., Gala, A., Kakadiaris, I. A., and Shah, S. K. (2014b). Activity analysis in crowded environments using social cues for group discovery and human interaction modeling. Pattern Recognit. Lett. 44, 49-57. doi:10.1016/j.patrec. 2013.09.015
Tran, K. N., Kakadiaris, I. A., and Shah, S. K. (2012). Part-based motion descriptor image for human action recognition. Pattern Recognit. 45, 2562-2572. doi:10. 1016/j.patcog.2011.12.028
Turaga, P. K., Chellappa, R., Subrahmanian, V. S., and Udrea, O. (2008). Machine recognition of human activities: a survey. Proc. IEEE Trans. Circuits Syst. Video Technol. 18, 1473-1488. doi:10.1109/TCSVT.2008.2005594
Urtasun, R., and Darrell, T. (2008). "Sparse probabilistic regression for activity- independent human pose inference, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
Vemulapalli, R., Arrate, F., and Chellappa, R. (2014). "Human action recognition by representing 3D skeletons as points in a lie group, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 588-595.
Vinciarelli, A., Dielmann, A., Favre, S., and Salamin, H. (2009). "Canal9: a database of political debates for analysis of social interactions, " in Proc. International Conference on Affective Computing and Intelligent Interaction and Workshops (Amsterdam: De Rode Hoed), 1-4.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). "Show and tell: a neural image caption generator, " in Proc. IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (Boston, MA), 3156-3164.
Vrigkas, M., Karavasilis, V., Nikou, C., and Kakadiaris, I. A. (2013). "Action recogni- tion by matching clustered trajectories of motion vectors, " in Proc. International Conference on Computer Vision Theory and Applications (Barcelona), 112-117.
Vrigkas, M., Karavasilis, V., Nikou, C., and Kakadiaris, I. A. (2014a). Matching mix- tures of curves for human action recognition. Comput. Vis. Image Understand. 119, 27-40. doi:10.1016/j.cviu.2013.11.007
Vrigkas, M., Nikou, C., and Kakadiaris, I. A. (2014b). "Classifying behavioral attributes using conditional random fields, " in Proc. 8th Hellenic Conference on Artificial Intelligence, Lecture Notes in Computer Science, Vol. 8445 (Ioannina), 95-104.
Wang, H., Kläser, A., Schmid, C., and Liu, C. L. (2011a). "Action recognition by dense trajectories, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3169-3176.
Wang, J., Chen, Z., and Wu, Y. (2011b). "Action recognition with multiscale spatio- temporal contexts, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3185-3192.
Wang, Y., Guan, L., and Venetsanopoulos, A. N. (2011c). "Kernel cross-modal factor analysis for multimodal information fusion, " in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (Prague), 2384-2387.
Wang, H., Kläser, A., Schmid, C., and Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60-79. doi:10.1007/s11263-012-0594-8
Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012a). "Mining actionlet ensemble for action recognition with depth cameras, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1290-1297.
Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., and Hauptmann, A. G. (2012b). "Action recognition by exploring data distribution and feature correlation, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1370-1377.
Wang, Z., Wang, J., Xiao, J., Lin, K. H., and Huang, T. S. (2012c). "Substructure and boundary modeling for continuous action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1330-1337.
Wang, L., Hu, W., and Tan, T. (2003). Recent developments in human motion analysis. Pattern Recognit. 36, 585-601. doi:10.1016/S0031-3203(02)00100-0
Wang, S., Ma, Z., Yang, Y., Li, X., Pang, C., and Hauptmann, A. G. (2014). Semi- supervised multiple feature analysis for action recognition. IEEE Trans. Multi- media 16, 289-298. doi:10.1109/TMM.2013.2293060
Wang, Y., and Mori, G. (2008). "Learning a discriminative hidden part model for human action recognition, " in Proc. Annual Conference on Neural Information Processing Systems (Vancouver, BC), 1721-1728.
Wang, Y., and Mori, G. (2010). "A discriminative latent model of object classes and attributes, " in Proc. European Conference on Computer Vision (Heraklion), 155-168.
Wang, Y., and Mori, G. (2011). Hidden part models for human action recognition: probabilistic versus max margin. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1310-1323. doi:10.1109/TPAMI.2010.214
Westerveld, T., de Vries, A. P., van Ballegooij, A., de Jong, F., and Hiemstra, D. (2003). A probabilistic multimedia retrieval model and its evaluation. EURASIP J. Appl. Signal Process. 2003, 186-198. doi:10.1155/S111086570321101X
Wu, C., Zhang, J., Savarese, S., and Saxena, A. (2015). "Watch-n-patch: unsupervised understanding of actions and relations, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Boston, MA), 4362-4370.
Wu, Q., Wang, Z., Deng, F., Chi, Z., and Feng, D. D. (2013). Realistic human action recognition with multimodal feature selection and fusion. IEEE Trans. Syst. Man Cybern. Syst. 43, 875-885. doi:10.1109/TSMCA.2012.2226575
Wu, Q., Wang, Z., Deng, F., and Feng, D. D. (2010). "Realistic human action recognition with audio context, " in Proc. International Conference on Digital Image Computing: Techniques and Applications (Sydney, NSW), 288-293.
Wu, X., Xu, D., Duan, L., and Luo, J. (2011). "Action recognition using context and appearance distribution features, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 489-496.
Xiong, Y., Zhu, K., Lin, D., and Tang, X. (2015). "Recognize complex events from static images by fusing deep channels, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Boston, MA), 1600-1609.
Xu, C., Hsieh, S. H., Xiong, C., and Corso, J. J. (2015). "Can humans fly? Action understanding with multiple classes of actors, " in Proc. IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition (Boston, MA), 2264-2273.
Xu, R., Agarwal, P., Kumar, S., Krovi, V. N., and Corso, J. J. (2012). "Combining skeletal pose with local motion for human activity recognition, " in Proc. Inter- national Conference on Articulated Motion and Deformable Objects (Mallorca), 114-123.
Yan, X., Kakadiaris, I. A., and Shah, S. K. (2014). Modeling local behavior for predicting social interactions towards human tracking. Pattern Recognit. 47, 1626-1641. doi:10.1016/j.patcog.2013.10.019
Yan, X., and Luo, Y. (2012). Recognizing human actions using a new descriptor based on spatial-temporal interest points and weighted-output classifier. Neuro- computing 87, 51-61. doi:10.1016/j.neucom.2012.02.002
Yang, W., Wang, Y., and Mori, G. (2010). "Recognizing human actions from still images with latent poses, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2030-2037.
Yang, Y., Saleemi, I., and Shah, M. (2013). Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1635-1648. doi:10.1109/ TPAMI.2012.253
Yang, Z., Metallinou, A., and Narayanan, S. (2014). Analysis and predictive mod- eling of body language behavior in dyadic interactions from multimodal inter- locutor cues. IEEE Trans. Multimedia 16, 1766-1778. doi:10.1109/TMM.2014. 2328311
Yao, A., Gall, J., and Gool, L. V. (2010). "A Hough transform-based voting frame- work for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2061-2068.
Yao, B., and Fei-Fei, L. (2010). "Modeling mutual context of object and human pose in human-object interaction activities, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 17-24.
Yao, B., and Fei-Fei, L. (2012). Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1691-1703. doi:10.1109/TPAMI.2012.67
Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., and Fei-Fei, L. (2011). "Human action recognition by learning bases of action attributes and parts, " in Proc. IEEE International Conference on Computer Vision (Barcelona), 1331-1338.
Ye, M., Zhang, Q., Wang, L., Zhu, J., Yangg, R., and Gall, J. (2013). "A survey on human motion analysis from depth data, " in Time-of-Flight and Depth Imaging, Lecture Notes in Computer Science, Vol. 8200. eds M. Grzegorzek, C. Theobalt, R. Koch, and A. Kolb (Berlin Heidelberg: Springer), 149-187.
Yi, S., Krim, H., and Norris, L. K. (2012). Human activity as a manifold-valued random process. IEEE Trans. Image Process. 21, 3416-3428. doi:10.1109/TIP. 2012.2197008
Yu, G., and Yuan, J. (2015). "Fast action proposals for human action detection and search, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1302-1311.
Yu, G., Yuan, J., and Liu, Z. (2012). "Propagative Hough voting for human activ- ity recognition, " in Proc. European Conference on Computer Vision (Florence), 693-706.
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., and Samaras, D. (2012). "Two- person interaction detection using body-pose features and multiple instance learning, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (Providence, RI), 28-35.
Zeng, Z., Pantic, M., Roisman, G. I., and Huang, T. S. (2009). A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39-58. doi:10.1109/TPAMI.2008.52
Zhang, Z., Wang, C., Xiao, B., Zhou, W., and Liu, S. (2013). Attribute regulariza- tion based human action recognition. IEEE Trans. Inform. Forensics Secur. 8, 1600-1609. doi:10.1109/TIFS.2013.2258152
Zhang, Z., Wang, C., Xiao, B., Zhou, W., and Liu, S. (2015). Robust relative attributes for human action recognition. Pattern Anal. Appl. 18, 157-171. doi:10.1007/ s10044-013-0349-3
Zhou, Q., and Wang, G. (2012). "Atomic action features: a new feature for action recognition, " in Proc. European Conference on Computer Vision (Firenze), 291-300.
Zhou, W., and Zhang, Z. (2014). Human action recognition with multiple-instance Markov model. IEEE Trans. Inform. Forensics Secur 9, 1581-1591. doi:10.1109/ TIFS.2014.2344448

A Review of Human Activity Recognition Methods

Sign up for access to the world's latest research

Abstract

Key takeawaysAI

Related papers

References (288)

Related papers

Related topics

Cited by

Key takeaways
AI