A Review of Human Activity Recognition Methods
2015, Frontiers in Robotics and AI
https://doi.org/10.3389/FROBT.2015.00028Abstract
Recognizing human activities from video sequences or still images is a challenging task due to problems, such as background clutter, partial occlusion, changes in scale, viewpoint, lighting, and appearance. Many applications, including video surveillance systems, human-computer interaction, and robotics for human behavior characterization, require a multiple activity recognition system. In this work, we provide a detailed review of recent and state-of-the-art research advances in the field of human activity classification. We propose a categorization of human activity methodologies and discuss their advantages and limitations. In particular, we divide human activity classification methods into two large categories according to whether they use data from different modalities or not. Then, each of these categories is further analyzed into sub-categories, which reflect how they model human activities and what type of activities they are interested in. Moreover, we provide a comprehensive analysis of the existing, publicly available human activity classification datasets and examine the requirements for an ideal human activity recognition dataset. Finally, we report the characteristics of future research directions and present some open issues on human activity recognition.
Key takeaways
AI
AI
- Human activity recognition faces challenges like background clutter and occlusion, complicating accurate classification.
- The text categorizes recognition methods into unimodal and multimodal approaches, detailing their advantages and limitations.
- Addressing complex activities often requires decomposing them into simpler actions for better recognition accuracy.
- An ideal human activity recognition dataset must be diverse and reflect real-world scenarios for effective training.
- Future research should focus on generalization and robustness in varied environments, including handling occlusions and missing data.
References (288)
- Aggarwal, J. K., and Cai, Q. (1999). Human motion analysis: a review. Comput. Vis. Image Understand. 73, 428-440. doi:10.1006/cviu.1998.0744
- Aggarwal, J. K., and Ryoo, M. S. (2011). Human activity analysis: a review. ACM Comput. Surv. 43, 1-43. doi:10.1145/1922649.1922653
- Aggarwal, J. K., and Xia, L. (2014). Human activity recognition from 3D data: a review. Pattern Recognit. Lett. 48, 70-80. doi:10.1016/j.patrec.2014.04.011
- Akata, Z., Perronnin, F., Harchaoui, Z., and Schmid, C. (2013). "Label-embedding for attribute-based classification, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 819-826.
- Alahi, A., Ramanathan, V., and Fei-Fei, L. (2014). "Socially-aware large-scale crowd forecasting, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 2211-2218.
- AlZoubi, O., Fossati, D., D'Mello, S. K., and Calvo, R. A. (2013). "Affect detec- tion and classification from the non-stationary physiological data, " in Proc. International Conference on Machine Learning and Applications (Portland, OR), 240-245.
- Amer, M. R., and Todorovic, S. (2012). "Sum-product networks for modeling activities with stochastic structure, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1314-1321.
- Amin, S., Andriluka, M., Rohrbach, M., and Schiele, B. (2013). "Multi-view pictorial structures for 3D human pose estimation, " in Proc. British Machine Vision Conference (Bristol), 1-12.
- Andriluka, M., Pishchulin, L., Gehler, P. V., and Schiele, B. (2014). "2D human pose estimation: new benchmark and state of the art analysis, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 3686-3693.
- Andriluka, M., and Sigal, L. (2012). "Human context: modeling human-human interactions for monocular 3D pose estimation, " in Proc. International Confer- ence on Articulated Motion and Deformable Objects (Mallorca: Springer-Verlag), 260-272.
- Anirudh, R., Turaga, P., Su, J., and Srivastava, A. (2015). "Elastic functional coding of human actions: from vector-fields to latent variables, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3147-3155.
- Atrey, P. K., Hossain, M. A., El-Saddik, A., and Kankanhalli, M. S. (2010). Mul- timodal fusion for multimedia analysis: a survey. Multimed. Syst. 16, 345-379. doi:10.1007/s00530-010-0182-0
- Bandla, S., and Grauman, K. (2013). "Active learning of an action detector from untrimmed videos, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 1833-1840.
- Baxter, R. H., Robertson, N. M., and Lane, D. M. (2015). Human behaviour recog- nition in data-scarce domains. Pattern Recognit. 48, 2377-2393. doi:10.1016/j. patcog.2015.02.019
- Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., and Ilic, S. (2014). "3D pictorial structures for multiple human pose estimation, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1669-1676.
- Bilakhia, S., Petridis, S., and Pantic, M. (2013). "Audiovisual detection of behavioural mimicry, " in Proc. 2013 Humaine Association Conference on Affec- tive Computing and Intelligent Interaction (Geneva), 123-128.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Secaucus, NJ: Springer.
- Blank, M., Gorelick, L., Shechtman, E., Irani, M., and Basri, R. (2005). "Actions as space-time shapes, " in Proc. IEEE International Conference on Computer Vision (Beijing), 1395-1402.
- Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., and Sivic, J. (2013). "Finding actors and actions in movies, " in Proc. IEEE International Conference on Computer Vision (Sydney), 2280-2287.
- Bousmalis, K., Mehu, M., and Pantic, M. (2013a). Towards the automatic detection of spontaneous agreement and disagreement based on nonverbal behaviour: a survey of related cues, databases, and tools. Image Vis. Comput. 31, 203-221. doi:10.1016/j.imavis.2012.07.003
- Bousmalis, K., Zafeiriou, S., Morency, L. P., and Pantic, M. (2013b). Infinite hidden conditional random fields for human behavior analysis. IEEE Trans. Neural Networks Learn. Syst. 24, 170-177. doi:10.1109/TNNLS.2012.2224882
- Bousmalis, K., Morency, L., and Pantic, M. (2011). "Modeling hidden dynamics of multimodal cues for spontaneous agreement and disagreement recognition, " in Proc. IEEE International Conference on Automatic Face and Gesture Recognition (Santa Barbara, CA), 746-752.
- Burenius, M., Sullivan, J., and Carlsson, S. (2013). "3D pictorial structures for multi- ple view articulated pose estimation, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 3618-3625.
- Burgos-Artizzu, X. P., Dollár, P., Lin, D., Anderson, D. J., and Perona, P. (2012). "Social behavior recognition in continuous video, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1322-1329.
- Candamo, J., Shreve, M., Goldgof, D. B., Sapper, D. B., and Kasturi, R. (2010). Understanding transit scenes: a survey on human behavior-recognition algo- rithms. IEEE Trans. Intell. Transp. Syst. 11, 206-224. doi:10.1109/TITS.2009. 2030963
- Castellano, G., Villalba, S. D., and Camurri, A. (2007). "Recognising human emo- tions from body movement and gesture dynamics, " in Proc. Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, Vol. 4738 (Lisbon), 71-82.
- Chakraborty, B., Holte, M. B., Moeslund, T. B., and Gonzàlez, J. (2012). Selective spatio-temporal interest points. Comput. Vis. Image Understand. 116, 396-410. doi:10.1016/j.cviu.2011.09.010
- Chaquet, J. M., Carmona, E. J., and Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Comput. Vis. Image Understand. 117, 633-659. doi:10.1016/j.cviu.2013.01.013
- Chaudhry, R., Ravichandran, A., Hager, G. D., and Vidal, R. (2009). "Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1932-1939.
- Chen, C. Y., and Grauman, K. (2012). "Efficient activity detection with max- subgraph search, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1274-1281.
- Chen, H., Li, J., Zhang, F., Li, Y., and Wang, H. (2015). "3D model-based continuous emotion recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1836-1845.
- Chen, L., Duan, L., and Xu, D. (2013a). "Event recognition in videos by learning from heterogeneous web sources, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2666-2673.
- Chen, L., Wei, H., and Ferryman, J. (2013b). A survey of human motion analysis using depth imagery. Pattern Recognit. Lett. 34, 1995-2006. doi:10.1016/j.patrec. 2013.02.006
- Chen, W., Xiong, C., Xu, R., and Corso, J. J. (2014). "Actionness ranking with lattice conditional ordinal random fields, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 748-755.
- Cherian, A., Mairal, J., Alahari, K., and Schmid, C. (2014). "Mixing body-part sequences for human pose estimation, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Columbus, OH), 2361-2368.
- Choi, W., Shahid, K., and Savarese, S. (2011). "Learning context for collective activity recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3273-3280.
- Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. (2011). "Flexible, high performance convolutional neural networks for image classification, " in Proc. International Joint Conference on Artificial Intelligence (Barcelona), 1237-1242.
- Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012). "Multi-column deep neural networks for image classification, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 3642-3649.
- Cui, X., Liu, Q., Gao, M., and Metaxas, D. N. (2011). "Abnormal detection using interaction energy potentials, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3161-3167.
- Dalal, N., and Triggs, B. (2005). "Histograms of oriented gradients for human detection, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 886-893.
- Dalal, N., Triggs, B., and Schmid, C. (2006). "Human detection using oriented histograms of flow and appearance, " in Proc. European Conference on Computer Vision (Graz), 428-441.
- Dollár, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005). "Behavior recognition via sparse spatio-temporal features, " in Proc. International Conference on Com- puter Communications and Networks (Beijing), 65-72.
- Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). "Long-term recurrent convolutional networks for visual recognition and description, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2625-2634.
- Du, Y., Wang, W., and Wang, L. (2015). "Hierarchical recurrent neural network for skeleton based action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1110-1118.
- Efros, A. A., Berg, A. C., Mori, G., and Malik, J. (2003). "Recognizing action at a distance, " in Proc. IEEE International Conference on Computer Vision, Vol. 2 (Nice), 726-733.
- Ekman, P., Friesen, W. V., and Hager, J. C. (2002). Facial Action Coding System (FACS): Manual. Salt Lake City: A Human Face.
- Elgammal, A., Duraiswami, R., Harwood, D., and Davis, L. S. (2002). Background and foreground modeling using nonparametric kernel density for visual surveil- lance. Proc. IEEE 90, 1151-1163. doi:10.1109/JPROC.2002.801448
- Escalera, S., Baró, X., Vitrià, J., Radeva, P., and Raducanu, B. (2012). Social network extraction and analysis based on multimodal dyadic interaction. Sensors 12, 1702-1719. doi:10.3390/s120201702
- Evangelopoulos, G., Zlatintsi, A., Potamianos, A., Maragos, P., Rapantzikos, K., Skoumas, G., et al. (2013). Multimodal saliency and fusion for movie summa- rization based on aural, visual, and textual attention. IEEE Trans. Multimedia 15, 1553-1568. doi:10.1109/TMM.2013.2267205
- Evgeniou, T., and Pontil, M. (2004). "Regularized multi-task learning, " in Proc. ACM International Conference on Knowledge Discovery and Data Mining (Seattle, WA), 109-117.
- Eweiwi, A., Cheema, M. S., Bauckhage, C., and Gall, J. (2014). "Efficient pose-based action recognition, " in Proc. Asian Conference on Computer Vision (Singapore), 428-443.
- Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. A. (2009). "Describing objects by their attributes, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1778-1785.
- Fathi, A., Hodgins, J. K., and Rehg, J. M. (2012). "Social interactions: a first-person perspective, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1226-1233.
- Fathi, A., and Mori, G. (2008). "Action recognition by learning mid-level motion features, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
- Fergie, M., and Galata, A. (2013). Mixtures of Gaussian process models for human pose estimation. Image Vis. Comput. 31, 949-957. doi:10.1016/j.imavis.2013.09. 007
- Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., and Tuytelaars, T. (2015). "Modeling video evolution for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 5378-5387.
- Ferrari, V., Marin-Jimenez, M., and Zisserman, A. (2009). "Pose search: retrieving people using their pose, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1-8.
- Fisher, R. B. (2004). PETS04 Surveillance Ground Truth Dataset. Available at: http: //www-prima.inrialpes.fr/PETS04/
- Fisher, R. B. (2007a). Behave: Computer-Assisted Prescreening of Video Streams for Unusual Activities. Available at: http://homepages.inf.ed.ac.uk/rbf/BEHAVE/
- Fisher, R. B. (2007b). PETS07 Benchmark Dataset. Available at: http://www.cvg. reading.ac.uk/PETS2007/data.html
- Fogel, I., and Sagi, D. (1989). Gabor filters as texture discriminator. Biol. Cybern. 61, 103-113. doi:10.1007/BF00204594
- Fothergill, S., Mentis, H. M., Kohli, P., and Nowozin, S. (2012). "Instructing people for training gestural interactive systems, " in Proc. Conference on Human Factors in Computing Systems (Austin, TX), 1737-1746.
- Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., and Sivic, J. (2014). People watching: human actions as a cue for single view geometry. Int. J. Comput. Vis. 110, 259-274. doi:10.1007/s11263-014-0710-z
- Fu, Y., Hospedales, T. M., Xiang, T., and Gong, S. (2012). "Attribute learning for understanding unstructured social activity, " in Proc. European Conference on Computer Vision, Lecture Notes in Computer Science, Vol. 7575 (Florence), 530-543.
- Fu, Y., Hospedales, T. M., Xiang, T., and Gong, S. (2014). Learning multimodal latent attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 303-316. doi:10. 1109/TPAMI.2013.128
- Gaidon, A., Harchaoui, Z., and Schmid, C. (2014). Activity representation with motion hierarchies. Int. J. Comput. Vis. 107, 219-238. doi:10.1007/s11263-013- 0677-1
- Gan, C., Wang, N., Yang, Y., Yeung, D. Y., and Hauptmann, A. G. (2015). "DevNet: a deep event network for multimedia event detection and evidence recounting, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 2568-2577.
- Gao, Z., Zhang, H., Xu, G. P., and Xue, Y. B. (2015). Multi-perspective and multi- modality joint representation and recognition model for 3D action recognition. Neurocomputing 151, 554-564. doi:10.1016/j.neucom.2014.06.085
- Gavrila, D. M. (1999). The visual analysis of human movement: a survey. Comput. Vis. Image Understand. 73, 82-98. doi:10.1006/cviu.1998.0716
- Gorelick, L., Blank, M., Shechtman, E., Irani, M., and Basri, R. (2007). Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247-2253. doi:10.1109/TPAMI.2007.70711
- Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R. J., Darrell, T., et al. (2013). "Youtube2text: recognizing and describing arbi- trary activities using semantic hierarchies and zero-shot recognition, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 2712-2719.
- Guha, T., and Ward, R. K. (2012). Learning sparse representations for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1576-1588. doi:10.1109/ TPAMI.2011.253
- Guo, G., and Lai, A. (2014). A survey on still image based human action recognition. Pattern Recognit. 47, 3343-3361. doi:10.1016/j.patcog.2014.04.018
- Gupta, A., and Davis, L. S. (2007). "Objects in action: an approach for combining action understanding and object perception, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Minneapolis, MN), 1-8.
- Gupta, A., Kembhavi, A., and Davis, L. S. (2009). Observing human-object inter- actions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775-1789. doi:10.1109/TPAMI.2009.83
- Haralick, R. M., and Watson, L. (1981). A facet model for image data. Comput. Graph. Image Process. 15, 113-129. doi:10.1016/0146-664X(81)90073-3
- Hardoon, D. R., Szedmak, S. R., and Shawe-Taylor, J. R. (2004). Canonical correla- tion analysis: an overview with application to learning methods. Neural Comput. 16, 2639-2664. doi:10.1162/0899766042321814
- Healey, J. (2011). "Recording affect in the field: towards methods and metrics for improving ground truth labels, " in Proc. International Conference on Affective Computing and Intelligent Interaction (Memphis, TN), 107-116.
- Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C. (2015). "ActivityNet: a large-scale video benchmark for human activity understanding, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 961-970.
- Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527-1554. doi:10.1162/neco.2006.18.7.1527
- Ho, T. K. (1995). "Random decision forests, " in Proc. International Conference on Document Analysis and Recognition, Vol. 1 (Washington, DC: IEEE Computer Society), 278-282.
- Hoai, M., Lan, Z. Z., and Torre, F. (2011). "Joint segmentation and classifi- cation of human actions in video, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3265-3272.
- Hoai, M., and Zisserman, A. (2014). "Talking heads: detecting humans and rec- ognizing their interactions, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 875-882.
- Holte, M. B., Chakraborty, B., Gonzàlez, J., and Moeslund, T. B. (2012a). A local 3- D motion descriptor for multi-view human action recognition from 4-D spatio- temporal interest points. IEEE J. Sel. Top. Signal Process. 6, 553-565. doi:10.1109/ JSTSP.2012.2193556
- Holte, M. B., Tran, C., Trivedi, M. M., and Moeslund, T. B. (2012b). Human pose estimation and activity recognition from multi-view videos: comparative explorations of recent developments. IEEE J. Sel. Top. Signal Process. 6, 538-552. doi:10.1109/JSTSP.2012.2196975
- Huang, Z. F., Yang, W., Wang, Y., and Mori, G. (2011). "Latent boosting for action recognition, " in Proc. British Machine Vision Conference (Dundee), 1-11.
- Hussain, M. S., Calvo, R. A., and Pour, P. A. (2011). "Hybrid fusion approach for detecting affects from multichannel physiology, " in Proc. International Confer- ence on Affective Computing and Intelligent Interaction, Lecture Notes in Com- puter Science, Vol. 6974 (Memphis, TN), 568-577.
- Ikizler, N., and Duygulu, P. (2007). "Human action recognition using distribution of oriented rectangular patches, " in Proc. Conference on Human Motion: Under- standing, Modeling, Capture and Animation (Rio de Janeiro), 271-284.
- Ikizler-Cinbis, N., and Sclaroff, S. (2010). "Object, scene and actions: combining multiple features for human action recognition, " in Proc. European Conference on Computer Vision, Lecture Notes in Computer Science, Vol. 6311 (Hersonissos, Heraclion, Crete, greece: Springer), 494-507.
- Iosifidis, A., Tefas, A., and Pitas, I. (2012a). Activity-based person identification using fuzzy representation and discriminant learning. IEEE Trans. Inform. Forensics Secur. 7, 530-542. doi:10.1109/TIFS.2011.2175921
- Iosifidis, A., Tefas, A., and Pitas, I. (2012b). View-invariant action recognition based on artificial neural networks. IEEE Trans. Neural Networks Learn. Syst. 23, 412-424. doi:10.1109/TNNLS.2011.2181865
- Jaimes, A., and Sebe, N. (2007). "Multimodal human-computer interaction: a sur- vey, " in Computer Vision and Image Understanding, Vol. 108 (Special Issue on Vision for Human-Computer Interaction), 116-134.
- Jain, M., Gemert, J., Jégou, H., Bouthemy, P., and Snoek, C. G. M. (2014). "Action localization with tubelets from motion, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Columbus, OH), 740-747.
- Jain, M., Jegou, H., and Bouthemy, P. (2013). "Better exploiting motion for better action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2555-2562.
- Jainy, M., Gemerty, J. C., and Snoek, C. G. M. (2015). "What do 15,000 object cate- gories tell us about classifying and localizing actions?, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 46-55.
- Jayaraman, D., and Grauman, K. (2014). "Zero-shot recognition with unreliable attributes, " in Proc. Annual Conference on Neural Information Processing Systems (Montreal, QC), 3464-3472.
- Jhuang, H., Gall, J., Zuffi, S., Schmid, C., and Black, M. J. (2013). "Towards understanding action recognition, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 3192-3199.
- Jhuang, H., Serre, T., Wolf, L., and Poggio, T. (2007). "A biologically inspired system for action recognition, " in Proc. IEEE International Conference on Computer Vision (Rio de Janeiro), 1-8.
- Jiang, B., Martínez, B., Valstar, M. F., and Pantic, M. (2014). "Decision level fusion of domain specific regions for facial action recognition, " in Proc. International Conference on Pattern Recognition (Stockholm), 1776-1781.
- Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D. P. W., and Loui, A. C. (2011). "Con- sumer video understanding: a benchmark database and an evaluation of human and machine performance, " in Proc. International Conference on Multimedia Retrieval (Trento), 29-36.
- Jiang, Z., Lin, Z., and Davis, L. S. (2013). A unified tree-based framework for joint action localization, recognition and segmentation. Comput. Vis. Image Understand. 117, 1345-1355. doi:10.1016/j.cviu.2012.09.008
- Jung, H. Y., Lee, S., Heo, Y. S., and Yun, I. D. (2015). "Random treewalk toward instantaneous 3D human pose estimation, " in Proc. IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (Boston, MA), 2467-2474.
- Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). "Large-scale video classification with convolutional neural networks, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1725-1732.
- Khamis, S., Morariu, V. I., and Davis, L. S. (2012). "A flow model for joint action recognition and identity maintenance, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Providence, RI), 1218-1225.
- Kim, Y., Lee, H., and Provost, E. M. (2013). "Deep learning for robust feature generation in audiovisual emotion recognition, " in Proc. IEEE International Con- ference on Acoustics, Speech and Signal Processing (Vancouver, BC), 3687-3691.
- Klami, A., and Kaski, S. (2008). Probabilistic approach to detecting dependencies between data sets. Neurocomputing 72, 39-46. doi:10.1016/j.neucom.2007.12. 044
- Kläser, A., Marszałek, M., and Schmid, C. (2008). "A spatio-temporal descriptor based on 3D-gradients, " in Proc. British Machine Vision Conference (Leeds: University of Leeds), 995-1004.
- Kohonen, T., Schroeder, M. R., and Huang, T. S. (eds) (2001). Self-Organizing Maps, Third Edn. New York, NY.: Springer-Verlag Inc.
- Kong, Y., and Fu, Y. (2014). "Modeling supporting regions for close human inter- action recognition, " in Proc. European Conference on Computer Vision (Zurich), 29-44.
- Kong, Y., Jia, Y., and Fu, Y. (2014a). Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1775-1788. doi:10.1109/TPAMI.2014.2303090
- Kong, Y., Kit, D., and Fu, Y. (2014b). "A discriminative model with multiple tem- poral scales for action prediction, " in Proc. European Conference on Computer Vision (Zurich), 596-611.
- Kovashka, A., and Grauman, K. (2010). "Learning a hierarchy of discriminative space-time neighborhood features for human action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2046-2053.
- Kuehne, H., Arslan, A., and Serre, T. (2014). "The language of actions: recov- ering the syntax and semantics of goal-directed human activities, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 780-787.
- Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. (2011). "HMDB: a large video database for human motion recognition, " in Proc. IEEE International Conference on Computer Vision (Barcelona), 2556-2563.
- Kulkarni, K., Evangelidis, G., Cech, J., and Horaud, R. (2015). Continuous action recognition based on sequence alignment. Int. J. Comput. Vis. 112, 90-114. doi:10.1007/s11263-014-0758-9
- Kulkarni, P., Sharma, G., Zepeda, J., and Chevallier, L. (2014). "Transfer learning via attributes for improved on-the-fly classification, " in Proc. IEEE Winter Con- ference on Applications of Computer Vision (Steamboat Springs, CO), 220-226.
- Kviatkovsky, I., Rivlin, E., and Shimshoni, I. (2014). Online action recognition using covariance of shape and motion. Comput. Vis. Image Understand. 129, 15-26. doi:10.1016/j.cviu.2014.08.001
- Lafferty, J. D., McCallum, A., and Pereira, F. C. N. (2001). "Conditional random fields: probabilistic models for segmenting and labeling sequence data, " in Proc. International Conference on Machine Learning (Williamstown, MA: Williams College), 282-289.
- Lampert, C. H., Nickisch, H., and Harmeling, S. (2009). "Learning to detect unseen object classes by between-class attribute transfer, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 951-958.
- Lan, T., Chen, T. C., and Savarese, S. (2014). "A hierarchical representation for future action prediction, " in Proc. European Conference on Computer Vision (Zurich), 689-704.
- Lan, T., Sigal, L., and Mori, G. (2012a). "Social roles in hierarchical models for human activity recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1354-1361.
- Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., and Mori, G. (2012b). Discrim- inative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1549-1562. doi:10.1109/TPAMI.2011.228
- Lan, T., Wang, Y., and Mori, G. (2011). "Discriminative figure-centric models for joint action localization and recognition, " in Proc. IEEE International Conference on Computer Vision (Barcelona), 2003-2010.
- Laptev, I. (2005). On space-time interest points. Int. J. Comput. Vis. 64, 107-123. doi:10.1007/s11263-005-1838-7
- Laptev, I., Marszałek, M., Schmid, C., and Rozenfeld, B. (2008). "Learning realistic human actions from movies, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
- Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011). "Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3361-3368.
- Li, B., Ayazoglu, M., Mao, T., Camps, O. I., and Sznaier, M. (2011). "Activity recognition using dynamic subspace angles, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3193-3200.
- Li, B., Camps, O. I., and Sznaier, M. (2012). "Cross-view activity recognition using hankelets, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1362-1369.
- Li, R., and Zickler, T. (2012). "Discriminative virtual views for cross-view action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2855-2862.
- Lichtenauer, J., Valstar, J. S. M., and Pantic, M. (2011). Cost-effective solution to synchronised audio-visual data capture using multiple sensors. Image Vis. Comput. 29, 666-680. doi:10.1016/j.imavis.2011.07.004
- Lillo, I., Soto, A., and Niebles, J. C. (2014). "Discriminative hierarchical modeling of spatio-temporally composable human activities, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 812-819.
- Lin, Z., Jiang, Z., and Davis, L. S. (2009). "Recognizing actions by shape-motion prototype trees, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 444-451.
- Liu, J., Kuipers, B., and Savarese, S. (2011a). "Recognizing human actions by attributes, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3337-3344.
- Liu, N., Dellandréa, E., Tellez, B., and Chen, L. (2011b). "Associating textual features with visual ones to improve affective image classification, " in Proc. International Conference on Affective Computing and Intelligent Interaction, Lecture Notes in Computer Science, Vol. 6974 (Memphis, TN), 195-204.
- Liu, J., Luo, J., and Shah, M. (2009). "Recognizing realistic actions from videos in the wild, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 1-8.
- Liu, J., Yan, J., Tong, M., and Liu, Y. (2010). "A Bayesian framework for 3D human motion tracking from monocular image, " in IEEE International Conference on Acoustics, Speech and Signal Processing (Dallas, TX: IEEE), 1398-1401.
- Livne, M., Sigal, L., Troje, N. F., and Fleet, D. J. (2012). Human attributes from 3D pose tracking. Comput. Vis. Image Understanding 116, 648-660. doi:10.1016/j. cviu.2012.01.003
- Lu, J., Xu, R., and Corso, J. J. (2015). "Human action segmentation with hierar- chical supervoxel consistency, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3762-3771.
- Lu, W. L., Ting, J. A., Murphy, K. P., and Little, J. J. (2011). "Identifying players in broadcast sports videos using conditional random fields, " in Proc. IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3249-3256.
- Ma, S., Sigal, L., and Sclaroff, S. (2015). "Space-time tree ensemble for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 5024-5032.
- Maji, S., Bourdev, L. D., and Malik, J. (2011). "Action recognition from a distributed representation of pose and appearance, " in Proc. IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3177-3184.
- Marín-Jiménez, M. J., Noz Salinas, R. M., Yeguas-Bolivar, E., and de la Blanca, N. P. (2014). Human interaction categorization by using audio-visual cues. Mach. Vis. Appl. 25, 71-84. doi:10.1007/s00138-013-0521-1
- Marszałek, M., Laptev, I., and Schmid, C. (2009). "Actions in context, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Miami Beach, FL), 2929-2936.
- Martinez, H. P., Bengio, Y., and Yannakakis, G. N. (2013). Learning deep physio- logical models of affect. IEEE Comput. Intell. Mag. 8, 20-33. doi:10.1109/MCI. 2013.2247823
- Martinez, H. P., Yannakakis, G. N., and Hallam, J. (2014). Don't classify ratings of affect; rank them! IEEE Trans. Affective Comput. 5, 314-326. doi:10.1109/ TAFFC.2014.2352268
- Matikainen, P., Hebert, M., and Sukthankar, R. (2009). "Trajectons: action recog- nition through the motion analysis of tracked features, " in Workshop on Video-Oriented Object and Event Classification, in Conjunction with ICCV (Kyoto: IEEE), 514-521.
- Messing, R., Pal, C. J., and Kautz, H. A. (2009). "Activity recognition using the velocity histories of tracked keypoints, " in Proc. IEEE International Conference on Computer Vision (Kyoto), 104-111.
- Metallinou, A., Katsamanis, A., and Narayanan, S. (2013). Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image Vis. Comput. 31, 137-152. doi:10.1016/ j.imavis.2012.08.018
- Metallinou, A., Lee, C. C., Busso, C., Carnicke, S. M., and Narayanan, S. (2010). "The USC creative IT database: a multimodal database of theatrical improvisation, " in Proc. Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (Malta: Springer), 1-4.
- Metallinou, A., Lee, S., and Narayanan, S. (2008). "Audio-visual emotion recognition using Gaussian mixture models for face and voice, " in Proc. IEEE International Symposium on Multimedia (Berkeley, CA), 250-257.
- Metallinou, A., and Narayanan, S. (2013). "Annotation and processing of contin- uous emotional attributes: challenges and opportunities, " in Proc. IEEE Inter- national Conference and Workshops on Automatic Face and Gesture Recognition (Shanghai), 1-8.
- Metallinou, A., Wollmer, M., Katsamani, A., Eyben, F., Schuller, B., and Narayanan, S. (2012). Context-sensitive learning for enhanced audiovisual emotion classifi- cation. IEEE Trans. Affective Comput. 3, 184-198. doi:10.1109/T-AFFC.2011.40
- Mikolajczyk, K., and Uemura, H. (2008). "Action recognition with motion- appearance vocabulary forest, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
- Moeslund, T. B., Hilton, A., and Krüger, V. (2006). A survey of advances in vision- based human motion capture and analysis. Comput. Vis. Image Understand. 104, 90-126. doi:10.1016/j.cviu.2006.08.002
- Morariu, V. I., and Davis, L. S. (2011). "Multi-agent event recognition in structured scenarios, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3289-3296.
- Morris, B. T., and Trivedi, M. M. (2011). Trajectory learning for activity understand- ing: unsupervised, multilevel, and long-term adaptive approach. IEEE Trans. Pattern Anal. Mach. Intell. 33, 2287-2301. doi:10.1109/TPAMI.2011.64
- Moutzouris, A., del Rincon, J. M., Nebel, J. C., and Makris, D. (2015). Efficient track- ing of human poses using a manifold hierarchy. Comput. Vis. Image Understand. 132, 75-86. doi:10.1016/j.cviu.2014.10.005
- Mumtaz, A., Zhang, W., and Chan, A. B. (2014). "Joint motion segmentation and background estimation in dynamic scenes, " in Proc. IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (Columbus, OH), 368-375.
- Murray, R. M., Sastry, S. S., and Zexiang, L. (1994). A Mathematical Introduction to Robotic Manipulation, first Edn. Boca Raton, FL: CRC Press, Inc.
- Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. (2011). "Multimodal deep learning, " in Proc. International Conference on Machine Learning (Bellevue, WA), 689-696.
- Ni, B., Moulin, P., Yang, X., and Yan, S. (2015). "Motion part regularization: improv- ing action recognition via trajectory group selection, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 3698-3706.
- Ni, B., Paramathayalan, V. R., and Moulin, P. (2014). "Multiple granularity analysis for fine-grained action detection, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 756-763.
- Nicolaou, M. A., Gunes, H., and Pantic, M. (2011). Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affective Comput. 2, 92-105. doi:10.1109/T-AFFC.2011.9
- Nicolaou, M. A., Pavlovic, V., and Pantic, M. (2014). Dynamic probabilistic CCA for analysis of affective behavior and fusion of continuous annotations. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1299-1311. doi:10.1109/TPAMI.2014.16
- Nie, B. X., Xiong, C., and Zhu, S. C. (2015). "Joint action recognition and pose estimation from video, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1293-1301.
- Niebles, J. C., Wang, H., and Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299-318. doi:10.1007/s11263-007-0122-4
- Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C., Lee, J. T., et al. (2011). "A large- scale benchmark dataset for event recognition in surveillance video, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3153-3160.
- Oikonomopoulos, A., Pantic, M., and Patras, I. (2009). Sparse B-spline polynomial descriptors for human activity recognition. Image Vis. Comput. 27, 1814-1825. doi:10.1016/j.imavis.2009.05.010
- Oliver, N. M., Rosario, B., and Pentland, A. P. (2000). A Bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22, 831-843. doi:10.1109/34.868684
- Ouyang, W., Chu, X., and Wang, X. (2014). "Multi-source deep learning for human pose estimation, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 2337-2344.
- Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). "Zero-shot learning with semantic output codes, " in Proc. Annual Conference on Neural Information Processing Systems (Vancouver, BC), 1410-1418.
- Pantic, M., Pentland, A., Nijholt, A., and Huang, T. (2006). "Human computing and machine understanding of human behavior: a survey, " in Proc. International Conference on Multimodal Interfaces (New York, NY), 239-248.
- Pantic, M., and Rothkrantz, L. (2003). "Towards an affect-sensitive multimodal human-computer interaction, " in Proc. IEEE, Special Issue on Multimodal Human-Computer Interaction, Invited Paper, Vol. 91 (IEEE), 1370-1390.
- Park, H. S., and Shi, J. (2015). "Social saliency prediction, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4777-4785.
- Patron-Perez, A., Marszalek, M., Reid, I., and Zisserman, A. (2012). Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2441-2453. doi:10.1109/TPAMI.2012.24
- Perez, P., Vermaak, J., and Blake, A. (2004). Data fusion for visual tracking with particles. Proc. IEEE 92, 495-513. doi:10.1109/JPROC.2003.823147
- Perronnin, F., and Dance, C. R. (2007). "Fisher kernels on visual vocabularies for image categorization, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Minneapolis, MN), 1-8.
- Picard, R. W. (1997). Affective Computing. Cambridge, MA: MIT Press.
- Pirsiavash, H., and Ramanan, D. (2012). "Detecting activities of daily living in first- person camera views, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2847-2854.
- Pirsiavash, H., and Ramanan, D. (2014). "Parsing videos of actions with segmental grammars, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 612-619.
- Pishchulin, L., Andriluka, M., Gehler, P. V., and Schiele, B. (2013). "Strong appear- ance and expressive spatial models for human pose estimation, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 3487-3494.
- Poppe, R. (2010). A survey on vision-based human action recognition. Image Vis. Comput. 28, 976-990. doi:10.1016/j.imavis.2009.11.014
- Prince, S. J. D. (2012). Computer Vision: Models Learning and Inference. New York, NY: Cambridge University Press.
- Quattoni, A., Wang, S., Morency, L. P., Collins, M., and Darrell, T. (2007). Hid- den conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1848-1852. doi:10.1109/TPAMI.2007.1124
- Rahmani, H., Mahmood, A., Huynh, D. Q., and Mian, A. S. (2014). "Real time action recognition using histograms of depth gradients and random decision forests, " in Proc. IEEE Winter Conference on Applications of Computer Vision (Steamboat Springs, CO), 626-633.
- Rahmani, H., and Mian, A. (2015). "Learning a non-linear knowledge transfer model for cross-view action recognition, " in Proc. IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (Boston, MA), 2458-2466.
- Ramanathan, V., Li, C., Deng, J., Han, W., Li, Z., Gu, K., et al. (2015). "Learning semantic relationships for better action retrieval in images, " in Proc. IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1100-1109.
- Ramanathan, V., Liang, P., and Fei-Fei, L. (2013). "Video event understanding using natural language descriptions, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 905-912.
- Raptis, M., Kokkinos, I., and Soatto, S. (2012). "Discovering discriminative action parts from mid-level video representations, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1242-1249.
- Rawlinson, G. (2007). The significance of letter position in word recognition. IEEE Aerosp. Electron. Syst. Mag. 22, 26-27. doi:10.1109/MAES.2007.327521
- Reddy, K. K., and Shah, M. (2013). Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24, 971-981. doi:10.1007/s00138-012-0450-4
- Robertson, N., and Reid, I. (2006). A general method for human activity recognition in video. Comput. Vis. Image Understand. 104, 232-248. doi:10.1016/j.cviu.2006. 07.006
- Rodriguez, M. D., Ahmed, J., and Shah, M. (2008). "Action MACH: a spatio- temporal maximum average correlation height filter for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
- Rodríguez, N. D., Cuéllar, M. P., Lilius, J., and Calvo-Flores, M. D. (2014). A survey on ontologies for human behavior recognition. ACM Comput. Surv. 46, 1-33. doi:10.1145/2523819
- Rohrbach, M., Amin, S., Mykhaylo, A., and Schiele, B. (2012). "A database for fine grained activity detection of cooking activities, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1194-1201.
- Roshtkhari, M. J., and Levine, M. D. (2013). Human activity recognition in videos using a single example. Image Vis. Comput. 31, 864-876. doi:10.1016/j.imavis. 2013.08.005
- Rudovic, O., Petridis, S., and Pantic, M. (2013). "Bimodal log-linear regression for fusion of audio and visual features, " in Proc. ACM Multimedia Conference (Barcelona), 789-792.
- Sadanand, S., and Corso, J. J. (2012). "Action bank: a high-level representation of activity in video, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1234-1241.
- Salakhutdinov, R., Torralba, A., and Tenenbaum, J. B. (2011). "Learning to share visual appearance for multiclass object detection, " in Proc. IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 1481-1488.
- Samanta, S., and Chanda, B. (2014). Space-time facet model for human activity classification. IEEE Trans. Multimedia 16, 1525-1535. doi:10.1109/TMM.2014. 2326734
- Sanchez-Riera, J., Cech, J., and Horaud, R. (2012). "Action recognition robust to background clutter by using stereo vision, " in Proc. European Conference on Computer Vision (Firenze), 332-341.
- Sapienza, M., Cuzzolin, F., and Torr, P. H. S. (2014). Learning discriminative space- time action parts from weakly labelled videos. Int. J. Comput. Vis. 110, 30-47. doi:10.1007/s11263-013-0662-8
- Sargin, M. E., Yemez, Y., Erzin, E., and Tekalp, A. M. (2007). Audiovisual synchro- nization and fusion using canonical correlation analysis. IEEE Trans. Multimedia 9, 1396-1403. doi:10.1109/TMM.2007.906583
- Satkin, S., and Hebert, M. (2010). "Modeling the temporal extent of actions, " in Proc. European Conference on Computer Vision (Heraklion), 536-548.
- Schindler, K., and Gool, L. V. (2008). "Action snippets: how many frames does human action recognition require?, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
- Schuldt, C., Laptev, I., and Caputo, B. (2004). "Recognizing human actions: a local SVM approach, " in Proc. International Conference on Pattern Recognition (Cambridge), 32-36.
- Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., and Pantic, M. (2011). "Avec 2011 -the first international audio visual emotion challenge, " in Proc. International Audio/Visual Emotion Challenge and Workshop, Lecture Notes in Computer Science, Vol. 6975 (Memphis, TN), 415-424.
- Sedai, S., Bennamoun, M., and Huynh, D. Q. (2013a). Discriminative fusion of shape and appearance features for human pose estimation. Pattern Recognit. 46, 3223-3237. doi:10.1016/j.patcog.2013.05.019
- Sedai, S., Bennamoun, M., and Huynh, D. Q. (2013b). A Gaussian process guided particle filter for tracking 3D human pose in video. IEEE Trans. Image Process. 22, 4286-4300. doi:10.1109/TIP.2013.2271850
- Seo, H. J., and Milanfar, P. (2011). Action recognition from one example. IEEE Trans. Pattern Anal. Mach. Intell. 33, 867-882. doi:10.1109/TPAMI.2010. 156
- Shabani, A. H., Clausi, D., and Zelek, J. S. (2011). "Improved spatio-temporal salient feature detection for action recognition, " in Proc. British Machine Vision Conference (Dundee), 1-12.
- Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton NJ: Princeton University Press.
- Shao, J., Kang, K., Loy, C. C., and Wang, X. (2015). "Deeply learned attributes for crowded scene understanding, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4657-4666.
- Shivappa, S., Trivedi, M. M., and Rao, B. D. (2010). Audiovisual information fusion in human-computer interfaces and intelligent environments: a survey. Proc. IEEE 98, 1692-1715. doi:10.1109/JPROC.2010.2057231
- Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). "Real-time human pose recognition in parts from single depth images, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 1297-1304.
- Shu, T., Xie, D., Rothrock, B., Todorovic, S., and Zhu, S. C. (2015). "Joint inference of groups, events and human roles in aerial videos, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 4576-4584.
- Siddiquie, B., Khan, S. M., Divakaran, A., and Sawhney, H. S. (2013). "Affect analysis in natural human interaction using joint hidden conditional random fields, " in Proc. IEEE International Conference on Multimedia and Expo (San Jose, CA), 1-6. Sigal, L., Isard, M., Haussecker, H. W., and Black, M. J. (2012a). Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propagation. Int. J. Comput. Vis. 98, 15-48. doi:10.1007/s11263-011-0493-4
- Sigal, L., Isard, M., Haussecker, H., and Black, M. J. (2012b). Loose-limbed people: estimating 3D human pose and motion using non-parametric belief propaga- tion. Int. J. Comput. Vis. 98, 15-48. doi:10.1007/s11263-011-0493-4
- Singh, S., Velastin, S. A., and Ragheb, H. (2010). "Muhavi: a multicamera human action video dataset for the evaluation of action recognition methods, " in Proc. IEEE International Conference on Advanced Video and Signal Based Surveillance (Boston, MA), 48-55.
- Singh, V. K., and Nevatia, R. (2011). "Action recognition in cluttered dynamic scenes using pose-specific part models, " in Proc. IEEE International Conference on Computer Vision (Barcelona), 113-120.
- Smola, A. J., and Schölkopf, B. (2004). A tutorial on support vector regression. Stat. Comput. 14, 199-222. doi:10.1023/B:STCO.0000035301.49549.88
- Snoek, C. G. M., Worring, M., and Smeulders, A. W. M. (2005). "Early versus late fusion in semantic video analysis, " in Proc. Annual ACM International Conference on Multimedia (Singapore), 399-402.
- Soleymani, M., Pantic, M., and Pun, T. (2012). Multimodal emotion recognition in response to videos. IEEE Trans. Affective Comput. 3, 211-223. doi:10.1109/T- AFFC.2011.37
- Song, Y., Morency, L. P., and Davis, R. (2012a). "Multimodal human behavior analysis: learning correlation and interaction across modalities, " in Proc. ACM International Conference on Multimodal Interaction (Santa Monica, CA), 27-30.
- Song, Y., Morency, L. P., and Davis, R. (2012b). "Multi-view latent variable discrimi- native models for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 2120-2127.
- Song, Y., Morency, L. P., and Davis, R. (2013). "Action recognition by hierarchical sequence summarization, " in Proc. IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (Portland, OR), 3562-3569.
- Soomro, K., Zamir, A. R., and Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. Cornell University Library. CoRR, abs/1212.0402.
- Sun, C., and Nevatia, R. (2013). "ACTIVE: activity concept transitions in video event classification, " in Proc. IEEE International Conference on Computer Vision (Sydney, NSW), 913-920.
- Sun, Q. S., Zeng, S. G., Liu, Y., Heng, P. A., and Xia, D. S. (2005). A new method of feature fusion and its application in image recognition. Pattern Recognit. 38, 2437-2448. doi:10.1016/j.patcog.2004.12.013
- Sun, X., Chen, M., and Hauptmann, A. (2009). "Action recognition via local descriptors and holistic features, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (Los Alamitos, CA), 58-65.
- Tang, K. D., Yao, B., Fei-Fei, L., and Koller, D. (2013). "Combining the right features for complex event recognition, " in Proc. IEEE International Conference on Computer Vision, pages (Sydney, NSW), 2696-2703.
- Tenorth, M., Bandouch, J., and Beetz, M. (2009). "The TUM kitchen data set of everyday manipulation activities for motion tracking and action recognition, " in Proc. IEEE International Workshop on Tracking Humans for the Evaluation of Their Motion in Image Sequences (THEMIS) (Kyoto), 1089-1096.
- Theodorakopoulos, I., Kastaniotis, D., Economou, G., and Fotopoulos, S. (2014). Pose-based human action recognition via sparse representation in dissimilarity space. J. Vis. Commun. Image Represent. 25, 12-23. doi:10.1016/j.jvcir.2013.03. 008
- Theodoridis, S., and Koutroumbas, K. (2008). Pattern Recognition, Fourth Edn. Boston: Academic Press.
- Thurau, C., and Hlavac, V. (2008). "Pose primitive based human action recognition in videos or still images, " in Proc. IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (Anchorage, AK), 1-8.
- Tian, Y., Sukthankar, R., and Shah, M. (2013). "Spatiotemporal deformable part models for action detection, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Portland, OR), 2642-2649.
- Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211-244. doi:10.1162/15324430152748236
- Toshev, A., and Szegedy, C. (2014). "Deeppose: human pose estimation via deep neural networks, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 1653-1660.
- Tran, D., Yuan, J., and Forsyth, D. (2014a). Video event detection: from subvolume localization to spatiotemporal path search. IEEE Trans. Pattern Anal. Mach. Intell. 36, 404-416. doi:10.1109/TPAMI.2013.137
- Tran, K. N., Gala, A., Kakadiaris, I. A., and Shah, S. K. (2014b). Activity analysis in crowded environments using social cues for group discovery and human interaction modeling. Pattern Recognit. Lett. 44, 49-57. doi:10.1016/j.patrec. 2013.09.015
- Tran, K. N., Kakadiaris, I. A., and Shah, S. K. (2012). Part-based motion descriptor image for human action recognition. Pattern Recognit. 45, 2562-2572. doi:10. 1016/j.patcog.2011.12.028
- Turaga, P. K., Chellappa, R., Subrahmanian, V. S., and Udrea, O. (2008). Machine recognition of human activities: a survey. Proc. IEEE Trans. Circuits Syst. Video Technol. 18, 1473-1488. doi:10.1109/TCSVT.2008.2005594
- Urtasun, R., and Darrell, T. (2008). "Sparse probabilistic regression for activity- independent human pose inference, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Anchorage, AK), 1-8.
- Vemulapalli, R., Arrate, F., and Chellappa, R. (2014). "Human action recognition by representing 3D skeletons as points in a lie group, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Columbus, OH), 588-595.
- Vinciarelli, A., Dielmann, A., Favre, S., and Salamin, H. (2009). "Canal9: a database of political debates for analysis of social interactions, " in Proc. International Conference on Affective Computing and Intelligent Interaction and Workshops (Amsterdam: De Rode Hoed), 1-4.
- Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). "Show and tell: a neural image caption generator, " in Proc. IEEE Computer Society Conference on Com- puter Vision and Pattern Recognition (Boston, MA), 3156-3164.
- Vrigkas, M., Karavasilis, V., Nikou, C., and Kakadiaris, I. A. (2013). "Action recogni- tion by matching clustered trajectories of motion vectors, " in Proc. International Conference on Computer Vision Theory and Applications (Barcelona), 112-117.
- Vrigkas, M., Karavasilis, V., Nikou, C., and Kakadiaris, I. A. (2014a). Matching mix- tures of curves for human action recognition. Comput. Vis. Image Understand. 119, 27-40. doi:10.1016/j.cviu.2013.11.007
- Vrigkas, M., Nikou, C., and Kakadiaris, I. A. (2014b). "Classifying behavioral attributes using conditional random fields, " in Proc. 8th Hellenic Conference on Artificial Intelligence, Lecture Notes in Computer Science, Vol. 8445 (Ioannina), 95-104.
- Wang, H., Kläser, A., Schmid, C., and Liu, C. L. (2011a). "Action recognition by dense trajectories, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3169-3176.
- Wang, J., Chen, Z., and Wu, Y. (2011b). "Action recognition with multiscale spatio- temporal contexts, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 3185-3192.
- Wang, Y., Guan, L., and Venetsanopoulos, A. N. (2011c). "Kernel cross-modal factor analysis for multimodal information fusion, " in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (Prague), 2384-2387.
- Wang, H., Kläser, A., Schmid, C., and Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60-79. doi:10.1007/s11263-012-0594-8
- Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012a). "Mining actionlet ensemble for action recognition with depth cameras, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1290-1297.
- Wang, S., Yang, Y., Ma, Z., Li, X., Pang, C., and Hauptmann, A. G. (2012b). "Action recognition by exploring data distribution and feature correlation, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1370-1377.
- Wang, Z., Wang, J., Xiao, J., Lin, K. H., and Huang, T. S. (2012c). "Substructure and boundary modeling for continuous action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Providence, RI), 1330-1337.
- Wang, L., Hu, W., and Tan, T. (2003). Recent developments in human motion analysis. Pattern Recognit. 36, 585-601. doi:10.1016/S0031-3203(02)00100-0
- Wang, S., Ma, Z., Yang, Y., Li, X., Pang, C., and Hauptmann, A. G. (2014). Semi- supervised multiple feature analysis for action recognition. IEEE Trans. Multi- media 16, 289-298. doi:10.1109/TMM.2013.2293060
- Wang, Y., and Mori, G. (2008). "Learning a discriminative hidden part model for human action recognition, " in Proc. Annual Conference on Neural Information Processing Systems (Vancouver, BC), 1721-1728.
- Wang, Y., and Mori, G. (2010). "A discriminative latent model of object classes and attributes, " in Proc. European Conference on Computer Vision (Heraklion), 155-168.
- Wang, Y., and Mori, G. (2011). Hidden part models for human action recognition: probabilistic versus max margin. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1310-1323. doi:10.1109/TPAMI.2010.214
- Westerveld, T., de Vries, A. P., van Ballegooij, A., de Jong, F., and Hiemstra, D. (2003). A probabilistic multimedia retrieval model and its evaluation. EURASIP J. Appl. Signal Process. 2003, 186-198. doi:10.1155/S111086570321101X
- Wu, C., Zhang, J., Savarese, S., and Saxena, A. (2015). "Watch-n-patch: unsupervised understanding of actions and relations, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Boston, MA), 4362-4370.
- Wu, Q., Wang, Z., Deng, F., Chi, Z., and Feng, D. D. (2013). Realistic human action recognition with multimodal feature selection and fusion. IEEE Trans. Syst. Man Cybern. Syst. 43, 875-885. doi:10.1109/TSMCA.2012.2226575
- Wu, Q., Wang, Z., Deng, F., and Feng, D. D. (2010). "Realistic human action recognition with audio context, " in Proc. International Conference on Digital Image Computing: Techniques and Applications (Sydney, NSW), 288-293.
- Wu, X., Xu, D., Duan, L., and Luo, J. (2011). "Action recognition using context and appearance distribution features, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Colorado Springs, CO), 489-496.
- Xiong, Y., Zhu, K., Lin, D., and Tang, X. (2015). "Recognize complex events from static images by fusing deep channels, " in Proc. IEEE Computer Society Confer- ence on Computer Vision and Pattern Recognition (Boston, MA), 1600-1609.
- Xu, C., Hsieh, S. H., Xiong, C., and Corso, J. J. (2015). "Can humans fly? Action understanding with multiple classes of actors, " in Proc. IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition (Boston, MA), 2264-2273.
- Xu, R., Agarwal, P., Kumar, S., Krovi, V. N., and Corso, J. J. (2012). "Combining skeletal pose with local motion for human activity recognition, " in Proc. Inter- national Conference on Articulated Motion and Deformable Objects (Mallorca), 114-123.
- Yan, X., Kakadiaris, I. A., and Shah, S. K. (2014). Modeling local behavior for predicting social interactions towards human tracking. Pattern Recognit. 47, 1626-1641. doi:10.1016/j.patcog.2013.10.019
- Yan, X., and Luo, Y. (2012). Recognizing human actions using a new descriptor based on spatial-temporal interest points and weighted-output classifier. Neuro- computing 87, 51-61. doi:10.1016/j.neucom.2012.02.002
- Yang, W., Wang, Y., and Mori, G. (2010). "Recognizing human actions from still images with latent poses, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2030-2037.
- Yang, Y., Saleemi, I., and Shah, M. (2013). Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1635-1648. doi:10.1109/ TPAMI.2012.253
- Yang, Z., Metallinou, A., and Narayanan, S. (2014). Analysis and predictive mod- eling of body language behavior in dyadic interactions from multimodal inter- locutor cues. IEEE Trans. Multimedia 16, 1766-1778. doi:10.1109/TMM.2014. 2328311
- Yao, A., Gall, J., and Gool, L. V. (2010). "A Hough transform-based voting frame- work for action recognition, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 2061-2068.
- Yao, B., and Fei-Fei, L. (2010). "Modeling mutual context of object and human pose in human-object interaction activities, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (San Francisco, CA), 17-24.
- Yao, B., and Fei-Fei, L. (2012). Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1691-1703. doi:10.1109/TPAMI.2012.67
- Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., and Fei-Fei, L. (2011). "Human action recognition by learning bases of action attributes and parts, " in Proc. IEEE International Conference on Computer Vision (Barcelona), 1331-1338.
- Ye, M., Zhang, Q., Wang, L., Zhu, J., Yangg, R., and Gall, J. (2013). "A survey on human motion analysis from depth data, " in Time-of-Flight and Depth Imaging, Lecture Notes in Computer Science, Vol. 8200. eds M. Grzegorzek, C. Theobalt, R. Koch, and A. Kolb (Berlin Heidelberg: Springer), 149-187.
- Yi, S., Krim, H., and Norris, L. K. (2012). Human activity as a manifold-valued random process. IEEE Trans. Image Process. 21, 3416-3428. doi:10.1109/TIP. 2012.2197008
- Yu, G., and Yuan, J. (2015). "Fast action proposals for human action detection and search, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Boston, MA), 1302-1311.
- Yu, G., Yuan, J., and Liu, Z. (2012). "Propagative Hough voting for human activ- ity recognition, " in Proc. European Conference on Computer Vision (Florence), 693-706.
- Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., and Samaras, D. (2012). "Two- person interaction detection using body-pose features and multiple instance learning, " in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (Providence, RI), 28-35.
- Zeng, Z., Pantic, M., Roisman, G. I., and Huang, T. S. (2009). A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39-58. doi:10.1109/TPAMI.2008.52
- Zhang, Z., Wang, C., Xiao, B., Zhou, W., and Liu, S. (2013). Attribute regulariza- tion based human action recognition. IEEE Trans. Inform. Forensics Secur. 8, 1600-1609. doi:10.1109/TIFS.2013.2258152
- Zhang, Z., Wang, C., Xiao, B., Zhou, W., and Liu, S. (2015). Robust relative attributes for human action recognition. Pattern Anal. Appl. 18, 157-171. doi:10.1007/ s10044-013-0349-3
- Zhou, Q., and Wang, G. (2012). "Atomic action features: a new feature for action recognition, " in Proc. European Conference on Computer Vision (Firenze), 291-300.
- Zhou, W., and Zhang, Z. (2014). Human action recognition with multiple-instance Markov model. IEEE Trans. Inform. Forensics Secur 9, 1581-1591. doi:10.1109/ TIFS.2014.2344448