Human Action Recognition

description390 papers

group2,499 followers

lightbulbAbout this topic

Human Action Recognition is a field of computer vision and machine learning focused on identifying and classifying human actions in video sequences or images. It involves analyzing motion patterns and contextual information to enable systems to understand and interpret human behavior in various environments.

lightbulbAbout this topic

Key research themes

1. How can spatial-temporal feature extraction and data fusion improve accuracy and robustness in human action recognition?

This research theme investigates the development and integration of spatial and temporal features for human action recognition (HAR), focusing on methods that combine multiple feature types or modalities to capture intricate motion and appearance cues. The goal is to enhance recognition accuracy and robustness across varied environmental conditions and datasets by effectively modeling both static pose and dynamic movement patterns.

STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition

by Byoung Chul Ko

2024, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Key finding: Proposes the STAR-transformer model which aggregates cross-modal data (video frames and skeleton sequences) into multi-class tokens using novel spatio-temporal attention mechanisms (zigzag and binary attention) to efficiently... Read more

articleView Paper downloadDownload

Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences

by Sharnil Pandya

2021, Sensors

Key finding: Develops a feature descriptor fusing Histogram of Oriented Gradient (HOG) features with displacement and velocity to capture spatial gradient and motion information in video sequences. The fusion technique reduces descriptor... Read more

articleView Paper downloadDownload

Human action recognition using trajectory-based representation

by Elsayed Hemayed

2022, Egyptian Informatics Journal

Key finding: Improves temporal relationship modeling by extracting trajectories via tracking spatio-temporal interest points (cuboids) using SIFT descriptor matching. The approach represents human actions by volumes around trajectory... Read more

articleView Paper downloadDownload

Improving Skeleton-Based Action Recognition Using Part-Aware Graphs in a Multi-Stream Fusion Context

by Zois Tsakiris

2024, IEEE Access

Key finding: Introduces part-aware graphs to improve skeleton-based HAR by segregating skeleton data into semantically meaningful parts emphasizing motion-relevant areas. The multi-stream fusion aggregates different part-based graph... Read more

articleView Paper downloadDownload

Multi-Sensor-Based Action Monitoring and Recognition via Hybrid Descriptors and Logistic Regression

by ABDULWAHAB ALAZEB

2025, IEEE Access

Key finding: Integrates multi-modal sensor data—combining inertial (accelerometers, gyroscopes) and computer vision inputs (RGB, skeleton data)—to extract time-frequency and geometric features. Their fusion using logistic regression... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are the effective dimensionality reduction strategies for handling high-dimensional features in large-scale human action recognition datasets?

This area focuses on addressing the computational and storage challenges posed by increasingly high-dimensional feature vectors, especially those derived from Fisher vectors and Bag-of-Words models on large-scale datasets. The studies explore how dimensionality reduction techniques such as principal component analysis (PCA) or learned projections can unearth latent structures in feature spaces, reduce redundancy, and facilitate efficient and accurate classification in expansive HAR datasets comprising numerous action classes and real-world variability.

Dimensionality reduction of Fisher vectors for human action recognition

by Roland Goecke

2023, IET Computer Vision

Key finding: Demonstrates that reducing the dimension of high-dimensional Fisher vector features (up to ~500K dimensions) using projection techniques can maintain or improve classification performance on large-scale unconstrained datasets... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can skeletal data and body part representations be leveraged for efficient and interpretable human action recognition?

This theme explores techniques that utilize human skeleton-based features and body part models to improve interpretability, reduce feature size, and increase recognition accuracy. Approaches include representing body dimensions variations, part-based graph models, and compact skeleton descriptors to capture meaningful and discriminative motion patterns. Such methods offer the advantage of robustness to occlusion and viewpoint changes and facilitate lightweight, explainable HAR systems.

Human Action Recognition Utilizing Variations in Skeleton Dimensions

by Heba Elnemr and

2018

Key finding: Proposes an action recognition method exploiting global variations in skeleton-derived human body dimensions during motion, using both 2D and 3D data. Achieves high accuracy (above 94%) across Weizmann, Berkeley MHAD, and... Read more

articleView Paper downloadDownload

Improving Skeleton-Based Action Recognition Using Part-Aware Graphs in a Multi-Stream Fusion Context

by Zois Tsakiris

2024, IEEE Access

Key finding: Uses part-aware graph convolutional networks to isolate and emphasize dominant skeleton sub-parts across actions, improving feature discrimination. Fusion of streams trained on different parts yields substantial performance... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Human Action Recognition

A survey of depth and inertial sensor fusion for human action recognition

by Chen Chen

A number of review or survey articles have previously appeared on human action recognition where either vision sensors or inertial sensors are used individually. Considering that each sensor modality has its own limitations, in a number... more

descriptionView Paper arrow_downwardDownload

Video-Based Abnormal Human Behavior Recognition—A Review

by Oluwatoyin Popoola

Modeling human behaviors and activity patterns for recognition or detection of special event has attracted significant research interest in recent years. Diverse methods that are abound for building intelligent vision systems aimed at... more

Fig. 1. Frequency of publications in the area of human abnormal behavior detection between 2002 and 2011 as reviewed in this survey. Note the rising trend in research interest in the area particularly spiking to new levels from 2008.

Fig. 2. General process of feature-based modeling and detection of anomalies in video sequences.

Fig. 3. Different types of anomalies in various contexts. (i) UMN dataset: anomaly in crowd movement caused by panic [16]. (ii) UCSD dataset: anoma- lies are (a) biker, (b) skater, and (c) vehicle in a pedestrian walkway [138]. (iii) Suspicious behavior: different from walking or jogging [102]. (iv) Unusual speed: running in a shopping mall [109]. (v) Subway: red masks in the yellow rectangle indicate the location of anomaly such as going in wrong directions of exit (top row) and entry (bottom row) areas. E and F show “no-payment” events [118].

Oluwatoyin P. Popoola (M’09) received the B.Sc. and M.Sc. degrees in industrial and production en- gineering from the University of Ibadan, Ibadan, Nigeria, in 1999 and 2003, respectively. He is cur- rently working toward the Ph.D. degree with the Pat- tern Recognition and Intelligent Systems Laboratory, College of Automation, Harbin Engineering Univer- sity, Harbin, China. He joined the faculty at Systems Engineering De-

KEY POINTS OF PREVIOUS RELATED SURVEYS behavior abstraction and representation. Sections V—VII con- tain our intuitive characterization of the research to highlight peculiar issues under different contexts, to which the researcher needs to pay attention. Finally, the concluding part—Section VUI—points out important observations and areas that need further research.

POSSIBLE THEMES FOR GROUPING THE RESEARCH PUBLICATIONS

ber of clusters that have unique cluster structures and possible semantic meaning. Local features are extracted and clustered using low-level abstraction to describe activities (e.g., either pixel- or object-level features). A statistical model is built, and a clustering-based outlier detection algorithm is trained using labeled or unlabeled data of either only normal or both nor- mal and abnormal event features. Objects that are not located proaches have been widely adapted and extended to solve anomaly-detection problems in video-based applications.

descriptionView Paper arrow_downwardDownload

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

by Fabio Cuzzolin and

In this work we propose a new approach to the spatiotemporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. Our framework is composed of three stages. In stage 1, a cascade... more

Figure 2: Action tube detection in a ‘biking’ video taken from UCF-101 [25]. (a) Side view of the detected action tubes where each colour represents a particular instance. The detection boxes in each frame are linked up to form space-time action tubes. (b) Illustration of the ground-truth temporal duration for comparison. (c) Viewing the video as a 3D volume with selected image frames; notice that we are able to detect multiple action instances in both space and time. (d) Top-down view.

Performance comparison on UCF-101. Table | presents the results we obtained on UCF- 101, and compares them to the previous state-of-the-art [32, 37]. We achieve an mAP of 66.75% compared to 46.77% reported by [32] (a 20% gain), at the standard threshold of 6 = 0.2. Ata threshold of 6 = 0.4 we still get a high score of 46.35%, (comparable tc 46.77% [32] at 6 = 0.2). Note that we are the first to report results on UCF-101 up to 6 = .6, attesting to the ro- bustness of our approach to more accurate localisation requirements. Although our separate [aE ee Oh Meee (EON Mein MM LE” OE pee FY Fs Tene Ree fee eee? Sen See ee ier h) Sener oer at mi Ley ley Pea Ok ME Re Peete

Table 1: Quantitative action detection results (mAP) on the UCF-101 dataset. instances of the same action class. To achieve a broader comparison with the state-of-the- art, we also ran tests on the J-HMDB-21 [12] dataset. The latter is a subset of HMDB- 51 [18] with 2 action categories and 928 videos, each containing a single action instance and trimmed to the action’s duration. Finally we conducted experiments on the more chal- lenging LIRIS-HARL dataset, which contains 10 action categories, including human-human interactions and human-object interactions (e.g., ‘discussion of two or several people’, and ‘a person types on a keyboard’). In addition to containing multiple space-time actions, some of which occurring concurrently, the dataset contains scenes where relevant human actions take place amid st other irrelevant human motion. For all datasets we used the exact same evaluation metrics and data splits as in the original

Table 2: Quantitative action detection results (mAP) on the J-HMDB-21 dataset. Table 3: Classification accuracy on the J-HMDB-21 dataset.

Performance comparison on LIRIS-HARL. LIRIS HARL allows us to demonstrate tt efficacy of our approach on temporally un-trimmed videos with co-occurring actions. F this purpose we use LIRIS-HARL’s specific evaluation tool - the results are shown in Table | Our results are compared with those of 1) VWRULABUAM-13 [22] and ii) IACAS-51 [1( from the original LIRIS HARL detection challenge. In this case, our method outperforms tl competitors by an even larger margin. We report space-time detection results by fixing tt threshold quality level to 10% for the four thresholds [33] and measuring temporal precisic and recall along with spatial precision and recall, to produce an integrated score. We ref the readers to [33] for more details on LIRIS HARL’s evaluation metrics.

Table 4: Quantitative action detection results on the LIRIS-HARL dataset. We also report in Table 5 the mAP scores obtained by the spatial, flow and the fusio1 detection models, respectively (note that there is no prior state of the art to report in this case) Again, we can observe an improvement of 7% mAP at 6 = .2 due to our fusion strategy. Ti demonstrate the advantage of our 2nd pass of DP (§ 3.4), we also generate results (mAP using only the first DP pass (§ 3.4). Without the 2" pass performance decreases by 20% highlighting the importance of temporal trimming in the construction of action tubes.

descriptionView Paper arrow_downwardDownload

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

by Chen Chen

Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis (e.g. action detection and recognition) has been limited due to... more

descriptionView Paper arrow_downwardDownload

Enhanced skeleton visualization for view invariant human action recognition

by Chen Chen

Human action recognition based on skeletons has wide applications in human-computer interaction and intelligent surveillance. However, view variations and noisy data bring challenges to this task. What's more, it remains a problem to... more

descriptionView Paper arrow_downwardDownload

A review on vision techniques applied to Human Behaviour Analysis for Ambient-Assisted Living

by Francisco Flórez-Revuelta

2012, Expert Systems with …

Human Behaviour Analysis (HBA) is more and more being of interest for computer vision and artificial intelligence researchers. Its main application areas, like Video Surveillance and Ambient-Assisted Living (AAL), have been in great... more

descriptionView Paper arrow_downwardDownload

A Multiclass ELM Strategy in Pose-Based 3D Human Motion Analysis

by Mohamad Ivan Fanany

This paper pursues the best multiclass classification strategy for pose-based 3D human motion recognition using Extreme Learning Machines (ELM). Such classification task is one of the most difficult classification problem because the pose... more

descriptionView Paper arrow_downwardDownload

Human activity recognition for domestic robots

by lasitha piyathilaka and

Capabilities of domestic service robots could be further improved, if the robot is equipped with an ability to recognize activities performed by humans in its sensory range. For example in a simple scenario a floor cleaning robot can... more

descriptionView Paper arrow_downwardDownload

Coupled Generative Adversarial Network for Continuous Fine-grained Action Segmentation

by Harshala Gammulle

2019, Winter Conference on Applications of Computer Vision

We propose a novel conditional GAN (cGAN) model for continuous fine-grained human action segmentation, that utilises multi-modal data and learned scene context information. The proposed approach utilises two GANs: termed Action GAN and... more

descriptionView Paper arrow_downwardDownload

Human Action Recognition: TT1S new criterion to evaluate the complexity

by Bashar Wannous

2017, irjet

In this paper, we propose a new Human Action Recognition algorithm depending on hybrid features extraction from silhouettes and Neural Networks for classification. The hybrid features include contour history images which is a new method,... more

descriptionView Paper arrow_downwardDownload

Action recognition by saliency-based dense sampling

by Chen Chen

A B S T R A C T Action recognition, aiming to automatically classify actions from a series of observations, has attracted more attention in the computer vision community. The state-of-the-art action recognition methods utilize dense... more

descriptionView Paper arrow_downwardDownload

Spatio-temporal Human Action Localisation and Instance Segmentation in Temporally Untrimmed Videos

by Fabio Cuzzolin

Current state-of-the-art human action recognition is fo-cused on the classification of temporally trimmed videos in which only one action occurs per frame. In this work we address the problem of action localisation and instance... more

descriptionView Paper arrow_downwardDownload

Delay reduction in real-time recognition of human activity for stroke rehabilitation

by Maryam Najafian and

Assisting patients to perform activity of daily living (ADLs) is a challenging task for both human and machine. Hence, developing a computer-based rehabilitation system to re-train patients to carry out daily activities is an essential... more

descriptionView Paper arrow_downwardDownload

Fine-grained Action Segmentation using the Semi-Supervised Action GAN

by Harshala Gammulle

2020, Pattern Recognition Journal

In this paper we address the problem of continuous fine-grained action segmentation, in which multiple actions are present in an unsegmented video stream. The challenge for this task lies in the need to represent the hierarchical nature... more

descriptionView Paper arrow_downwardDownload

Selecting active frames for action recognition with vote fusion method

by Binh Tieu

Recent applications of Convolutional Neural Networks, especially 3-Dimensional Convolutional Neural Networks (3DCNNs) for human action recognition (HAR) in videos have widely used. In this paper, we use a multi-stream framework which is a... more

descriptionView Paper arrow_downwardDownload

A Survey on Human Action Recognition

by International Journal of Advance Research in Computer Science and Management Studies [IJARCSMS] ijarcsms.com

2020, International Journal of Advance Research in Computer Science and Management Studies [IJARCSMS] ijarcsms.com

Gotten from fast advances in computer vision and AI, video investigation errands have been moving from inferring the present state to anticipating the future state. Vision-based activity acknowledgment and forecast from... more

descriptionView Paper arrow_downwardDownload

Kinematic Features For Human Action Recognition Using Restricted Boltzmann Machines

by Mohamad Ivan Fanany

—Human Action recognition research is an interesting and active filed of research in the current years. Human Action Recognition (HAR) has many potential and promising applications, in such fields as security, surveillance, professional... more

descriptionView Paper arrow_downwardDownload

3D Action Recognition Using Multi-scale Energy-based Global Ternary Image

by Chen Chen

— This paper presents an effective multi-scale energy-based Global Ternary Image (GTI) representation for action recognition from depth sequences. The unique property of our representation is that it takes the spatial-temporal... more

Figure 19: Confusion matrix on MSRGesture3D dataset with protocol of work ‘

Figure 20: Original action snaps from SKIG dataset.

Figure 21: Action snaps from SKIG dataset, where backgrounds are removed.

Figure 1: Comparison between DMM and GTI. (a) is a depth sequence of action “bend”. (b) shows the DMM of the front view projection generated using all depth frames [5]. (c) shows the GTIs of the front view projection generated using consecutive depth frames. Pixels in red, green and blue colors denote positive, negative and neutral states, respectively. (Best viewed in color)

Figure 3: Extraction of GBI and GTI. (c) shows GBIs which denote motion regions (colored in pink). (d) shows GTIs which denote both motion regions and motion directions (pixels in green stand for negative motion and pixels in red stand for positive motion). (e) shows that each GTI can be described by two GBIs. (Best viewed in color)

Figure 4: Comparison between GBI and GTI. Two depth frames are used to denote action #1 (a) and action #2 (d). GBI (b) and GTI (c) of the side view projection are generated for action #1. GBI (e) and GTI (f) of the side view projection are generated for action #2. (Best viewed in color)

Figure 5: The effect of Radon Transform. Radon Transform describes GBIs of the front view projection generated using (a) original depth frames, (b) depth frames added by 20% pepper noise, (c) depth frames partially occluded by occlusion #1 (simulated by ignoring pixels located in the green region), (d) depth frames partially occluded by occlusion #2. Number in (e) and (f) denotes correlation coefficient. (Best viewed in color)

Figure 6: Feature extraction from GTI. (b) shows two GBIs, denoting positive and negative motion regions of the GTI in (a). (c) shows feature maps generated by describing GBIs using Radon Transform. Feature maps in (c) are converted to feature vectors in (d), which are concatenated to describe the GTI in (a). (Best viewed in color)

Figure 7: Illustration of energy-based sampling method. Action “bend” performed by person #1 and person #2 are shown in (a) and (d), respectively. (b) and (e) are the accumulated motion energy curves, calculated by accumulating frame-to-frame motion energy. Depth frames in (c) and (f) are sampled from (a) and (d), respectively. The sampling criterion is to keep frame-to-frame motion energy of the sampled sequence nearly the same. (Best viewed in color)

Algorithm 1: Energy-based sampling method motion energies between consecutive frames are nearly the same. In this way, the sampled sequence suffers less from the effect of speed variations. Correspondingly, an energy- based sampling method is proposed to sample frames from original depth sequences. As shown in Fig. id ‘7| (c) and (f), the sampled sequences are similar to each other, indicating slight effect from speed variations. The GTI extracted from sampled sequences is termed as Energy-based GTI (E-GTI), which inherits the merits of GTI and shows robustness to speed variations. Following paragraphs mainly focus on the energy- based sampling method.

Figure 8: Illustration of multi-scale sampled sequences. (a) is an original sequence. (b), (c) and (d) are sampled depth sequences, which record different scales of motions. (Best viewed in color)

Figure 9: Action snaps from MSRAction3D dataset. sequence S'i,z with M frames, which can be represented by a set of low-level E-GTIs. Similarly, a representation BE_ op, is formed to describe Sjz. However, a sampled sequence Sy only preserves one certain scale of motion from the original sequence Z. As shown in Fig. 8} three sampled sequences, i.e., S3, Sg and Sg, are sampled from an action “bend”. It is noted that the parameter M/ is set to 3, 6 and 8 for example. As can be seen, three sampled sequences record different scales of motions. Specifically, motions in larger scale are captured by sequence #1 and sequence #2, meanwhile motions in smaller scale are captured by sequence #3. To capture different scales of motion information, we sample multi-scale depth sequences to give a detailed description of Z. We set parameter M in Algorithm [T] to M,,..., Mz, which produces a number of L sampled sequences, i.e., Syy,,...,Sac,, from S i; sequence Z. Let B on , denote the representation of Sy,,. By concatenating representations of all sampled sequences, we obtain representation-level fused representation as: rc a oo =

Figure 11: Confusion matrix on MSRAction3D with protocol of work

Table IV: Comparison between our method and related works on the MSRAction3D dataset with protocol of work [31]. The original work is colored in blue. parameters achieve optimal performances in certain scopes.

Figure 12: Evaluation of robustness to depth noise. (a) Depth frames affected by 0% and 20% percentage of pepper noise. (b) Recognition results on MSRAction3D dataset with different percentages of pepper noise.

Figure 13: Eight types of occluded depth sequences. Each sequence is denoted as six frames for example. Table V: Evaluation of the robustness to partial occlusions.

values of GBIs are directly concatenated as the descriptor of GTI, which is named as Bag of GTIs (without RT).

Table VI: Evaluation of the robustness of speed variations.

Table VII: Comparison between our method and related works on the DHA dataset with protocol of work . Original work is colored in blue. 99 66. 29 66. curl’, “leg-kick”, “one-hand-wave’, “pitch”, “p-jump”, “rod- swing”, “run’, “skip”, “side”, gia! box”, “side-clap”, “tai- chi’, “two-hand-wave”, “walk”. Each action is performed by 21 people (12 males and 9 females), resulting in 483 depth sequences. We use an extended version of DHA dataset where six additional action categories are involved. In Fig. [16] “golf- swing” and “rod-swing” share similar motions that involve moving hands from one side up to the other side. A few more such similar pairs can be found, like “leg-curl” and “leg-kick”’, “run” and “walk”, etc. Bounding boxes of front, side and top views are resized to fixed sizes of 102 x 54, 102 x 75, 75 x 54. Other parameters are the same with MSRAction3D dataset. 20 66

Figure 16: Action snaps from DHA dataset.

Table VII: Comparison between our method and related works on the MSRGesture3D dataset with protocol of work [14].

Table IX: Comparison between our method and related works on the SKIG dataset with protocol of work [68]. Original work is colored in blue. Figure 22: Confusion matrix on SKIG dataset with protocol of work [68].

Table X: Evaluation of single scale and multi-scale structures. worse on MSRAction3D-Order dataset than on MSRAction3D dataset. This is because one action type and its opposite type contain similar motion regions, which bring extra challenges to the task of classification. BZ, achieves an accuracy of 71.37%, which is higher than that of both BE and shape BE pr: This improvement is justifiable because GTI captures directional information, which is essential to distinguish one action type from its opposite type.

Figure 24: Evaluation of single scale and multi-scale structures.

Figure 25: Difficult cases in MSRAction3D dataset.

Table I: Selection of kr, P on the training samples of MSRAction3D dataset with 10-fold procedure. Training samples are defined in previous work : Table II: Selection of L on the training samples of MSRAction3D dataset with 10-fold procedure. Training samples are defined in previous work (42].

various speeds on an MSRAction3D-Speed 2) dataset, which contains totally different speeds between training and testing sets. Specifically, we reserve all the sequences performed by subjects #1, 3, 5, 7, 9 and randomly select half the number of frames for sequences performed by subjects #2, 4, 6, 8, 10. Based on the original time order, the selected frames are cate- nated to form new sequences. Fig.|14/shows that the difference in average frames between the training and testing sets of the new dataset has been enlarged. Since random sampling method is used, many key frames may be ignored in new sequences, which makes action recognition more challenging. Comparing linear samp ing method with our random sampling method (see Fig. [i5] , we infer that action speeds in MSRAction3D- Speed dataset may change dramatically in a non-linear manner. oe a oy | ae a | o-. . se a aap % — Figure 15: Comparison between linear sampling and random sampling.

descriptionView Paper arrow_downwardDownload

Stacked Denoising Autoencoder for feature representation learning in pose-based action recognition

by Mohamad Ivan Fanany

In this paper, we studied Stacked Denoising Autoencoder(SDA) model for Human pose-based action recognition. We used public dataset Chalearn 2013 which contains Italian body language actions from 27 persons. We studied two model of SDA... more

descriptionView Paper arrow_downwardDownload

Action Recognition Using 3D histograms of Texture and A Multi-class Boosting Classifier

by Chen Chen

Human action recognition is an important yet challenging task. This paper presents a low-cost descriptor called 3D Histograms of Texture (3DHoTs) to extract discriminant features from a sequence of depth maps. 3DHoTs are derived from... more

Fig. 1. Salient Information (SI) maps. From the left to the right: front (f) view, side (s) view and top (t) view. orthogonal Cartesian planes. After obtaining each projected map, its motion energy is computed by thresholding the difference between consecutive maps. The binary map of motion energy provides a strong clue of the action category being performed and indicates motion regions or where movement happens in each temporal interval. s article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2017.2718189, IEE Transactions on Image Processing

Fig. 2. Sign and magnitude components extracted from a sample block. (a) 3x3 sample block; (b) the local differences; (c) the sign component of block; and (d) the magnitude component of block.

Fig. 4. An example of basketball-shoot action from UTD-MHAD dataset. The first row shows the color images, the second row shows the depth images.

Fig. 5. KELM performance w.r.t. parameter on the MSRAction3D dataset Finally, the execution time of our system is calculated, intending to reveal the feasibility of our system for a real-time application. To this end, we have set up a simulation platform using MATLAB on an Intel i5 Quadcore 3.2 GHz desktop computer with 8GB of RAM. It can be seen that the proposed method is able to process over 120 frames per second.

Fig. 6. System performance w.r.t. parameter A on two datasets

RECOGNITION ACCURACY (%) OF DIFFERENT FEATURE AND CLASSIFIER COMBINATIONS ON MSRACTION3D DATASET

RECOGNITION ACCURACY (%) OF DIFFERENT FEATURE AND CLASSIFIER COMBINATIONS ON MSRGESTURE3D DATASET

LADLE Ill RECOGNITION ACCURACY (%) OF DIFFERENT FEATURE AND CLASSIFIER COMBINATIONS ON UTD-MHAD DATASET

af. eese it THREE SUBSETS OF ACTIONS USED FOR MSRACTIONSD DATASET Setting 2 - The experimental setup suggested by [46] is used. A total of 20 actions are employed and one half of the subjects (1, 3, 5, 7, 9) are used for training and the remaining subjects are used for testing. Setting 1 - The experimental setting reported in [11] is adopted. Specifically, the actions are divided into three subsets as listed in Table IV. For each subset, three different tests are carried out. In the first test, 1/3 of the samples are used for training and the rest for testing; in the second test, 2/3 of the samples are used for training and the rest for testing; in the cross-subject test, one half of the subjects (1, 3, 5, 7, 9) are used for training and the rest for testing.

COMPARISON OF RECOGNITION ACCURACIES (%) OF OUR METHOD AND EXISTING METHODS ON MSRACTION3D DATASET USING SETTING 1 TABLE V

RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON MSRACTION3D DATASET

afanaeee * a RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON MSRGESTURE3D DATASET

RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON UTD-MHAD DATASET Seen from the results reported in Table IX, our algorithm outperforms all the prior arts including several recent ones except for [22]. It reveals that our MBC framework indeed works well even if feeding two different types of features. The major reason that our performance is worse than that of [22] lies in the fact that we are mainly based on the depth features extracted from the raw depth signal but the work in [22] employs more sophisticated skeleton-based features, which can better interpret the human actions when a challenging dataset is given. Though we have integrated the skeleton information here in order to verify whether our multi-class boosting framework can handle two different types of features, our skeleton features encoding only the joint position differences are very simple, in contrast to [22] that uses group sparsity and geometry constrained dictionary learning to further enhance the skeleton feature representation. According to their results, the

RECOGNITION ACCURACIES (%) OF OUR METHOD AND DEEP LEARNING METHODS ON MSRACTION3D DATASET USING SETTING 1

RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON MSRACTIVITY3D DATASET QO) VOMPdrloOn Wu! Gey led Ming YdseG MIewlods The baseline methods mentioned above deploy the traditional handcrafted features. Differently, the deep learning models learn the feature representation from raw data and generate the high level semantic representation [26], [27] which represent the latest development in action recognition. Here, we compare our method with two deep models, in which one is SMF-BDL [26] and the other one isa DMM-Pyramid approach based on both traditional 2D CNN and 3D CNN for action recognition. Similar to MBC, the decision-level fusion method is used to combine different deep CNN models. To validate the proposed 3DHoT-MBC method, we conduct the same experiment as those of the two methods. Note that the comparative results are all reported on their reference papers. The results in Table X and Table XI show that 3D HoT-MBC is even superior to the two deep learning methods TADT LCV TABLE IX

RECOGNITION ACCURACY (%) OF DIFFERENT FEATURE AND CLASSIFIER COMBINATIONS ON ACTION -MHAD DATASET We also verify our algorithm on the DHA dataset [61]. DHA contains 23 action categories where the first 10 categories follow the same definitions in the W eizmann action dataset [65] and the 11th to 16th actions are extended categories. The 17th to 23rd are the categories of selected sport actions. Each of the 23 actions was performed by 21 different individuals (12 males and 9 females), resulting in 483 action samples. Table XIII shows the recognition results of our method against existing algorithms on the DHA dataset. A gain, our method achieves the best recognition performance.

RECOGNITION ACCURACY (%) COMPARED WITH EXISTING METHODS ON DHA DATASET

descriptionView Paper arrow_downwardDownload

Online Real-time Multiple Spatiotemporal Action Localisation and Prediction

by Fabio Cuzzolin and

2017, Proceedings of the International Conference on Computer Vision (ICCV)

We present a deep-learning framework for real-time multiple spatio-temporal (S/T) action localisation and classification. Current state-of-the-art approaches work offline, and are too slow to be useful in real-world settings. To overcome... more

descriptionView Paper arrow_downwardDownload

Human Action Recognition Utilizing Variations in Skeleton Dimensions

by Heba Elnemr and

This paper presents a human action recognition system that distinguishes between different actions using a new set of features based on global variation in the visual appearance of the subject body. The proposed technique utilizes the... more

Fig. 3 Parameters extracted from 2D videos

Fig. 1 A block diagram of the proposed system

Fig. 2 Skeleton extraction. The first row is frames for a waving two hands action, and the second row is the corresponding detected skeleton In case of 2D videos, the skeleton of the subject person is extracted to represent the movement as shown in Fig. 2. The skeleton extraction is implemented as follows: Wang et al. [15] perform action recognition through mod- eling the spatiotemporal configurations of human poses. The approach constructs five body parts by grouping the esti- Evangelidis et al. [13] use skeletal quads as the skeletal features to describe a sequence of action recognition. Fisher vector (which is discriminating than the popular bag of words approach in a recognition context [14]) is used to encode skeletal features with the help of a trained Gaussian mixture model. Finally, classification is performed by a multi-class SVM. The main concern of the work is gesture recognition. This is done by a labeling framework where each frame is assigned a label that denotes the gesture ID that the frame belongs to. 3.1 Skeleton Extraction

Fig. 4 Parameters extracted from 3D videos Furthermore, a sketch is classified as a cross if the number of intersections and end points is limited (i.e., it is a writing not a mess). Besides, several conditions must be confirmed. These conditions are demonstrated in Fig. 7. First, an inter- section in the middle of the image is detected (see, Fig. 7a). In addition, a significantly large closed area must be found Fig. 7b. Finally, vertically in both directions top and bot-

Fig. 5 Some sketches extracted using free hand writing-tracking

Fig. 6 A circle drawing detection. a False detected circle. b True detected circle. e Snapshots for a circle drawing action

Fig. 7 A cross drawing detection (a—c). d Snapshots for a cross drawing action

Fig. 8 A tick mark detection (a—c). d Snapshots for a tick mark drawing action A parameter is encoded as bell shape if there exists one global maximum and the value of the parameter is increasing at the first third of the action and decreasing at the last third of the action. The middle third is not taken into account so as to allow a steady state in the middle of the action. For the U shape encoding, the inverse of the above conditions is taken. During the encoding process of the parameters, the very beginning and ending frames of the video are neglected to avoid the extra movements at the beginning and the end of the action performance. A parameter is encoded decreasing if its value at the begin- ning of the action is more than that at the end of the action and if its value is decreasing during performing the action more than 75% of the action duration. On the other hand, a parameter is encoded increasing if the opposite of the above conditions is found. The percentage 75% is taken to over- come noise due to extra movements. Figure 9 demonstrates the parameter and the video encod- ing process. Each parameter is computed for each frame (Fig. 9a) where k is a parameter and L is the number of frames in the video, and then it is encoded to be a vector of six elements as shown in Fig. 9b. All encoded parameters are

Fig. 9 Parameter and video encoding. a Parameter ‘k’ value during the video. b Encoded parameter k. e An encoded video

Fig. 10 Skeleton extraction evaluation: a is frames for the video, b shows extracted skeleton by the proposed method, and ¢c presents extracted skeleton for the foreground videos given by the Weizmann Dataset

4.1 Weizmann Dataset Weizmann dataset was presented by Blank et al. [21] in 2005; it includes ten action classes performed by nine actors. The actions are bending, jumping jacks, jumping, jumping in place, running, skipping, walking, galloping sideways, one-hand waving, and two-hands waving. Table | shows the actions of the Weizmann dataset (the rows) and the extracted

Fig. 14 Bell shape of change for some parameters for non-moving actions

Table 1 The relation between the features and the different actions for Weizmann dataset features (the columns). The symbols used are the same as in Fig. 3. The table reflects which parameters affect each action; these relations are conducted through experimental evalua- tion. Leave-one-person out experimental setup is used; in each run, eight persons are used for training, and one person for testing. Then the average of the results, 98.9% is taken as a

MSR-Action3D Dataset [23] was collected in 2010 using depth sensor similar to the Microsoft Kinect. It is composed

Table 4 The three action subsets of MSR-action 3D dataset of 20 actions performed by ten subjects, and each is repeated two or three times. The state-of-the-art methods that used this dataset divided the 20 action classes into three action subsets [23] AS1, AS2, and AS3 as shown in Table 4. AS1 and AS2 are actions of similar movements, whereas AS3 is a group of complex actions.

Table 3 Confusion matrix of Berkeley MHAD dataset

Table 5 The relation between the features and the different actions for MSR-Action3D and Berkeley datasets Table 6 Confusion matrix of subset AS] of MSR-Action3D dataset

Table 7 Confusion matrix of subset AS2 of MSR-Action3D dataset

Table 9 Comparison with other methods for Weizmann dataset Table 10 Comparison with other methods for Berkeley dataset

Table 11 Comparison with other methods for MSR-Action3D dataset

Table 8 Confusion matrix of subset AS3 of MSR-Action3D dataset nition, the classifier is trained on a dataset and tested on another dataset. The purpose of this experiment is to examine the generalization of the proposed technique across different situations of the data acquisition. The training stage is done using Berkeley MHAD dataset, and testing is performed on all the subjects of the MSR-Action3D dataset using the four common actions between the two datasets, which are forward punch, wave two hands, wave one hand, and clap.

descriptionView Paper arrow_downwardDownload

Classifying Abnormal Activities in Exam Using Multi-class Markov Chain LDA Based on MODEC Features

by Mohamad Ivan Fanany

In this paper, we apply MCMCLDA (Multi-class Markov Chain Latent Dirichlet Allocation) model to classify abnormal activity of students in an examination. Abnormal activity in exams is defined as a cheating activity. We compare the usage... more

NUMBER OF VIDEOS OF EACH ACTIVITY IN THE DATASET Fig. 2. Examples of annotated frames for training our own MODEC model TABLE I

We have presented tl he usage of MCMCLDA model to classify abnormal activities in exams. The comparison of MODEC and Harris3D in better performance as interest points detector results for MODEC both in accuracy and computational time, although the classification performance is not satisfactory enough. While MCMCLDA using MODEC suffers low performance because of the difficulties in detecting For future works, we can focus on improving the perfor- mance of MODEC when detecting human joints in indistin- guishable foreground and background. In addition to that, the cheating surveillance system is not complete without point- in-time detection. The idea of point-in-time detection is to

AVERAGE ACCURACY, PRECISION, RECALL, AND F1-SCORE FROM 1 EXPERIMENTS OF MCMCLDA USING HARRIS3D. THE HIGHEST ACCURACY IS IN EXPERIMENT #2.

AVERAGE ACCURACY, PRECISION, RECALL, AND F1-SCORE FROM 1( EXPERIMENTS OF MCMCLDA USING MODEC. THE HIGHEST ACCURACY IS IN EXPERIMENT #5. B. MCMCLDA

descriptionView Paper arrow_downwardDownload

Encouraging LSTMs to Anticipate Actions Very Early

by Basura Fernando

In contrast to the widely studied problem of recognizing an action given a complete sequence, action anticipation aims to identify the action from only partially available videos. As such, it is therefore key to the success of computer... more

descriptionView Paper arrow_downwardDownload

Action recognition from depth sequences using weighted fusion of 2D and 3D auto-correlation of gradients features

by Chen Chen

This paper presents a new framework for human action recognition from depth sequences. An effective depth feature representation is developed based on the fusion of 2D and 3D auto-correlation of gradients features. Specifically, depth... more

descriptionView Paper arrow_downwardDownload

UTD-MHAD: A Multimodal Dataset for Human Action Recognition Utilizing a Depth Camera and a Wearable Inertial Sensor

by Chen Chen

IEEE International Conference on Image Processing, 2015

Human action recognition has a wide range of applications in-cluding biometrics, surveillance, and human computer interaction. The use of multimodal sensors for human action recognition is steadily increasing. However, there are limited... more

descriptionView Paper arrow_downwardDownload

AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture

by Fabio Cuzzolin and

2017, Proceedings of the International Conference on Computer Vision (ICCV)

Dominant approaches to action detection can only provide sub-optimal solutions to the problem, as they rely on seeking frame-level detections, to later compose them into 'action tubes' in a post-processing step. With this paper we... more

descriptionView Paper arrow_downwardDownload

Robust 3D Action Recognition through Sampling Local Appearances and Global Distributions

by Chen Chen

2017, IEEE Transactions on Multimedia

3D action recognition has broad applications in human-computer interaction and intelligent surveillance. However , recognizing similar actions remains challenging since previous literature fails to capture motion and shape cues... more

descriptionView Paper arrow_downwardDownload

Real-time Video Action Recognition via Hidden Two-Stream Networks

by Yi Zhu

In this work, we implement a real-time human action recognition framework, termed hidden two-stream networks [1]. This method only takes raw video frames as input and directly predicts action classes without explicitly computing optical... more

descriptionView Paper arrow_downwardDownload

Improving Human Action Recognition Using Fusion of Depth Camera and Inertial Sensors

by Chen Chen

IEEE Transactions on Human-Machine Systems

This paper presents a fusion approach for improving human action recognition based on two differing modality sensors consisting of a depth camera and an inertial body sensor. Computationally efficient action features are extracted from... more

Fig. 1. Example depth images of the actions (left to right) jumping jacks, punching, and throwing a ball.

Fig. 2. Depth image foreground extraction: (left) original depth image, (right) foreground extracted depth image.

Fig. 4. A DM M;, generated from a waving two hands depth video sequence. Fig. 3. Three projection views of a depth image.

‘ig. 5. Body placement of the six accelerometers in the Berkeley MHAD. static foot movements in the actions.

Fig. 6. Recognition rates (%) using different number of segments for accelerometer features: (a) SVM classifier. (b) CRC classifier. The number of segments NV for the acceleration data was determined via experimentation using the first 6 subjects for training and the rest for testing. SVM and CRC were employed as the classifiers and the performance was tested using different NV; see Fig. 6. In this figure, A, denotes only using the accelerometer A,, A, denotes only using the accelerometer Ay, and A,&A 4 denotes using both of the accelerometers A; and A, together where the features from the two accelerometers are stacked. Average denotes the mean accuracy of using the three accelerometer settings: Aj, Ag, and A, &Ay4. The setting N, € [13,17] produced a consistent recognition performance under three accelerometer settings. Thus, NV, = 15 was chosen for the experiments. Each feature vector v; had the dimension of 180 and 360 for the single-accelerometer setting and the two-accelerometer setting, respectively.

Fig. 7. Depth motion maps for the actions (left to right) sit down and stand up, sit down, and stand up. example, the overall recognition rate for the action sit was improved by 13% over the Kinect alone and the accuracy for the action punch was improved by 23% over the accelerometer alone.

Fig. 8. Three-axis acceleration signals corresponding to the actions: (a) sit down, and (b) stand up.

Fig. 9. Features generated from three-axis acceleration data for the actions punch and clap.

COMPARISON OF RECOGNITION RATES (%) BETWEEN OUR FEATURE-LEVEL FUSION (SVM) AND THE MKL METHOD IN [19] Fig. 10. Real-time action recognition timeline of our fusion framework

RECOGNITION RATES (%) WHEN USING DIFFERENT ACCELEROMETERS TABLE I

RECOGNITION RATES (%) FOR THE LEAVE-ONE-SUBJECT-OUT CROSS-VALIDATION TEST TABLE III CONFUSION MATRIX WHEN USING KINECT ONLY FOR THE LEAVE-ONE-SUBJECT-OUT CROSS-VALIDATION TEST

ONFUSION MATRIX WHEN USING ACCELEROMETER Aj ONLY FOR THE LEAVE-ONE-SUBJECT-OUT CROSS-VALIDATION TEST

ONFUSION MATRIX WHEN USING KINECT AND ACCELEROMETER A, FUSION FOR THE LEAVE-ONE-SUBJECT-OUT CROSS-VALIDATION TEST

RECOGNITION RATES (%) COMPARISON BETWEEN FEATURE LEVEL FUSION (CRC) AND DECISION LEVEL FUSION (CRC)

descriptionView Paper arrow_downwardDownload

Human Action Recognition Using Temporal Segmentation and Accordion Representation

by manel sekma

In this paper, we propose a novel motion descriptor Seg-SIFT-ACC for human action recognition. The proposed descriptor is based both on the accordion representation of the video and its temporal segmentation into elementary motion... more

descriptionView Paper arrow_downwardDownload

Self-Supervised Video Representation Learning With Odd-One-Out Networks

by Basura Fernando

We propose a new self-supervised CNN pre-training technique based on a novel auxiliary task called odd-one-out learning. In this task, the machine is asked to identify the unrelated or odd element from a set of otherwise related elements.... more

descriptionView Paper arrow_downwardDownload

Pose-based 3D Human Motion Analysis Using Extreme Learning Machine

by Mohamad Ivan Fanany

In 3D human motion pose-based analysis, the main problem is how to classify multi-class label activities based on primitive action (pose) inputs efficiently for both accuracy and processing time. Because, pose is not unique and the same... more

descriptionView Paper arrow_downwardDownload

3D Action Recognition Using Multi-temporal Depth Motion Maps and Fisher Vector

by Chen Chen

This paper presents an effective local spatio-temporal descriptor for action recognition from depth video sequences. The unique property of our descriptor is that it takes the shape discrimination and action speed variations into account,... more

descriptionView Paper arrow_downwardDownload

A Real-Time Human Action Recognition System Using Depth and Inertial Sensor Fusion

by Chen Chen

This paper presents a human action recognition system that runs in real-time and uses a depth camera and an inertial sensor simultaneously based on a previously developed sensor fusion method. Computationally efficient depth image... more

Fig. 1. Microsoft Kinect depth sensor. Kinect is a low-cost RGB-Depth camera sensor introduced by Microsoft for human-computer interface applications. It comprises a color camera, an infrared (IR) emitter, an IR depth sensor, a tilt motor, a microphone array, and an LED light. A picture of the Kinect sensor or depth camera is shown in Fig. 1. This sensor can capture 16-bit depth images with a resolution of 320x240 pixels. Two example depth images are depicted in Fig. 2. The frame rate is approximately 30 frames per second. In addition, the Kinect SDK [14] is a publicly available software package which can be used to track 20 body skeleton joints (see Fig. 3) and their 3D spatial positions.

Fig. 2. Example depth images from Kinect depth sensor.

Fig. 3. Skeleton joints provided by Kinect depth sensor.

The wearable inertial sensor used in this work is a small size (1”x1.5”) wireless inertial sensor built in the Embedded Signal Processing (ESP) Laboratory at Texas A&M University [15]. This sensor captures 3-axis acceleration, 3-axis angular velocity and 3-axis magnetic strength, which are transmitted wirelessly via a Bluetooth link to a laptop/PC. This wearable inertial sensor is shown in Fig. 4. The sampling rate of the sensor is 50 Hz and its measuring range is +8g for acceleration and +1000 degrees/second for rotation. It is worth mentioning that other commercially available inertial sensors can also be used in place of this inertial sensor. For practicality reasons or to avoid the intrusiveness associated with asking subjects to wear multiple inertial sensors, only one inertial sensor is considered in our work, either worn on the right wrist (similar to a watch) or the right thigh as depicted in Fig. 5 depending on the action of interest to be recognized in a particular application More explanations about the placement of the sensor for different actions are stated in Section IV. Fig. 6 shows the inertial sensor signals (3-axis accelerations and 3-axis angular velocities) for the action right hand wave. Fig. 4. Wearable inertial sensor developed in the ESP Lab.

Fig. 5. Inertial sensor placements: right wrist or right thigh. Fig. 6. Inertial sensor signals (3-axis accelerations and 3-axis angular velocities) for the action right hand wave.

Fig. 7. DMMs generated from a sample video of the action one hand wave.

Fig. 8. Action segmentation illustration using skeleton joint positions. d =a (x—a,)? + —y,)? + @— 2). (6) If for m, consecutive skeleton frames, all the distances are greater than a specified sensitivity og, the start of an action is triggered. If for m, consecutive skeleton frames, all the distances are less than or equal to the specified sensitivity o4, the end of an action is triggered. Fig. 8 illustrates the procedure of using skeleton joint positions to indicate the start and end of an action. The use of m, consecutive skeleton frames avoids responding to possible signal jitters.

The flowchart of the real-time operation of the system is shown in Fig. 11. The detection of an action start and end i: continuously performed. After detecting an action start, the fusion classification method is activated while monitoring fo1 the action end. Note that the DMM gets computed frame by frame. The DMM feature computation is completed when the end of an action is detected. Fig. 11. Flowchart of the real-time human action recognition system.

Fig. 12. Sample shots of the 27 actions in the UTD-MHAD database. corrupted sequences, the dataset includes 861 data sequences. Sample shots of the 27 actions in the UTD-MHAD are shown in Fig. 12. The wearable inertial sensor was placed on the subjects’ right wrists for actions (1) through (21) which were hand type movements, and on the subjects’ right thigh for actions (22) through (27) which were leg type movements.

ig. 13. Classification performance (recognition rates per action class and overall recognition rate) when using Kinect sensor only, inertial sensor nly, and Kinect and inertial sensors fusion for the subject-generic experiment.

Fig. 14. Classification performance (recognition rates per action class and overall recognition rate) when using Kinect sensor only, inertial sensor only, and Kinect and inertial sensors fusion for the subject-specific experiment.

Fig. 15. A misclassification case using Kinect and inertial sensor fusion.

achieved higher overall classification performance compared to using each sensor individually. Fig. 16. Recognition confusion matrix when using Kinect sensor only.

Fig. 17. Recognition confusion matrix when using inertial sensor only. Fig. 18. Recognition confusion matrix when using Kinect and inertial sensors fusion.

ROCESSING TIMES TAKEN BY THE MAJOR COMPONENTS OF THE REAL-TIME SYSTEM Fig. 19. Real-time recognition confusion matrix.

descriptionView Paper arrow_downwardDownload

Silhouette History and Energy Image Information for Human Movement Recognition

by Mohiuddin Ahmad

2010, Journal of Multimedia

descriptionView Paper arrow_downwardDownload

Hollywood 3D: What are the best 3D features for Action Recognition?

by Simon Hadfield and

Action recognition " in the wild " is extremely challenging, particularly when complex 3D actions are projected down to the image plane, losing a great deal of information. The recent growth of 3D data in broadcast content and commercial... more

Fig. 2: The appearance and disparity (top row) for a Drive action from the Hollywood 3D dataset. The 3D motion field is also shown on the bottom row. Note that the primary motion occurs on the foreground regions of the car, with secondary x and y motion on the passen- gers.

Fig. 1: The appearance and disparity (top row) for a Eat action from the Hollywood 3D dataset. The 3D motion field is also shown on the bottom row. The primary motion is concentrated on the arm and head, which move towards each other.

Hollywood 3D: What are the best 3D features for Action Recognition? Fig. 3: “In the wild” action recognition pipeline, making use of depth information at various stages. The green elements refer to dataset pre-processing, while blue elements relate to the recognition pipeline.

Fig. 4: Distribution of estimated focal lengths over 20000 repetitions, on the 2 different sequences pairs shown (wide-angle, close-up Eat shot, and extreme zoom Drive shot).

Fig. 5: Detection of correspondences between the two cameras, and between two points in time (shown in black and green). This illustrates the scene flow estimation task, and it’s relation to optical flow (OF) and stereo matching (SM). Hollywood 3D: What are the best 3D features for Action Recognition?

Fig. 6: (a) Orientation bins visualised with alternating white and black squares. y is rotation around the w axis. 6 is rotation around the wu axis. (b) a scene divided into a 3 by 3 grid of subregions, with the motion of each subregion aggregated.

Hollywood 3D: What are the best 3D features for Action Recognition? Fig. 7: H® The subregions of the encoded motion field are re-arranged such that the region of maximum mo- tion occurs first. This provides some degree of invari- ance to camera roll.

Fig. 8: HP The orientation of the strongest motion vec- tor in the scene is used to normalise the histograms, providing robustness to camera pitch and roll. Fig. 9: H’ A new set of 3D axes is chosen using PCA, relating to the dominant 3D motion orientations in the scene. This provides complete invariance to cam- era viewpoint change.

Table 2: Average precision per class, on the 3D action dataset, for a range of sparse interest point detectors, including simple spatio-temporal interest points, depth aware extensions and Dense Trajectory encoding. The Bag of Visual Words (HoG/HoF/HoDG) feature encoding was used. Classes are shown in bold, for schemes outperforming both of the simple spatio-temporal interest point schemes. depth; no information from the depth stream is encoded in the descriptors, and depth information cannot be used by the classifier to distinguish actions. Instead, the depth stream is used only to make more informed decisions about which regions of the appearance stream to encode and which to discard. This is particularly true after the encoded regions are accumulated into a single holistic descriptor. SVM kernels including linear, y* and multi-y?. How- ever, linear kernels were found to perform poorly while x? kernels greatly increased computation time and had little effect on performance. Thus for clarity we only presen parisons with the Hollywood 1 and 2 da erage in the t the results using RBF kernels. To Precision (AP) measure was used facilitate com- tasets, the Av- , as explained PASCAL VOC [9]. Relevant source code is avail- able online along with the data [13]. R performed with 4 binary comparisons D tests were per histogram (N = 4), concatenating 10 descriptor histograms. BOW tests were performed wi th 4,000 cluster centres (as sug- gested in [30]), with the local histogram descriptors cal- culated using a block size of 3 by 3, wit h 8 orientation bins. For the 3D motion (HOS) features, each subregion histogram uses 4 x 4 bins in the y and 6 orientations.

Table 3: Correct Classification rate and Average Pre- cision for different local features using the 2 top per- forming saliency measures. The best feature for each saliency measure is shown in bold.

Table 4: Per class Average Precision scores using vari- ous types of motion features, including 2D motions, un- calibrated 3D motions, unnormalised 3D motions, and calibrated motions encoding varying levels of invariance to camera viewpoint change. The results also show that the unnormalised features (H), which are not scale invariant, perform uniformly worse than their normalised counterparts. It is worth noting, however, that Hollywood 3D does not contain the Run/Jog/Walk ambiguities of some datasets. In- stead the wide range of viewpoints and zooms present in the data favour the more consistent H features.

descriptionView Paper arrow_downwardDownload

A review on vision techniques applied to Human Behaviour Analysis for Ambient-Assisted Living

by Pau Climent-Pérez and

descriptionView Paper arrow_downwardDownload

Real-Time Human Action Recognition Based on Depth Motion Maps

by Chen Chen

This paper presents a human action recognition method by using depth motion maps. Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, the absolute difference between... more

descriptionView Paper arrow_downwardDownload

WHO IS DOING WHAT? SIMULTANEOUS RECOGNITION OF ACTIONS AND ACTORS

by Shahzad Cheema

Recognizing human actions in videos has become a rapidly growing area of research. Most existing research has focused only on a single aspect i.e. recognition of actions. However, humans tend to perform different actions in their own... more

descriptionView Paper arrow_downwardDownload

FUSION OF DEPTH, SKELETON, AND INERTIAL DATA FOR HUMAN ACTION RECOGNITION

by Chen Chen

This paper presents a human action recognition approach by the simultaneous deployment of a second generation Kinect depth sensor and a wearable inertial sensor. Three data modalities consisting of depth images, skeleton joint positions,... more

descriptionView Paper arrow_downwardDownload

Human action recognition using Dynamic Time Warping

by Peb Ruswono Aryan

2011, … and Informatics (ICEEI), …

Human action recognition is gaining interest from many computer vision researchers because of its wide variety of potential applications. For instance: surveillance, advanced human computer interaction, content-based video retrieval, or... more

Figure 1. Stick figure representation with 15 joints of body part We represent human pose using stick figure consisted of 15 joints of body part, which is shown in figure 1.

Figure 3. Sequences of depth maps overlayed with segmented human region and skeleton tracking result for some human actions: clap, wave, smash - The feature vector was formed by concatenating the 15 quaternions of the respective body parts to form a column vector of 60 elements.

Compared to 3-by-3 rotational matrices, quaternions are also more compact, requiring only 4 storage units, instead of 9. These properties of quaternions make their use favourable for representing rotational representations. Figure 2. Graphical representation of quaternion units product as 90 degree rotation in 4D-space

graph shows the mapping function from the index of A to the index of B. Figure 4. Matching on similar points on signal As a typical NN algorithm, there is no specific learning phase. Our system stores a list of multivariate time series of known activities and their corresponding labels in a database. When an unknown action is presented to the system, the system takes the unknown time series, performs a sequential search with lower bounding DTW.

Figure 8. motion path of the action “smash” Below are comparison path of clap, punch, smash, wave, run, and kick action. Each was performed five times. As current result, upper part generated action (clap, punch, smash, and wave) can be recognized quite well but we have to collect more data to do a benchmarking. However, lower part generated action still have to be improved in the recognition. C. Result

descriptionView Paper arrow_downwardDownload

Multi-Level Sequence GAN for Group Activity Recognition

by Harshala Gammulle

2018, Asian Conference on Computer Vision (ACCV)

We propose a novel semi supervised, Multi-Level Sequential Generative Adversarial Network (MLS-GAN) architecture for group activity recognition. In contrast to previous works which utilise manually annotated individual human action... more

Fig. 1. The proposed Multi-Level Sequence GAN (MLS-GAN) architecture: (a) G is trained with sequences of person-level and scene-level features to learn an intermediate action representation, an ‘action code’. (b) The model D performs group activity clas- sification while discriminating real/fake data from scene level sequences and ground truth/generated action codes.

Fig. 2. Sample ground truth action codes, with k = 7 (i.e. we have 7 actions). For the code in (a), y = 5 and for the code shown in (b) y = 3. Note that a green border is shown around the codes for clarify, this is not part of the code and is only included to aid display. Codes are of size 1 x k pixels.

Fig. 3. Sample frames from 4 example sequences (in columns) from the collective ac- tivity dataset with the ‘Crossing scene level activity’. The colour of the bounding box indicates the activity class of each individual where yellow denotes ‘Crossing’, greer denotes ‘Waiting’ and blue denotes ‘Walking’. The sequences illustrate the challenges due to view point changes and visual similarity between the transition frames and the action frames (i.e 3rd column. transitions from ‘Crossing’ to ‘Walking’).

Fig. 4. Visualisations of the predicted group activities for the Volleyball dataset using the proposed MLS-GAN model. Figure 4 visualises qualitative results from the proposed MLS-GAN model for the Volleyball dataset. Irrespective of the level of clutter and camera motion, the proposed model correctly recognises the group activity.

Table 1. Comparison of the results on Collective Activity dataset [3] using MCA and MPCA. NA refers to unavailability of that evaluation.

Table 2. Comparisons with the state-of-the-art for Volleyball Dataset [6]. The first block of results (1 group) are for the methods considering all the players as a on« group and the second block is for dividing players into two groups (i.e each team) first and extracting features from them separately. NA refers to unavailability of results.

descriptionView Paper arrow_downwardDownload

Self-organizing neural integration of pose-motion features for human action recognition

by German I. Parisi and

The visual recognition of complex, articulated human movements is fundamental for a wide range of artificial systems oriented toward human-robot communication, action classification, and action-driven perception. These challenging tasks... more

descriptionView Paper arrow_downwardDownload

A Skeleton Descriptor for Kinesthetic Element Recognition of Bali Traditional Dances

by Yaya Heryadi and

Bali traditional dance has gain international reputation thanks to its highly articulated body-part motions, fascinating eyes movement, facial expressions, and colorful costumes. Although the motions are viewed as the main aesthetic... more

descriptionView Paper arrow_downwardDownload

Silhouette-based Human Action Recognition using Sequences of Key Poses

by Pau Climent-Pérez and

In this paper, a human action recognition method is presented in which pose representation is based on the contour points of the human silhouette and actions are learned by making use of sequences of multi-view key poses. Our contribution... more

descriptionView Paper arrow_downwardDownload

Action Recognition Using Completed Local Binary Patterns and Multiple-class Boosting Classifier

by Chen Chen

This paper, for the first time, introduces a multiple-class boosting scheme (MBS) to combine depth motion maps (DMMs) and completed local binary patterns (CLBP) for action recognition. DMMs derive from projecting depth frames onto three... more

descriptionView Paper arrow_downwardDownload

TraMNet -Transition Matrix Network for Efficient Action Tube Proposals

by Fabio Cuzzolin

2018, Asian Conference on Computer Vision (ACCV 2018)

Current state-of-the-art methods solve spatio-temporal action locali-sation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called action micro-tubes. However,... more

Fig. 1. Illustrating the key limitation of anchor cuboids using a “dynamic” action like “hor riding”. (a) A horse rider changes its location from frame f; to fi+~ as shown by the grou truth bounding boxes (in green). As the anchor cuboid generation [1,2] is constrained by t spatial location of the anchor box in the first frame ft, the overall spatiotemporal IoU overl between the ground-truth micro-tube and the anchor cuboid is relatively low. (b) In contra our anchor micro-tube proposal generator is much more flexible, as it efficiently explores t video search space via an approximate transition matrix estimated based on a hidden Mark model (HMM) formulation. As a result, the anchor micro-tube proposal (in blue) generated the proposed model exhibits higher overlap with the ground-truth. (c) For “static” actions (su as “clap”) in which the actor does not change location over time, anchor cuboid and anch micro-tubes have the same spatiotemporal bounds.

Fig. 3. Base network architecture. (a) SSD convolutional layers; (b) the corresponding conv fea- ture maps outputted by each conv layer; (¢) r anchor boxes with different aspect ratios assigned to cell location cs of the 3 x 3 feature map grid; (d) transition matrices for the P feature map grids in the pyramid, where P = 6.

proposals could be further improved by learning transitions between anchors across different levels of the pyramid. As the feature dimension of each map varies in SSD, e.g. 1024 for p = 2 and 512 for p = 1, a more consistent network such as FPN [29] with Resnet [30] would be a better choice as base architecture. Here we stick to SSD to produce a fair comparison with [2, 11, 1], and leave this extension to future work.

report our TraMNet’s performance. 3D as the base network. As in our TraMNet network, we also replaced SSD’s convo- tional heads with new linear layers. The same tube generation [11] and data augmen- tion [8] methods were adopted, and the same hyperparameters were used for training | the networks, including TraMNet. The only difference is that the anchor micro-tubes edin [9,2] were cuboidal, whereas TraMNet’s anchor micro-tubes are generated us- g transition matrices. We refer to these approaches as SSD-L (SSD-linear-heads) [11], MTnet-L (AMTnet-linear-heads) [1] and as ACT-L (ACT-detector-linear-heads) [2]. etwork training and implementation details. We used the established training set- igs for all the above methods. While training on the UCF101-24 dataset, we used a itch size of 16 and an initial learning rate of 0.0005, with the learning rate dropping ter 100K iterations for the appearance stream and 140/¢ for the flow stream. Whereas e appearance stream is only trained for 180/¢ iterations, the flow stream is trained for )OK iterations. In all cases, the input image size was 3 x 300 x 300 for the appearance ream, while a stack of five optical flow images [35] (15 x 300 x 300) was used for yw. Each network was trained on 2 1080Ti GPUs. More details about parameters and 1ining are given in the supplementary material.

descriptionView Paper arrow_downwardDownload

Predicting Action Tubes

by Fabio Cuzzolin

2018, ECCV 2018 Workshop on Anticipating Human Behaviour (AHB 2018)

In this work, we present a method to predict an entire 'action tube' (a set of temporally linked bounding boxes) in a trimmed video just by observing a smaller subset of it. Predicting where an action is going to take place in the near... more

Fig.1. An Illustration of the action tube prediction problem using an example in which a “pickup” action is being performed on a sidewalk. As an ideal case, we want the system to predict an action tube as shown in (c) (i.e. when 100% of the video has been processed) just by observing 25% of the entire clip (a). We want the tube predictor to predict the action class label (shown in red) alongside predicting the spatial location of the tube. The red shaded bounding boxes denote the detected tube in the observed portion of the input video, whereas, the blue coloured bounding boxes represent the future predicted action tube for the unobserved part of the clip. Abstract. In this work, we present a method to predict an entire ‘action tube’ a set of temporally linked bounding boxes) in a trimmed video just by observ- ng a smaller subset of it. Predicting where an action is going to take place in he near future is essential to many computer vision based applications such as 1utonomous driving or surgical robotics. Importantly, it has to be done in real- ime and in an online fashion. We propose a Tube Prediction network (TPnet) which jointly predicts the past, present and future bounding boxes along with heir action classification scores. At test time TPnet is used in a (temporal) slid- ng window setting, and its predictions are put into a tube estimation framework o construct/predict the video long action tubes not only for the observed part of he video but also for the unobserved part. Additionally, the proposed action tube oredictor helps in completing action tubes for unobserved segments of the video. We quantitatively demonstrate the latter ability, and the fact that TPnet improves state-of-the-art detection performance, on one of the standard action detection yenchmarks - J-HMDB-21 dataset.

Fig. 2. Workflow illustrating the application of TPnet to a test video at a time instant t. The network takes frames f; and f+ as input and generates classification scores, the micro-tube (in red) for frames f, and f,+a, and prediction bounding boxes (in blue) for frames fr_a,, fita + Up to fitna f All bounding boxes are considered to be linked to the micro-tube. Note that predictions also span the past: a setting called smoothing in the estimation literature. Ap, As and n are network parameters that we cross-validation during training.

Fig. 7. Future action tube prediction results (a) (prediction-mAP (p-mAP)) for predicting the tube in unobserved part of the video. Action tube prediction results (b) (completion-mAP (c-mAP)) for predicting video long tubes as early as possible on J-HMDB-21 dataset in sub-figure (b). We use p-mAP (a) and c-mAP (b) with detection threshold 6 = 0.5 as evaluation metrics on J-HMDB-21 dataset. TPnetabe represents our TPnet where a = Ap, b = Af andc =n.

TPnet,4- represents our TPnet where a = Ay, b = A; andc =n. Table 1. Action localisation results on JHMDB dataset. The table is divided into four parts. The first part lists approaches which have single frame as input; the second part presents approaches which take multiple frames as input; the third part contemplates different fusion strategies of our feature-level fusion (based on AMTnet); lastly, we report the detection performance of our TPnet by ignoring the future and past predictions and only use the detected micro-tubes to produce the final action tubes. Implementation details. We train all of our networks with the same set of hyper- parameters to ensure the fair comparison and net. We use an initial learning rate of 0.0005, consistency, including TPnet and AMT- and the learning rate drops by a factor of 10 after 5A and 7 iterations. All the networks are trained up to 10/¢ iterations. We implemented AMTnet using pytorch (https://pytorch.org/). We initialise AMTnet and TPnet models using the pretrained SSD network on J-HMDB-21 dataset on its respec- tive train splits. The SSD network training is network. For, optical flow images, we used o initialised using image-net trained VGG ptical flow algorithm of Brox et al. [41]. Optical flow output is put into a three channel image, two channels are made of flow vector and the third channel is the magnitude of the flow vector, TPnet,,.. The training parameters of our T Pnet are used to define the name of the

descriptionView Paper arrow_downwardDownload

Gradient Local Auto-Correlations and Extreme Learning Machine for Depth-Based Activity Recognition

by Chen Chen

This paper presents a new method for human activity recognition using depth sequences. Each depth sequence is represented by three depth motion maps (DMMs) from three projection views (front, side and top) to capture motion cues. A... more

descriptionView Paper arrow_downwardDownload

Human Action Recognition

Key research themes

1. How can spatial-temporal feature extraction and data fusion improve accuracy and robustness in human action recognition?

2. What are the effective dimensionality reduction strategies for handling high-dimensional features in large-scale human action recognition datasets?

3. How can skeletal data and body part representations be leveraged for efficient and interpretable human action recognition?

Related Topics

All papers in Human Action Recognition