nour el houda khadidja SLIMANI

Human interaction recognition based on the co-occurence of visual words

HAL (Le Centre pour la Communication Scientifique Directe), Jun 28, 2014

This paper describes a novel methodology for automated recognition of high-level activities. A ke... more This paper describes a novel methodology for automated recognition of high-level activities. A key aspect of our framework relies on the concept of cooccurring visual words for describing interactions between several persons. Motivated by the numerous success of human activity recognition methods using bag-of-words, this paradigm is extended. A 3-D XYT spatio-temporal volume is generated for each interacting person and a set of visual words is extracted to represent his activity. The interaction is then represented by the frequency of co-occurring visual words between persons. For our experiments, we used the UT-interaction dataset which contains several complex human-human interactions.

Download

Learning bag of spatio-temporal features for human interaction recognition

Twelfth International Conference on Machine Vision (ICMV 2019)

Bag of Visual Words Model (BoVW) has achieved impressive performance on human activity recognitio... more Bag of Visual Words Model (BoVW) has achieved impressive performance on human activity recognition. However, it is extremely difficult to capture high-level semantic meanings behind video features with this method as the spatiotemporal distribution of visual words is ignored, preventing localization of the interactions within a video. In this paper, we propose a supervised learning framework that automatically recognizes high-level human interaction based on a bag of spatiotemporal visual features. At first, a representative baseline keyframe that captures the major body parts of the interacting persons is selected and the bounding boxes containing persons are extracted to parse the poses of all persons in the interaction. Based on this keyframe, features are detected by combining edge features and Maximally Stable Extremal Regions (MSER) features for each interacting person and backward-forward tracked over the entire video sequence. Based on feature tracks, 3D XYT spatiotemporal volumes are generated for each interacting target. Then, the K-means algorithm is used to build a codebook of visual features to represent a given interaction. The interaction is then represented by the sum of the frequency occurrence of visual words between persons. Extensive experimental evaluations on the UT-interaction dataset demonstrate the strength of our method to recognize the ongoing interactions from videos with a simple implementation.

Learning bag of spatio-temporal features for human interaction recognition

Bag of Visual Words Model (BoVW) has achieved impressive performance on human activity recognitio... more Bag of Visual Words Model (BoVW) has achieved impressive performance on human activity recognition. However, it is extremely difficult to capture high-level semantic meanings behind video features with this method as the spatiotemporal distribution of visual words is ignored, preventing localization of the interactions within a video. In this paper, we propose a supervised learning framework that automatically recognizes high-level human interaction based on a bag of spatiotemporal visual features. At first, a representative baseline keyframe that captures the major body parts of the interacting persons is selected and the bounding boxes containing persons are extracted to parse the poses of all persons in the interaction. Based on this keyframe, features are detected by combining edge features and Maximally Stable Extremal Regions (MSER) features for each interacting person and backward-forward tracked over the entire video sequence. Based on feature tracks, 3D XYT spatiotemporal v...

Human Interaction Recognition Based on the Co-occurrence of Visual Words

2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014

Uploads

Papers by nour el houda khadidja SLIMANI

Log In