Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.

Log In
Sign Up

Figure 2 – uploaded by Monu Verma

See full PDF downloadDownload figure

RECOGNITION ACCURACY COMPARISON ON CK+ DATASET wearing different artifacts etc. We also evaluate performance of the proposed network for 7-class expression, where datasets included neutral expressions too. Furthermore, to extract the facial region, we have utilized the viola jones [30] face detection algorithm instead of manual face cropping like existing methods [11, 12]. This procedure creates a real-life scenario and enhances the adaptability of the model to deal with real-life conditions as images don’t have static position and background. Moreover, to validate the outcomes, experimental setup included N-fold cross- validation scheme. In N- Fold cross-validation, image sets are partitioned into N equal sized folds, from which N-1 folds are included as training and remaining one is used as a test dataset. Further, to prepare datasets, we have extracted most expressive image frames and then divided them to 80:20 ratio as 80% for training and remaining for testing image set. To train the network, training set is again divided into 70:30 ratio, from which 70% images are used to train the network and 30% to validate the accuracy outcomes. The recognition rate for the network is calculated by using Eq 10. — Table 2 RECOGNITION ACCURACY COMPARISON ON CK+ DATASET wearing different artifacts etc. We also evaluate performance of the proposed network for 7-class expression, where datasets included neutral expressions too. Furthermore, to extract the facial region, we have utilized the viola jones [30] face detection algorithm instead of manual face cropping like existing methods [11, 12]. This procedure creates a real-life scenario and enhances the adaptability of the model to deal with real-life conditions as images don’t have static position and background. Moreover, to validate the outcomes, experimental setup included N-fold cross- validation scheme. In N- Fold cross-validation, image sets are partitioned into N equal sized folds, from which N-1 folds are included as training and remaining one is used as a test dataset. Further, to prepare datasets, we have extracted most expressive image frames and then divided them to 80:20 ratio as 80% for training and remaining for testing image set. To train the network, training set is again divided into 70:30 ratio, from which 70% images are used to train the network and 30% to validate the accuracy outcomes. The recognition rate for the network is calculated by using Eq 10.

Related Figures (10)

the network. Another block has different sized filters to capture both minor and major edge variations from the face images. Later, Zang et al. [24] proposed a deep evolutional spatial-temporal network, which utilized the advantages of two networks: recurrent neural network and multi-signal CNN to analyzed the dynamic evolution and appearance information of the facial expression respectively. classification. Many deep CNN models are proposed in literature such as AlexNet [13], VGG-Net [14], GoogleNet [15] ResNet [16] etc.

EXPERTNET DETAILED CONFIGURATION with learnable weights and bias, which are responsible for capturing pertinent features from the input. Let I (p x q) be an

Rectified Linear Unit: ReLU layer is used to transformed linear input into nonlinear by imposing a monotonic function. ReLU extends the capability of the EXPERTNet by dropping gradients, which are not active in the network. First, Alexnet [13] architecture uses ReLU function in-place of tanh activation, to generates the sparse feature responses. The outcome of ReLU is computed by using max function as:

Figure 3: Input image and response images generated by applying (a) Conv with stride 1 (b) pooling with stride 2 and (c) Conv with stride 2.

Figure 4: Input image and response feature maps generated by applying different sized filters as 1x1, 3x3, 5x5 and 7x7 in ExFeat block. Further, output feature map is generated by applying (a) elective layer and (b) addition layer, respectively. Visibly it is clear that elective layer preserves prominent edge features.

RECOGNITION ACCURACY COMPARISON ON MMI DATASET Reo, ion Total no. of correctly predicted samples 100 (10) Total no. of samples

Figure 5: Visualization of response feature maps for a) neutral b) anger c) disgust d) fear e) happy f) sad and g) surprise expression, capture at elective layer over MMI dataset.

RECOGNITION ACCURACY COMPARISON ON DISFA DATASET ethnic subjects as American, African, Asian and Latin. In this dataset, each image sequence initiates with neutral and extend up to peak expression. All image sequences are labeled with one of six basic expressions: anger, disgust, fear, happy, sad and surprise. For the 7-class expression, we have included apex frame of each expression as neutral expression. In our experiment, we select 2897 total images with seven proper expression class labels as anger-343, disgust-419, fear-318, happy-577, neutral-326, sad-409 and surprise-505. Recognition accuracy results over ck dataset for existing state-of-art and EXPERTNet approach is detailed in Table II. By Table H, we can observe that, EXPERTNet gains more accuracy as compared to other existing FER approaches. Particularly, proposed network secure 2.4%, 1.9%, 5.1% and 3.6%, 17.6%, 7% more accuracy for 6-and 7-class expression as compare to VGG-Net 16, VGG-Net 19 and ResNet respectively. EXPERTNet also gain 3.8%, 3.4% and 6.9%, 7.3% more accuracy rate for 6-and 7-class expression over handcrafted approaches LDTexP and LDTerP respectively.

Figure 6: Visual comparison of existing models and EXPERTNet over different expression of four datasets a) CK+: Surprise b) MMI: Happy c) DISFA: Disgust and d) GEMEP-FERA: Anger.

TABLE V sad and surprise, according 66 FACs. For experiment, we have elicited 2761 frames from the videos which have accurate expression labels as: anger-112, disgust-287, fear- 515, happy- 674, neutral-615, sad-411 and surprise-265. Accuracy results over DISFA dataset are tabulated in Table IV. Table IV, shows that, proposed network outperformed the existing — state-of-art approaches. Particularly, the proposed model gains 6.1%, 11.4%, 11.4% and 11.6%. 7.2%, 24.3% extra recognition accuracy for 6- and 7- class expression as compared to CNN based approaches VGG- Net 16, VGG- Net 19 and ResNet models respectively. Further, it yields 4.6%, 3.1% and 2.5%, 1.7% for 6-and 7-class expressions, better accuracy as compared to the handcrafted techniques of LDN and LDTexP respectively.

Related topics:

Computer Science

Connect with 287M+ leading minds in your field

Discover breakthrough research and expand your academic network

Explore
Papers
Topics

Features
Mentions
Analytics
PDF Packages
Advanced Search
Search Alerts

Journals
Academia.edu Journals
My submissions
Reviewer Hub
Why publish with us
Testimonials

Company
About
Careers
Press
Help Center
Terms
Privacy
Copyright
Content Policy

580 California St., Suite 400

San Francisco, CA, 94104

© 2025 Academia. All rights reserved