Efficient facial emotion recognition model using deep convolutional neural network and modified joint trilateral filter

Facial emotion recognition extracts the human emotions from the images and videos. As such, it requires an algorithm to understand and model the relationships between faces and facial expressions and to recognize human emotions. Recently, deep learning models are utilized to improve the performance of facial emotion recognition. However, the deep learning models suffer from the overfitting issue. Moreover, deep learning models perform poorly for images which have poor visibility and noise. Therefore, in this paper, an efficient deep learning-based facial emotion recognition model is proposed. Initially, contrast-limited adaptive histogram equalization (CLAHE) is applied to improve the visibility of input images. Thereafter, a modified joint trilateral filter is applied to the obtained enhanced images to remove the impact of impulsive noise. Finally, an efficient deep convolutional neural network is designed. Adam optimizer is also utilized to optimize the cost function of deep convolutional neural networks. Experiments are conducted by using the benchmark dataset and competitive human emotion recognition models. Comparative analysis demonstrates that the proposed facial emotion recognition model performs considerably better compared to the competitive models

The objective is to recreate a similar level of intellectual ability in artificial intelligence, which has motivated researchers from computer vision and natural language communities to design automatic emotion recognition systems . Facial emotion recognition represents the content of an input image in the form of human emotions by using various machines and deep learning models (Ferreira et al. 2018). Thus, it initially extracts the face information, and thereafter, it provides a descriptive emotion (Alam et al. 2018). Recently, many convolutional neural networks (CNN) and recurrent neural network (RNN)-based emotion recognition models have been designed and implemented . Li and Deng (2019) proposed artificial neural network (ANN) model to recognize facial emotions. Zhang et al. ) designed spatial-temporal recurrent neural network to recognize facial emotions.  used hierarchical deep learning (HDL) to extract adaptive facial features to obtain better results.
From the existing literature, it has been found that the decision tree (DT) (Lee et al. 2011;Sun et al. 2019), support vector machine (SVM) (Varma et al. 2020;Kar et al. 2019), random forest (RF) (Valstar et al. 2016;Pu et al. 2015), and artificial neural network (ANN) (Li and Deng 2019) are commonly utilized to recognize the human emotions. Jain et al. (2019) designed deep neural networks (DNNs) using deep residual blocks. Wang et al. (2020) combined CNN and RNN (CCNNRNN) to classify the human emotions. Gupta et al. (2020) implemented a ResNet and attention block (CRAB)-based human emotion recognition model. Lakshmi and Ponnusamy (2021) implemented a modified histogram of oriented gradients (HOG) and local binary pattern (LBP), i.e., HOGLBP to extract the features. Although these methods achieve good performance, they suffer from the overfitting issue. Moreover, these models perform poorly for images which have poor visibility and noise.
The main contributions of this paper are as follows: 1. An efficient deep learning-based facial emotion recognition model is proposed. 2. Contrast-limited adaptive histogram equalization (CLAHE) is applied to the input images to improve the visibility of images. 3. The modified joint trilateral filter is applied on the obtained filtered images to remove the impulsive noise. 4. Adam optimizer is also utilized to optimize the cost function of deep convolutional neural networks. 5. Experiments are conducted by using the benchmark dataset and competitive human emotion recognition models.
The remainder of the paper is structured as follows: Section 2 discusses the related work. Section 3 mathematically defines the proposed model. Comparative results are discussed in Sect. 4. Concluding remarks are presented in Sect. 5. Hung and Chang (2021) used multilevel transfer learning based on a fine-tune approach to recognize facial emotion. Vijaya Lakshmi and Mohanaiah (2021) recognized facial emotion by utilizing whale optimization algorithm (WOA) and teaching-learning-based optimization (TLBO). In this, multi-support vector neural network (MultiSVNN) was used to build the model based on WOA and TLBO. Liu and Fu (2021) captured human facial emotions using multi-channel electroencephalography (EEG) signals and textual feature fusion method. Using both frequency and spatial domains, features were extracted. The model was trained using the SVM to recognize the facial emotions. Ngai et al. (Ngai et al. 2022) enhanced facial recognition using two-channel EEG and eye modality. CNN model was used to recognize and classify facial emotions. Zhang et al. (2020) proposed a model to recognize facial emotions using the correlation emotion label distribution learning. The six basic facial expressions were learned using a constructed convolutional neural network. Chen et al. (2018) recognized the facial emotions by applying a softmax regression model. In this, facial expressions were learned through a deep sparse autoencoder network. It overcomes the issues such as gradient diffusion and local extrema in training of a model. Tan et al. (2021) recognized facial expression using EEG and multimodal emotion recognition method. The issue of small dataset was resolved using Monte Carlo strategy that further improved the results. In this, recognition rate was achieved near 83.33%. Wang et al. (2020) fused speech features and facial expressions using bimodal fusion method to recognize human emotion. The facial emotions were recognized by combining RNN and CNN. The speech emotions were captured using CNN and LSTM. After that, facial and speech emotions were fused using a weighted decision fusion method. Li and Lima (2021) applied ResNet-50 to capture the facial emotions of humans. This helped to improve the robustness and generalization ability of the recognition models. Lakshmi and Ponnusamy (2021) proposed hybrid model to recognize facial emotions. In this, face regions were selected through Viola-Jones method. Then, the required features are extracted from selected regions using local binary pattern and histogram of oriented gradients. Deep stack autoencoder was utilized to reduce the dimensionality of extracted features. Lastly, multi-class SVM was applied to recognize and classify the emotions. Deng et al. (2019) recognized the facial emotions using conditional generative adversarial network-based approach (cGAN). Du et al. (2020) detected the emotions of players using facial expression and heart beat signals. Heart rate signals were learned through bidirectional long short-term memory (Bi-LSTM) networks. Facial features were learned via CNN.  recognized the continuous emotions by combining the facial expression and EEG. Choi and Song (2020a) used metric learning to recognize the continuous emotions. Hua et al. (2019) used ensembled deep learning model to recognize facial expressions. Choi and Song (2020b) utilized 2-D landmark feature map to recognize facial micro-expression.

Related work
From the related work, it has been observed that the majority of the existing methods achieve good performance, but suffer from the overfitting issue. Additionally, the existing models perform poorly for images which have poor visibility and noise.

Proposed model
In this paper, an efficient deep convolutional neural network model is proposed to recognize the human emotions from facial images. The proposed model can dynamically focus on the salient features in the images during the training pro-cess. Figure 1 represents the flow of the proposed emotion recognition model.
Step 1: Initially, the emotion recognition dataset is considered and loaded into the MATLAB workspace.
Step 2: CLAHE is then applied by using Eq. (1) on the input images to improve the visibility.
Step 3: Modified joint trilateral filter is then applied by using Eq. (2) on the obtained images to remove the effect of impulsive noise.
Step 4: Divide the dataset into the training and testing fractions for building the emotion recognition model.
Step 5: Deep convolutional neural network is applied on the training dataset for training the emotion recognition model.
Step 5.1: Convolution operator along with batch normalization and ReLu is applied on the training dataset.
Step 5.2: Max pooling is then applied on the extracted features obtained from Step 5.1.
Step 5.3: Repeat Step 5.1 and 5.2 by changing the convolution layer till the maximum number convolution operators. In the proposed model, five convolution layers are utilized (refer Figure 1).
Step 6: Apply fully connected layer on the extracted features.
Step 7:To optimize the coefficient of deep convolutional neural networks, Adam optimizer is used for the proposed model.
Step 8: Apply softmax and classification layer to build the proposed model.
Step 9: Evaluate and return the performance of the proposed model by applying the built model on the testing dataset.

Contrast-limited adaptive histogram equalization
Reducing the effect of numerous lighting circumstances is a nontrivial issue in the field of image processing. The precision of emotion recognition is generally low if the visibility of the face changes. However, the standard histogram equalization-based approaches are not effective for uneven lightening conditions as they may result in overenhanced images (Reza 2004). CLAHE has the ability to overcome the excessive amplification of noise by limiting the contrast. It can be implemented as follows: For every pixel, the mapping of four adjacent values of the histogram cumulative distribution function (CDF) to the pixel is desirable. ωx and ωy show the distance between pixels and the center of the left upper mask. D shows pixel coordinates. f() shows CDF. f bl , f br , f ur , and f ul show the below left, below right, upper right, and upper left values in current mask (window), respectively.

Modified joint trilateral filter
There are many edge-preserving Gaussian noise removal filters such as fourth-order partial differential equations-based trilateral filter , modified joint trilateral filter (Singh and Kumar 2017), adaptive joint trilateral filter (Jung 2012), and gain coefficient-based trilateral filter . Trilateral filters (TF) (Choudhury andTumblin 2003) can achieve edge preservations only in a few iterations without over-smoothing the structures like ridges and avoid shifting of edge locations, thus reducing noise to a greater extent. These filters perform smoothing by considering three similarity structures based on geometric, photometric, and local neighbor-hood features (in non-homogeneous areas). Guided trilateral filter (GTF) (He et al. 2013) is an improvement in TF which produces output image by considering the contents of a guidance image which may or may not be the same as input image. It removes noise from the input image while preserving clear edges. Joint trilateral filters (JTF) (Lo et al. 2013) improve TF by overcoming gradient reversal artifacts and removing the overly dark regions. Modified joint trilateral filters (MJTF) (Singh and Kumar 2017) further improve JTF by using a threshold value as modification to protect the edges as well as to improve the reliability in the filling values. All these features come up with their own advantages. But such a filter is required that can preserve as well as texture details of the images. Therefore, to remove the noise from images, MJTF (Singh and Kumar 2017) is utilized. It is an edge-preserving filter which does not introduce various types of artifacts on the filtered images. The modified joint trilateral filter method (Xiang 2016) keeps smooth and authentic edges, while the Gaussian filter method and the bilateral filter method cause jagged edges and pseudo-shadows. The filtering procedure is initially prepared under the guidance of image G d , the so-called reference image or guided image, which is actual image I κ itself. The guided image filter is mathematically based on linear combination, the output image must be consistent with the gradient direction of the guidance image, and the problem of gradient reversal does not occur.
Initially, a guided image G d , i.e., the actual image I κ , itself is considered. Assume that I q and G d are the illumination values at pixel k of minimum channel object and guided image. Let k r be the kernel mask centered at pixel at r dependent on the bilateral filter. The joint modified trilateral filter can be defined as Fully Connected Layer (7) with dropout

Softmax and Classification Layer
Return recognized facial expression Here, μ n and σ 2 n show the mean and variance of G d in the local window k r . |n| defines the total number of pixels in window. When G d p and G d q are on the identical sides of an edge, the weight assigned to pixel q is maximum. When G d p and G d q are on diverse sides, a minimum weight can be assigned to the pixel q.

Detailed architecture of CNN model
The convolutional neural networks systematically apply filters to an input and create output feature maps. In the present study, a total of twenty-two CNN layers have been used to train and classify seven facial emotions from CK+ dataset. Out of these 22 layers, 1 input layer, 1 output layer, 1 fully connected layer, 1 softmax layer, 5 convolutional layers, 5 ReLu layers, 5 batch normalization layers, and 3 max pool layers have been utilized to get an ideal output. Figure 2 shows the description of the various layers used in the proposed model. Figure 3 represents the detailed architecture of the proposed CNN model for facial emotion recognition.

Batch Normalization:
A batch normalization layer independently normalizes a mini-batch of data across all observations for the single channel used in the CK+ dataset for grayscale images. This layer has 48 × 48 × 8 activations. Both offset and scale matrices have size 1 × 1 × 8. 4. ReLU: The rectified linear unit (ReLU) layer performs a nonlinear threshold operation, where any input value less than 0, received from previous batch normalization layer, is set to zero. This layer has 48 × 48 × 8 activation units.

Max Pooling:
The maximum pooling operation performs down sampling of the image feature matrix by dividing the input matrix into pooling regions and selecting the maximum value out of each region. In this work, 2×2 max pooling with stride [2 2] and padding [0 0 0 0] has been utilized. This layer has 24 × 24 × 8 activations. 6. Convolution: In the second convolution layer, 16 3×3 convolutions with stride [1 1] and padding 'same' have been utilized. It means that the network pads evenly at the left and right edges, but if the number of columns to be added is odd, it will add the extra column to the right, or if the number of rows to be added is odd it will add an extra row of zeros at the bottom. This layer has 24 × 24 × 16 activation units. Also, it has 3 × 3 × 8 × 16 learnable weights and bias size is 1 × 1 × 16.

Fully Connected:
A fully connected layer multiplies the ReLu output received from the previous step, by a weight matrix, and then adds a bias vector. A total of seven such fully connected layers have been utilized in this research work. This layer has 1 × 1 × 7 activations. 21. Softmax: The softmax function normalizes the value of the output data received from the channel dimension of the fully connected layer in such a way so that it sums 7822 N. Kumari, R. Bhatia to one. Its output can be regarded as a probability distribution. In this layer, 7 × 4608 weights have been learned.

Classification Output:
A classification layer has been utilized to compute the cross-entropy loss for classification into seven facial emotion classes and weighted classification tasks with mutually exclusive classes. The size of output matrix is 1 × 1 × 7.

Adam optimizer
To optimize the coefficient of deep convolutional neural networks, Adam is used for the proposed model. Adam optimizer (Kingma and Ba 2015) is an extension to stochastic gradient decent and can be used in place of classical stochastic gradient descent to update network weights more efficiently. The name Adam is derived from adaptive moment estimation. Adam is designed to combine the advantages of Adam is an adaptive learning rate optimization algorithm that has been designed specifically for training deep neural networks. It uses the squared gradients to scale the learning rate like RMSprop, and it takes advantage of momentum by using moving average of the gradient instead of gradient itself like SGD with momentum. Adam uses estimations of first and second moments of gradient to adapt the learning rate for each weight of the neural network. Some of Adam's advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its step sizes are approximately bounded by the step size hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing. In Algorithm 1, α shows the step size. β 1 andβ 2 are stochastic objective function with θ as a variable. θ 0 shows the initial vector. m 0 and v 0 define 1 st and 2 nd moment vectors. t shows the timestep. In this paper, α = 0.002, β 1 = 0.85, β 2 = 0.9, and = 10 −9 .

Algorithm 1: Adam optimizer
Require: α, β 1 , β 2 ∈ [0 1], f (θ), and θ 0 . Ensure: Set values of m 0 , v 0 , and t to 0. 1: while θ 0 not converged do 2: t = t + 1 3:  which is the most commonly used test bed for emotion recognition. The human emotions (Kumari and Bhatia 2020) that are recognized universally are as follows: anger, contempt, disgust, happiness, fear, surprise, and sadness. Some examples of the CK+ dataset are shown in Fig. 4. Table 1 shows the number of images that belong to each class. Here, it is seen that there are seven emotions, i.e., target classes for training and testing the proposed model. These classes are labeled from 0 to 6, respectively. The experiments are done in MATLAB 2020b on Intel core i7 processor with 16−G B RAM with single GPU. Table  2 shows various hyperparameters of the proposed model along with the selected values in this paper for experimental purpose. Values are selected on trial-and-error basis.

Performance metrics for proposed model
In this section, various confusion matrix-based performance metrics (Kumari and Rekha 2021) are discussed and used to evaluate the performance of the proposed facial emotion recognition model.

Accuracy
It evaluates the ratio of the total number of correctly recognized emotions to the actual number of emotions. The accuracy (A c ) can be defined as Here, T p , T n , F p , and F n define true positive, true negative, false positive, and false negative values, respectively. A c ∈ [0, 100]. A c is desirable to be 100.

Precision
Precision quantifies the number of positive class predictions that actually belong to the positive class. p can be evaluated as p = T p T p + F p (5)

Recall
Recall quantifies that the number of positive class predictions made out of all positive r can be computed as

F-measure analysis
F-measure is utilized to compute the weighted mean among precision ( p) and recall (r ). Therefore, it utilizes the values of both false positives and negatives. Mathematically, F-score can be computed as Table 3 shows the performance analysis of the existing and proposed model with or without considering the CLAHE and a modified joint trilateral filter (MJTF). It clearly shows that the proposed model outperforms the existing model (i.e., without considering CLAHE and MJTF) in terms of accuracy, precision, recall, and F-measure values. The existing model achieves accuracy, precision, recall, and Fmeasure values as 94.81 %, 93.22 %, 93.04 %, and 93.12 %, respectively, whereas the proposed model achieves accuracy, precision, recall, and F-measure values as 98.01 %, 97.08 %, 97.06 %, and 97.03 %, respectively, for test data set.   Similarly, all vertical values represent the corresponding false values. Therefore, in confusion matrix, let assume the target class as anger; then, confusion matrix has false positives (F p )= 0 and false negatives (F n )= 0. Overall, analysis indicates that the proposed model with CLAHE and modified joint trilateral filter obtains better performance as compared to without the use of CLAHE and modified joint trilateral filter.

Comparative analysis
From Table 4, it is found that the proposed facial emotion recognition model achieves better performance than the existing emotion recognition models. Comparative analysis reveals that the proposed facial emotion recognition model achieves better results than the competitive models in terms of accuracy.

Future scope
Further improvement in the network performance can be achieved by using a larger dataset for training as well as validation and testing. In this paper, standard hyperparameters of deep convolutional neural network are selected for building the facial emotion recognition model. The affect of hyperparameters tuning is ignored (Jiang et al. 2021;Basavegowda and Dagnew 2020;Xu and Qiu 2021). Therefore, in the future, hyperparameters of the proposed model will be optimized using various optimization approaches. Also, in the future, the proposed model can be used for other kind of applications such as human behavior identification (Ghosh et al. 2020) and speaker-aware information logging . Besides, other visibility restoration and segmentation techniques (Gupta et al. 2019) can be used to enhance the images. Several different datasets can also be considered for feature engineering by carefully normalizing the images contained in them. The researchers can extend this work using deep transfer learning models (Zakraoui et al. 2019).

Conclusion
It has been observed that the existing facial emotion recognition models perform poorly for images which have poor visibility and noise. To handle these problems, an efficient deep learning-based facial emotion recognition model was designed. The contrast-limited adaptive histogram equalization (CLAHE) is applied to improve the visibility of input images. The modified joint trilateral filter is applied to the obtained enhanced images to remove the impact of noise. Finally, the deep convolutional neural network was applied on the CLAHE and modified joint trilateral filter (MJTF)based feature matrix. Next, Adam optimizer was used to optimize the cost function of deep convolutional neural networks. Extensive experiments were conducted by using the CK+ facial emotions dataset. Comparative analysis has demonstrated that the proposed facial emotion recognition model achieves better results than the existing models in terms of various performance metrics.
Funding This research received no specific grant from any funding agency.

Conflict of interest
The authors declare no competitive interest regarding the publication of this paper.