Real emotion seeker: recalibrating annotation for facial expression recognition

Facial expression recognition (FER) is a challenging classification task. Due to the subjectivity and ambiguity of performers and spectators, compound facial expression is hard to be represented by one-hot label. In this paper, a simple but efficient method, named real emotion seeker (RES), is proposed to recalibrate the annotation of sample to latent expression distribution besides one-hot label. In particular, subjective implicit knowledge is transformed into posterior distribution which is specific to each FER data set through Bayesian inference, thus enhancing universality and authenticity. The posterior distribution is then combined with one-hot label to form the recalibrated annotation as an additional supervision, guiding the prediction more realistic. Our proposed method is independent of the backbone network and can improve the accuracy significantly by an average of 3.16% with no burden for training and inference. Extensive experiments show that RES can obtain consistent prediction with human subjective intuition. Results on three in-the-wild data sets demonstrate that our approach achieves advanced results with 90.38% on RAF-DB, 90.34% on FERPlus and 62.63% on AffectNet.


Introduction
In the process of human communication, the amount of information conveyed by non-verbal components accounts for around two-thirds [1]. As the most important form of non-verbal communication, facial expression plays an important role in many applications, such as human-computer interaction, safe driving and criminal investigation. Therefore, Facial Expression Recognition (FER) is an interesting and important task in computer vision. Currently, deep learning-based FER algorithms have achieved great performance on lab controlled data sets (e.g., CK+ [2], MMI [3], Oulu-CASIA [4]). But there are still many challenges to the algorithms on large-scale in-the-wild data sets (e.g., RAF-DB [5], FERPlus [6], AffectNet [7]).
Although there are only a few categories in FER, it is hard to distinguish expressions clearly for several reasons. First, most expressions are the mixture and combination of basic emotions [8]. As shown in Fig. 1a, b, the expressions are compound in different circumstances. The expression in Fig. 1a is the mixture of surprise and happy, while the expression in Fig. 1b is the mixture of surprise and fear. Therefore, it is unreasonable to annotate image with a single basic expression. Second, the transition between different expressions is gradual rather than abrupt, which makes it ambiguous and cannot confirm a clear boundary. Image sequence in Fig. 1c shows how the expression changes from surprise to anger.
Among them, each image presents a mixture of different intensities of surprise and anger, indicating that the intermediate process of expression change is not easy to identify. Third, people have diverse opinions on facial expression as their intuitions are subjective and influenced by many elements (e.g., psychology, gender, age, race). For the above reasons, it is hard to precisely distinguish facial expressions by a discrete and single annotation, which results in limited algorithm performance, especially on in-the-wild data sets.
To prevent overfitting and overconfident prediction caused by one-hot labels, several attempts were proposed to mine real distributions for FER. Label Distribution Learning (LDL) [9][10][11] methods applied in FER are based on a small data set with label distribution [10] or the distributions of similar expressions in auxiliary tasks [11]. She et al. [12] mine the latent distributions of expressions by multi branches trained with samples from negative classes. Methods mentioned above introduce a massive calculations during training, which greatly increase training cost.
In this paper, we propose a simple but efficient method named Real Emotion Seeker (RES) for effective FER. RES utilizes subjective implicit knowledge and Bayesian inference to recalibrate each category of samples in the data set, and expand the one-hot label to latent expression distribution as a new constraint during training. In particular, so as to decouple compound expressions in the label space, we use the professional knowledge that includes the degree of coupling between expressions and the probability of latent expressions. Bayesian inference is applied to transfer the implicit knowledge to different data set, obtaining a data set-specific posterior distribution with the help of the sample distribution of the data set. The posterior distribution is then fused with the one-hot label to recalibrate annotation, acting as an additional constraint to the model training. Benefiting from RES, the existing backbone (e.g., ResNet-18, ResNet-50IBN) can learn better representations to recognize facial expressions more accurately and realistically. Experimental results show that the proposed RES can improve the recognition accuracy significantly by an average of 3.16% compared with the baseline on three in-the-wild data sets when taking different architectures as backbones. We also demonstrate that RES can yield more realistic predictions of expressions and achieve competitive performance by 90.38% on RAF-DB, 90.34% on FERPlus and 62.63% on AffectNet without extra cost for training and inference.
Overall, our contributions can be summarized as follows: 1. We introduce a novel method named RES to mine real emotion of expressions, which can recalibrate the onehot label of sample to latent expression distribution and significantly improve performance without increasing the burden on training and inference. 2. Benefiting from Bayesian inference, the implicit knowledge [13] can be transformed into posterior distribution which is specific to each data set, obtaining prediction consistent with human subjective intuition. 3. Our proposed RES achieves leading performance on three in-the-wild FER data sets.
Recently, deep learning based methods [11,[17][18][19][20][21][22] are the mainstream choices at present. Most of the deep learning-based methods use Convolutional Neural Networks (CNN) [11,[17][18][19], but the joining of Generative Adversarial Network (GAN) can simultaneously and well solve multiple facial related tasks [20][21][22], such as facial image synthesis. Recent efforts to handle compound expressions have been divided into two branches, one based on artificially annotated information, and the other based on innovative algorithms. For the former branch, Vo et al. [13] take advantage of the professional voting results provided by FER data set and convert them into implicit knowledge which can be integrated with label smoothing. For the latter branch, Chen et al. [11] substantially mine the realistic information of Action Units (AUs) and faical landmarks besides one-hot labels by constructing K-Nearest-Neighbor graphs. Wang et al. [23] introduce a novel framework, where the low-importance samples given by the self-attention module are relabeled as the class with the maximum predicted probability. On this issue, She et al. [12] exploit the latent distribution with the help of an auxiliary multi-branch framework in the label space, while the extent of ambiguity is estimated by the pairwise relationship of semantic features in the instance space. TransFER [24] applies Transformer to FER for the first time, where multi-attention dropping module randomly discards attention maps so that models can better learn the correlation between different local blocks.
Compared with the above methods, our RES can convert implicit knowledge to data set-specific posterior distribution, avoiding the need for manual annotations on each data set. Moreover, RES brings no burden for training and inference, which can improve performance in a simple and efficient way.

Label distribution learning
Owing to the ambiguity of one-hot labels, Label Distribution Learning (LDL) [9] is a novel and effective method to learn with ambiguity by mapping the samples to the labels. The label distribution can easily describe the sample belonging to each possible label in a fine-grained way, which is more general than one-hot label. Geng et al. [9] divide LDL into three branches: (1) problem transformation (PT) transfers the problem of LDL into the weighted single label learning. Bayes classifier and SVM are used to predict the label distribution.
(2) Algorithm adaptation (AA) extends K-Nearest Neighbor and backpropagation neural network to discover the label distribution. (3) Specialized algorithm (SA) learns the label distribution using the maximum entropy model with Kullback-Leibler divergence. Benefiting from these methods, LDL has some practical applications, such as head pose estimation [25], facial landmark detection [26], age estimation [27,28], etc.
Label enhancement (LE) [29][30][31][32] can mine the label distribution from the given one-hot label in the training set, so as to solve the problem that there is a lot of resistance to obtaining label distributions directly. Graph Laplacian label enhancement (GLLE) [29] combines the topological information in feature space with the relationship among the labels in label space to recover the label distributions from one-hot labels. Xu et al. [30] introduce the more generative model with variational inference to obtain the label distribution.
However, FER is different from the traditional classification task such as ImageNet in that it has few categories but large intra-class variations. Algorithms should distinguish different expressions on the same face, and find out similarities of the same expression on different faces. Therefore, it is hard to extract the features sensitive to expression change but robust to different faces. Therefore, several works [10,11] has begun to explore the emotion distribution with LDL. Zhou et al. [10] introduce EDL to discover the expression distributions from one-hot labels by learning the correlation between expressions. Chen et al. [11] construct the similarity graph and utilize the relationship between predictions of central image and its neighbors.
In our work, we recalibrate the annotation of samples to latent expression distribution beside one-hot label, which includes information about human subjective intuition and the associations with expressions. In addition, the process of recalibration avoids complicated and massive calculations during training.

3 3 Methods
Notation Given a FER data set (X, Y) with C classes, we denote x ∈ X as an input image and y ∈ Y = {1, ..., C} as its corresponding one-hot label.
For each training sample x , the softmax function can generate the prediction p(ŷ|x) . The ith element of prediction after the sharpen [33][34][35] is T is the temperature factor. We omit x in distribution for simplicity in the following.

Overview of RES
As shown in Fig. 1, expressions are compound and there exist other latent expressions, so one-hot labels cannot reflect the real emotions of images. To solve the above problem, we Fig. 2 Overview of our RES. Given a expression image x annotated with y. The prediction of this image and the prediction after the sharpen is p(ŷ) and p(ŷ) . The implicit knowledge of human intuition on expression p imp (y|z) is extracted from the professional voting results. Then the posterior distribution p post (z|y) is calculated by Bayesian inference, which is specific to each data set. Furthermore, the posterior distribution is combined with one-hot label distribution q(z) to recalibrate annotation p f (z) , which acts as a constraint for the backbone training. Both L CE and L f are terms of the overall loss function

Fig. 3
Proportions of each expression on three FER training data sets. The total numbers and standard deviations of three training sets are provided for reference. The sample distributions in different data sets are quite different. AffectNet has an even more serious imbalance problem than other data sets propose a simple and effective method named RES, which recalibrates the annotation of samples to latent expression distributions besides one-hot labels. An overview of RES is depicted in Fig. 2. First, the implicit knowledge of human intuition on expression is extracted from the professional voting results on one given data set. Second, for a given image in any FER data set annotated with one-hot label, the data set-specific posterior probability of latent expression is calculated by Bayesian inference. Third, the posterior distribution is fused with the one-hot label to recalibrate annotation, which serves as a constraint for the backbone network training. With the help of RES, the predictions are more consistent with real emotion.

Recalibrate annotation by Bayesian inference
Implicit knowledge Implicit knowledge determines what we learn from experience, which can greatly improve the efficiency of learning [36]. For FER, since expressions are strongly subjective, implicit knowledge of human will can significantly improve performance. If implicit knowledge of expression is inferred from a large number of manual voting results, then the implicit is closer to the real emotion. To find the appropriate knowledge in FER, inspired by [13], we regard the subjective voting results officially provided by FERPlus as implicit knowledge. The subjective results are voted by 10 professionally trained annotators, thus ensuring the rationality of the knowledge. In FERPlus training set S, the approximate distribution v s ∈ ℝ C over a sample s is calculated according to the voting results and ∑ v s = 1 . The average label distribution of the z-class is where | | S z | | is the size of subset S z ⊂ S , S z contains the images annotated with expression z. In this paper, d z denotes implicit knowledge p imp (y|z) (i.e., the probability of an image being labeled into y when its latent expression is mostly to be z).
The specific value of d z are rows in Table 1.
Bayesian recalibration As most data sets do not provide a large number of manual annotations, the latent expression distribution is obtained by calculating the posterior probability according to Bayesian inference, which can be formulated as where i, j ∈ {1, ..., C} . p(z = i) denotes the proportion of expression i on FER data set as shown in Fig. 3. p imp (y|z = i) denotes the probability of an image being labeled into expression y when its latent expression is i, in other words, d iy in Table 1. p post (z = i|y) denotes the probability that the latent expression of an image is i when it is labeled into expression y.
Then we fuse the posterior distribution with one-hot label to recalibrate annotation, which is used as a constraint for the backbone network training. This constraint includes the information about human subjective judgments and associations between expressions: where is a trade-off ratio. p post (z|y) denotes the posterior distribution. q(z) denotes the one-hot label distribution:

Training strategy
Given an expression image x whose label is y in the FER data set, feed it into the backbone network (e.g., ResNet-18, ResNet-50IBN, ShuffleNetV1, MobileNetV2) and get the We choose these four networks as backbones for the following reasons. First of all, the existing SOTA methods in FER almost all adopt the classical ResNet, so we choose it for a fair comparison. Second, MobileNet, as a lightweight network, fits the characteristics of our approach, indicating that good performance can be achieved even with small parameters. Finally, ShuffleNet is to further verify the universality and versatility of our method.
To transfer the knowledge in the recalibrated annotation to the backbone network, we use the Kullback-Leibler divergence written as where p f (z) and p(ŷ) are recalibrated annotation and prediction after the sharpen [33][34][35].
The overall loss function is where L CE represents the cross-entropy loss. denotes a trade-off ratio. Under the guidance of this loss function, the backbone network can act as a real emotion seeker to obtain prediction which is consistent with human subjective intuition.
It is worth mentioning that the Bayesian inference can transfer the implicit knowledge based on FERPlus to other data sets. It only needs to be calculated once in the migration process of each data set. Moreover, RES has a significant advantage in that brings no extra burden for training and inference, which proves the simplicity and effectiveness of the proposed RES.

Experiment
In this section, we first introduce three public in-the-wild FER data sets which are publicly available and researchers are authorized to use them for research work. Then we compare different backbone networks with RES to the baseline. In addition, We explore real emotions and demonstrate the superiority of proposed RES for better estimating the real distribution. Finally, we compare our approach with SOTA methods on three in-the-wild FER data sets and extensive ablation studies are conducted. [5] includes about 30,000 facial expression images downloaded from the Internet, which are annotated by 40 professional annotators. The data set is divided into two subsets: basic expressions subset and compound expressions (6) L f = KL(p f (z),p(ŷ)),

RAF-DB
subset. In our experiment, we only use the former subset with 7 basic expressions including surprised, fear, disgust, happy, sad, anger and neutral, in which 12,271 images are used as the training set, while 3068 images are used as the test set.
FERPlus [6] is extent from FER2013 [37] used in the ICML 2013 Workshop on Challenges in Representation Learning. The data set collected by the Google search engine consists of 28,709 training images, 3589 validation images and 3589 test images, all of which are aligned and resized to 48 × 48. The difference between FER2013 and FERPlus is the number of category. In FERPlus, each image has been annotated by 10 annotators with 8 classes, where contempt is added, and the most voting class is regarded as the onehot label. Besides, the voting results for each image of each annotator are given and can be used as implicit information to estimate the distribution of each image.
AffectNet [7] is the largest FER data set so far, which provides both Valence-Arousal annotations and one-hot labels. There are about 450,000 available images with 8 classes tagged by trained annotators in more than one millions images collected by three search engines with expressionrelated keywords. In addition, available images provide 280K images as training set and 4K images as test set. Existing methods mostly adopt the seven classes of AffectNet except contempt.

Implementation details
By default, the backbones are all pre-trained on MS-Celeb-1M face recognition data set. In the process of pre-processing, cropped facial images are aligned and then resized to 256×256 pixels. Random crop and horizontal flip are applied to 224×224 pixels images. We train our model with a single Nvidia Tesla P40 GPU and the batch size is 72. Every batch ensures that every class of image exists in it. Adam optimization algorithm [38] with weight decay of 10 −4 is used as the optimizer. The learning rate is set as 10 −3 initially and then divided by 10 after 10 and 20 epochs. The training process lasts for 40 epochs.

Performance evaluation
Compound facial expressions are difficult to be represented by one-hot labels owing to the subjectivity and ambiguity of performers and spectators. Therefore, mining the real emotion in facial expressions can significantly improve performance in FER.
Evaluation on different backbone networks To verify how effective our method is, we qualitatively train different backbone architectures with proposed RES and compared them with the baseline. ResNet-18, ResNet-50IBN, ShuffleNetV1 and MobileNetV2 are used as backbones for comparison.
The baseline is trained with the cross-entropy loss and has the same architecture as backbones. As shown in Table 2, the proposed RES stably improves the performance of all the backbones by an average of 3.10%, 2.76% and 3.62% on RAF-DB, FERPlus and AffectNet, respectively. Besides, ResNet50-IBN achieves the best results owing to a great quantity of parameters and the IBN module. We can easily conclude that our RES is independent of the backbone architectures. It is worth mentioning that the network structure of RES is the same as the baseline, and it achieves significant performance improvement without bells and whistles.
The effect of the Bayesian inference To verify the validity of Bayesian inference, we compare the recognition accuracy of different methods. The baseline is ResNet-18 trained with the cross-entropy loss function. Vo et al. [13] combine subjective knowledge obtained in FERPlus [6] with label smoothing. The proposed RES utilizes Bayesian inference to convert implicit knowledge into posterior distribution which is specific to each data set. RES (w/o Bayesian inference) represents a method that dose not use Bayesian inference and directly uses implicit knowledge as a guidance.
As shown in Table 3, RES (w/o Bayesian inference) and RES both have better results than PSR and baseline by a large margin, demonstrating the effectiveness of the proposed method. When combing Bayesian inference, we achieve the considerable performance improvement. Since the implicit knowledge is calculated on FERPlus, the improvement on FERPlus is limited. However, for the other two data sets, as the knowledge of the data set sample distribution is added, we achieve performance improvement by 0.30% and 0.65%, respectively. This improvement is attributed to the fact that Bayesian inference can combine implicit knowledge and data set sample distribution to infer the data set-specific posterior distribution. Therefore, RES is more versatile than PSR and the prediction obtained by RES are closer to real emotions.
2D feature visualization We utilize t-SNE [39] to analyze the feature embedding of ResNet-18 trained on three in-the-wild data sets. As shown in Fig. 4, RES has smaller intra-class variations compared to baseline and can enlarge the inter-class distances to better distinguish the differences between expressions, demonstrating the effectiveness of RES. Owing to the serious ambiguity on AffectNet, it is still difficult to distinguish some expressions, such as disgust and anger. However, RES can alleviate this problem to a certain extent and achieve better results.
Prediction and correlation To prove the feasibility and reliability of the proposed method, we conducted a subjective experiment of human intuition on expressions by 141 annotators who were instructed with an online facial expression annotation assignment. Each annotator signed a confidentiality agreement and was told that the experiment was completely anonymous and only the voting results would be used to support our views. Noting that we take the subjective voting results as the real distributions of facial expressions. The real distributions and predictions of baseline and RES are shown in Fig. 5. We observe that the baseline tends to make overconfident predictions, which goes against the fact that expressions are compound. Besides, RES can revise the baseline misprediction to a correct label. We also calculate the Kullback-Leibler divergence between the real distributions and predictions as a correlation index for reference. The results quantitatively show that the Kullback-Leibler divergence between the real distributions and predictions declines greatly after applying RES, which is sufficient to prove that our RES can obtain more realistic estimations. But for expressions that are too ambiguous, such as the expression in the lower right corner, the proposed method gives prediction that is more likely to be real emotion rather than the label in data set, which leads to a slight decrease in the accuracy (Fig. 6).
Confusion matrices As shown in Fig. 6, the confusion matrices presents the accuracy of each category on three in-the wild data sets, which are obtained by the ResNet-50IBN applying RES. To better prove the effectiveness of RES, we compare our method with PSR [13]. Figure 6a shows the confusion matrix for RAF-DB. Happy has the best accuracy of 95%. However, Fear and disgust are less accurate, because the similarity between fear and surprise is high. As shown in Fig. 6c, happy, neutral and surprise have accuracy above 90% on FERPlus. Contempt has the lowest accuracy of 46% due to the small class size. As for AffectNet, the accuracy of each category is not as high as that of RAF-DB and FERPlus, because AffectNet is severely imbalanced. Therefore, the existing methods tend to exclude contempt to calculate 7-class accuracy for

3
AffectNet. However, the advantage of our method is that RES has a considerable accuracy in the class with a small sample size especially like contempt with the accuracy of 60%, which proves that RES can weaken the effect of imbalance to a certain extent and obtain more accurate results. Table 4 compares our approach with SOTA methods on RAF-DB, FERPlus and AffectNet. Both DDA [40] and DACL [41] apply deep metric learning (DML) approaches for FER, and the latter is aimed at the problem of class imbalance. LDL-ALSG [11], PSR [13], SCN [23] and DMUE [12] focus on the issue of ambiguity and real emotion, while FDRL [43], MA-Net [42] and TransFER [24] concentrate on learning better representations to handle the real world scenarios. According to [13], we apply two metrics to fully illustrate the superiority of our RES. The first and most widely used metric is accuracy, or weighted accuracy, which means the number of correct samples divided by the total number of test set. Since most FER data sets are imbalanced, the second metric is the unweighted accuracy, also known as mean accuracy, which is the average accuracy rate of each category.
In our work, we utilize the knowledge of human subjective intuition and high correlation between expressions, thus seeking more realistic distributions and greatly improving performance. Without increasing the burden on training and inference , RES obtains the results with 90.38%, 90.34% on RAF-DB and FERPlus, respectively. It is worth mentioning that for the mean accuracy with ResNet18, our method achieves 2.22% and 5.99% improvement on RAF-DB and FERPlus compared with the SOTA methods. For Affect-Net with 7 classes and 8 classes, RES also has impressive results. DMUE has the leading results due to the auxiliary multiple branches which require complex training and the joint tuning of multiple modules. Besides, TransFER achieves the best results with an additional Transformer which needs a large number of samples on ImageNet to train, thus introducing massive training costs. In contrast, our method obtains comparable results with much simpler training process.

Ablation study
Temperature T T is the temperature parameter to sharpen the distribution to help model learn more implicit knowledge. The higher T can generate the softer probability distribution. In Table 5, we evaluate the effect of different temperature in our approach. T > 1 can reduce the sensitivity of model to incorrect predictions, while the accuracy will decrease when T is too large. Therefore, T = 5 can effectively smooth the prediction, thus obtaining the best performance.
Trade-off Weight balances the cross-entropy loss and the Kullback-Leibler divergence between the recalibrated annotation and the prediction of backbone. In a sense, when is bigger, the cross entropy plays a relatively smaller role, which means the more serious problem of ambiguity on this FER data set. Figure 7a shows = 0.55 obtains the highest accuracy on RAF-DB, while = 0.85 achieves the best performance on AffectNet in Fig. 7b, which proves that Affect-Net meets a more serious problem of class imbalance.
Trade-off Weight balances the posterior distribution and the one-hot label. When is smaller, the posterior distribution can play a more important guiding role. But there is no denying that one-hot label is also good supervisory signal. Figure 8a shows that = 0.25 obtains the highest accuracy on FERPlus, while Fig. 8b shows that = 0.60 achieves the best performance on AffectNet, which means that human intentions and the associations between expressions both play a good supervisory role in performance improvement.

Conclusions
To learn the real emotion for FER, we propose a simple but effective framework named Real Emotion Seeker in this paper, which can recalibrate the annotation of sample to latent expression distribution besides one-hot label. To be specific, implicit knowledge of human intuition on expressions is extracted from professional voting results. Then, Bayesian inference is applied to obtain posterior distribution which is specific to each data set. To the end, the posterior distribution is combined with one-hot label to recalibrate annotation, acting as an additional supervision.
The proposed RES can yield predictions which are more consistent with real emotions and significantly improve the performance for FER with no extra burden for training and inference. RES is not limited to a specific backbone or data set and can be easily reproduced without complex tricks. However, it is highly dependent on appropriate implicit knowledge, which requires a lot of manual annotation. In addition, the proposed method has strong transferability to new data sets but weak transferability to the new expression. In the future, we can explore more convincing implicit knowledge from not only human experience but biological characteristics of human faces.