Emotion Categorization from Faces of People with Sunglasses and Facemasks


 Emotional information is considered to convey much meaning in communication. Hence, artificial emotion categorization methods are being developed to meet the increasing demand to introduce intelligent systems, such as robots, into shared workspaces. Deep learning algorithms have demonstrated limited competency in categorizing images from posed datasets with the main features of the face being visible. However, the use of sunglasses and facemasks is common in our daily lives, especially with the outbreak of communicable diseases such as the recent coronavirus. Anecdotally, partial coverings of the face reduces the effectiveness of human communication, so would this have hampering effects on computer vision, and if so, would the different emotion categories be affected equally? Here, we use a modern deep learning algorithm (i.e. VGG19) to categorize emotion from faces of people obscured with simulated sunglasses and facemasks. We found that face coverings obscure emotion categorization by up to 74%, whereby emotion categories are affected differently by different coverings, e.g. clear mouth coverings have little effect in categorizing happiness, but sadness is affected badly. While an overall accuracy of up to 97% has been achieved with nothing added to the face, the achieved accuracy decreases in all other cases when the face is obscured. Notably, clear visors have only a small effect across all emotions, where the classifier achieved an accuracy of up to 89.0% compared to other types of facemasks in which the achieved accuracy is less than 36%.

masks. We found that face coverings obscure emotion categorization by up to 74%, whereby emotion categories are affected differently by different coverings, e.g. clear mouth coverings have little effect in categorizing happiness, but sadness is affected badly. While an overall accuracy of up to 97% has been achieved with nothing added to the face, the achieved accuracy decreases in all other cases when the face is obscured. Notably, clear visors have only a small effect across all emotions, where the classifier achieved an accuracy of up to 89.0% compared to other types of facemasks in which the achieved accuracy is less than 36%.

Introduction
The categorization of human facial expression by artificial systems has received significant attention in recent years (Huang et al., 2019). One reason behind that is due to its diverse applications such as in human-computer interaction (Picard, 2003;Yuan and Ip, 2018;Brave and Nass, 2009;Arunnehru and Geetha, 2017). Recently, coverings that obscure faces such as facemasks and sunglasses have become widely used for various purposes. This may have a great impact on the performance of artificial systems in emotion categorization.
The distinction between emotion recognition and emotion categorization is made here because we affirm that it is only an approximation to understand the underlying emotion of a person from their face (Shehu et al., 2020b). This is because emotion expression can vary across individuals, for instance, while scowling and crying are perceived as an expression of anger and sadness, certain people scowl when paying too much concentration to a task (Barrett et al., 2019;Boehner et al., 2007) and various people cry when they are happy.
Currently, hospitals and healthcare systems are motivated to introduce robots to help doctors in critical healthcare conditions (Shehu et al., 2020a), especially with the ongoing challenge of a global pandemic. However, these robots need to identify human emotions in order to interact with humans in an intuitive way.
Faces can be covered for different reasons e.g. facemasks are widely being used (Jones, 2020) to prevent the spread of infectious diseases such as tuberculosis, coronavirus, swine flu, etc. Sunglasses are used to protect the eye from the sunlight, improve appearance, and obscure the face. However, while it is anticipated that different emotions will be affected differently by different face coverings, it is not known to what extent these face coverings can affect the ability of artificial systems to categorize emotion since the effect of obscured faces on emotion categorization systems has not been tested.
Recently, manufacturers have introduced a solution to solve the problem of using a fully covered facemask that reduces lip-reading opportunities by designing a facemask with a transparent window that allows the mouth to be visible (Coleman, 2020). However, while the individual perception of a transparent mask might help lip reading and maybe also help emotion classification by humans, it is unknown if only adding the transparent window is enough for the artificial systems to categorize the emotion of humans as emotion is considered to be conveyed from the full face rather than only from the mouth or eyes (Coleman, 1949).
The novelty of this research is to identify how wearing sunglasses or different versions of facemasks affects emotion classification systems. In other words, the paper raises this important question: To what extent can the performance of artificial emotion categorization systems be affected when people wear sunglasses or facemasks?
This question is addressed by implementing a strategy to manually add the sunglasses and facemasks to the images of the CK+ database. The research will analyze this effect by using a state-of-the-art deep learning classifier i.e. VGG19 model as an example of a deep learning algorithm (Simonyan and Zisserman, 2014). Contrary to the use case of a mask, either full or with a transparent window, this research also proposes the use of a clear (fully transparent) facemask, which has similar properties as a visor. The performance of an emotion classification system is compared for when manipulated images of people wearing this visor-like mask and people wearing sunglasses, fully and partially covered facemasks are included.
The research uses the CK+ database because it is an example of emotion video-frame images and uses the last-half frames of each sequence in order to have more data to be used in the experiments.
The rest of the paper is organized as follows; Section 2 describes how certain researchers have used machine learning and deep learning algorithms to classify emotion. Section 3 explains the properties of the CK+ database, its emotion categories as well as how and why it is used in this research. This section also explains how computer vision techniques are used to create and add sunglasses and facemasks to emotion images. In addition, the section also provides an explanation of the properties of the deep model used, how it is set up, and why it is chosen to be used in this research. Section 4 presents the obtained results. Section 5 provides a further discussion of the findings and finally, Section 6 concludes the paper with hints at future studies.  (Shehu et al., 2020a). Note that the majority are grayscale images 2 Related Work A number of methods can be used to categorize emotion from images. For instance, there is a single frame-based method that categorizes emotion from a single frame and a multi frame-based method that categorizes emotion from multiple frames.
The multi frame-based method with reference frames was used to analyze emotion from the CK+ dataset (Otroshi-Shahreza, 2017). Firstly, landmark coordinates of the face in each frame were derived using the dlib library (Dlib, 2017) followed by a normalization process. Vector movements from normalized coordinates of the landmark were calculated from the initial frame where the posed emotion is neutral to the last frame where the emotion is expressed at peak. The same was done for each of the six basic emotions (see Fig. 1), whereas the detected landmarks on initial frames were used for the neutral emotion. The calculated vector movements were used to determine the facial expression using Random Forests (RF), Decision Tree (DT), and Linear Discriminant Analysis (LDA) which lead to an accuracy of 93.47%, 89.29%, and 96.08% respectively. However, there are concerns that the proposed method is sensitive to the choice of landmarks as the use of dlib to automatically detect coordinates decreases the accuracy of the method compared to when landmarks are detected manually.
Majumder et al. (Majumder et al., 2018) proposed an automatic facial expression recognition system (AFERS) that used four different layers of a neural network to classify emotion using a deep network framework. The first two layers used geometric and appearance features for better representation of the facial expression. The third layer used Kohonen's self-organizing maps (SOM)-based classifier. The SOM-based classifier used an improved learning algorithm and a soft-threshold logic as an improvement to give higher accuracy. Several experiments performed at the last layer to demonstrate the varying performances for a different number of nodes in the last layer. The performance of the proposed deep network had been tested on the CK+ (Lucey et al., 2010) and the MMI (Pantic et al., 2005) database, which lead to accuracy of 98.95% and 97.55%. The first two layers of the proposed AFERS system work by first detecting the faces in the image using the Viola-Jones (Viola and Jones, 2004) algorithm. However, as the Viola-Jones algorithm only detects faces on fully displayed frontal face images, it is anticipated that the method will not function on images displaying only part of the face. This will limit the usability of the method to only frontal face images. Almowallad and Sanchez (Almowallad and Sanchez, 2020) proposed a deep learning framework for label-distribution learning (EDL-LBCNN) to classify emotion from images. The proposed method enhances features extracted by the convolution neural network by forming a local binary convolutional (LBC) layer to acquire texture information from face images so as to improve the generalization of the trained model. The proposed EDL-LBCNN was evaluated on the Japanese Female Facial Expression (JAFFE) (Lyons et al., 1999)  of-the-art label distribution-learning methods when the evaluation was made on straight facial images. However, the result was not very encouraging when the analysis was made on tilted face images.
Harshitha et al. (Harshitha et al., 2019) proposed a convolution neural network (CNN) architecture to classify emotion on the six basic (anger, disgust, fear, happy, sad, surprise) expressions of the JAFFE database. The proposed CNN was developed to have two convolution, pooling, fully connected layers, and a rectified linear unit (ReLU) activation function at each layer. The proposed approach has achieved an accuracy of 91.6% when tested with images from the JAFFE database. The proposed CNN has only two layers, as such, questions remain unanswered as to whether the method will perform better than the current state-of-the-art CNN architectures that were developed to have 50 (e.g. ResNet50 (He et al., 2016)) or more layers.
Fathallah et al. (Fathallah et al., 2017) proposed an architecture based on a convolution neural network (CNN) to recognize facial expression from images. Initially, the proposed CNN is trained with fine-tuning by the Visual Geometry Group (VGG) model to improve results. In the second step, training was repeated, however, fine-tuning was carried out with the obtained first model to obtain the final model. The performance of the method was determined after evaluation on three state-of-the-art databases (CK+, MUG (Aifanti et al., 2010), and RaFD (Langner et al., 2010)), which lead to an accuracy of 99.33%, 87.65%, and 93.33% for the CK+, MUG, and RaFD databases respectively. The developed model was trained with only fully displayed faces that are located at the center of the images. As such, it is anticipated that the method might perform badly when tested with tilted face images, faces located in different regions of the image, or with obscure i.e. faces that are partially covered with external devices such as sunglasses or facemasks.
Deep learning algorithms have been used to classify emotion from images (Majumder et al., 2018;Almowallad and Sanchez, 2020;Harshitha et al., 2019;Fathallah et al., 2017). However, emotion classification models developed based on a fully displayed facial configuration might perform badly in classifying the emotion of people from tilted face images or when a certain portion of the face is obscured e.g. by sunglasses or a facemask. This research is needed to analyze the effect of using systematic facemasks and sunglasses applied to an emotion dataset to test artificial emotion classification systems.
We know from our previous work (Shehu et al., 2020b) that using the lasthalf frames of each sequence of the CK+ database gives a more accurate result than using only the last few frames where the emotion is expressed at peak. Consequently, the last-half frames of each sequence are assigned the emotion label of the sequence and the first-two frames of each sequence are assigned as neutral expression. For instance, we assigned neutral expression to frame one and two and happy expression to frames starting from 11 to 20 in a given sequence where the total number of frames ıs 20 and the decoded emotion is happy. Therefore, a total of 3,368 images are used.
Since deep learning algorithms require more data to be trained, the CK+ dataset is chosen to be used to obtain more data for the experiment.

Train test split
A total of 350 images, consisting of 50 images from each class were randomly selected as the test set and the remaining 3,018 images were used for training. The database is split to have the same number of images from each test class to avoid bias in the performance estimate.

Pre-processing
Image pixels were converted to an array and normalization has been performed on the pixels of raw images to adjust values between the range of [0, 1] to enable fast computation. Labels were also converted to integers and one-hot encoded.

Deep learning model
Deep learning models are a type of artificial neural network model that performs end-to-end learning. These algorithms are designed to recognize patterns in data based on an inspiration derived from neurons in the human brain. These algorithms use the layers of neural networks to extract higher-level information at other layers based on the raw input data, which in this case is an image.
In this research, VGG19 (Simonyan and Zisserman, 2014), which is an improved version of VGG16, is used as an example of a deep model to classify emotion labels from the CK+ database. VGG19 is chosen because it is a well tested standard model that achieved high performance in many studies (Ullah et al., 2019;Rassadin et al., 2017;Knyazev et al., 2017;Oloko-Oba and Viriri, 2020;Dua et al., 2020). Another reason why the network is chosen is because it is deeper and has more weight layers compared to its pair (VGG16) (Zheng et al., 2018), which could lead to more flexible feature extraction .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 The same VGG19 model with the same number of layers as proposed by the original paper is used from the Keras API (Keras API, 2021) with the following setup; the model is set to run over 200 epochs with a learning rate starting from 0.001 and reduced by 10% after 80, 120, and 160 epochs. The learning rate is set to reduce by only 5% after 180 epochs. 10% of the training data is reserved for validation and data augmentation is also used to improve the diversity of the data during training. The validation patience is set to five; meaning that training should stop if the loss on the validation set is larger or equal to the previously smallest loss for up to five consecutive times.

Dataset Modification
This procedure is followed to add an obscure artifact to a face, which in this case is sunglasses or a facemask: Initially, the foreground of the glasses or mask is placed on top of an overlay image. The overlay is blank and is resized to have the same size as the width and height of the input image. The alpha channel, which controls transparency in a given region is added and resized to have the same size as the input image. However, the alpha channel only contains the foreground mask. Alpha blending is performed to merge the alpha channel, foreground, and the background, which returns the output image.

Sunglasses
Sunglasses are created to cover the two eyes of the participants in the image (see Fig. 2d). They are created by converting a particular glasses image sourced from the internet into a transparent mask (Rosebrock, 2018).
Algorithm 1 adds sunglasses to each test image of the CK+ database as well as generates a prediction for each image.

Facemasks
Three different facemasks are applied to the images to analyze how different obscured faces affect the performance of an emotion classification system; one of which is a fully transparent visor type design, which has similar properties to a visor in terms of adding a small amount of visual noise to the image as in Fig. 2a. Another type has a transparent window that has an obscured noise, chin, and cheeks, but leaves the eyes, eyebrows, mouth, and forehead visible (see Fig. 2b). Finally, we included the common mask used by the individuals in multiple public settings which allows only the eyebrows, eyes, and forehead to be visible (see Fig. 2c).

Application
This section provides a step-by-step explanation of how the artificial sunglasses and facemasks are applied to the face images .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Algorithm 1 Procedure adopted to add sunglasses and generate prediction The face is first detected using a pre-trained convolution architecture for fast feature embedding (CAFFE ) (Jia et al., 2014), the res10 300x300 ssd iter 140000 model followed by constructing a dlib (Dlib, 2017) rectangle object, which is used to detect facial landmarks within faces. If the CAFFE model failed to detect a face in an image, the Haar cascade classifier (Viola and Jones, 2001) is used as an alternative to detect a face. The coordinates of the left and right eye are extracted from the detected landmarks followed by computation of the angle between the eye centroids and the centre of mass for each eye. Since certain faces of participants in the CK+ database are tilted (see Fig. 3), the computed angle is used to rotate the sunglasses to make sure that the sunglasses are aligned with how the face is tilted in the image. Furthermore, the width of the glasses is reduced to 90% to make sure that the glasses are not  covering the entire face before the glass is added to the face according to the detected landmarks of the left and right eye.
A similar procedure is performed to add a facemask onto the face except that we are more interested in where the chin is located in the detected face, rather than where eyes are located. The landmark of the jaw is initially extracted using dlib followed by extracting the landmark of the chin. In addition, the mask is also rotated according to how a particular face is tilted. Also, since the mask should cover the entire face, the mask is resized to have the same size as the width of the image before adding it to the face.  In the Expressions section of Table 1, 2, 3, 4, and 5, An represents anger, Di represents disgust, Fe represents fear, Neu represents neutral, Ha represents happy, Sa represents sadness and Su represents surprise expression. Also, * refers to the average accuracy achieved from all categories of emotions. We chose to analyze emotion for the six basic plus neutral expression as they are considered to be universal expressions that can be understood by many people (Nummenmaa et al., 2007).
The VGG19 model is trained from scratch and evaluated on the validation set. As the model is not deterministic, the program is set to run 30 times and the performance of each run is assessed by testing the model on the test set. For that reason, results are provided in two different forms. * represents average accuracy obtained from testing the best model on the test set, which can be visualized from the confusion matrix whereas ** represents the overall average accuracy obtained from 30 runs with upper and lower bound of a 95% confidence interval.   Table 1 shows the confusion matrix obtained from testing images of the CK+ database with no changes made to them. The model achieved an accuracy of up to 100% in three different classes (disgust, sad, and surprise). The model achieved a very high accuracy across all classes, with the lowest accuracy of 94.0% when the prediction is made for neutral class images. Table 2 presents the confusion matrix obtained from predicting images of people with sunglasses. While the accuracy of up to 100% is achieved when the prediction is made for anger class images, the average accuracy achieved from the best model is 84.86%. This is because the model was not able to achieve a classification accuracy of more than 92% in all other classes except for the anger class compared to the minimum accuracy of 94% with nothing added to the image.
There is certain evidence that the eyebrows, eyes, and forehead makes a very strong contribution towards the frowning of the face (Russell, 1994). As such, one reason why the anger class in Table 3 might have achieved a higher accuracy result compared to the other classes could be because the fully covered facemask still allows the eyebrows, eyes, and forehead to be visible. Table 4 presents the confusion matrix obtained from testing images of people wearing a facemask designed with a transparent window. Interestingly, the model achieved an accuracy of 0% when the prediction is made for disgust and sad class images. This has a great impact on the average accuracy achieved by the best model (48.86%) compared to the accuracy achieved when nothing is added to the image .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  Table 4 Confusion matrix obtained from testing images of people wearing a mask with transparent window

Expressions
An What is striking in this table is the accuracy achieved when the prediction is made for images of the surprise class. While only 12% of images from the surprise class are classified correctly when the prediction is made for images with a fully covered mask, the achieved accuracy is up to 98% (an increase of 86%) when the prediction is made for images of people wearing a mask with a transparent window. It can be inferred from this result that the mouth is making a very strong contribution in categorizing surprise expressions.   Table 5 shows the confusion matrix obtained from the prediction of images of people wearing a fully transparent visor. Apart from the anger and sad classes that achieved an accuracy of 86% and 88%, all other classes achieved an accuracy of more than 93%. An overall accuracy of up to 94% has been achieved from the prediction of images of people wearing a fully transparent visor. There is a large difference between the accuracy achieved from predicting images with a fully transparent visor compared to when there are glasses on the face, a fully covered facemask, or a partially covered facemask with a transparent window is used.

Discussion
Taken together, these results suggest that while the use of a facemask with a transparent window might help the artificial systems to categorize the emotional facial expression of people by achieving an overall accuracy that is almost twice the accuracy achieved when the prediction is made for images of people with a fully covered facemask, the achieved accuracy is still not very promising (< 50%).
Conversely, an overall average accuracy of up to 89% achieved from predicting images of people wearing a fully transparent visor suggests that the artificial systems have a better chance of categorizing people's emotional facial expressions correctly while wearing a fully transparent visor than a partially or a fully covered facemask. In addition, two-sample unpaired t-tests showed that the overall accuracy achieved when wearing a fully transparent visor is significantly better than the achieved accuracy when wearing other types of coverings.
Hence, the use of a fully transparent visor is beneficial if those are considered as protective, as it will not only provide protection against infectious disease but will at the same time help in improving the interaction between artificial systems and humans since people like to interact with social robots which can identify emotion (Breazeal et al., 2008;Onyeulo and Gandhi, 2020) than with robots that cannot identify emotions.
While it is understandable to see a decrease in the accuracy of all or a particular class when sunglasses or facemasks are added to the images, the accuracy achieved from predicting images of the anger class after sunglasses have been added to the image is somewhat counter-intuitive. The accuracy increases from 98% with nothing added to the image to 100% after sunglasses have been added to the image. This possibly occurred due to the stochastic nature of processes while training the deep learning algorithm or noisy features where covered by the sunglasses.
It is also worth mentioning that the aim of this research is to utilize artificial intelligence technique to analyze changes in the performance of an emotion categorization model when the face is covered with and without sunglasses and facemasks. The research does not aim to develop a new technique to improve the performance of an emotion categorization model from obscured faces. VGG19 has been chosen to be used as an example of a deep model as it is one of the most commonly used network nowadays, which has been extensively tested.

Conclusion
To the best of our knowledge, this is the first study that analyzes emotion categorization from faces of people wearing simulated sunglasses and facemasks. This paper analyzes emotions from the faces of people wearing facemasks and sunglasses. The research adds sunglasses as well as different kinds of masks to the images of the CK+ databases. The work analyzes emotion from faces of people wearing these sunglasses and facemask, comparing the performance of an emotion categorization system on faces of people wearing sunglasses, a fully covered facemask, a transparent facemask with a transparent window, and a fully transparent visor. The achieved accuracy on the fully transparent visor is relatively larger compared with the accuracy achieved when glasses, a fully covered facemask, or a facemask with a transparent window is used.
Before this study, evidence that the use of a fully covered facemask obscure communication was purely anecdotal. However, after a comprehensive investigation that was made in the study, we can now conclude that not only using the fully covered facemask obscure communication but also the newly produced partially covered facemask with transparent window also affect the performance of an artificial emotion categorization system by a significant amount. In addition, the empirical findings in this study provide evidence that the fully transparent facemask, which has a similar property to a visor is the easiest to understand of all coverings by the artificial emotion categorization system (see Section 4) as the achieved accuracy when wearing the fully transparent visor is significantly better than the accuracy achieved with all other types of coverings.
Despite the promising result obtained when emotion is analyzed from faces of people wearing a fully transparent visor, questions remain unanswered as to whether adding the fully transparent visor on images from a different database will also lead to a higher accuracy result. In addition, it is unknown as to how human classifiers will perform when categorizing these images. As such, future work should analyze the performance of human classifiers in categorizing these data as well as apply the same approach to a different dataset.