Soft Biometrics and Deep Learning: Detecting Facial Soft Biometrics Features Using Ocular and Forehead Region for Masked Face Images

— Soft Biometrics is a growing field that has been known to improve the recognition system as witnessed in the past decade.When combined with hard biometrics like iris, gait, fingerprint recognition etc. it has been seen that the efficiency of the system increases many folds.With the Pandemic came the need to recognise faces covered with mask in an efficient way-soft biometrics proved to be an aid in this.While recent advances in computer vision have helped in the estimation of age and gender - the system could be improved by extending the scope and detecting quite a few other soft biometric attributes that helps us in identifying a person, including but not limited to - eyeglasses, hair type and color, mustache, eyebrows etc. In this paper we propose a system of identification that uses the ocular and forehead part of the face as modalities to train our models that uses transfer learning techniques to help in the detection of 12 soft biometric attributes (FFHQ dataset) and 25 soft biometric attributes (CelebA dataset) for masked faces. We compare the results with the unmasked faces in order to see the variation of efficiency using these data-sets.Throughout the paper we have implemented 4 enhanced models namely - enhanced Alexnet ,enhanced Resnet50, enhanced MobilenetV2 and enhanced Squeezenet. The enhanced models apply transfer learning to the normal models and aids in improving accuracy.In the end we compare the results and see how the accuracy varies according to the model used and whether the images are masked or unmasked.We conclude that for images containing facial masks - using enhanced Mobilenet would give a splendid accuracy of 92.5% (for FFHQ dataset) and 87% (for CelebA dataset).


I. INTRODUCTION
The concept of biometric for identification is not new,as a matter of fact it dates back to the 1970s when it was used to identify people on the basis of their fingerprints.Over the years many new forms of biometrics were developed ranging from facial recognition to recognition on the basis of DNA,gait,hand geometry,keystroke etc. All of these constitute the hard biometrics.In the past 2 decades there has been seen an increase in the research in the area of soft biometrics as well.Soft biometrics recognition is based on the concept of how humans recognise each other i.e on the basis of their physical and behavioral characteristics.So traits like height,weight,body geometry,scars,facial hair, hair color, baldness all constitute as soft biometric attributes.With all the new research over the years it has been evident that the integration of these soft biometric features along with the hard biometric features proves to significantly improve the system of recognition.While the existing system of hard biometrics has reached a level of saturation, soft biometrics still remains a new field of identification that is capable of unraveling the solution to many problems that exist in biometrics currently.
With the pandemic, came the need to introduce a system that was able to identify people while they wore facial masks i.e. when only the area above their nose is accessible for recognition purposes. This posed a difficulty to the existing system of recognition that needed to be resolved. To tackle this problem we propose the use of ocular as well as forehead region of the face to train our models.Using haar-cascade, we process the images present in dataset in order to retrieve only the portion that is important to us and train our models on those images to recognise the soft biometric attributes of the person. Apart from this, another problem that has been identified by researchers is working on low resolution devices.With the increasing use of mobile phones in order to take pictures and selfies and expansion of such pictures on social media, it has become quite evident that there needs to be a system which is able to recognise faces captured with the help of low power devices.These kinds of images can be blurry, have improper lighting and poor resolution in general that provides hindrance to the recognition system.We worked with different models (Table2), fine tuned them to increase the accuracy and tested on images captured via mobile phones in order to identify which enhanced model would give the best accuracy for low powered devices or even for blurry images. Soft Biometrics has found a niche in the facial recognition system [37] used nowadays. It is increasingly being used for situations where knowing the exact identifying detail isn't crucial, rather finding a particular segment of people is considered more important.For example websites that aim at a targeted portion of the population for selling their products (say according to age, gender etc.).Sometimes for studying the demographics of an area, knowing the ethnicity, age, gender of communities living in that area can prove to be more beneficial than finding the identity of each individual.Looking for only certain semantics in a video surveillance, say an old man with black hair and beard,can reduce the search to quite an extent and even using a low quality camera for surveillance purposes can satisfy the need to identify the person (or simply narrow down along with other information available) without knowing the exact identity of the person and serve the necessary purpose.
In the paper, the following information can be found in each section-Section II., introduces the readers to the datasets used in the project. In Section III., we have compiled a literature study of the work done on this particular area of soft biometrics and deep learning techniques. Further, in Section IV., we show our contributions to the area of research, and how it is different from the work done earlier in this field. Section V., discusses the methodology applied in our study and introduces the various models being used and how they are implemented in our paper to get the results. Section VI., is a compilation of the results obtained from running our models on the datasets introduced and discusses its significance. As we move ahead, in Section VII., we conclude our paper by bringing out the key points covered, its significance and its impact on the real world. We also discuss the future scope of the project and how it could be applied to real world scenarios to improve existing systems or be applied to new systems. We have also included the Compliance with Ethical Standards (Section VIII.) followed by Authorship Contribution (Section IX.). Lastly, we include in Section X., the references to previous studies that have been used throughout the paper.

II. DATASETS USED
There are two datasets that have been used in this paper on which our models are trained.
The first dataset used is CelebFaces Attributes (CelebA), characterized by the following features-The CelebA dataset is usually used in order to train and test models for face detection, especially for facial recognition of soft biometric attributes like hair, smile, eyeglasses. Images consists of different soft attributes, diverse faces,and a large number of pictures.It consists of -• 202,599 face images • 10,177 individuals • 40 labels per image to give the soft biometric attributes We have used CelebA images and reduced it to images containing ocular and forehead parts of the face using image preprocessing in order to represent images containing facial masks. For unmasked images we preprocess the images to contain the whole frontal face.Further all the enhanced models were applied to these processed images to find out the accuracy of detection of 25 facial soft biometric attributes selected out of the 40 available in the dataset.
The other dataset used in the paper is the Flickr-Faces-HQ (FFHQ) dataset. It consists of PNG images of human faces, generally used for the work revolving around GANs. It has the following features - • 70,000 high quality images at 1024×1024 resolution that have considerable variation when looking at age, ethnicity and image background. • Including many secondary attributes like eyeglasses We have used FFHQ images and done a preprocessing similar to CelebA. All the enhanced models were applied to the processed images here as well for masked and unmasked images in order to detect 12 facial soft biometric attributes that are a subset of those found using celeba dataset so that the two results are comparable with each other.One of the added advantage of using this dataset is that it contains images of not only adults but also infants and children,a feature that isn't available in other datasets including CelebA.
The preprocessing so as to extract required facial features for both the datasets is done using OpenCV and Haar Cascade Classification. The required feature is the frontal face in case of unmasked images and ocular and forehead region for masked images. Haar Cascade Classification is divided into two parts, the first is the classification task that takes images and gives an output in the form of a binary value of 1 or 0, showing us whether the features present in the image or not. The second part is a face localization that includes an image as input and gives the location of the facial feature within that image as an output in a bounding box with dimensions as width and height.

III. RELATED WORK
The majority of the research work in the field of soft biometric focuses on age [29,30,31], gender [32,33], and race [34] estimation from face images to date. Earlier, techniques like infrared iris images for finding out the age of people were used, which was possible by using the already existing iris datasets available [20,21]. Geometric or textual information was used in these studies, obtaining an accuracy of ∼64%. Meanwhile, in recent years facial age estimation approaches have shifted to the computationally efficient CNN design and Multitask learning. Numerous recent research works are focused on the development of small and efficient neural networks suitable for systems with limited resources, for instance mobile devices. A common approach is reducing the amount of parameters in the convolutions, with the MobileNet [24,25], Shufflenet [26,27], and Xception [28] models utilizing depth-wise separable convolutions.Rattani et al. [16,18,19] was one of the pioneers who carried out recognition of age or gender from RGB ocular mobile devices images .

Fernando Alonso-Fernandez et al.(2021)[1]
addressed how using images captured as a selfie with mobile phones contained the required ocular images to estimate soft biometrics such as age and gender, they used light-weight CNNs (of a some megabytes [53-56] to work with low quality images of the eyes) proposed within the context of the ImageNet Challenge, and extra architectures that were given for face recognition using mobile devices. 11,299 images of the Adience benchmark were employed to perform the experiments, which includes images clicked in the wild using smartphones that were uploaded to Flickr. M Vasileiadis et al. [5] proposed a computationally efficient CNN architecture, appropriate for real-time implementation on low power devices, which concurrently performs gender, age, race, eyes state, eyewear, smile, beard, and mustache estimation from unconstrained face images. The architecture employs Mobilenet and exploits the correlation between the individual biometric features. Shervin Minaee et al.(2021) [2] presented a comprehensive review of the recent advances that appeared between 2014 and 2019 employing deep learning networks in biometric recognition. For each step, they provided a comprehensive overview of latest contributions, loss functions and network architecture, evolved to achieve state-of-the-art performance in biometric recognition. It provides a survey of greater than one hundred twenty promising works on biometric recognition (which include fingerprint, face, iris, palmprint, ear, signature, gait recognition and voice).  [17] suggested a light-CNN to identify faces with a small dataset based on an updated VGG16 model. VGG16 has very deep-layers with many narrow convolution-layers with separate kernel numbers followed by max-pooling optimized for large-scale classification. The planned architecture uses (120 X 120) pixels as the input-image size and has two types of convolutional-layers, followed by max-pooling. Every Convolutionary layer is preceded by the activation feature of the rectified linear unit (ReLU). The proposed light-CNN is small but delivers good efficiency with 94.4% accuracy.  [14] presented a face-recognition system using a ConvNet which is AlexNet. Liu (2019) [15] suggested Facial Expression Recognition (FER) system by using the fer2013 dataset and an effective deep convolution neural network to train an efficient model and then use the Tkinter name tool used by the Graphical User Interface (GUI) to evaluate the image of expression and achieve realistic performance. Also, AlexNet, VGG16, VGG19, and ResNet152 are used. Yingying Wang(2019) [22] proposes a method that combines multiple sub-regions and the entire face image by weighting, which can capture more important feature information that is conducive for improving the recognition accuracy. The proposed method was evaluated based on four well-known publicly available facial expression

IV. CONTRIBUTIONS
As seen from above, the research done previously on soft biometrics is extensively being used in facial recognition systems.We can further notice that works related to identification of people wearing mask was required on an urgent basis after the pandemic and soft biometrics has been playing a substantial role in tending to this need. The most recent studies ongoing is the use of soft biometric attributes recognition using the ocular data provided i.e. images of eyes are evaluated in order to identify the characteristics of people-mainly limited to age, gender and ethnicity.In addition to this, most of the earlier work makes use of the classic models available for use like CNN,Mobilenet etc. to identify the facial feature,including identifying soft biometric attributes of images taken from low powered devices.
• In our paper we aim at extending this idea further by introducing an additional modality.We preprocess the images available in the datasets (namely CelebA and FFHQ) to contain not only the ocular region but also the forehead region of the face.We are not using databases containing only these parts rather the full image so that we are able to understand if we need the face in its entirety or just the selected portions and the difference it would make. This helps us in expanding the scope of recognition and training our fine tuned models on these preprocessed images.Using large datasets containing as high as 200,000 images proved to be an additional advantage in training and testing our models.
• Further we have applied transfer learning techniques to the basic models available via python libraries like CNN(Alexnet),Resnet-50,MobileNet,SqueezeNet etc. We can see from the results obtained that having an enhanced model which contains 2 extra fully connected layers(after which the results get more or less stable) can significantly improve the values of accuracy and fbeta, which makes our overall results better.
• In the recent years we notice that many researchers have worked on finding a method to identify soft biometric attributes however it has mostly been restricted to 3 labels namelyage,gender and ethnicity.Additionally in works that contain a few more attributes,the images are taken in unconstrained environments i.e. without facial masks being considered.We aim to devise a method for identifying around 25 soft biometric attributes (CelebA dataset) and 12 soft biometric attributes (FFHQ dataset) for identifying people that are wearing a facial mask.We further compare the accuracy and fbeta results with images that don't contain facial masks to get an idea of how the system is working.
Since this type of work is new and there is no previous work done of using ocular and forehead region and identifying people wearing facial masks by using models fine tuned by transfer learning, we have compared the results obtained on running our models on two datasets,both for masked and unmasked faces.Furthermore we have used 4 kinds of enhanced models in order to compare how the results vary and which model is best suited for our purpose.

V. PROPOSED METHODOLOGY
We worked on using Convolutional Neural Network (CNNs) to extract the presence of soft biometric features such as gender, bald, beard, blurry, eyeglasses, smile etc.. We have trained the CNNs for two different types of dataset, the first is original(unmasked) CelebA and FFHQ, and the second is modified(masked) CelebA and FFHQ. The modified CelebA and FFHQ have the same labels as the original, however the images are cropped to have only the ocular and forehead region. We have worked on extracting the required facial features using OpenCV and Haar Cascade Classification. Our work employs deep network architectures (Alexnet, Resnet50, MobilenetV2, Squeezenet), combined with two fully connected multi-part classification layers, in order to identify the presence of a number of facial soft biometric traits. We have used pre-trained models applying transfer learning of above mentioned models. The basic premise of transfer learning is simple: take a model trained on a large dataset and transfer its knowledge to a smaller dataset.
• AlexNet is a convolutional neural network, it contains eight layers, the first five are convolutional layers, these convolutional layers are sometimes followed by max-pooling layers. The last three layers are fully connected layers. The activation function used is ReLu, which has been proven to show better performance over tanh and sigmoid activation functions. The inputs are RGB images (h*w*3, where h and w are the height and width of the images) , which go through five layers of filters, each of the filters producing an image as its output. After each filter, the image sizes are changed. The images produced by the last layer of filters can be seen as a 4,096-element one dimensional vector and used to classify and produce up to 1,000 output probabilities, for each of the required output classes.

) Fully Connected Layers
The building blocks of such models are residual blocks or identity blocks. A residual block is shown in Fig. 4, the activation of a layer is fast-forwarded to a deeper layer using skip connections in the neural network.

Fig. 4 Architecture of ResNet50 model
• SqueezeNet is a convolutional neural network made for reducing the parameters, markedly with the help of fire modules. These fire modules "squeeze" parameters using 1x1 Convolutions. It is a small and compact network.The number of parameters are 50 times lesser than AlexNet, and still its speed is 3x compared to that of AlexNet. Fire module consists of a squeeze convolution layer (which has only 1x1 convolution layers), feeding into an expand layer that has a mix of 1x1 and 3x3 convolution layers. Since these have a lesser number of parameters they can easily be fitted inside a computer's memory and be transmitted over a computer network.    We have added two fully connected layers to the above mentioned types of Convolutional Neural Network which are AlexNet, SqueezeNet, MobileNetV2 and ResNet50.
The input given to our models are either frontal face image or ocular and forehead images. The ocular and forehead images are used to check for accuracy in case of masked images. Our models require input images normalized in the same way, i.e. 3-channel RGB images of shape (3 x 128 x 128). The data loaders used convert the input image to tensors of size [3,128,128] pixels for our neural network model to use.
The output of the last convolutional layer passes through an average pooling layer in order to reduce the spatial dimensions of the feature maps to 1 × 1, and transform them into one dimensional feature vectors. The classification stage encodes the soft biometric and image specific data. It receives as input the feature vector generated by the feature extraction stage and passes it to 12 labels for FFHQ Dataset and 25 labels for CelebA Dataset, one for each of the target facial soft biometrics. Hence, the output vector is one dimensional vector of size 12 or 25 depending on the dataset used, and has value one or zero signifying presence or absence of the label.
The loss function used in our CNNs is Flattened Loss. Flattened loss is a type of loss function where the input of the previous layer is flattened. Flatten function flattens the multi-dimensional input tensors into one dimensional outputs.
The metrics used in our paper are accuracy and fbeta.Let us understand these further -Accuracy :Accuracy is a very useful metric for assessing classification models.Accuracy can be thought of as the portion of predictions our model got right. It is expressed as: Accuracy = True Positive / (True Positive+True Negative)*100 Fbeta:It is an extraction of the F-measure which has the balance of precision and recall in the calculation of the harmonic mean. The coefficient controlling it is called beta,which in our case is 0.5 Fbeta = ((1 + beta^2) * Precision * Recall) / (beta^2 * Precision + Recall) These metrics help us understand the system a bit better and act as a basis to compare the different models used. Along with comparison of accuracies of individual labels present in our multilabel classification we get to understand the role of each label and how it gets us the final overall accuracy. We can see in the above table the overall performance of the system based on accuracy and fbeta values and how it varies along with the model that is used. Moreover it is visible that there exists quite a gap between the values of the two datasets. However the values of accuracies and fbeta for the masked and unmasked images are found to be quite similar with minimal difference between the two which indicated the recognition system would work comparably well for both masked as well as unmasked face images.

VI. RESULTS AND DISCUSSION
Further we can evaluate the accuracies of individual labels that make up the overall accuracies of the recognition system and how they contribute individually to the results in the following tables for all the models implemented. Table 4 Masked CelebA -accuracy per label Table 5 Unmasked CelebA -accuracy per label Our results indicate the possibility of providing soft-biometrics labels for masked images, that is using images containing only the ocular and forehead regions. The difference in overall accuracy for unmasked and masked data is not beyond 5.27% for any of the models. The maximum difference for CelebA Dataset is 5.27% and in case of FFHQ Dataset is 2.60%. The difference in FBeta (for beta = 0.5) between masked or unmasked and masked data is not beyond 6.47% for any of the models. The maximum difference for CelebA Dataset is 6 .47% and in the case of FFHQ Dataset is 2.78%.
These overall accuracy are calculated by taking average of all accuracy per label, that is for 25 labels in case of CelebA Dataset. These labels are mentioned in table 8. For FFHQ Dataset, the labels are mentioned in Table 9.    The above figure helps us understand the difference between accuracies for different labels and understand which model and dataset work better for each other and for what label. For both the dataset and both kinds of input images, the model which has provided the best results is MobileNetV2. The maximum value for both FBeta and accuracy has been obtained for MobileNetV2.
As the number of soft biometrics attributes predicted in one go increases, the overall accuracy tends to decrease. This is expected as the overall result is the average of all the accuracies. A fair trade off for detecting more number of soft biometrics is decreasing the accuracy or vice versa.
Coming to the comparison of the number of parameters used in different models that we used we can see a pictorial representation in order to notice the differences between the four.  Table 1.
However, the other models used in our paper should not be disregarded, as even if these do not provide the best accuracy or FBeta score value, the difference in accuracy is not beyond experimental limits and is limited to under 1.5% in all cases. Hence, we can use other models too depending on time and system constraints by compromising only a little on the efficiency of the model.

VII. CONCLUSION
In this study we have seen that enhanced MobileNetV2 gives the best accuracy in detection of soft biometric facial features for identifying people donning a facial mask.We compared our results for different models that have been enhanced by using transfer learning techniques for both images containing facial masks and without them as well. We calculated the accuracy and fbeta values (the fbeta value closest to 1 is considered to give the best results) for all the cases and conclude the following - These results show that MobileNetV2 containing two extra fully connected layers (enhanced) has proved to give the best accuracy for both masked and unmasked images as well as for both the datasets.This can be attributed to the model's ability to consist of fewer parameters yet have a very high processing power and thus giving good accuracy.These kinds of models have an added advantage due to the use of techniques such as point-wise convolution, bottleneck layers, residual connections and even depth-wise separable convolution.
The difference in the results obtained on the two datasets can be attributed to the fact that the number of labels present in both and the type and number of images also vary. For the FFHQ dataset that gave a better result of 92 percent only 12 labels have been taken into account while on the other hand CelebA dataset which gives the accuracy of 87 percent is being used to predict 25 soft biometric attributes.Hence, we can conclude that there needs to be a tradeoff between accuracy and number of labels.
We also notice that the accuracy of unmasked faces is higher than the masked faces which is expected since there is a larger region of interest in unconstrained frontal face images used for unmasked face images as compared to only the eyes and forehead portion in masked face images. However it is noteworthy that the difference between the accuracies of the two is a minimal ±2%, hence we can see that the system would get approximately similar results for both masked as well as unmasked images.
Therefore, we can conclude that this methodology could prove to be quite beneficial when compared to the current context where face engines are forced to work with images of people wearing masks, we evaluate the feasibility of using partial images containing only the ocular or mouth regions (Figure 1). This work can have a wide array of implementation,especially due to the fact that it takes into account low power devices images, in the real world -it can serve to be beneficial in security surveillance systems [38], censex systems to study the demographic of an area, identifying a particular segment of the people(e.g. according to age group) for a targeted commercial product etc.
Further we have considered the overall accuracy of the models which takes into account many sub classes (multilabel classification), however when considering the individual accuracies the accuracies could vary for every subclass due to class imbalance, that is the number of per label may vary. As a future scope of this project we can envision a number of areas where the scope of this project could further extend, for example creating clusters based on the labels, integrating it with a hard biometric such as gait, iris recognition etc. which would further enhance the face recognition system.

I. Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors

II. Funding details (In case of Funding)
This study is not funded by any agency.

III. Conflict of interest
The authors declare that they have no conflict of interest..

IV. Informed Consent
The work has been done completely on datasets available publicly and does not involve any other research on humans and/or animals.