LOW RESOLUTION FACE RECOGNITION USING GENERATIVE ADVERSARIAL NETWORK (GAN)

Although face recognition system has achieved a very good performance in the past years, but Low Resolution Face Recognition (LRFR) is still challenging because low resolution image would decrease the accuracy. This research aimed to solved and get the best SR method to solved LRFR problem. YTF dataset used for fine tuning SR methods. While LFW dataset used for fine tuning and evaluating FaceNet model. The images would be increased using Res-Net GAN and RRDB GAN. Then the images would be recognized using FaceNet. The images that had been increased by RRDB GAN reached the highest accuracy 98.96 %.


Introduction
Face recognition system is a system that has been designed to recognize faces based on the database of faces that has been recognized before [1]. In recent years, we have known a really good improvements in face recognition systems, especially when using a deep learning architeture. One of the state-of-theart method based on deep learning has been achieved by [2] with an accuracy over 97% on public face datasets such as LFW [3]. Although this algorithm has a good perfomance on good condition of face and pose, these faces generally need to be in high resolution. These designed that has a good performance on High Resolution (HR) image can't be directly used on Low Resolution (LR) image. The surveillance camera has been widely used on public places such as on the streets, office, even in a room (usually placed at the corner). It creates a challenge on face recognition system, where the detected faces would be low in resolution. This research is motivated to solve this challenge.
This work is based on prior work published in [4], but has improved some methods and comparing which method will increase the accuracy for Low Resolution Face Recognition (LRFR). In this work, the resolution of the LR images would be increased and compared using the state-of-the-art method for super resolution by [5] and [6]. There are several method that can be used on super resolution like Bicubic Interpolation [7], Convolutional Neural Network (CNN), Generative Adversarial Network (GAN) [4], etc. Now GAN is the state-of-the-art method that has been used on super resolution comparing with other methods [6]. This work will compare 2 super resolution method, Residual in Residual Dense Block Deep Convolutional GAN (RRDB DCGAN) and Residual Network Deep Convolutional GAN (ResNet DCGAN). The face dataset that will be used on this research is Megaface Challenge 2.

Related Works
There are 3 ways to solve the LRFR [7], first method is downsampling the HR images to LR images . The second method is increasing the resolution of LR images using an interpolation methods or super resolution methods [8]. After increasing the resolution of LR images, then the usual face recognition method for HR images can be used to increase the accuracy. The third method is by combining several methods. The most used method is Canonical Correlation Analysis (CCA) [9].
In 2015, [2] used CNN to build a face recognition system. The dataset used in this research are Labelled Faces In The Wild (LFW) [3] and Youtube Faces Database (YTF) [10]. The accuracy result for LFW dataset is 98.95% and YTF dataset is 97.3%. [2] conclude that the error value would keep getting lower if the CNN use more layer.
CNN has reach very good result in face recognition system, especially a face recognition system that has HR images [2], [11], [12]. From there, CNN has been widely used to solved the face recognition problems. CNN has been modified like on [7] who used Multi Resolution CNN. For the pre-processing stage, [7] uses a bicubic interpolation method to decrease the value of cosine similarity. Cosine similarity is a method to measure the similarity between 2 objects. The similarity value ranges from -1 that means exactly opposite and 1 that means exactly similar.  [13] also did some research on low resolution face recognition. They did the face recognition from a video using CNN. This research modified the CNN method by adding manifold based track comparison strategy for low resolution video recognition. The dataset used for training are from FaceScrub [16] and MSRA-CFW (dataset of celebrity faces on the web) [17]. YTF dataset is used for testing. This research has reached 80.3% accuracy on YTF dataset.
[14] also researched on low resolution face recognition using Deep Coupled ResNet. The proposed method consists of 1 network trunk and 2 networks branch. Network trunk used for training using 3 different resolution on each face images. Network branch also used for training on the HR and LR images. The objective of this training is to reduce the distance of HR and LR image as small as possible. The parameter on network branch had been optimized using Coupled Mapping. This research used 2 datasets from LFW and SC Face Database [18]. LFW dataset used for testing with 2 different resolution. The 8 x 8 pixels reached 93.6% accuracy and 112 x 96 pixels reached 98.7% accuracy.
[15] had research on face recognition using facenet with several size image. Dataset that had been used on that research was LFW. Every image on the dataset had been resized to 10 x 10, 20 x 20, 40 x 40, 60 x 60, 80 x 80, 100 x 100, dan 160 x 160. Facenet accuracy on 160 x 160 pixel image was 99.2 % and validation rate 97.1 %. But the accuracy gradually decrease with smaller image. Face accuracy on 10 x 10 pixel was 58.2 % and validation rate 1.6 %. The size that would be compared on this research is 40 x 40 pixel. Facenet accuracy on 40 x 40 was 98.5 % and validation rate 88.5 %.
Super resolution method is used to increase the value of cosine similarity. There are several methods of super resolution like bicubic interpolation [19], CNN [20], GAN [9][21], etc. Recently, GAN has achieved an impressive result regarding the Single Image Super Resolution (SISR) problems. Basically, GAN using CNN to do the training phase. GAN architecture is using Deep Convolutional and had been proposed for the first time by Radford on 2015 [22].2 modified GANs that will be used on this work are Residual Network GAN (ResNet GAN) and Residual in Residual Dense Block GAN. ResNet or skip connection is being used on GAN, so that the CNN layers could be used more than 1,000 layers [23] that will cause very minimum error. This two GAN methods has been proposed by [5] and [6]. But those methods had never been applied directly on LRFR problems.
Face recognition method that will be used is FaceNet [24]. FaceNet learns to map face images to an efficient Euclidean space to measure the face similarity based on the distance. FaceNet embeddings as feature vectors can be used by standard methods to solve tasks like face recognition, verification, and clustering [25] [26]. For example, k-NN [27] can be used for face recognition by using the embeddings as the feature vectors. All previous works are summarized on Table 1.

Low Resolution
Low resolution is one of the main issues in face recognition problems [28]. Those researched and proved that even the state of the art FR methods that reportedly has 99.63% accuracy on LFW [3] and 99.087% on MegaFace Challenge [29] are also degrade significantly on low resolution images. Low resolution images are typically shot and captured using surveillance camera. The image from a great distance captured by the surveillance camera cause the image resolution lower. Those images will lack sufficient visual information for the Deep Learning model to learn the feature representation rather than the HR images. [30] also said that down-sampled images would degrade the accuracy of face recognition.
Generally, there are two ways to solve the LRFR problems. The first way is to decrease the distance between the LR and HR images. The SR methods can be used or even downsample the HR image. The second way is directly used the CNN to learn the feature between LR and HR images. Based on [4], the images lower than 64 x 64 pixels will be classified as low resolution image. This research will use 32 x 32 pixel image.

MTCNN
MTCNN is a method used for face detection and alignment. CNN has achieved impressive results on this two tasks especially in unconstrained environment [31]. MTCNN used CNN architecture in three stages. The first stage (P-Net) is used for getting the candidate windows throught the shallow CNN. It obtains the candidate facial windows and the box regression vectors. The candidate will be calibrated based on the estimation of bounding box regression vector. Then it used non-maximum supression (NMS) to merge all the highly overlapped faces. The second stage (R-Net) is used for evaluating the windows and rejecting the large number of candidates categorized as non faces. Then it also calibrated with bounding box regression and conducts the NMS. The third stage (O-Net) is used to give an ouput of five facial landmark positions.

Super Resolution
Super resolution is a method used to increase the resolution of LR image to HR image. Super resolution has widely use to keep the quality of images when the images are strecth to a bigger size. Super resolution is not same with only upsizing the image. Super resolution ensures that the upsize images doesn't lose their details and visual appearances. The main goal of super resolution is to find the missing pixel value in the HR image [32].
Super resolution can be divided into two classes: Dynamic resolution (multiple image super resolution) and static resolution (single image super resolution). Dynamic resolution method are based on the assumption that the multiple image are basically geometric transform and misaligned version of each other. Combining and evaluate them together will result more details and fill the missing pixel in HR image. But this method is difficult because the multiple are not always available. So the static resolution is more applicable and will be used for this research. There are many method that can be used for super resolution like bicubic linear, nearest neighbor, generative adversarial network (GAN), etc. But recently GAN has reached a very good performance on super resolution and become the state of the art in super resolution problem.

Generative Adversarial Network (GAN
GAN is a better method for super resolution because with GAN the generated image looks more real. GAN firstly proposed by Ian Goodfellow through the concept of game theory on 2014 [21]. Basically, GAN consists of 2 neural networks: Generator and Discriminator. Generator ( ) used for generating a fake image and Discriminator ( ) used for learning what features that make image real. The main principile of GAN is make and to compete each other. tries to generated an image so that can not recognize if that image is not a real image. In the other hand, keep learning on decide whether the input image is real or not. The GAN equation can be seen below. (1)

Dataset
This research will be using 2 face dataset for training and evaluation phase. Youtube Faces Database (YTF) [10] as a training phase dataset on SR neural network. Labelled Faces In The Wild (LFW) [3] will be used as training on FaceNet neural network and also used as evaluation phase. The first dataset is YTF. It is a dataset that designed especially for solving the face recognition problems in video. This dataset contains 3,425 videos of 1,595 subjects. The average video for every subject is 2.15. For each subject, 10 sample images will be used for training the neural network.
The second dataset is LFW. It is a dataset that becomes a standard benchmark for automatic face verification. This dataset contains 13,233 images with 5,749 subjects. It also provides 6,000 pairs of sample data for evaluation phase. Images in this database is also has various challenges like variations of pose, expression and illumination.
YTF and LFW dataset will be prepared by cropping and aligning. Every image on the dataset will be checked if any face is detected on the image. If there is a face detected, the image will be cropped precisely on the face only. So that the background image and other attributes will not affect when face being recognized. The image face also being aligned to straight position. MTCNN method would be used for this task. MTCNN would ensure all the images is still detected as a face. The goal of this process is to get the cropped face and align the face. The sample output of MTCNN shown at Figure 1.

Super Resolution
The super resolution will get the input of cropped and aligned images. is an image get from cropping the face image with 128 x 128 pixel. While is resized from to 32 x 32 pixel. The preparation of LR and GT image is shown on Figure 2.

Figure 2 Preparation for GT and LR dataset
The 32 x 32 pixel images will be upsampled using both SR methods, ResNet GAN and RRDB GAN. Those methods will enlarge the images 4x. So the output size of these SR would be 128 x 128. Then the image will be feed and recognize using FaceNet. Both SR methods would be explained further on section 4.3.1 and 4.3.2.

Residual Network GAN (Res-Net GAN)
The ResNet GAN used for this research is also called SRGAN that had been proposed by [6]. SRGAN basically used the ResNet GAN Architecture, but there are several modification on calculate the perceptual loss function.
In this single image super resolution (SISR) task, SRGAN is aiming to predict the HR image and resulting the superresolved image from the LR image . On the dataset, and are a pair from the dataset, is the predicted image from SRGAN. Based on the basic GAN, the ultimate goal of this method is to train the network and the network as the generator and discriminator network. The network is trained using feed-forward CNN where denotes as the parameter for . The value of = { 1: , 1: } , where and denotes as the weight and the bias of -Layer deep network. The parameter is obtained by optimizing the loss function . From the pre-processing step we get where = 1, 2, … , and where = 1, 2, … , . So the function: According to the proposed formula of GAN [21] the objective of GAN is to solve equation 1. As explained on the section 3.4, the goal of this equation is for getting to generate a fake image that can fool . is also trained to classify whether the input image is a fake or real. So at the end, the generator can generate an image that can hardly classify by .
The architecture of generator consists of 16 residual blocks with the same layout. The layout consists of two convolutional layers with 3x3 kernels, 64 feature maps, batch normalization layers, and ParameticReLU as the activation function. At the end, two sub-pixel convolutional layers are used for increasing the resolution of input image.
To differentiate the fake image and the real image discriminator needs to be trained. The architecture of used LeakyReLU with = 0.2 as the activation function. The architecture contains eight convolutional layers with 3x3 filter kernels, increasing factor is 2 from 64 to 512 kernel same as the VGG Network. Strided convolutions are used to maintain the image resolution every time the features is doubled. The result of 512 feature maps are followed by two dense layers and sigmoid activation function as a probability to classify the image.

Residual in Residual Dense Block GAN (RRDB GAN)
Basically RRDB GAN also used residual network GAN, but it modified the network architecture. On the basic ResNet GAN, Generator network architecture contains of convolutional layer, batch normalization (BN) layer, and activation layer. In RRDB GAN it removed all BN layers and add Dense Block (DB) on every blocks. BN layer is removed from the basic block on ResNet GAN because it has proven in increasing the performance and also reduce the complexity on SR [33] and deblurring [34] tasks. BN layer is used to do the normalization of feature using mean and variance of a current batch during training phase. Meanwhile on the testing phase, BN layer will normalize the feature using estimated mean and variance of the whole training dataset. In case when the training and testing dataset differs a lot, then BN layer will reduce the performance and limit the generalization ability. So [5] removed BN layer to produce a stable and consistent performance during training.
In CNN, more layer and connection will always result better performance [35] [36]. This RRDB network use dense block on the main path that will cause the network capacity becomes better with the dense connection. Residual scaling and smaller initialization were also used to improve the architecture. Scaling down the residual with where 0 < < 1 before adding to the main path will stable the process. Residual architecture also train easier when the initial parameter variance is smaller.
There is also improvement on the Discriminator based on Relativistic GAN. The standard Discriminator ( ) evaluates if images is naturally real or fake. But relativistic discriminator will evaluates if image is relatively more realistic than the fake . Relativistic Discriminator can be calculated as Where is a sigmoid function, is the discriminator output, and is the average off all the fake images in a single batch.

Facenet
FaceNet is a method proposed by [24]. It uses a deep CNN architecture with triplet loss to get an embedding that can be used for face clustering, verification, and recognition. FaceNet provides a unified embedding ( ) that maps face into adimensional Eucledian space, where ( ) ∈ ℝ . This triplet loss method is inspired by nearest-neighbor classification. Triplet loss is a loss function that will ensure that an image of a person face (anchor) is closer to all other image (positive) of the same person than to all other image (negative) of other person. The triplet loss function is shown at Figure 3. Figure 3: Triplet loss will minimize the distance between anchor and positive (same person) and maximize the distance between anchor and negative (different person) The triplet loss function can be defined as Where is the amount of image in a set of all possible triplets pair on the training set and is a difference margin between the anchor positive and anchor negative.
FaceNet trains the CNN using Stochastic Gradient Descent (SGD) with backprop and AdaGrad. The initial learning rate is 0.05, the is 0.2 and ReLU is the activation function.
The LFW dataset from SR is used on the training and evaluation phase. Every identity will be splitted into training set, validation set and testing set. The portion of the dataset is 90% for training, 5% for validation, and 5% for testing. The minimum image for each identity is 4 images. 2 for training, 1 for validation, and 1 for testing. Validation is used to prevent the training process to get underfitting or overfitting.
During training and testing 2distance threshold will be set on 1.242 based on [24]

Results and Discussion
The training result for FaceNet using the dataset that had been increased by Res-Net GAN can be seen on   After training the facenet model using training and validation set, the model was tested using the testing set. The testing comparison result can be seen on Table 2. The accuracy reached by FaceNet using Res-Net GAN was 95.53 % and validation rate was 94.667 %. This result is better than [15] on accuracy by 0.03% and on validation rate by 6.167%. The accuracy on Facenet using RRDB GAN reached 98.96 % and validation rate was 96.757 %. This result is the best result on this research. It is better than [15] on accuracy by 0.46% and on validation rate by 8.257%.

Conclusion
This researched proposed a model to solve the LRFR problem. This research proposed SR methods using Res-Net GAN and RRDB GAN before recognition method using FaceNet. This model was evaluated using LFW dataset. The result showed that using SR before face recognition can increase the accuracy. It also showed that using RRDB GAN to increase the low resolution image gave the best result. For further research, it could be added some features on face recognition model like gender and age category. It would narrow down the possibility for facenet to false predict. Table 2 Comparison result on FaceNet

Face Recognition
Size

FaceNet Evaluation
Accuracy VAL