Discriminative center loss for face photo-sketch recognition

Face photo-sketch recognition refers to the process of matching sketches to photos. Recently, there has been a growing interest in using a convolutional neural network to learn discriminatively deep features. However, due to the large domain discrepancy and the high cost of acquiring sketches, the discriminative power of the deeply learned features will be inevitably reduced. In this paper, we propose a discriminative center loss to learn domain invariant features for face photo-sketch recognition. Speciﬁcally, two Mahalanobis distance matrices are proposed to enhance the intra-class compactness during inter-class separability. Moreover, a regularization technique is adopted on the Mahalanobis matrices to alleviate the small sample problem. Extensive experimental results on the e-PRIP dataset veriﬁed the eﬀectiveness of the proposed discriminative center loss.


Introduction
Face recognition technology has been widely used in public security [1], enterprise management [2] and video surveillance [3]. As one of the most important problem in face recognition, heterogeneous face recognition (HFR) [4] has attracted much attention in security scenarios. It aims to match face images from different sensors, which is more challenging than traditional face recognition. In HFR, the gallery set consists of commonly available visible light images, while images in the probe set come from other different modalities, such as the sketch images [8], infrared images [9], and aging facial images [10].
One of the most difficult HFR problems is to match the photos with the sketches obtained from eyewitness descriptions of criminals, namely face photo-sketch recognition. face photo-sketch recognition [11] is the first applied criminal investigation field. It can use sketches to retrieve the photos of the suspects automatically, so that the police can narrow down the search suspect and improve the efficiency of case handling. Figure 1 shows some examples of the sketch-photo pairs.
Face photo-sketch recognition methods are usually divided into intra-modality methods and inter-modality methods. For intra-modality methods [14,15,16,17,18], a synthetic photo or sketch is generated firstly by using an advanced synthesizing method, then traditional face recognition methods are used to match pseudophoto to photo (sketch to pseudo-sketch). Intra-modality methods heavily rely on the quality of generated images, and they are usually computationally expensive [18]. Inter-modality methods mainly extract domain invariant features to represent the Figure 1 Different sketch-photo pairs from (a) the CUHK [12], (b) the PRIP-HDC [8], and (c) the e-PRIP [13] datasets.
images. Recently, convolutional neural networks (CNNs) have been widely utilized for face recognition and other recognition problems [19], [20], [21]. The CNNs exhibit expressive power, so there has been a growing interest in using CNNs to learn domain invariant features for face photo-sketch recognition [22,23,24,13].
Usually, CNN is followed by a softmax loss function. However, the softmax loss can only maximize the distance among different categories, it does not make deep features of the same category more compact. To address this problem, Wen et al. [25] proposed a center loss to constrain the intra-class distance. The distance used in the center loss is Euclidean distance. The center loss significantly improves the performance of traditional face recognition. However, for face photo-sketch recognition problem, sketches and photos come from different modalities, and they have larger intra-class differences and smaller inter-class differences than traditional face recognition. Moreover, due to the high cost of acquiring sketch face datasets, the sketch face datasets currently used in the public are very small, resulting in a small sample problem.
In this paper, we propose a discriminative center loss to obtain highly discriminative deep features for face photo-sketch recognition. Specifically, two Mahalanobis distance matrices are proposed to improve the discriminative power of the deeply learned features. The Mahalanobis distances take into account the relationship between the various features and can eliminate the correlation between the disturbance variables. The key element of the proposed loss is to estimate the intra-class covariance matrix and the inter-class covariance matrix of the deep features. A regularization technique is adopted on the dual Mahalanobis matrices to alleviate the small problem. The estimate Mahalanobis distances are updated through network training, which helps guide the network to learn more scale-independent discriminant features of the sketch face.
Based on the discriminative center loss, we propose a method for face photo-sketch recognition. The steps of the method are summarized as follows: 1) Perform face detection and alignment on the sketch face dataset; 2) Pre-train the Resnet18 model on the ImageNet dataset; 3) Use the discriminative center loss to guide Resnet18 to learn discriminatively deep feature. Initialize the model with the training parameters of the model on the ImageNet dataset, and then use the sketch dataset for pretraining; 4) After testing, calculate the matching rate of the features based on the trained model.

Related work
In this section, a summary of intra-modality and inter-modality face photo-sketch recognition methods are reviewed. Since Fisher Linear Discriminant Analysis is related to our proposed method, these methods are also reviewed.
For intra-modality methods, images from one modality are firstly transformed into other modality, and then matching algorithm is applied to these synthesized images. For example, the Generative Adversarial Network (GAN) [26] is used to generate face synthetic, which belongs to intra-modality methods and has achieved good performance. Gao et al. [14] synthesized the sketches using Embedded Hidden Markov Models (E-HMM) to model the nonlinear relationship between sketches and photos. Gao et al. [17] proposed Sparse Neighbor Selection (SNS) to render the initial pseudo-image and then used Sparse Representation based Enhancement (SRE) to improve the quality of the synthesized images. With the development of convolutional neural networks, the face synthesis methods based on deep learning [18] have achieved good performance. However, the synthetic methods for face photo-sketch recognition heavily depend on the quality of the synthesized images, resulting in inaccurate recognition results.
The inter-modality methods directly extract the domain invariant features from different modalities. Bhatt et al. [27] used extended uniform circular local binary descriptors to characterize sketches and photos. They decompose sketches and photos into pyramids of multiple precisions, extracting high-frequency information from sketches and photos. Galoogahi et al. [28] proposed a feature extraction method termed Histogram of Averaged Oriented Gradients (HAOG) to describe the shape and local gradient of the images. With the rapid development of deep learning technology, Saxena et al. [32] used CNN trained on optical face images to perform heterogeneous recognition tasks. Galea et al. [23] utilized the VGG-Face network [33] to match composite with photos, which has achieved competitive performance. Wei Zhang et al. [43] proposed a new inter-modality face recognition approach by reducing the modality gap at the feature extraction stage, which based on coupled information-theoretic encoding to capture discriminative local face structures and to effectively match photos and sketches. Xiaogang Wang et al. [44] proposed a novel face photo-sketch synthesis and recognition method using a multiscale Markov Random Fields (MRF) model. Hamed Kiani Galoogahi et al. [45] proposed a new face descriptor to directly match face photos and sketches of different modalities, called Local Radon Binary Pattern (LRBP), witch is inspired by the fact that the shape of a face photo and its corresponding sketch is similar, even when the sketch is exaggerated by an artist.
A related study is Fisher Linear Discriminant Analysis, which considers intraclass and inter-class variances. For example, Sang et al. [7] introduced a Fisher loss to extract more discriminative features, and it adjusts the features of each class to satisfy the Fisher criterion. Cheng et al. [5] introduced a rotation-invariant layer to learn rotation-invariant features and a Fisher discriminative layer to enhance the features that have small intra-class scatter and large inter-class separation. Ye et al. [6] proposed a discriminative feature learning method to increase the separability of the deep features. The above methods considered the Euclidean metric to extract discriminative features. Instead, our method introduced a new Mahalanobis to learn the modality invariant feature further.

Proposed method
In this section, we first revisit the softmax loss and the center loss. Inspired by the center loss [25], then we introduce our proposed discriminative center loss to improve the discriminative power of the learned features for face photo-sketch recognition. Finally, we detail the network architectures.

Softmax loss and center Loss
The softmax loss is one of the most widely used loss functions in deep learning methods. It can be written as follows, The softmax loss encourages separability between classes. However, it does not make the distance within the class more compact. The center loss is proposed by reference - [25] to characterize the intra-class variations effectively to overcome the above problem: where c yi ∈ R d denotes the y i th class center of deep features.

Discriminative center loss
Center loss significantly improves traditional face recognition. However, for face photo-sketch recognition problem, sketches and photos come from different modalities, they have larger intra-class differences and smaller inter-class differences than traditional face recognition. Moreover, due to the high cost of acquiring sketch face datasets, the sketch face datasets currently used in the public are very small, resulting in a small sample problem. To further reduce the modality gap, a discriminative center loss is proposed in this section, and a regularization technique is adopted on the dual Mahalanobis matrices to alleviate the small sample problem.
The class center c yi is set to the weight W yi firstly, and the modified center loss can be rewritten as follows: where I ∈ R d×d denotes the identity matrix.
Since sketches and photos come from different modalities, there is a large intraclass difference and small inter-class difference in the sketch face dataset. From equation (4) of reference [34], they introduced a specific Mahalanobis distance matrix to learn class dependent distance feature in the embedding space. Here, we want to introduce a new Mahalanobis distance matrix to learn the modality invariant feature in the embedding space. Hence, to further reduce the modality gap, inspired by the work [34], a Mahalanobis distance matrix M is introduced to replace the identity matrix I in (3), and the corresponding Mahalanobis loss can be written as: where M is a symmetric semi-positive matrix. The Mahalanobis loss is an efficient way to measure the similarity of two unknown sample sets. Unlike center loss, it takes into account the relationship between the various characteristics and is scale-invariant, therefore, they can be able to learn class and direction dependent distance metric in the embedding space.
A commonly used Mahalanobis distance matrix is an inverse of the class covariance matrix M = S −1 , where S is the class covariance matrix, and it can be estimated as follows: where µ = 1 n n i=1 x i denotes the center of the batch feature. Since label information is critical to explore discriminative information, here, label information is introduced into the class covariance matrix (5) to consider intraclass and inter-class variances in the embedding space, and thus the domain gap in bridged. Specifically, S is replaced by the intra-class covariance matrix and the intra-class covariance matrix.
The intra-class covariance matrix can be estimated as follows: where x i is the i th sample, and W yi is the class center corresponding to the sample.
The inter-class covariance matrix can be estimated as follows: where c j is the other class center.
Since the intra-class covariance matrix S a and the inter-class covariance matrix S e are semi-positive definite matrices, they can be decomposed into: where U a and U e are orthogonal matrices, Λ a = diag[λ 1 , · · · , λ n ], with λ i being the eigenvalue of S a . Λ e = diag[γ 1 , · · · , γ n ], with γ i being the eigenvalue of S e .
Given the small training sample, the larger eigenvalues of the true covariance matrix are always highly biased in the estimated covariance matrix. That is, the larger values of Λ a and Λ b are highly biased. According to [35], a regularization technique is introduced on the above two covariance matrices. Specifically, two covariance matrices (8) and (9) are modified as follows: S e = (1 − t e )S e + t e α e I, where α a = 1 d tr(S a ), α e = 1 d tr(S e ), 0 ≤ t a ≤ 1, and 0 ≤ t e ≤ 1. Correspondingly, the diagobal matrix Λ a and Λ b in (8) and (9) are modified as follows: The two parameters t a and t e can shrinkΛ a andΛ e toward an identity matrix, hence, the shrunkenŜ a andŜ e suppress the larger estimate of the large eigenvalues.
In this way, performance is improved in practice.
Combine (10) and (11), (4) can be reformulated as follows: where β is the tradeoff parameter. Since the Mahalanobis distance matrix in (14) contains the intra-class matrix S a and the inter-class S e matrix, our proposed loss function (14) terms discriminative center loss.
Similar to [25], the overall loss function is shown as follows.
where λ is the tradeoff parameter.

Network structure
Resnet18 [36] has achieved significant performance improvements in image classification tasks, such as object detection and face recognition, by using residual blocks to improve network performance. Figure 2  Instead of hoping that every few stacked layers directly fit a desired underlying mapping, they explicitly let these layers fit a residual mapping. Formally, as shown in Figure2(a), denoting the desired underlying mapping as H(x), then they let the stacked nonlinear layers fit another mapping of F (x) := H(x) − x. The original mapping is recast into F (x) + x. It has been proven that it is easier to optimize residual mapping than to optimize the original mapping. In this paper, Resnet18 is used as the network.

Experiment
In this section, we first introduce the experimental settings and the sketch face dataset (e-PRIP). Then we compare the proposed discriminative center loss with different loss functions to verify its effectiveness. Finally, we compare our method with other state-of-the-art methods on the e-PRIP dataset.

Details of the experiment
Images used in this section are firstly detected and aligned using MTCNN [22], then they are resized to 256 × 256 and converted to gray-scale images. Adam optimizer [37] is used for supervisory training. The model is trained with the batch size of 64 and the initial learning rate is set to 0.001.
In the testing phase, outputs of the last layer of the model are extracted as the deeply learned features. When comparing two deep features, Euclidean similarity is utilized as the measurement.

Dataset
Dataset used this section is the extend-PRIP (e-PRIP) dataset. The original e-PRIP dataset contains 4 different composite sketch sets, photos of the e-PRIP dataset come from the AR face dataset [38]. However, only sketches drawn by Asian artists using the Identi-Kit tool are available in this experiment. Hence, e-PRIP datasets used in this experiment have 123 pairs, each pair contains one sketch and one photo.
Since the e-PRIP dataset is small, there are few images for training. We generate a synthetic sketch dataset called PRIP-CUFSF to train the model. To reduce the domain gap between the training set and the testing set, the PRIP-CUFSF dataset is trained on the CUFSF photos and e-PRIP composite sketches by the CycleGAN [39]. After pre-processing, the 1187 sketch-photo pairs of the PRIP-CUFSF dataset are used in this experiment. Figure 3 shows the examples of the PRIP-CUFSF dataset. In order to reduce the over-fitting of resnet50 on small training datasets, we employed multiple augmentation techniques for the sketch face dataset in the training phase. The augmentation techniques are explained as follows: 1 Deformation: Transformed sketches and photos to compensate for the difference in shape between the sketch image and its corresponding photo. The deformation is performed by translation and rotation of random amplitude and direction.
2 Scale and crop: Upscale the sketches and photos to several random sizes, and then cut a 256 × 256 crop from the center of the scaled image.
3 Flipping: The images are randomly flipped horizontally.

Experimental Evaluation
In this section, three different experimental setup are performed. In the first experimental setup (S1 setup), the e-PRIP dataset is divided into two parts. The training set contains 48 identities and the test phase contains 75 subjects. When tested, photos form the gallery set and sketches form the probe set. In the second experimental setup (S2 setup), the PRIP-CUFSF dataset is used as the training set, and the test dataset is the same as S1 setup..Then the Cumulative Match Characteristic (CMC) curve is drawn to evaluate the performance. Table 1 details the experimental setups.   To verify the effectiveness of the proposed discriminative center loss, we compare our proposed loss with the softmax loss and the center loss. Note that each method uses the same network architecture (Resnet18), parameter settings, and training set to make our experiments fair.   Figure 4 shows the results of the softmax loss, the center loss and the discriminative center loss functions on the S1 setup. The results show that our method outperforms the other two loss functions. The matching rate at rank10 is nearly 3% higher than the other two losses, which demonstrates the effectiveness of our loss function. Table 2 shows the results of the three different loss functions on the S1 setup. Table 3 shows the results of the three different loss functions on the S2 setup. As we can see from Table 2 and Table 3, our proposed method achieved the best result, and the center loss is second only to our proposed method. Figure 5 visualizes the effect of the proposed loss on the top five ranks on the S2 setup. The first two rows are the results of the discriminative center loss. The second two rows show the top ranks for the same sketch probe using by the center loss, while the third two rows show the top ranks for the same sketch probe using by the softmax loss. The discriminative center loss removes many of the false matches from the ranked list, and the correct subject moves to a higher rank. In this section, we compare our method with the state-of-the-art methods SSD [24], Auto+DBN+SVM, Transfer [13], and Attribute [41] in the S1 setup. The SSD and Attribute are traditional methods, and Auto+DBN+SVM and Transfer are deep learning methods. Table 4 report the matching rate at rank10 on the S1 setup. It depicts that our method outperforms other methods for rank 10, which shows the robustness of our method.

Discussion
In this paper, we proposed a discriminative center loss for sketch-photo recognition. Our method aims to guide deep CNN to learn highly discriminative sketch face features. Specifically, a intra-class and the intra-class covariance matrix are estimated to extract discriminative features regularization technique to alleviate the small sample problem. We conducted several experiments on the e-PRIP dataset. Experimental comparison among softmax, center loss and our loss have showed the effectiveness of the discriminative center loss. Comparison with SSD, Auto+DBN+SVM, Transfer, Attribute and SGR-DA methods have showed that our method significantly outperforms other state-of-the-art face photo-sketch recognition methods.

Abbreviations
Not applicable