Person re-identiﬁcation using Convolutional Neural Network and Autoencoder embedded on frameworks based on Siamese and Triplet networks

The person re-identiﬁcation problem addresses the task of identify if a person being watched by security cameras in surveillance environments has ever been in the scene. This problem is considered challenging, since the images obtained by cameras are subject to many variations, such as lighting, perspective and occlusions. This work aims to develop two robust approaches based on deep learning techniques for person re-identiﬁcation, considering these variations. The ﬁrst approach uses a Siamese neural network composed by two identical subnets. This model receives two input images that may or may not be from the same person. The second approach consists of a triplet neural network, with three identical subnets, which receives a reference image from a certain person, a second image from the same person and another image from a diﬀerent person. Both approaches have identical subnets, composed by a convolutional neural network which extracts general characteristics from each image and an autoencoder model, responsible for addressing high variations that input images may undergo. To compare the developed networks, three datasets were used, and the accuracy and the CMC curve metrics were applied for the analysis. The experiments showed an improvement in the results with the use of the autoencoder in the subnets. Besides, Triplet Neural Network presented promising results in comparison with Siamese Neural Network and state-of-the-art methods.


Introduction
An intelligent system for security that uses video cameras has not only the function of environment surveillance but also the detection of events, people and the scene interpretation. One advantage by the use of these system is related to easy and automatic way for handling a large amount of data and information, instead an human operator monitoring the video, avoiding wrong registering and operator fatigue. There are many problems in intelligent security systems such as anomalous event, people or object detection, tracking and person re-identification, among others [1].
This work covers the person re-identification problem, which consists of identifying a person that have already been presented in images from the same camera or in images obtained from different cameras. This problem should consider several characteristics and appearance changes from an analyzed person that may occur as he/she move among the cameras, caused by environmental and geometric variations and/or by partial occlusions [2].
In general, the person re-identification problem can be defined as a process in which is intended to establish a correspondence between different images of the same person. Therefore a person who has been previously seen may be identified again [3].
There are several techniques for person re-identification in the literature. However, this problem does not have a definitive solution. The analyzed images can present numerous variations of lighting, points of view, the person's position, partial occlusions, low resolutions, among other problems. These problems make the person re-identification a complex problem, since the same person can look different from one image to another, as well as different people can look very similar [4].
The main purpose in this work is to develop two approaches for person reidentification in digital images using deep learning techniques. The first approach is a Siamese Neural Network and the second a Triplet Neural Network, both built using a subnet composed by a convolutional neural network (CNN) and an autoencoder (AE) model, which performs the re-identification of people who were previously detected. The Siamese Neural Network and the Triplet Network are composed by two and three identical subnets, respectively.
The proposed methods considers two input images and compares them to verify whether they are from the same person. This is possible because the subnets generate a feature vector of images, extracted by CNN. This feature vector is reconstructed by AE, which minimizes the noise that can compromise the comparison with the other image, meanwhile maintains more relevant features for reidentification. Details of how the training of each implemented network is performed are described in section 3. The methods are validated using the accuracy and CMC curve. Experiments were also carried out to compare our neural networks against state-of-the-art methods available in the literature.

Related Work
In this section it will be presented other works that deal with the problem of person re-identification in digital images, using features descriptors and deep learning.
Gray and Tao developed a method called ELF for the person re-identification problem. The technique is based on the combination of histograms of RGB, YCbCr, and HSV1 color channels with image texture. A model of the most significant features of each image is learned by the AdaBoost algorithm during training. It was observed that the most significant features were the hue and saturation, due to the variations of luminosity between the images of two cameras. This method does not solve the problem of automatic person re-identification, however, by means of the similarity function it assists a human operator in the recognition of pedestrians in images of different cameras, reducing about 82% [5].
Another work that also considers the most representative histogram and texture features of each image was developed by Prosser [6]. In this method, a vector containing the features of each image is obtained and during training and associated with a set of relevant features and a vector with irrelevant features, through positive and negative correspondences. Each image was divided into 6 rectangles, to represent body parts. For the classification was used the ranking method RankSVM, which assigns the positive matches scores between the vectors. Better results were obtained than that proposed by Gray and Tao [5]. This method can become very costly computationally since it can generate a large number of negative samples.
In the method proposed by Zhao [7], are considered features extracted from salient regions, described as regions that make a person different from the other and that must be present in the images even with variations in the point of view. This information is extracted by LAB color histogram and SIFT descriptor of each 10x10 pixels in the image. The correspondence between these pixels is determined on the basis of the Euclidean Distance, using the k-nearest neighbor algorithm (KNN). The method employed may aid in the person re-identification with variations in point of view, however, in some cases the overhang regions can be treated as outliers by the algorithm.
Wu [8] developed a method for extracting features of the images, combining a convolutional neural network with a modification of the method proposed by Gray and Tao [5]. The method consists of a convolutional neural network formed by two parts, the first one containing convolution and pooling layers to extract features of an image. The second step uses a modification of the ELF [5] method that divides the input image into 16 bands for extraction of the features to be performed in each band, thus producing a 16-dimensional histogram for each color channel, which will be concatenated at the end to form a single vector.
Li [9] proposed a method called DeepReID: Deep Filter Pairing Neural Network, which consists of a network that automatically learns features for identifying people. The network takes into account possible transformations that may occur in images and background interference. A set of images with 13,164 images of 1,360 different pedestrians obtained by 6 cameras, until then the largest set of data for re-identification of people in the literature, was also created for the training of the method.
Ahmed [2] used a Siamese neural network to learn similarity metrics between two images. The network is composed of two subnets containing two convolution layers used to obtain the image features maps. To verify the matching between feature maps, the network calculates the neighborhood differences between them. This operation will result in 25 neighborhood difference maps pass through a CNN to learn spatial relationships between maps.
A Siamese neural network was used by Yi [10] to find similarity in two images. The proposed network has two convolutional neural networks, formed by convolution layers, followed by a normalization layer, max-pooling, and a fully connected layer. For the calculation of the similarity of the images was used the cosine function. Yi's method obtained better results than the methods developed in [5] and [7]. However, the subnets do not share the same parameters, making a comparison between the features vectors imprecise.
A Siamese neural network composed of a convolutional neural network and a recurrent neural network to person re-identification based on video was projected in [11]. The convolutional neural network produces a features map of each frame and sends it as input to the recurrent neural network. The information from each frame is maintained and processed by the recurring neural network as new features are obtained by the network. A temporal pooling layer is used to combine all these characteristics. The similarity function used in this method is the Euclidean Distance.
Wang [12] proposed a method for re-identifying people using unsupervised learning. This was possible due to the sharing of the knowledge of the information of an individual of origin through attributes learned from the labeled source data, without then transferring that knowledge to destination data not labeled by joint learning of identity transfer between domains. For this, each figure of the individual goes through two extractors of characteristics, the first of which aims to extract sensitive information for re-identification and the second of which aims to extract the semantic knowledge of the labels of the attributes. After that, a knowledge fusion channel is used to integrate the information obtained by the two channels. In this work, Wang used an autoencoder in the fusion channel, with the justification that he has a great capacity to capture the most important information from input and also because it is a more concise representation of features, facilitating the transfer of information between tasks.

Method
This section presents the proposed approaches to person re-identification in digital images obtained by different points of view and cameras. As cited in Section 1, this paper introduces two methods for person re-identification using Siamese Neural Network and Triplet Network. The first implemented method is formed by two identical subnets that are joined by a Contrastive Loss Function. The second method is presented consisting of three identical subnets joined by a Triplet Loss Function. In both methods, the subnets are the same, formed by a convolutional neural network and an autoencoder. The Figure 1 shows the subnet model proposed in this work. Each subnet receives an input image, which goes through a process of extracting features and subsequently verifying similarity between them, based on the features found. Each of these subnet steps are described in Section 3.1 and Section 3.2.

Convolutional Neural Network -CNN
The first stage of the subnet consists of a CNN with 4 layers of convolution, responsible for producing a feature map of the input image.
In general, a Convolutional Neural Network (CNN) uses convolution operations on at least one of its layers to learn patterns from a given set of data. For images as input, a convolution layer has formed a set of filters and produces as output a features map of the image. A convolution operation can be denoted in Equation 1 where the first argument, represented by x, is the input and w is the kernel represents a filter to be applied in the image [13,14].
In addition, the CNN proposed uses a layer of max-pooling and dropout. The pooling layer is used to reduce the size of the feature map produced by the convolution layer, making the number of parameters in the network smaller and decreasing the computational cost, since the width and height in the characteristics map also decreases. In the proposed subnet the max-pooling operation is used for a region: 2x2 of the features map replaces the highest value of this region in the output [13,14].
Dropout is a technique developed to avoid overfitting during the training of a neural network. The method consists of randomly discarding data, preventing the network from adapting too much, breaking patterns that are not significant for the learning process. For example, if a layer of a neural network returns, for a given sample of training input, the vector (0.2; 0.5; 1.3; 0.8; 1, 1), applying Dropout the new vector could be (0; 0.5; 1.3, 0; 1.1) [15,16].

Autoencoder -AE
A neural network autoencoder (AE) uses unsupervised learning to learn patterns through a set of data. The architecture of an AE consists of at least 3 layers: input, output and hidden layer, and the input and output are the same size.
The purpose of an AE is to reproduce the input data at its output. This is done by means of an encoder function, which extracts characteristics from the input and produces a compressed features vector. The decompression of this vector is performed by a decoder function that translates the data as close as possible to the input, minimizing the reconstruction error. In this way, the network ends up learning to prioritize the most important properties of the input data [13].
As used by Wang [12], the autoencoder can learn the most important information for a given input representation, generating a characteristic vector with more concise data. This work takes into account that each image has several types of variations, such as lighting and perspective since they were not obtained by the same camera and pedestrian movement.
After being reconstructed, the vector goes through a normalization layer, which aims to make the samples more similar, helping the network in learning and better generalization for new data. This can be accomplished through changes in the data values for a common scale, without distorting differences in the value ranges and without losing information.
Each feature vector produced by one subnet is compared with the vector produced by the other subnet of the neural network. To make an association between them, for instance, to check if these vectors correspond to the image of the same person, two architectures are used in this work: Neural Siamese Network and Neural Triplet Network, described in the next sections.

Siamese Neural Network
A Siamese network consists of two identical subnets that are joined together in their outputs. Each subnet receives a different input that is mapped to a feature descriptor. Two descriptors are obtained and compared to estimate the similarity between them, resulting in the output of the network. Subnets need to share the same parameters and weights to make the output of each of them comparable [17]. Figure 2 Network Architecture of the proposed using Contrastive Loss Function. Siamese neural network formed by two identical subnets that receive a different image to verify the similarity between them. For this, each image passes through a CNN and an DAE, that form the subnet, resulting in a features vector for each image. To verify the similarity, the Euclidean Distance between the vectors is calculated, passing through a Contrastive Loss function.
In Figure 2 the proposed Siamese Neural Network model is presented. The pair of input images must be associated with a binary label Y that identifies whether the images are from the same person or not. For example, if the images are from the same person, the network receives Y = 0 as the label, otherwise, the label should be Y = 1. The network also receives for training a variable, where if 0, so the images correspond to the same person or 1 if they are from different people. The network entry contains two images and binary value. The network will adjust its weights during training to reach a certain value, according to the pairs of input images, using a Contrastive Loss which seeks to separate with greater distance entries belonging to different people using a predefined margin value.
During training, the network modifies its parameters to find a smaller difference between the inputs, according to the label given to them. For this, a function to calculate this difference is used (E w ), given by Equation 2. This equation passed to a Constraint Loss function L [17].
where X 1 and X 2 are a input pair of images, and G w is a set of network functions. The Contrastive Loss Function (Equation 3) is used to measure the network ability to find similarities between images. The function learns the parameters therefore the most similar examples get closer and the different ones are separated, where m is the margin, if m > 0, such that (X 1 , X 2 ) are a positive input pair of images, and (X 1 , X 2 ) are a negative input pair of images, so E w (X 1 , X 2 ) + m < E w (X 1 , X 2 ) [18].

Triplet Network
A Triplet Network is inspired by a Siamese Network, but this network is composed of three subnets with shared parameters where each subnet is fed with a sample. Thereby, a triplet samples feed the network input: anchor x i , positive x + i and negative x − i . The sample called anchor comprises the reference sample, so we have that the sample denoted positive must be from the same class as the anchor and the negative from a different class [19]. In the context of person re-identification, each different person represents a class, therefore the images representing the anchor sample and the positive sample are different images of the same person and the negative sample is another person, as can be observed in the Figure 3. Figure 3 Network Architecture of the proposed using Triplet Loss Function. The network receives as input a triplet given by anchor, positive and negative. Since the image identified by anchor is a reference image, positive is a different image, but the same person as the anchor and negative is a different person image. In this model, each subnet treats one of the input images and finally the network uses a Triplet Loss Function that attempts to bring the anchor closer to the positive and away from the negative so that re-identification can be performed.
The purpose of a triplet network is to ensure that an anchor sample from one person is closer to all other positive samples from the same person than any negative sample from anyone else, reducing the distance to the anchor and positive sample as much as possible (Figure 4). For this, triplet loss is used, which is based on the context of the classification of the nearest neighbor. The grid generates two intermediate values of distance from the input samples, this encodes the pair of distances between each positive and negative sample relative to the anchor [19,20].
Being the embedded representation of the network represented by f (x), we have to Figure 4 Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same class, and maximizes the distance between the anchor and a negative of a different class [20].
Where τ is the set of all possible triplets in the training set and α is the margin that is enforced between positive and negative pairs [20].

Experiments and Results
This section presents the results obtained with the training and tests of the proposed Siamese and Triplet Neural Networks. The networks were implemented in Python, through the library for neural networks Keras [1] , operating on Tensorflow [2] . Three public datasets were used for network training and testing: VIPeR [21], i-LIDSVID [22] and CUHK03 [9]. The evaluation metrics used to analyze the performance of the implemented models were Accuracy and the Cumulative Correspondence Characteristic (CMC) curve. Accuracy is one of the most used metrics in machine learning to measure the performance of a classifier. Being obtained by the fraction of correct predictions p of a given model about the total of forecasts T p , it can be calculated by A = p T [23]. The CMC curve is considered a classification metric based on the concept of scores. Performance is measured for each sample, determining a set of correspondence candidates, the idea being that these candidates belong to samples with the same identity as the sample evaluated [24]. Bolle [25] described two subsets of samples to measure a CMC curve for a primary sample set B i . The first being a set of biometric identifiers G = {B 1 , B 2 , ..., B m } for m different identities. The second is a set of n samples Q = {B l , l = 1, ..., n} that can be from any individual that in set G, being that it can contain more than one sample for this identity. Thus, m is the total number of identities and n is the total number of samples per identity.
Considering the total number of samples given by N = n * m, each sample is compared with the remaining samples (N − 1) resulting in a set of S scores of two types: 1) genuine scores: both samples belonging to the same individual; 2) impostor scores: samples belonging to different individuals [25], [24].
When comparing a set Q with the set of identities G, a set of scores S l = {s (B l , B 1 ) , s (B l , B 2 ) , ..., s (B l , B m )} is generated. Thus, for each example B l , the scores s are ordered from highest to lowest, generating a list C. The CMC curve must represent the probability that a genuine score is in the first k classifications of C [25], [24]. [1] Available in https://keras.io [2] Available in https://www.tensorflow.org

VIPeR dataset
The ViewPoint Invariant Pedestrian Recognition (VIPeR) dataset was developed to simulate real-world safety environments, where pedestrians appear in large open environments and can be viewed from any angle. It contains a total of 1264 images of 632 pedestrians, and for each pedestrian there are two images captured by different cameras, containing variation between points of view and also changes in the lighting and poses of individuals [21]. In the Figure 5here are samples of 10 different pedestrian image pairs. Figure 5 ViPER: Examples of pairs of images from 10 different pedestrians taken by two cameras [21] To increase the number of training images and reduce overfitting, the data augmentation technique was performed on the images of this dataset. There were 11 transformations in the indicated original image obtaining a total of 15144 images.
The networks were trained with 100, 200, 400 and 600 epochs. Table 1 shows the accuracy results obtained for each season in the set of images intended for testing and Figure 6 presents the graph. To validate the use of AE in the subnets, experiments were carried out with and without AE. As can be seen in the Table 1 and Figure 6, the tests without AE had the worst accuracy rates, compared to tests with AE, in all cases.
The Triplet Network with AE started the experiment with better accuracy than the Siamese Network, with 51.33% and 64.27% respectively. As the number of epochs increased, the accuracy of the Siamese Neural Network increased, but remained ever smaller, ending the experiment with 89.54% accuracy. The Triplet Network achieved its best result in this experiment, raising 96.29% with 800 seasons.  Figure 7 shows the graph of the CMC curve for the four networks implemented and tested with the VIPeR dataset. It can be seen that the Triplet Network with AE obtained the best results in all cases. Table 2 shows the hit rates for each rank.

i-LIDSVID dataset
The i-LIDS Video Re-Identification (i-LIDSVID) dataset contains images of 300 different pedestrians obtained through two non-overlapping cameras. There are two sets of images for each individual: one of still images and the other of sequential images. In the case of sequential images, the images were obtained by pedestrian tracking, containing from 23 to 192 frames for each person other than dataset. Images have many variations between different cameras, such as lighting, point of view variations, and occlusions [22].  Figure 8 presents an example of sequential images of the same pedestrian, however, captured by different cameras. There will not always be frame-by-frame parity, as one person may not appear in both camera views at the same time.
The accuracy of the test set for different training epochs of the implemented networks can be seen in Table 3 and in the graph in Figure 9. Networks that do not have AE in the subnets obtained lower accuracy rates in all cases, compared to networks with AE.
Analyzing networks with AE, the Triplet Network got the highest hit percentages with 97.48% for 800 epochs. The networks chart has remained more consistent and kept growing with an increasing number of epochs.In this case, the best result to Siamese Neural Network was 92.48% for 800 epochs.The CMC curve of the experiments performed with this dataset can be seen in Figure 10, with its data in Table 4.

CUHK03 dataset
The CUHK03 dataset was developed taking into account that better training of deep neural networks requires the use of a large number of images. Thus, the CUHK03 dataset was built with 13164 images of 1360 different people, obtained through six   surveillance cameras. Examples of images from this dataset can be seen in Figure  11. Images of manually cropped people are also available and also detected using a pedestrian detector in addition to the original frames [9]. This dataset presents several problems such as misalignment, occlusions, missing body parts, among others. Also, cameras monitor an open area, subjecting to variations in lighting caused by various weather factors in a single camera view. Other transformations may also occur due to the various directions pedestrians walk [9].   Table 5 presents the results obtained with the experiments using the four implement network models and the Figure 12 illustrates a graph of the accuracy results, in relation to the number of training epochs.  As can be seen from the graph in the Figure 12, In these experiments the networks that were implemented and the AE also obtained better results. In this case, Triplet Network obtained the best accuracy of this dataset, with 97.62% to 800 epochs. In both networks the accuracy increased as the number of epochs increased. The best result to Siamese Neural Network was 94.09% to 800 epochs. Figure 13 and Table 6 show the results for the CMC curve. As in the experiments carried out with the other datasets, the Triplet Network with AE obtained the best hit rates in all cases, compared with the Siamese Neural Network with AE for ranks from 1 to 15 and with the neural networks without AE.

Discussion of Results
Given the considered experiments, the use of AE provided better results for both implemented approaches. For the results with the dataset i-LIDSVID, the networks The experiments carried out with the networks that use AE, it is concluded that Neural Network Triplet obtained better results in all cases, compared to the Neural Siamese Network, using the same datasets and subnets. The accuracy of each case can be seen in Figure 14. As the Neural Network Triplet receives a positive and a negative image pair per entry, the network may have had an easier time, separating images with different identities better and also bringing together images with identical identities, compared to the neural network that receives only one positive or one negative image pair per entry.
The generated CMC curves for networks with AE were also compared and the Neural Triplet Network also obtained better results against the results of the Siamese Neural Network. From the analysis of Figure 15, we can highlight the CMC Curve of the Siamese Neural Network using the i-LIDSVID dataset obtained identification rates very close to the rates obtained by the other network. This may be due to the fact that this dataset is a multi-shot type, presenting more images with the same identity. The CMC curves generated by the experiments with the CUHK03 dataset showed that the results started with lower identification rates than the other datasets for the same rank. One explanation for this is that this dataset has a much larger number of images and identities than the other two. Throughout the experiments, the CMC Curve of the Neural Triplet Network starting from rank-12, obtained the highest identification rates, and in the last two ranks, the network reached 100% correct with the CUHK03. Total candidates, in this case, had about 30 images for each identity.

Comparison with State of the Art
This section presents a comparison of the CMC Curves obtained for each dataset and proposed method with the AE with some methods available in the state of the art.
The methods used to compare the performance of the networks using the VIPeR dataset were: ELF [5], KISSME [26], SDALF [27], DGD [28], NullSpace [29] and the method developed by Ahmed [2]. As can be seen in Table 7 and in Figure 16, our Neural Triplet Network, obtained the best performance in rank-2 and from rank-6 it was the better between methods. In other cases, our network was among the top three methods.  The i-LIDSVID dataset was used in experiments to compare the following methods available in the state of the art: KISSME [26], SDALF [27], DGD [28] and Ahmed [2]. In experiments with i-LIDSVID, the Triplet Neural Network had the best results from rank-2. In rank-1 this network was in second place and the DGD method obtained the best result. The Siamese Neural Network was in second place in all cases starting from rank-11. The results are shown in Table 8 e Figure 17.
For the CUHK03 dataset, the approaches developed in this work were compared with the following methods: KISSME [26], SDALF [27], DGD [28], NullSpace [29], NormXcorr [30] and Ahmed [2]. The results obtained can be seen in Table 9 and Figure 18. The DGD method obtained a better identification rate in the first ten ranks. Starting from rank-11, our Neural Network Triplet obtained the best results,  Figure 17 Comparison of CMC Curves with state of the art methods, using the dataset i-LIDSVID reaching a rate of 100 % in rank-14 and rank-15. It is worth mentioning from rank-6 that our network was among the two best methods from rank-6.  Figure 18 Comparison of CMC Curves with state of the art methods, using the dataset CUHK03

Conclusion and Futures Works
This work implemented two neural network models: a Siamese Neural Network and a Triplet Network. Both models contain identical subnets composed by a Convolutional Neural Network (CNN), which is responsible for the extraction of image features, and an Autoencoder, which encodes the features produced by CNN and then decodes to maintain the most relevant information for the image identified during the network training. Experiments were performed with three different datasets, with and without AE at the end of the subnets. The results showed that the use of AE generated a significant gain, both for the Neural Siamese Network and the Neural Triplet Network. With the AE, the accuracy rates increased in all cases, as well as the CMC Curve, and it was found that the AE provided a gain of up to 71.05 %.
From the experiments carried out using the networks with the AE, it was also found that the Neural Triplet Network obtained better performance in the people re-identification in comparison with the Neural Siamese Network, in all tested cases.
The Neural Triplet Network also presented a potential for person re-identification compared to other methods available in the state-of-the-art approaches. This network largely gained from the results in the CMC Curves, for the three considered datasets.
As future work, it is suggested to carry out experiments using other public datasets, including video datasets. Besides, an implementation of a Neural Quadriplet model using the same subnet as in this work can be performed. This network could prove whether the number of subnets increases, the potential to re-identify people in images will increase as well.