Staged Encoder Training for Cross-Camera Person Re-Identication

As a cross-camera retrieval problem, person Re-identiﬁcation (ReID) suffers from image style variations ca-sued by camera parameters, lighting and other reasons, which will seriously affect the model recognition accuracy. To address this problem, this paper proposes a two-stage contrastive learning method to gradually reduce the impact of camera variations. In the ﬁrst stage, we train an encoder for each camera using only images from the respective camera. This ensures that each encoder has better recognition performance on images from its respective camera while being unaffected by camera variations. In the second stage, we encode the same image using all trained encoders to generate a new combination code that is robust against camera variations. We also use Cross-Camera Encouragement [12] distance that complements the advantages of combined encoding to further mitigate the impact of camera variations. Our method achieves high accuracy on several commonly used person ReID datasets, e.g., achieces 90.8% rank-1 accuracy and 85.2% mAP on the Market1501, outperforming the recent unsupervised works by 12+%. Code is available at https://github.com/yjwyuanwu/SET.


Introduction
Given a query image, person Re-identification(ReID) aims to match the person across multiple non-overlapping cam- Fig. 1 An illustration: the generation of combination coding in the inter-contrast learning stage using encoders trained in the intra-contrast learning stage eras [12,20].In ReID scenarios, each identity may be recorded by multiple cameras with different parameters and environments, these factors change the appearance of the image, making it challenging to recognize the same identities.
In previous studies, researchers have addressed the above challenges through supervised methods, mainly focusing on finding appropriate mapping functions based on the data distribution of images captured by different cameras [14,9].However, such approaches require annotated training samples to learn the camera transfer model and are only applicable to small datasets.In recent years, researchers have focused on studying unsupervised domain adaptation (UDA) methods [3,18,28,10,5,23] and purely unsupervised methods [11,17,19,27,1] to address this issue.UDA is complex to train and requires that the difference between the source and target domains is not significant.In this paper, we focus on the fully unsupervised approach, which uses only unlabeled data in the target domain and is trained using the generated pseudo-labels.
In research on fully unsupervised methods, it is common to use data augmentation to make the model robust to camera variations [27].Alternatively, in the training step, sam-ples are clustered and pseudo-labelled, and then a model is designed to extract features that are robust to camera variations [11,17,19,1].Unlike previous methods, this paper focuses on the pseudo-label prediction step in the fully unsupervised setting.Most pseudo-label prediction algorithms follow a similar process, which includes feature extraction, similarity computation, and assigning the same label to similar samples for training.The feature similarity calculation is a crucial step in this process.However, camera variations lead to an increase in the inter-class distance for the same identity, which significantly affects the reliability of the similarity results.
In this paper, we address the above issues by investigating a more reasonable distance computation for generating pseudo-labels.Since it is easier to identify pedestrians with the same identity in the same camera than in different cameras, as shown in Fig. 2, we decompose the distance calculation between sample encodings into two stages, gradually searching for reliable pseudo-labels.These stages are trained alternately to jointly optimize the backbone network.In the first stage, i.e., intra-contrast learning stage, multiple branches are trained together, with branch k using samples from camera k for training.Since the samples of each branch come from a single camera and are not affected by camera variations, the similarity computation in this stage is performed directly using the encodings obtained from the backbone network and the encoder.The contrast learning method used for training is discussed in detail in Sec.3.2.
In the second stage, i.e., inter-contrast learning stage, we use all samples in the training set to jointly train an additional encoder.Since the samples in the training set come from different cameras, we must take camera variations into account during this stage.Inspired by studies such as [19,4], which show that the classification probability is more robust to the domain gap than raw features, we consider the feature obtained from the backbone as "raw feature".As shown in Fig. 1, the encoders trained in the first stage for each camera are used to obtain the combined encoding of the samples as "classification".Furthermore, to avoid misidentifying samples from different identities as the same identity when their combined encodings are close, we further explicitly reduced the sample distance between different cameras using the Cross-Camera Encouragement [12].The distance between sample encodings in the second stage is composed of the original encoding distance (d 1 , and the Cross-Camera Encouragement distance (d 3 d 3 d 3 ).We also employed contrastive learning for training in this stage.d 2 and d 3 will be introduced in Sec.3.5.
The proposed method decomposes the distance calculation between sample encodings into two stages, and gradually finds reliable pseudo-labels.This method is more reliable than directly predicting pseudo-labels across cameras in that, and effectively alleviates the impact of camera variations.

Related work
The proposed method is inspired by domain adaptation methods and effectively mitigates the impact of camera variations in a fully unsupervised setting.The work on these two topics will be introduced in the following two subsections.

Domain adaptation
Domain adaptation can be summarized into three categories: GAN-based style transfer, finding features that are robust to camera varitions, and mutual training.Zhong et al. [26] proposed a triplet training sample construction method using style transfer and non-overlapping person ReID datasets.Wei et al. [18] introduced a GAN-based approach that transfers task images to match the style of the target domain dataset while preserving the label information from the source domain.For research on finding robust features, Zheng et al. [25] proposed a method to separate features into appearance and structural features, and Zou et al. [28] explored domain adaptation using appearance features as domain-invariant features.There are also studies [19,4] showing that the classification probability is more robust to the domain gap than raw features, and our work was inspired by this research result.Other methods, such as MMT [5] and NRMT [23], focus on reducing the impact of low-quality pseudolabels through mutual training [22] to improve the model's recognition accuracy.

Fully unsupervised person ReID
Fully unsupervised methods related to mitigating camera variations mainly focus on three aspects: data augmentation, extracting features that are robust to camera variations, and generating reliable pseudo-labels.Zhong et al. [27] proposed a method to improve model accuracy through data enhancement and using label smoothing regularization (LSR) loss.Chen et al. [1] extracted features from the statistical information of different camera images and performed feature fusion to generate cross-camera invariant features.For research on generating reliable pseudo-labels, Lin et al. [11] considered each image as an individual sample and gradually grouped them based on sample similarity.Wang et al. [17] formulated ReID as a multi-classification problem and employed optimized similarity computation to enhance the accuracy of pseudo-label prediction.The work most similar to our study is [19], which produces feature vectors that withstand differences in cameras by utilizing classification outcomes from various camera classifiers to mitigate the camera disparity issue.In contrast, we use image encoding to produce camera-robust composite encodings directly.Additionally, we use d 3 (the Cross-Camera Encouragement distance) to compensate for the shortcomings of d 2 (the combined encoding distance) and improve the model's optimization by using a memory dictionary rather than a classifier, resulting in a better reduction of intra-class distance of the samples whilst expanding the inter-class distance.our method is proven to be more effective on multiple datasets.
3 Proposed Method

Formulation
Given an unlabelled dataset χ, we can consider it to consist of multiple subdatasets, denoted as χ = {χ c } , c = 1 : C, where the superscript c indicates that all of the images in this subdataset are from camera c and C represents the total number of cameras.Our task is to train a model on χ, such that for each query image q, this ReID model generates a feature encoding to retrieve the pedestrian images in gallery set G that contain the same identity.in other words, the feature encoding of q should have a smaller distance to the encoding of a gallery image g with the same identity as q compared to the distances to other images in G.The task can be defined as follows: where r represents the image encoding extracted by the ReID model, and dist (•) is the distance metric.This paper is to generate more accurate pseudo-labels by reducing the effect of camera variations on the sample distance calculation during training, so as to guide the model training and enable the model to extract encodings that satisfy Eq. 1.The model comprises of two stages.As shown in Fig. 2, the first stage, intra-contrast learning stage, uses multiple branches for joint training, each branch uses only a sub-dataset for training, and the loss of branch c can be expressed as the sum of the contrast loss of all samples from camera c: where f represents the feature of image I after extraction by the backbone, m is the corresponding pseudo-label, and H represents the set of pseudo-labels generated by clustering.
In the second stage, inter-contrast learning, we share the parameters of the first stage backbone and train an additional encoder.This stage uses the whole training set for training, including images from different cameras.To minimize the impact of camera variations, we propose a combination coding that is more robust to the camera variations.As shown in Fig. 1, we use all encoders trained in the first stage to encode the images separately, and then use these encodings to generate the combination coding R. The combination coding Ri for image x i can be denoted as: where r k i is the coding of the image x i obtained by the encoder corresponding to the k-th camera.We use d 2 (the combination coding distance) and d 3 (the Cross-Camera Encouragement distance [12]) to reduce the impact of camera variations.The distance between any two images x i and x j in the inter-contrast learning stage is represented as follows: where d 1 (•) represents the Euclidean distance of the image coding.We use the clustering result H to calculate the loss in the inter-contrast learning phase to optimise the extraction of the coding r, i.e., In summary, these two stages share the backbone network while having their own encoders with the same structure, and the two stages are trained alternately.The d 2 (•) and d 3 (•) mentioned above are explained in detail in Sec.3.5.

Contrast learning
Both stages of the model are trained using contrast learning, including the initialization and training stages.In the initialization stage, samples are passed through the backbone network and the encoder to obtain sample encodings.Then, similarity is computed to perform clustering by assigning the same pseudo-label to samples belonging to the same cluster.After real-ranking which will be described in Sec.3.3, the average encoding of samples with the same pseudolabel is used to initialize the memory dictionary, each of the cluster centroids stored in the memory dictionary can be represented as: where H k represents the set of sample encodings for the kth cluster.During the training process, the cluster centroids encoding φ k in the memory dictionary is updated with the sample encoding r using Eq.7: where λ ∈ [0, 1) represents the momentum update factor.λ controls the consistency between the sample coding r and the corresponding clustering mean.When λ approaches 0, the clustering mean φ k is closest to the coding r of the latest training sample sample.The loss of contrast for a sample coded as r and with a pseudo-label of m can be expressed as: where τ is a temperature hyperparameter, {φ 1 , φ 2 , ..., φ K } represents the cluster centroids stored in the memory dictionary, and K represents the number of clusters.The contrast loss can reduce the intra-class distance while increasing the inter-class distance, which can improve the discriminative ability of the model.The loss of all obtained samples is used to update the backbone and encoder.

Real-ranking
We use top-down hierarchical clustering method for clustering, requiring the number of clusters M to be specified at the outset.When the number of samples is small, the resulting number of clusters K may be fewer than the specified quantity.Nevertheless, the allocation of cluster labels m is randomly assigned by the clustering algorithm within a number less than M, i.e., it may produce the problem of over-labelling: Since the cluster means in the memory dictionary are stored in order of label, when an over-label problem occurs, the cluster centroid corresponding to label m that exceeds the actual number of clusters cannot be found in the memory dictionary, which leads to the inability to compute the loss by using Eq. 8. To solve this problem, as shown in Fig. 2, we propose a method called Real-ranking to redistribute the pseudo-labels by ranking them after clustering.The ranking position of the given sample's pseudo-label is then used as its final pseudo-label, guaranteeing that no pseudo-label exceeds the actual number of classifications.

Intra-contrast learning
As illustrated in Fig. 2, we employ multiple branches for joint training in the intra-contrast learning stage.According to Eq. 8, we can derive the contrastive loss for the branch c mentioned in Sec.3.1 as follows: where E(θ c , •) represents the encoder with parameter θ c .The loss of the intra-contrast learning stage is equal to the sum of the losses of all branches in this stage and can be formulated as: Eq. 11 effectively improves the discriminative ability of the encodings extracted by each camera encoder.In addition, the optimization of multiple branches also improves the discriminative ability of the model for images from different cameras.

Inter-contrast learning
In the inter-contrast learning stage, the encoding distance between samples is determined using Eq. 4. Due to camera variations, the encoding distance between different samples of the same identity tends to increase.Therefore, we subtract d 2 from the encoding distance of samples from distinct cameras during the encoding distance calculation.d 2 can be calculated as follows: where J(•) represents the Jaccard distance, the Jaccard distance between two samples is smaller when their combination coding is more similar.The corresponding Jaccard distance of the combination coding is calculated as: where ∩ indicates that the combination coding R takes a smaller value at the corresponding location, and ∪ indicates that it takes a larger value.In order to prevent samples with different identities from having similar combination codings leading to them being mistakenly recognised as the same identity, we use d 3 to further reduce the effect of camera variations, and the d 3 distance can be denoted as: 4 Experiment

dataset and Evaluation Protocols
We evaluated our method on three widely-used person ReID datasets, including Market-1501 [24], PersonX [16] and Du-keMTMC-ReID [15].The details of these three datasets are summarized in Table 1.During training, we only utilized the images and camera information from the training sets of each dataset, without using any other annotation information.Note that the camera ID is automatically obtained at the moment of capturing and is no need for human labeling.Performance is evaluated by the Cumulative Matching Characteristic (CMC) and meanAverage Precision (mAP).

Implementation details
To ensure a fair comparison with other methods, we used a pre-trained ResNet50 [7] on ImageNet [2] as the backbone network for feature extraction.After layer 5, we removed all submodule layers and added a batch normalisation layer [8], which will produce 2048 dimensional coding using the combination of these two layers as the encoder.During testing and clustering, we calculated the similarity between samples using the encodings obtained after passing through the backbone and the encoder.
During training, the input images are resized to 256 × 128.In each round, we perform intra-contrast learning and inter-contrast learning in sequence.The training consists of 50 rounds.We use the Adam optimizer to train both stages of the re-ID model with weight decay of 0.0005.The initial learning rate lr = 0.00035 and then decays to 1/10 of the previous every 20 rounds.The momentum update factor λ = 0.99.Every mini-batch integrates 256 images of 16 fake person identities (16  For every round of training, we train the model for two epochs at both stages.We use the standard hierarchical clustering method [13], as done in [19], we set the number of clusters for each camera to be 600 in the intra-contrast learning stage and 800 in the inter-contrast learning stage.

Comparison with State-of-the-arts
We compare recent fully unsupervised methods and domain adaptation methods on Market-1501 [24] , PersonX [16], and DukeMTMC-ReID [15].The results of the comparison are summarised in Table 2 and Table 3.First we compare domain adaptive methods, including methods that perform style transfer via GAN (SPGAN [3],et al.), methods that reduce the effect of domain gap by disentangling features (DGNet++ [28],et al.) and methods that reduce the effect of low-quality pseudo-labelling by mutual training (NRMT [23],et al.).
These domain adaptation techniques depend on manually annotated labels from the source domain, whereas our methodology achieves better results even without such reliance.We also compared our method with some fully unsupervised methods (BUC [11],et al), and it is clear that our approach outperformed most of these methods based on various metrics relying on the more reliable calculation of the sample encoding distances used in the clustering process.

Ablation Studies
The impact of individual components.In this section we evaluate the effectiveness of the two stages of intra-contrast learning and inter-contrast learning in our method.The experimental results are summarised in Table 5.As shown in the table, relying solely on inter-contrast learning for training leads to poor performance, indicating that the distance calculations between samples from different cameras are unreliable.On the other hand, when only intra-contrast learning is used, the rank-1 accuracy on the Market-1501 and Per-sonX datasets can reach 86.9% and 93.0%respectively.This shows that the distance calculation of the sample coding is more accurate when it is not influenced by camera variations.However, without considering the distribution gap between the cameras, the addition of the inter-contrast learning stage results in a decrease in performance on PersonX.This shows that although the sample coding produced by the model improves after the intra-contrast learning stage, the calculation of distances between samples from different Influence of hyper-parameters.In this section, we investigate the effect of two important hyperparameters µ and λ c , as shown in Fig. 3.The parameter µ is used to regulate the importance of d 2 .By increasing µ from 0 to 0.02, we observe an increase in both mAP and Rank1.However, further raising µ leads to a decline in mAP and rank-1 to varying extents.Therefore, we select µ as 0.02.For the parameter λ c , it is used to explicitly decrease the encoding distance between samples from different cameras.It can be observed that when λ c increased to 0.04, both mAP and Rank1 reached their optimal values, and further increasing λ c produces a negative effect.

Conclusion
This paper introduces two-stage contrastive learning approach for unsupervised person ReID, which aims to mitigate the impact of camera variations by improving the encoding distance calculation across cameras.First, In the intra-contrast learning stage, multi-branching is utilized to train individual encoders for each camera separately.Subsequently, in the inter-contrast learning stage, the encoding results of all encoders are combined to generate a more robust combination coding that is more robust to camera variations.The sample encoding distance is calculated by considering both d 1 (the original distance) and d 2 (the complementary combination coding distance) and d 3 (the Cross-Camera Encouragement distance).Extensive experiments have demonstrated the effectiveness of our proposed method in unsupervised person ReID tasks.

Fig. 2
Fig. 2 overall flowchart .The whole training process is divided into two parts, intra contrast learning and inter contrast learning, which share the backbone structure, and are represented by the upper and lower parts in the box, respectively.Both parts undergo two stages of training sequentially: (1)Initialization Stage (indicated by the red line): The clustering results of image encodings are used for dictionary feature initialization and pseudolabel initialization of the samples.(2)Training Stage (indicated by the thin arrow line): The thin solid arrow line updates features in the dictionary, while the thin dashed arrow line calculates the loss for the current stage and updates the backbone and encoder.During testing, we encode the images using the backbone and encoder from the inter-contrast learning stage.We compute the Euclidean distance between the encodings to obtain the final query results

Funding
This work was supported by Guangxi Natural Science Foundation (No. 2020GXNSFAA297186), Jiangsu Province Agricultural Science and Technology Innovation and Promotion Special Project (No. NJ2021-21), Guilin Key Research and Development Program (No. 20210206-1), Guangxi Key Laboratory of Precision Navigation Technology and Application (No. DH202227), Guangxi Key Laboratory of Image and Graphic Intelligent Processing (No. GIIP2301).There are no financial conflicts of interest to disclose.
Code is available at https://github.com/yjwyuanwu/SETOur contributions can be summarized as follows:

Table 1
images per identity).Statistics of datasets used in the experimental section

Table 2
Experiments on Market-1501 and DukeMTMC-ReID datasets.The comparison with recent person ReID methods, including domain adaptation methods and fully unsupervised methods, where "None" represents the fully unsupervised method and other values represent the source domain datasets in domain adaptive methods.The black bold font represents the optimal value of each metric

Table 3
Experiments on PersonX datasets.Where "None" represents the fully unsupervised method and other values represent the source domain datasets in domain adaptive methods.The black bold font represents the optimal value of each metric

Table 4
Investigate the effect on the results of using different parts of Eq. 4 in stage 2 Fig.3Parameter analysis on Market-1501 cameras remains unreliable.When we use Eq. 4 to calculate the sample coding distance in the inter-contrast learning stage, there is a significant improvement in accuracy, demonstrating that our proposed distance calculation method successfully mitigates the effects of camera variances on sample distance calculations.The impact of different partial distances.In this section, we investigate the effectiveness of the d 2 and d 3 distances in Eq. 4. The experimental results are summarised in Table 4. Taking the experimental results on the Market1501 dataset as an example, when we use d 1 directly to calculate the sample coding distance, the rank-1 accuracy is only 87.4%.However, when we use d 2 or d 3 for distance calculation while using d for sample encoding distance calculation, the rank-1 accuracy improves to 90.1% and 89.7%, respectively, indicating that both can reduce the effect of camera variations on the distance calculation.Furthermore, when we calculate the sample encoding distance using d 1 , d 2 , and d 3 simultaneously, the rank-1 accuracy further improves to 90.8%.This suggests that d 2 and d 3 can improve accuracy individually, and their advantages complement each other.

Table 5
Ablation study on individual components.Stage 1 denotes intra-contrast learning stage.Stage 2 denotes inter-contrast learning stage.* denotes only d 1 in Eq. 4 is used in stage 2