Multiple-local Feature and Attention Fused Person Re-identication Method

: Person re-identification (ReID) is widely used in intelligent security, monitoring, criminal investigation and other fields. Aiming at the problems of local occlusion, scale misalignment and attitude change of pedestrian images in actual scenes, we propose a Multi-local Feature and Attention fused network (MFA) used for person re-identification task. Firstly, Channel Point Affinity Attention module (CPAA) is embedded in the backbone network to enhance the ability of the network for extracting local details. Then, the feature map output from the backbone network is horizontally segmented into four local feature maps, and further four branch networks are concatenated to the feature map of the backbone network. The four local feature maps are used to guide the four branch networks to pay more attention on different areas of pedestrians through Global Local Aligned loss (GLA) function. Finally, the pedestrian feature vector containing multi-local features is obtained. The mAP of the network on Market-1501 and DukeMTMC-reID datasets were 88.6% and 81.4%, and the Rank-1 is 95.8% and 90.1%, respectively. In addition, the model also obtained 73.2% and 68.1% of Rank-1 on partial dataset Patial-REID and Patial-iLIDS, respectively. Compared with other ReID methods, our proposed methods achieved a competitive performance for ReID task.The code was available at github:git@github.com:ISCLab-Bistu/MFA.git.


Introduction
Person Re-identification (ReID) [1] is a technology that can be widely used in video monitoring, intelligent security, and other fields by combining with pedestrian detection, target tracking, and other technologies.It plays an important role in criminal investigation or specific scenarios of person search.The ReID task essentially belongs to the image retrieval problem.In recent years, with the breakthrough of deep learning in the image field, the deep learning-based ReID method has also received widespread attention.
Unlike pedestrian detection tasks, the ReID task needs to pay attention to more granular features of pedestrians.In an open scene, the combined features of pedestrians such as clothing, accessories, and personal belongings are ever-changing.In addition, due to factors such as perspective, posture, occlusion, and lighting, the same pedestrian has multiple differences in different time periods, which makes the ReID task very difficult.
To better address challenging scenarios such as scale variation and occlusion, the contributions of this paper are summarized as follows: We proposes a person re-identification method that fuses local features and attention mechanisms (Multi-local Feature and Attention fused network for person re-identification, MFA),We have enhanced the ability of the backbone network to extract fine-grained features through the proposed Channel Point Affinity Attention module (CPAA), while guiding the information interaction between global and local features using the proposed Global Local Aligned (GLA) loss function，Testing on commonly used datasets Market-1501 and Dukemtmc-reid as well as occlusion datasets Partial-REID and Partial-iLIDS shows that the network achieves competitive performance.Through visualization analysis using Grad-CAM [27] , the network is shown to not only focus on multiple local features but also cover the entire pedestrian silhouette, filtering out background and occlusion regions.Even under occlusion and scale variation, the network can still achieve the above effects.

Related work
Currently, the ReID task combines research contents of representation learning [2] and metric learning [3] , where representation learning enables the network to extract more discriminative features, and metric learning maps features to specific subspaces, increasing the sample class distance and achieving the purpose of distinguishing different pedestrians.
In the ReID task, methods of representation learning mainly include methods based on global features, generative adversarial networks (GANs), pose estimation, masks, image blocks, and attention mechanisms.Among them, networks based on global features such as OSNet [4] 、BagTricks [5] and SVDNet [6] only allow the network to learn partial discriminative features through the dataset and do not consider local features.They perform generally in scenes where pedestrians are occluded.GAN-based methods such as Camstyle [7] and PN-GAN [8] improve the model's generalization ability by further expanding the training data.Pose estimation-based methods such as CPM [9] , GLAD [10] , PIE [11] , and SpindleNet [12] utilize the key point information of the human body to alleviate the problem of local misalignment.However it increases the model's computational complexity and requires human keypoint annotated data during training.Mask segmentation-based methods such as SPReID [13] and MaskReID [14] use image segmentation methods to separate human and background information, generate binary masks, and thus suppress background interference.Image block-based methods such as AlignedReID [15] , SCPNet [16] , PCB [17] , Pyramid [18] , and BFE [19] force the network to focus on features in different regions during the learning process, enhancing the model's feature robustness.However, if the pedestrian alignment problem is not solved, it will introduce background interference.Attention mechanism-based methods can selectively improve the feature extraction ability of the network, such as AGW [20] , Mancs [21] , DuATM [22] , HA-CNN [23] , etc.The attention mechanism improves the model's feature extraction ability, but the methods based on attention mechanisms only utilize the global features and discriminative local features of the human body and ignore other minor features, resulting in mediocre performance in occlusion and viewpoint changes.Currently, in supervised ReID tasks, more attention is given to the fusion of global and local information to extract more discriminative pedestrian features.
After extracting discriminative pedestrian features, the recognition performance can be further improved by using metric learning methods that design loss functions to increase inter-class distance and decrease intra-class distance.Commonly used metric learning methods include classification loss [24] , triplet loss [25] , and quadruplet loss [26] .Classification loss sets the number of IDs in the dataset as the number of classes in the network, feeds the feature vector into a fully connected layer, and calculates cross-entropy loss after Softmax.Triplet loss controls the distance between positive sample pairs and negative sample pairs to be less than a set threshold during training, thus achieving sample clustering within the same class.Quadruplet loss adds another set of samples from different classes to triplet loss, and controls the distance between positive sample pairs and other samples from different classes to be less than a set threshold, thus reducing intra-class distance and increasing inter-class distance.

Method
The ReID network structure proposed in this article, which integrates local features and attention mechanisms, as shown in Fig. 1.For the ReID task, deep features with rich semantic information are required, but increasing the depth of the network can lead to the problem of vanishing gradients, and the training error also increases accordingly.ResNet proposed by He [28] et al. solved the degradation problem of deep networks, so ResNet50 was used as the backbone network in this article.In order to better utilize the associated information in neighboring regions and extract discriminative pedestrian local semantic features, a Channel Point Affinity Attention (CPAA) module is embedded in the last 3 layers of the network.At the same time, in order to better deal with occlusion, scale changes, and other situations, the output feature map of the backbone network is horizontally divided into multiple regions, and each horizontal region is output as a feature vector through global average pooling (GAP), and corresponding local branches are added.The feature map generated by each local branch is subjected to a GeM [29] layer to produce a feature vector, which is paired with the region feature vector obtained by horizontal segmentation and combined through the proposed Global Local Aligned (GLA) loss function to guide each local branch to learn the features of different regions of the pedestrian.To reduce overfitting, better focus on local detailed features, and increase feature robustness, this article uses cross-entropy loss with LabelSmoothing as the identity loss (ID loss), and adds Soft-triplet loss and Center loss for joint training.It should be noted that batch normalization (BN) is used for the features in ID loss in this article.

Channel Point Affiniy Attention
In ReID networks, attention mechanisms can generally promote the network to learn more pedestrian features and suppress irrelevant background information.The proposed Channel Point Affinity Attention (CPAA) in this article can better enable the network to learn associated information in neighboring regions and thus extract more discriminative local semantic features.In Fig. 2, the input feature map of the network , where C 、 H 、 W denote the channel number, height, and width of the feature map, respectively.Let i A and j A represent the feature maps of the i-th and j-th channels, and the channel correlation matrix is shown in the Fig. 2, where the correlation degree ij x between channel i x and j x is denoted by : (1) Let g be the hyperparameter that controls the correlation degree between two channels, and each channel of the resulting feature map B can be represented as:

Reshape
Subsequently, the feature map B is passed through three 1´1 convolutions to generate feature maps 1 P ， 2 P and 3 P , respectively.To reduce the computational cost, the feature maps 1 P and 2 P are passed through a 1´1 convolution to reduce the number of channels to 1 C .Similar to the above calculation, the spatial correlation matrix is obtained, where N = H ´W .Then, the matrix after transformation of S is multiplied with 3 P to obtain the feature map F with high semantic correlation on the spatial level.Finally, to alleviate the problem of gradient vanishing, the feature map G is obtained through a residual structure.

Local Branch Structure
In this paper, the output feature map of the backbone network is horizontally divided into R regions, and the backbone feature map is output to R parallel local branches through R groups of 1´1 convolution layers.
Each local branch is jointly trained with the corresponding horizontally divided feature vector, and the GLA loss function is used to guide each branch network to focus on different local features of the image, and finally fuse multiple local feature vectors.The local branch in Fig. 1 uses the GeM module when generating feature vectors.Compared with global max pooling and global average pooling, the GeM module has the ability of adaptive learning, which can better retain discriminative fine texture features.The feature vector output by GeM can be represented by the formula: (3) where K represents the number of features, and k f represents the k -th generated feature vector, which can be represented as: In the formula, c k represents the input feature map parameter, p k is a learnable parameter, and the partial derivative of f k with respect to x i and p k is represented as: Different from global average pooling that only performs pooling operation on the feature map, GeM continuously updates the final convergence to the best value during the backpropagation process when generating the feature vector, and when p k approaches ¥ , GeM is equivalent to global max pooling, while when p k is 1, GeM is equivalent to global average pooling.GeM can perform the best pooling operation through adaptive learning.
After generating the feature vector in the local branch, batch normalization (BN) [5] module is applied to normalize the features.Batch normalization makes the features of different input samples follow a normal distribution, which helps the ID loss to converge faster and also facilitates the convergence of other auxiliary losses.

MFA Network loss function
The overall loss function of the MFA network can be represented as follows: (7)   In the formula, l 、 m 、 g are weight factors which can be learned by the network.The BN module in Fig. 1 is only responsible for optimizing the feature for the classification loss part.Therefore, during the module training process, except for ID L using the feature vector after the BN module, other loss functions use the feature vector before BN.During testing, the concatenated feature vector from the local branches is used.
Firstly, the model uses cross-entropy loss with added label smoothing as the basic classification loss function, which can be represented by the formula: (8) where i p represents the probability value of predicting class i , y represents the true class, N represents the total number of ID categories, and e is a small value (0.1 in this paper).Label Smoothing reduces the confidence of the network when predicting correctly, while increasing the confidence of the network when predicting incorrectly, thereby reducing overfitting of the network on the training set.
In addition, to improve the network's learning of finer-grained features, the Soft-Triplet loss [30] is added, where To address this issue, the Center Loss is introduced [31] , which is used to measure the distance between the current class and its intra-class center.The expression is: (11)   In the equation, B represents the mini-batch size, j y c represents the feature center of class j y , j y represents the label of the j -th image in the mini-batch, and j t f represents the feature vector of the j -th image.Minimizing the Center Loss can reduce the intra-class variance.Finally, for the various local branches in the network, their own features are generated by the 1´1 convolution layer on the output of the backbone network.
The network cannot automatically focus on different local features of pedestrians.Therefore, the GLA loss function is constructed, and the R horizontally segmented region feature maps are used to guide the local branches to learn different local features.For the local feature vectors and their corresponding region feature vectors, the closer the distance between the two, the closer the features that are attended to.Therefore, the loss function is constructed as follows: (12) In the formula, R represents the number of horizontally divided regions, gi f represents the semantic features included in the i -th local branch, li f represents the i -th split region feature.Therefore, gi f is supervised by li f to make the f-th local branch tend to focus on the local information of the i -th region.At the same time, since the output features of the local branch contain global semantic information, when there are difficulties such as non-aligned pedestrian scales or pedestrian occlusion, i.e., the i -th region does not contain the required pedestrian features to be focused on, but only contains background interference information, the semantic features cannot match the corresponding local features, so the local features of this region will be ignored, thus achieving that different local branches only focus on pedestrian local features and eliminate the influence of background interference.

Experimental Dataset Introduction
To demonstrate the effectiveness of our proposed method, we conducted experiments on two widely-used datasets, Market-1501 and DukeMTMC-reID, as well as two occlusion datasets, Partial-REID and Partial-iLIDS, to evaluate the model's generalization ability under occlusion scenarios.Fig. 3 shows some examples of each dataset.DukeMTMC-reID and Market-1501 were collected in a university campus environment and contain a certain amount of occlusion; Partial-REID was captured outdoors, where pedestrians are occluded by trees, buildings, and other objects; Partial-iLIDS was captured by airport cameras, where pedestrians are occluded by luggage, signs, and other pedestrians.Table 1 shows the basic parameters of each dataset, where #identities represents the number of identities, #imgs represents the number of images, and #camera represents the number of cameras.Partial-REID and Partial-iLIDS are relatively small in scale and do not include a training set, only being used for testing purposes.

Experimental Configuration
The experiment was conducted on Ubuntu20.04operating system with hardware specifications including core-i9 11900 CPU and NVIDIA GeForce 3090 graphics card with 24GB of memory.The model was built using Pytorch framework and the ResNet50 network pre-trained on the ImageNet dataset was used as the backbone network.Since the pedestrian images in the dataset are generally small in size and the feature map resolution is low after downsampling by 32 times, the stride of the last layer of the backbone network was set to 1.During training, the Adam optimizer was used with a batch size of 64 (16×4) and RandomIdentitySampler was used for sampling.The initial learning rate was set to 3.5×10 -4 and weight decay was set to 5×10 -4 .A total of 120 epochs were trained with a warm-up mechanism for the first 40 epochs, gradually increasing the learning rate to 3.5×10 -4 .The learning rate was then decreased to 3.5×10 -5 and 3.5×10 -6 at 40 and 70 epochs, respectively.The image resolution was scaled to 256×128, and data augmentation techniques such as horizontal flipping and random cropping were used with probabilities of 0.5 each.Moreover, when the network input resolution is 256×128 and is downsampled by 16 times, the output is a feature map of size 16×8.To ensure that the semantic information of the regions is not lost,also enable the localbranches take better effect,finally, the feature map was divided into 4 parts, covering the four regions of the body (above the chest, from chest to waist, from waist to knee, below the knee), which can effectively capture each local feature of pedestrians on most ReID datasets.During testing, the pedestrian feature vectors were L2 normalized, and the model performance was evaluated using CMC (Cumulative Matching Characteristics) and mAP (mean Average Precision) metrics.For each query, one image was selected at a time and the similarity with all images in the gallery was calculated.The CMC metric records the probability of successful matching for the top k images, and the probability of the first successful match is recorded as Rank-1.The average accuracy of each query is calculated from the PR curve, and the mAP is the average accuracy of all queries.

Ablation Experiment
To investigate the optimal location of the CPAA module in the network, four groups of control experiments shown in Table 2 were designed.Baseline only uses the ResNet50 backbone network to extract features without multiple local branches and the CPAA attention structure, and the loss function does not contain GLA.Layer2, Layer3, and Layer4 represent adding the CPAA module at the corresponding positions.To ensure the fairness of the comparison, the basic experimental parameters were set according to Section 4.2, and each method was trained and tested on the Market-1501 and DukeMTMC-reID datasets separately.Table 2 shows the results of four control experiments designed to study the optimal location of the CPAA module in the network.Baseline only uses the ResNet50 backbone network to extract features, without including multiple local branches or the CPAA attention structure, and the loss function does not include GLA.Layer2, Layer3, and Layer4 represent adding the CPAA module at the corresponding positions.To ensure fair comparison, the basic experimental parameters are set according to section 3.2, and each method is trained and tested on the Market-1501 and DukeMTMC-reID datasets.The data in Table 2 shows that using only the Baseline method 1, the Rank-1 indicators in Market-1501 and DukeMTMC-reID are 93.6% and 85.1%, respectively.Methods 2, 3, and 4 all show some degree of improvement in Rank-1 indicators, because the CPAA module aggregates channel features and related information from neighboring regions, resulting in feature maps with better representation capabilities.Method 3 shows a higher degree of improvement compared to Method 2, as the CPAA module in Method 2 is located in a shallower position in the network, with more interference from other low-level features.In Method 3, the feature map outputted after Layer3 already contains rich semantic features, and at this point, adding the module can better aggregate deep semantic features and improve network performance.Method 4 shows some improvement compared to Method 3, but the magnitude of improvement is limited, as important semantic information has already been attended to through multiple feature aggregations in the early stages, and the outputted features tend to be stable.
To further verify the effectiveness of each module, six control experiments are designed, as shown in Table 3, where Localbranches represents multiple local branches.A comparative analysis of the data in Table 3 reveals the following: (1) CPAA, when added to Baseline, increases the Rank-1 performance by 1.6% and 1.7% on the Market-1501 and DukeMTMC-reID datasets, respectively.Similar improvements are observed in other combinations that include CPAA, indicating that CPAA can effectively aggregate semantic features from neighboring regions in the feature map and enhance the representation ability of the backbone network.(2) Compared to Baseline, adding only Localbranches (without CPAA or GLA) to Baseline has no significant impact on the Rank-1 performance on both datasets, and this observation holds for other combinations that only include Localbranches, indicating that Localbranches alone cannot achieve the goal of diversifying local features.(3) Adding GLA to Localbranches (Baseline+Localbranches+GLA) increases the Rank-1 performance by 1.5% and 3.3% on the Market-1501 and DukeMTMC-reID datasets, respectively, compared to Baseline+Localbranches, suggesting that Localbranches can extract multiple local features and obtain feature vectors that contain more detailed information only when used with GLA in joint training.Based on the above analysis, it can be concluded that CPAA and Localbranches combined with GLA have the potential to improve the model's performance.Finally, the MFA model is built by stacking the above modules, and achieves Rank-1 performance of 95.8% and 90.1% on the Market-1501 and DukeMTMC-reID datasets, respectively, representing a 2.2% and 4.7% improvement over Baseline.

Network Visualization Analysis
To visualize the regions of interest in the image that the network attends to, Grad-CAM [27] was used for visual analysis.Four representative scene images were randomly selected from the Market-1501 test dataset, as shown in Fig. 4. The leftmost image in each group is the original image, and the middle four images show the visualization of the network's attention regions overlaid with four different local branches.The image on the far right shows the visualization of the fused attention region after combining the four local branches.
In Fig. 4, (a) shows a normal cropped image, as described in section 4.2.The different branches of the network sequentially attend to four regions of the pedestrian's body, from above the chest to the chest and waist, from the waist to the knees, and from below the knees.The fused attention regions basically cover the entire outline of the pedestrian.At the same time, the fused features tend to focus on more discriminative regions, such as special texture patterns on the upper body, and the transition area between the lower body pants and legs.(b)shows a situation where the scale is changed, and the pedestrian only occupies a local area of the image.In this case, the local branch responsible for the upper part of the image does not attend to the background area above the image because under the influence of global semantic features, there is no semantic information that meets the requirements in the upper region of the image.This branch of the network defaults to outputting the most discriminative global feature, i.e., the upper body region of the pedestrian.The other branches continue to focus on different local regions.(c)shows a situation where the pedestrian is partially occluded.Because the lower part of the pedestrian's body is covered, the branch network responsible for the area below the knees defaults to outputting the most prominent global feature, and the other branches only focus on the upper body region.(d)shows a difficult situation where there is both background interference and scale change.Many other pedestrians and objects appear in the background, but the attention regions of each local branch still meet the requirements.
In all four situations, the network's final attention regions cover the entire pedestrian and ignore interference such as occlusion and background, thus achieving rich and pure pedestrian features in the network output.To observe the generalization ability of the network on other occlusion datasets, we selected four representative images from the Patial-REID dataset, as shown in Fig. 5, including half-body, side view, heavily occluded, and partially occluded situations.After visualizing with Grad-CAM, it was found that the model trained on the Market-1501 dataset still has good generalization ability on Patial-REID.The network's attention areas covered the entire person and focused on multiple discriminative local features while ignoring the interference of the background and occlusions.

Comparision with other methods
The proposed method will be compared with current state-of-the-art (SOTA) methods, including block-based methods such as AlignedReID [15] , SCP [16] , PCB [17] , MGN [32] , GCP [33] ,MSCN [34]; global-based methods such as OSNet [4] and BagTricks [5] ,AEFLN [35] ;mask-based segmentation method such as SPReID [13] ,GASM [36] ; attentionbased methods such as AGW [20] ,ABD-Net [37] and SONA [38] .To ensure fairness in comparison, re-ranking [39] , multi-query [40] , and other methods will not be used in this study.OSNet [4] 94.8 84.9 88.6 73.5 BagTricks [5] 94.5 85.9 86.4 76.4 SPReID [13] 92.Table 4 presents the performance of these networks on the Market-1501 and DukeMTMC-reID datasets.OSNet, AGW, and MGN, which have better performance among the methods described above, were selected for analysis.OSNet incorporated a fusion mechanism of convolution layers with different receptive fields in the module, which can extract the most discriminative features, but ignores other secondary features and has limited network generalization ability.AGW further enhances the ability of the network backbone to extract features by adding an attention mechanism module on the basis of BagTricks, but it does not use local feature information.MGN uses global and block-based local features, and the network has the ability to extract fine-grained features, but it does not address the problem of scale transformation and occlusion, which may introduce more interference information when the network faces such situations.In this paper, by using CPAA and multi-branch structures, the network can extract diverse fine-grained features, and by using GLA, the network considers global semantic information while focusing on local features to deal with scale transformation and occlusion scenarios.The indicators in Table 4 show that the proposed MFA network has the best performance.To visually reflect the effectiveness of the proposed method, the retrieval results of AGW, OSNet, MGN, and the proposed method were compared.Fig. 6 shows the Rank-10 retrieval results of each method and the proposed method, with dotted boxes indicating incorrect matches.Although AGW and OSNet had correct Rank-1 matches, they could not pay attention to more detail information due to the absence of local features, leading to more incorrect matches.MGN failed to solve the problem of scale variation and could result in misidentification in such cases.MFA, due to its ability to extract more diverse local fine-grained features and address the issue of scale variation, achieved better Rank-10 matching.
Finally, to verify the model's generalization ability under various occlusion scenarios, tests were conducted on the Partial-REID and Partial-iLIDS datasets using the single-shot method.The occluded image was used as the query and the complete pedestrian image was used as the gallery.Table 5 shows the indicators for each method.In Partial-REID, MFA achieved a Rank-1 indicator of 73.2%, which is 3.5% higher than AGW, ranked higher.In the Partial-iLIDS dataset, the Rank-1 indicator of MFA increased by 0.8% compared to the better-performing PGFA, reaching 68.1%.SFR [38] 56.9 78.5 63.9 74.8 VPM [40] 67.7 81.9 65.5 74.8 PGFA [41] 68

Conclusion
We propose a pedestrian re-identification network that integrates local features with attention mechanisms.The CPAA attention module is embedded in the ResNet50 backbone network, effectively aggregating semantic features from different channels and neighboring regions.On the other hand, through the joint action of local branches and GLA loss, the model can focus on multiple local features of pedestrians.Visual analysis shows that the fused pedestrian features are rich and pure.Compared with similar networks, the proposed method achieves Ours OSNet MGN competitive performance, demonstrating its effectiveness.In future work, we will investigate how to streamline the model while maintaining current performance, enabling lightweight and edge deployment of the network.

Fig. 1
Fig. 1 Overall network architecture of MFA between positive sample pairs and negative sample pairs, respectively.The specific expression of Soft-Triplet loss is:(10)   The problem with triplet-based losses, including the Soft-Triplet loss, is that the function only focuses on the difference between , ap d and , an d , but not on the magnitude of d a, p .For example, when d a, p is 0.3, d a,n is 0.5, the loss is equal to ,

Fig. 3
Fig. 3 Examples of images on different datasets

Fig. 6
Fig. 6 The comparison of Rank-10 retrieval results by different networks