Multitask Model for Person Re-identi cation by Attribute-aware

The image-based person re-identification problem can be transformed into a similar image retrieval problem. At present, most of the current identity-based methods do not consider pedestrian attributes. Moreover, many methods that consider pedestrian attributes and identities fail to fully simulate the relationship between pedestrian attributes and identities. In this article, we propose a new image-based person re-identification method by attributeaware. Based on the introduction of instance batch normalization, the nonlocal module based on attention is used to transform the ResNet network structure to improve the feature extraction performance. After using generalized mean pooling for feature aggregation, the identity-based and attribute-based double stream network modules pay attention to the relationship between identities and pedetrian features, and the relationship between attributes and pedestrian features, so as to fully activate the relationship between pedestrian attributes and identities. Experiments are carried out on two classic person re-identification by attribute dataset Market-1501 and DukeMTMC-reID, and the results prove the effectiveness of the method. The method proposed in this paper has achieved the best performance on some evaluation metrics.

Given a pedestrian image, the purpose of person re-identification is to retrieve the images of the pedestrian from the cross-camera devices. It is generally believed that person re-recognition is an image retrieval technology designed to compensate for the visual limitations of fixed cameras. It can be combined with pedestrian detection and pedestrian tracking technologies to be used in industrial fields such as intelligent pedestrian detection and intelligent security [1,2].
Due to the differences between camera devices, pedestrians have both rigid and flexible characteristics, and their appearance is also easily affected by factors such as clothing, posture, weather, and occlusion, making person reidentification one of the most challenging research topics in the field of computer vision.
The main idea of traditional image-based person re-recognition is to compare the similarity of two identities. The similarity of different identities is small, the similarity of the same identity is large. Based on the supervised person re-identification problem, it has the labeled information. The labeled information is slightly rough, and pedestrian attributes with different identities also have many commonalities. There are also many differences in pedestrian images with the same identity. As a result, the features extracted by the network often cannot accurately measure the similarity. Therefore, simply relying on the labeled information to determine the pedestrian distance can easily lead to the deviation of the network's attention features. For example, there are three pedestrians with different identities, namely (x 1 , y 1 ), (x 2 , y 2 ) and (x 3 , y 3 ). Suppose y 1 = y 2 = y 3 , although x 1 and x 2 are very similar, but the similarity S(x 1 , x 2 ), S(x 1 , x 3 ) and S(x 2 , x 3 ) will be widened. As a result, the network pays attention to other regional feature information. When the attributes are introduced, (x 1 , y 1 ) can be expressed as (x 1 , y 1 , A 1 ) . When calculating the similarity, both the label and the attribute information will be considered to guide the network to pay attention to the corresponding features. So we can draw a conclusion S(x 1 , x 2 ) > max(S(x 1 , x 3 ), S(x 2 , x 3 )). Thus we can conclude that attributes help guide the network to learn the relationship between attributes and features and obtain features with more semantic information. On the other hand, attributes also help the network speed up training. Adding attributes can filter out some pedestrian images that are not compatible with query attributes.
Of course, not that the more attributes the better the performance. In the image-based person re-recognition dataset, the two classic datasets that have been marked with attributes information include Market-1501 and DukeMTMC-reID. The Market-1501 dataset has 30 attributes for each identity, and the DukeMTMC-reID dataset has 23 attributes for each identity. There are many cases in the dataset where there are pedestrians with different identities, but the attributes of the pedestrians are very similar. That is, suppose two pedestrians with different identities, (x 1 , y 1 , A 1 ) and (x 2 , y 2 , A 2 ), although there exists A 1 = A 2 , but y 1 = y 2 . It is not possible to directly use attribute labels 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 to train the network, but to choose discriminative attributes. Therefore, this paper chooses to use identity labels and attribute labels to train the network jointly. Person re-recognition with multi-modal attributes, although current single-modal algorithms have achieved good performance on some standard datasets. As we all know, RGB images are sensitive to bad weather such as rain, snow, and fog. However, intelligent monitoring needs to meet actual needs. The intuitive idea is to mine useful information from other modalities (such as thermal sensors, depth sensors, etc.), and then fuse them with RGB sensors. Therefore, there has been a lot of work deep fusion of multi-modal data to obtain performance improvement [3]. In recent years, some scholars have also proposed to introduce attributes to person re-identification tasks. Lin et al. [4] proposed an attribute person recognition (APR) network. It is a multi-task network that learns pedestrian ID embedding and predicts pedestrian attributes at the same time. They manually marked attribute labels of two person re-identification datasets, and systematically studied the correlation between pedestrian ID and attribute recognition. Yin et al. [5] proposed an Identity Recognition Network (IRN) and Attribute Recognition Network (ARN). Identity recognition network is used to extract partial information of pedestrian. Attribute recognition network is used to calculate the attribute similarity of pedestrians. Because of the role of attributes in detection and recognition, some scholars have also introduced them into video-based person re-recognition.
However, at present, person re-identification with attributes still encounters such problems.
1. At present, the accuracy of person re-identification with attributes is low.
This paper proposes a new method to improve the accuracy of person re-identification with attributes. 2. The effect of person re-identification is very poor on cross-domain problems.
The method proposed in this paper can improve the performance of crossdomain problems. 3. Compared with the latest person re-identification methods, the method proposed in this paper gets better performance.
Lin et al. [4,5] proposed a network framework combining pedestrian ID labels and pedestrian attributes. These network frameworks break through the traditional limitation of learning only using pedestrian ID labels. By introducing pedestrian attribute labels, they construct a multi-task network that learns pedestrian ID labels and predicts pedestrian attribute labels at the same time. Because pedestrian is both flexible and rigid, and the camera devices are different, which greatly affected by the environment and the outside world, person re-identification is extremely challenging. These multi-task networks have high accuracy in attribute recognition, but lack reliable performance in person re-identification. Different from these networks, to improving the accuracy of person re-identification, this paper uses pedestrian attribute labels as an aid, and discusses the function of pedestrian attribute labels in pedestrian recognition.
Zhu et al. [14] used attributes to assist person re-identification network. They fuse low-level feature distance and attribute-based distance as the final distance to distinguish whether a given image is of the same identity.
Because of the role of attributes in detection and recognition, attributes have also been introduced in video-based person re-recognition. Zhao et al. [15] proposed an attribute-driven method for feature decomposition and frame weighting. The sub-features are re-weighted through the confidence of attribute recognition, and then integrated in the time dimension as the final representation. Through this strategy, the area with the largest amount of information in each frame is enhanced, and it contributes to a more differentiated sequence representation. Song et al. [16] proposed the partial attribute-driven network (PADNet). Current these methods are based on the global-level feature representation. Pedestrians are automatically divided into multiple body parts. A four-branch multi-label network is used to explore the spatio-temporal cues of the video.
The main work of person re-identification is based on static images. Although the ideas for solving the problem are different, the main idea is to transform the person re-identification problem into the most similar image retrieval problem. That is to say, in the training phase, the distances of the same class should be as close as possible, and the distances of different classes should be separated as much as possible. In the testing phase, compare all pedestrian images in the gallery, and the image with closest distance is the identity of the query. Translated into the most similar image retrieval problem, the construction of features is particularly important. If we treat this problem according to the traditional artificial perspective, we will judge the identity of pedestrians according to clothing, age, body, etc. In this paper, we also added attention-like non-local module [17] and instance batch normalization (IBN) module [18] to learn more robust features.
The generalized non-local operation can be defined as: Where, i represents the output location, and j is the list of all possible locations.
x represents the input image, f represents the affinity function, and f (x i , x j ) is function that calculates affinity i and j. g is a unary function, which g (x j ) is a representation of the input image at location j. C(x) is the normalization function.
Here, we apply non-local operations to ResNet of instance batch normalization. In order to adapt, we made the following changes.
C(x) = N , N represents the number of location of x. f (x i , x j ) uses dot product to calculate the affinity i and j. The formula is as follows: In this paper, non-local block is acted on Layer 2, Layer 3 and Layer 4 of ResNet of instance batch normalization.

Optimization
In the person re-identification dataset, we can use D = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )} to represent the set with identity labels. Here, x i represents the input image of the identity i. y i represents the label of the identity i, and n represents the total number of input images. In the person re-identification dataset containing attributes, we can use A to represent the attribute set of all identities.
is the attribute subset of the identity i, and m is the number of attributes of each identity. Therefore, we can use E = {(x 1 , A 1 ), (x 2 , A 2 ), ..., (x n , A n )} to represent a set with attribute labels. For the two sets D and E, we use a bidirectional parallel approach to solve the person re-identification problem. Therefore, we can define the following three functions.
For the set D with identity labels, we define two functions, both of which are based on the objective function of identity labels. The first one is a classification function based on identity labels. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Here, L T ri (φ, y i ) represents the metric loss function of the feature embedding identity. The purpose of the function F T ri is to find the appropriate image feature embedding so that the identities of the same labels are as close as possible, and the identities of different labels are separated as much as possible.
We combine classification learning and metric learning to find suitable image feature embedding to better solve the person re-identification problem. For a set E with attribute labels, we define a function set.
Here, f Attj (w Attj , φ) represents the classification function of the feature embedding the attribute of identity j. L Att (f Attj , A j i ) indicates the j-th classification loss function of the identity i. Integrate all the attributes of the identity i together to get the attribute set of the identity i. The purpose of F Att is to find a suitable image feature embedding so that the identity attribute set obtained by training is as consistent as possible with the real attribute set.
In the testing phase, we use the feature embedding function φ(θ, x i ) to embed the query set and all gallery set images into the feature space. Then according to the Euclidean distance between each query image and all gallery images, the identity label of the query image is judged. f Attj (w Attj , φ(θ, x i )) calculates all the attributes of each query image identity.
In order to better adapt these classifiers (f Id and f Att ), here we normalize them after the feature embedding function φ(θ, x i ). In particular, in the testing phase, before calculating the Euclidean distance between each query image and all gallery images, we also normalize them first.

Network structure
In this section, we will describe the network structure in detail. Figure 1 shows the network framework proposed in this paper. The network framework includes two parts: attribute recognition and identity recognition, corresponding to "A" and "B" of the framework diagram respectively. In the training phase, we first preprocesses the input images, and preprocessing plays the role of data enhancement. The preprocessing operation includes five different data enhancement modules: resize, random horizontal flip, pad, random crop and random erasing. The data enhancement modules can help to prevent the network from falling into local extrema during the training process, leading to overfitting.
It also helps realize the diversification of input images and helps the network better train. Then feed data into the ResNet backbone network used in this paper. In particular, this paper introduces two modules, instance batch normalization and non-local network, to reform ResNet so that ResNet has stronger feature extraction performance. The introduction of the two modules 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64   This paper introduces generalized mean pooling, its function is similar to adaptive average pooling. It is responsible for the aggregation of the feature embedding after the backbone. Its mathematical formula is as follows: In particular, when p i = ∞, then generalized mean pooling evolves into max pooling. When p i = 1, then generalized mean pooling evolves into average pooling. In this paper, we set p i = 3.
After feature aggregation, the aggregated features are sent to module A and module B respectively. Module A focuses on pedestrian attributes and is responsible for learning the relationship between pedestrian attributes and features. Module A can learn the correlation between each attribute of the pedestrian image and the features of the pedestrian image. Module B focuses on the identity of pedestrians, and is responsible for learning the relationship between pedestrian identity and features. Module B can learn the global features of pedestrian images. The combination of module A and module B makes the network pay attention to pedestrian attributes while also paying attention to the identity of pedestrians. It can prevent the phenomenon of network overfitting .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64   In module A, in order to obtain the relationship between each attribute and the aggregation features, a batch normalization (BN-1) module is set for each attribute. The BN-1 module is shown in the Figure 2 (a), followed by a two-classifier to determine whether the current pedestrian feature contains this attribute.
The "c" in module A represents a concat function, which refers to combining the classification results of all attributes. Finally, use BCE loss to calculate the difference between the predict attributes and the real attributes, and learn the relationship between pedestrian attributes and aggregated features through backpropagation.
In module B, in order to obtain the relationship between each identity and the aggregated features, a batch normalization (BN-2) module is also designed. This module has a similar function to the BN-1 in A, but only a simple onedimensional batch normalization operation is used here. Finally, triplet loss, softmax loss and center loss are used to calculate the difference between the predict identity and the real identity. Through back propagation to learn the relationship between pedestrian identity and aggregation features, the network makes the pedestrians of the same identity as close as possible, and makes the pedestrians of different identities farther.
In the testing phase, there is a difference from the training phase. The network does not perform data enhancement on the input images, but in order to adapt to the network, it only performs a resize preprocessing operation. After the aggregated features are obtained, the attributes of the current input images (query and gallery) can be judged through the attribute classifiers. The network can also output the embedded features of each input image (query) through the BN-2 module, and find the images with the best rank score from the gallery through the distance matching method. "1", "2", "3" in module A and module B represent the output results of the network in the testing phase. The metrics in "2" are commonly used evaluation metrics for person re-identification problems. The embedded features in "1" are mainly used for output visualization. The metrics in "3" are commonly used evaluation metrics in person re-identification with attributes .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 4 Experiment In this section, we will conduct a lot of experiments to verify the effectiveness of the algorithm proposed in this paper. In order to distinguish, we call the method proposed in this paper as multi-task model for person re-identification by attribute-aware (MMAA).

Evaluation metrics
In order to measure the performance of the algorithm, this paper introduces standard metrics commonly used in person re-identification. Including cumulative matching cure (CMC), mean Average precision (mAP), mean inverse negative penalty (mINP), and receiver operating characteristic (ROC) curve.

Datasets and settings
The algorithm in this paper uses data enhancement methods such as random horizontal flip, pad, random crop, and random erasing to preprocess the input images. For the triplet loss function, in the training phase, 4 different identities are fed into network in each batch, and each identity has 8 different images, that is, a total of 32 pedestrian images are fed into network in each batch.
The table lists the results obtained by the latest methods in recent years. In the Table 2, "-" means no record. For the DukeMTMC-reID dataset, compared with other methods on the evaluation metrics Rank1, Rank5, Rank10 and mAP, the method proposed in this paper achieve the best performance.

Ablation study
In order to better illustrate the effectiveness of the method proposed in this paper, this paper carried out ablation experiments for the three modules of non-local, instance batch normalization, and attributes. According to Rank1, mAP and mINP metrics, evaluate whether to join the experimental results of these three modules. From the Table 6, " √ " means using the corresponding module, and blank means not using the module. Without using the three modules of non-local, instance batch normalization and attributes, this paper obtained 94.1% Rank1, 85.0% mAP and 57.1% mINP in the Market-1501 dataset. When the attribute module is applied, Rank1 rises by 0.1%, mAP rises by 1%, and mINP rises by 2.1%. When the three modules are applied, it is much better than the model without any modules. Among them, Rank1 is improved by 2%, mAP is improved by 5.3%, and mINP is improved by 13.9%. For the DukeMTMC-reID dataset, without using the three modules of non-local, instance batch normalization and attributes, the Rank1 is 85.9%, mAP is 74.8%, and mINP is 36.4%. After applying the three modules, Rank1 increased by 5.5%, mAP increased by 6.6%, and mINP increased by 11.5%. From the ablation experiment, we 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64    can see that these three models have improved network performance, and also verified the effectiveness of the algorithm proposed in this paper.

Visualization
In this section, we use a variety of visualization experiments to analyze the performance of the method proposed in this paper.
In order to better verify the effectiveness of the proposed algorithm, this paper compares the visualization results of the two networks. "ID" in these figures represents the query label, the serial number from 1 to 10 represents the sorting result from largest to smallest. Number in red font indicates an incorrect match, and number in green font indicates a correct match. The first line represents the visualization results of the baseline without attributes, and the second line represents the visualization results of the method proposed in this paper .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 From Figure 5, we can find that the baseline that do not use attributes have more mismatches. For example, input image with ID 94, the image of Rank1 matches wrongly. This method makes an error in "backpack" attribute, and regards pedestrian without backpack as an exact match. In addition, there is an error in matching in "clothing" attribute. For the pedestrian with ID 934, the baseline does not correctly match "hair" attribute. The method proposed in this paper accurately recognizes the key attributes of pedestrian. The network has learned the relationship between key attributes of pedestrian and pedestrian features, as well as the relationship between pedestrian identity labels and pedestrian features.
From Figure 6, we select pedestrian images with ID 47 and 288 from the query. The baseline that do not use attributes have more incorrectly matches. For the pedestrian with ID 47, the baseline has some errors on "backpack" and "bag" attributes. For the pedestrian with ID 288, the baseline does not correctly match "hair" attribute .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  We used GradCam to generate heat map for the input pedestrians for comparison. For the two pedestrians from the Market-1501 dataset, the baseline without attributes is used to compare with the method proposed in this paper. From Figure 7, we can find that the method mentioned in this paper focus on more parts and more accurate. The method proposed in this paper accurately recognizes the key attributes of pedestrians, and illustrates the important role of key attributes of pedestrians in network parameter learning.
From Figure 8, we listed two examples for each dataset, one positive and one negative. Positive examples indicate that our proposed method can predict correctly. The negative example shows that the method proposed in this paper can predict the attributes of pedestrians correctly, but it is wrong in attributes that they do not possess. For example, For ID 94, our network can predict that pedestrian contains "teenager", "backpack", "clothes", "up", "upblack" and "downblack", but ID 94 does not have the "downblue" attribute, but our network made a wrong judgment. For ID 329 of Market-1501 and ID 98 of DukeMTMC-reID, our network can predict completely accurately. The   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  ID:  network not only predicts accurately the attributes that the pedestrian image has, but also predicts the attributes that the pedestrian image does not have.

Conclusions
Aiming at the low recognition accuracy of the current person re-identification by attribute and lack of discussion on cross-domain issues, this paper proposes a new person re-identification network with attribute-aware. The network introduces the instance batch normalization module, and uses the non-local method based on attention to transform the network. The method increases the performance of feature extraction. Bidirectional training of attribute-based networks and identity-based networks, the network can learn the relationship between attribute labels and pedestrian features, as well as the relationship between identity labels and pedestrian features. Our method is tested on two standard attribute person re-identification dataset, and the effectiveness of the method proposed in this paper is verified by comparing with the latest methods.

Compliance with ethical standards
Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.