View-aware attribute-guided network for vehicle re-identification

Vehicle re-identification is one of the essential application of urban surveillance. Due to enormous variation in inter-class and intra-class resemblance creates a challenge for methods to distinguish between the same vehicles. Additionally, varying illumination and complex environments create significant hurdles for the existing methods to re-identify vehicles. We present a multi-guided learning method in this paper that uses multi-attribute and view point information, while also enhancing the robustness of feature extraction. The multi-attribute sub-network learns discriminative features like, i.e. color and type of vehicle. Moreover, the view predictor network adds extra information to the feature embedding and To validate the effectiveness of our framework, experiments on two benchmark datasets VeRi-776 and VehicleID are conducted. Experimental results illustrate our framework achieved comparative performance.


Introduction
Vehicle re-identification (re-id) received immense attention recently to build Advanced monitoring and surveillance systems. It seeks to identify specific vehicles across multiple non-overlapping cameras. Vehicle re-id [1][2][3][4] is a critical visual task in surveillance systems since it allows you to pinpoint a single vehicle precisely. Even though surveillance cameras are standard in several places, their coverage is restricted to limited viewpoints due to their fixed placements. Even though license plates are vital to differentiate between vehicles, their use is uncontrollable and can be faked or broken. In urban surveillance, license plate recognition [5][6][7] techniques struggle in extreme environments, i.e., motion blur, low lighting, inconsistent viewpoints, and low-quality image conditions. As a result, vehicle re-id techniques focus primarily on examining visual vehicle features. Moreover, vehicle re-id encounters a unique challenge where particular vehicles may have near or even identical appearances from different viewpoints, particularly for vehicles of the same model and brand. Large intra-class differences imply that images of the same vehicle look very different in various scenarios of distance, light, occlusion, viewpoint, etc. Small inter-class differences mean that different vehicles with the same model, color, and brand may appear very similar. Therefore, the most significant factors for vehicle reidentification are reducing intra-class variations and raising inter-class differences. Some scholars [8,9] divided vehicle portions in various methods to maximize the utilization of local attributes in key locations of vehicles in order to reduce the distance between images of same vehicle.
Deep learning has lately been used to solve a wide variety of computer vision problems, researchers primarily focus on either building new network architectures to learn new discriminative features or adding data to improve re-id model performances. In vehicle re-id, local features are momentous, as illustrated in Fig. 1, various vehicles with identical global appearances could be recognized more easily depending on their local appearance cues. Global feature representation is insufficient for the vehicle re-id task since vehicles in a fine-grained categorization have identical appearances. Vehicles from one or separate manufacturers might have identical shapes, and color making it tough to determine if two images are from the single vehicle. Vehicle re-id systems that are focused on vehicle features and visual characteristics like as color, shape, and texture are gaining ground. Moreover, these techniques are imperfect, and therefore the research emphasis in the area of vehicle re-identification is on solving the hurdles associated with existing research issues and improving the accuracy through use of robust and reliable methods.
Multi-view information is a challenging issue, specifically in multi-view matching. In instances where a vehicle appearance varies in different viewpoints due to viewpoint variation, only one viewpoint data may be insufficient. It's challenging to identify the same vehicle with a rear view based on its front view features during the ranking stage. Multiple approaches [2,[10][11][12] were proposed to deal with training problems due to intense variations in viewpoint. A few previous techniques used vehicle key-point annotations [13,14], while others utilized pose estimation network [1,4,15] to learn viewpoint information. Moreover, in practical applications, data annotation is incredibly expensive, and pose estimation is subject to various constraints, including motion blurriness and occlusion. Whereas precise semantic views are hard to achieve, the number of views may be specified using previous information and camera positions. Furthermore, the overall appearance differs drastically from different viewpoints, leading to inconsistency of global characteristics. In contrary, local attributes offer stable discriminatory cues. Researchers claim local regions learn more discriminatory features.
To solve challenges stated above, we utilized multi-attribute, and view-aware features and proposed a deep learning model to accomplish advancements in vehicle re-identification. Vehicle color and model are invariant to occlusion and illumination conditions therefore, we associated these attribute feature to gain more robust representation of the vehicles. Various views typically show the various aspects of a vehicle in the real world. By utilising these additional features, our model can gain a more discriminatory description of a vehicle. However, the same vehicle may appear to have vast differences in views. It poses a major challenge to combining these varied features efficiently. The main contributions are as follows. • We utilize attribute information i.e., color and model which are invariant to illumination, and camera position; helping more promising features for re-identification. • To deal with challenges associated with viewpoints in vehicle re-id, we introduced a view-aware model which enhances feature alignments, and improve visible views by adding multi-view with attention mechanism to help model learn more discriminating and robust features. • Comprehensive experiments are carried out on two benchmark re-id datasets i.e., VeRi-776 [16] and Vehi-cleID [17] employing the proposed network and comparative results have been achieved. Ablation studies of the proposed model is performed and findings shows that the proposed network verifies promising results.

Related work
Vehicles play a crucial role in advanced surveillance systems. With the rising significance of public security, many researchers have proposed various image-based re-id algorithms. The advancement in Neural networks have helped researcher achieve robust and discriminative data representations. Liu et al. [16,18,19] constructed the VeRi-776 dataset and tackle the re-id challenge as a progressive process based on spatial-temporal data, licence plates, and visual features. Liu et al. [17] published one massive monitoring dataset named VehicleID and developed coupled clusters loss method to calculate the distance between two identical vehicles. Liu et al. [20] utilized handcrafted features and combined with features extracted with CNNs to gain robust feature representation. Wang et al. [21] initially investigated vehicle structure and extracted key points accordingly using CNN feature extraction. Li et al. [22] combined verification, attribute recognition, Identification in one deep network. However, due to the massive visual appearance variations produced by multiple cameras, vehicle position, illumination variations, and partial occlusion, real-world vehicle reidentification remains a complex issue.  [3,9,[23][24][25][26][27][28][29] employ attribute information e.g., color, model/type; to increase the capacity of vehicle re-identification. Liu et al. [30] developed a Model that retrieved features from local regions rather than global regions and RAM embedded the complex visual information in local regions as each respective area transmitted increasingly unique visual information. Additionally, they developed a novel method for training a model that included vehicle IDs, types, and color, leading in more discriminative local and global characteristics. Yan et al. [31] investigated the multi-grain associations between vehicles with different attributes. Yang et al. [32] dealt with occlusion problem using long short term memory network. Liu et al. [17] proposed a multi-branch network to extract features like instance differences, and model information.
We further utilized viewpoint cues to improve feature stability. In recent works, Zhou and Shao [10] introduced a method that focuses on certain regions of vehicles from several views to predict a multi-view feature vector from a single-view input. It enhances the visual features according to viewpoint variations, and it's complex and hard to learn. Zhou et al. [2] introduced a model to tackle multiview problem; where views were generated from available views, and auxiliary discriminative features were ignored. Chu et al. [12] proposed a model which learns two metrics in two feature spaces for similar and distinct viewpoints separately. The viewpoint is determined, and the appropriate metric is used. Zheng et al. [33] combined type, color and views of vehicle together for robust feature representation. Meng et al. [11] parsed vehicles in various views and adapted parsed masking to align feature representation of multiple views. Khorramshahi et al. [34] introduced the key-point selection technique to locate critical features and emphasize highly relevant features. Teng et al. [35] introduced a network to counter the negative impact of viewpoint variations. They developed a multi-view branch network in which each branch learns view-specific cues without using shared parameters. Their network architecture is composed of spatial attention learning and multi-view feature learning to distinguish visually similar vehicles. In our proposed model, we associated the vehicle attribute information e.g., color and type, and viewpoint features to enhance visual feature presentation.

Proposed methodology
The key to solve the challenge of vehicle re-identification is to discover how to extract features under adverse circumstances like ambient light and viewpoint. We concentrate on feature extraction with numerous views and attributes. Our model in this work, as shown in Fig. 2, is comprised of two branches. A type, and color feature extraction branch, a view feature extraction branch with multi-view attention module. Color feature extraction and type feature extraction are done using ResNet50 in main branches. The viewpoints are extracted in the second branch. Multi-view attention mechanism module is added in second branch to determine distinct view features. Triplet loss and cross-entropy loss are used to used to compute losses.

Attribute-guided branch
Current vehicle re-id methods typically use a single stream convolutional neural network to generate discriminative feature representations of the vehicle image while, ignoring the attribute feature. Two distinct vehicles can be confused due to their similar type and color. The same vehicle may appear very different from various viewpoints, yet the model and colors will remain the same. To receive vehicle color and model information, we will extract the vehicle colors and The feature extractor is ResNet50 we added fully connected layers to extracts color, and type attributes. In the second branch, we extracted view features using ResNet network and attention module is applied to refine viewpoint features and finally attribute, and viewpoint features are concatenated. We added triplet loss and cross entropy loss to compute the losses type using attribute branch. The vehicle's appearance is one of those with low feature levels that can be retrieved using a shallower neural network. Unlike semantics, color properties are easily represented in pixel values. Vehicle appearance changes with different viewpoints; occlusion and illumination conditions. We proposed a framework inspired by attribute features, i.e., color, type, etc., robust across appearance fluctuations due to different viewpoints and occlusion. The main branch is based on ResNet50 architecture, initially trained on ImageNet [36]. We removed down sampling stage. The features extracted from fully-connected layer are 2048-dim. ResNet50 [37] is used to generate feature maps (f); then, the feature maps are fed to the main branch to obtain attribute-guided features f color and f type . The concluding concatenated features representations f t are used for the vehicle re-id problem. We used cross-entropy loss to train color and type classification sub-branches. The loss is aggregated with triplet loss and cross-entropy loss. The loss function for attribute branch is calculated as: where the corresponding weights are for f color , and for f type ; the weights for and are set to 1.

Multi-view/head attention mechanism
It is hard to distinguish large numbers of vehicles from various viewpoints. Multi-view/head enhances the capacity of the model to learn more distinct features from different view points. Therefore, we embrace multi-view attention mechanism architecture initially introduced in [38] to improve the view features for re-id. The 2048 dim features extracted from the ResNet are fed to fully connected layers. Each fully connected layer is regarded as a view, and learn different view features. Moreover, attention mechanism decides the important view features for final embedding.

Loss function
Triplet loss and cross-entropy losses are widely used in re-id tasks. In this work, we have used triplet loss and cross entropy loss in training phase. The goal of triplet loss is to minimize the distance between similar and dissimilar features of vehicles. When query image q y of a vehicle is given, x is true labels and prediction of class as p y . The crossentropy is formulated as: Concurrently, the triplet loss is computed as: (1) Loss attribute = Loss color + Loss type , where distances between positive pair is denoted by d p and distance between negative pair is denoted by d n . The margin of triplet loss is set to 0.5, and [z] + equals to max(z, 0) . Lastly, the total loss is computed as:

Datasets
VeRi-776 [16] L total = L id + L attribute + L triplet . red, yellow, and white are the six independent colors of the cars. The training set includes 13,164 vehicle images. The test set is grouped into complex levels. Small test sets containing 800 vehicles, medium test sets containing 1600 cars, large test sets containing 2400 vehicles, and the largest testing image containing 13,164 vehicles. Although the dataset is enormous, it misses multi-view information since a single point of view was utilized to construct the dataset, indicating only the front and rear views of vehicle images. The images are not consistent for every vehicle, with certain vehicles having a limited number of images available and others having a more significant number of images in the dataset (Table 1).

Implementation details
We used two benchmark datasets VeRi-776 [16], and Vehi-cleID [17] to train our re-id network. The architecture is trained using PyTorch as implementation tool on NVIDIA TITAN RTX GPU. In our model the backbone for feature extraction is ResNet [37]. Input images are resized to 320 × 320 and padding of 10 pixel size is added. All feature extractors are trained for 120 epoch and the optimizer used is ADAM optimizer with 5e − 4 weight decay. The learning rate was set to 3.5e − 5 and later increased to 3.5e − 4 after 60 epochs. The source code for the framework is available at https:// github. com/ saift umrani/ VAAG.

Evaluation metrics
The protocol provided in the study is followed during the testing stage. One of the vehicles image is chosen as the query sample on VehicleID [17], while the rest are considered gallery targets. On Veri-776 [16], we use the original approach for retrieving queries, which requires the retrieved image and the matching gallery images to be collected from separate cameras. For both datasets, we evaluated the performance using the Cumulative Matching Characteristic (CMC) curves and the Mean Precision (mAP) techniques commonly used for re-identification task. The CMC graph illustrates the amount of correct target vehicle images in corresponding lists of varying sizes. Precision and recall are both taken into account by the mAP. It's a standard metric for evaluating the performance of re-id models.

Comparison with other methods
We used VeRi-776 [16] and VehicleID to validate our method and compared with recent state-of-the-art methods. The results are enlisted in Tables 2 and 3 respectively. LOMO [39] is handcrafted feature technique and focus on local features, it also handles illumination and viewpoint challenges in person re-id. BOW-CN [40] also target local features. GoogLeNet [41] is a classification neural network. FACT FACT [16] integrate semantic information, texture, and color information extracted using deep neural networks. DGD [42] used multiple domains and learn robust and generic feature representation. XVGAN [2] tackled re-id by generating realistic images of same vehicle in different views, and integrated features from real data and generated images to find distance metrics. In [17] two method CCL and Mixed Difference+CCL where coupled clustered loss replaces triplet loss, and L2 distance was used for similarity estimation. VAMI [10] introduced view-aware attentive model for picking primary regions. DenseNet121 [43] every layer is connected with another layer in feed forward manner, used for classification tasks. ABLN-Ft-16 [44]. In [21] to establish an orientation invariant feature embedding module for vehicle feature representation, OIFE model adjust the local regions using key-points. PAMAL [3] utilized multiattributes features i.e., color and type, and vehicle keypoints to solve re-id task. PROVID [19] is a deep re-id framework utilizing coarse-to-fine search, and near-to-distant search feature domain and physical space respectively. VRSDnet [45] introduced short and dense convolutional network for vehicle re-id. EALN [46] is able to generate localized  [48] introduced multiscale attention based method to fuse discriminative features and global information. QD-DFL [49] used densenet for basic features and quadruple average pooling to gain quadruple directional deep features for vehicle re-id. The query images and ranking lists obtained by the final model is visualized in Fig. 4. The vehicles in the first and second rows may appear different because of different views and illumination conditions, whereas the samples in the last row have very low resolution. Our model, on the other hand, outperforms the state-of-the-art and can retrieve a high-quality ranking list, as shown in the Fig. 4.

Evaluation on Veri-776
The results of our proposed model on VeRi-776 dataset along with comparison of our model with recent state-ofthe-art methods are listed in Table 2. The compared methods includes: LOMO [39]; BOW-CN [40]; GoogLeNet [41]; FACT [16]; DGD [42]; XVGAN [2]; two method CCL, and Mixed Difference + CCL [17]; VAMI [10]; DenseNet121 [43]; ABLN-Ft-16 [44]; OIFE [21]; PAMAL [3]; PROVID [19]; VRSDnet [45]; EALN [46]; MDLSTM [47]; MSA [48]; QD-DFL [49]. Our proposed approach achieves finest performance. Table 2 is illustrating that our proposed model achieve best performance with Rank-1 = 92.2%, Rank-5 = 96.67% and mAP = 63.01% on the VeRi-776 dataset. Firstly, when compared to traditional techniques, such as LOMO; BOW-CN, our proposed method was able to achieve noteworthy boost in mAP of 53.37%, 50.81% respectively. This proves for the vehicle re-id task, deep features are more appropriate and robust than handcrafted features extraction methods. Secondly approaches, such as FACT, PROVID, OIFE, and VAMI used spatio-temporal clues for additional information to increase re-id accuracy. Despite the fact that our suggested method does not use spatial-temporal information, our method achieved significant improvements when compared to the others methods. Finally, when compared to approaches that learn features from various views of vehicles, including VAMI, our strategy improves mAP by 12.88%. This proves that our introduced technique for the re-id task can learn more robust and discriminative features. Finally, when compared to other approaches that use local features with numerous branches, such as RAM, and EALN; MDLSTM; MSA; QD-DFL; our method performs magnificently.
Using the VeRi and vehicleID datasets, Fig. 3 shows the results of various estimation methods. The first subfigure depicts the results of the model performance on the VeRi dataset, with mAP highlighted in cyan color. The following three subfigures represent the results of three different subsets of the VehicleID dataset, namely the small, medium, and large sets. The large subset of the VehicleID dataset is the most challenging subset, and the subset is separated into difficulty levels based on the level of complexity. The graphs demonstrate that our proposed model produced outcomes that were comparable to those obtained by other state-ofthe-art approaches.

Evaluation on VehicleID
LOMO; FACT; DRDL; OIFE; VAMI; PROVID; RAM; EALN; MSA; QD-DFL are some of the methods that are compared on VehicleID, Table 3 shows the Rank-1, Rank-5, and mAP of our method and other compared methods. Our method, like the VeRi-776, delivers the best results and beats all other methods tested. This clearly proves the efficacy of our suggested strategy. Despite the fact that each deep network in our architecture is not particularly complex, the many features are learned and merged in a unique way. When compared to other approaches that learn many features that are comparable to our concept, our method has significant advantages over the other methods. This also demonstrates

Ablation study
In this section, we investigated the attribute feature extraction branch of the network. The effects of employing attributes in vehicle re-id are first evaluated. The ResNet-50 generates features f, which are fed to the main branch to extract attribute features. For the attribute features, we used f color and f type features. Our proposed method enhances discrimination between vehicles based on the vehicle color and type and encourages the model to distinguish between vehicles.
In the classification of color and type branches for multitask learning, we used cross entropy. The results illustrate the effectiveness of the attribute features for gaining prominent outcomes for vehicle re-id task. The accuracy of the attribute feature extraction module is shown in Table 4. The first branch is to extract color attribute features. Since color is more stable in different viewpoints, we included the color branch to get robust visual features. Multiple vehicles with the same color may look alike in different lighting conditions. Color is the crucial attribute of the vehicle to help improve the effectiveness of vehicle re-id task. Our model achieves 67.1% for VeRi-776 and 62.51% for vehicleID in accuracy by utilizing color attributes. The second branch is the vehicle type branch; this vehicle attribute is also vital to extract visual appearance differences. Due to variation in inter-class and intra-class, type attributes help distinguish  between similar-looking vehicles. The type attribute learning branch achieves the accuracy by 58.4% for VeRi-776 and 54.24% for vehicleID. The third branch is the view feature extraction branch, which extracts view features of the vehicle. Due to a change in viewpoint, the same vehicle may look different or identical to another vehicle because of the vehicle's structure. Our view feature extractor module embraces multi-view attention mechanism to improve the effectiveness of the model. We improve the network capability by using Attention Mechanism after obtaining the data strategy for training, achieving 83.55% on VeRi-776 and 61.92% on vehicleID, respectively. This shows that employing the visual representation features gained using multi-head are more robust (Fig. 4).

Conclusion
In this work, we propose a deep convolutional network based framework for vehicle re-id that learns robust discriminative features such as color, types, and camera views simultaneously. We add subnetworks to ResNet backbone; combining view, type, and colors clues, respectively. These three activities complement one another and provide a discriminative representation for vehicle re-id that is informative and useful. In comparison to state-of-the-art approaches, our system has achieved comparative performance by learning robust discriminative features, camera views, vehicle types, and vehicle color. The contribution of deep model and the capabilities of information representation for vehicle re-id are demonstrated in a comprehensive evaluation on two benchmark datasets. In future work, we are interested to use capsule networks to develop more robust feature extracting models for better learning of visual features and spatial information.