SiamADT: Siamese Attention and Deformable Features Fusion Network for Visual Object Tracking

To date, existing Siamese-based trackers have achieved excellent performance. However, in some complex scenarios, using deep convolutional layers alone can not effectively capture powerful representative features. To solve this problem, we propose a Siamese Attention and Deformable features fusion network for visual object Tracking (SiamADT). The proposed SiamADT consists of three modules: a Siamese attention network module for attention feature extraction, a deformable features fusion module, and a classification-regression module for bounding box prediction. Our framework uses ResNet-50 as the backbone for anchor-free tracking. Without tricky anchor hyperparameters tuning and manual intervention, SiamADT is more flexible and versatile. We conduct extensive experiments on four challenging benchmark datasets. The results demonstrate that SiamADT achieves competitive performance among state-of-the-art methods, with real-time speed—30 frames per second.


Introduction
Visual object tracking is a fundamental task in computer vision [1][2][3].It aims to estimate the position, shape or occupied area of the tracking target from video sequences.Although visual object tracking has wide practical applications, such as automatic driving [4], modern military [5], human-computer interaction [6], virtual reality [7] and so on, it remains a challenging task in computer vision community.
In the past ten years, correlation filter [8][9][10] and Siamese network-based trackers [11][12][13] have attracted a lot of attention and occupied top positions in the community.Siamese network based trackers formulate visual object tracking as a similarity learning problem, and determine the position of the object in the current frame by calculating the similarity between the search region and the target template.To address the scale variation problem, SiamFC [11], CFNet [10] and SA-Siam [14] adopt a multi-scale test to predict the target scale.Concretely, by rescaling the search region into several scales and assembling a mini-batch of scaled images, the trackers select the scale with the highest score as the predicted object scale in the current frame.This strategy is easy to introduce background information on the similarity measurement step, and it can not fundamentally understand the pose of objects.SiamRPN [15] introduces the Region Proposal Network (RPN) into Siamese network architecture, using predefined anchors to regress boundaries.However, SiamRPN is very sensitive to the hyperparameters of anchors.Moreover, since the size and aspect ratio of anchor boxes is fixed, even with heuristic tuning [16], these trackers have trouble while the interested objects have large shape deformation and pose variations.To address the challenge, anchorfree trackers are proposed, such as SiamBAN [17] and SiamCAR [18].Instead of adopting the multi-scale strategy and pre-defined candidate anchors, they treat object tracking as a parallel classification and regression problem, directly classifying the object and regressing its boundaries in a fully convolutional network.The anchor-free trackers can estimate the scale and aspect ratio of the target more accurately than other Siamese based trackers.
A high-performance generic object tracker can not only ensure tracking efficiency without prior knowledge, but can also extract features that focus more on the target of interest.Recently, the attention mechanism was introduced to visual object tracking in RASNet [19] and SiamAttn [20], which inspired the current work.However, RASNet [19] adopts the multiscale strategy to estimate object scale, which limits the potential performance of the Siamese network.SiamAttn [20] uses anchors for box regression, which relies on prior knowledge and hinders its generalization ability.
The core purpose of the attention mechanism is to make the model pay more attention to useful foreground information and ignore background information.During the tracking process, the target usually undergoes different degrees of deformation.And the features extracted by ordinary convolution often contain many background features due to the fixed sampling position, which cannot allow the network to learn the possible conditions of various geometric deformations of the target.In this paper, we propose Siamese attention and deformable feature fusion network for object tracking, named SiamADT.By combining channel attention with spatial attention, we design a robust Siamese attention mechanism.In addition, in order to improve the robustness of the tracker to scale variations, deformation and similar objects, we design a deformable feature fusion module.Through deformable convolution, the features of targets with complex deformation can be flexibly extracted and the geometric modeling ability can be improved.The combination of attention and deformable features can focus more accurately on the geometric region of the target object, and they complement each other to effectively improve the performance of the tracker.Therefore, SiamADT has the ability to accurately capture the features of the geometric target regions while achieving anchor-free tracking.It achieves better performance compared to other trackers, as shown in Fig. 1.
The main contributions of this work are: Wang et al. [21] first introduced deep learning methods to object tracking.They proposed the method of "offline pre-training + online fine-tuning", and provided a feasible new idea for object tracking research thereafter.Subsequently, MDNet [22], C-COT [23] and GFS-DCF [24] attempted to incorporate the correlation filter framework with deep learning methods.
Pang et al. [25] proposed a target tracking method based on deep learning architecture and supervised ranking algorithm (DPL 2 ).At the same time, deep learning methods have been applied with great success in a variety of applications, such as human face pose estimation [26], fine-grained image recognition [27], and human pose recovery [28].Significant advances have been made in both deep learning methods and object tracking.
With the continuous efforts of researchers, tracking algorithms based on Siamese network occupied the main position in deep learning based trackers.At first, Siamese network was applied to signature authentication [29] and face verification [30,31].In 2016, SINT [32] and SiamFC [11] proposed to learn similarity measures between target patch and search image using Siamese network, thus modeling tracking as a searching problem with the target on the whole image.Among them, SiamFC [11] not only achieved good accuracy but also a tracking speed of 58 frames per second (FPS), which draws much attention in the community.To improve the tracking performance, SiamRPN [15] abandoned multi-scale testing.It introduced the region proposal network (RPN) [33] and classification regression task branch, which improved the tracking accuracy.Wang et al [34] proposed SiamMask tracker which is able to produce binary segmentation masks that more accurately describe the target object.SiamRPN++ [35] solved the problem of the spatial invariance destruction when using deep networks.It makes full use of different features from different layers for prediction, and significantly improves the comprehensive performance of the tracker.However, in the above algorithms, both multi-scale strategies and methods with predefined anchors rely on prior knowledge.To avoid introducing unnecessary background information and hyperparameters, SiamBAN [17] proposed anchor-free tracking by removing the predefined anchors, so that the overall parameters of the model are reduced and the speed is further improved.SiamCAR [18] and SiamCPN [36] added a center-ness branch to anchor-free tracker to better determine the location of the target center point.SiamCAR let centerness branch and classification branch output in parallel, while SiamCPN let centerness branch and regression branch output in parallel.
Recently, attention mechanism has been successfully applied to visual object tracking.DensSiam [37] added a self-attention module to the target branch, allowing the network to focus more on non-local features during offline training.SA-Siam [14] designed a twofold Siamese network for visual object tracking, and computed channel-wise weights by adding channel attention mechanism to the semantic branch.Although the introduction of channel attention mechanism significantly improved the tracking performance, it ignored the importance of spatial attention mechanism.Different from them, SiamAttn [20] proposed the deformable Siamese attention network by introducing a novel Siamese network attention mechanism to compute deformed self-attention and cross-attention.But SiamAttn still uses anchors to predict the bounding box.

Proposed Method
We describe the details of the proposed SiamADT that consists of three main components (Fig. 2 ): a Siamese attention network module aims to improve the model's perception of the targets, a deformable feature fusion (DFF) module to enhance the geometric deformation modeling capability of our tracker and a target localization module for bounding box prediction.

Fig. 2 An overview of the proposed Siamese Attention and Deformable Features Fusion Network (SiamADT).
It consists of a Siamese attention network module, a deformable feature fusion(DFF) module and a target localization module.The "DCN.Conv" represents a deformable convolutional layer, " " represents added by elements, " " is element-wise multiplication and " " represents cross-correlation operations

Feature Extraction with Siamese Attention Network
There are two branches in this module, a target branch that takes the template patch Z (127 × 127) as input, and a search branch that takes the search region X (255 × 255) as input.The two branches share parameters in CNN to ensure that the same transformation is applied to input X and Z .In [35], ResNet-50 has been proven to be effective in Siamese network based trackers, so we adopt ResNet-50 as the Siamese backbone.ResNet-50 uses the continuous convolution to extract features.The features extracted from the lower layers have higher resolution and contain more low-level information, which is essential for localization.The features extracted from the higher layers have lower resolution and rich semantic information that can be beneficial in some challenging scenarios such as occlusion and deformation.Therefore, multi-level features can provide richer semantic information.To allow our tracker to perform densely Siamese network prediction, we modify the last two blocks.Specifically, we reduce the effective strides at the conv4 and conv5 blocks from 16 pixels and 32 pixels to 8 pixels.Furthermore, to improve the receptive field, we adopt the atrous convolution.We set the atrous rate to 2 in the conv4 block, and 4 in the conv5 block.After modifying ResNet-50, we can aggregate multi-level features extracted from the last three blocks for prediction.
where F 3:5 (X ) represents the features extracted from layers 3 to 5, each of which has 256 channels, and ϕ (X ) represents the features after fusion which contains 3 × 256 channels.Apart from the above modifications, we integrate ResNet-50 with the attention mechanism.Attention can not only focus on important features but also suppress those unnecessary background features.SENet [38] explicitly models the interdependencies between the channels of its features, and adaptively improves the quality of the representations, but ignores important spatial information.To emphasize the meaningful features in the spatial dimension and channel dimension, we adopt channel and spatial attention modules.These two attention modules compute complementary attentions, and they can determine "what" and "where" is worth to focus respectively.However, these two modules can be integrated in a parallel [39] or sequential manner [40].To select the optimal protocol, we perform the ablation experiments and finally we extract attention features in a sequential manner.Details of the ablation experiment are in Sect.4.2.
In our Siamese attention network, we apply channel and spatial attention modules in each ResBlock, ensuring that the output of each of the last three layers contains channel attention and spatial attention.First, given a feature map f ∈ R c×h×w as input of one block in our Siamese attention network, the feature map sequentially passes through three consecutive convolution layers, namely, 1 × 1, 3 × 3, and 1 × 1 convolutions.The convolutions of 1 × 1 are used to reduce and then increase (restore) the dimensions, and the 3 × 3 convolution can change or maintain the feature map size.Next, the feature map F ∈ R C×H ×W will be input into the channel attention module.As shown in Fig. 3, the channel attention module consists of a maximum pooling layer, an average pooling layer and a shared network.It has been confirmed that exploiting both average pooled and maximum pooled features greatly improves the representation power of networks rather than using each independently.After the pooling operation, we can obtain the average pooling features F c avg and the maximum pooling features F c max , which contain different spatial context descriptions.Both the average pooling features and the maximum pooling features are then forwarded to a shared network that consists of a multi-layer perceptron (MLP) with one hidden layer.Then we merge F c avg and F c max using element-wise summation.The channel attention map Mc ∈ R C×1×1 can be computed as: where M L P(F c avg ) and M L P(F c max ) represent the average pooling features and the maximum pooling features processed by multi-layer perceptron, respectively.σ represents the sigmoid function, and M c (F) represents the channel attention features.
M c (F) will be multiplied by the input feature F before sending to the spatial attention module.
As shown in Fig. 4, the spatial attention module consists of a maximum pooling layer, an average pooling layer and a standard convolution layer.We first apply average pooling and maximum pooling operations to generate average pooled features F s avg ∈ R 1×H ×W and maximum pooled features F s max ∈ R 1×H ×W .F s avg and F s max are then concatenated and convolved by the convolution layer, producing our 2D spatial attention map M s (F) ∈ R H ×W .The spatial attention can be computed as: Similarly, the values of F are propagated along the spatial dimension, and multiplied with the spatial attention features to obtain the refined output F .
After adding the residual connection to F , the final output of a ResBlock is obtained.As shown in Fig. 5, we visualize the class activation map of several sample images generated by ResNet-50 and our Siamese attention network.These images are selected from the test set of LaSOT [41].Our Siamese attention network can capture the target object area more accurately.Even in some challenging scenarios of occlusions, similar objects, and background clutter, our Siamese attention network still shows superior perception capability over ResNet-50.

Deformable Feature Fusion Module
As illustrated in Fig. 2, the proposed deformable feature fusion (DFF) module takes a pair of convolutional features computed from Siamese attention network as inputs.The DFF module consists of two convolutional layers, a deformable convolutional layer and an original convolution layer.This module can reduce the number of channels for subsequent calculation, and can greatly enhance the framework's capability of modeling geometric deformations.
Inspired by [42], our DFF module contains two components, namely a f 3×3 deformable convolution layer and a f 1×1 original convolution layer.Convolution or pooling units in CNNs often have fixed geometric structures.However, for visual object tracking, it is very critical to model complex geometric deformations.The tracking objects usually have large deformations due to various factors, especially for non-rigid objects, which will affect the tracking performance.Therefore, regular lattice sampling in standard convolution is responsible for the network's difficulty in adapting to the geometric deformation.To weaken this limitation, we add deformable convolution to extract the features.Deformable convolution adds the 2D offsets to the position of each sampling location in the convolution kernel.With the 2D offsets, the convolution kernel can be randomly sampled around the current position and is no longer limited to the previous regular lattice.
The output resolutions of the deformable convolution and ordinary convolution are the same.To obtain more abundant features, we add these two output feature maps directly.
where f 3×3 (•) is the deformable convolution, f 1×1 (•) is the ordinary convolution, ⊕ denotes element-wise summation.(X ) is the output of the deformable feature fusion module which will be adopted as the input to the target localization module.The DFF module can not only allow the sampling grid to deform freely but also reduce the number of channels to 256.Without extensive data augmentation and complex model parameters, the DFF module enhances the geometric deformation modeling capability of our tracker.Through dimension-reduction, the number of parameters is significantly reduced.Consequently, the computing speed can be increased accordingly.Therefore, it is particularly suitable for those tracking scenarios where the visual appearance of a target changes severely over time.

Target Localization Module
In our end-to-end training, each position (i, j) in the response map R can be mapped back to the input search region as (x, y).Unlike the RPN-based trackers, our network directly classifies and regresses the target bounding box at each position.Such an anchor-free tracker avoids complex parameter tuning and reduces manual intervention.In the target localization module, there are three branches: a classification branch, a regression branch, and a centerness branch.
The classification branch is used to predict the label for each position.It outputs a classification feature map A cls w×h×2 .As shown in Fig. 2, each point (i, j, :) in A cls w×h×2 contains a 2D vector, which represents the foreground and background scores of the corresponding location in the input search region.The regression branch computes the target bounding box at each location.Similarly, each point (i, j, :) in A reg w×h×4 contains a 4D vector t(i, j) = (l, t, r , b), which represents the distances from the corresponding location to the four sides of the bounding box in the input search region.The center-ness branch in parallel with the classification branch is used to remove the outliers.As shown in Fig. 2, the center-ness branch outputs a feature map A cen w×h×1 , where each element gives the center-ness score of the corresponding location.
Since the areas ratio of target object and background in the search region is not very large, we need not deal with the problem of sample imbalance.So we simply use cross-entropy loss for classification and DIOU [43] loss for regression.IOU [44] loss and generalized IOU (GIOU) [45] loss suffer from the problems of slow convergence and inaccurate regression.When two boxes do not intersect, the IOU loss is always 1 and cannot give the optimization direction.GIOU is proposed by adding a penalty term.Although GIOU can relieve the gradient vanishing problem for non-overlapping cases, it ignores the central point distance.Distance-IOU (DIOU) loss adds a penalty term to the IOU loss to minimize the normalized distance between the predicted box and the target box, which converges much faster in training than IOU and GIOU losses.Thus we use DIOU loss instead of IOU loss.The DIOU and its loss function can be expressed as: In the above formula, b and b gt respectively represent the center point of the predicted bounding box and ground-truth bounding box, and ρ represents the Euclidean distance between the two center points.c represents the diagonal distance of the smallest enclosing box covering the two boxes.

Training Loss
Our model is trained in an end-to-end manner, and the overall loss is obtained by adding up the loss of the three branches, which can be expressed as: where L cls represents the cross-entropy loss for classification, L cen represents the center-ness loss, and L reg represents the regression loss calculated by DIOU loss.The weight parameters λ 1 , λ 2 and λ 3 are used to balance the three branches.In our implementation, we empirically set λ 1 = 1, λ 2 = 1 and λ 3 = 3.

Implementation Details
We adopt the modified ResNet-50 as the backbone Siamese subnetwork, and initialize it with the weights pre-trained on ImageNet.During the training process, the batch size is set as 32 and 20 epochs are performed by using stochastic gradient descent (SGD) with an initial learning rate 0.001.For the first 10 epochs, the parameters of the Siamese subnetwork are frozen while training the classification and regression subnetwork.For the last 10 epochs, the last 3 blocks of ResNet-50 are unfrozen to be trained together.We train our SiamADT using public datasets: COCO, ImageNet DET, ImageNet VID, YouTube-BB, GOT-10K and

Ablation Study
We first study the effectiveness of individual components in SiamADT and conduct ablation study on OTB100 using AUC score and precision score as the evaluation measures as in Sect.4.3.We use SiamCAR as the baseline.As shown in Table 1, the baseline achieves an AUC of 0.646.By respectively adding attention modules in a parallel (Attention 1 ) or sequential manner (Attention 2 ) to the baseline, the AUC can be improved to 0.655 and 0.670.However, the tracker adding Attention 1 achieves better performance than the tracker adding Attention 2 in terms of precision, so we added the DFF module to compare one more time.From the last 3 rows of Table1, we find that the incorporation of DFF improves the performance of the tracker with Attention 2 and deteriorates the performance of the tracker with Attention 1 .It indicates that the combination of Attention 2 and DFF can achieve more significant results, and better improve the performance of the tracker.Moreover, to solve the disadvantages of IOU loss, we replace IOU loss with DIOU loss.Compared with the baseline, it can further improve the AUC by +3.1%, which demonstrates that DIOU loss is very necessary and effective.

Comparison with State-of-the-Art Trackers
We compare our SiamADT tracker with state-of-the-art trackers on four benchmark datasets: OTB100, UAV123, LaSOT and TC128.SiamADT tracker achieves state-of-the-art results and runs at a real-time speed, about 30 FPS.All the trackers are ranked using precision score in precision plots and area under curve (AUC) in success plots.
On OTB100.OTB100 consists of 100 sequences with both gray-scale and color images, involving 11 challenging attributes of object tracking.Our SiamADT tracker is compared with several deep Siamese network-based trackers including TADT [46], SiamCAR [18], SiamRPN [15], SiamMask [34], SiamDW [47], UDT [48], UDT+ [48] and SiamTri [13].We aslo select two advanced Transformer-based trackers, SiamTPN [49] and TransT [50]. Figure 6 shows quantitative results.Due to the powerful representation capabilities of Transformer, SiamTPN achieves the best precision and AUC scores.Our SiamADT tracker is the runner-up, but it surpasses TransT by 1.6% and 0.8% in terms of precision score and success score, respectively.Compared to the baseline tracker SiamCAR, our SiamADT tracker   has improved the precision and AUC scores by 5.9% and 5.7% respectively, which clearly demonstrates the effectiveness of our proposed modules.Moreover, we compare the average overlap precision of our SiamADT and other trackers under 11 challenging attributes in Table 2. Due to the introduction of the Siamese attention network and DFF module, our SiamADT achieves excellent performance, especially for deformation (DEF, 0.894), scale variation (SV, 0.906) and background clutters (BC, 0.833).Table 3 shows the tracking speeds of each algorithm on OTB100.Our framework results in a certain amount of computational cost-about 30FPS which is lower than SiamRPN, SiamRPN++ and SiamFC++.However, our algorithm is more accurate and robust than most tracking algorithms based on Siamese network, and can meet the requirements of real-time tracking.
On UAV123.UAV123 contains 123 sequences captured from a low-altitude aerial perspective.We compare our tracker with other 12 trackers, including SiamTPN [49], SiamMask [34], HiFT [51], SiamCAR [18], SiamAPN++ [12], SiamAPN [52], SiamRPN [15] and SiamDW [47].Among these trackers, HiFT, SiamTPN, SiamAPN and SiamAPN++ are UAV object trackers.Figure 7 shows the success and precision plots.Our proposed SiamADT follows SiamTPN ranking second, but compared with the baseline tracker SiamCAR, it has improvements of 3.1% on precision score and 2.3% on success AUC score.At the same time, our SiamADT surpasses three UAV object trackers (HiFT, SiamAPN, SiamAPN++).UAV object tracking mainly focuses on small-scale objects.While these UAV object trackers also use different attention extraction modules, the presence of anchors undoubtedly hinders the overall performance improvement.We therefore reasoned that the anchor-free strategy is more applicable to UAV tracking.
On LaSOT.LaSOT is a high-quality, large-scale dataset with an average sequence length 2,500 frames.It covers different challenges from the wild where the object may disappear and reappear in the view, which tests the ability of a tracker to re-capture the object.We evaluate our SiamADT tracker on the test set consisting of 280 sequences, and compare it with 15 Siamese network-based trackers including SiamRPN++ [35], SiamMask [34], UpdateNet [53], SiamAPN++ [12], C-RPN [54], SiamAPN [52], SiamRPN [15], SiamCAR [18], SiamFC [11], StructSiam [55], SiamDW [47], DSiam [56] and SINT [32], etc.The results including success plots and precision plots are illustrated in Fig. 8 .Our SiamADT ranks first in terms of AUC and precision, surpassing other Siamese network-based trackers.The results show that the attention module can improve the model's perception ability in complex scenes, and the combination of the anchor-free strategy and the DFF module enables the model to handle the change of target scale more flexibly.On TC128.TC128 is a colored tracking benchmark with 128 color sequences.We compare our method with 15 state-of-the-art approaches including SiamBAN [17], TransT [50], SiamRPN [15], SiamCAR [18] and so on.As shown in Fig. 9 , SiamADT attains the best precision (0.817) and AUC (0.606) scores, which are 1.4% and 1.0% higher than the runnerups (SiamBAN and Ocean).Compared to the baseline tracker SiamCAR, SiamADT achieves 4.2% and 2.3% improvements in terms of precision and AUC scores.The results demonstrate our tracker can use the rich color information to improve the object tracking performance.

Conclusion
In this paper, we have presented a Siamese attention and deformable features fusion network for visual object tracking.We introduce a convolutional block attention module consisting of both channel and spatial attention into the Siamese network.Our Siamese attention network module can capture the target object area more accurately, making the objects more discriminative against distractors and background.Additionally, a deformable features fusion module is designed to further improve the robustness of the extracted features against challenging factors such as deformation, rotation and scale variation.Extensive experiments are conducted on four benchmarks, which demonstrates that our method obtains competitive results with real-time running speed.Considering the remarkable achievements of Transformer-based trackers, we will continue our research using Transformers as the backbone network in our future work.

Fig. 1
Fig.1Comparisons of the proposed SiamADT with four state-of-the-art trackers on three challenging sequences of OTB100.Our SiamADT can accurately predict the bounding boxes even when the objects suffer from large scale variation, deformation, rotation and similar distractors, while TransT, SiamCAR and SiamMask give much rougher results, and UDT drifts to the background

Fig. 6
Fig. 6 Success and precision plots on OTB100

Fig. 7
Fig. 7 Success and precision plots on UAV123

Fig. 8
Fig. 8 Success and precision plots on LaSOT

Fig. 9
Fig. 9 Success and precision plots on TC128

Table 2
Comparisons on OTB100 with 11 challenging attributes