SiamLight: lightweight networks for object tracking via attention mechanisms and pixel-level cross-correlation

Despite Siamese-based trackers have achieved great success in recent years, researchers have focused more on the accuracy of trackers than their complexity, which leads to their inapplicability in some scenarios, and the real-time speed can be greatly limited. In this work, we propose a lightweight network method called SiamLight for object tracking. MobileNet-V3 is selected as the backbone network. The PG-corr module is added as the feature fusion module, a strategy that decomposes the template feature into spatial and channel kernels, reducing the matching regions and suppressing the effect of similar interference. In addition, we also add the CSM module, which carries out attention to the channel and spatial simultaneously. CSM module not only reduces the number of parameters but also ensures that it can be integrated into existing network architectures as a plug-and-play module. Finally, multiple separable convolution blocks are added to the classification and regression branches to meet our lightweight parameters and Flops requirements. The experiments on LaSOT, VOT2018, VOT2019, OTB100, and UAV123 benchmarks show that the method has fewer Flops and parameters than state-of-the-art trackers.


Introduction
Visual object tracking is one of the most fundamental but challenging tasks in the field of computer vision. It has a wide range of applications in different fields such as intelligent transportation, video surveillance, and human-computer interaction. In the past few years, object tracking has also made significant progress due to the rise of deep neural networks. However, as tracking performance has improved, state-of-the-art trackers become increasingly heavy and expensive. For example, the seminal work SiamFC [1] using 2.7 G Flops and 25.9 M parameters, and the latest Siam-BAN [2] and LightTrack [3] trackers correspond to Flops of 6.19 G and 0.53 G, respectively, with parameter values of 10.85 M and 1.97 M. The comparative results of Siam-Light's performance and real-time speed on the VOT2019 benchmark are shown in Fig. 1, the size of the circles on the left and right plots indicates the model complexity (Flops) and the number of parameters, respectively, with bigger circles indicating larger Flops and parameters for the model, and it can be found that our method has the least number of Flops and parameters compared to other trackers. At the same time, our method maintain a better performance of real-time speed.
Two approaches are carried out to address the problem of model complexity, namely, model compression (pruning, quantization, distillation) and compact model design. The former solves the problem of the model using cost by increasing inference speed thus reducing the number of model parameters, but such methods will lead to the loss of partial information and inevitably degradation of model performance. The latter include the MobileNet [4] and ShuffleNet [5], but they rely heavily on human expertise and experience. In this paper, a more lightweight network is designed using MobileNet-V3 from the MobileNet series as the backbone network for feature extraction.
Meanwhile, Siamese-based trackers have recently gained much attention for their balanced speed and performance. While in Siamese networks, cross-correlation is the core operation to embed the information of two branches. A naive correlation is used in SiamFC [1] to obtain a single-channel response map for target localization. Later, researchers apply depth-wise correlation to SiamRPN++ [6], which provides a non-Siamese feature that allows the template and search branch to focus on different contents. Both naive correlation and depth-wise correlation (DW-corr) [7] associate the whole template features as kernels with the search region features to produce response map and blur spatial information. However, the two methods are not conducive to preserving the edge information. Unlike the above two methods, the PG-corr [8] encodes each part on the template features as a kernel with the search features, which not only preserves the edge information well but also greatly avoids the associated window blurring features. If a PG-corr is added to our method, experiments have shown that the matching region is much larger than the target region, and then a large amount of background noise will be introduced, which will lead to inaccurate or matching failure.
To address the above limitations of cross-correlation, we propose a new Siamese architecture, inspired by [8], which decomposes template features into 1 × 1 spatial and channel kernels to reduce the matching region. Thus, in the feature fusion module, we incorporate the Pixel to Global Correlation (PG-corr) module, which can effectively resist the interference of background information and match to more accurate targets. In addition, before the PG-corr module, inspired by [9] to perform attention on the channel and spatial respectively, we add the Channel and Spatial Module (CSM). CSM module consists of the Channel Attention Module (CAM) and Spatial Attention Module (SAM), which not only does not generate too many additional parameters but can improve the performance of the model.
In summary, our main contributions can be summarized as follows: • We design a new tracking architecture that has fewer Flops and parameters than some SOTA, which not only reduces the computational burden but also improves the performance of the tracker. • To suppress the interference of background clutter, we add PG-corr module instead of the usual cross-correlation operation and use the CSM module to reduce the model parameters and computational power. • Using multiple depth-separable convolutional blocks in the classification and regression head, the improved tracking head can meet our requirements for lightweight parameters and Flops.

Siamese network-based trackers
The task of object tracking is firstly viewed as a similarity learning problem in the work of SINT [10] and SiamFC [1]. SiamFC constructed a Siamese network structure consisting of feature extraction and a cross-correlation layer (Xcorr) embedding two branches of the template and search features. It uses the template features as kernels and performs a convolution operation on the search region to obtain a single-channel response map. With the development of region candidate networks (RPN), Bo Li et al. [11] successfully applied it to Siamese networks to form SiamRPN. SiamRPN outputs a multi-channel response map by performing an up-channel cross-correlation-layer (Up-Xcorr) cascading multiple independent cross-correlation layers. Ocean [12] designed an anchor-free object tracking network that avoids the complex setting parameters of anchors. The anchor-free tracker SiamBAN [5] solves the problem of inconsistent classification and regression well by directly classifying objects and regressing their bounding boxes in a unified FCN [13], which makes the SiamBAN method more flexible and general. STMTrack [14] is a tracking framework based on Spatio-temporal memory networks, which can make full use of the historical information associated with a target. Thus, STMTrack can better adapt to changes in the appearance of the target during tracking. SiamCAR [15] considers all pixels in the search region for prediction and studies their classes as well as the distances to the four edges of the target bounding box. Unlike Siamese-like feature extraction, SBT [16] embeds its network with cross-image feature correlation depth into multiple layers of the feature network. However, these advanced trackers became increasingly complex, requiring a larger number of parameters and Flops. Light-Track [3] designed a more lightweight and efficient tracker using neural architecture search (NAS). Inspired by Siam-BAN, this paper designs a new more lightweight anchor-free network architecture, which will avoid a larger number of parameters and Flops.

Object tracking based on attention mechanisms
SiamAttn [17] proposed a deformable Siamese attention network to improve the feature learning capability of the Siamese tracker. It consists of two parts, namely, deformable self-attention and cross-attention. The former learns more powerful contextual information through spatial and channel attention mechanisms. The latter is dedicated to solving the problem of information interaction between the template and the search branch during feature extraction. SiamGAT [18] applied graph attention to Siamese networks, not using prefixed regions to select template feature regions, but using a target-aware region selection mechanism to adapt to different object size and aspect ratio variations. TAPL [19] designed an attention-oriented part localization network to directly predict target locations and determine the final bounding box with the distribution of the targets. Based on these studies, a CSM lightweight attention module is added to our networks and experiments show that it not only improves tracking performance but also brings fewer parameters and computational power.

Proposed method
In this section, we describe the proposed SiamLight network, shown in Fig. 2. SiamLight consists of a Siamese network backbone, a feature fusion module, and a box adaptive head. The feature fusion module includes CSM and PG-corr modules, which not only brings less complexity to the model parameters but also improves the accuracy of the tracker. Before classification and regression branches, we incorporates multiple depth-separable convolutional blocks, and the adaptive head consists of a classification branch, which predicts whether each point on the response map corresponds to the foreground or background, and a regression branch, which predicts the offset between each point mapped to the search region and the ground truth bounding box.

Siamese network architecture
To design a lightweight tracker, this work uses MobileNet-V3 as the backbone network, which mainly converts the input images into feature maps. The backbone network can alleviate the problem of feature degradation, and reduce the computation while making better use of the input information. However, shallow features may not necessarily be applicable to tracking when performing feature extraction. They may cause a lot of noise because of their low Fig. 2 The SiamLight network consists of a feature extraction sub-network, a feature fusion module, and a head network semantics. To solve this problem, the last block of features of the network backbone is extracted and used as the output layer.
As can be seen in Fig. 2, the Siamese network backbone consists of two identical branches. The template patch is denoted as z and the other search branch is denoted as x. The two branches share the same CNN architecture and through the same feature mapping operation to map the original image into the feature space. The output feature maps are denoted by (z) and (x) respectively. To embed the information from these two branches, (z) is used as a kernel to perform a cross-correlation operation with (x) in order to obtain a response map S. The response map S should have rich information because it will be decoded in the subsequent prediction sub-network to obtain information on the location and scale of the target. However, the popular naive or depthwise correlation commonly used by previous researchers is not suitable for this condition. The naive correlation only generates a single-channel response map, which lacks useful feature information. The depth-wise correlation only allows each convolution kernel to convolve one channel, and thus less information is exchanged between each channel. Therefore, some channel information is missing in the subsequent process. In summary, we use the PG-corr correlation layer to generate multiple semantic similarity maps: where * represents the similarity of each pixel between the template feature and the search feature. The template branch z is decomposed into the channel and spatial kernels to suppress the interference of background information by reducing the matching region. The details are described in Sect. 3.2. The response map S has the same number of channels and height and width size as (x) , and it contains a large amount of information for classification and regression.

Feature fusion module
The CSM module is a simple and effective feed-forward convolutional neural network attention module, which is a plug-and-play module. The number of channels and aspect ratio of the two branches extracted from the features are not changed after the CSM module. Most Siamese structures aggregate templates and search for regional features using naive correlation, depth-wise correlation, or PG-corr, but these methods may introduce a lot of background noise and eventually lead to incorrect target matching. To address these issues, this paper adds PG-corr to achieve similarity matching between template and search regions in the novel Siamese network.
Channel and spatial module The CSM sequentially infers the attention map along two separate dimensions: spatial and channel, and then multiplies the attention map by the input feature map to perform adaptive feature refinement. Figure 3 shows the general structure of the CSM, where it can be seen that the feature F ∈ ℝ C×H×W as input (performing sequentially on the template and search features, the total operation is denoted by F). Through a channel attention module, it infers X c ∈ ℝ C×1×1 and obtains a weighted result F ′ . Through a spatial attention module, it infers X s ∈ ℝ 1×H×W . F �� is the final refined output. In contrast, we add a one-dimensional convolution operation before the F ′′ , so that the number of channels in the final output feature F ′′ is the same as the input feature. The results are not only less computational effort but also need few model parameters.
The CSM consists of CAM and SAM modules, as shown in Fig. 4. The channel attention module sends the input features to two parallel layers of average-pooling and maxpooling operations, then generates two different channel feature descriptors: F c avg and F c max . The size of channel feature descriptors is changed from C × H × W to C × 1 × 1 , and then uses one-dimensional convolution whose kernel size is k to aggregate the information of k channels. The number of channels is 1∕r of the input features ( r is the reduction rate), and the channel finally sums the obtained features element by element-wise and performs the sigmoid operation. The scale of the feature map remains unchanged while the introduction of non-linear elements. The channel attention is computed as follows: where denotes the sigmoid function, f k c denotes the onedimensional convolution operation, in which convolution kernel size is k. MaxPool and AvgPool denote the maxpooling feature and average-pooling feature, respectively.
Similarly, the results generated by CAM are carried out with two sequential layers of average-pooling and maxpooling operations. Along the channel axis, two different spatial feature descriptors are generated, which are F s avg and F s max respectively. Then, we apply a convolution layer and sigmoid function to generate the feature map of the spatial attention map X s F ′ . Finally, the resulting spatial attention is extended along the channel dimension to C × W × H . The input feature image is then multiplied by the corresponding element to obtain the feature image after injecting spatial attention. The spatial attention is computed as follows: where denotes the sigmoid function, F ′ represents the output feature map of the channel attention module, and f 7×7 denotes the convolution operation with the kernel size of 7 × 7.
Pixel to global correlation PG-corr has a strong ability to suppress background interference. In this paper, the template features are decomposed into 1 × 1 spatial kernels and channel kernels, and the matching region is narrowed in every search operation, which can achieve the effect of suppressing background interference. At the same time, we accurately obtain the response points on the target region, further improve the accuracy of the predicted bounding box, with better performance than the existing cross-correlation operation. The architecture is shown in Fig. 5.
Firstly, the template feature Z f is decomposed into n z kernels of size 1 × 1 × c in the spatial dimension. In order to enhance channel correlation, the template feature is decomposed into c kernels of size 1 × 1 × n z by channel dimension as follows: where w z and h z represents the width and height of the template feature. In Fig. 5, w x and h x denotes the width and height of the search feature X f respectively, and X f in x The spatial kernel tends to focus more on the local information of the template features. Thus, we use the channel kernel Z f c to obtain the overall information and unify the similarity of the local positions. The result obtained from Eq. (5) is then used to calculate the similarity with channel kernels Z f c , then the calculation can be represented by: where the output feature S 2 with the same size of X f .

Box adaptive head
The box adaptive head consists of a classification branch and a regression branch. Unlike previous methods, we use multiple depth-wise separable convolution blocks (DSConv) after cross-correlation operation, which has a lower number of  Fig. 5 The process of template feature decomposition and similarity matching parameters and lower computational cost compared to the conventional convolution operation. On the contrary, the model accuracy is improved, while the classification and regression head can only use a maximum of 8 searchable layers and with the kernel choices of {3, 5} . From Fig. 2, the number of DSConv blocks uses in the classification head is less than that in the regression head. The issue is that the response map can already roughly identify the center and boundary of the target after the cross-correlation operation. Thereby the classification branch can identify the target without too many operations. However, more DSConv blocks are needed while the regression branch needs to judge the distance between the points on the feature map and the four sides of the original map. In the regression head, in order to satisfy the requirement of using different DSConv blocks for classification and regression, 4 DSConv blocks of 3 × 3 and 5 × 5 kernels are used respectively, while the classification head has one less block than the regression head. After the PG-corr operation, the number of channels for classification is more than the number of channels for regression. Thus, the classification head is first plugged into a DSconv with a kernel size of 5 × 5 to ensure that the number of subsequent channels remains constant. Experiments have shown that the improved tracker head can meet the computational and parametric requirements of our lightweight model.
In summary, the PG-corr module and box adaptive head algorithm process are shown in Algorithm 1.
Training details During the training process, we use ImageNet VID [28], COCO [29], GOT10k [30], and Ima-geNet DET [28] as our training datasets, setting the batch size to 28, our network is trained with stochastic gradient descent (SGD) with an initial learning rate 0.001, performing 50 epochs in total. For the first 5 epochs, using a warm-up learning rate of 0.001 to 0.005, the learning rate declined from 0.005 to 0.00005 for the rest epochs. The whole network is trained end-to-end on a large-scale dataset. In addition, two large and small networks with different layers and parameters will be obtained. The epoch and batch size of both networks are the same, but the training time of the large network is longer than that of the small network.

Results and comparisons
We compare SiamLight with state-of-the-art trackers on five tracking benchmarks. To visualize the improvement of PG-corr in terms of background suppression, we visualize the classification score map generated by the different similarity-matching methods. The comparison results are illustrated in Fig. 6, where the top row is generated by DW-corr and the bottom row is based on a PG-corr operation, with the different frames indicated in the top left corner of the map. Figure 6a and b shows the response regions in the PG-corr score maps are concentrated on the target itself, with weaker responses in non-target areas, while the DW-corr modules have stronger responses in non-target areas. In Fig. 6c and d, the DW-corr confounds when similar objects are present in the background, while the PGcorr is still able to distinguish the targets. We observe that PG-corr is much more capable of distinguishing between targets when there are similar objects in the background than the depth-wise correlation. This is mainly due to the pixel-level similarity matching method in PG-corr, which reduces the matching area to achieve an exact match.
VOT2018 The benchmark consists of 60 challenging sequences. It includes the most important metric EAO (Expected Average Overlap), taking into account both accuracy (average overlap during successful tracking) and robustness (failure rate). We perform a visual comparison of the challenging aspects including camera motion, illumination change, occlusion, size change, and motion change. Frames that do not correspond to any of the five attributes are marked as unassigned. The values in parentheses indicate the EAO range for each attribute and the tracker as a whole, as shown in Fig. 7. Our SiamLight-Large performs well in terms of illumination changes, camera motion, and size changes, which shows that our tracker is robust to target motion changes, and camera motion. The results in terms of EAO, robustness, and accuracy are presented in Table 1, with our SiamLight-Small having 0.6% and 2.2% higher EAO values compared to SiamRPN++ and ATOM respectively.
VOT2019 The sequence also consists of 60 challenging sequences, with 20% different sequences compared to VOT2018 and more challenging in fast motion and similar distractors. Table 2 shows the results in terms of EAO, accuracy, and robustness, with our method SiamLight-Small yielding approximately 85.8% fewer Flops and covariates compared to the latest lightweight tracker LightTrack.
UAV123 UAV123 is an aerial video benchmark that contains 123 sequences captured from a low-altitude aerial perspective, and all sequences are labeled with upright bounding boxes. Targets in this dataset are subject to Comparison of EAO on VOT2018 for the following visual attributes: camera motion, illumination change, occlusion, size change, and motion change rapid movement, illumination changes, and occlusion, which makes the dataset more challenging. Table 3 shows that SiamLight-small in this paper has a 9.8% and 5.4% higher success rate than ECO and SiamRPN respectively. However, SiamLight-Large is still 0.1% smaller than the benchmark baseline in this paper, and the performance of the tracker will be further improved if the computational constraints are reduced.
OTB100 OTB100 is the most widely used public tracking benchmark, consisting of 100 well-annotated sequences, and Fig. 8 shows the success and accuracy plots for the tracker. Our SiamLight-Large has a 0.7% improvement in success rate compared to the baseline, and SiamLight-Small outperforms Ocean and SiamFC++ in accuracy by 1.3% and 1.4% respectively. In addition,our  method also has a 0.4% higher success rate compared to SiamRPN++-RBO. LaSOT LaSOT is by far the largest number of single target tracking benchmarks, with high-quality frame-level annotations. As shown in Table 4, SiamLight-Large outperforms the benchmark baseline in both accuracy and robustness by 1.1% and 0.9%, while using 30 times fewer flops (0.201 v.s. 6.2 G) and 20 times fewer parameters (0.54 v.s. 10.8 M) than baseline. Our SiamLight-Large in Fig. 9 has a success rate and accuracy of 1.8% and 0.2% higher than the baseline, respectively.

Ablation studies
Comparison on PG-corr and CSM modules We evaluate the impact of the different components of SiamLight on VOT2019 and give specific results in Table 5. In this paper, using the PG-corr and CSM modules in the novel tracking architecture, the EAO on the VOT2019 benchmark is only 0.302 when we use the conventional cross-correlation operation after feature extraction of the images, and adding the corresponding modules in turn, the EAO values increased by 1.9% for SiamLight-Small and by 1.6% for SiamLight-Large. The last row of the table shows that the number of Flops and parameters is much lower for both SiamLight-Large and SiamLight-Small than without any modules.
Comparison on different tracking heads We set the number of DSConv blocks according to the size of the obtained feature channels and the requirements of the classification and regression head. From Table 6, we can obtain that the parameters and Flops of our SiamLight-Small are both 1.3 times less than without DSConv, and the parameters and Flops of SiamLight-Large are 1.2 times less than without DSConv, respectively, and 1.1 times less than without DSConv. At the same time, the success rates under the LaSOT benchmark are 0.8% and 0.6% higher for our  SiamLight-Small and SiamLight-Large respectively than without changing the head. Thus, our method achieves better performance than tracking head using only classification and regression.

Conclusions
In this paper, we design a lightweight object tracker called SiamLight, which mainly consists of three parts: feature extraction, feature fusion, and tracking head. SiamLight adopts MobileNet-V3 as the backbone network for feature extraction. The feature fusion part includes two modules: PG-corr and CSM. The tracking head of SiamLight differs from other methods by adding multiple separable convolutional blocks. Extensive experiments on five popular benchmarks show that our method uses very few Flops and parameters compared to popular trackers but is not state-of-the-art in some benchmarks. In future work, we hope to propose a better approach to balance model parametric number and performance.