Fisher pruning for developing real-time UAV trackers

Unmanned aerial vehicle (UAV)-based tracking has shown large potential in various domains such as transportation, logistics, public safety, and more. However, deploying deep learning (DL)-based tracking algorithms on UAVs is challenging because of limitations in computing resources, battery capacity, and maximum load. Discriminative correlation filter (DCF)-based trackers have become a popular choice in the UAV tracking community owing to their ability to provide superior efficiency while consuming fewer resources. However, the limited representation learning ability of DCF-based trackers leads to lower precision in complex scenarios compared to DL-based methods. Filter pruning is a prevalent practice for deploying deep neural networks on edge devices with constrained resources, and it may be an effective way to solve problems encountered when deploying deep learning trackers on UAVs. However, the application of filter pruning to UAV tracking is underexplored, and a straightforward and useful pruning standard is desirable. This paper proposes using Fisher pruning to reduce the SiamFC++ model for UAV tracking, resulting in the F-SiamFC++ tracker. The proposed tracker achieves a remarkable balance between precision and efficiency, as demonstrated through exhaustive experiments on four popular UAV benchmarks: UAVDT, DTB70, UAV123@10fps, and Vistrone2018, showing state-of-the-art performance.


Introduction
With the widespread use of UAVs, UAV-based tracking technology has become a new hot topic, attracting increasing interest in visual tracking.It has broad potential applications in fields such as navigation, agriculture, transportation, aerial photography, and emergency response [1][2][3][4].While tracking in general scenes is already challenging, UAV tracking faces even more onerous challenges.On the one hand, the motion of UAVs causes great challenges to the accuracy of tracking algorithms, such as scale changes, motion blur, and severe occlusion; on the other hand, constrained computing capabilities, the demand for minimal power consumption, and constraint of battery endurance of UAV present tremendous difficulties regarding their efficacy as well [1,5,6].The present state of technology in UAV tracking places a great emphasis on efficiency, which is why DCF-based trackers are often used instead of DLbased ones [3,4].Despite great improvements in tracking precision for DCF-based trackers, they still do not achieve the same level of precision as the majority of state-of-theart DL-based trackers.A DL-based UAV tracking method with superior speed and precision was proposed recently in [2], which applies a lightweight backbone for consideration of efficiency and fuses features of shallow and deep layers for robust representation learning with a hierarchical feature transformer.Regrettably, despite attaining a striking tradeoff between precision and efficiency, and accomplishing top performance in UAV tracking, this tracker is incapable of real-time tracking on a single CPU.But importantly, this indicates that an effective and lightweight DL-based tracker may be a viable alternative to DCF-based trackers, as it can balance precision and speed.Therefore, we are motivated to develop lightweight DL-based trackers for UAV tracking and apply model compression techniques to trade precision for speed.Our goal is to develop a real-time DL-based tracker that achieves close precision to the original model.
Model compression aims to reduce the complexity of deep neural network models using methods like parameter 91 Page 2 of 16 pruning and knowledge distillation while maintaining accuracy.The goal is to deploy cutting-edge deep network models on resource-constrained environments, such as UAVs and embedded devices [8].The methods extensively researched and widely used for achieving model compression involve low-rank approximation, parameter pruning, knowledge distillation, quantization, etc [9].It is impractical to anticipate that a universal method for model compression will effectively compress all DL-based trackers to fulfill real-time demands while preserving high precision.The choice of DLbased tracker and compression technique can significantly impact the real-time and tracking precision performance.In this study, we present utilizing Fisher pruning [10] to reduce the SiamFC++ [11] model, in order to achieve realtime drone target tracking.This pruning method does not require additional constraints or retraining of the model to achieve optimized training, making it simple and efficient.The SiamFC++ tracker is an extension of the efficient SiamFC [12] tracker, which includes a regression branch and a center-ness branch, aimed at achieving better precision and speed.Our exploration of this combination was effective, yielding an outstanding trade-off between efficiency and precision in comparison to previous DCF and DL-based trackers, as shown in Fig. 1.We expect that our research could promote the study and application of model compression in the field of UAV tracking.The following is a summary of our contributions: -We introduce Fisher pruning as a method to narrow the performance disparity between DCF-based and DL-based trackers in UAV tracking, which is an unexplored approach.-We present the F-SiamFC++ tracker which obtains an excellent balance between precision and efficiency by utilizing Fisher information as the pruning criterion to reduce the deep model SiamFC++.-We evaluate the proposed approach on four established UAV datasets, consisting of UAVDT, DTB70, UAV123@10fps and Vistrone2018.The experimental results exhibit that our proposed F-SiamFC++ tracker accomplishes state-of-the-art performance.
The remaining content of this article is organized as follows.
Section 2 summarizes the related work.Section 3 provides an overview of the proposed approach.Section 4 contains a description of the experiments and the results obtained.
The final section presents the conclusions drawn from this research.

Related works
This paper has improved and extended our earlier work [13] and conducted a thorough analysis of fisher pruning in realtime UAV tracking.In this paper, we additionally explore block-wise pruning ratios that enable us to obtain a better balance between tracking efficiency and precision.Thanks to this improvement, we have achieved outstanding real-time tracking performance on a single CPU, achieving an average speed of over 90FPS.Note that the previous version is Fig. represented as F-SiamFC++(v1) and the enhanced version is represented as F-SiamFC++(v2).

Visual tracking approaches
Tracking techniques have advanced rapidly with the emergence of modern visual trackers.There are two major categories of modern trackers: DCF-based trackers that locate targets by learning the correlation between templates and search areas, and DL-based trackers that automatically learn features using powerful neural networks.DCF-based trackers originated from the minimum output sum of squared error (MOSSE) filter [14], which is an early representative in the field of visual tracking.Since then, the DCF-based tracker has continuously improved its correlation learning methods and update mechanisms within the correlation filtering framework, achieving notable progress [5].DCFbased trackers can accomplish competitive performance while maintaining relatively high efficiency, thanks to their use of handcrafted features and the ability to be calculated in the Fourier domain.This is why they have become popular in UAV tracking.However, handcrafted features are difficult to maintain tracking stability and accuracy under complex and challenging conditions.Recently, visual tracking has seen significant improvements in precision and robustness, due to the widespread adoption of deep learning techniques.SiamFC [12] was the first to propose formulating the visual tracking task as a generalized similarity learning problem and employing a Siamese network [15] to measure the similarity between target and search images.Siamese-based trackers can be further classified into two main types: anchor-based and anchorfree trackers [16].In the class of anchor-based methods, SiamRPN [17] introduced a region proposal network (RPN) into Siamese networks and treated tracking as two sub-tasks that were accomplished by a classification and a regression branch.DaSiamRPN [18] incorporated an effective sampling strategy and a distractor-aware module.SiamMask [19] added a new branch to produce a pixel-wise binary mask.Recently, in order to leverage the powerful representation ability of deep features, more and more researchers have devoted to studying deeper architectures for visual trackings, such as SiamRPN++ [20] and SiamDW [21].However, the use of these methods often results in a significant reduction in efficiency.With regards to anchor-free trackers, SiamFC++ [11] proposed a novel quality assessment branch for classification, which constitutes a simple yet effective framework for visual tracking.After that, SiamCAR [22] harnessed this framework to reengineer the anchor-free structure and integrate multiple layers of features, delivering impressive performance gains.Besides, SiamBAN [23] proposed a new strategy for generating classification labels and regression targets.Apart from Siamese-based trackers, there exist multiple DL-based trackers that extend online discriminative frameworks using deep networks for end-toend training, e.g., ATOM [24], DiMP [25], KYS [26], and KeepTrack [27].Unfortunately, the efficiency of these methods is too low for real-time UAV tracking.
In summary, the development of deeper architectures in recent years has indeed substantially enhanced tracking precision, but typically at the expense of efficiency.In contrast, SiamFC++ [11] is a simple yet powerful DL-based object tracking framework with a lightweight backbone and an effective quality assessment branch.Regrettably, while it exhibits remarkable GPU speed, its CPU speed appears insufficient to satisfy rigorous real-time requirements (i.e., with a speed of ≫ 30 FPS).Our purpose in this work is to enhance the efficiency of SiamFC++ by employing model compression methods while sustaining its precision to the greatest extent for real-time UAV tracking [11].

Methods of filter pruning
Pruning is a widely applied method for compressing neural networks, whose pipeline conventionally includes three stages: pretraining, pruning, and finetuning.There are four typical topics involved in the pipeline: pruning structure, pruning ratio, pruning criterion, and pruning schedule [9].Typically, pruning structures are classified into two categories: weight pruning and filter pruning.The first method includes eliminating individual neurons or weights, which is difficult to harness for acceleration on general-purpose hardware [28].In contrast, the second method involves deleting entire filters or channels, it is simpler due to the regular weight arrangement and can achieve significant acceleration [29].Pruning ratios determine the percentage of weight that should be removed.Typically, there are two methods for adjusting the pruning ratio.The first is to directly specify one global ratio or numerous layer-wise ratios.The second alternative is to indirectly modify the pruning ratio, such as by using regularization-based pruning procedures, but this involves a significant amount of technical adjustment to achieve a given ratio [30].The pruning criterion indicates which weights should be removed, and weight magnitude is the most fundamental criterion for weight pruning.Popular criteria used for filter pruning include the Frobenius norm, filter response sparsity, and selecting weights that result in the smallest amount of weight loss [9].The pruning schedule outlines how network sparsity is gradually increased from zero to a target value, with two available approaches [9]: (1) in a single step, namely, one-shot, followed by finetuning, or (2) progressively, pruning and training interlaced.While the progressive approach may yield better results with more training time available, the one-shot method is more efficient and can reduce the burden of developing sophisticated training techniques.Overall, despite extensive research, there is still much potential for further exploration in filter pruning methods.The Fisher pruning presented in [10] recently and developed into Group Fisher pruning in [31], has proved to be an efficient and effective filter pruning approach.It is scheduled in a one-shot way and adopts Fisher information as the pruning criterion.It obviates the necessity to impose additional constraints or retrain, thereby simplifying the pruning procedure substantially.We have employed this approach in our work to achieve our objective of model compression.The difference lies in the fact that we have used global pruning ratios and block-wise pruning ratios rather than layer-wise ratios in order to simplify the pruning process.Determining layer-wise pruning ratios in the previous approach is a tedious and time-consuming task, but our approach using global and block-wise pruning ratios significantly simplifies the pruning process.

Proposed method
Our F-SiamFC++ is built up by pruning SiamFC++ [11] using the Fisher pruning technique introduced in [10].Unlike previous methods, we employed both global and block-wise pruning ratios to identify the optimal pruning ratios.The details are described as follows.

F-SiamFC++ overview
Our proposed F-SiamFC++ tracker undergoes offline training and is subsequently used for online prediction, which architecture consists of two branches.The first branch holds the template, and the second branch is responsible for searching.Refer to Fig. 2 for a depiction.The input for the preceding is the tracking target patch Z, whereas the input for the following is the search patch X.The two branches utilize a common backbone for feature extraction, which is represented by the mapping (⋅) .Before being used for subsequent classification and regression tasks, the features from both branches are interacted through cross-correlation.The definition of the coupled features is given as follows: where i (⋅) denotes the task-specific layer (abbreviated as 'cls' for classification and 'reg' for regression) and ⋆ denotes the cross-correlation operation.The outputs of cls and reg are equal in size, and to evaluate the accuracy of the classification and ultimately reweight the classification scores, a center-ness branch is parallel to the classification branch.The following summarizes the entire training loss: where z denotes a location on a feature map represented by its coordinates, the variables p z , q z , and t z represent the pre- dicted values, while p * z , q * z , and t * z represent the corresponding target label, I {⋅} represents the indicator function, L cls , L qual , and L reg denote the focal loss, the binary cross entropy loss and the IoU loss for classification, quality assessment, and regression, respectively.Refer to [11] for more details.Constants 1 and 2 are used to balance the losses, and (1) . Be aware that p * z is given 1 if z is thought of as a positive sample and 0 if it is thought of as a negative sample.Our F-SiamFC++ follows the same pipeline as SiamFC++ but differs in terms of the pruned filters obtained through filter pruning, and this difference will be explained with thorough explanations later on.

Fisher pruning
Fisher information provides a method for quantifying the amount of information that an observable random variable contains about an unknown parameter of a distribution that serves as a model for the variable [32].Formally, it represents the score variance or the expected value of the information that has been observed.Fisher Pruning is a filter pruning technique that builds its pruning criterion using Fisher information.Its purpose is to use the Fisher information [10] measure to discard feature maps that have little impact on the model's overall performance.Note that Q (z|I) represent the model, where I, z , and correspond to the input, output, and parameters to be trained, respectively.Assuming the objective of the training is the minimization of the subsequent loss function, without loss of generality [10]: where P(I) is a specific data distribution with respect to which the expectation is taken.Fisher pruning uses a 2nd order approximation to approximate the corresponding change in L for a slight change d in the parameters: where the gradient and the Hessian matrix, respectively, are denoted by g = ▽L( ) and H = ▽ 2 L( ) .The change in loss that can be expressed as follows results from dropping the kth parameter, k : where e k denotes the one-hot vector whose ith element has a value of 1.We would have ▽L( ) ≈ 0 if the model had converged during training, which simplifies equation (5) to For the diagonal entry of the Hessian matrix, it follows If Q (z|I) had been trained to be close to the ideal distribu- tion P(z|I) , then Moreover, the Hessian matrix simply reduces to the Fisher information matrix if Q (z|I) and P(z|I) are equivalent, and Q (z|I) is twice differentiable with respect to [10].In accordance with Fisher pruning, the change in loss resulting from pruning an individual parameter k can be estimated with the help of N samples, as follows: where L n stands for the n-th sample's loss.As our goal is to speedups by pruning entire feature channels as opposed to individual parameters, we indicate the kth filter by k and denote its parameter at position (i, j) by kij , then the resulting change of the loss due to the kth filter is removed is which in Fisher pruning should be minimized.

Fisher pruning schedule
Let a group of 3-D filters represents the i-th ( i ∈ [1, K] ) con- volutional layer C i of the SiamFC++.Let n i be defined as the number of filters in C i , k i be defined as the kernel size, and the j-th filter be w i j ∈ ℝ n i−1 ×k i ×k i .And then the set of 3-D filters is The proce- dure of Fisher pruning can be described in detail as follows: First, a Fisher information set }} K i=1 is created by first calculating the Fisher information of any filter for each layer.Second, each F i is sorted from highest to lowest, resulting in Fi = {f i where s i j is the index of the j-th highest value in F i .Third, we develop a pruned SiamFC++ model, denoted by F-SiamFC++, by experimentally determining the number of pruned filters of each layer n i p in accordance with a global pruning ratio or ( 7) 91 Page 6 of 16 block-wise ratios.After pr uning, p specifies the number of filters to remove in C i .In the final step, the trained SiamFC++ model's original weights are used to initialize the filters kept.The resulting compressed model, F-SiamFC++, is then fine-tuned to adjust its parameters for better performance.

Experiments
To validate the excellent performance of our proposed F-SiamFC++ tracker, we conducted a comprehensive evaluation on four challenging UAV tracking benchmarks, i.e., UAVDT [7], DTB70 [33], UAV123@10fps [34] and Vistrone2018 [35].UAVDT is mainly designed for tracking vehicles under various weather conditions, flying altitudes, and camera perspectives.DTB70 consists of 70 sequences captured by drones, containing cluttered scenes and objects of various sizes.It is designed to evaluate the robustness of tracking algorithms in complex scenarios.UAV123@10fps is constructed by sampling the UAV123 [34] benchmark from the original 30FPS to 10FPS, and Its purpose is to investigate the effect of camera capture rate on tracking performance.The Vistrone2018 dataset is derived from a single-object tracking challenge that was conducted in conjunction with the European Conference on Computer Vision (ECCV2018) and focused on evaluating drone tracking algorithms.

Experimental environment
We conducted all evaluation experiments on a PC equipped with an Intel Core i9-10850K processor (3.6GHz), an NVIDIA TitanX GPU, and 16GB RAM.Note that there are two pruning ratio settings for our F-SiamFC++, which correspond to two different versions of the implementation denoted by F-SiamFC++(v1) and F-SiamFC++(v2), respectively.F-SiamFC++(v1) employs a global pruning ratio and is the realization in our previous work [13].F-SiamFC++(v2) is implemented here and uses block-wise pruning ratios for enhancement.Specifically, F-SiamFC++(v1) utilizes a global pruning ratio of 0.2.In F-SiamFC++(v2), three blocks (backbone, neck, and head) need to be pruned, with pruning ratios of 0.5, 0.6, and 0.4, respectively.Other parameters for training and inference follow SiamFC++ [11].It is worth noting that the real-time performance discussed in this paper is relatively defined and only applicable to platforms with equal or greater computing resources than those we used.This means that if lower computing resources are used, the performance may be lower than our experimental results.
Overall performance evaluation: The overall performance of F-SiamFC++ compared to other trackers on the four benchmarks is illustrated in Fig. 3.It can be observed that F-SiamFC++(v1) and F-SiamFC++(v2) exhibit better performance than all other trackers on all four benchmarks, except for the VisDrone2018.In particular, in UAVDT, DTB70, and UAV123@10fps benchmarks, F-SiamFC++(v1) demonstrates a notable superiority over the second-best tracker RACF by a significant margin in terms of (PRC, AUC), with improvements of (2.1%, 6.1%), (8.9%, 10.0%), and (2.7%, 5.9%), respectively.Meanwhile, F-SiamFC++(v2) is superior to RACF with gains of (3.4%, 7.2%), (6.9%, 8.8%), and (1.6%, 5.3%) on the same three benchmarks, respectively.On VisDrone2018, although F-SiamFC++(v1) and F-SiamFC++(v2) exhibit inferiority to the first-ranked tracer RACF in terms of (PRC, AUC), with respective gaps of (2.7%, 0.4%) and (2.2%, 0.9%), they surpass all other trackers.Note that the average gaps on VisDron2018 are much smaller than those on the other three benchmarks.Although the more efficient version F-SiamFC++(v2) is inferior to F-SiamFC++(v1) in terms of (PRC, AUC) on DTB70 and UAV123@10fps, with respective margins of (2.0%, 1.2%) and (1.0%, 0.6%), F-SiamFC++(v2) shows better performance on UAVDT and VisDrone2018, resulting in an average precision difference between these two version of only 0.3% on the four benchmarks.In terms of speed, we assess the average FPS on a single CPU for the competing trackers across the four benchmarks.Table 1 displays the average FPS, average precision rates (PRCs), and success rates (AUC) of the competing trackers on a single CPU.As can be seen, F-SiamFC++(v1) and F-SiamFC++(v2) outperform all the other competing trackers in precision, with average PRCs of 78.4% and 78.1%, respectively, and in AUC, with average AUCs of 55.5% and 56.6%.They are also the best real-time trackers (with a speed of >30FPS) on a single CPU, achieving speeds of 51.9 FPS and 93.9 FPS, respectively.Note that although F-SiamFC++(v1) has a slight advantage in average precision over F-SiamFC++(v2), F-SiamFC++(v2) achieves a higher average pruning ratio and faster speed, close to 1.81 times that of F-SiamFC++(v1).In summary, F-SiamFC++(v2) achieves a better balance between efficiency and precision compared to F-SiamFC++(v1).
Attribute-based evaluation: In the four benchmarks, our F-SiamFC++(v1) and F-SiamFC++(v2) are superior to other DCF-based trackers in the majority of the defined attributes.Figure 4 illustrates examples of success plots.As can be seen, in the situations of object blur and object motion on UAVDT, scale variation and out-of-plane rotation on DTB70, aspect ratio change and viewpoint change on UAV123@10fps, and fast motion and low resolution on VisDrone2018, significant improvements have been demonstrated by F-SiamFC++(v1) and F-SiamFC++(v2) over other trackers, which can be attributed to the effectiveness of feature representation achieved through deep learning.For example, F-SiamFC++(v1) and F-SiamFC++(v2) significantly surpass the second-ranked tracker RACF on the scale variation subset of DTB70 by a gap of 15.3% and 15.9%, on the object motion subset of UAVDT by a gap of 8.3% and 9.0%, respectively.This justifies the effectiveness of developing lightweight deeper trackers for UAV tracking.Meanwhile, we can observe that F-SiamFC++(v2) performs better than F-SiamFC++(v1) in the cases of scale variation and out-of-plane rotation on DTB70, object blur/motion on UAVDT, and low resolution on VisDrone2018, for instance, although the former contains more network parameters.This supports the idea that block-wise pruning ratios can provide a better network architecture than a global pruning ratio does, so that we are able to achieve a more efficient compressed model when seeking a certain level of tracking precision.And thanks to the relatively small number of pruning ratios, block-wise ratios are easier to determine than layer-wise ones, justifying the reasonability of the choice of blockwise pruning ratios to enhance our previous version of a global pruning ratio.
Qualitative evaluation: Fig. 5 shows eight qualitative tracking results of our F-SiamFC++ and four top DCF-based trackers, i.e., ECO-HC [40], ARCF-HC [4], AutoTrack [3], and RACF [6].We selected a total of eight video sequences from the four benchmarks (two from each benchmark) including S1607, S0304, BMX5, Surfing06, person16, boat3, uav0000180_00050_s, and uav000020_00675_s for demonstration.As can be observed, the four DCF-based trackers cannot remain robust in challenging scenarios with significant deformation, pose change, or partial occlusion, whereas our F-SiamFC++ exhibits better performance and generates visually more satisfactory results by virtue of its deep representation learning.Specifically, all the trackers except F-SiamFC++(v1) and F-SiamFC++(v2), fail to track the person in the sequence person16; only F-SiamFC++(v1), F-SiamFC++(v2), RACF and ECO-HC succeed in tracking the target in BMX5 but our F-SiamFC++(v1) and F-SiamFC++(v2) are more accurate; only F-SiamFC++(v1), F-SiamFC++(v2) and Auto Track succeed in tracking the surfer in Surfing06 but also our F-SiamFC++(v1) and F-SiamFC++(v2) are more accurate; all trackers are able to track the targets in S1607, S0304, boat3, uav0000180_00050_s, and uav0000207_00675_s successfully.However, our F-SiamFC++(v1) and F-SiamFC++(v2) still perform better in these sequences than the DCF-based trackers.These results suggest that developing lightweight DLbased trackers for UAV tracking might be a more effective means for enhancing tracking precision.

Table 1
The average precision, AUC, and speed (FPS) of F-SiamFC++ and hand-crafted based trackers were compared on the UAVDT [7], DTB70 [33], UAV123@10fps [34], and Vis-Drone2018 [35] datasets The first, second, and third place are indicated by bold, bolditalic and underline, respectively All FPS values reported were evaluated on a single CPU.

Impact of layer-wise and global pruning ratios:
We trained F-SiamFC++ with both layer-wise and global pruning ratios to investigate their impact on the model's precision.In layer-wise pruning, we prune a specific layer of SiamFC++ with a pruning ratio setting from 0.1 to 0.9, while keeping other layers fixed.In global pruning, we prune each convolutional layer in the backbone, neck, and head with a global ratio ranging from 0.1 to 0.9.The precisions of F-SiamFC++ on the DTB70 benchmark with different settings of pruning ratios are shown in Table 3.Note that the larger the pruning ratio, the greater the number of filters that will be pruned.And there is a significant drop of precision at the global pruning ratio of 0.9, with a decrease of 14.0% with respect to the pruning ratio of 0.8, suggesting that if the model is too small its tracking performance is largely limited.It is also worthy of note that it is not linear that correlation between the number of model parameters and even the global pruning ratio.Because the actual pruning ratio of each convolutional layer depends as well on the target pruning ratio of its previous layer.This non-linear correlation can be easily seen from Fig. 6, which plots how F-SiamFC++'s number of parameters varies with the global pruning ratio.As we can see, in a layer-wise manner, the best precision is reached when the pruning ratio is smaller than 0.7, but the distribution of the best pruning ratios for each layer does not provide any guidance on how to find better layer-wise pruning ratios.In other words, finding good layer-wise pruning ratios is laborious and time-consuming in view of the huge possibility of combinations.In a global manner, the highest precision is achieved with a pruning ratio of 0.2.However, if the global pruning ratio exceeds 0.7, there is a significant drop in precision.Since after pruning a fine-tuned process is required, we show in Table 4 the additional training time due to this fine-tuned process to understand the impact of the additional training cost.Note that = 0 represents the training time of the original model without pruning.As can be observed, the time for fine-tuning is very close to that of training the original model, and large pruning ratio does not lead to faster fine-tuning process.This may be explained by the fact that data access dominates the training process.Last but not least, this result implies that filter pruning is not only beneficial for simplifying the model and improving efficiency but also for increasing precision, as it may enhance the model's generalization ability if appropriate pruning ratios are chosen.However, applying a global pruning ratio would overlook the variations among different layers, which makes it difficult to achieve optimal pruning effects for each layer simultaneously.Meanwhile, determining optimal layerwise pruning ratios is too cumbersome and time-consuming, which is why we explore block-wise pruning ratios to overcome this problem in this work.

Impact of block-wise pruning ratios:
The process of finding the best layer-wise pruning ratios through exhausting search methods is a time-consuming endeavor.Even if the same pruning ratios are applied to the corresponding layers of the head and neck branches, there are still 10 layerwise pruning ratios that require determination.Therefore, we utilize block-wise pruning ratios to simplify the process, dividing the model into three blocks: backbone, neck, and head, and defining a separate pruning ratio for each block denoted as B , N , and H , respectively.The combinations consist of the following ranges: 0.2 to 0.5 for the backbone, 0.3 to 0.6 for the neck, and 0.4 to 0.7 for the head, all with a step size of 0.1.Note that these ranges of pruning ratios are set empirically.As a result, the total number of combinations Table 3 The accuracy (PRC) of F-SiamFC++ on DTB70 varies when the pruning ratio goes from 0.1 to 0.9 in step of 0.1 The first place of each row is marked in bold, while the best among all is also marked with a star 'L1 (Backbone)' means that only the first convolutional layer is pruned in the backbone, and 'Backbone + Neck + Head' means all the convolutional layers are pruned with the same pruning ratio in all the backbone, neck, and head   is reduced from 8 10 to 4 3 , a manageable amount given the time and computation resources we have.Since there is no consensus on what is the optimal balance between precision and efficiency, we have defined a measure to quantified the trade-off between precision and efficiency.The proposed measure I b is defined as follows, where PRC 0 and PRC represent the precision before and after the pruning, S 0 and S are the GPU inference speed before and after the pruning, respectively.Intuitively, I b denotes the scaled product of the relative increase of precision and the relative increase of inference speed.The larger the two relative increments are, the larger I b is.( B , N , H ) set to (0.5, 0.6, 0.7), F-SiamFC++ has the smallest size of around 1.76M, but it only achieves an average precision of 74.0%, significantly lower than the highest average precision, i.e., with a gap of 4.8%, and it gets a negative I b of − 1.45.The second highest precision of 78.1 is achieved by setting it to (0.5, 0.4, 0.6), with a model size of around 2.31 M, which makes the default setting of pruning ratios for our tracker F-SiamFC++(v2) considering its optimal trade-off between precision and efficiency in terms of the proposed measure, specifically I b = 2.43 .We can also observe that there is no simple correlation between model size and average precision.For example, the largest model has 5.04M parameters when ( B , N , H ) is set to (0.2, 0.3, 0.4), but its average precision, i.e., 76.7%, is lower than that of F-SiamFC++(v2), with a gap of 1.4%.These results also reinforce the idea that Fisher pruning can enhance both efficiency and accuracy if pruning ratios are properly determined, as previously suggested.
Effect of fisher pruning: We compared the proposed F-SiamFC++(v1) and F-SiamFC++(v2) with the baseline tracker SiamFC++ on all four UAV benchmarks.We aimed to investigate the impact of applying Fisher filter pruning on the model size, multiply-accumulates (MACs), precision, and tracking speed of the baseline SiamFC++.Their comparison in terms of model size, MACs, precision (PRC), and speed (on both CPU and GPU) are shown in Table 6.We can observe that the model sizes of F-SiamFC++(v1) and F-SiamFC++(v2) are reduced to 80.0% ( ≈7.73/9.66)and 23.9% ( ≈2.31/9.66) of the original, respectively.Sig- nificant decreases of MACs can also be observed, from SiamFC++'s 297.98G to F-SiamFC++(v1)'s 193.58G and ( 11) F-SiamFC++(v2)'s 80.53G.Both CPU and GPU speeds have improved.Due to the parallel computing units on our GPU significantly exceeding the size of these models, the average GPU speeds grow by just 12.8% and 70.5% on F-SiamFC++(v1) and F-SiamFC++(v2), respectively.However, their average CPU speeds increase, respectively, by 42.2% to 51.9 FPS and by 157.2% to 93.9 FPS from SiamFC++'s 36.5 FPS.Although F-SiamFC++(v1) performs slightly worse than the baseline SiamFC++ on UAV123@10fps with a gap of 0.7%, it achieves significant precision improvements on UAVDT and VisDrone2018 benchmarks, with gains of 4.2% and 11.3%, respectively.Regarding F-SiamFC++(v2), it improves the precision of F-SiamFC++(v1) on UAVDT and VisDron2018, while significantly enhancing its speed and reducing its model size.Specifically, F-SiamFC++(v2) improves the precision of F-SiamFC++(v1) by 1.6%, 0.6% on UAVDT and Vis-Drone2018, respectively, reduces its model size by nearly 70%, i.e., from 7.73 to 2.31 M, and raises the CPU and GPU speed by 42.0 FPS and 131.7 FPS, respectively, despite a slight decrease of 1.0% and 1.9% on UAV123@10fps and DTB70, respectively.Overall, F-SiamFC++(v2) achieves a better trade-off between precision and efficiency compared to F-SiamFC++(v1), and both variants enhance the efficiency and precision of SiamFC++ across all benchmarks.These results justify that the proposed method is effective for real-time UAV tracking and may encourage more work on developing lightweight DL-based trackers with filter pruning for real-time UAV tracking.

Conclusions
In this paper, we are the first to employ Fisher pruning to reduce the gap between DCF-and DL-based trackers in UAV tracking.The proposed F-SiamFC++(v1) and F-SiamFC++(v2) strike an impressive balance between precision and efficiency while showcasing state-of-the-art performance on four widely used UAV benchmarks: UAVDT, DTB70, UAV123@10fps, and Vistrone2018.Remarkably, the proposed approach not only enhances efficiency but also surprisingly enhances tracking precision.Specifically, compared with the baseline tracker SiamFC++, which runs at 36.The first, second, and third place are indicated by bold, bolditalic and underline, respectively We constructed a range of combinations for each ratio using the values of 0.2 to 0.5 for the backbone, 0.3 to 0.6 for the neck, and 0.4 to 0.7 for the head, with a step size of 0.1, on all four benchmarks  look into alternative filter pruning methods, superior pruning criteria, and different baseline trackers.Another avenue of research would be to explore utilizing discriminative representation learning to enhance the discriminative power of the compressed model in view of the significant decrease in model parameters.

UAV benchmarks
The first place is indicated by bold Note that only the precision on CPU is shown here since the difference between the precision on CPU and on GPU is very small Methods

Fig. 2
Fig.2An illustration of the network structure of the proposed F-SiamFC++ method, which is similar to that of SiamFC++ except for differences in the pruned feature maps and filters.It should be

Fig. 4
Fig. 4 Comparison based on attributes such as object blur, object motion, scale variation, out-of-plane rotation, aspect ratio change, viewpoint change, fast motion, and low resolution

Fig. 6
Fig. 6 Illustration of the non-linear correlation between the number of parameters of F-SiamFC++ and the global pruning ratio.Note that the purple dashed line depicts a linear correlation for comparison

Table 2
This table presents a comparison between F-SiamFC++ and other deep-based trackers on UAVDT [7] in terms of precision and speed (FPS)All FPS values reported are evaluated on a single GPU, where the first, second, and third place are indicated by bold, bolditalic and underline, respectively

Table 4
Training time (Time) of the models with the global pruning ratio ranging from 0 to 0.9 Table 5 shows the average precision (PRC), average GPU speed on the four benchmarks, the trade-off measure I b , and model size of F-SiamFC++ with different block-wise pruning ratios.From the table, it is evident that the model achieves the highest average precision of 78.8 with a model size of around 3.24M when ( B , N , H ) is set to (0.3, 0.4, 0.7), with I b = 1.44 .

Table 5
Illustration of how the average precision (PRC), FPS, I b , and model size of F-SiamFC++ change as the block-wise pruning ratios ( B