Learning Target Point Seeking Weights Spatial–Temporal Regularized Correlation Filters for Visual Tracking

Spatial–temporal regularized correlation filtering is an effective tracking algorithm. Still, problems such as target occlusion, background clutter and out of view caused by target movement are inevitable in the running process. This paper proposes a target point seeking weights spatial–temporal regularized correlation filters to tackle target loss. The alternating direction method of multipliers is used to simplify the spatial–temporal regularized term in the objective function, obtain the optimal solution of the filter and auxiliary factor, reduce the algorithm’s complexity, and realize the adaptability of time and space. The target weight response model is obtained through the target point finding weight, the target motion information is extracted, the target motion state and motion trajectory are predicted, and the interference information such as the algorithm’s ability to identify the target and the ability to distinguish the target is enhanced. With the help of the weighted least squares method, the maximum response value of the weight is obtained, and the target position is determined. Experiments show that, compared with other mainstream correlation filtering algorithms, the proposed algorithm has higher tracking accuracy and stronger robustness under complex conditions.


Introduction
In recent years, discriminative correlation filters (DCFs) have been widely used in the field of visual tracking and have many practical applications, such as autonomous driving [1], security monitoring [2], digital military [3], etc.Given the initial state of the target in the first frame, the tracker converts the correlation operation in the space domain to the element multiplication in the frequency domain according to the periodic training samples, automatically locates the target, reduces the computational complexity, and improves the tracking speed.
Since 2010, correlation filter has been introduced into the field of target tracking, thereby improving tracking accuracy and tracking speed.Danelljan et al. proposed SRDCF [4] (Spatially Regularized Correlation Filters), expanding the target search area to make the filter template coefficient near the boundary close to 0, increasing the filter coefficient in the target area, and improving the tracking accuracy.However, due to the training form of large-scale training sets, the operation speed is slow.Li et al. [5] added the time regularization term based on SRDCF and proposed STRCF to reduce the recording of the number of images and improve the running speed in the training of samples.Li et al. [6] proposed AutoTrack to realize automatic control of regular items in time and space by setting global response variables.Xu et al. [7] proposed ACSDCF and used 10% of the depth channel feature to achieve adaptive selection in the learning stage, reduce the number of filter dimensions of the file, and enhance the recognition ability.Zhang et al. [8] proposed EMCF, which uses regularization terms to learn the environmental residuals between two adjacent frames, thus enhancing the discrimination and insensitivity of the filter in the tracking scene with frequent changes, and improving the operation speed.Ma et al. [9] proposes a color saliency aware correlation filter, which uses color statistics as a model of image boundary connectivity cues.
However, early DCF-based methods suffer from two significant drawbacks.First, the cyclic shift sampling process is always subject to the periodic repetition of boundary positions.It makes the DCF model need to use a fraction of the phantom samples for training.This dilemma is alleviated to some extent by attaching pre-defined spatial constraints to the filter coefficients.
These constraints are usually fixed for different objects and do not change during the tracking process, which cannot fully utilize the diverse information of different objects at different times.Second, object localization and scale estimation are usually performed on the same feature space, which requires the extraction of multi-scale feature maps during the tracking process.This strategy significantly increases the computational load and slows down the tracking speed when the tracker utilizes some powerful and complex functions (such as feature extraction from deep networks).This is why the CF trackers are usually slow (e.g., DeepSTRCF [5], SRDCF [4] and RPCF [10]).The STRCF [5] achieves a 5x speedup in real-time visual tracking, which is faster and more accurate than the SRDCF [4].However, this method still has room for improvement in aspects such as background clutter, occlusion, and plane rotation.
Due to target rotation, occlusion and complex background interference, the tracking effect of STRCF is limited.Aiming at the above single-target tracking drift problem, a spatial-temporal regularized target tracking algorithm based on the dynamic update weight distribution map (target point seeking weights) of target points is proposed.The algorithm in this paper can predict the target characteristics and movement direction in all directions, and improve the accuracy of the tracking algorithm.
The contributions of this paper are as follows: (1) Predict the state, update the target point and the weight of the target point immediately, and select the tracking target by considering the weight of the candidate sample, which reduces the interference risk of the tracker noise and solves the half-occlusion and fullocclusion problems caused by the target movement during the operation of STRCF.
Improve the robustness and tracking performance.(2) Perform ADMM iteration on the objective function to find two sub-problems that are convenient to calculate the optimal solution, obtain real-time performance, and reduce computational complexity.
(3) Obtain the target through the target search weights Weight distribution, extracting target motion information, predicting target motion state and motion direction, and using the idea of weighted least squares to find the maximum value of the response, so that the filter can obtain the moving target direction in real-time.
We conduct comparative experiments on the OTB-2015 [11], UAV20L [12], LaSoT [13] and VOT2018 [14] datasets.The experimental results show that TP-STRCF performs well in terms of accuracy, robustness, and speed compared with state-of-the-art CF-based trackers and can in the case of rapid target movement, illumination change, occlusion, low resolution, etc., the excellent real-time tracking performance of STRCF is maintained, and the tracking accuracy and tracking success rate are improved.

Related Work
This section briefly introduces DCF trackers and then focuses on the relevant theories and formulas of STRCF.

Discriminative Correlation Filters Trackers
Since its inception in 2010, DCF has received extensive attention from target tracking scholars at home and abroad.For the first time, Bolme et al. introduced relevant filtering ideas in the field of signal processing into target tracking and proposed the MOSSE (Minimum Output Sum of Squared Error filter) [15], which uses multiple samples of the target as training samples to generate better filters and improve the algorithm robustness.To solve the boundary effect, expand the search area, and at the same time constrain the effective scope of the filter template, make the filter template coefficient near the boundary close to 0, and increase the coefficient of the filter in the target area, proposed the SRDCF, Gauss-Seidel iterative optimization is used in this algorithm.Still, the running speed is slow due to the large-scale training set.Li et al. added time regularization on the basis of the SRDCF and used the Alternating Direction Method of Multipliers (ADMM) [16] to iteratively solve the problem to reduce the computational complexity, named STRCF, the algorithm It can be well adapted to large appearance changes, and the speed has also been improved.

Improved STRCF
The SRDCF adds a spatial regularization term to the DCF, and its weight can punish the correlation filter coefficients in the learning process and prevent the model from corrupting.The STRCF is the same as the SRDCF algorithm to retain more information about the target, expand the search area, and integrate temporal regularization into the SRDCF, which has strong robustness to appearance changes and target occlusion.The STRCF authors show that the algorithm can integrate temporal and spatial adjustments into the DCF tracker, providing efficient real-time object tracking.Multi-trackers use prior knowledge to estimate motion models and improve sampling efficiency.
Although STRCF can achieve five times faster real-time tracking than SRDCF, it still has the problem of instability if the target is severely deformed or occluded.Moreover, spatial overfitting will occur in the case of the abnormal target corresponding graph, causing filter degradation since a regularization term of spatial corresponding weight response is absent in STRCF.In this paper, a space regular term coefficient is added to the STRCF algorithm to prevent the space over-fitting sum after adding target point seeking weights.In STRCF, each sample is x t = { x 1 t , x 2 t , ..., x s t } composed of S feature channel maps of size M × N at discrete sampling time t.y t is the desired Gaussian response, and the STRCF model is obtained by minimizing the objective function: where w t • f s t 2 is the space regular, θ is the newly added spatial regularization coefficient, w t is the space regularization matrix, f s t is the filter trained for the current frame, δ is the time regular term, δ is the time regularization coefficient, f t−1 is the filter trained for the previous frame.
Because Eq. ( 1) is a convex function, the auxiliary variable g is introduced, minimized by the ADMM algorithm, and transformed into three sub-problems to solve, set f t = g t and the stride parameter is ρ.The augmented Lagrangian expression [17] of Eq. ( 1) is: where h t = 1 ρ s t ,s t is the Lagrange operator.The ADMM algorithm decomposes Eq. ( 2) into the sub-problems shown in Eq. ( 3): (3) According to Eq. ( 3), f t and g t can be obtained.The algorithm complexity of STRCF is , where N M is the maximum number of iterations.
In response to the above questions, this paper proposes a Target Point Seeking Weights Spatio-temporal regularization correlation filter tracking algorithm.

Target Point Seeking Weights Description
Target Point Seeking Weights is a feature fusion framework based on Bayesian filter [18].It consists of the target point, the weight of the target point, and the target point prediction and tracking process, that is, the quantification process.Traditional Bayesian filtering is based on Gaussian random distribution to establish state variables.The algorithm in this paper is different from the traditional Bayesian filtering method.The target finder takes the results of multiple trackers as the prior probability and the elements of the decision vector calculated by the proposed target finder as the posterior probability.In order to clearly describe the targeting method and tracking steps, some new concepts and related definitions involved are given as follows.
Definition 1 Target point.In the initialized image of the target, a Cartesian coordinate system is established with the initialized geometric center of the target as the origin, and γ points are uniformly distributed around the origin by Gaussian, denoted as: where R is the number of rings, u ∈ [1, R],S is the number of sampling points per ring, μ ∈ [0, S]. Figure 1a shows a schematic diagram of the distribution of R = 4, S = 5 targets point.Let the coordinates of any point α in the target distribution be (M, N), make a 9px*9px block A with α as the center, A consists of 3*3 cells, and calculate the descriptor of each cell within a block The descriptors of all cells are concatenated to obtain the Hog feature [19] descriptor of α.This point α with Hog feature descriptor is called the target point, as shown in Fig. 1b.TP-STRCF sets R = 11, S = 16, takes the center of the target as the origin and draws the target point according to the Gaussian uniform distribution.Then, the Hog feature of the target is obtained.The target distribution operation results are shown in Fig. 2.

Definition 2
Target Point Seeking Weights.The target points that conform to the Gaussian uniform distribution in the initial frame are given weights to the target points, and the weight distribution conforms to the normal distribution function, which is recorded as: where k is the current frame number, n is the number of target points, w i k is the weight of the target point in the current frame, and the target point immediately finds the motion update variable w i k+1 before the subsequent frame runs.In the search process, if the Hog feature of a target point in the k + 1th frame is not retrieved in the kth frame, the target point and the weight are redrawn according to the Gaussian uniform distribution so that the allocation process is centripetal.This process of finding target weights is called target point seeking weights.
In the movement process, to improve the accuracy of predicting the target trajectory, the target point prediction and tracking process follows the principle of continuity and uses the target point seeking weights to quantify the initial target state P(X 0 ) obtained by the multiple trackers.
The target motion state is included in the posterior probability density distribution P(X 0:k | Y 1:k ), where X is the motion information and Y is the observation information.Therefore, target tracking can be described as the observation information Y 1:k finds the Fig. 1 Target point related concept map motion information X of the target, and Y 1:k represents the observation value from the 1st frame to the kth frame.Through Maxaposterior (MAP), the target state can be estimated, that is, the motion information X 0:k corresponding to the maximum value of P(X where M AP(X 0:k ) is the estimated value of X 0:k , the posterior probability of the state candidate after P(X 0:k | Y 1:k ) is the observation value of the first frame to the kth frame  is the recursive relation expression of target tracking Bayesian estimation, that is, the prior probability density of the current state of the target is: 123 Using the Bayesian formula to calculate the posterior probability density of all time series states of the target, we can get: where P (X 0:k | Y 1:k ) is the posterior probability at time k, which is determined by the posterior probability P (X 0:k−1 | Y 1:k−1 ) at time k-1 and the motion model.Specify the state vector information X k−1 at the k-1th time, so the current k time exists: where P(X k | X k−1 ) is the observation equation of the target, which is related to the samples that are tracked from multiple tracks.
where Csp(k) is the tracking frame and N xc is the number of samples.To normalize the constant , we can get Eq.( 7) into Eq.(8): According to the law of large numbers, if a random experiment is done n times on the variable X, the experimental result is X = {x 1 , x 2 , x 3 ...x n }, then there is: where E(X) is the mathematical expectation of variable x n < +∞, the posterior probability density of variable X can be approximately expressed as: where δ is Dirac function.From Eq. ( 12), for any expectation E(X) about X, there is: It is expected that E(X) can be approximated by E(X ), and because the law of large numbers is applied, the above process can be guaranteed to converge.Features are extracted by the target point seeking weights in the stage of sample feature extraction, with the weight distribution model acquired.The point distribution, direction gradient histogram at that moment and the initialization weight distribution model before the operation of target point seeking weights are shown in Fig. 3 a.And the direction gradient histogram at that moment and the weight distribution model after the operation of target point seeking weights are presented in Fig. 3b.Through drawing the weight distribution model, an effective discrimination model can still be built when there are similarities between target and background.

Integration of TP and STRCF
The algorithm in this paper is a multi-feature tracking algorithm framework integrating TP and STRCF.The tracking results of multiple trackers are used as candidate samples of the motion model, and the optimal result is selected from the candidate samples through the prediction of the target motion trajectory The calculated decision vector is used as the observation model, and the posterior probability estimation of the improved target is described.The improved system model for state transition from k to k+1 can be expressed as: where is an n*n-dimensional state transition matrix, is an n*n-dimensional discrete-time input coupling matrix, both of which are time-invariant matrices.X 0:k is an n*1-dimensional target motion state, and U 0:k is an n*1-dimensional deterministic input vector.The (l x , l y ) target position is used as the discrete-time input model, and the target velocity v is used to construct a state transition matrix , the n*n identity matrix describes the discrete-time input coupling matrix : The measurement model is defined by the measurement value of the sampling time t, the tracking result x t updates the target after the time t, and the linear relationship between the observation variable Z and x t is: where v is a constant parameter representing noise, v = [1 0] T .Based on Eq. ( 1), the correlation filter is the M × N convolution filter of a certain f S t feature layer at the sampling time t.And define the convolution response of filter f t to sample x t as: Therefore, the discrete Fourier transform ζ { f s t } of the function f S t after the response is: where μ is a continuous variable, because STRCF classifies all cyclic shifts of a frame in a sliding-window fashion, we use the convolution property of discrete Fourier to obtain the classification score RS f t (x t ) of all pixels in the t frame as: where ζ −1 is inverse Fourier transform.The time complexity of this method is has a limited number of f s t maxima and minima and discontinuities in the interval, F(u) and RS f t (x t ) converge.
In the detection stage, the input model used in this paper is (l x , l y ) discrete variables, so the discrete Fourier transform of the updated classification score RS f t (x t ) at time t is: R is the value of the function that the sampling characteristic produces at the location of the two-dimensional impulse.For discrete variables l x and l y , the two-dimensional discrete impulse is defined as: Then use the inverse Fourier transform to get the detection scores of the sample sets of discrete variables l x and l y : Finally, calculate the maximum value of the detection scores of all pixels: where (l x , l y ) traverses all pixel locations in the sample, (l x , l y ) ∈ [0, M) × [0, N ).Also, because and the number of maxima and minima and discontinuities in the interval RS f t (x t ) is limited, F(l x , l y ) converges.The STRCF algorithm is applied to the tracking result F(l x , l ŷ ), and the STRCF algorithm is integrated into the discrete samples TP.
Obtain A weighted target point set according to the prior information {X i 0 , w i 0 } A i=1 , where X i 0 ∼ A(X (0), P(0)), w i 0 ∼ (100, 15).Select the prior probability as the importance probability density function, and let q( , according to the objective Eq. ( 13): After the above process is iterated several times, the weights of many targets will become smaller, which cannot play the role of identifying samples.The degree of degradation of the target point weights is judged according to the number of valid targets: where 2  3 A, delete the target, and then draw a new target according to the Gaussian distribution, and assign weights to the target at the same time.

Target Trajectory Model Update
The tracked target can freely change speed and direction.When sudden acceleration and steering changes occur, the tracking algorithm may not be able to adjust correctly, resulting in the loss of the tracked target.To solve this problem and effectively predict the target trajectory, this paper proposes a prediction method for correcting and predicting the target motion trajectory.With the help of the weighted least squares method [20], the discrete target points (l x , l y ) coordinates and weights w i k input in each frame is fitted into the following mathematical model: where η w ∈ (0, diag(w , and η i , η i+1 ...... are not related to each other.Y i = f i (X , β) + η i is a nonlinear model containing the coordinates of the target point in the ith frame.Where the state estimate at time k is output: For nonlinear models, the existence of optimal weighting is proved by taking the minimum mean square error as the criterion.
To illustrate our TP-STRCF more clearly, Fig. 4 is the flow chart of the framework of this paper.In the data set experiment, we obtained RS f t (x t ) of the current frame through the calculation with Eqs. ( 20) and ( 21) based on the prediction equation established for the system discrete state established from Eq. (15).After that, the detection scores of the sample sets of discrete variables F(l x , l y ) were acquired through Eq. (23).Further, the maximum detection fraction F(l x , l ŷ ) was calculated for all target points according to Eq. (24).Then, the weight distribution model was obtained after the target center was determined in line with F(l x , l ŷ ).Eventually, the target center and weight distribution model in the next frame were updated according to the target motion trajectory predicted by Eq. (29).
Fig. 6 The overlap success plots of the competing trackers with 4 attributes on the OTB-2015

Steps of TP-STRCF Algorithm
TP-STRCF is an algorithm that uses the current output state of STRCF as the initial state of the target point seeking weights algorithm to obtain an immediately update weight distribution map of the target point and predict the target motion.The main algorithm steps are as follows: Step 1 Modeling stage.Crop the target area of the first frame, determine the target's initial position, mark the target geometric center, and set the target point conforming to the Gaussian uniform distribution with the target geometric center as the coordinate origin.Then perform feature extraction on the target appearance, train the algorithm model, assign initial weights to the target point found by the target point seeking weights algorithm, and obtain the target point Hog value and weight distribution map.
Step 2 Target trajectory prediction.The subsequent frames of the video sequence continue, and the target point obtained in the previous frame is randomly moved to obtain the H OG t+1 value of the new coordinate position and determine whether H OG t+1 is the threshold range.
Step 3 Target positioning.The prediction result correlates with the current frame's search area to obtain the target position.
Step 4 Model update.Use Eq. ( 29) to get the center of the target point, the center of the new weight response map, and train the filter to update the training model.
Step 5 Output the update result, go to Step2, continue to predict the target position of the next frame, train the relevant filter, and update the model.

Framework Parameters
The experiments are performed using GNU Octave.The computer's operating system is macOS Monterey 12.4 64-bit, the CPU is a 2.9GHz Intel Core i5 with 2 cores and 4 threads, the RAM is 8 GB.
In the experiment, the time regularity coefficient σ = 16, the spatial regularity coefficient w t = 1 in the target area, the other parts w t = 0, and the step size ρ = 10.The scale sets the maximum number of ADMM iterations to 2. The number of target points is set to 200.The remaining parameters are the same as STRCF.

The OTB-2015 Benchmark
The OTB-2015 benchmark [11] is a popular tracking dataset that contains 100 fully annotated video sequences with a total of 58,897 frames across the database, both short-term and long-term.There are 11 different properties such as Illumination Variation, Low Resolution and Background Clutter.We evaluate the trackers based on the one-pass evaluation (OPE), temporal robustness evaluation (TRE) and spatial robustness evaluation (SRE) protocols provided in OTB-2015 [11].
Our TP-STRCF outperforms STRCF by 3.56% in OPE score, 1.85% in TRE score, and 1.87% in SRE score.The score is improved due to the addition of target point seeking weight.The target point seeking weight in TP-STRCF obtains the latest position of the sample by continuously tracking the target point's weight and solving the problem of sample loss.Furthermore, our method outperforms other DCF-based trackers, such as BACF, ECO and ACSDCF.

Quantitative Analysis in Comparison with Hand-Crafted Features Algorithm
We compare the proposed TP-STRCF with 8 other state-of-the-art trackers using hand-crafted features.Table 1 shows the average FPS results of 20 random video sequences in OTB-2015.Figure 5shows the overlap success rate of the OPE, TRE, and SRE score comparison algorithms based on the area under the curve (AUC) in OTB-2015.As shown in Fig. 5and Table 1, although TP-STRCF has no obvious advantage in speed, it significantly outperforms other trackers in overlap success rate.Our TP-STRCF outperforms STRCF by 3.56% in OPE score, 1.85% in TRE score, and 1.87% in SRE score.The score is improved due to the addition of target point seeking weight.The target point seeking weight in TP-STRCF obtains the latest position of the sample by continuously tracking the target point's weight and solving the problem of sample loss.Furthermore, our method outperforms other DCF-based trackers, such as BACF, ECO and ACSDCF.

Quantitative Analysis Based on Video Attributes
In this section, we quantitatively analyze 11 video attributes on the OTB-2015 dataset.Figure 6only shows the overlapping success rate maps of four attributes of background clutter (BC), occlusion (OCC), in-plane rotation (IPR) and out-of-plane rotation (OPR).As shown in Fig. 6, our TP-STRCF outperforms the other 8 DCFs.For example, in the case of BC and OCC for a long time, the sample model is updated abnormally, and the target point seeking weight recalculates the weight of the target points.After the target returns, it can still have better robustness.Among the four attributes in Fig. 6, TP-STRCF is 8.46%, 4.9%, 3.4%, and 4.92% higher than STRCF, respectively.

Qualitative Analysis Based on Video Sequences
To further verify the robustness of TP-STRCF, as shown in Fig. 7are the tracking results of 9 algorithms in 6 video sequences in OTB-2015.Includes 4 properties: OCC, IPR, OPR, FM, Illumination Variation (IV) and Low Resolution (LR).As shown from Fig. 7, TP-STRCF performs well in video sequences with the above properties.In the case of occlusion and fast-moving targets, STRCF has the problem that the lost target is not adjusted in time, and the target has deviated from the center of the target.When the illumination changes in the video sequence, the characteristics of the target area are affected, and algorithms such as KCF, ECO, and SRDCF drift.
For multi-scale blur images, such as Fig. 8Mountainbike, in the process of target prediction, the closer the target is to the target point center, the smaller the impact of scale change on tracking.The farther the target is to the target point center, the greater the impact of scale change on In the experimental results, the tracking result of the algorithm in this paper is better than that of STRCF in the short-term multi-scale blur image tracking, but in the longterm multi-scale blur image tracking process, the target deviation distance too far will cause the tracking drift.

The UAV20L Benchmark
UAV20L is a data set specially prepared for the long-term tracking of UAVs.It contains 20 long sequences, and in these 20 sequences, the target will appear out of the field of view many times and return after a certain period.To test whether the detection part in the long-term tracker can complete the task well.We still evaluate according to OPE criteria.Figure 8shows that TP-STRCF still has outstanding performance in the multi-view long video test sequence and has good robustness.

The LaSoT Benchmark
The LaSoT dataset is a long-term tracking scene, in which there are frequent uncertainties such as similar target motion and similar direction of target motion.Therefore, the actual performance of tracking algorithm can be reflected by LaSoT in continuous tracking tasks, which is difficult to track.Comparative experiments were carried out in this section after the selection of STRCF, EMCF, BACF and ECO.As shown in Fig. 9 the AUC Success rate of the algorithm in the paper is 0.446, 1.7% higher than that of STRCF.The interference of abnormal conditions can be reduced with the algorithm in the paper by updating the predicted target trajectory, achieving a better tracking effect.

The VOT2018 Benchmark
The VOT2018 dataset, a benchmark dataset made up of 60 video sequences, changes the labeling method of the results, which is different from OTB-2015.And the tracking difficulty was increased due to the result marked by the minimum bounding rectangle of target.The algorithm in the paper was compared with STRCF, EMCF, ACSDCF and SiamFC on VOT2018 dataset.As can be seen from the Table 2, EAO is improved by 5.1% with its accuracy score 6.2% higher compared to STRCF algorithm.And Robustness score is also improved, showing the effectiveness of the improved module.

Conclusions
In this paper, a target tracking algorithm based on target point seeking weights Spatiotemporal regularization is proposed.Under the framework of STRCF, target point seeking weights are used to estimate sample motion trajectories and predict target motion while ensuring spatial and temporal adaptability.The prediction result is input into the filter to enhance the filter's resolving power and improve the tracking accuracy.The cross-direction multiplier method is used to solve the problem.The target model is transformed into three

Fig. 2
Fig.2The first frame target point distribution map

Fig. 3
Fig. 3 Before and after Hog operation

Table 1
Fig. 9 Overlap success rates for OPE on the LaSoT dataset