Robust Probabilistic Discriminative Model Prediction Tracker via Improved Model Update Strategy

In the visual object tracking, the tracking algorithm based on discriminative model prediction have shown favorable performance in recent years. Probabilistic discriminative model prediction (PrDiMP) is a typical tracker based on discriminative model prediction. The PrDiMP evaluates tracking results through output of the tracker to guide online update of the model. However, the tracker output is not always reliable, especially in the case of fast motion, occlusion or background clutter. Simply using the output of the tracker to guide the model update can easily lead to drift. In this paper, we present a robust model update strategy which can effectively integrate maximum response, multi-peaks and detector cues to guide model update of PrDiMP. Furthermore, we have analyzed the impact of different model update strategies on the performance of PrDiMP. Extensive experiments and comparisons with state-of-the-art trackers on the four benchmarks of VOT2018, VOT2019, NFS and OTB100 have proved the effectiveness and advancement of our algorithm. of our method has been signiﬁcantly improved. These results indicate that our method can effectively adapt the object model online.

In the past few years, deep learning have achieved milestones in computer vision field [7,8]. Many object tracking algorithms based on deep learning have been proposed [9,10,11]. The tracking algorithm based on deep learning can be divided into offline model update [12,13,14,15,16] and online model update [17,18,19]. In general, compared to offline model update, the online model update method has higher accuracy and better robustness. Therefore, the online model update method has become a widely concerned part in recent researches [18,19]. However, the online model update is a double-edged sword. It can adapt to the appearance changes of objects and background, but it is also easy to be contaminated by noise samples, which leads to tracking drift. PrDiMP [19] is a typical online model update tracker, which integrates the maximum response and second-maximum response ratio to establish evaluation criteria to evaluate the tracking results. PrDiMP evaluate the tracking results through the output of the tracker to guide the online update of the model. However, the tracker output is not always reliable, especially in the case of fast motion, occlusion or background clutter.
In this paper, we present a robust model update strategy which can effectively integrate maximum response, multi-peaks and detector cues to guide model update of PrDiMP tracker. Furthermore, we have analyzed the impact of different model update strategies on the performance of PrDiMP. Extensive experiments and comparisons with state-of-the-art trackers on the four benchmarks of VOT2018, VOT2019, NFS and OTB100 have proved the effectiveness and advancement of our algorithm. This not only significantly improves the tracking performance of PrDiMP, but also can be easy to be embedded into other online model update trackers.

Contributions:
1. A robust model update strategy is proposed, which can effectively integrate maximum response, multi-peaks and detector cues to guide model update of PrDiMP.
2. We analyzed the impact of different model update strategies on the performance of PrDiMP in detail.
3. Our method not only has better results compared with the corresponding baseline method, but also better than other excellent target tracking methods (on OTB100, NFS, VOT2018 and VOT2019).

Visual object tracking
In the past few years, deep learning have achieved milestones in computer vision field. Many object tracking algorithms based on deep learning have been proposed [9,10,11]. Siamese architecture [12,13,14,15,24,25,26] has end-to-end training capabilities and high efficiency. However, the method based on the siamese architecture can't integrate background information, and its discriminative ability is limited. Based on this, DiMP [18] and PrDiMP [19] develop an end-to-end tracking architecture, which can make full use of the appearance information of object and background for object model prediction. This framework is based on the object model prediction network, which is derived from a discriminative learning loss by applying an iterative optimization procedure. It can achieve effective end-to-end training while maximizing the discriminative ability of the prediction model.

Online update for visual object tracking
In the field of visual object tracking, online model update plays an important role, which enables models to adapt to the changes of object appearance and their surrounding background. The online model update is a double-edged sword. It can adapt to the appearance changes of objects and background, but it is also easy to be contaminated by noise samples, which leads to tracking drift. In order to enhance the model's ability to adapt to the changes of object appearance and their surrounding background, while not making the model contaminated. Some researchers have done a lot of work by designing some criteria to evaluate the reliability of current tracking results, delete unreliable samples or reject inappropriate updates, such as the confidence score [27], the maximum response [17], peak-to-sidelobe rate [17], average peak-to-correlation energy [28], and MAX-PSR [29]. These methods usually evaluate the tracking results through the output of the tracker to guide the online update of the model. However, the tracker output is not always reliable, especially in the case of fast motion, occlusion or background clutter [30,31]. In this paper, we present a robust model update strategy which can effectively integrate maximum response, multi-peaks and detector cues to guide model update of PrDiMP. This not only significantly improves the tracking performance of PrDiMP, but also can be easy to be embedded into other online model update trackers.

Methods
In this section, we introduce our robust model update strategy which integrate maximum response, multi-peaks and detector cues to guide model online update of PrDiMP. First, we analyzed the problem of PrDiMP's model update strategy (Section 3.1). Then we introduce our robust model update strategy in detail (Section 3.2).

Analyze the problem in online model update strategy of PrDiMP
The PrDiMP [19] is a tracker that integrates the probabilistic regression method into the DiMP [18]. It consists of two parts: I) A object estimation module that is learned offline; II) A object classification module that is learned online. In this section, we mainly analyze the online model update strategy of PrDiMP. For more detailed information about the PrDiMP, see [18,19]. The PrDiMP integrates maximum response (MAX) and second-maximum response ratio (SMR) to guide model update. MAX is defined as the maximum value of the classification network response map R t , Here, the subscript t denotes the t − th frame. In order to calculate SMR, the response map is divided into maximum and sidelobe (the remaining pixels of the M * M window are not included around the maximum). Defines second response MAX sl is the maximum response of the sidelobe. Then SMR is defined as Based on MAX and SMR, PrDiMP divides the tracking results into four types: The tracking result is 'Not found'; Else If SMR > threshold 2 Calculate the distance D c between the MAX prediction position and the prediction position of the previous frame; Calculate the distance D sl between the MAX sl prediction position and the prediction position of the previous frame; If D c < threshold 3  In PrDiMP, when SMR > threshold 2 , the tracking result is evaluated according to the distance between the current double-peaks prediction position and the previous prediction position. The peak predicted position whose distance is less than threshold is selected as the current frame prediction position. However, the appearance of similar object and interference is usually random, and the distance between the interference and the predicted position of the previous frame is not necessarily greater than the distance between the object and the predicted position of the previous frame. Updating at this time may cause the model to be contaminated.
In addition, when SMR > threshold 4 and MAX sl > threshold 1 , the author regards the maximum response position as the current frame prediction position. Use it as a hard sample and increase its weight. However, in the case where the bimodal phase difference is not too large, what is considered to be interference is likely to be the real object. Updating at this time may cause the model to be contaminated. To verify our analysis, we define proportion of error update R distance and R disturb respectively, Here, n distance represent the total number of frames in which the current frame is used as a 'Hard negative' in the case of SMR > threshold 2 , and n distance e rror represent the number of frames in which the current predicted position and the actual object tracking distance are greater than 20 pixels. n disturb represent the total number of frames in which the current frame is used as a 'Hard negative' in the case of SMR > threshold 4 and MAX sl > threshold 1 . n disturb e rror represent the number of frames in which the current predicted position and the actual object tracking distance are greater than 20 pixels. The results are shown in Table  1 and Table 2. As shown in Table 1 and Table 2, both R distance and R disturb of PrDiMP are higher. It shows that the proportion of error update is high, and the model is easy to be contaminated, which leads to tracking drift. Our methods R distance and R disturb are lower than PrDiMP, respectively. It shows that compared with PrDiMP, our method has lower proportion of error update, so our method is more robust than PrDiMP.

Robust model update strategy
Aiming at the problems of PrDiMP model update strategy, we integrate maximum response, multi-peaks and detector cues to guide the update of the tracker. The overall framework of the algorithm is shown in Fig. 1.
Maximum response cue. The goal of classification network is to distinguish the object from the surrounding background. MAX is defined as the maximum value of the classification network response graph R t .
Multi-peaks cue. The MAX may be interfered by similar objects or certain noise leading to inaccurate detection. The inaccurate detection would further contaminate the model due to incorrect training samples. The peaks located at similar objects or background noise in the response map may approach, or even surpass the peak at the object. As above analysis, the object may locate at one of multiple peaks, all of them should be taken into consideration. The ratio of these peaks to the maximum peak PMR i (The subscript i represent the the i −th peak) is calculated, Fig. 1 The overall workflow of the proposed method in this paper.
Here, s i t represent the peak value of the i − th peak. Detector cue. These peaks are verified by the detector to determine the object location. Specifically, we use the SiamBAN tracker [24] as a detector and select the peak closest to the predicted position of the detector as the object position.
Based on maximum response, multi-peaks and detector cues, our method divides the tracking results into three types: ' The detail strategy of model update is shown in Algorithm 2.

Results and Discussion
Extensive experiments and comparisons with state-of-the-art trackers on the four benchmarks of OTB100, NFS, VOT2018 and VOT2019.

Implementation details
Our algorithm is implemented in Python with PyTorch and run on an RTX 3090 GPU. In order to make a fair comparison, we use the same parameters as in PrDiMP [19]. ATOM [17], DiMP [18], PrDiMP [19] and our method were run 5 times in OTB100, NFS, and 15 times on VOT2018 and VOT2019.   Table 3. Compared with other methods, the AUC of our method has been significantly improved. These results indicate that our method can effectively adapt the object model online.

Conclusions
We improved PrDiMP, and integrate maximum response, multi-peaks and detector cues to guide model update of PrDiMP. This method greatly reduces the risk of online model update. We comparisons with state-of-the art trackers on four benchmarks: OTB100, NFS, VOT2018 and VOT2019. The results show that, our approach not only achieves better results compared with the corresponding baseline method, but also better than other excellent object tracking methods. This not only significantly improves the tracking performance of PrDiMP, but also can be easy to be embedded into other online model update trackers.

List of abbreviations
PrDiMP: probabilistic discriminative model prediction; MAX: maximum response; SMR: second-maximum response ratio.

Competing interests
The authors declare that they have no competing interests.

Funding
This work is not supported by funding.