4.1 Problem Formulation
The existing trackers based on deep learning are performed as off-line training and online fine-tuning for surveillance analysis, and only use the object information of the first frame to fine-tune online the learned deep network parameters. However, it is difficult to capture previously unseen object features from one or few examples. In addition, the positive samples in each frame are highly spatially overlapped and they fail to capture rich appearance variations. On the other hand, amount of positive and negative samples for training deep learning network model are nearly impossible to meet up in real world.
In this section, a novel attention generative adversarial network is given at first to describe the overall training architecture. The proposed generative model takes VGG network as input, which is mainly used to capture the object appearance variations of continuous video frames. The discriminative model is introduced as a supervisor and provides guidance on the advantages of the generated object appearance details. To stabilize the training of the generative adversarial networks, we present the mean squared loss to punish the classification error for each pixel. In order to improve the tracking performance, a novel spatial attention mechanism is developed to adapt the offline learned deep model to online object tracking. The VGG network is used to sense the tracked object and decode the object features into the attention response maps. At last, online tracking is described consisting of model updating and scales. The object attention maps are captured by inputting the object appearance information provided in the first ten frames and remaining video frames into the generative model. The score with maximum response score is regarded as the tracking result. This process will be continued for video frames until the end of the video sequences. Fig. 2 shows the flowchart of the proposed tracker.
In Fig 2, the generative model of GAN follows the encoder-decoder framework which attempts to encode the input of the object appearance into feature representation, and decode it into corresponding outputs. The discriminative model is a standard convolutional neural network.
4.2 Network Architecture
The network includes two branches, and in the lower part of the architecture which utilizes the first ten frames of a video sequences as input, which is called the prediction network. We use the prediction network to track the one to ten frames of a video sequence to obtain the object position of each frame. These features extracted from the predicted object location will be used to fine-tune the fully connected layers of the network located in the top half of the architecture. The object feature of each frame is taken as input of the network from the frame 11 to the end of video. The weight masks are applied to adaptively dropout input features. Adversarial learning identifies the weight mask that maintains the most robust features over a long temporal span while removing the discriminative features from individual frames.
It is worthy to note that the deep learning model is initialized with the weights of a VGG-16 model pre-trained on the ImageNet benchmark for object classification. Most of deep learning based trackers use this offline learned network, and then utilize the first frame to fine-tune the network parameters during the tracking. However, it is difficult to obtain the object specific feature of a video by training the deep network model only with the sample of the first frame. On the other hand, if the deep learning network is fine-tuned by using the first n frames of a video, manually labeling the object position will be expensive and impractical. Therefore, a prediction network is introduced into deep learning framework, which can automatically predict the position of the object in the video sequence. The network structure is shown in Fig.2, which has three convolution layers and two fully connected layers. The architecture of the prediction network is depicted in the lower part of Fig.2. We directly use a VGG-M  model pre-trained in the classification task from ImageNet , and the parameters of the convolution layers is fixed and only the fully connected layers is fine-tuned online. The cross-entropy loss is adopted for fine-tuning network parameters online. The prediction network is optimized by minimizing the cross-entropy loss function with SGD as follows: (see Equation 3 in the Supplementary Files)
where p and q denote training samples and corresponding labels, respectively; N is the number of training samples.
The object features are extracted from the convolution layer and fed to the fully connected layer for classification. Fig.3 reports the foreground response maps predicted by using different VGG feature maps. Fig.3 is the foreground response maps predicted by using different VGG feature maps. Foreground response maps are predicted using different VGG feature maps. Conclusion of the Fig.3 is that shallow layer feature (Conv4-1 feature) focuses on object details; deep layer feature (Conv4-2 and Conv4-3) is semantic features.
Finally, the sample with the highest response score in each frame is regarded as the tracking result. This prediction network is interpreted as a generative network in generative adversarial network framework and the samples drawn from the predicted location will be used to fine-tune the fully connected layers of the generative model.
The discriminative model is employed to make the generative model produce attention response map that is robust to occlusion, deformation, and background clutter, etc. In this work, the attention response map and corresponding RGB frame of a video sequence are considered as the input of discriminative model.
In our work, mean squared error (MSE) is utilized to measure the difference between estimated attention response map and ground truth map. Given an image I, and its dimension is N = W × H. The mean squared loss can be formulated as: (see Equation 4 in the Supplemental Files)
where and S denote the attention response maps and its corresponding ground truth, respectively.
However, mean squared loss function focuses on pixel-level features, and learned deep network can produce a coarse attention response maps. Therefore, training the network with the adversarial loss can be further improved the tracking performance. We iteratively train G and D, and the adversarial loss function is written as: (see Equation 5 in the Supplemental Files)
where C is the input image feature; G(C) is the mask generated by the G network; M is the actual mask identifying the discriminative feature. The dot is the dropout operation on the feature C. As described in Eq.(5), G is used to predict a weight mask G(C) which operates on the extracted features. The mask is randomly initialized at the beginning and each mask represents a specific type of appearance variation. Through the adversarial learning process, G will gradually identify the mask that degrades the performance of classifier.
In each iteration of the training process, object features of the input frames are extracted from convolutional layers and fed into G network to obtain the predicted mask m*. Then obtained deep features are multiplied by the predicted mask m* and sent into D network. We keep the labels unchanged and train D through supervised learning method. D is trained to discriminate features from individual frames relying on more robust features over a long temporal span. Thus, it avoids the overfitting issue. G is used to predict different masks according to different input deep features. It enables D to focus on the temporal robust features without discriminative feature interference from single frame. Given an input image, multiple output features based on several random masks are created. Diversified features are performed through the dropout operation, which are sent to D for classification, and we choose the one with the highest loss. The corresponding mask of the selected feature is effective in decreasing the impact of the discriminative features. We set this mask as M in equation (5) and update G accordingly.
Finally, we combine the MSE loss with adversarial loss to obtain more stable and fast convergence for GAN model. The final loss function for the adversarial training can be formulated as: (see Equation 6 in the Supplemental Files)
where λ is a trade-off parameter, we experimentally set it as 1/20 in our implementation.
4.4 Spatial Attention
Attention from the training samples can be captured to share a common attention. In practical sceneries, some attention maps are obtained by the initialization of matrix of ones. They are too restrictive to constrain all samples and the object to share a single deep network structure. Therefore, we propose a spatial attention scheme to model attention response map in Fig.4.
The proposed attention mechanism can capture the general features and distinct the object from the background in the video. It can encode the global information of the object and has a low computational load. The output of attention module is passed through a global pooling layer to produce a channel-wise descriptor. Then three fully connected (FC) layers are added, in which learned for each channel by a self-gating mechanism based on channel dependence. This is followed by reweighting the original feature maps to generate the output of attention module. The cosine similarity is utilized to measure the similarity between current frame features φt (p) and the features φt-1 (p) extracted from t-1 frame. (see Equation 7 in the Supplemental Files)
If the current frame features is close to the features of the last frame, it is prone to the foreground object and assigned with a larger weight, otherwise, a smaller weight is assigned to background pixel.
4.5 Online Tracking
In this subsection, we illustrate how our tracker works for visual object tracking. We involve the generative model during the training and remove it in the tracking stage.
We first draw the samples from the first ten frames of a video sequence to fine-tune generative model online. Then, we track the object in all videos. Given an input frame, we generate multiple candidate proposals and extract their deep features. Deep features of the candidate proposals are fed into the classifier to obtain the probability scores. During the online update, we employ these training samples jointly train the generative model and the discriminative model. The object tracking result is obtained by finding the maximum response score in the attention map.
Object appearance model updating plays a critical role in object tracking, and most of trackers update their appearance model in each frame or at a fixed interval. However, this updating strategy may introduce background information into the object appearance model when the tracking result is inaccurate due to occlusion or illumination variations.In this paper, we need to update the object appearance model with the recently obtained object results. First, we define a fixed length sequence L to store the tracking result of each frame. When the length of L reaches a fixed number of elements, we update object appearance once. In addition, model updating is performed when the number of iteration or maximum value of response map are satisfied. The maximum response score in L is used to update the object appearance model.Therefore, the new object appearance model is written as (see Equation 8 in the Supplemental Files).
where β is a learning parameter and set empirically; Tu is the updated object appearance model, which is represented by a linear combination of the initial object template Tf and the last updated object appearance model Tp. To alleviate the drift problem during the tracking, the initial template is incorporated into the new observation template.
To handle the scale change, we follow the approach in  and use patch pyramid with the scale factors. The proposed object tracking algorithm can be summarized as Algorithm 1.
Algorithm 1：The Proposed Tracking Algorithm
1. Input: Initial object bounding box (x0, y0, w0, h0); video frame sequence Ii (1,2,3...);
2. Output: Estimated object state (xi, yi, wi, hi);
Use the prediction network to get the object position of the first ten frames and fine-tune the generative adversarial network with the obtained object position;
4. Tracking process:
5. for 11 : end
6. Sample the object candidate states in frameaccording to the previous object position (xi-1, yi-1, wi-1, hi-1);
7. Extract deep features of these candidates using convolution layers;
8. Feed extracted deep features into the discriminative model for classification and predict the possibility score of each candidate;
9. Estimate new object position (xi, yi, wi, hi) by the candidate with the highest score S;
10. if max(S)0 or i%10 == 0 then
11. Extract features from the object location of the successfully tracked frame;
12. Use adversarial learning to update the discriminative model by Eq.(5);