End-to-End Joint Multi-Object Detection and Tracking for Intelligent Transportation Systems

Environment perception is one of the most critical technology of intelligent transportation systems (ITS). Motion interaction between multiple vehicles in ITS makes it important to perform multi-object tracking (MOT). However, most existing MOT algorithms follow the tracking-by-detection framework, which separates detection and tracking into two independent segments and limit the global efficiency. Recently, a few algorithms have combined feature extraction into one network; however, the tracking portion continues to rely on data association, and requires complex post-processing for life cycle management. Those methods do not combine detection and tracking efficiently. This paper presents a novel network to realize joint multi-object detection and tracking in an end-to-end manner for ITS, named as global correlation network (GCNet). Unlike most object detection methods, GCNet introduces a global correlation layer for regression of absolute size and coordinates of bounding boxes, instead of offsetting predictions. The pipeline of detection and tracking in GCNet is conceptually simple, and does not require complicated tracking strategies such as non-maximum suppression and data association. GCNet was evaluated on a multi-vehicle tracking dataset, UA-DETRAC, demonstrating promising performance compared to state-of-the-art detectors and trackers.


Introduction
Environment perception is one of the most critical technology of intelligent transportation systems (ITS), because its performance has an important impact on subsequent process of decision making and vehicle control [1][2][3][4].The complex motion interaction between multiple vehicles in ITS makes it important to perform multi-object tracking (MOT) from the view of both vehicles and roadside [5,6].MOT is a basic problem in environment perception, whose goal is to calculate the trajectories of all interested objects from consecutive frames of images.It has a wide range of application scenarios, such as autonomous driving, motion attitude analysis, and traffic monitoring.Recently, MOT has been receiving increasing attention.
Traditional MOT algorithms follow the trackingby-detection framework, which is split into two modules: Detection and tracking.With the development of object detection, these algorithms achieve excellent performance, approximately dominating the entire MOT domain.The tracking module in a tracking-by-detection framework generally contains three parts: Feature extraction, data association, and lifecycle management.Early tracking methods use simple features to accomplish data association, such as location, shape, and velocity; however, these features have evident deficiencies.Later methods utilize appearance features, especially highlevel features from deep neural networks.These appearance features can significantly improve the association accuracy and robustness; however, it leads to an increase in the required calculation.Currently, a few MOT algorithms integrate feature extraction into the detection module, which adds the ReID head to obtain instancelevel features for data association.Although these algorithms require less computation, data association is still required to perform motion prediction and set complex tracking strategies, resulting in surplus hyperparameters and a cumbersome inference pipeline.This paper presents a novel network for end-toend joint detection and tracking.The network realizes bounding box regression and tracking in the same manner, known as global correlation.Notably, bounding box regression generally uses local features to estimate the offsets between the anchor and ground truth, or to estimate the box size and offset between the key point and feature location.In this paper, the proposed framework intends to regress the absolute coordinate and size of the bounding box, rather than the relative coordinate, or offset.However, in traditional convolutional neural networks, the local feature cannot contain global information when the receptive field is considerably small.The self-attention mechanism allows the features of each location to contain global information; however, its computational complexity is too large to be used on a high-resolution feature map.Hence, this paper introduces the global correlation layer to encode global information into features at each location.Moreover, the correlation vectors generated by the global correlation layer can encode the correlation between the local feature vector Q with the global feature map K. Q and K from the image in the same frame are used while performing object detection; conversely, Q from the image in the previous frame and K from the image in the current frame are used while performing object tracking.In this manner, this paper unifies detection and tracking under the same framework.
This paper performs algorithm evaluation on a vehicle tracking dataset, UA-DETRAC, which is captured from a roadside view, and can be seen as a typical application of environment perception in ITS.GCNet demonstrated competitive performance with 74.04% average precision (AP) and 36 frame/s in detection, 19.10% PR-MOTA and 34 frame/s in tracking.Figure 1 shows some examples of tracking results.To summarize, the main contributions of this paper are as follows:  The following of this paper is organized as follows.Section 2 introduces the existing research that is related to this paper.Section 3 provides the methodology of this paper, including network components and implementation details.Section 4 conducts experiments and Section 5 gives the conclusions.

Object Detection
With the advancements in deep learning, object detection technology has developed rapidly.Existing object detection algorithms can be divided into two categories: Anchor-based [7][8][9] and anchor-free [10][11][12].Anchorbased algorithms set a series of anchor boxes and regress offsets between the anchor boxes and ground truth using local features.The methods based on region convolution neural network (R-CNN) utilizes heuristic algorithm [13] and region proposal network (RPN) [7,14,15] to generate region proposals as anchor.Most anchor-free algorithms use full convolution networks to estimate the key points of targets, and further obtain the bounding boxes through the key points.These algorithms consider local features for bounding box regression, such that they only obtain the offsets between the anchor boxes or key points and the ground truth, rather than absolve bounding box coordination.Detection transformer (DETR) [12] adopted an encoder-decoder architecture based on transformers to achieve object detection.A transformer can integrate global information into the features at each position; however, the self-attention mechanism of the transformer requires a considerable amount of computation and GPU memory, which is difficult to apply to highresolution feature maps.In the proposed joint detection and tracking framework, the network detects objects in a single image, and tracks objects in different images.However, the offsets for the same object in different images are hard to define.Hence, this paper introduces a global correlation layer to embed global information into the features at each position for absolute coordinate regression, which can be applied to higher-resolution feature maps, rather than the transformer.

Tracking-by-Detection
With the improvement in detection accuracy, tracking-bydetection methods [16][17][18] have become mainstream in the field of MOT.Tracking is considered as a data association problem in tracking-by-detection frameworks.Features, such as motion [19], shape [20], and appearance [21,22], are used to describe the correlation between detections and tracks, and thus, a correlation matrix is established.Algorithms including the Hungary algorithm [23], JPDA [16] and MHT [24], input the correlation matrix to complete data association.Although these algorithms have made significant progress, there are certain drawbacks.First, they do not combine the detector and tracker efficiently, and a majority of them need to perform feature extraction separately, which involves unnecessary computation.Second, they often rely on complicated tracking rules for lifecycle management, resulting in numerous hyperparameters and difficult tuning.In the proposed approach, detection and tracking are performed in the same manner, such that they are well combined and the computation of feature extraction is reduced.Additionally, the proposed approach eliminates the complex tracking rules.

Joint Detection and Tracking
In the field of MOT, it is an important research direction to combine detection and tracking.With the quick maturity of multi-task learning in deep learning, many methods using a single network to complete detection and tracking tasks by adding ReID feature extraction to existing object detection networks [25][26][27].Wang et al. [28] proposed the joint detection and embedding (JDE) method that allows target detection and appearance embedding to be learned in a shared model.Bergmann et al. [29] proposed a JDT method that adopts Faster-RCNN framework, and accomplishes tracking by region of interest (RoI) pooling and bounding box regression without data association.Zhou et al. [10] considered current and previous frames as well as a heatmap, rendered from tracked object centers, as inputs, and produces an offset map, which simplifies data association considerably.Peng et al. [30] converted the MOT problem into a pair-wise object detection problem, and proposed chained-tracker method realizing end-to-end joint object detection and tracking.Similarly, this study also provides a new idea for joint detection and tracking.Compared with trackformer [31], which formulate the MOT task as a frame-to-frame set prediction problem and propose a tracking-by-attention network based on DETR [12], the network structure of GCNet is simpler and can reach a higher inference speed.

Methodology of Global Correlation Network
The proposed network is designed to solve the online MOT problem.At time step t , the network obtains the object trajectories {T 1 , T 2 , . . ., T n } from time 0 to time t − 1 , where and B i,j are the bounding box of the object i at time j .Considering an image of the current frame I t ∈ R h×w×3 , the network assigns the bounding boxes B x,t of objects in the current frame to historical trajectories, or generates new trajectories.The following section introduces the proposed algorithm in detail.

Global Correlation Network
In this part, the global correlation layer and its application principle in end-to-end joint detection and tracking framework are introduced.Furthermore, the specific implementation of detection module and tracking module in the proposed GCNet are described.
Global correlation layer: The global correlation layer in GCNet encodes global information to generate the correlation vectors, which can be utilized in detection module and tracking module.Using feature map F ∈ R h×w×c , two feature maps Q and K are obtained from the following two linear transformations: where F ij ∈ R c denotes the feature vector at the i th row and j th column of F .Further, for each feature vector Q ij , the cosine distance between it and all K ij is calculated.Following another linear transformation Ẇ , the correla- tion vectors C ij ∈ R c′ is obtained: These correlation vectors C ij encode the correlation between the local feature vectors Q ij with the global fea- ture map K , such that it can be used to regress the abso- lute bounding boxes for the objects at the corresponding positions in the image.All of the correlation vectors C ij can form a correlation map C ∈ R h×w×c′ , allowing us to obtain bounding boxes B ∈ R h×w×4 using a convolution layer with 1 × 1 kernel size.K and Q from the image in the same frame are used while performing object detection; conversely, Q from the image in the previous frame and K from the image in the current frame are used while performing object tracking.In this manner, detection and tracking are unified under the same framework.
Compared with the traditional self-attention layer, the global correlation layer has advantage in computation.The computation of traditional self-attention layer includes three parts: Computing attention weight, As shown in Eq. ( 2), the computation of global correlation layer is c × (h × w) × (h × w) T = ch 2 w 2 , which is significantly less than that of the total computation of self-attention layer (2c + 1)h 2 w 2 .
In terms of object classification brunch, this study uses the same network structure and training strategy as Cen-terNet.When infers, a detection heatmap Y d and track- ing heatmap Y t are obtained in each frame.The detection (1) heatmap Y d denotes the detection confidence of the object centers in the current frame, while the tracking heatmap Y t denotes the tracking confidence between the current and next frame.The peaks in the heatmaps correspond to the detection and tracking key points, and max-pooling is used to obtain the final bounding boxes, without applying box non-maximum suppression (NMS).
where maxpool(H, a, b) represents a max-pooling layer with kernel size a and stride b .Hence, the GCNet can realize joint multi-object detection (MOD) and MOT, without complicated post-processes, such as NMS and data association, which have a concise pipeline.
Detection module: The detection module architecture is depicted as Figure 2, which contains three parts: Backbone, classification branch, and regression branch.The backbone is for high-level feature extraction.Because the classification is identical to CenterNet, each location of the feature map corresponds to an object center point, while the resolution of the feature map crucially affects the network performance.To obtain high resolution and retain a large receptive field, the same skip connection structure is acquired as a feature pyramid network (FPN); however, it only outputted the finest level feature map F .The size of the feature map F is h ′ × w ′ × c , which is equivalent to h 8 × w 8 × c ; here, h and w are the height and width of the original image, respectively.This resolution is 4 times that of DETR.The classification branch is a full convolution network, and outputs a confidence map Y d ∈ R h ′ ×w ′ ×n with values between 0 and 1.The peaks of the i th channel of Y d correspond to the centers of the objects belonging to the i th category.The regression branch is used to calculate bounding boxes [x, y, h, w] i |1 ≤ i ≤ N .First, this paper considers F and Y d as inputs, and generates three feature maps K , Q , and V .
where Conv(F , a, b) denotes a convolution layer with kernel size a , strides b and kernel number c , and BN denotes batch normalization layer.Gate(X, Y ) is depicted in Figure 3, which is a form of spatial attention.P is the position embedding with the same shape as F , and is expressed as: (3) The two embedding vectors that are close in the position have a large cosine similarity, while the two that are farther away have a smaller cosine similarity.This attribute reduces the negative influence of similar objects while tracking.Further, the correlation vectors C ij between Q ij and K are calculated using Eq. ( 2).The final bounding boxes B d,ij = x ij , y ij , h ij , w ij can be obtained using Eq. ( 6).Here, the absolute coordinates and size of the bounding box are directly regressed, which differs from most existing methods, especially anchor-based methods.
Tracking module: Tracking is the process of assigning objects in the current frame to historical tracks, or generating new tracks.The architecture of the tracking module is depicted in Figure 4.The inputs of the tracking module are: (1) Feature map K of the current frame, (2) detection confidence map of the current frame, and (3) feature vectors of historical tracks.Additionally, the tracking module outputs a tracking confidence and bounding box for each historical track.It can be observed, this architecture is almost identical to that of the detection (6) module.Most of its network parameters are shared with the detection module, except for the fully connected layer used for calculating tracking confidence (the green block in Figure 4).The tracked bounding boxes are consistent with the detected target boxes in terms of expression, which is B i = [x i , y i , h i , w i ] , with absolute coordinates and size.The tracking confidences indicate whether the objects are still present in the image of the current frame.The tracking module functions in an object-wise manner, such that it can naturally pass the ID of each object to the next frame, which is similar to parallel single-object tracking.

Training
Although the proposed model can be trained end-to-end, the GCNet is trained in two stages in this study.First, the detection module is trained, and then, the entire network is fine-tuned.The training strategy of the classification branch is consistent with CornerNet.A heatmap Y gt ∈ R h ′ ×w ′ ×n with 2D Gaussian kernel is defined as follows: where N k is the number of objects of class k , [x n , y n ] is the center of object n , and variance σ 2 is relative to the object size.σ x and σ y are expressed as shown in Eq. ( 8), and η IoU is set to 0.3.

The classification loss is a penalty-reduced pixel-wise focal loss.
The regression branch is trained using CIoU loss, as follows: where Furthermore, for B ij with max n G ijn = 1 , the weight of their regression loss w ij is set to 2 , and the other weights to 1 .This is done to enhance the precision of the bounding boxes at the center points.
The entire network is fine-tuned using a pretrained detection module.At this training step, two images I t−i and I t are treated as inputs simultaneously, where i lies between 1 and 5 .The loss contains two parts, i.e., detection loss of I t−i and tracking loss between the two images.The tracking loss also comprises two terms, i.e., regression loss and classification loss.The tracking ground truth is determined by object ID.B t,ij and Y t,ij are positive if [ij] in I t−i is equal to 1 , and the corresponding objects exist in I t .The total train loss is expressed as: (7)

Inference Pipeline
The inference pipeline for joint MOD and MOT is described in Algorithm 1.The inputs of the algorithm are consecutive frames of images I 1 − I t .Trajecto- rie T i , confidence Y i , and vector [V i , Q i ] of all tracks and candidates are recorded in four collections: T , O , Y , and C .At each time step, object detection is per- formed on the current frame of image I , and tracked the existing track T and candidate C .Tracking con- fidences are used to update all confidences in sets Y and C , and obtained Y i = min 2 × Y i × Y t,i , 1.5 .The tracks and candidates with a confidence lower than p 2 are deleted, and other trajectories, candidates, and corresponding features are updated.This update strategy, 5 , provides these tracks with a higher tracking confidence, certain trust margin, and confidence possibly greater than 1 .The detections with an IoU greater than p 3 , or confidence less than p 2 , are ignored.For the remaining detections, those with a detection confidence greater than p 1 are used to generate new tracks, and the rest are added to the candidate set C .As observed, the entire detection and tracking process can be performed in sparse mode, such that the overall computational complexity of the algorithm is extremely low.

Experiments of the Algorithm
In this section, experiments are carried out to validate the performance of GCNet.Comparison and ablation study are carried out and the results indicate the advantages of the proposed method.

Benchmark and Implementation Details
Experiments of this study are conducted using the vehicle detection and tracking dataset, UA-DETRAC, which is captured from a roadside view, and can be seen as a typical application of environment perception in ITS.This dataset contains 100 sequences; 60 were used for training, and the remaining 40 were used for testing.The data in the training and test sets, which are derived from different traffic scenarios, make the test more difficult.The UA-DETRAC benchmark employs AP to rank the performance of the detectors as well as PR-MOTA, PR-MOTP, PR-MT, PR-ML, PR-IDS, PR-FM, PR-FP, and PR-FN scores for tracking evaluation.This paper refers to Ref. [32] for further details on the metrics.
where (m, n) is the center of B t,i ; 13 end 14 end All the experiments are performed using TensorFlow 2.0.The proposed model is trained with Adam on the complete training dataset of UA-DETRAC.The size of the input images is 512 × 896 .Three commonly used data augmentation methods are employed: Random horizontal flip, random brightness adjustment, and scale adjustment.Hyperparameters p 1 , p 2 , and p 3 for the inference are set to 0.5 , 0.3 , and 0.5 respectively.

Ablation Study
In the proposed joint detection and tracking framework, three main components influence the performance: 1) Gate by confidence map Y d ; 2) concatenated feature vec- tor in V for bounding box regression; and 3) specially designed position embedding P .The detection effects of the three models are compared with the GCNet to demonstrate the effectiveness of these components.Table 1 shows the results of the comparison.The full version of the GCNet exhibited the best performance, with 74.04% AP on UA-DETRAC.The gate and feature vector of V both yielded 2% AP.The gate step explicitly merges the classification result into the regression branch, which plays a role of spatial attention and is conducive to the training of the regression branch.The concatenated feature vectors of V for regression introduce more texture and local information, which is not included in the correlation vectors.This information is beneficial for inferring the size of the objects.To demonstrate the role of the position embedding, it is replaced with a normal explicit position embedding, where P ijk equals i when 0 ≤ k < c/2 , and equals j when c/2 ≤ k < c .Notably, the self-designed position embedding attains a 5.80% increase in AP.
The ablation study is conducted only on the detection benchmark.This is because the tracking module shares most of its parameters with the detection module, and the tracking performance is highly correlated with the detection performance.The results of the ablation study can thus be extended to the tracking module.

Benchmark Evaluation
Table 2 shows the results obtained using the UA-DETRAC detection benchmark.The GCNet demonstrates promising performance, and outperforms most detection algorithms on this benchmark.It attains a high AP on full and medium difficulty as well as on night and rainy images of the test set.Figure 5 shows the PR curves of the GCNet and other algorithms, exposed by the UA-DETRAC dataset.It can be observed that the proposed model is far more effective than the baselines in each scenario.Notably, the proposed model does not employ any other components for better precision, and the backbone network is the original version of ResNet50.Compared with other methods, the performance improvement of GCNet benefits from the global correlation mechanism in the model.In the complex traffic scenarios, there are many non-critical areas such as trees and buildings, as well as many traffic participants with similar appearances.When using correlation convolution for object detection, the correlation between different objects will decrease with the increase of the distance, which can effectively reduce the false and missed detection.When only the detection module of the GCNet is used, it can run at 36 frame/s on a single Nvidia 2080Ti.The aim of designing GCNet considers both MOD and MOT.This is the real purpose of introducing the global correlation layer to regress the absolute coordinates.The tracking results are shown in Table 3.The MOT metrics with "PR-" can evaluate the overall effect of detection and tracking.EB and KIoU are the UA-DETRAC challenge winners.In the process of multi-objects tracking, the pixel coordinate distance of the same target among continuous frame images is generally close.Benefiting from the position embedding and global correlation, our method can encode spatiotemporal motion of tracking target implicitly, which can improve the matching accuracy between trajectory in the precious frames and detection results in the current frame.Additionally, a significant PR-MOTA score and an excellent PR-MOTP score are obtained, approximately twice as high as that of EB and KIoU combined.Moreover, the leading scores are obtained in PR-ML and PR-FN on the UA-DETRAC tracking benchmark.Because the detection and tracking modules share most of the features, calculating the entire joint detection and tracking pipeline is approximately the same as calculating detection alone, and it can achieve a speed of approximately 34 frame/s.

Conclusions
This paper proposes a novel joint MOD and MOT network, called GCNet.A global correlation layer is introduced to achieve absolute coordinate and size regression, which performs object detection on a single image, and naturally propagates the ID of objects to the subsequent consecutive frames.Compared to existing tracking-bydetection methods, the GCNet calculates end-to-end object trajectories without a bounding box NMS, data association, and other complex tracking strategies.The proposed method is evaluated on the UA-DETRAC, a vehicle detection and tracking dataset.The results of the experiments indicate that: (1) The evaluation results demonstrate the effectiveness of the proposed approach outperforms the existing methods in both detection and tracking.(2) This approach is also equipped to run 36 frame/s for detection and 34 frame/s for joint detection and tracking, thereby meeting the real-time requirements of most application scenarios, such as onboard environment perception of autonomous vehicles, and roadside perception of ITS.

( 1 )
This paper proposes a novel network GCNet to realize end-to-end joint multi-object detection and tracking, serving for onboard and roadside perception of ITS.(2) This paper develops the global correlation layer of GCNet that can encode correlation between the local feature vectors with the global feature map without computational complexity.(3) This paper demonstrates the competitive performance of the GCNet by comparative experiments on UA-DETRAC dataset.The results show the advantages of the proposed framework in both detecting and tracking process.

Figure 1
Figure 1 Examples of tracking results on UA-DETRAC dataset

Figure 2 Figure 3 Figure 4
Figure 2 Detection module architecture , and is assigned to a ground truth.A bounding box B d,ij is assigned to a ground truth if G ijn > 0.3 and n G ijn − max n G ijn < 0.3.

Table 1
Ablation study resultsBold values indicate the best scores of each single item

Table 2
Results on the UA-DETRAC detection benchmarkBold values indicate the best scores of each single item