Cross-domain learning using optimized pseudo labels: toward adaptive car detection in different weather conditions and urban cities

Convolutional neural networks based object detection usually assumes that training and test data have the same distribution, which, however, does not always hold in real-world applications. In autonomous vehicles, the driving scene (target domain) consists of unconstrained road environments which cannot all possibly be observed in training data (source domain) and this will lead to a sharp drop in the accuracy of the detector. In this paper, we propose a domain adaptation framework based on pseudo-labels to solve the domain shift. First, the pseudo-labels of the target domain images are generated by the baseline detector (BD) and optimized by our data optimization module to correct the errors. Then, the hard samples in a single image are labeled based on the optimization results of pseudo-labels. The adaptive sampling module is approached to sample target domain data according to the number of hard samples per image to select more effective data. Finally, a modified knowledge distillation loss is applied in the retraining module, and we investigate two ways of assigning soft-labels to the training examples from the target domain to retrain the detector. We evaluate the average precision of our approach in various source/target domain pairs and demonstrate that the framework improves over 10% average precision of BD on multiple domain adaptation scenarios on the Cityscapes, KITTI, and Apollo datasets.


Introduction
Research on autonomous vehicles (AVs) has developed rapidly in recent years. However, just as explored in our previous work [40,41,45], the adaptability in different cities is becoming a huge challenge for the development of AVs. Detection is a critical technology in AVs, which can obtain information about the driving scene ahead [40,[42][43][44]. Similarly, the domain shift cannot be neglected when the detector is applied to various weather conditions or various types of urban scenes. Although many types of research are devoted to resolving domain shift, there are still some challenges in AVs: 1) Domain adaptation methods or frameworks should take into account the real-time detection requirements of AVs; 2) It is difficult to select samples that have a large impact on domain adaptation among the massive driving scene (target domain) data (Fig. 1).
CNN-based object detection has made remarkable achievements in recent years. RCNN [9] and Fast RCNN [8] use CNN for feature extraction and classification. In Faster RCNN (FRCNN) [31], the region proposal network (RPN) was introduced to refine the position and size of the pre-defined anchor box. In SSD [26], Anchor boxes with different aspect ratios are used to classify feature maps and predict bounding boxes on different scales. YOLO [29] divides the image into grids and predicts the bounding boxes and class labels of objects centered on each cell in the grid. CASNet [15] uses a cross-attention Siamese model to improve the performance of video salient object detection. Both FRCNN and YOLOv3 [30] are widely used nowadays. However, as discussed in [4,21], the performance of the algorithm will descend when they are deployed in diverse scenarios.
In the initial phase of developing domain adaptation, researches are based on shallow architectures [10,14], mainly utilize feature-based and instance-based methods to align the domain distributions. [5] proposed a subspace alignment technique to directly reduce the discrepancy between domains, [28] used maximum mean discrepancy criterion to learn a domain-invariant feature transformation, [36] attempted to discover a common low-rank subspace between domains to use the source samples reconstruct the target samples, [11] estimated the weights between domains to reduce the disparity by kernel mean matching. However, compared with the deep architecture, the improvement brought by above methods is very limited.
In recent years, the development of deep learning attracts researchers' attention to the deep domain adaptation [16,39]. The research includes fine-tuning supervised model in the target domain and unsupervised cross-domain representation learning. For example, [17,32] adapt to the target domain by using noisy labels and robust learning methods, [2,12,47,48] explore the feature differences and use adversarial training for adaptation. There are also some methods to reduce the feature difference by reconstructing the source domain image, such as image-to-image conversion [1,23]. Other methods combine image reconstruction and domain difference techniques for adaptation, such as using image conversion and pseudo-label to finetune the model [13], adapting at the pixel level and feature level [35], using weak self-training and adversarial background score regularization to reduce the negative impact of pseudo-labels and domain feature shift [18]. These algorithms are based on FRCNN, which only considers detection accuracy but not efficiency. Actually, they are not suitable for AVs that require high detection speed.
In this paper, we address the domain shift through pseudo-labels and robust training. We divide domain adaptation into real-time inference phase and optimization & retraining phase. In the real-time inference phase, the BD, which is mounted on the AV, needs fast detection speed and high detection accuracy. YOLOv3 is considered to be a widely used generic detector with the above conditions [25], so we choose it as the BD. In theory, the recently developed YOLOv4 and YOLOv5 and other detectors can also be easily extended as BD. We observed that although BD performs poorly in the target domain, it can still be used for low-accuracy detection. The pseudolabel with low-accuracy results will be saved in this phase. In the optimization and retraining phase, we correct the pseudo-label by combining the low-accuracy pseudo-label with the results of designed detector. Then we retrain BD by mixing optimized data with the source domain data. Besides, we approach a more robust training method to accommodate pseudo-label noise. Considering the car is one of the main participants of urban traffic, this paper verifies the effectiveness of our framework through car detection in different driving scenarios. But our general framework, theoretically, can also be extended to the detection of other objects(e.g., pedestrian and traffic light).
The contributions of this paper are summarized as follows: (1) A pseudo-label optimization method was proposed to correct the low accuracy results generated by BD.
(2) An adaptive sampling method was introduced to extract more effective data, which can significantly improve the adaptation of detector in the target domain. (3) A modified knowledge distillation loss and label smoothing method were applied to reduce the adverse effects of the label noise in the retraining stage.
The following content is arranged in this layout: The details of our method are explained in Sect. 2. Section 3 describes the datasets and results of various cross-domain experiments. The conclusion and future works are summarized in Sect. 4.

Domain adaptation framework
We represent the training data space as the source domain (S) and the test data space as the target domain (T ). Figure 2 shows our domain adaptation framework. BD is initialized by source domain data (D S ) and applied on T . Then the pseudo-labels are generated by BD. To make use of these noise data, we designed the following modules: 1. Pseudo data optimization module Correct the errors of pseudo-labels. We developed a FRCNN-based module to correct three common errors in object detection: (1) inaccurate bounding box location and size; (2) mistakes in object labels (False Positive); (3) missed detection of objects (True Negative). 2. Data sampling module Count the number of hard samples in a single image which from T , and adaptively sample the data according to the hard sample amounts to preserve and utilize the data that has a significant impact on the detector. 3. Retraining module Employ a modified distillation loss to retrain the detector. We approach two different ways to assign soft-labels to reduce the effects of noise labels.

Low accuracy data generated by BD
We assume the vehicle-mounted BD is initialized by D S , and the environment captured by the on-board camera is T . When the confidence of the object in an image is greater than the threshold, the label will be saved as the pseudolabel.

Module structure
The pseudo data optimization module trained by D S is proposed to reduce the errors contained in the pseudo-label. The module is based on FRCNN and takes LocNet [7] as the location method. In the pseudo-label optimization phase, the proposal region will be replaced by the pseudolabel region. Figure 3 shows the details of the module structure.
LocNet-based localization method In this part, In-Out probabilities are adopted as the conditional probability. In stage 1, taking the center point of the proposal as the origin, the area is enlarged by S x and S y on the X-axis and Y-axis to generate the search region. Then the search region will be divided into M parts in the horizontal regions (rows) and vertical regions (columns), respectively. The vectors p x ¼ fp x ðiÞg M i¼1 and p y ¼ fp y ðiÞg M i¼1 , respectively, represent the conditional probabilities of each column and row of the search region to be inside the bounding box. If at least part of the area corresponding to the row or column is inside bounding box, then the row or column is considered to be within the box. The ground truth bounding box is denoted as g is the right-bottom coordinates. The object candidate probability vector T ¼ fT x ; T y g can be defined as follows: In stage 2, the input search region is projected onto the feature map of the conv5_3 layer of vgg16 [37] for ROI pooling, and then through the convolutional layer to do max pooling on the X-axis and Y-axis, respectively. The prediction vector p ¼ fp x ; p y g is used to predict the The location loss is replaced by L D loc ðp; TÞ : whereT a ðiÞ¼1 À T a ðiÞ andp a ðiÞ ¼ 1 À p a ðiÞ.

Pseudo label refinement
Bounding box refinement As shown in Fig. 3, we replace the proposal to the pseudo-label region in stage 1 to refine the bounding box in the optimization phase. We assume d D is the confidence of the proposal in stage 2. The box from the LocNet-based localization method will be applied as the new bounding box when d D ! h.
False positive (FP) correction An area in the image may be incorrectly identified as a true sample (e.g., a building is mislabeled as a car). For the pseudo-label region to be optimized, we set the score evaluation function d optim as follows: where d pseudo is the score predicted by BD, a and b are the weight parameters. We estimate whether the area is a FP area through the value of d optim . The details are shown in Sect. 2.3. False negative (FN) correction Some objects in the image could be missed detection by BD. We adopt a pretrained model on COCO [24] dataset to mine these samples. ResNeXt-based [46] FRCNN is an ideal model, which is easily available and widely used. Although the detection speed is slow, the real-time inference is not required when optimizing pseudo-labels. We employ the FRCNN model as the FN model, which with ResNeXt-101 as the backbone, pre-trained on the COCO dataset. We assume the confidence of FN model is d N and add the region to our set of pseudo-labels when d N ! h.  Fig. 3 Network structure of the pseudo data optimization module. In the training phase, the RPN will generate a rectangular proposal area, and then the proposal area will be expanded by S x and S y on the Xaxis and Y-axis of the image to generate a search region for the next stage. In the optimization phase, to refine the bounding box, the proposal is replaced by the pseudo-label region. Then the pseudolabel region is expanded into the search region

Retraining module
The target domain data (D T ) still have noise after the pseudo data optimization module. To accommodate these noises when retraining BD, we build a more robust retraining module.

Training on pseudo labels
When retraining BD with D T and D S , the sample X i , which is the i-th training sample, will be considered as a positive sample and labeled as 1 if X i from the ground truth of S or Sect. 2.2. The label y i is defined as: The binary cross entropy loss is used to retrain the classification branch. For the i-th training sample, the loss is defined as: where the label y i 2 f0; 1g and the model's predicted posterior p i 2 ½0; 1.

Pseudo label smoothing
Previous research [22] has shown that label smoothing can avoid the adverse effects of noise labels. Therefore, we further smooth the pseudo-labels. The positive labels from T are consists of two parts: (1) the pseudo-label region from BD; (2) the region from FN correction. We exploit two different ways to generate soft labels. Soft labels of cross-domain score mappings remap : Following the practice in [32], we assume that the soft score s i in T has the same distribution as the score of S. Let the score distribution of BD on T has p.d.f. f(x) and the score distribution on S has p.d.f. g(x). The cumulative distributions of them would be FðxÞ ¼ R x 0 f ðtÞdt and GðxÞ ¼ R x 0 gðrÞdr, respectively. We use a histogram to remap the distribution from T to match the score distribution from S, i.e., replace the target domain score x with G À1 ðFðxÞÞ. We map the scores by linear interpolation. For the regions from FN correction, the soft score is the confidence threshold h. The soft labels after remappings i is defined as: where d remap is the score after remapping.
In FN correction, the soft score d FN optim is: After FP correction, the pseudo-label region from BD will be verified in FN correction again. The soft label of the region X i will be defined ass i ¼ 0 if d FP optim ¼ h and d N \h. Then the soft labels with optimization is: if X i only from BD and d pseudo \h h : The distillation loss for X i 2 fS;T g after label smoothing is as follows now:

Data sampling module
Large amounts of data are generated when a car is driving.
More data may bring better improvements, but it will also consume more computing resources. Hence, to collect more effective data in T , we adopt and improve the adaptive sampling method [27]. We define the soft label region that is not detected by BD as the suspected FN region, and the region that is only detected by BD is the suspected FP region. The region among them is labeled as hard sample (HS). The images are arranged in descending order according to the number of HS per image. For the ordered total sample K, our sampling number is s, and we have a sampling adaptation parameter l. The sampling method is hard sampling when l ¼ 1. For the adaptive sampling method, random sampling is applied to select s samples from the data in the sampling area ð1; lsÞ. The adaptive sampling method is defined as: We apply random sampling on S and adaptive sampling on T , respectively, to select samples. The details are shown in Sect. 3.3. These samples are composed as the mixed domain and used as the retraining data (Fig. 4).

Results and discussion
In this section, we use the YOLOv3 trained by D S as BD and test it on T . Therefore, BD has no ground truth information about T . We evaluated our method in experiments with cross weather adaptation, cross-platform adaptation, adaptation to the city with more complex scenes, and adaptation to local urban. Remap and optim represent soft labels of cross-domain score mapping and soft labels with optimization, respectively. Also, to verify the effect of pseudo-label smoothing, we recorded the results when it was not used (denoted as original). Noted that there is no FP correction for remap and original labels. The results of validation set are shown in tables. Additionally, the results trained on T (oracle) are revealed to illustrate the existing gap between the domains. The experiments are tested on Nvidia Titan XP with 12GB memory.

YOLOv3
In all the experiments, we train the detector 150 epochs with Darknet-53 pre-trained on ImageNet [20]. We resize the image's largest side to 416 pixels during the training and inference stage. We report the average precision (AP) with a threshold of 0.5 for evaluation. The other setting is the same as the original YOLOv3.

Pseudo data optimization module
The shorter side of the image is resized to 600 pixels during the training and inference stage. Following the practice in [7], the S x and S y are setting to 1.2 which expanding the length and width of the area by 1.2 times, respectively. The M is set as 28. The threshold of p x and p y is 0.5. Stochastic gradient descend (SGD) optimization algorithm is used during training, with the learning rate of 0.0001 and weight decay of 0.00005. The training process is stopped at 25 epochs.

Retraining module
Through our small-scale test experiments, we determined the following parameter's value. We set the confidence threshold h ¼ 0:5 and the high confidence threshold h h ¼ 0:7. In Eq. (4), we set the a ¼ d When the d optim [ 1, the d optim is set to 1. So the ranges of d optim is [0.5, 1].  (Fig. 5).

Domain adaptation for detection
In this section, we evaluated our method through the following experiments: (1) Cross weather adaptation; (2) Cross platform adaptation; (3) Cross cities adaptation; (4) Adaptation to local urban. In Fig. 6, we show the improvement of our method to BD and the gap with the ground truth.

Cross weather adaptation
The change in weather conditions will significantly affect the accuracy of detection. The most basic requirement of an AV is that the detector can maintain high accuracy in all weather conditions. Considering the cost of annotations, it is not possible to collect the data of many weather conditions. Therefore, models must be adaptable to various weather conditions. In the weather adaptation experiment, we evaluate our method through Cityscapes and Foggy Cityscapes. The Cityscapes is used as S and the Foggy Cityscapes as T (denoted by ''Cityscapes Foggy Cityscapes''). We use the adaptive sampling to sample s ¼ 2500 images from the pseudo-label set with 8925 images (three levels foggy images from the training set). We set the sampling adaptation parameter l ¼ 2. At the same time, we use the random sampling to select the same number of images in Cityscapes. Table 1 compares our method to the baseline and other methods. The AP of the car is reported in the table. Our results have reached a great improvement on the target domain validation set. The remap label improves 21.3% AP, while the optim label improves 23.3% AP compared with BD. Besides, with label smoothing, the optim label improves 5.3% AP compared with no label smoothing. The AP of our results is very close to the oracle result.

Cross platform adaptation
The dataset captured by different platforms has different characteristics such as scenes and viewpoints. Also, the camera settings and frame sizes can also influence the visual appearance and image quality. These discrepancies are the domain shift reasons. In cross platform adaptation, we report the results of adaptation from different cameras. The KITTI dataset is used as S and the Cityscapes is used as T (denotes by ''KITTI Cityscapes''). These two datasets have big differences in image size, scene, and perspective. In this experiment, we sample s ¼ 2000 images from T with the sampling adaptation parameter l ¼ 2 by the adaptive sampling. The same number of images are sampled by random sampling in S. Table 2 shows the results of cross platform adaptation. Compared with BD, the remap label improves the AP by 34.4%, while the optim label improves the AP by 32.7%. Up to 1.2% AP was promoted when label smoothing was applied. The results are all close to the oracle result. The results of our method also show better performance than the FRCNN-based method.

Cross cities adaptation
In the above two experiments, the traffic scenes of the KITTI dataset and the Cityscapes are kind of simple. In fact, the car faces much more crowed and complicated traffic conditions in many real-world scenes. To this end, we conducted the cross cities adaptation to verify the effectiveness of our method in the face of more complex scenarios. Considering the ApolloCar3D has much more complicated city environment than Cityscapes, we use the Cityscapes and ApolloCar3D datasets as S and T (denotes by ''Cityscapes Apollo''), respectively. There are significant differences in image size, traffic complexity and perspective between the two datasets. In this experiment, we apply the same sampling method and sample the same number of images as the cross-platform adaptation. Table 3 reports the results of our method. Our method still performs well when facing more complex and crowed scenes. In this experiment, the remap label improves 10.6% AP and the optim label improves 14.4% AP on validation set compared with BD. The optim label also gets a better result than the original, improves 3.2% AP.

Adaptation to local urban
Researchers usually use random sampling to create datasets when faced with video data. However, the sample obtained by random sampling is not necessarily the most effective data for improving the detector. This experiment is adopted to explore the effect of the data sampling module. In this    section, LUDV is used as T and Cityscapes as S. In LUDV, 19719 images with pseudo-label are saved by BD. Images of validation set are randomly sampled from them, while the remaining images are used as D T . All 2975 images in D S and the same amount of images in D T composed the mixed domain. As shown in Table 4, the samples from D T are collected by different sampling methods. Noted that this experiment only used optim label because the other labels have no FP correction. Moreover, Fig. 7 shows the statistics of label amounts in this experiment, including the number of hard samples (HS) and the others (ES).
As shown in Fig. 7, when the sample amounts are the same, hard sampling have the largest number of labels, including 20038 HS and 45978 ES. Besides, HS has the highest proportion (30.4%). However, most of the samples collected are from similar scenes, so the result (54.6%) is lower than random sampling (54.8%). Giving consideration to the label amounts and scene diversity, adaptive sampling gets better performance than random sampling. The result (55.9%) is 1.1% higher than random sampling when l ¼ 3.
In addition, the result is better than others when using all D T (59.0%). But it is not suitable when faced with much more video data considering the computing resources. Table 4 also shows the precision (P) and recall (R) of each experiment to explore the impact of different sampling methods on both. It can be seen from the table that the number of HS is negatively correlated with the precision and positive correlation with the recall when the sample size is the same. The possible reason for this phenomenon is that the label noise in HS limits the improvement of precision, while the rich scene information improves the recall. This is also the direction for our further research and improvement.

Influence of IoU threshold
The threshold of intersection over union (IoU) controls the predicted bounding boxes, which can also impact the results of the test data. In the previous experiments, the threshold is set as 0.5. In this section, we tune the IoU threshold in our test to study its influence. We found that the value of the IoU threshold will greatly affect the test data results when the camera parameters have a big difference. So, we choose KITTI and Cityscapes to do the experiment. Figure 8 shows the results of our experiment. With the increase in the IoU threshold, the AP of the car drops for all models. This is because the increase in the IoU threshold filters out many bounding boxes that are not so accurate, which leads to the decrease in detection precision and recall values. From Fig. 8, the AP of the baseline is close to 0 when the IoU threshold reaches 0.8. It means the baseline model can hardly locate the object accurately. Through the adaptation, our methods (remap label and optim label) still maintain a good performance when the threshold reaches 0.8. In Fig. 8, we can also see that our methods are stable under different IoU thresholds, and the gap with oracle has been kept within a certain range. This shows the effectiveness of our pseudo data optimization module.

Conclusions
To improve the adaptability of the vehicle-mounted object detector in various urban environments, this article employs a pseudo-label-based domain adaptation method to retrain the detector. Through our method, the detection model can adapt to the target domain without the annotations. In the real-time inference phase, BD generates low accuracy pseudo-label. In the optimization and retraining phase, the pseudo-label is corrected and used to retrain the detector. We evaluated our method through weather adaptation, cross-platform adaptation, and cross cities adaptation experiments. In weather adaptation and crossplatform adaptation, the maximum improvement of our method for BD is 23.3% AP and 34.4% AP, respectively. We compared our results, which close to the oracle, with other FRCNN-based methods. The adapted detector achieves higher accuracy while ensuring the detection speed. In the cross cities adaptation experiment, our method achieves a maximum improvement of 14.4% AP, which has a big improvement to BD. Moreover, we also explored the effect of each module by several designed experiments. Future works focus on three perspectives: (1) Apply the framework into the system-on-chip and deploy on a real vehicle; (2) Transfer the pseudo-label data via 5G. Through cloud computing power, we can optimize the pseudo-labels and retrain the detector. (3) Employ more advanced algorithms to explore the gap between the domains. Reinforcement learning (RL) is a promising method to minimize the domain gap.