Dim Target Detection Method Based on Deep Learning in Complex Traffic Environment

Although the current vehicle detection and recognition framework based on deep learning has its own characteristics and advantages, it is difficult to effectively combine multi-scale and multi category vehicle features, and there is still room for improvement in vehicle detection and recognition performance. Based on this, an improved fast R-CNN convolutional neural network is proposed to detect dim targets in complex traffic environment. The deep learning model of fast R-CNN convolutional neural network is introduced into the image recognition of complex traffic environment, and a structure optimization method is proposed, which replaces VGG16 in fast RCNN with RESNET to make it suitable for small target recognition in complex background. Max pooling is the down sampling method, and then feature pyramid network is introduced into RPN to generate target candidate box to optimize the structure of convolutional neural network. After training with 1497 images, the complex traffic environment images are identified and tested. The results show that the accuracy of the proposed method is better than other comparison methods, and the highest accuracy is 94.7%.

time, the precise detection and recognition of various targets by target detection and recognition technology also laid the foundation for the development of video surveillance, unmanned driving, scene semantic understanding, Internet mobile terminals, image retrieval and other fields [3][4][5].
When the target detection and recognition technology is used in the transportation field, it can detect and recognize various types of targets that are common in road traffic scenes. It can accurately and timely determine the route of the vehicle ahead. It not only improves the intelligence of traffic, but also guarantees the safety of the field of intelligent driving [6,7]. However, in actual situations, due to the complex road conditions and many goals, there are still some practical problems: (1) There are many similar targets. In the actual detection scene, when the shooting angle is fixed, due to light and non-rigid deformation, there will be many targets with small intra-class similarity and large inter-class similarity. Under this condition, it is easy to interfere with the detection effect and reduce the accuracy of detection [8,9]. (2) There are more redundant information and insufficient use of effective information. In the target detection and recognition of 2D images, due to the insufficient utilization of effective information, the detection of the target to be detected in the actual scene will be missed, which makes it difficult to improve the detection effect. Therefore, target detection and recognition based on the deep learning framework still has great research significance and space under the current research status.
In summary, transportation is a scene closely related to human travel and life. Combining with target detection and recognition technology is the basis for the development of intelligent transportation. It can not only improve the convenience of travel, but also play a very important role in the safety of human travel. At the same time, the existing research work has made some progress, but there are still many difficulties unsolved. There is still a certain gap with the use in real life. Therefore, further detection and recognition of various targets in actual traffic scenes still have certain theoretical research significance and practical application value. Based on this, a method for dim target detection using improved Faster R-CNN convolutional neural network in a complex traffic environment is proposed. The deep learning model of Faster R-CNN convolutional neural network is introduced into the image recognition of complex traffic environment, and a structure optimization method is proposed. The experimental results prove that this method can obtain higher accuracy and real-time performance.
The rest of this paper is arranged as follows. The second chapter introduces in detail some representative related studies in recent years. The third chapter introduces the target detection method based on the improved Faster R-CNN network model in detail. In Chapter 4, experimental analysis is carried out to verify the effectiveness of the proposed method. The fifth chapter summarizes this paper.

Related Works
The intelligent transportation system collects road vehicle driving information through cameras, and then the central computer processes the information. So as to realize the tracking and identification of vehicles, the identification of illegal traffic vehicles, and assist in handling various traffic violations. While reducing the work pressure of traffic police, it also increases the utilization rate of roads and reduces the accident rate. In recent years, intelligent driving technology has also made great progress, which is also inseparable from the rapid development of target detection technology [10]. The regular functions of modern vehicles such as cruise control, adaptive cruise, lane keeping and lane departure warning are inseparable from target detection technology. The surrounding information of the vehicle is collected through the cameras installed at various positions of the vehicle, and the current operating environment of the vehicle is analyzed to realize the driving assistance function. It reduces the fatigue of the driver and improves the safety of the vehicle.
In the traditional intelligent transportation system, vehicle detection is mainly realized by special sensors. Reference [11] proposed an algorithm for vehicle detection using ultrasonic data after analyzing the optimal energy-saving method of sensors. Reference [12] built a wireless sensor network to estimate the speed of the vehicle. These methods are not affected by the weather and can quickly detect passing vehicles. However, the installation of sensors tends to temporarily close the traffic there, and the maintenance cost of these sensors is also a considerable expense.
Nowadays, with the development of economy and technology, camera technology has been integrated into every corner of social production and life. With the development of camera technology, video storage, playback and processing have also made considerable progress, laying a solid foundation for the development of computer vision technology. At the same time, with the rapid improvement of computer computing power, the use of computer vision to achieve target detection has become the main development direction of modern scientific research. This also provides effective tools and methods for the progress of target detection. Visionbased vehicle detection system came into being. The visual vehicle detection system can detect vehicle type, traffic volume, vehicle speed and even predict traffic accidents. Reference [13] proposes a vehicle detection algorithm that can adaptively distinguish the front background. But as a background difference method, it has a common disadvantage. That is, it is difficult to detect a stationary or slow moving vehicle. This limits the final detection accuracy of this method.
The development of deep learning has promoted the research of target detection, such as YOLO [14], RCNN [15] and SSD [16]. Reference [14] proposed a road image vehicle detection algorithm based on an improved YOLOv3 network. In order to improve the detection efficiency, a new and improved YOLOv3 network structure with only 16 layers is constructed. Reference [17] proposed a hybrid deep neural network to divide the convolutional layer and the maximum pooling layer of the network into multiple blocks by dividing the final mapping of the two, including the receptive domain and the maximum pooling domain. The network can extract and learn multi-scale features of pictures. Reference [18] improves vehicle detection performance through a well-designed convolutional feature map neural network. In the traffic scene of vehicle congestion, reference [19] detects occluded vehicles by training two sets of paired support vector machines. Reference [20] studied a vehicle detection algorithm based on convolutional neural networks that fused color images and depth images. The algorithm is mainly researched by convolutional network multi-scale forward-looking depth imaging positioning model, forward-looking variable-scale vehicle detection pre-positioning algorithm, and typical model recognition algorithm based on transfer learning.
According to research hotspots in recent years, the detection and recognition rate of the above model methods for vehicles with small pixel sizes in images is generally low, and it is difficult to meet the accuracy requirements in actual traffic applications [21,22]. The current vehicle detection and recognition framework based on deep learning has its own characteristics and advantages, but it is difficult to effectively combine multi-scale and multicategory vehicle features. There is still room for improvement in vehicle detection and recognition performance [23]. Based on this, a method for dim target detection using improved Faster R-CNN convolutional neural network in a complex traffic environment is proposed. Aiming at the problem that it is difficult to accurately identify the common dim targets in road traffic scenes, a similar target detection and recognition method based on the improvement of residual network is proposed. The residual network with stronger learning ability is used to obtain more effective feature expression and rich semantic information. The main contributions of the method in this paper are as follows: 1) Aiming at the problem that it is difficult to accurately identify the common dim targets in road traffic scenes, a similar target detection and recognition method based on the improvement of residual network is proposed. The residual network with stronger learning ability is used to obtain more effective feature expression and rich semantic information.
The feature pyramid network is introduced into the RPN network to generate target region proposal, and the convolutional neural network structure is optimized. Obtain region proposal with more effective information, and enhance the ability to express feature regions containing important information. Efficient use of the target area in the image.

Proposed Target Detection Network Structure
Faster-RCNN unifies feature extraction, candidate region extraction and box regression in one network 1 3 to improve the efficiency of the network. It is mainly divided into four stages: (1) Feature extraction stage. In the first step, the convolutional neural network is still used as the basis to perform deep feature extraction on the sample data to obtain a feature map. This feature map can continue to be used by subsequent region proposal generation networks, so it can be called a shared feature map. So that a sample image only needs to go through the convolutional neural network once. (2) Region proposal generation stage. Use the Region Proposal Network (RPN) to generate multiple anchors for each pixel in the feature map of the previous step. Use SoftMax to determine whether an anchor belongs to the foreground target or the background, and output a category confidence probability value for each anchor. Finally, a bounding box regression method is used to correct the position of the anchor containing the target to obtain a more accurate target region. This paper firstly normalizes the image of any size of complex traffic environment to 1000 × 600 pixels. Then, the feature map is generated through the convolutional layer and the pooling layer in the CNN. The Faster-RCNN algorithm uses deep learning methods for feature extraction, which can effectively reduce the time and space complexity while meeting the accuracy requirements. VGGNet is one of the feature extraction networks in Faster-RCNN. VGGNet is a classic convolutional neural network that emerged during the development of deep learning. It does not use a larger convolution window, but uses a smaller convolution kernel to gradually extract multiple layers of the original image. Using multiple such small convolution kernels in cascade can achieve the same effect as a convolution with a larger window, and the generalization ability of the VGG model is stronger. The superposition of multiple small-scale convolutional layers and pooling layers effectively improves the learning ability of the network structure for image features.
In this study, in order to improve the recognition accuracy of vehicles and dim targets in the image, the VGG16 network was not selected as the basic feature network to extract image features. In other fields such as target detection, image segmentation, video analysis and recognition, replacing VGG16 in Faster-RCNN with a residual network (ResNet) can improve system performance. On the PASCAL VOC2007 data set, by replacing VGG16 with ResNet101, the MAP increased from 73.2% to 76.4%, and on PASACAL VOC2012 from 70.4% to 73.8% [24]. Since there are not many target categories and numbers in traffic environment images, the ResNet50 network is selected to extract image features. Extract foreground Region of Interest (ROI) and region scores through RPN and feature pyramid networks (FPN) networks on all feature maps. The area with the highest score is used as the final vehicle and dim target candidate region (Fig. 1).
Any target region proposal is mapped to the corresponding position of the feature map through the ROI Pooling layer, and the area is down-sampled into a 7 × 7 feature map. Then each input feature map is extracted into a 7 × 7 × 256 dimension feature vector through a fully connected layer. Finally, the feature vector is input to two output layers of the same level: One is the classification layer, which determines whether the target is a vehicle or other dim target. The other is the boundary regression layer, which mainly fine-tunes the position and size of the ROI border.

RPN (Region Proposal Network)
RPN is a fully convolutional network. After end-toend training, high-quality foreground target region 1 3 proposal for complex traffic environments are generated. Simultaneously complete the target boundary and target score prediction of vehicles and dim targets at each location. The network shares the convolutional features of the image with the vehicle target detection network. The residual network based on ResNet50 and the Faster R-CNN model share the convolutional layers from C2 to C5.

FPN (Feature Pyramid Network)
The FPN algorithm uses both the high-resolution of low-level features and the high-semantic information of high-level features. The prediction effect is achieved by fusing the features of these different layers. In order to improve the accuracy of target detection in traffic environment images, this paper uses FPN to fuse features of different layers in the RPN network to generate target region proposal of interest.
FPN designs the feature map as a multi-scale pyramid structure, and each layer of the pyramid uses a single-scale anchor. Corresponding to each layer of Pyramid {P2, P3, P4, P5, P6} corresponding to the anchor scale of ResNet50 are {32 × 32, 64 × 64, 128 × 128, 256 × 256, 512 × 512}. Use 3 types of ratios {1:2, 1:1, 2:1}, and share 15 types of anchor to predict the target object and background in the traffic environment image. Generate region proposal of interested targets (vehicles, dim targets). The RPN framework is shown in Fig. 2.

ROI Pooling Module
ROI Pooling maps the ROI to the position corresponding to the feature map according to the input image. Divide the mapped area into sections of the same size, the number of which is the same as the output dimension. Perform max pooling operation for each section. In this process, two quantization operations are carried out. The result of RPN is input into the ROI Pooling layer and mapped into 7 × 7 features. Then all the output passes through 2 Fullyconnected Layers, then the classification layer and the boundary regression layer to get the final result. The classification layer gives the probability that the object in the region proposal is a vehicle and a weak target, and the boundary regression layer gives the coordinates of the vehicle and the dim target region proposal.

Network Model Training
Before training the RPN network, each anchor will be assigned a binary label that is the background or target. The anchors that assign positive labels are: (1) An anchor that has the largest intersection over union (IoU) with a bounding box ground truth (GT) of a target's true position. (2) An anchor whose intersection ratio with the true position of any target is greater than 0.7. Fig. 1 The structure of target detection network is proposed 1 3 The anchors that assign negative labels are: (1) An anchor whose intersection ratio with the bounding box of all target real positions is less than 0.3.
The process of bounding box regression is the process of fine-tuning anchors. Although 9 different sizes of anchors are used to cover all the targets in the image, they can only cover roughly. It is also necessary to modify each anchor within a certain range to make the anchors containing the target closer to the real target position.
In the process of returning the frame, F represents a certain anchor, and G represents the true position of a certain target. The position coordinates and width and height of the proposal window are (x, y, w, h), then the process of bounding box regression is to transform F into f and find a set of offset values G′ to make it as close as possible to the real target position G.
can be approximately regarded as a linear transformation. The above four values can be obtained by linear regression: That is, given the input X as the feature vector, W is the parameter to be learned, so that X is infinitely close to the real position Y after linear regression. For the problem of target detection and recognition: Where,Φ(A) is the feature vector mapped from the anchor to the feature map output by the convolutional neural network. * represents (x, y, w, h), and d(A) is the final predicted value. The goal is to minimize the difference between the final predicted value and GT. The loss function is: Where, let loss be the smallest, you can learn a set of transformation values to get the final offset. During the training process, the network is finetuned by minimizing the multi-task loss function: Where,i represents that the i-th candidate frame is selected during the training process.p i represents the predicted probability of the i-th candidate frame as the target. p * i represents the true probability of the i-th candidate frame target. If the i-th proposal box is a positive label, it belongs to the target, then p i = 1. If the i-th proposal box is a negative label, it belongs to the background, then p i = 0. t i represents the position of the predicted bounding box. t * i p i represents the position of the real box corresponding to the prediction box. L cls is the classification loss, defined as follows: L reg is the regression loss, defined as follows: Where, (x, y) is the coordinate value of the predicted bounding box, and (x a − y a ) is the coordinate of the proposal box. (x * , y * ) is the real GT coordinate. w and h indicate the width and height of the bounding box. N cls and N reg are the normalized parameters when calculating regression coordinates and classification confidence, respectively.

Experimental Setup
This article uses Pytorch and the mmdetection open source framework provided by the Chinese University of Hong Kong to experiment on Ubuntu 16.04. The data set used in the experiment is the KITTI public dataset, which is currently one of the most commonly used datasets in the field of autonomous driving, and is also one of the internationally common visual  Fig. 3.
In this paper, 7481 images in the data set are formed into a training verification set and a test set at a ratio of 2:8. That is, 1497 training verification sets and 5984 test sets.
This paper chooses the stochastic gradient descent method to train Faster R-CNN in an end-to-end joint manner. A Gaussian distribution with a mean of 0 and a standard deviation of 0.01 is used to randomly initialize the weights of all newly added layers. The remaining layers are initialized with the parameters of the pre-trained ImageNet classification model. Set the learning rate to 0.005, momentum to 0.9, weight decay coefficient to 0.0001, epoch to 1500, and number of iterations to 550,000. The model is saved every epoch, and finally the model with the highest accuracy is selected.
A traffic scene image RPN network gets about 20,000 anchors. Use the NMS algorithm to select the 2000 RoIs with the highest probability, which correspond to regions of different sizes in the feature map. Use Proposal Target Creator to select 128 RoIs, and then use ROI Pooling to pool all these regions of different sizes to the same scale (7 × 7).

Evaluation Index
In order to evaluate the effectiveness of the proposed method, two indicators, precision and recall, are used for model evaluation, both of which range from [0,1]. At the same time, F1 value is introduced for harmonic average evaluation. The specific evaluation calculation formula is: Where, P represents precision; R represents recall; F1 represents the harmonic average of precision and recall; n TP represents the number of correctly identified vehicles and dim targets; n FP represents the number of misidentified vehicles and dim targets; n FN represents the number of unidentified vehicles and dim targets.

Training Loss
Using the Faster R-CNN structure described above, 1497 training set sample data are used for training. Performing 1500 iterations on the above training set took 20 h. The training accuracy loss curve is shown in Fig. 4.
It can be seen from Fig. 4 that as the number of iterations continues to increase, the accuracy loss produced by the training set gradually decreases. When the iteration reaches 1200 times, the accuracy loss drops to 3%, indicating that the model training effect is good. The training loss basically converges to a stable value, indicating that the improved Faster R-CNN achieves the expected training effect.

Model Performance Verification
In order to verify the reliability and stability of the model, after the training is completed, the 1497 images in the test set are identified. Choose the mean average precision (MAP), average recall, average precision (AP) as the evaluation index of the validity of (12) R = n TP n TP + n FN × 100% the test results. Use the average processing time to evaluate the speed of recognition.
The experimental results show that the average time for the method in this paper to recognize a single image is 1.55 s. Moreover, it was found in the experiment that occlusion and background similarity are the main reasons that affect target recognition. The recognition effect is shown in Fig. 5. Under normal conditions, the algorithm can detect and recognize the objects to be inspected in the road traffic scene respectively. Output the category of each target and the specific location of each target. It can also distinguish objects that are far away, partially obscured, and blurred. It can be seen that the detection and recognition effect of common targets in traffic is improved. In the actual situation, there is more redundant information in the road traffic scene and less effective information of the target, so that the target to be inspected in the image is often interfered by other information, (14) Average processing time = Test run time Number of test pictures  such as insufficient lighting, shadow occlusion, etc. And in the shooting process, some target boundaries are not clear and fuzzy, which makes it difficult to accurately detect and recognize the road traffic targets to be inspected. The accuracy of the trained road traffic target detection and recognition model is reduced. Replacing VGG16 in Faster-RCNN with a residual network (ResNet) can simultaneously improve the utilization of effective information from the channel and space. Use the shallow detail information in the feature map to achieve better detection and recognition results.

Model Precision Comparison
In order to demonstrate the performance of the proposed method in terms of precision indicators, it is compared with the methods in reference [13,14,20]. The result is shown in Fig. 6.
It can be seen from Fig. 6 that as the number of iterations increases, the precision of various methods is also increasing. The precision of the proposed method is better than other comparison methods, the highest precision reaches 94.7%, showing certain advantages. Because the improved Faster R-CNN deep network model deeply integrates RPN, it can generate high-quality region proposal boxes and improve the precision of recognition. Reference [20] model is used for global feature extraction and classification, but it is difficult to adapt to the complex road traffic environment. Therefore, the recognition accuracy needs to be further improved. However, the accuracy of reference [13,14] is low, and it is difficult to identify dim targets.

Model Recall Comparison
In order to demonstrate the performance of the proposed method in terms of recall index, it is compared and analyzed with the methods in reference [13,14,20]. The result is shown in Fig. 7.
As can be seen from Fig. 7, as the number of iterations increases, the recall of various methods is also rising. The recall of the proposed method is better than other comparison methods. Because the improved Faster R-CNN network can adapt to various complex environments, the highest recall reaches 85.6%. The recall of reference [20] and reference [14] are similar, and both are lower than the proposed method. Among them, the reference [14] is optimized through the YOLOv3 network and used for vehicle target recognition, but the influence of factors such as illumination is not considered, and the recall is low. Reference [13] uses the background difference method for recognition. Because the target feature extraction is not complete, it is difficult to detect a stationary or slow-moving vehicle. This limits the final detection accuracy of the method, so the recognition performance is poor.

Comparison of Precision Recall Curves
When performing target object detection, IoU is used to define the matching degree between the real object and the predicted object, and the PR (precision-recall) precision recall curve is drawn through calculation. The precision recall curves of vehicles and dim targets in different methods are shown in Fig. 8. It can be seen from Fig. 8 that the precision recall curve of the vehicle is better than the precision recall curve of the weak target. Because the vehicle has more targets, the characteristics are clearer. There are many types of dim targets, and some of them are similar in type, so it is difficult to identify them. In addition, the precision recall curve of the proposed method is better than other comparison methods for the recognition of vehicles or dim targets.

Conclusion
Aiming at the problem that it is difficult to accurately identify similar dim targets in road traffic scenes, an improved detection and identification method based on residual network is proposed. The residual network with stronger learning ability is used to obtain more effective feature expression and rich semantic information. This paper establishes a ResNet50 network model to extract features of vehicles and dim targets from the original image. The model does not rely on image preprocessing and data conversion, and can autonomously extract feature expressions of vehicles and dim targets through learning. Compared with the various features extracted by manual design, it can more accurately reflect the effective information of the identified target. The test results show that the average target recognition accuracy of this method is 94.7%. It has excellent actual generalization performance and obtains a stable high recognition accuracy rate. The disadvantage is that the Faster R-CNN model requires a long training time. Training requires training data under the condition of GPU memory greater than 8G. But after training, it does not affect the recognition speed of the actual test. In the next

Data Availability Statement
The data included in this paper are available without any restriction.

Author's Contribution
The main idea of this paper is proposed by Jianfang Liu. The algorithm design and experimental environment construction are jointly completed by Hao Zheng and Xiaogang Ren. The experimental verification was completed by all the three authors. The writing of the article is jointly completed by Hao Zheng and Xiaogang Ren. And the writing guidance, English polish and funding project are completed by Jianfang Liu.

Declarations
Ethical Approval All authors have read "Ethical Responsibilities of Authors" of Soft Computing, and they all promise to strictly abide by the publication ethics.