Scan From the Sky: A Path Planning Method with Perception Optimization for UAVs

—Unmanned aerial vehicles (UAVs) are frequently adopted in disaster management. The vision they provided is extremely valuable for rescuers. However, they face severe problems in their stability in actual disaster scenarios, as the images captured by the on-board sensors cannot consistently give enough information for deep learning models to make accurate decisions. In many cases, UAVs have to capture multiple images from different views to output ﬁnal recognition results. In this paper, we desire to formulate the ﬂy path task for UAVs, considering the actual perception needs. A new convolutional neural network (CNN) model is proposed to detect and localize the objects, such as the buildings, as well as an optimization method to ﬁnd the optimal ﬂying path to accutately recognize as many as possible objects with a minimum time cost. The simulation results demonstrate that the proposed method is effective and efﬁcient, and can well address the actual scene understanding and path planning problems for UAVs in the real world.


INTRODUCTION
The unmanned aerial vehicle (UAV) is often used in disaster management. For example, after a disaster, people can use the UAV to detect the affected area to find the affected buildings and residents; and in the disaster recovery stage, the UAV can be used to assess the severity of the disaster and the feasibility and cost of the reconstruction, in order to assist people in the disaster reconstruction planning, as shown in Figure 1. However, an serious problem of UAV is that its battery is limited and cannot be used for longterm, large-scale survey tasks. Therefore, how to develop a feasible and effective flight path for it to detect and identify objects efficiently and accurately has become a very important issue.
This task mainly includes two aspects, namely path planning [1] and scene understanding [2]. Although people have done a lot of research work on both sides and got some results, they all have some problems. For example, most of these methods consider this task by two totally-separated problems, i.e., path planning and scene understanding, rather than as a whole. Therefore, they may not be well applied to actual detection tasks in disaster management. In fact, for path planning, the result of scenario understanding is a very important feedback. In one single shot, some of the buildings or objects in the picture are not easily discernible, and it is likely to cause the failure of the scene understanding model. The unclear zone of this area can be used as the key consideration area for the next shooting, and based on that, plan the fly path for the UAVs.
For the flight planning of drones, the previous methods mainly considered coverage and total flight time. And in our approach, we also need to consider the confidence of the recognition. This is to ensure that the deep learning network [3] can accurately and reliably identify buildings, pedestrians and other objects in the video data collected by the drone. In the optimization process, the traditional method mainly uses a heuristic algorithm based on the traveling salesman method. The heuristic algorithm optimizes the performance of the entire algorithm by minimizing the cost of reaching the current state and the cost of going from the current state to the next state. In our approach, we also chose a similar approach to optimize the path, but since we added the elements of confidence, the overall optimization algorithm is more complex.
In this article, we present a UAV path planning approach to target detection and location tasks using depth models. The main contributions of our work include: • A drone planning algorithm. In this algorithm, the flight time of the drone, the coverage of the area to be detected and the confidence of object detection can be guaranteed and optimized.
• An object recognition deep learning model. This model can simultaneously detect and locate pre-selected objects, such as buildings, pedestrians, etc., in video frames captured by drones.
• A drone image dataset. The dataset was collected from real drones, and all buildings in each video frame were manually tagged and the exact location was marked. It can be used for performance measurement and comparison of drone detection and positioning methods. The rest of the paper is organized as follows. Section 2 introduces the existing research in related area. Section 3 presents some notations and the problem definition, and gives an overview of the proposed system. In section 4, we present a scene understanding method and propose the core algorithm of path planning. Experimental evaluation is shown in Section 5. Finally, conclusions are drawn in Section 6.

Scene Understanding for UAVs
Liu et al. propose a framework [4] combining trajectory detection based on support vector machine and tracking and tracking tracker to realize tracking direction estimation and tracking with low computational cost and real-time. In addition, in the their system framework, a simple linear iterative clustering super-pixel segmentation algorithm is adopted to ensure the accuracy of scene segmentation. The visual detection of important objects or people is realized by a singleshot multi-box detector algorithm. Minaeian et al. propose a novel target detection and location scheme, which is based on vision [5] and adopts each UAV as a different function of the cooperation team. In this paper, authors build a team of UAV and multiple unmanned ground vehicles (UGVs) that track and control the population in the border field. A algorithm, which can detect the custom motion, is used in the crowd detection by a mobile camera installed on a drone. Since the drone the lower analysis ability but the wider range for detection, UGV has better analysis and higher accuracy, so a separate body detector is used and the landmark is moved to locate an unknown independent movement at each point in time. The detected population of patterns. The UAV positioning algorithm proposed in this paper uses perspective transformation to translate the crowd position in a image into a position in the real world. Moreover, a thumb positioning method determined by UGV is introduced, which makes the prediction for the geographical location in the detection. Combined with the low-quality attitude-heading reference system (AHRS) on UAVs, Zhang et al. [6] propose a new vision-based positioning method to determine the three-dimensional position of the target. The usual positioning method must rely on many requirements (ie geo-referenced terrain database, precise attitude sensor). Therefore, if the drone system does not meet these requirements, the geographic location of the target cannot be achieved. On the contrary, the geolocation method proposed in this paper only uses computer vision technology to accurately estimate the target height and yaw angle measurement deviation. The purpose of this method is to eliminate these requirements in current systems while maintaining high target positioning accuracy. Zhu et al. present a superior estimation method for urban traffic density [7] which can efficiently deal with ultra-high resolution video shots obtained from UAVs. They first shot the traffic video in ultra-high resolution for nearly one hour of ultra-high resolution traffic video at five crowded areas in big cities, driving drones during peak hours. They then randomly sample pixel patches, and annotate vehicles to form the data set for the research in the paper, which can also be used to other researches. In the innovative method of urban traffic estimation, it uses deep neural network to detect vehicle and obtain the information, such like location, identification of the vehicle. In addition, they claim that there is other information included in a ultra-high resolution video, making vehicle detection and recognition more accurate than lowresolution content. Fan et al. [8] propose a plant detection method using UAVs. This method consists of three phases. In the first stage, some candidate tobacco plant areas are extracted from the drone image by morphological operation and watershed segmentation. There are tobacco plants or non-tobacco plants in each candidate area. In the second phase, a deep convolutional neural network is established and trained to classify candidate areas as tobacco growing areas or non-tobacco plant areas. In the third stage, posttreatment is carried out to further remove non-tobacco plant areas. There are also many other existing approaches in the area of scene understanding [9], [10], [11], [12], [13], [14], [15], [16], [17]. However, most of them are not designed for the UAVs nor optimized for detection and localization tasks during flying on the air.

Path Planning of UAVs
Yu et al. [18] introduce a solution named as collaborative path planning, which utilizes UAV and UGV to track moving targets in urban environments. The most significant advantage of this algorithm is that it considers the visual occlusion caused by obstacles in the environment. The algorithm models the target state using a dynamically occupied grid, which is refreshed according to the data obtained from the Bayesian filter. Hence, the current behavior and its prediction can be analyzed, based on which, a single vehicle path planning method is presented. The method can maximize the sum of detection probabilities, and has been applied to various scenarios due to its portability. In this scenario, the auction-based decentralized programming algorithm is designed to plan a limited forward-looking path, maximizing the combined probability of detection and the vehicle. Kumar et al. [19] make statistical comparisons between existing UAV path planning methods to determine the best benchmark function. Using the approximate optimization technique determined in the first step, namely the multi-verse optimizer (mvo), they formulate the path planning problem of determining the minimum deviation trajectory of the minimum collision of the drone from the mathematical point of view. existing approaches to verify the proposed path planning method. Yang et al. [20] propose a new approach to individual evaluation and evolution methods. By using this new idea, people can take advantage of high quality waypoints.
In the evaluation phase, a new set of evaluation functions is derived from existing targets and constraint functions to evaluate each waypoint. Basically, the derivation can only be made if the original function is separable on the waypoint. To further improve the performance of the proposed planner, the waypoints are encoded in a rotating coordinate system with external constraints. In order to test the ability of new planners in planning barrier-free paths, five scenarios with increasing barriers were constructed. Planners using 3 existing planners and 4 alternatives for comparison, which can prove that it can be executed efficiently and effectively. Wen et al. [21] propose a new way to obtain a feasible and safe path. First, based on the iintuitionistic fuzzy set (IFS), static threats (STs) is modeled to represent the uncertainty in STs. Introduced the evaluation and synthesis method of st. Based on the fast detection of the random tree (RRT), the dt reachability set (RS) is supposed to predict the value of threat. Secondly, the main purpose that putting a sub-target selector into the planning scheme is to reduce planning costs and improve the efficiency of searching. Furthermore, a back horizon (RH) is introduced, aiming to dealing with online path planning in more complex environments. Therefore, local planners are designed based on the dynamic domain fast detection random tree (DDRRT). And RRT is adopted to achieve the optimization of the path in the planning program. Yin et al. [22] introduce a scheme for the multiobjective path planning (mopp), which searches the appropriate path in a complex urban scenarios, taking into account the level of safety within the proposed scheme to ensure the security of drones. In particular, the security index maps (SIM) is first utilized to detect various obstacles in geographic maps. Thus, the offline search and online search method based on static sim card are proposed. Offline search is supposed to detect the static obstacles, thereby reducing the driving time; however, online search is supposed to deal with other dynamic obstacles. There are also many other existing approaches in the area of path planning for UAVs [23], [24], [25], [26], [27], [28], [29], [30]. However, most of them do not consider the prediction confidence when choose the future path. Therefore, their prediction performance are not so satisfactory during actual tasks.

PATH PLANNING
The path planning problem of UAVs in 3D environment is a hotspot in the development of robotics. Particularly, path planning is one of the basic links in autonomous navigation of drones. It refers to the environment in which obstacles are used according to certain evaluation criteria, and the optimal path should be established. Generally, path planning can be summarized as follows: modeling the environment, executing the path search, and building the optimal path. Modeling the environment means abstracting the actual spatial environment information mathematical model so that it can be understood by computer algorithms. Therefore, environmental modeling is path planning, as shown in Figure 2. The basis for rational environmental modeling has an important impact on the eventual results. The modeling methods for UAVs in 3D scenarios mainly include geometric modeling method and unit decomposition modeling, which considers the flight area of the drone as various units. Actually, the traditional grid method plays the most basic role among the various modeling methods. When modeling the environment, the scenarios is transformed into the traditional grids. The information is beneficial to be preserved in a computer. In addition, the adjacency relationship among the grids is pretty intuitive, so that it is easy to write a program implementation in practical applications, and can obtain better planning results when combined with the planning algorithm. In the field of robot path planning, traditional grid method has been widely concerned by researchers as a typical environment modeling method. With the continuous complexity of environmental information and the continuous improvement of robot autonomy, the traditional grid method is used. When there are three-dimensional environment modeling problems, there are still some problems. Compared with the two-dimensional environment, the traditional grid method is used to directly model the environment in the threedimensional environment. In order to improve the accuracy of modeling, the grid has to be designed into small size, which leads to a large increase in the number of grids, resulting in a planning algorithm. Processing a large amount of raster data, planning efficiency is significantly reduced. This paper mainly considers three kinds of planning indicators, namely the total information density of sub-areas, the coverage time of sub-areas, and the time of transition. Suppose a drone or unmanned boat i is assigned K i subareas, and the total information density of the k-th sub-area is: The sub-area coverage time represents the ideal time required for the drone or unmanned boat to completely cover the sub-area: Where C represents the sensor coverage of the drone, and S represents the subarea area. The transition time represents the ideal flight required from the initial point P 0 to the center of the Kth sub-region (k = 1), or from the center of the k-1 sub-region to the center of the k-th sub-region (k 2): Among them, V represents the speed of movement of the drone. Therefore, the expected observed benefit of a drone i for a sub-area k is: According to the above formulas, the Ki sub-regions are iteratively sorted to determine the optimal observation order. It is generally expected that the larger the observed benefit, the higher the information density of the region or the less the ideal observation time, the higher the priority.
After determining the optimal observation order for each sub-area, the expected observed benefits of the drone or unmanned boat i are available: where k ′ represents the sub-area number after sorting. In addition, define the ideal observation time for each drone: The total allocation indicator for the drone is: Among them, the first part represents the total expected observation gain, and the second part is used to balance the task execution cost among the various observation forces such as drones. 1 and 2 are weight coefficients. The larger the EA, the higher the regional observation efficiency, so the optimal allocation objective is as follows: After the sub-areas are allocated, each drone or unmanned boat adopts a parallel receding horizon control (RHC) route planning method aiming at maximizing the observation gain, so that the planned route meets the task time constraint. Assuming that the center position of the sub-area to which the drone is assigned is {S ′ 1 , , S ′ Ki }, the route planning procedure based on the parallel RHC is as follows.
First, each route segment is initialized, including the shortest route segment φ 0 (time t 0 ) from the starting point P 0 to the center of the sub-region S ′ 1 , and the coverage observation route φ j = {S ′ j } of each sub-region (j = 1, , K i ) (time tj = 1), the shortest transition route φ jj+1 (j < K i ) (time t jj+1 ) in each sub-area.
Then, if the sum of the time of each of the abovementioned route segments is less than the task time T , then a certain route segment needs to be selected and a new waypoint is added. The specific implementation strategy is as follows: adopt the RHC method to pre-plan a new waypoint for each route segment φ j , and select from it. The destination point P j of the largest single observation gain is added to the internal route segment φ j of the corresponding sub-area (j ), and the time t j = t j + 1 is updated; if j < K i , the slave is updated the point P j is the shortest transition route φ j →j +1 of the center of the sub-area S ′ j +1 , and the flight time. Repeat the above steps to gradually expand each covered route segment until the sum of the time periods is equal to the mission time T .

Deep Learning Model
After obtaining frames of captured video, the system can work on the scene understanding, as shown in Figure 3. Compared to other perception methods, which manually design the feature map for the deep model, we directly map the input frames to object categories and localization, which has the potential to achieve higher network precision.
A multi-layer detection network is designed to find different object characteristics. The input is the captured images. And the main objective is to detect specific objects and localization. Our model contains 7 hidden layers and one output layer. There are five common convolutional layers and two fully-connected layers in the proposed model. We demonstrate that this structure has a good performance on obtaining the relationship between video frames and object information. The distinctiveness is that we design this model as a two-branch structure. There are two different sub models adopted to respectively calculate the category and localization information. Since the features extracted in the lower part of the deep model do not have too much difference, the first 5 convolutional layers are shared between two sub-networks for training efficiency. The remaining layers in each sub-network are used to calculate the higher-level features for the model, because is usually task-related and do not show similarity among different artificial intelligence tasks.
In order to learn two objects at the same time, a multitask loss is adopted as the cost function of our deep model.
where L category , L localization and L dec are cost functions for two different objectives, and λ 1 , λ 2 and λ 3 can decide their importance for the whole cost value. L category represents the loss value to judge the selected data samples as correct category categories, i.e., whether or not there is the specific kind of category in the video frames. L category is indeed a softmax cost function.
where h (j) category (x i ) belongs to [0, 1], representing the confidence that one specific area is a true object, e.g., building, car, human, etc. n represents the number of input samples, i is the sample id of one input sample, m represents the total category number, j is the specific category id, h category (x i ) is the Softmax result, and h localization (x i ) is the Smooth L 1 result.
L localization is utilized to find the object localization of one object. According to the smooth L 1 loss in of Fast R-CNN, L localization can be written as

UAV Image Dataset
We created a new drone image dataset ourselves to train our deep neural network. The training set for neural networks comes from the video that is actually captured on the drone. We extract keyframes for it, and then perform image calibration and area marking on each image. The marking tool we chose is LabelMe. LabelMe is an annotation markup software made by Python and drawn with GUI by Qt. It can greatly improve the efficiency of image marking.
We have marked a total of 1000 drone images, with dozens of buildings to be marked on each image. The entire marking process took three months/person and marked more than 30,000 different types of buildings with different appearances.
The marked image is shown in Figure 4. Each building has clear boundaries and tag values. In the experimental part, we used this database for training and testing, and some of the results are shown in the next section.

PERFORMANCE EVALUATION
The training process is performed with some state-of-theart deep learning frameworks and a modernized NVIDIA GTX 1080 GPU. The learning procedure takes about 2 hours due to the super-large data. And we present some instances for the results of confidence value (shown in Fig.5) and visualization (shown in Fig.6).
As mentioned above, the object is divided into different grids and detected respectively. We analyze the data obtained from the images, and make the prediction that if it is a building. In Fig.5, we show 9 results for the grid of ground truth objects and their predictions. Some examples have 25 grids while some examples have 9 grids, 8 grids or other number. The number of grids is depend on the the number of buildings in the backgrounds. The more buildings for ground truth, the more grids should be generated. Generally, the grid colored with blue represents that there is a building in the real world, and the other grid means there is not a building in the real world. As for the value of each grid, it is the confidence value for the prediction. In our method, we estimate that there is a building if the confidence value exceeds 0.6, and possibility that there is a building is higher when the value of confidence is higher. For example, in Fig.5 (a), the confidence value of the first grid in line 1 is 0, which is lower than 0.6, therefore, we make the prediction that there is no building in this grid; the confidence value of the second grid in line 1 is I0.854, which is larger than 0.6, then we make the prediction that there is a building in this grid, which is proved to be true according to the ground truth results. In particular, high confidence value in the grid, which includes a building in the real world, means the high performance of the method. Fig.6 illustrates the visualization outcomes of the building predictions. The background is the ground truth for the real scenario in each instance, and the identified buildings are labeled with different colors. We show the visualization outcomes for 9 scenarios, including the commercial circle, the play ground, the road area. The buildings are dense in some scenarios, and others are not, which is related to the number of grids generated in the method. Among the 9 visualization outcomes, it can be seen that the size and shape of the building can be drawn with the colored areas. In general, it has recognized all the buildings and has high performance, however, there exist some points to be improved. For example, the buildings are recognized about 10 in Fig.6 (a) and labeled with blue, purple, green and so on. But, some corners and boundaries are not completed in the visualization outcome when we check the results in detail. Moreover, the method results in a error that recognized the wrong object. In Fig.6 (g), for example, it can be seen that the play ground has been colored with purple, but actually, the play ground is not a building. And in Fig.6 (i), the ring building, which is colored with blue, is not recognized accurately enough, because the center area in the ring building is not a part of the building. Hence, the center area should not be colored with blue. In a word, our method can recognize the buildings in various scenarios with high accuracy, but still, we should further improve the accuracy of the recognition and try to avoid the error, in order to apply it in the real applications in a better way.
According to the results of comparison between ground truth and predictions in grids as well as the visualization for building predictions, our method has achieved high performance, such as accuracy, efficiency and reliability.

CONCLUSION
In this paper, we formulate the fly path task for UAVs, considering the actual perception needs. A new convolutional neural network (CNN) model is proposed to detect and localize the objects, such as the buildings, as well as an optimization method to find the optimal flying path to accurately recognize as many as possible objects with a minimum time cost. The simulation results demonstrate that the proposed method is effective and efficient, and can well address the actual scene understanding and path planning problems for UAVs in the real world.

ACKNOWLEDGMENT
This work is partially supported by Jiangsu Provincial Key Research and Development Program, BE2017007-1.