Deep learning-based framework for vegetation hazard monitoring near powerlines

The increasing popularity of drones has led to their adoption by electric utility companies to monitor intrusive vegetation near powerlines. The study proposes a deep learning-based detection framework compatible with drones for monitoring vegetation encroachment near powerlines which estimates vegetation health and detects powerlines. Aerial image pairs from a drone camera and a commercial-grade multispectral sensor were captured and processed into training and validation datasets which were used to train a Generative Adversarial Network (Pix2Pix model) and a Convolutional Neural Network (YoLov5 model). The Pix2Pix model generated satisfactory synthetic image translations from coloured images to Look-Up Table (LUT) maps whiles the YoLov5 obtained good performance for detecting powerlines in aerial images with precision, recall, mean Average Precision (mAP) @0.5, and mAP0.5:0.95 values are 0.82, 0.76, 0.79 and 0.56 respectively. The proposed vegetation detection framework was able to detect locations of powerlines and generate NDVI estimates represented as LUT maps directly from RGB images captured from aerial images which could serve as a preliminary and affordable alternative to relatively expensive multispectral sensors which are not readily available in developing countries for monitoring and managing the presence and health of trees and dense vegetation within powerline corridors.


Introduction
The increasing popularity of drones has also led to their adoption by electric utility companies to monitor intrusive vegetation near powerlines due to their ability to provide reliable and cost-effective inspections, minimizing downtime and improving the efficiency of the monitoring operations of such companies [1]. Besides the lines themselves, the monitoring also involves surrounding objects, most specifically vegetation. This is done to ensure that the electrical transmission grids are operating safely. Monitoring the vegetation along the corridors of powerlines is important to prevent trees that are close by from damaging the equipment causing short circuits [2] and blackouts [3]. Also, in dry conditions, the presence of too much-scorched vegetation may result in forest fires destroying powerlines [4]. During storms, trees can fall across power lines and cause outages. These can also lead to severe damage to the equipment. Foot patrols are among the traditional methods used by utility companies to manage the vegetation around powerlines. One of the main disadvantages of foot patrols is that they are very infrequent, slow to deploy, offer low accuracy and are expensive [5]. Foot patrol inspection has nevertheless been used extensively due to its high detection rate despite the drawbacks [6]. An evaluation of an unmanned aerial vehicle revealed that it could be faster than traditional foot patrols to inspect and generate models of high-voltage power lines [3,7,8]. In addition to reducing the overall cost of maintaining the power lines, the proper management of the vegetation can also help in maintaining the electricity supply.

3
Various studies have been presented by [2,3,9,10] about the importance of monitoring power lines. Several studies related to the usage of drones for vegetation encroachment near powerlines have focused on the segmentation, classification and extraction of trees [6]. Classification of tree species can be beneficial as it allows for the identification of undesirable species which are most likely to cause problems in the future while encouraging the planting of low-growing trees as healthy competition [11]. Multicopter platforms equipped with various miniaturized sensors, usually employ remote sensing techniques to gather data for vegetation health determination, detection and classification. Multispectral (MSI) sensors are preferred to colour or Red-Green-Blue (RGB) imagers for certain applications because they provide more spectral information which is not visible to the human eye. Studies from [12] and [4] used multispectral images to automatically extract and identify trees near powerlines. Both studies based their methodology on the ratio of the red band to Near Infrared (NIR) reflectance as inputs into a pulse-couple neural network. The results of the study revealed that the system was able to improve its segmentation output by performing various morphological processing. Their detection rates were higher than the segmentation accuracy of tree crowns. The paper by [13] aimed to verify and identify the algorithm that is most suitable for the processing of vegetation images. The study employed a consumer camera attached to a drone to collect data from various areas near the powerline and analysed them using the Visible Atmospherically Resistant Index (VARI) achieving good detection results for vegetative and non-vegetative areas. According to [14], VARI can lessen the sensitivity of atmospheric effects and enable the estimation of the vegetation fraction. The red, green, and blue (RGB) bands alone of the camera were all that was needed to produce the result based on the VARI formula. Although colour models can provide the necessary information, they may not be able to identify all the vegetation hazards that can be found within a power line track hence the need for multi-or hyperspectral sensors [15]. However, high-end remote sensing equipment such as multispectral sensors, thermal sensors, and hyperspectral cameras are costly and not readily available to the general public [16].
Deep learning models have exhibited promising results in performing visual applications like object detection, image classification and segmentation [17] and reconstruction of multispectral images from RGB images [18][19][20][21]. Recently, studies by [18,19,[22][23][24] proposed deep learning-based models to reconstruct multispectral images from coloured images. The availability of low-cost sensors and electronic devices has the potential to stimulate new applications and discoveries in the field of participatory research to potentially provide frequent and effective monitoring of the environment [25]. Hence an intelligent AI-based framework that leverages the existing knowledge in deep learning and miniaturized high-performance computers has become necessary to promote frequent and sustainable integration of remote sensing solutions in continuous monitoring and management of vegetation encroaching on powerlines. The study, therefore, proposes a deep learning-based detection framework compatible with UAVs for monitoring vegetation encroachment near powerlines. The framework leverages on computing capability of NVIDIA's Jetson Nano and/or Graphics Processing Unit (GPU)-enabled machines to integrate generative adversarial networks (Pix2Pix) for estimation of vegetative indices and YoLov5 for detection of powerlines from RGB images captured with drones.

Drone surveys
The flight-planning stage is the preliminary phase and most critical step to ensure safe flights and minimise post-processing times and costs. The unique vegetation characteristics of captured vegetation imagery during the mapping process make it strenuous to match these images. To minimize this issue the drone was programmed to capture images with very high overlap values of at least 75% along flight directions in both the longitudinal and transversal overlaps. Meteorological conditions are also considered during the surveys. All flights were performed under favourable weather conditions during the central hours of the day when the sun was close to its zenith to minimise shadows and bidirectional reflectance distribution function (BRDF) effects. Given an illumination viewing orientation and an observing viewing orientation, the bidirectional reflectance distribution function (BRDF) is a function that describes how light is reflected. In aerial photography, this effect frequently appears as hotspots or darkened corners [26]. As part of the pre-flight preparations, multispectral images of a reflectance calibration target were captured to perform radiometric calibration. A calibration target is used to convert the values expressed as Digital Numbers (DN) into real reflectance values recommended in vegetation analysis to allow for the comparison of images acquired at different times.

Data
The data collection process commenced through aerial image acquisition using a drone and a commercial-grade multispectral camera over several random areas within the Sunyani Municipality of the Bono Region of Ghana. Except for the town centre, Sunyani is situated in a lightly forested region with a significant amount of vegetation growing within the city limits. There are 42 small towns and villages in the study area that surround the city of Sunyani. Data samples available from the manufacturers of the multispectral camera used were also included in the study to augment data captured locally. A DJI Mavic2 drone equipped with a multispectral camera was used. Images were captured without ground control points (GCPs); however, all images were geotagged in EXIF format with embedded information on latitude, longitude and altitude. The multispectral images were captured using a 12 Megapixels MAPIR Survey3W OCN camera [27]. This multispectral camera captures images with Orange, Cyan and Near Infrared (NIR) channels which see light on 615 nm, 490 nm and 808 nm respectively. The three bands of the camera are different from the usual Red, Green and NIR channels used for Normalized Difference Vegetation Index (NDVI) calculations. Instead of red and green, the camera captures orange and cyan (blue-green) rather. Also, the camera captures 808 nm NIR instead of 850 nm NIR. This slight increase in bandwidth is to provide enhanced contrast. One of the biggest issues with using the camera to compute the vegetation difference index is the noise in the images due to the presence of red light. To minimise this issue, the manufacturer's focus was to capture orange light instead of red light which significantly blurs out pixels of soil noise or cross-talk. The images were saved in JPG format with the RGB and multispectral images having resolutions of 5472 × 3078 pixels & 4000 × 3000 pixels respectively. A single UAV flight took between 5 and 10 min to complete.
The quality of the RGB images collected was estimated using the image quality tool provided by the Agisoft Metashape software [28] based on each image's sharpness. For 3D reconstruction purposes, only images with a quality score above 0.5 were chosen for potential use [29]. The images captured with the multispectral camera were calibrated using the MAPIR Camera Control application. The images were loaded into the application together with the image of the reflectance calibration ground control target taken before each survey. The reflectance calibration ground target contains the known reflectance of four (4) felt-like sheet material targets which were measured using multiple Shimadzu spectrophotometers with an integrating sphere [27]. After measuring the target pixel values, calibration formulas were computed by the software. The remaining drone photographs were then subjected to reflectance calibration. The next step involved using photogrammetry techniques to create an orthomosaic. The various stages and settings used for the RGB and MSI orthomosaic generation are shown in Table 1.
The Vegetative Index used was the NDVI which is a simple measure of plants' photosynthetically active biomass and can be used to determine, visualize and define vegetated areas on the map allowing users to monitor the growth process and identify areas of concern. The NDVI is expressed as shown in Eq. 1.
where Y is the NIR light @Band 1 (Red Channel) and X is the orange light @Band 3 (Blue Channel). The images taken with the Survey3 camera were then processed to produce an index image and a coloured Look-Up- Table (LUT) was applied to show the contrast between the different vegetation. To normalise all the generated LUT images the min and max values were set to -1 and 1 respectively. To generate the training, validation and test datasets for the Generative Adversarial Networks (GAN) models, the RGB and LUT images were aligned and cropped into equal squares with a resolution of 256 × 256 pixels and 1024 × 1024 pixels. The LUT-RGB image pairs were then split into train, validation and test using a ratio of 0.6:0.3:0.1. The dataset contains a total of 3,656 image pairs.
During the drone surveys, several images of powerlines were taken with the drone camera pointing 90° down at different heights. An open-source image annotation tool, LabelMe [30] was used to annotate the region of powerlines in each picture in COCO format.
A set of training and validation datasets was then generated for the YOLOv5 model. Several augmentation steps (i.e., flipping, rotations, blurring, exposure and noise) were applied to the annotated datasets to increase the training sample size. Out of 2,859 annotated images, 2,500 (88%), 239 (8%) and 120 (4%) were used as train, validation and test datasets respectively. The images were also resized to a uniform size of 415 × 415.

YoLo model
The YoLov5 algorithm is based on the detection architecture of the YoLo model and uses several optimization strategies widely used in Convolutional Neural Networks (CNN). The YoLov5 algorithm features four main parts: the input terminal, the backbone, the output, and the neck.
The input terminal is mainly used to pre-process the data. The YoLov5 algorithm can adapt to different datasets by calculating the initial anchor frame size automatically. This feature can be useful for performing various tasks such as auto-learning and data augmentation. The backbone network mainly uses a combination of spatial pyramid pooling (SPP) and cross-stage partial network (CSPNet) to extract feature maps from an image's input. The advantage of this strategy is that it allows the network to perform various tasks at the same time while reducing the amount of computation. The bottleneck CSP is also useful to increase the speed of the detection by reducing the amount of computation. The spatial pyramid pooling structure is additionally used to generate three-scale feature maps. The neck network utilizes the structures of Path Aggregation Network (PAN) and Feature Pyramid Network (FPN) to generate feature maps from an image's input. The FPN [31] structure provides strong semantic features to the top feature maps while the PAN [32] structure delivers strong localization features to the lower feature maps. The two feature pyramid structures of the neck network contribute to the improved performance of the detection by providing a strong representation of the various network layers in backbone fusion. As the final step in the process, the head output of the network is used to predict the different sizes of the feature maps. The YoLov5 is composed of four different architecture groups, namely YoLov5s, YoLov5m, YoLov5l, and YoLov5x with the main difference being the number of convolution kernel and feature extraction modules at specific locations within the network. The Yolov5 model's performance is evaluated in two phases. The first phase focuses on the detection performance while the second phase is on object classification. The metric used to evaluate the system was the mean Average Precision (mAP) which measures the accuracy of an object detection algorithm. It is calculated by taking into account the various Intersection of Union (IoU), recall, and Average Precision (AP) parameters. A higher mAP score indicates better model performance. The average and mean precision are the most popular metrics used to evaluate the various algorithms in the field of object detection. Additionally, the model was also evaluated to determine the powerlines' location using the IoU metrics which determines the correctness of the model to predict bounding boxes of categories of objects correctly. A detection is mostly termed as successful if there is an overlap of more than 60% [33]. The model's false positives (FP), false negatives (FN) and true positives (TP) were analysed specifying a confidence score (conf_sc) according to Table 2 to compute the precision and recall values as shown in Eqs. 2 and 3.

Pix2Pix model
The GAN algorithm, which is based on a generative modelling approach, utilizes a CNN proposed by [34]. Generative   [34]. GANs are generative models used to map the output images from a random noise vector (z) into a representation of the image while conditional GANs generate new outputs (y) from both random noise (z) and an observed image (x). The goal is to train a generator model (G) to produce outputs that closely resemble real images and then pass these "fakes" through an adversarial discriminator model (D) which learns to detect the fakes produced by the generator. The GAN model known as Pix2Pix is designed for general-purpose image-to-image translation which was presented by [35]. The model is built on the conditional GAN framework, which allows it to perform various image-to-image translation tasks. It utilizes a discriminator known as PatchGAN [35] and a generator known as U-Net [36]. The U-Net generates a low-level representation of the input image, which is then passed to the PatchGAN. The resulting statistics are then analysed to learn various feature representations. The loss function of the model is: This GAN model learns to map the noise (z) and input image (x) to an output imagery. The discriminator model then tries to maximize its loss function while the generator tries to do the opposite. The study adopted the GAN Compression method proposed by [37], which is a generalpurpose method that can reduce the computational cost and time associated with the development of conditional GANs and pix2pixHD architecture proposed by [38] which presents an improved pix2pix framework that uses a coarse-to-fine generator, a multi-scale discriminator architecture, and a robust adversarial learning objective function.
Generative adversarial networks are commonly used to generate high-quality synthetic images. They have been shown to perform remarkably well in various problem domains. The generator models are trained by the discriminator model, which learns to distinguish between fake and real images. This method eliminates the possibility of having an objective measure or function for the generator model [39]. Although various measures are being introduced to improve the performance of the generator models, there is currently no consensus as to which measure should be used to compare the strengths and limitations of the models [40]. Developing GAN models can be very complex. Manual inspection of output figures can greatly minimize the lengthy process of testing, refining and implementing model configurations. The evaluation of GAN generator models is usually performed in the context of a target problem domain. The quality of the images that the models generate is taken into account to determine the performance of the model. The generator model was saved iteratively over many epochs during training. Each saved model was used to generate some synthetic images to allow for post-hoc evaluation to select an acceptable model for use. Quantitatively, the Fréchet Inception Distance (FID) score [41] was used to evaluate the GAN model by summarizing the quality of images generated by GAN models. First, a trained Inception v3 model is loaded to determine the FID score. The model's output layer is eliminated, and the model's output is taken to represent the activations from the last pooling layer, a global spatial pooling layer. Since there are 2,048 activations in this output layer, each image is projected to have 2,048 activation features. For the image, this is known as the coding vector or feature vector. The next step is to forecast a 2,048 feature vector for a set of genuine photos from the issue area to serve as a benchmark for how real images are represented. Then, feature vectors can be computed for artificial images. There will be two collections of 2,048 feature vectors for both actual and created photos as a result. The FID score is then mathematically computed as shown in Eq. 5 [41] where d 2 is the FID score, mu 1 and mu 2 represent the featurewise mean of the real and synthetic images respectively, C 1 and C 2 refer to the covariance matrix of the real and synthetic images respectively, Tr represents a trace linear algebra operation. The FID score takes into account the

Powerline detection model
The evaluation of the model was done on 239 test images which were alien to the trained model with an IoU threshold of 0.6. The model obtained good performance for detecting powerlines in aerial images. The precision (P), recall (R), mAP @0.5 and mAP@0.5:0.95 values are 0.821, 0.762, 0.798 and 0.563 respectively. The true positive (TP), true negative (TN), false positive (FP) and false negative (FN) are the components that make up the confusion matrix (Fig. 1a) defined on the IoU threshold of 0.6 The value of the f1 curve (Fig. 1b) is used to measure the balance between the accuracy and recall of the given object. High values of the metric imply that both the recall and precision are high. A lower f1 score means that the model's accuracy and recall are lower.
The precision-recall curve (Fig. 2) shows the relationship between the number of positive samples and the accuracy of the model. The precision values decrease with increasing recall as the number of samples increases, and the model's accuracy in classifying correctly each sample decreases. This is expected since the model is more prone to failure when there are many samples.
A mosaic of the ground truth against predicted results on the test samples is shown in Fig. 3. The images in columns a, b, c, d under validation batch labels correspond to images in columns e, f, g, h under model predictions respectively. The model missed a few powerlines in the test samples as shown in Figs. 3 (f7, f11 and f12).

RGB to LUT image translation model
Some visual results of the original and compressed GAN are shown in Fig. 4 which presents the input RGB image, the ground truth LUT image, the generated LUT of the full GAN and the generated LUT image of the GAN compression  Fig. 5 which presents the input RGB image, the ground truth and the generated LUT. Quantitatively, the FID scores (the lower the better) recorded were 61.488 and 208.562 for the original and compressed GAN respectively. Also, it can be observed that the full GAN model generated satisfactory-looking LUTs compared to the compressed version which shows some persistent artefacts bordering the output on the top and left sides. The comparatively higher FID score could be because of some irregularities and misalignments in the generation of the RGB -LUT image pairs borne out of the two different lenses of the RGB and MSI cameras and also from the image manipulation software used.

Outline of the proposed framework
The proposed deep learning-based detection framework compatible with UAVs for monitoring vegetation encroachment near the powerline is illustrated in Fig. 6. The hardware of the sensor platform consists of (1) NVIDIA Jetson Nano (2) UPS Power Module and (3) GNSS Module.
The Jetson Nano was announced by NVIDIA in 2019 as a development system kit. This small but powerful computer is capable of delivering 475 GFLOPs for running modern AI algorithms in multiple neural networks in parallel for image classification, object detection and AI applications and processes using as little as 5 watts. The UPS power module provides a 5 V uninterruptible power supply to the sensor. The 4G/GNSS Module provides GPS positioning information which is tagged to images captured. As shown in Fig. 6, the sensor platform can be either used in three configurations; Standalone, where batches of images captured with a drone can be processed post flights to assess vegetation encroachment automatically using (1) the sensor platform or (2) a GPU-enabled workstation and (3) in flight, where the sensor platform is attached to a drone as vegetation hazards are monitored in real time via a streaming platform. Images that contain potential vegetation hazards detected by the platform are tagged for further visual inspection and assessment. Figure 7 shows the flowchart of the proposed

Conclusion
The detection framework for vegetation encroachment near powerlines was developed using a couple of deep learning methods. The study employed the use of a Pix2Pix GAN for image-to-image translations from RGB to LUT maps representing NDVI and the YoLov5 model for detecting powerlines from images captured from a UAV. The study revealed that: (1) the use of a CNN model, i.e., YoLov5 was able to successfully detect the presence of powerlines of various sizes from aerial images. The model obtained good performance for detecting powerlines in aerial images with precision (P), recall (R), mAP @0.5, and mAP@0.5:0.95 values are 0.821, 0.762, 0.798 and 0.563 respectively. (2) The original pix2pix GAN model generated satisfactory synthetic image translations from RGB to LUT compared to the compressed version which visibly showed several artefacts in the generated LUT images and was not close to the ground truth. Quantitatively, the FID scores (the lower the better) recorded were 61.488 and 208.562 for the original and compressed GAN respectively corroborating the visual results. The pix2pixHD model also generated very good synthetic LUT images. (3) The deep learning models, i.e., YOLOv5 and Pix2Pix GAN were capable of running on the Jetson Nano. The compressed version of the Pix2Pix had a much better computing performance but unsatisfactory-looking synthetic images. (4) The proposed vegetation detection framework was able to generate NDVI estimates represented as LUT maps directly from RGB images captured from aerial images which could serve as a preliminary and affordable alternative to relatively expensive multispectral sensors which are not readily available in developing countries for monitoring and managing the presence and health of trees and dense vegetation within powerline corridors.
The research work highlights an approach of applying generative adversarial networks (GANs); specifically, the pix2pix models, to convert RGB images captured from drones directly into NDVI estimates represented as LUT images serving as a viable and affordable alternative to expensive multispectral sensors. Also, the research proposes a vegetation detection framework that partly leverages on miniaturized GPU-enabled embedded system (Jetson Nano) to provide a drone-integrated platform for preliminary vegetation monitoring.