The introduction outlines our purpose for creating a dataset that supports the development and evaluation of testing for detecting water bodies from Unmanned Aerial Vehicle (UAV) imagery. Therefore, the dataset must satisfy certain conditions: images should include various water body types, especially arbovirus vector breeding sites. Moreover, it should consist of characteristics of the environment and diverse habitat in the Vietnamese region, including suburban, rural, and suburbs, etc. Secondly, the backgrounds should vary widely, and the associated ground-truth annotations should also be comprehensive to facilitate developing, training, and testing different computer vision algorithms. Lastly, the dataset can be copyright-free or permissible for use within the research community, considering the high cost usually associated with aerial images.
Drone and Camera
This research used the DJI Phantom 4 Multispectral (P4M) UAV and DJI GS PRO software for agricultural and environmental analysis. Figure 1 illustrates an overview of the drone and the multispectral camera. This UAV has a high-quality multi-spectral camera system, including four spectral bands: Red, Green, Blue, NIR, and Red-edge. Integration with the Real-Time Kinematics system enables highly accurate positioning, optimizing automated flight planning and control processes. DJI GS Pro software is essential in efficient multispectral image acquisition, contributing to highly accurate mapping and data collection. Before each flight, we calibrate the spectral image by taking images into MicaSense Calibrated Reflectance Panels. After the flight, we will calibrate the image to get the most accurate spectral value with bright images during the flight.
Collection, Annotation, and Construction
The study was conducted in Ben Cat town, Binh Duong province, which has a high exposure and vulnerable prevalence of the Arbovirus disease. The image acquisition is located in the Southern area of Vietnam. We tried to select by purposed regions to maximize a variety of conditions of water bodies, including in rural and peri-urban areas, such as reservoirs, ponds, temporary water pools, and road puddles. The dataset was collected from August to November 2023 in wet weather conditions, which is 16 times for flight and is crucial for maintaining consistency between successive images. We meticulously planned the flight paths. The detailed map of these locations is described by map pin in Fig. 1, where four locations are in the Vietnamese-German University campus, two are in the residential quarter in Thoi Hoa town, and ten are in the rural area in An Dien town, consisting of fields and the rural regions.
In this study, we implemented a comprehensive process to collect and annotate unmanned aerial imagery, primarily to facilitate the training of distinct algorithms. Our aerial coverage usually encompassed 100m x 100m and 200m x 200m (equal to 4km²), utilizing a flight speed of 5m/s at an altitude of 120m from the ground. We also ensure sufficient data overlap (around 70% for Orthomosaic construction. In the annotation stage, we employed two different methods tailored to the specific requirements of each algorithm. Bounding box annotations, essential for object detection tasks, were created using the LabelImg tool [15]. This involved manually drawing square boxes around objects of interest within the images. While in segmentation tasks requiring a more granular approach, we utilized the CVAT tool [16], enabling precise brushing masks on every object in the images.
Our final dataset contains 1,013 images across five spectral bands: RGB, NIR, and Red Edge. Each image has a resolution of 0.0265cm per pixel and a size of 1600x1300 pixels. Figure 2 shows the visualization of some RGB and NIR image examples from the WaterMAI dataset. The color images, consisting of three 8-bit channels (R-G-B), were utilized exclusively for training the RGB model. In contrast, the additional spectral band NIR from the P4M camera was pivotal in training multi-spectral models. We also utilize the NDWI band, with the methodologies and experimental setups detailed in the latter sections of this article. Furthermore, concerning maintaining the integrity of our dataset quality, we also filtered out images of subpar quality, particularly those affected by poor weather conditions such as darkness or cloud cover, as these factors significantly hinder the clarity of manual annotations.
The WaterMAI dataset contains 870 images containing six bands after filtering low-quality or low exposure for training and validation. Regarding the testing dataset, we collect in 2 distinct areas: One is in an area similar to the training set, and the other is in a completely new area around Ben Cat town. However, we still maintained the distribution of the testing data, which was similar to the training dataset. The number of each folder is illustrated in Table 1.
Table 1
Distribution for training, validation, and testing purpose
Purpose | Number | Modality | Wavelength range | Resolution |
Training and validation | 870 images each band | RGB | 450nm − 590nm | 0.0265 cm/pixel |
NIR | 790nm |
Testing | 143 images each band | RGB | 450nm − 590nm | 0.0353 cm/pixel |
NIR | 790nm |
In processing multispectral image data from P4M, we employed Pix4D mapper to analyze images from UAVs for generating orthomosaic images. Initially, raw images are put into a Pix4D mapper with key settings like the WGS 1984 coordinate system. The process involves several stages: Initial image alignment processing, point cloud creation, mesh 3D creation, and development of a digital surface model to construct orthomosaic images. This method allows for seamless integration with Geographic Information Systems, enhancing spatial analysis capabilities. Figure 3 illustrates the sample of orthomosaic images, demonstrating the detailed and geographically accurate representation of the area surveyed, which is useful in applications such as environmental monitoring.
One of the primary objectives of this paper is to contribute to the computer vision community with resources for water-body detection from multispectral aerial imagery. This section aims to benchmark several deep learning-based architectures, stated below, by training and evaluating them using our collected WaterMAI dataset, facilitating comprehensive comparison.
Water Bodies Detection Algorithm
You Only Look Once version 7 [17] (Yolov7) is an advanced object detection model known for its speed and accuracy, applying a trainable bag-of-freebies approach. This approach in Yolov7 collectively combines innovative techniques such as mosaic augmentation, self-adversarial training, and a CSPNet backbone to improve the model’s performance during training without significantly increasing computational complexity or inference time. Specifically, these strategies aim to enhance the model’s robustness, generalization, and efficiency in detecting objects in images or videos. Similarly, DocF, proposed by Fang Qingyun [18], maximizes the potential of different modalities of multispectral images, which utilizes a Cross-Modality Fusion Transformer, a straightforward yet powerful method. Unlike previous approaches relying on Convolution Neural Networks (CNNs), the network, inspired by Transformer architecture, grasps long-distance connections and includes broader contextual details during feature extraction. The network seamlessly combines information within and between modalities, effectively capturing interactions between RGB and other multispectral domains. Comprehensive experiments were carried out on multiple datasets, confirming the effectiveness of the proposed scheme in achieving state-of-the-art detection results.
In contrast, a Multispectral Semantic Segmentation Network (MSNet) [19] is a segmentation deep convolutional neural network with significant strides in remote sensing. The model splits multispectral bands into visible and invisible groups, namely RGB and NIR, to fully utilize the multispectral information for distinguishing features like water and vegetation. The MSNet model leverages ResNet-50 for feature extractor and cascaded up sampling for increasing resolution to fuse multi-scale image and spectral features using a feature pyramid structure. The experiments being carried out by MSNet’s authors showcase competitive performance compared to other similar methods. Yuxiang Sun's RTFNet [20] model also leverages visible and invisible spectral bands for object detection. The network combines an Encoder-Decoder design, utilizing ResNet for feature extraction and introducing a new decoder to restore feature map resolution. Finally, U-Net architecture [21], a traditional semantic yet powerful deep learning model, is also evaluated on the WaterMAI collected dataset for benchmarking purposes.
Experimentational Setup
Multispectral imagery is a powerful remote sensing tool that leverages diverse channel combinations to extract features from the Earth’s surface. Each combination serves a unique purpose, offering insights into various environmental and geographical phenomena. In this experiment, we try utilizing three different types of channel combinations as follows.
RGB + NIR
RGB provides basic color information for distinguishing vegetation, soil, and water. On the other hand, NIR is highly reflective in vegetation but absorbed by water, making it excellent for differentiating between water bodies and vegetated areas. This combination can enhance the model’s ability to detect water bodies by contrasting them against land and vegetation.
NDWI + NIR + Green
NDWI is specifically designed to enhance the presence of water bodies in multispectral imagery. The combination of NDWI and NIR bands has the potential to maximize the reflection of water while minimizing the reflection from vegetation. Moreover, adding the green band to this combination can improve the detection of water bodies, as water has strong absorption in green wavelength.
Red + NDWI + Blue
This combination can effectively identify water bodies, even in areas with diverse land features, as NDWI is specifically designed to detect water, and the red and blue bands offer additional distinct water signatures. Furthermore, research by Bulent Ayhan [22] indicates that a similar combination (NDVI, Green, and Blue) achieved the highest accuracy in chlorophyll-rich vegetation detection on a public dataset. This finding encourages the exploration of similar concepts of spectral band combinations in this paper.
The NDWI highlights open water features in a satellite image, allowing a water body to stand against the soil and vegetation. It is calculated using the Green-NIR (visible Green and Near-Infrared) combination, and the value is a minor constant value to prevent the denominator from being zero. The equation of NDWI is defined as
NDWI = \(\frac{Green - NIR}{Green + NIR + \epsilon }\) (1)
The experiments of deep learning models are conducted on a GPU server containing 1 CPU (12th Gen Intel i9-12900K) with a total of 64GB RAM and 1 GPU (Nvidia RTX 3090) with a total of 24GB vRAM. Regarding the primary parameter setting for the training process, the batch size is 8 to ensure maximum memory utilization, and the initial learning rate is 1e-2. We apply a dynamic linear learning rate scheduler from [23], defined with a linear decay function, that decreases the learning rate linearly from the initial learning rate to the setup minimum value based on the current epochs' value. The scheduler trick was proved to improve the accuracy of the training process and lead to better learning performance. The optimizer is a momentum-based Stochastic Gradient Descent [24], the Binary Cross Entropy loss function is selected for the binary task water body detection, and the number of training epochs is 200 to ensure a convergent result. Additionally, we apply a mixed precision training [25] technique to use a combination of single-precision and half-precision floating-point numbers to speed up training while maintaining accuracy and reducing memory consumption.
Evaluation metrics
The evaluation algorithm is required to analyse a group of ground truth images and predictions from a test fold. These predictions represent the anticipated positions of the target within the test images. The data image given, denoted as I, and a chosen threshold t, detection within an image I that exhibits a score surpassing the threshold t are considered valid detection, while others are disregarded. In practice, the results of predictions by computer vision’s model will be in one of four outcomes: True Positive or denoted TP(I,t) is presented that the number of correct positive predictions matched with the ground truth. Conversely, detection wrongly classified as positive is FP(I,t). Similarly, the number of instances improperly classified as negative are designated as FN(I,t). Finally, the number of correctly negative predictions with the ground truth is set FP(I,t). We assess the performance of the models in identifying water bodies within a specific test fold, the formula established for precision and recall as
$$precision\left(t\right) = \frac{{\sum }_{i \in fold}^{}TP(I,t)}{{\sum }_{i \in fold}^{}TP(I,t) + {\sum }_{i \in fold}^{}FP(I,t)}$$
2
$$recall\left(t\right) =\frac{{\sum }_{i \in fold}^{}TP(I,t)}{{\sum }_{i \in fold}^{}TP(I,t) + {\sum }_{i \in fold}^{}FN(I,t)}$$
3
The F1 is computed as the harmonic mean of the precision and recall scores, which reflects their relative contribution. Consequently, we assess this metric by the mathematical formula depending on the precision and recall as
F1 = \(2 \times \frac{precision \times recall}{precision + recall}\) (4)
The ratio F1 ranges from 0 to 1, where 0 signifies a complete inability to detect any observation correctly. While 1 indicates a perfect match of each observation into the ground truth. On the other hand, Dice Score with threshold DSC(t) is a widely used metric in segmentation object detection. This metric effectively measures the overlap between the predicted segmentation X and the ground truth Y, normalized by the total size of both predicted and actual segmentation. A Dice Score of 1 indicates a perfect detection, whereas a score of 0 indicates no overlap.
DCS(t) = \(\frac{2 \times \left|X\cap Y\right|}{\left|X\right| + \left|Y\right|}\) (5)