Landscape-scale Remote Sensing and Classification of Pond and Lake Habitat Diversity in a Tropical City


 Ponds and lakes are common freshwater habitats in urban landscapes, and often have a high biodiversity and conservation value. The importance of landscape-scale conservation of pond networks has recently been recognised, yet the categorisation and classification of pond network spatial structures is missing. Developing spatial methods and tools to characterise and understand pond networks is a critical first step to accurately conserve pond habitats and inhabiting species. This paper presents an inventory of ponds and lakes in Greater Kuala Lumpur, Malaysia, characterising their distribution, abundance and type. Remote sensing was first employed to map and characterise these habitats, followed by multivariate cluster analysis to classify and develop a typology of the ponds identified. Physicochemical data was collected from a sample (n=60) of ponds to compare with the remotely sensed pond classification. Results demonstrated that multi-source remote sensing can be highly accurate and effective in inventorying ponds and lakes of varying sizes. A total of 1013 ponds and lakes were identified within the Greater Kuala Lumpur region and were found to be highly environmentally heterogeneous. Typology clusters were driven by land cover rather than local physicochemical variables demonstrating that specific remotely-sensed variables may be sufficient proxies for certain chemical variables. Landscape-scale conservation and management of pond networks should utilise remote sensing tools, to establish pond network structure, and to maintain wide environmental heterogeneity among pond habitats. Incorporating remote sensing tools into pond conservation will ensure that effective pond conservation can be achieved and biodiversity protection can be maximised.

With urban land cover expected to increase in the future (Seto et al., 2012), the recognition of urban 88 landscapes, particularly urban ponds and lakes, for biodiversity conservation is growing. This study 89 identifies and maps ponds at the landscape-scale to characterise their abundance, distribution and type. Open access multispectral satellite imagery such as Landsat and Sentinel have sufficient radiometric 95 resolution to identify ponds and lakes by spectrally separating water from other land covers; 96 identification this way is typically straightforward and can by highly accurate (Verpoorter et al., 2014). 97 High temporal resolution is only necessary if ephemeral ponds are prevalent, reflecting their 98 hydrological variability. In this paper, ponds are defined as standing waterbodies between 1 m 2 to 2 ha 99 (Biggs et al., 2005), with any water body larger than 2 ha considered to be a lake. Spatial resolution is 100 a key challenge in freshwater pond habitat identification, as larger pixels will increase the minimum 101 mappable unit (MMU) at which an object can be reliably identified. The MMU is theoretically 4 pixels 102 but in practice can be 11-12 pixels (Lechner et al., 2009). At pixels 30m or larger, small or medium-7 step involved collecting field data for a sample of ponds from each of the types identified in step 2. For 157 these sample ponds, satellite and field data were used together to help understand how individual ponds 158 and the pond types differed in their environmental condition. These steps are outlined in detail below. 159 160

Remote sensing mapping of ponds 161
Google Earth Engine (GEE) was used to generate Sentinel-2 visual and infrared (IR) bands for the study 162 area for 2018. The images were cloud masked and 20m bands (B5, 6, 7, 8A, 11, 12) pansharpened to 163 10m, using the average of the 10m bands (B2, 3, 4, 8) as a simulated panchromatic band (Kaplan, 2018). ) were generated and stacked together with the IR 171 bands. Finally, high-resolution Airbus Pleiades imagery for the study area was downloaded from the 172 online mapping service, Yandex Maps, available at 2.4m, were then imported into GEE to further 173 pansharpen the Sentinel-2 image to 2.4m resolution using the method described above. 174

175
Training areas were selected with the aid of higher resolution (50 cm) World View 3 reference data 176 from Google Earth, Street View and personal expertise. In GEE, a random forest classifier was trained 177 and used to classify land cover (Table 1). A roads mask was added using data from OpenStreetMap. 178 Standing water bodies were manually identified by eliminating rivers, then manually edited to reflect 179 the correct spatial extent. Aquaculture ponds were excluded from the analysis as they are highly and are stocked with a high densities of fish. Finally, a raster land cover map and a vector ponds 182 shapefile were produced. 183 184 An accuracy assessment was conducted for the GKL land cover map using 50 random points per class. 185 Furthermore, to provide an estimate for the number of smaller ponds either too small to be reliably 186 detected or completely undetected by remote sensing classification, 50cm resolution World View 3 187 images in Google Earth were used to manually count the number of undetected ponds over a sample of 188 areas across GKL. This sample of areas consisted of 24 evenly spaced points across GKL each buffered 189 by a 2 km radius. 190 191

Pond metric selection 192
Using the land cover map and ponds areal extents, various environmental and spatial metrics were 193 extracted for each pond. Eight explanatory variables, representing five environmental characteristics of 194 ponds, which could be mapped with remote sensing were assessed: dimension, shape, proximity, 195 surrounding land cover and sediment (Table 2). A correlation matrix ( Fig. 1 and Table 3) was used to 196 investigate relationships between metrics. Area and perimeter were highly correlated (Pearson's r = 197 0.96, p < 0.01) and as a result only area was selected for further analysis. Perimeter-Area Ratio (PAR), 198 shape index and fractal dimension are all measures of shape complexity, so using only one of these 199 helped reduce metric redundancy (Cushman et al., 2008). Since PAR and fractal dimension proved 200 highly correlated (Pearson's r = 0.79, p < 0.01), and the Shape Index is merely a normalised PAR, to 201 avoid the size-dependency problem of PAR (Forman and Godron, 1986), the Shape Index was selected 202 to represent pond shape complexity. A measure of pond proximity is typically recorded as the number 203 of water bodies within 500 m for smaller-scale studies (Waterkeyn et al., 2008;Hill, Heino et al., 2017). 204 However, in the present study the mean distance to the nearest three ponds was used as the analysis is 205 at a larger scale and many water bodies are located further than 500 m from their neighbours. To these, the characteristics at the 100m and 1km buffer distances proved least correlated (Table 3) and so 209 these two were used for further analyses; these are useful as representing characteristics and possible 210 anthropogenic influences over more at proximate and larger areas. The variables were reduced using a 211 PCA and are represented by component 1. The final list of metrics used for clustering is presented in 212 Pond clustering (n=1013) was based on the final set of remotely-sensed metrics and was performed 216 using three methodshierarchical clustering, k-means and partitioning around medoids (PAM). Three 217 clustering validation measures (connectivity, Dunn index, silhouette width) were generated for cluster 218 numbers from 5 to 10 to identify the clustering method and number of clusters that would best 219 differentiate between ponds and generate distinct clusters. Since connectivity should be minimised, and 220 both Dunn index and silhouette width should be maximised, connectivity was multiplied by -1, and 221 each clustering validation measure was normalised (Table 5). PAM with 6 clusters delivered the highest 222 values and was used for the main clustering. Using a Bray-Curtis dissimilarity matrix, an analysis of 223 similarities (ANOSIM) was conducted to determine whether similarity within clusters is greater than 224 similarity between clusters. In addition, similarity percentages (SIMPER) analysis was undertaken to 225 determine the most important contributors to differences between clusters. ANOSIM and SIMPER 226 In total, 60 ponds (10 from each cluster) were sampled between 18 Dec 2019 and 22 Jan 2020. The 230 choice of which pond to sample was governed by the need to include ponds from across the whole of 231 GKL, so that a wide range of environmental conditions across different areascould be incorporated. 232 Water samples were taken from three different locations in each pond, at around 5m from the shore.
From the water samples, turbidity (NTU) was recorded using a turbidity meter (Hach 2100Q), while 234 conductivity (µS/cm), pH and dissolved oxygen (mg/L) were recorded using a multiparameter probe 235 (YSI Pro Plus). Average values for turbidity, conductivity, pH and dissolved oxygen were calculated 236 from the three samples. The following variables were visually estimated for each pond: percentage of Cluster analysis was conducted on the sampled ponds (n=60). Clustering validation measures were 251 computed as described in Section 2.4, but using only 3-6 clusters due to the smaller number of ponds. 252 Three cluster scenarios with different variable combinations were run. Firstly, for remote sensing 253 variables and fieldwork variables of sampled ponds, a 4-cluster k-means model performed best (Table  254 6a). Secondly, for remote sensing variables of sampled ponds only, a 3-cluster hierarchical model 255 performed best (Table 6b). Thirdly, for fieldwork variables of sampled ponds only, a 4-cluster k-means 256 model performed best (Table 6c). Bray-Curtis dissimilarity matrices, ANOSIM and SIMPER analyses 257 were conducted for each clustering scenario, following the procedure outlined in Section 2.4.

Pond and lake numbers 261
The land cover map of GKL (2,950 km 2 ) derived from remote sensing achieved an overall classification 262 accuracy of 82.0% ( Fig. 2; see Table B1 in Supplementary Material B for confusion matrix). In total, 263 1271 standing waterbodies were identified; of these, 258 aquaculture ponds were excluded, leaving 264 1,013 ponds and lakes (Fig. 3a). The smallest pond was 0.0819 ha (819 m 2 ) and the largest lake was 265 418 ha (4.18 km 2 ). Based on the number of smaller undetected ponds per km 2 within the 24 sampled 266 areas, it is estimated that another 440 ± 243 (90% CI) smaller undetected ponds may exist over the 267 entire GKL (Fig. 3b). These values suggest that the satellite-based analysis detected 74.3% (90% CI:

Classification of ponds 272
A 6-cluster PAM model was found to perform best for clustering all 1013 ponds using the remote 273 sensing physical variables. Overall, similarity within clusters was greater than similarity between 274 clusters (ANOSIM R=0.64, p=0.001), indicating that the clusters are distinct. The clusters were mostly 275 separated based on land cover, as supported by the PCA ordination biplot (Fig. 4) and SIMPER analysis 276 (Table 7). between clusters, one-way ANOVAs showed significant differences between clusters for all remote 282 sensing variables, as well as PMS, temperature and DO (Table 8) The spatial distribution and abundance of each cluster is shown in Fig. 5 and Table 10. Peri-urban and 292 suburban areas of Selayang, Shah Alam and Sepang have a greater diversity of pond types than the 293 urban core of the city (Kuala Lumpur, Petaling, Subang), natural forested areas (northern Kajang), rural 294 areas (northern Klang) or the industrial and port areas (southern Klang). Cluster 1 and 2 are mostly 295 absent from the urban core and largely recorded near peri-urban and suburban areas. Cluster 3 can be 296 found in the urban core, suburban and peri-urban areas. Cluster 4 is predominantly found in the peri-297 urban areas. Clusters 5 and 6 are mostly found in the urban core and suburban areas. Clusters 3, 5 and 298 6 are found in all districts. 299 300

Comparison of remote sensing and fieldwork data 301
Using the same clusters assigned to each pond from the 6-cluster PAM model for remote sensing 302 variables, similarity within clusters for the fieldwork variables were not found to be greater than 303 similarity between clusters (ANOSIM R=0.019, p=0.25). Co-inertia analysis between the PCA 304 ordinations of remote sensing variables and fieldwork variables resulted in an RV coefficient of 0.22 305 ( Fig. 6), indicating that the fieldwork variables did not cluster into as similar groups as the remote-306 sensing variables. However, the remote sensing-derived red band values are positively correlated with 307 turbidity and appear to be an acceptable proxy (Table 11). pH is positively correlated with area and 308 shape index, and negatively correlated with the red band value. DO is negatively correlated with mean 309 distance to the 3 nearest ponds. 310

Ordination and clustering of all variables for sampled ponds 313
Using the same clusters assigned to each pond from the 6-cluster PAM model for remote sensing 314 variables, similarity within clusters for all (remote sensing and fieldwork) variables were not found to 315 be greater than similarity between clusters (ANOSIM R=-0.033, p=0.84). However, the PCA biplot 316 suggests that although clusters overlap significantly, clusters 1-3 are generally separated from clusters 317 4-6, with this separation occurring along land cover, median(RED), turbidity and the percentage of pond 318 margin shaded (Fig. 7a). A new clustering of all variables (remotely sensed and fieldwork) for sampled 319 ponds (n=60) using the best model, a 4-cluster k-means model (Fig. 7b), generated significantly distinct 320 clusters (ANOSIM R=0.65, p=0.001). A SIMPER analysis suggested that the k-means clusters are 321 mostly separated by turbidity and conductivity (Table 12). All variables were subjected to an ANOVA 322 to determine whether clusters were significantly different; clusters only displayed statistically 323 significant differences for median(RED), turbidity and conductivity (Table 13). 324 325 Based on the above results, a new clustering of remote sensing variables for sampled ponds (n=60) 326 using the best model, a 3-cluster hierarchical model (Fig. 8a), generated clusters more similar within 327 than between clusters (ANOSIM R=0.86, p=0.001). A SIMPER analysis suggested that the clusters are 328 now mostly separated by land cover (Table 14), which is comparable to the separation of clusters of 329 remote sensing variables for all ponds (Table 9). Clustering of fieldwork variables for sampled ponds 330 (n=60) using the best model, a 4-cluster k-means model (Fig. 8b), generated clusters more similar within 331 than between clusters (ANOSIM R=0.66, p=0.001). A SIMPER analysis suggested that the clusters are 332 now mostly separated by turbidity and conductivity (Table 15), just as the clusters of all variables for 333 sampled ponds (Table 9). 334 335 Surveying and assessing natural habitats in urban landscapes is challenging as they tend to be 338 fragmented and small in size (Holgerson and Raymond, 2016). Landscape-scale urban pond 339 conservation is increasingly recognised as important, but this requires methods for spatial data 340 acquisition and analysis suited to this scale. To date, almost no attention has been given to how best to 341 address the challenge of surveying urban wetlands. Furthermore, our methods can address data and Landscape-scale pond management also requires analytical methods for large datasets. Cluster analysis 377 spatial data can provide a classification of pond types, which can yield insights into pond environmental 378 characteristics at landscape-scale and guide pond management. Remote sensing variables provide 379 clusters separated primarily by land cover rather than patch metrics, suggesting that land cover is a 380 stronger driver of clustering than patch metrics. As a whole, physicochemical variables derived from 381 fieldwork do not group into the same clusters as remote sensing variables, but specific chemical 382 variables appear to be moderately correlated with specific remote sensing variables, which may be used suburban areas can support more diverse pond types, so pond management needs to consider the 406 opportunities and constraints arising from urban structure and development dynamics. In GKL, one 407 cluster (3) with more natural influence can be found in all districts and across the urban core, suburban 408 and peri-urban areas. This suggests that there are blue-green spaces across all areas of GKL that can 409 provide natural spaces and corridors for dispersal, although this is heavily dependent on species and 410 obstacles (e.g., high rise developments) present.

Conclusion 435
This study has highlighted the application of multi-source remote sensing for the accurate mapping and 436 typing of lentic habitats across an urban landscape. It is one of very few studies of its kind, and has two 437 important outcomes. The first is that the approach provides a robust way of identifying pond network 438 spatial structure, and hence provides the evidence-base needed for the maintenance and management of 439 environmental heterogeneity at large spatial scales. The second it that the methods used in this study 440 provides an accurate inventory of urban lentic habitats. The satellite and follow-up map-based analysis 441 suggest that approximately 1700 wetlands exist across GKL, along with another 258 aquaculture ponds. 442 By quantifying the number and distribution of ponds and lakes in major urban cities, an accurate 443 assessment of the freshwater resource can be determined, enabling practitioners to ensure that ecosystem services (e.g., wellbeing and stormwater collection) can continue to be provided by these 445 freshwater habitats to urban communities, urban biodiversity can be maximised (through targeted 446 conservation and management), and areas can be identified where the creation of urban freshwater 447 habitats are needed to support communities and biodiversity.

Conflicts of interest 453
No conflicts of interest are declared. 454

Availability of data and material 455
Data and material are available in the supplementary. Additional data available upon request from the 456 authors. 457

Code availability 458
Code is available in the supplementary. 459

Author's contributions 460
HCT, MJH, AML, FYT and CNG conceptualised the study, designed the methods, reviewed and 461 edited the manuscript. HCT performed the data collection and wrote the initial manuscript. CNG, 462 AML and MJH supervised the study. All authors read and approved the final manuscript. 463

Consent to participate 466
Not applicable. 467

Consent for publication 468
Not applicable.                   (colour indicates cluster generated by a 3-cluster hierarchical model from remotely-sensed variables) and (b) fieldwork variables (colour indicates cluster generated by a 3-cluster hierarchical model from fieldwork variables). The variables 'pca100' and 'pca1k' refer to land cover within 100 m and 1 km respectively reduced to the first principal component; 'pca_mac' and 'pca_bank' refer to macrophyte coverage and bank material respectively reduced to the first principal component.