Constructing ecological indices for urban environments using species distribution models

Ariel Levi Simons (  levisimons@gmail.com ) University of California Los Angeles https://orcid.org/0000-0003-0492-5492 Stevie Caldwell University of California Los Angeles Michelle Fu University of California Los Angeles Jose Gallegos University of California Los Angeles Michael Gatheru University of California Los Angeles Laura Riccardelli University of California Los Angeles Nhi Truong University of California Los Angeles Valeria Viera University of California Los Angeles


Introduction
The expansion of cities, both in their geographic scope as well as their use of resources, often result in ecological degradation (Elmqvist et al., 2016). However, it is not inevitable that increased urbanization will lead to ecological destruction and collapse. Densely populated landscapes can contain resilient ecosystems (Beller et al., 2019), prompting the study and management of urban environments as ecosystems in their own right (Montero, 2020).
As a result of ecological transformations brought on by urbanization, many municipalities and regional agencies have begun quantifying the impacts of urbanization on ecological metrics such as biodiversity (von der Lippe et al., 2020) and ecological functions (Eötvös et al., 2018). These efforts have led to the popularization of environmental report cards: a means of assessing and summarizing the environmental performance of urban systems based on metrics such as the proportion of native to non-native species and area of habitat dedicated to pollinators (Logan et al., 2020).
One city actively involved in the development of urban ecological assessments is Los Angeles (Rauser, 2021). The Los Angeles area is situated within the California Floristic Province, one of the 36 most biodiverse terrestrial ecosystems in the world (Myers et al., 2000). Being a heavily urbanized area within a designated biodiversity 'hotspot' (Gillespie et al., 2018), Los Angeles served as a model city for evaluating a framework for assessing ecological conditions in an urban context. Although various urban environmental report cards use measures of taxonomic and functional richness of local communities as part of their assessments, it remains an open question as to how indicative these values are of underlying environmental quality. Though various groups of species are used as ecological indicators, the determination of which species to use in constructing such assessments is often not well de ned (Siddig et al., 2016). As urban environments are traditionally described as disturbed (Johnson and Munshi-South, 2017), there is the additional di culty in performing ecological assessments as these often depend on the use of undisturbed reference conditions (Utz et al., 2009).
Here we propose the use of machine learning to bypass the issues of de ning reference conditions in urban environments, as well as select species to use as ecological indicators. The use of machine learning has been used to select species which are sensitive to a number of environmental gradients, and which could act as ecological indicators (Moraitis et al., 2018). Such methods can also predict-given a set of species locations and maps of environmental conditions-the likely spatial distribution of a given species. These species distribution models (SDMs), representing suitable habits for a given species, have increasingly been used in conservation biology and environmental management (Angelieri et al., 2016).
We used the machine learning technique of Maximum Entropy (MaxEnt) in order to select indicator species and construct indices of urban ecological conditions from their SDMs. We selected MaxEnt for constructing our SDMs as it can work with presence-only data for species observations (Elith et al., 2011), which are commonly found in databases such as the Global Biodiversity Information Facility (GBIF) (Edwards, 2004), and which has been used in assessing species as ecological indicators (Jose V, 2020).
Here we will combine SDMs from multiple indicator species in order to construct maps of their predicted richness as our measure of ecological assessment. In order to evaluate our indices we then used a random forest model, which has been used to evaluate the reliability of other ecological assessments (Rodriguez-Galiano et al., 2012).

Scope of data
Our initial set of species observational data was obtained from the GBIF with the following query: (1) observations made within the period 2010 -2020, (2) within the spatial extent of Los Angeles County, and (3) with a spatial uncertainty of less than 30 m. This initial set of 120,713 observations was then ltered using the R package sp (Pebesma et al., 2012) to contain only observations within the city boundaries of Los Angeles and with a spatial uncertainty of less than 10 m. To ensure su cient data to run a SDM for each species we then only retained those with at least 30 records. This ltering produced a list of 179 species (SI: File 1) containing 18,504 observations. This initial list was then split into native and nonnative species using the CalFlora list of plant species native to Los Angeles County (http://www.cal ora.org/), and a corresponding list for animals provided by the Los Angeles Sanitation and Environment (LASAN) (SI: File 2), as well as a list of invasive plant species provided by Cal-IPC (https://www.cal-ipc.org/plants/inventory/). Following this split, we found 120 native and 59 non-native spaces with 12,320 and 6,184 observations respectively (SI: Fig. 4).
We obtained our initial set of 32 environmental layers from a variety of sources (https://doi.org/10.5068/D1W988), which were then clipped and aligned to the city boundaries of Los Angeles using GDAL (Warmerdam, 2008) and QGIS 3.4 (QGIS Development Team, 2018) with a 30 ft resolution. We then removed 9 of these layers with a collinearity of r > 0.5 (Meier et al., 2010) using the function removeCollinearity within the R package virtualspecies (Leroy et al., 2016) with 100,000 randomly selected points (SI: Fig. 4).

Generating ecological indices
To select species for our ecological indices we rst generated 100 SDMs, hereafter referred to as initial SDMs, for each species from both our non-native and native lists. Each SDM was created using the maxent function within the R package dismo (Hijmans et al., 2017). As input, we randomly selected 25 presence and 500 background points using the function spsample within the R package sp. To account for the geographic clustering in species observations we generated a bias le for both our native and nonnative species lists (SI: Fig. 5). These two les were generated using the function kde2d from the R package MASS (Ripley et al., 2013), with species density rasters generated using the function rasterize within R package raster (Hijmans and van Etten., 2015) as input.
The rates of true and false positives and negatives were found for these initial SDMs using the function evaluate within dismo. These rates were then used to calculate the Symmetric Extremal Dependence Index (SEDI) (Wunderlich et al., 2019) for each SDM, along with the mean SEDI per species. We selected the SEDI, as opposed to other metrics such as the Area Under the Curve (AUC), to evaluate our initial SDMs as it has been found to provide more consistent evaluation of the sensitivity and speci city of models over a wider range of study extents and species prevalence (Wunderlich et al., 2019). We then selected the 25 species from both lists with the highest mean SEDIs as indicator species (SI: Fig. 4). As indicator species are often selected based on their sensitivity to environmental conditions (Caro, 2010), our selections of SDMs with a greater SEDI score are expected to be associated with species whose occupancy is more strongly associated with a given set of environmental variables.
To further reduce the effects of spatial clustering of our observational data we then spatially thinned our data, using the function thin within the R package spThin (Aiello-Lammens et al., 2015) to set our minimum separation distance to 500 m between observations within species (SI: Fig. 4).
The relative importance of the 23 environmental variables used to generate these SDMs was calculated as a percent contribution to each model using the function var.importance within the R package ENMeval (Muscarella et al., 2014). The mean relative importance of each environmental variable was then calculated for both the native and non-native lists (Table 2). We then removed variables with a mean relative importance of less than 3% for the generation of our habitat suitability maps, leaving us with two lists of variables for further modeling with native and non-native species (Table 3, SI: Fig. 4).
For both lists, we then generated habitat suitability maps for each species using the function predict within the R package raster. Each species' habitat suitability map was generated using the maxent function with all available spatially thinned presence points, 20 times that number of background points, and a model threshold set as the maximum sum of the speci city and sensitivity. The habitat suitability maps for the 25 non-native ( Fig. 1A) and native ( Fig. 1B) indicator species were then summarized using the calc function within the R package raster. These two summary maps are the Los Angeles Ecological Index (LAEI) and the native Los Angeles Ecological Index (nLAEI), which predict non-native and native species richness respectively (SI: Fig. 4).

Evaluating ecological indices
To evaluate our indices we built and evaluated 100 random forest models of the LAEI and nLAEI as a function of our environmental parameters (SI: Fig. 4). Each model was constructed using 20 randomly selected locations within our study area. We chose 200 locations as this was found to be the minimum number needed to consistently model signi cant (p < 10 -4 ) predictions of both indices. Each of these groups of sampled points was split into training and testing sets using a 5-fold partitioning with the function kfold within dismo (Hijmans et al., 2017). We then constructed random forest models using our training data and the function tuneRF, within the R package randomForest (Liaw and Wiener, 2002), with stepFactor set to a value of 1 and doBest set as 'true'. Predicted index values were calculated using the function predict within randomForest, which were then compared against their actual counterparts with a Pearson correlation coe cient.
We calculated the mean and standard deviation on our 100 Pearson correlation coe cients under a Fisher transformation using the FisherZ function within the R package DescTools (Signorelli, 2020). We used a Fisher transformation as it has been found to produce less biased summary statistics for a set of Pearson correlation coe cients (Corey et al., 1998). The average and standard deviation were then inverse Fisher transformed using the function FisherZInv within the DescTools package.
To compare the relative importance of environmental variables within our models we used the function partial within the R package pdp (Greenwell, 2017). The relative importance of our model variables was calculated as their mean decrease in node impurity as quanti ed by their Gini indices, with a larger value denoting greater relative importance. To generate individual partial dependence plots we used the function response within the R package dismo.
To visualize a heat map of the 100 iterations of the partial dependence plots for our continuous variables we then used the function geom_bin2d within the R package ggplot2 (Wickham et al., 2016). For all of our partial dependence heat maps, we divided our axes into 20 bins and generated a best-t curve using the function stat_smooth within the R package ggplot2. For our categorical variables, we visualized the partial dependences of our models with violin plots generated using the function geom_violin within the R packages ggplot2. Using this procedure, we were able to visualize the partial dependences of our models against their constituent variables.

Ecological indices and their responses to environmental gradients
We identi ed 25 separate native and non-native species as ecological indicators (Table 1), the SDMs of which we used to construct our ecological indices (Fig. 1). We found our SDMs for both sets of species to be strongly in uenced by a number of the same environmental variables (Tables 1 and SI: Table 4 We also observed a number of differences in the responses of both indices to various environmental conditions. For the nLAEI we tended to nd an increase in the richness of indicator species associated with an increase in elevation (Fig. 3G), proximity to lakes (Fig. 3J), or areas dominated by shrubland or canopy forming tree cover (Fig. 3I), while the LAEI has no signi cant responses to these variables. With the LAEI we found lower indicator species richness associated with an increase in either the density of sites with threatened groundwater quality (Fig. 2H), or requiring a cleanup of hazardous materials (Fig.  2I), while nLAEI was largely unaffected by these variables.

Evaluation of ecological indices
We found a small, but signi cant (p < 0.05, Mann-Whitney-Wilcoxon test), higher set of mean SEDI scores for the MaxEnt models of native versus non-native indicator species (SI: Fig. 6). This difference suggests a greater degree of sensitivity of our native over our non-native SDMs to our environmental gradients. In evaluating both indices as random forest models of their constituent environmental variables, we found the Pearson correlations between the predicted and actual ecological index values (SI: Fig. 7) were slightly, but signi cantly (p < 0.05, Mann-Whitney-Wilcoxon test), higher for the nLAEI versus LAEI.

Indicator species
As with previous research using community science observations of species (Petersen et al., 2021), we found geographic clustering in the locations of both our native and non-native indicators (SI: Fig. 5).
While we were able to account for these spatial biases in generating our indices, we do note a geographic bias in our observations of our native indicator species towards large open spaces (SI: Fig. 5B) versus the more heavily urbanized areas where we found observations of our non-native indicator species (SI: Fig.   5A).
We found that membership in our list of indicator species, particularly those native to Los Angeles, were biased towards birds (Table 1). This re ects a bias in the community science data sets found in GBIF towards observations of birds (Petersen et al., 2021). Despite this bias, birds have been useful as ecological indicators in a variety of studies (Mekonen, 2017), including assessments of urban environments (Callaghan et al., 2021).

Response of indices to environmental variables
For both indices, we found similar responses to seven of our environmental variables: land use category, housing unit density, slope, proximity to streams, predominant tree leaf type, climatic zone, and land cover category. We also observed that our indices were responsive to unique sets of environmental variables. For the LAEI this included the density of sites threatening groundwater quality and those requiring the cleanup of hazardous materials. For the nLAEI we observed signi cant responses to the proximity to lakes, elevation, as well as the dominant type of canopy forming plant cover.

Response to land use
We tended to nd our indices predicting an elevated richness of indicator species in areas with land use categorized as either open space or containing public facilities ( Figs. 2A and 3A). Prior observations have indicated greater species richness in open spaces, such as parks and unbuilt land, with comparatively lower levels of human activity (Paker et al., 2014). For both indices, we found land use to be the most important variable (Table 2), re ected in elevated indicator species richness in areas designated as parks or open space reserves such as the western Santa Monica mountains or Gri th Park (Fig. 1).

Response to housing density
Although species vary in their tolerances to the built density of urban environments (Godefroid and Ricotta, 2018), previous studies have indicated a broadly negative relationship between the richness of species and housing density (Clarke et al., 2013). We also found our indices to predict a lower richness of native, as compared to non-native indicator species richness, in response to housing density (Figs. 2B and 3B). These results may re ect the apparent role of cities acting as ecological lters for more urbantolerant non-native species (Blouin et al., 2019).

Response to slope and elevation
We found both indices had a clear negative response to slope (Figs. 2C and 3E). Across various landscapes in the Los Angeles area, it has been found that rates of soil erosion tend to increase with slope (DiBiase et al., 2011). This observation indicates that steep hillsides will tend to have lower rates of soil stability, in turn supporting a reduced species richness (Hubbert et al., 2006). This relationship appears to be particularly strong in semi-arid Mediterranean climates, such as Los Angeles (Smith et al., 2007).
The nLAEI predicted a trend of greater native indicator species richness with increased elevation (Fig. 3G).
This may in part re ect a historic trend towards greater plant species richness at higher elevations in California (Seabloom et al., 2002), as well as the effects of greater human disturbance at lower elevations in the Los Angeles area (Syphard et al., 2018).

Response to stream and lake proximity
We found both our indices tended to decline with distance from streams, although this decline was sharper in native versus non-native species (Figs. 2D and 3F). This relationship between species richness and stream proximity has been observed in previous studies of urban habitats, at least with regards to bird diversity (Chin and Kupfer, 2020). For plants, especially those in areas with a Mediterranean climate such as Los Angeles, proximity to water has also been found to be a signi cant contributor to local diversity (Hawkins et al., 2003). This decline in indicator species richness with distance from water also appears to be the case with proximity to lakes, although this only appears to be signi cant for our native index (Fig. 3J).

Response to predominant tree leaf type
For both indices, the same dominant leaf cover type determined the same maximum and minimum indicators for species richness (Figs. 2E and 3C). In areas dominated by deciduous vegetation, our predicted indicator species richness was at its maximum. The native vegetation of Los Angeles was dominated by the coastal sage scrub and oak woodlands which are mainly seasonally dry, deciduous plants (Avolio et al., 2020). As the Los Angeles area urbanized this pattern in vegetation held, with many of the introduced tree species were deciduous as well (Avolio et al., 2020), especially in heavily managed landscapes such as parks and backyards (Clarke et al., 2013). Though these landscapes are heavily altered, they do tend to enhance local biodiversity of both native and non-native species (Clarke et al., 2013;Avolio et al., 2020).
In areas dominated by built structure, as opposed to tree cover, both our indices predicted a reduction in indicator species richness (Figs. 2E and 3C). This result may be due to urban areas containing large tracts of transformed, fragmented habitat (Elmqvist et al., 2016). Heavily built areas in urban environments also tend to experience biotic homogenization, tending to create landscapes containing a few remaining common native species, along with a relatively small number of ubiquitous non-native species (Concepción et al., 2016). As a result, urban species richness tends to be signi cantly lower in densely developed areas than in green spaces (Kondratyeva et al., 2020).

Response to climatic zones
Both indices had the same minimum indicator species richness within climatic zone 21 (Figs. 2F and 3H), designated as 'Thermal Belts of SoCal w/Occasional Ocean In uence', which tends to be further inland on steeply sloped land. Within the landscapes of Los Angeles, this climatic zone tends to be drier, and with lower minimum temperatures, than other climatic zones such as those dominated by marine in uence in zone 24. Across the landscapes of southern California there tends to be greater species richness in such moderate, and relatively wetter environments (Schoenherr, 2017), which may in part re ect the minima of our indices in drier areas with larger temperature ranges.
Response to land and predominant canopy cover As with prior observations of urban landscapes, we found the differences in vegetation described by land cover to be in uential in predicting species richness (Roy et al., 1999;Petersen et al., 2020). Both indices predict indicator species richness minima and maxima associated with the same categories of land cover (Figs. 2G and 3D). We predict indicator species richness maxima in areas with land cover types associated with either chaparral or dune vegetation, both associated with the pre-urbanized landscapes of Los Angeles (Anderson et al., 2009), and which support local diversity even when adjacent to built environments (Chace and Walsh, 2006). We found a similar response, although only for the nLAEI, for indicator species richness and the predominant canopy cover in an area (Fig. 3I). Similar to the responses of indicator species to land cover, we predicted a maxima in species richness in areas with shrubland, and a minima where no signi cant canopy forming vegetation is present.
With areas where the vegetative cover was replaced with buildings, or dominated by cropland and pasture, we predicted minima in indicator species richness (Figs. 2G and 3D). While human activity does not impact native and non-native diversity equally (Wania et al., 2006), anthropogenically dominated environments such as built and agricultural landscapes tend to have the reduced species richness we observed in the response of our indices (Simmonds et al., 2019).

Responses to environmental contamination
With the LAEI we predicted declines in non-native indicator species richness associated with increases in the density of contaminated sites involving both groundwater (Fig. 2H) and soil (Fig. 2I). Both measures of contamination are associated with degraded landscapes in Los Angeles (Campbell et al., 2021), which have been found to reduce local biodiversity (Guilland et al., 2018).
We found no strong responses from the nLAEI to either measure of environmental contamination. Independent of the pollution tolerances of our native versus non-native species, we propose that this aspect of the behavior of the nLAEI to be due in part to the preponderance of highly mobile native indicator species, which can readily avoid contaminated areas. Of our 25 non-native indicator species 10 were plants and 8 birds, while for our native indicator species these values were 3 and 18 respectively, potentially re ecting a taxonomic bias towards birds in community science (Troudet et al., 2017). Successful non-native species also tend to be generalists, with relatively low sensitivities to a variety of environmental gradients (Callaghan et al., 2019). With a reduced sensitivity to various environmental gradients, we propose that our non-native indicator species are also more likely to be found in all urban areas, and thus encounter additional anthropogenic stressors than their native counterparts.

Evaluation of ecological indices
Using random forest, we found a small, but signi cant, greater level of reliability in predicting the nLAEI over the LAEI (SI: Fig. 7). We propose that this difference is due in part to the lower sensitivity of our nonnative to native indicator species to environmental conditions (SI: Fig. 6), which has been observed in other ecosystems (Davey et al., 2012). This reduced sensitivity, especially to anthropogenic stressors, tends to select species that are tolerant of various disturbances across an urban landscape (Callaghan et al., 2019). Various human activities in Los Angeles, such as the importation of large amounts of water to cultivate ornamental plants, further support the widespread distribution of non-native species (Avolio et al., 2020). These differences in the environmental responses of native and non-native indicator species are likely re ected in both the more uniform spatial distributions of the LAEI (Fig. 1A) over the nLAEI (Fig.  1B), as well as greater reliability of the latter (SI: Fig. 7).

Conclusions
Our results support the potential use of SDMs in building an assessment framework for urban ecosystems. Using MaxEnt, we selected a native and non-native set of indicator species based on the sensitivity of their SDMs. We then used these sets of SDMs to generate a native and non-native assessment of ecological conditions in Los Angeles, based on the predicted richness of our indicator species. While both indices can be reliably predicted using a relatively small number of environmental variables, we did nd our index utilizing only native species performed better than its non-native counterpart.
However, the method by which we generated our indices remains constrained by the current limitations of community science-based observations of species. These observations have a well-established taxonomic bias towards vertebrates, and in particular birds, which is likely re ected in the number of avian species we selected as indicators. For future work we therefore recommend the addition of other sources of data on species observations, ranging from camera traps to the sequencing of DNA from environmental samples, in order to build more robust assessments of ecological conditions in urban environments.   Slope (nLAEI/LAEI): % land slope (Maune, 2006). Figure 1 The LAEI (A) and nLAEI (B).

Figure 2
Density of 100 partial dependence plots for random forest models of the LAEI. The pink line represents a nonparametric loess curve with an associated 95% con dence interval.

Figure 3
Density of 100 partial dependence plots for random forest models of the nLAEI. The pink line represents a nonparametric loess curve with associated 95% con dence interval.

Supplementary Files
Page 26/26 This is a list of supplementary les associated with this preprint. Click to download.