Testing areas. We used three countries as testing areas to evaluate representativeness of the unseen diversity in their PA networks: Costa Rica, Mexico, and USA. These countries meet three criteria that allow exploring a richer portfolio of contexts relevant to the research question: (1) high availability of georeferenced data on megadiverse taxa, something uncommon for most countries, (2) follow a latitudinal gradient from tropical to temperate climates associated with a decrease in species diversity, and (3) PA networks have different sizes (Table 2).
Unseen diversity taxa selection. Deciding about what taxa of unseen diversity could be used was difficult. Almost by definition, unseen diversity groups are at most poorly represented in public databases (e.g. GBIF), and thus the challenge was to find highly diverse taxa with information on distribution for all of the testing areas. For some groups (e.g., bryophytes) the information was good for an area, but too poor for other(s), and information for most groups was poor worldwide. We explored insects as suitable candidates, as they include 5–6 of the 11 millions of described animals 19, are found in almost any terrestrial ecosystem worldwide, and they provide fundamental ecosystem services 29. Although it is estimated that less than 30% of existent insect species have been formally described 58, certain insect orders are reasonably well known as well as represented in public databases. Moreover, studies have shown that higher taxa insect richness, such as genera or families, are good surrogates for total insect diversity at species level 44. Thus, we concluded that insect might be appropriate to guide conservation assessments and suitable for testing extrinsic representativeness coverage in existing PA networks. Finally, insect diversity is reasonably well known for the three testing areas, and they follow the latitudinal gradient criteria we seek (Table 2).
Table 2
Comparative metrics of the three testing areas.
Sources for insect species richness: Costa Rica 59, Mexico 60 and USA 61.
| Costa Rica | Mexico | USA |
Current Protected Area (% of total area) | 27.60% | 14.16% | 12.99% |
Area (km2) | 51,100 | 1.97 * 106 | 9.83 * 106 |
Insect richness (Described / Estimated) Number of described insect species / total area | 68,500 / 365,000 1.341 | 47,800 / 97,000 0.024 | 91,000 / 164,000 0.009 |
We evaluated representation of 1,002 species for each testing area. Species were randomly selected from three of the most diverse orders of insects, Coleoptera, Hymenoptera and Lepidoptera, excluding exotic species listed in national catalogues 62,63 and the IUCN online database (iucngisd.org). These orders were selected because they have the largest and most comprehensive availability of georeferenced data among insect groups, and because they are diverse, allowing us for representing the greatest heterogeneity of niches and ecoregions. For each order, we downloaded occurrence data from different sources (see below), and discarded species with less than 15 unique occurrences, as these would render poor-performing distribution models 64. From the remaining pool of species, 334 were randomly selected for each order, adding up to a total of 1,002 species per study area (Supplementary Table 1). In recognition that data availability for the chosen taxonomic classes and countries is highly heterogeneous, we chose to use a random, constant and relatively high number of species across testing areas to counterbalance the impact of biases in our results.
Species data. We built a database of georeferenced occurrence points, obtained mainly from the Global Biodiversity Information Facility, for Costa Rica (DOI: 10.15468/dl.zropvg), Mexico (DOI: 10.15468/dl.kgaeqh) and USA (DOI: 10.15468/dl.7vadxq), including records within a buffer of 100 km around the border of each country to avoid artifacts in the SDM related to political borders. The database was complemented with additional records from diverse sources: Heliconiine 65, insect occurrences from the Mexican “Datos Abiertos” database (datos.gob.mx), a database for dung beetles from Costa Rica and Mexico 66, and data for the USA accessed through Biodiversity Information Serving Our Nation (BISON, bison.usgs.gov). To minimize the number of wrong records which typically occur in open databases 67, we used the package CoordinateCleaner 68 on the R 3.4 environment 69 to remove records located at country centroid, natural history museums or research facilities, with a coordinate uncertainty higher than 1 km, duplicates with identical coordinate values, or records unidentified at the species level.
Species Distribution Models. We built SDM to estimate the macroclimatic niche of the species 33 using the ‘biomod2’ package 70 in R environment. We generated models as ensembles of three different techniques considered to have higher prediction accuracy 71: Generalized Boosted Models 72, Random Forests 73, and Maxent 74. An ensemble approach using these three techniques was preferred to avoid problems related to the selection of a single modelling algorithm, which can influence results 75. A total of 21 variables were considered as potential predictors, the 19 climatic from Worldclim 2.0 76, and 2 related to vegetation: the Normalized Difference Vegetation Index (NDVI) calculated from MODIS (modis.gsfc.nasa.gov) by averaging data from 10-day periods, and the height of the tree canopy 77. To eliminate multicollinearity between predictors, we calculated the Variance Inflation Factor using the package ‘VIF’ 78, discarding variables with VIF equal or higher than 10. When possible, we decided which variable was kept considering its ecological relevance for the study group. Eight variables remained in the final set of predictors for each country (Table 3). Due to computational limitations, variables resolution was proportional to the area of the country of study: 1 km for Costa Rica, 5 km for Mexico and 10 km for USA.
Table 3
Variables used in the species distribution models for each country. In parenthesis the code of Worldclim variables (www.worldclim.org)
| Costa Rica | Mexico | USA |
Vegetation | NDVI | NDVI | NDVI |
Canopy height | Canopy height | Canopy height |
Temperature | Annual mean (bio 1) | Annual mean (bio 1) | Annual mean (bio 1) |
Seasonality (bio 4) | Mean diurnal range (bio 2) | Mean diurnal range (bio 2) |
Annual range (bio 7) | Isothermality (bio 3) | Seasonality (bio 4) |
Precipitation | Annual mean (bio 12) | Annual mean (bio 12) | Annual mean (bio 12) |
Mean wettest month (bio 13) | Mean wettest month (bio 13) | Mean wettest month (bio 13) |
Mean driest month (bio 14) | Mean driest month (bio 14) | Seasonality (bio 15) |
We built SDM using ten thousand background points randomly selected from within occurrence records of other species of the same order, a method that has shown potential to reduce biases resulting from non-systematic occurrence sampling 79. For each species, models were fitted with 70% of the presence data and validated with the remaining 30%. To increase robustness, we run 10 replicates for each modelling technique and the resulting mean was used. A final ensemble model was obtained for each species by a weighted averaging of models in each replica. Weights were calculated from each model TSS, using only those with TSS > 0.7 80. Only species whose ensemble model had a TSS > 0.7 were kept; species with lower TSS values were discarded. When this happened, a new species was analyzed keeping the final number of analyzed species at 1,002 per country (334 per order).
To limit extrapolations of the model too far from species known occurrences, we modified model outputs to restrict them to areas nearby presences using an exponential decay function as the approach in 57. This method is intended at reduce overprediction at areas far away from known occurrences, simulating the effect of geographic barriers and species dispersal limitations.
Finally, continuous models were transformed into binary maps (presence/absence) using maximum TSS as threshold.
Representativeness assessment. To assess representativeness of PA networks, a conservation target must be defined for each conservation feature (here species). The conservation target is the proportion of distribution area of a given conservation feature that should be included in the PA network in order to be considered represented. In the case of species, it is considered to be the minimum fraction of the distribution area necessary for it to thrive 46. Here, targets were set inversely proportional to the distribution area size of each species, i.e. species with narrower distributions got higher targets than widespread species. Building on the approach in 81, conservation targets ranged between 80% for species with a distribution area less than 100 km2, and 5% for those with ranges larger than 50,000 km2. The lower threshold was set following the B1 criteria (geographical distribution threshold) to declare a species as critically endangered 82. Conservation targets of species between those thresholds were scaled using a loglinear function 83.
Once species were assigned a target, we evaluated its achievement in current PA by comparison with the amount of SDM included in PA. Species with a distribution inside PA greater than their target were considered as represented. We classified species not achieving targets in two categories: underrepresented, when an area equal to 50–99% of their target was included in the solution, and neglected, when the area included in the solution was less than 50% of its target.
Efficiency assessment. We tested the efficiency of PA networks to represent unseen diversity from two perspectives: representation completeness, to measure the amount of area to be added to the existing PA network to represent all species, and representation specificity to measure the overlap between priority areas newly generated to optimize unseen diversity representation and the existing PA network.
The proposed framework to test efficiency requires contrasting current PA networks against “optimal” areas, that were obtained using the R package ‘prioritizr’ 4.1.1 84. Prioritizr uses the integer linear programming solver Gurobi Optimizer 85 to find the combination of sites that maximize the number of conservation targets achieved while minimizing costs (e.g. the amount of area or the cost of sites).
Representation completeness. To assess networks’ representation completeness, we executed Prioritizr and forced existing PA into the solution, so the algorithm selects additional complementary areas as necessary until all species’ targets were achieved. These solutions can inform about how much additional area is needed if targets are not met, allowing for measuring the efficiency of networks as a function of the amount of area needed to complete the species representation sought.
Representation specificity. To assess networks’ representation specificity, a second prioritization was conducted, but with existing PA not forced into the solution. With this setup, the algorithm identifies the optimal set of sites that achieve targets without any prior constraint, which implies that current PA (or portions of them) might or might not be part of the solution. The overlap between current PA and resulting areas allows measuring how existing PA networks are efficient at representing unseen diversity.