Studied species and presence points
In this study, 23 species which met two criteria were selected: 1) the species had to be widely traded globally and 2) an adequate number of presence points had to be accessible for the species (more than 50 presence points). We initially found 89 species that were traded on social media, unofficial platforms, and on the black market, but only 23 species of the initial set had enough presence points to be used for modeling. We compiled presence points for the 23 species from the Global Biodiversity Information Facility (GBIF), VertNet, iNaturalist, and Berkeley Ecoinformatics Engine. GBIF is an international platform aiming to provide publicly-available data on life on Earth. GBIF focuses on providing templates, standards, and tools for sharing information between institutions and individuals. Data available on GBIF come from a number of sources including field observations by experts and citizen scientists, museum records, and surveys. VertNet is an NSF-funded project created through the combination of four taxon-specific repositories (FishNet, MaNIS, HerpNET, ORNIS) to make biodiversity data available online. iNaturalist, jointly funded by California Academy of Sciences and National Geographic is a crowdsourced species identification system and a platform to record presence points. Berkeley Ecoinformatics Engine allows access to several biodiversity information repositories (https://holos.berkeley.edu/LearnMore/datasources/) through its R package. Each database was accessed through their respective packages in R (“rgbif”, “rvertnet”, “rinat”, and “ecoengine” respectively). Queries were made using the scientific name of the species, and georeferenced presence points from January 1, 1998 to January 1, 2018 were extracted. In this study, we used data from the last 20 years to minimize the error in the data (Alhajeri and Fourcade, 2019). Here, we used the datasets for both native and non-native geographical ranges of the specie. In order to obtain more accurate results in modelling, presence points were compared with the native range of the species. In the next step, only the none-native range in countries where established populations of the species had been reported was included (Cordier et al., 2020). In cases where the same data point was recorded more than once, duplicates were omitted in R. After removing duplicates, presence points were used to model the potential distribution of the species. Moran's correlograms were created in Spatial Analysis in Macroecology (SAM; Rangel et al., 2006) to determine spatial autocorrelation for the species. The significance of Moran’s I was tested using a randomization test with 9,999 Monte Carlo permutations, adjusted for multiple testing. In instances where spatial autocorrelation was detected, we restricted the testing and training data as follows: first, a distance threshold was set using distance lags which showed positive spatial autocorrelation (from 10 to 25 km); in the next step, the distance between data pairs was compared with the threshold, and pairs which had a distance smaller than the threshold were placed in the same partition (Parolo et al., 2008). Given that we only had access to species presence points, a number of pseudo-absences were randomly generated (Elith et al., 2011).
Environmental variables
Environmental variables include climatic, topographic, and land cover/land use variables. The variables were selected according to species ecology, the factors determining the distribution of reptiles, and availability of data (e.g. Block et al., 2016; Ribeiro-Júnior & Amaral, 2016; Sanchooli, 2017). An initial set of 19 climatic variables were downloaded from WorldClim 1.4 (Hijmans et al., 2005) at 1-km resolution (https://www.worldclim.org/). Four topographic variables (mean and standard deviation (SD) of elevation, and slope of all raster cells) were derived from the digital elevation model provided by the Shuttle Radar Topography Mission (SRTM) digital elevation model at 90-m resolution (http://srtm.csi.cgiar.org). 16-day composite normalized difference vegetation index (NDVI) data collected by the Moderate Resolution Imaging Spectrometer for the year 2019 at 500-m resolution (MODIS; http://lpdaac.usgs.gov). The land cover map (including 17 classes: 11 natural vegetation classes, three human-altered classes, and three non-vegetated classes) were derived from the combination of MODIS Terra and Aqua data at 500-m resolution (https://modis.gsfc.nasa.gov). To evaluate human impact on the reptile, we calculated the human footprint index using data on settlements, land transformation, accessibility and infrastructure.
To exclude highly-correlated inputs, all pairs of variables were checked for correlation using Pearson’s correlation coefficient (r) in SDMToolbox (Brown, 2014) for ArcGIS. r > 0.70 and r < -0.75 were considered as the threshold values, and variables with stronger correlation were excluded (Kalboussi & Achour, 2018), resulting in a final set of 14 variables (including: 9 climatic, 2 topographic, and 3 land cover/land use variables) (Table 1).
Modelling techniques and ensemble forecasting
Four categories of modelling techniques, encompassing 10 algorithms, were implemented in the Biomod2 package with 80/20 calibration and evaluation, bootstrapping with 10 cycles, 0.5 prevalence, and a high 0.70 quality threshold (Thuiller et al., 2009; 2014) for R version 3.1.25 (R Core Team, 2014). Algorithms in the regression category included generalized linear models (GLMs) and generalized additive models (GAMs), which respectively calculate linear and non-linear correlation between input variables and species presence. Algorithms in the machine learning category included, maximum entropy (MaxEnt), boosted regression trees (BRT), random forest (RF), artificial neural networks (ANN), and multivariate adaptive regression splines (MARS). These algorithms directly predict the environmental space of species according to training data. The classification algorithms included classification and regression trees (CART) and flexible discriminate analyses (FDA), both of which divide data into homogeneous groups in a series of consecutive steps. Finally, the surface range envelope (SRE) algorithm attempts to first describe the ecological conditions in which a species is found, and then find areas with similar conditions (Elith and Leathwick, 2009; Franklin, 2010; Guisan and Thuiller, 2005; Merow et al., 2014).
Only algorithms that met the 0.7 quality threshold were included in the ensemble calculations; those that do not meet this standard were discarded (Thuiller et al., 2009, 2014). To compensate for variation between the number of pseudo-absences (PA), replicates, and model runs based on which algorithm is used, we grouped the 10 algorithms into three PA subsets to allow for optimal ensemble model performance Groups 1, 2, and 3 (Table 3; Barbet-Massin et al., 2012; Brown and Yoder, 2015; Bevan et al., 2019). Group results were compared for accuracy using the true skill statistic (TSS; scaled -1 to +1, where performance ≤ 0 means the model output is no better than random and > 0 means the proposed model successfully distinguishes between suitable and unsuitable habitat). Models were also evaluated for sensitivity and specificity. Sensitivity is the accuracy rate for true positive outcomes (i.e., the probability that the model correctly predicted presence). Specificity is the accuracy rate for true negative outcomes (i.e., the probability that the model correctly predicted absence) (Allouche et al., 2006). Finally, we used a quality threshold to accept models and TSS, sensitivity and specificity values to compare and select among alternative ensemble SDMs (Table 4).
Variable importance is determined in Biomod2 package using permutation as follows: after models are calibrated, a baseline prediction is generated. In the next step, the values of one variable are randomized and a new prediction is generated using the randomized values of the randomized variable and the unchanged values of other variables. The correlation between the baseline prediction and the new prediction (using one randomized variable) indicates the relative importance of the variable that has been randomized (Thuiller et al., 2009). This procedure does not depend on the modelling algorithms employed.