Study area
The results of the models were restricted to the northern portion of the South American continent (Fig. 1 and Supplementary Material S2), but to generate the models, the Neotropical region was chosen (Leroy et al. 2018).
B.rubescens is a canopy species belonging to the Moraceae family (Berg 1972) that can reach 90 cm of diameter at breast height with heights of up to 35 m (Laurance et al. 2004). In Brazil, the main biomes where the species is reported are Amazonia, Cerrado, and Atlantic Forest, with spatial interruptions that coincide with the Caatinga Biome and monodominant forests registered only in Amazonia-Cerrado transition zone (Marimon et al. 2014, 2020). The region is characterized by a high spatial interconnection and continuity of the Amazonia Biome associated with an extensive hydrographic network (Wiens and Donoghue 2004; Senior et al. 2019). In Amazonia and Atlantic biomes the soils are dystrophic but shallow for the former and humid and deep for the latter (Quesada et al. 2010; Joly et al. 2012). On the other hand, Cerrado soils are highly weathered and deep, with acidic pH, low fertility, and high levels of iron and aluminum (Ribeiro and Walter 2008).
OCCs: Obtaining and preprocessing
We obtained the occurrence data (OCCs: Fig. 2a) for B. rubescens from the virtual platforms GBIF (http://www.gbif.org) and speciesLink (http://splink.cria.org.br), complemented with data from the Laboratório de Ecologia Vegetal (Universidade do Estado de Mato Grosso, Nova Xavantina). We incorporate all OCCs data regardless of the sampling date. The initial database had 256 records that we filtered spatially to identify and eliminate data that were outside the Neotropical region. Subsequently, we eliminated all repeated, null, or dubious taxonomic data.
A common problem in data obtained from virtual platforms is the spatial biases associated with the frequency, intensity and, heterogeneity of the sampling efforts (de Oliveira et al. 2014; De Marco and Nóbrega 2018). Using KDE (Kernel Density Estimation) spatial heterogeneity (Silverman 2018), was evidenced in the OCCs of B. rubescens (Fig. 1). To obtain the KDE, at the QGIS software (version 3.14; Quantum GIS Geographic Information System; QGIS.org 2020), we used the interpolation extension ´Heatmap´ with Epanechnikov Kernel function and 10,000 km as bandwidth value. We evaluated the spatial autocorrelation (SAC) of the OCCs using Moran index (Fig. 2a). As the data presented SAC (values greater than |0.2|) and spatial heterogeneity, we applied the “spThin” package with 10 km as the cutoff value (Aiello-Lammens et al., 2015) from the R platform (R Core Team 2020).
Preparation of environmental predictors
To better approximate the fundamental niche of B. rubescens we used predictive environmental variables (Supplementary Material S1) known to present an effect at different scales and magnitudes (Soberon 2010).
Environmental variables act with spatial connections that follow a spatial hierarchical dependence pattern, with climate acting on the global scale, geomorphology on the regional scale, and hydrological variables on the local scale (Wiens and Donoghue 2004; Domisch et al. 2015). In this sense, despite the system emergence properties, evaluating each set of variables in blocks (climate, edaphic and hydric) may promote a better understanding of the effect of each set on the spatial patterns (Velazco et al. 2017).
We established three subsets of environmental predictors: climatic, edaphic, and hydric variables. In the climatic subset, we used 19 predictor variables obtained from the WorldClim platform (http://www.worldclim.org, Hijmans et al. 2005). For the edaphic characterization, we used 17 variables (Supplementary Material Fig. 1) obtained from the SoilGrids project (http://www.soildgrids.org).
As described in different studies, life cycles, biomass dynamics, growth, stature, and species distribution are conditioned by the seasonality of climatic variables related to hydric stress (Bruijnzeel 2004; Feldpausch et al. 2011, 2016). In this sense, we incorporated 23 variables in the hydric subset (Supplementary Material Fig. 1) obtained from different platforms: six variables from the Potential EvapoTranspiration (http://www.cgiar-csi.org), one from Actual EvapoTranspiration, and three from Soil Water Stress (http://www.cgiar-csi.org/data), being six variables of Relative Humidity (https://www.climond.org), three of Cloud Cover (http://www.cgiar-csi.org/data), three of Water Vapor Pressure (http://worldclim.org/version2) and one Topographic Wetness Index variable (http://envirem.github.io).
Considering the relative temporal homogeneity and importance for the characterization and physical description of the landscape of the five following variables: Global Relief Model (http://geodata.grid.unep.ch), Elevation, Slope and, Aspect (http://srtm.csi.cgiar.org) and Terrain Roughness Index (http://envirem.github.io), we included them to each one of the three subsets described above.
To generate the Species Distribution Models (SDMs) of past and future projections, we selected three scenarios: Last Glacial Maximum (LGM: 22,000 years ago), Middle Holocene (MH: 6000 years ago), and Future (F: 2041–2060). We also used three models of global circulation for the selected periods: CCSM4, MPI-ESM-P and, MIROC-ESM; these models, from the WorldClim platform, were chosen because they generate robust temporal projections (Hijmans et al. 2005; Carnicer et al. 2020). Due to the increase in greenhouse gas emissions and trends expected for the future (Riahi et al. 2011; IPCC 2013), we considered the most extreme emission scenario [RCPs (Representative Concentration Paths) = 8.5].
We standardized all environmental predictor variables (μ=0.1, SD=1) to avoid different weights (Fig. 2). Subsequently, we adjusted all variables to 10 km of resolution and cropped them using a Neotropical mask. Finally, we established different treatments from the subsets of environmental predictor variables performing PCAs to reduce data collinearity (De Marco and Nóbrega 2018): PCA1 (C: 24 climatic variables), PCA2 (E: 22 edaphic variables), PCA3 (H: 28 hydric variables), PCA4 (CEH-1: all variables together, 69), and the PCAs for the LGM (Last Glacial Maximum), MH (Middle Holocene) and F (FUTURE) scenarios. In each PCA, we retained the first axes that together explained more than 95% of the original environmental variation. We also obtained the independent contributions of each variable considered in the axes of all PCAs (Supplementary Material Fig. 1).
SDM methods configuration and evaluation
We applied the SDMs that present greater predictive performances and flexibility with presence-only data (Fig. 2c), using the following algorithms: BIOCLIM (BIO), GOWER (GWR), MAXENT (ME), Generalized Linear Models (GLM), Random Forest (RF) and Support Vector Machine (SVM).
To meet the requirements of the algorithms, we generated 10,000 background points randomly distributed within the study area, without spatially coinciding with known occurrence points. Then, we divided the data into 70% for test and 30% for training and validation. We generated the first models for the present scenario (PCA1, PCA2, PCA3, and PCA4) and later for the LGM, MH, and F scenarios. To obtain the potential distribution of B. rubescens, we used the most straightforward configurations of the models (Varela et al. 2011), leaving the results to depend on the interactions between the algorithms and the input data (Fig. 2d).
We executed and built all SDMs using the R “dismo” package (Hijmans et al. 2015) using the specific functions of each algorithm. With some exceptions, we configured the SVM models from the “ksvm” function of the “kernlab” package (Karatzoglou et al. 2004) and RF from the “randomForest” package (Liaw and Wiener 2002). Moreover, we defined the randomization process to 30 iterations for each algorithm (using a self-initialization analysis) to decrease the spatial structure between the training and test data (subsets of the same OCC points), to reduce possible correlation errors, and to reduce biased evaluations. Thus, we generated 180 models for each set of predictive variables, totaling 1,260 models.
We used as a threshold the lowest adequacy value within the incorporated OCCs (LPT: Lowest Presence Training; Pearson et al. 2007), allowing the generation of presence/absence matrices (binary models) from the adequacy of continuous matrices projected for South America (environmental models). Applying the consensus of the weighted average we assembled all the suitability matrices obtained for the different temporal scenarios. In this way, to generate a final distribution matrix for B. rubescens in the temporal scenarios, we applied a consensus method, assembling all the predictions considering only those models with TSS values greater than the mean TSS value.
The performance of the SDMs generated was evaluated by two families of statistical metrics: dependent and independent thresholds (Liu et al. 2005). In this sense, we applied the TSS (True Skill Statistic estimator) (Allouche et al. 2006) in which the response values vary between -1 to 1 where SDMs > 0.5 are considered acceptable or adequate (Allouche et al. 2006). To complement, we applied the independent threshold AUC (Area Under the receiver-operator Curve) (Fielding and Bell 1997), which discriminates areas of omission from areas with known occurrence. Values range from 0 to 1, where the closer to 1 the better is the model's ability to represent reality (Elith 2000). To verify the differences in AUC and TSS values among the PCAs and the applied algorithms we performed Kruskal-Wallis non-parametric tests since the data didn’t meet the assumption of normality.