Species occurrence dataset
This genus Quercus includes 91 species widely distributed across U.S. and showing a marked longitudinal species diversity gradient, with high species concentration in south-eastern North America. Our main occurrence dataset was assembled in a previous study 66. We completed the main dataset using the Integrated Digitized Biocollections (iDigBio) and the Global Biodiversity Information Facility (GBIF) (data downloaded between 15 and 18 December, 2020) and collections from the second author (JCB) 67. All occurrence data were visually examined and any localities that were outside the known geographical range of the species, in unrealistic locations (e.g., water bodies, crop fields) or in botanical gardens were discarded for accuracy. In addition, to avoid problems of spatial sampling bias and spatial autocorrelation we thinned the occurrence records of each species using a spatial thinning algorithm 68 with thinning distance of 1 km for species with less than 100 occurrences up to 5 km for species with more than 10000 records.
Species and assemblage composition evaluation data
Ecological niche models represent abstractions of the environmentally suitable areas for species to maintain long-term viable populations 41. The presence of a species in a locality or a grid cell informs us about the areas that are environmentally suitable for that species 41,69. Its absences inform us of the opposite pattern—namely those areas that are either not environmentally suitable for the species or are the result of historical contingences, biotic interactions that constraint the species presence even if the physical environment is suitable, recent extirpation events such as those caused by land use change, or simply because the species was not detected 41,69. Here, using an independent dataset of true presences and absences from NEON 70, we evaluate the assemblages’ diversity and composition predictions. NEON data follows a nested structure in which subplots are nested in larger plots and these plots in turn are nested within large areas or NEON Sites. More specifically, the presence of species is recorded at 10- and 100-m2 subplots. The recorded species within the subplots are then combined to obtain a complete species list for a plot of 400-m2 70. To obtain a complete list of oak species for each NEON site, we combined the species lists from the plots embedded in each site.
Here we used the species presence and absence at plot and site scale for assemblage predictions. This dataset includes a total of 277 plots of 400-m2 embedded in 19 NEON sites, spanning 8 out of the 17 NEON ecoclimatic domains in the continental United States. In addition, 36 of the 91 oak species were found in the NEON dataset; thus, we restricted all our analyses to this set of species (Fig. 1).
Environmental data derived from Remote sensing
Remote sensing products used as environmental inputs to our models included a set of covariates that allow the description of distribution and environments occupied by oak species (EO-bioclimatic covariates) and covariates that allow the discrimination of local features not captured by bioclimatic information (biophysical covariates) 30. Bioclimatic covariates were constructed by combining monthly Land Surface Temperature and Emissivity (LSTE) from MODIS (MOD11C3) and monthly precipitation from Climate Hazards group Infrared Precipitation with Stations (CHIRPS) 71. Specifically, monthly LSTE and CHRIPS data were used as input to construct 19 bioclimatic covariates based on Earth observations 71, similar to those of WorldClim, using the function ‘biovars’ in the R package dismo 72. We named each bioclimatic variable derived from Earth observation as EO-bioclimatic variables (e.g., EO-BIO1 or EO-MAT, EO-BIO12 or EO-AP) in order to avoid confusion with those bioclimatic variables derived from interpolated climate surfaces. EO-bioclimatic covariates used or modeling oak distributions include mean annual temperature (EO-BIO1), temperature seasonality (EO-BIO4), minimum temperature of coldest month (EO-BIO6), mean temperature of warmest quarter (EO-BIO10), annual precipitation (EO-BIO12), and precipitation seasonality (EO-BIO15), all critical for the distribution and differentiation of oak species 66,67.
Biophysical covariates include Leaf Area Index (LAI) composites 30. LAI data were obtained from MODIS Terra/Aqua MOD15A2 over a 15-year period (2001–2015) using the interface EOSDIS Earthdata. The 15-year LAI composite provides a representation of the spatial variation of biophysical variables of different components of vegetation and ecosystems over the course of this time frame 28,30. LAI strongly co-varies with the physical environment; for instance, higher LAI is associated with warmer, wetter and stable environments, whereas lower LAI with cooler, drier and less stable environments 73–75. In particular, LAI has a strong relationship with climate and captures the dynamics of the growing season, including vegetation seasonality—important for characterizing plant species geographical ranges 30,34. We posit, therefore, that the spatial and temporal variation in LAI—half of the total green leaf area per unit of horizontal ground surface area 76—represents an important variable for determining species distributions that integrates across biotic and abiotic factors 41,77. Finally, we also included mean elevation or altitude from Shuttle Radar Topography Mission (SRTM). SRTM is a high-resolution digital elevation model of the Earth that has been used for mapping and monitoring the earth’s surface 78 and an important variable for predicting species distributions and community composition 20,79.
Modeling framework
To obtain reliable SDM predictions it is necessary to define the accessible area for a species in geographical space, i.e., the area that has been accessible to the species within a given period of time 41,80. We defined the accessible area for each oak species as a bounding box around the known species occurrences, i.e., the occurrence records from the occurrence dataset, plus ~ 300 km beyond each bound. This procedure accounts for approximate species dispersal within a geographical domain and has been shown to improve model performance 80,81. In other words, this procedure provides a conservative spatial representation of the environmental space in which a given species has potentially dispersed and is detectable. The individual accessible areas or species-specific accessible areas were then used to mask the covariates for each oak species and to constraint the random generation of pseudoabsences or background points before modeling species potential distributions. The number of pseudoabsences generated was similar to the number of presences 82,83.
We modeled species potential distributions using Bayesian additive regression trees (BART) 84. BART is a classification tree method defined by a prior distribution and a likelihood for returning occurrence predictions that enables the quantification of uncertainty around the predictions and the estimation of the marginal effects of the covariates 84–86. BART models were run with the default parameters as implemented in dbarts 87 through the R package embarcadero 86. More specifically, BART models were run using 200 trees and 1000 back-fitting Marcov Chain Monte Carlo (MCMC) 88 iterations, discarding 20% as burn-ins. Model performance or predictive ability was evaluated using two measures, the area under the receiver operating characteristic curve (AUC) and true skill statistics (TSS) 89. To estimate the potential distribution of oak species, the resulting predictions (i.e., probability of species presences) under BART were converted to binary predictions (presence-absence maps) using TSS-maximization thresholds 21,36.
To obtain assemblage composition and species richness predictions we applied three procedures for stacking species distribution models (S-SDM) (i.e., probability, binary and constrained binary). These procedures return the predicted species composition and richness within each assemblage (i.e., grid cell) across a geographical domain 13,14. More specifically, probability S-SDM (pS-SDM) and binary S-SDM (bS-SDM) were obtained by stacking the probability of species presence and the binary prediction layers, respectively. Constrained binary S-SDM (cS-SDM) predictions were obtained by constraining each assemblage applying a probability ranking rule (PRR). PRR emulates ecological assembly rules by ranking the species in each assemblage based on the occurrence probability obtained from each species and the number of species per assemblage. The species with the highest probabilities in an assemblage is selected until the number of species in an assemblage, based on observed data, is reached 19,36. We used the maximum number of species per assemblage from the NEON dataset as the assemblage-level constraint for cS-SDM estimations.
Evaluating species and assemblage predictions
To evaluate the performance of species assemblage predictions, we used four different metrics that are commonly used for this purpose 20,21,36. The metrics used were: a) the deviation of the predicted species richness to the observed (SR deviation); b) species richness change (SR change); c) the proportion of correctly predicted as present or absent (Prediction success); d) true skill statistic (TSS) and e) similarity between the observed and predicted community composition (Sørensen index). Assemblage metric evaluations were performed using modified functions from the R package ecospat 90 using the matrices of the three assemblage predictions, i.e., probability, binary and PRR, against the observed assemblage composition from NEON. Note that the SR deviation and SR change were estimated only for the pS-SDM and bS-SDM predictions given that we used the NEON dataset in constraining the number of species per assemblage in the cS-SDM construction.
We also investigated the phylogenetic structure of oak assemblages for both the NEON dataset and the predicted assemblages, in order to explore the performance of predicted oak assemblages in recovering similar patterns of phylogenetic structure as the observed assemblages. To do so, we first obtained the latest phylogenetic hypothesis for oak lineages 91. This phylogeny was constructed using restriction-site associated DNA sequencing (RAD-seq) and fossil data for node calibration and represent most highly resolved phylogenetic hypothesis for the clade globally 91. We defined phylogenetic structure as the mean phylogenetic distance (MPD) 92. To facilitate comparison between the two datasets, we summarized the results using standardized effects sizes (SES), which compare the observed value of an assemblage (MPD) to the mean expected value under a null model, correcting for their standard deviation. SES values > 0 and < 0 indicate phylogenetic clustering and overdispersion, respectively 47,92. In SES calculations we randomized the tips of the phylogeny to generate random communities (taxon shuffle null model). All phylogenetic structure calculations were conducted using customized scripts and core functions from the picante 93 package in R.
Statistical analysis
Using a Bayesian counterpart of Pearson’s correlation test, we evaluated the relationship between species richness and phylogenetic assemblage structure obtained from both NEON and predicted datasets. We chose this Bayesian alternative because it allows robust parameter estimations and accounts for outliers in the data 94. We also used Bayesian-ANOVAs to test for differences in species assemblage predictions between different modelling procedures (i.e., stacking procedures), using plots as a random variable to correct for potential repeated measures. Using Maximum A Posteriori (MAP) p-values 95, we then evaluated the evidence for those differences. Note that all analyses were performed for each scale separately, i.e., plot and site scales. Both Bayesian-ANOVAs and the robust Bayesian Pearson’s correlations were implemented in the probabilistic programming language Stan 96 through the R packages rstanarm 97 and brms 98, respectively. All analyses were performed using 4 sampling chains for 10,000 generations and discarding 20% as burn-ins. MAP-based p-values was estimated as implemented in the R package bayestestR 99.