Occurrence and variable data
Georeferenced occurrence data for vascular (Tracheophyta) plants in SEA were obtained from the Global Biodiversity Information Facility and Singapore Botanical Gardens and filtered to records from 1960 onwards (Supplementary Section 1). Species were taxonomically verified using the World Checklist of Vascular Plants34 and filtered to trees in the GlobalTreeSearch database35. Records with coordinates within 10 km of capital cities, 5 km of country centroids or known biodiversity institutions, or with equal longitudes and latitudes were removed36. Imprecise or dubious coordinates were manually removed. To reduce spatial autocorrelation, species occurrences were thinned using a 10 km distance buffer. Because we intended to use temporally dynamic ecological niche models (ENMs), occurrence thinning was applied using a 5-year moving window from 2019 to 1960 (i.e., prioritising more recent occurrences). Species with fewer than 10 occurrences were excluded to prevent spurious estimates of species distributions.
As variables, we obtained static (soil and distance from sea) and dynamic (bioclimatic and land-use) raster maps (Supplementary Section 2). Soil maps (250 m) containing continuous data on soil physical and chemical properties (e.g., clay content and soil pH) were sourced from SoilGrid250m (v2)37. To better estimate coastal and mangrove species distributions, the logarithmic distance from the sea was calculated and mapped as a static variable. Past and future bioclimatic variables (2.5 arcmin) were sourced from WorldClim (v2.1)38. Monthly climate data from 1960 to 2018 were sourced and converted to annual bioclimatic variables following methods in the ‘dismo’ R package’s biovars function39. Future bioclimatic variables under SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5 were directly sourced for 2030, 2050, 2070, and 2090, which represented values averaged over 20-year periods (e.g., 2030 variables were averaged from 2021-2040). Future bioclimatic variables were averaged across five Coupled Model Inter-Comparison Project 6 Global Circulation Models: CanESM5, CMCC-ESM2, CNRM-CM6-1, MIROC6, and MPI-ESM1-2-HR. Precipitation variables of temperature defined periods and vice versa were removed due to spatial artefacts (e.g., Bio 19, precipitation of coldest quarter). Annual land-use maps from 1901 to 2019 were obtained from the HILDA+ Global Land Use Change dataset (HLU; 0.01°), while 5-year interval maps of future land-use under SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5 from 2015 to 2090 were obtained from the Land Use Harmonisation (v2) dataset (LUH; 0.25°) and the model means of a newly generated land-use dataset by Chen et al. (2020) (CLU; 0.05°)21,22,40.
To reduce correlation between variables, a principal component analysis (PCA) was applied to soil and bioclimatic variables separately; because soil variables were static while bioclimatic variables were temporally dynamic. Soil variables were reduced to three principal component (PC) axes (75.8% cumulative variance) and bioclimatic variables to four (87.2% cumulative variance). Note, the PCA for bioclimatic variables accounted for past and future variances. All land-use maps were reclassified to binary forest/non-forest land-use classes (HLU, ‘forest’ land-use class; LUH, primary and secondary forests, class ‘primf’ and ‘secdf’; and CLU, tree land types, class ‘PFT’ 1 to 8). To keep past and future land-use maps consistent, we applied future land-use transitions (LUH and CLU) onto the most recent land-use map (HLU: 2019) to project future land-use changes, instead of using future land-use maps directly (Supplementary Section 2). Future land-use transitions were done separately for each SSP.
A terrestrial ecoregions map was obtained to constrain projections of species distributions41 (Supplementary Section 4). We focused our study on the biogeographical regions of Indo-Burma and Sundaland inclusive of subtropical East Asia. We delineated nine regions: mainland SEA; the island of Hainan; Andaman and Nicobar islands; the Malay peninsula; Riau islands; Borneo; the main island of Sumatra; islands south-west of Sumatra; and Java. The islands of Hainan, Andaman, Nicobar, and Riau, and islands south-west of Sumatra were included in model training but excluded from our analysis. Species distributions were limited to regions of known occurrence, based on existing occurrence records, distribution descriptors from Kew’s Plants of the World Online (POWO) database searched using the ‘taxize’ R package, and regions of likely occurrence given known plant distribution patterns42. Information on species’ native and introduced distributions from the POWO databased was also used to exclude non-native tree species.
All data preparation, subsequent modelling, and result analyses were done in the R programming software43. The ‘terra’ R package44 was used to crop variables to our study area and—where necessary—resample (cubic interpolation) or aggregate (mean, e.g., soil variables) rasters to a 5 km by 5 km grid cell resolution. All rasters (variables and model estimates) were projected onto the recommended Transverse cylindrical equal-area projection with a central meridian of 105° 00' 02'' E <projectionwizard.org>.
Ecological niche models
We developed species-specific ENMs using the PC axes of soil and bioclimatic variables, logarithmic distance from the sea variable, and binary land-use variable as predictors. Most occurrences, however, occurred in non-forested plots; largely within forest fragments and along forest edges considered non-forested at our spatial scale. This resulted in unrealistic point-predictor relationships, i.e., non-forested landscapes were projected to be suitable (Supplementary Section 3). However, the land-use variable reflected a key driver of anthropogenic range contractions in SEA, whose inclusion as a predictor was central to overcoming anthropogenic niche truncation (for details, see31). Hence, for species with sufficient occurrence data (n ≥ 10 within forest), occurrences within non-forested landscapes were excluded (exc.n) from model training (later used to test for niche truncation). For species with fewer occurrence data (n < 10 within forest), all occurrences were kept but model training ignored the land-use variable. The land-use predictor was kept where possible to account for anthropogenic niche truncations, but our models do not estimate species response to anthropogenic land-use (deforested/non-forested)31.
To estimate environmentally suitable habitats, a temporally dynamic ENM protocol was used, such that occurrences and background points were associated with bioclimatic and land-use variables from the year they were collected31; records from years with missing predictor data were associated with the predictor data of the closest year (e.g., occurrences from 2019 lacking bioclimatic data were associated with data from 2018). Soil and distance from the sea variables remained static. To correct for temporal as well as geographical sampling biases, we generated a Gaussian kernel density estimate (i.e., bias maps indicating sampling probability) for each year using occurrences across species for that given year, ±3 years. This approach adopted the concept for correcting geographical sampling biases and applied it for temporal sampling biases (e.g., bias maps reflect sampling density across space for a given year). For each species, we sampled 10,000 background points, where sampling probability across years matched the temporal sampling bias of occurrences (e.g., if 10 out of 100 occurrences were sampled from 1970, 1,000 background points were sampled from 1970) while sampling probability across grid cells was based on the bias map for that year. The sampling of background points was also limited to regions of known occurrence and within 500 km distance from any occurrence point of the species.
For the modelling procedure, we first withheld a random 30% of occurrences to test model performance (test.n). The remaining 70% (trg.n) were separated into k-folds using two partitioning techniques to cross-validated models during the tunning phase; model is trained using k – 1 folds and cross-validated using the remaining fold. If trg.n > 10, occurrences were separated into k = 4 equal folds using the spatial ‘block’ partitioning technique; if trg.n ≤ 10, occurrences were separated into k = trg.n folds (leave-one-out partitioning technique). We used and tuned three well-performing algorithms45: Maximum Entropy (MaxEnt, v3.4.4), using the ‘ENMeval’ (v2) R package46,47; Boosted Regression Tree (BRT) down-weighted, using the ‘dismo’ R package39; and Random Forest (RF) down-sampled, using the ‘caret’ R package48 (for model details, see Supplementary Section 3). We repeated this entire process ten times, generating a total of 30 models for each species.
We evaluated models with the previously withheld data (test.n) using a combination of evaluation metrics: Area Under the Curve (AUC) for continuous estimates and Omission Rate (OR) or True Skill Statistic (TSS) for binary estimates (depending on sample size). Models with AUC < 0.7, or OR > 0.2 or TSS < 0.4 were rejected; OR was preferred and used when test.n ≥ 549. Where available, excluded occurrences (exc.n) were also used to test for residual niche truncations31; only for OR (model rejected if OR > 0.2) and when exc.n ≥ 5. Species with less than five accepted models were excluded from subsequent analyses.
Distribution outcomes
Accepted models for each species were projected onto PC axes of past and future bioclimatic variables (holding soil and distance from the sea variables constant) and a zero-anthropogenic land-use variable (i.e., all forested) to estimate environmental suitability under a hypothetical undisturbed land-use scenario31. Model estimates were converted to binary presence–absence maps using model-specific ‘maximising the sum of sensitivity and specificity’ thresholds50, where the consensus among models was used. Reclassified land-use maps were used to determine intact habitats—forest and non-forest land-use classes as suitable and unsuitable, respectively. Reclassified land-use maps were overlaid onto model projections to determine areas of potentially suitable habitats (environmentally suitable intact habitats). In total, we produced maps of potentially suitable habitat annually from 1960 to 2019 and in 20-year intervals from 2030 to 2090 under four separate SSPs.
To quantify potential changes in tree species’ potential distribution under global changes, we first used projections of suitable habitats in 1960 to determine baseline distributions. Second, we use projected changes in suitable habitat overtime to estimate distribution changes. Distribution loss occurs when a grid cell previously occupied becomes unsuitable for at least five consecutive years. Distribution gain occurs when species disperse to a suitable but previously unoccupied grid cell and matures. Following Warren et al. (2013), a realistic fixed rate of tree migration was used, assuming a dispersal distance of 5 km (or 1 grid cell) and maturation time of 50 years (grid cell must remain suitable for the full 50-year duration)18. A more optimistic rate of migration was also tested (5 km every 10 years). Dispersal may occur for newly suitable (because of global changes) or newly accessible (because of prior range expansions) grid cells or both, and from grid cells for which trees have reached hypothetical maturity (50 years after dispersal). Third, to disaggregate potential distribution changes due to either global change, we also quantified changes in tree distribution when only climate or land-use suitability changed while the other was held constant. For disaggregating future distribution changes (after 2019), species’ 2019 distributions (after past global changes) were used as baselines, and climate or land-use variables were held constant at 2019 values. Finally, we quantified distribution changes as a percentage: past changes from 1960 to 2019, using 1960 distributions as the baseline; future changes from 2019 to 2090, using 2019 distributions as the baseline; and overall changes from 1960 to 2090, using 1960 distributions as the baseline.
Species threat statuses
To assess species threat statuses, we categorised species based on the decline of their distribution area as a percentage of their projected distribution area in 1960 (Supplementary Section 6). Net changes in species distributions as quantified in the previous section was used to calculate the percentage decline in a species’ area of distribution. We used thresholds for delineating threat statuses following the IUCN threatened species criteria A2/A419, i.e., the percentage decline in a species’ area of distribution for this study. Criteria A2 assesses past declines only (1960–2019) whereas criteria A4 assesses the accumulation of past and future declines (1960–2090). Three threat categories were considered, with values in parentheses indicating the threshold of net distribution loss for each category: vulnerable (>30% loss), endangered (>50% loss), and critically endangered (>80% loss). The number of species in each category was then tabulated and presented as a percentage of all 1476 tree species (across species), or as a percentage of species within each species group (among species groups). Our approach for assessing threat statuses do not account for declines prior to 1960 and are based entirely on model projections of species distribution area within the boundaries of our study region. Species-specific threat statuses are thus unlikely to be of sufficient accuracy to support changes in their actual threat status under the IUCN. Assessments of species threat statuses in this study was intended to provide macroecological insights on the potentially most threatened species groups across four separate SSPs.
Species groupings
To determine species “natural” associations, i.e., unmarred by recent climate changes and anthropogenic land-use impacts, we used projected binary distributions of species under 1960 climate conditions and a zero-anthropogenic land-use variable. Species groups were derived using a methodological framework for clustering spatially associated species 11. We first quantified interspecific spatial associations using five binary indices: probability of disassociation (derived from Forbes’ coefficient of association), scaled C-score, simple matching coefficient, Jaccard’s index of association, and Pearson’s tetrachoric correlation coefficient. The resulting five matrices of species’ pairwise associations were then used to inform four hierarchical clustering algorithms: average, Mcquitty, complete, and Ward’s. This produced 20 candidate dendrograms which were evaluated using three dendrogram performance metrics: co-phenetic faithfulness, agglomerative strength, and tree balance. Metric scores were then combined using a Euclidean formula to quantify the overall performance of each dendrogram. For each dendrogram, we explored probable optimal number of clusters using three stopping rules: average Silhouette, C-score, and bifurcation pairwise T-test (Supplementary Section 7).
The intent of delineating species groupings was to separate geographical variations among species to better understand global change impacts on species with different distributions11,12. Hence, the final dendrogram and number of clusters were selected based on a mixture of quantitative (Euclidean scores and stopping rules) and qualitative criteria (visual assessment of association matrix data structure using non-metric multidimensional scaling and aggregated distributions of resulting species groupings) (Supplementary Section 7). Ultimately, we selected species groupings that 1) could be easily interpreted (i.e., less than 15 groups), 2) were reasonably distributed (i.e., no super large grouping of over 300 species), and 3) reflected known broad scale patterns of tree distributions in Southeast Asia. Biodiversity outcomes of distribution changes and threat statuses were then summarised across species and for each group.