Study area
Taiwan is a subtropical island located at the western edge of the Pacific Ocean, with coordinates ranging from 21° 55’ to 25° 20’ N and 119° 30’ to 122° 00’ E (Fig. 1). The subtropical island is located 150 km off the southeast coast of Mainland China and is characterized by a monsoon climate (Chen & Tsai 1983; Su 1984). The northeast monsoon during winter and the southwest monsoon during summer influence the weather conditions of Taiwan Island. Particularly, the northeast monsoon during winter prevails in Taiwan for six months, bringing heavy rainfall and strong winds to the northern and eastern slopes of the Central Mountain Range.
In northern Taiwan (NTWN), a steep precipitation gradient extends from coast to inland areas that is significantly influencing the distribution of plant species (Liao & Chen 2022; Liao et al. 2023). The annual precipitation decreased from more than 6,000 mm at the northeastern slope to 1,900 mm at the southwestern slope of the mountain ridge in NTWN (Liao et al. 2023). The mean monthly temperatures at the mountain ridge range from 11.3 ℃ in winter to 20.5 ℃ in summer, while those at the coastal area range from 17.9 ℃ in winter to 26.6 ℃ in summer (Liao & Chen 2022; Liao et al. 2023).
The study area in NTWN ranges from 24° 57’ to 25° 17’ N and 121° 24’ to 122° 00’ E (Fig. 1), covering an area of approximately 1,031 square kilometers (103,100 hectares). The highest mountain peak in the study area is Qixingshan, which stands at an elevation of 1,120 meters above sea level (asl.). The major vegetation type in NTWN is evergreen broad-leaved forest (Hsieh et al. 1997; Li et al. 2013; Liao et al. 2012). The forests in NTWN are dominated by species such as Castanopsis, Cleyera, Cyclobalanopsis, Dendropanax, Elaeocarpus, Engelhardia, Gordonia, Helicia, Ilex, Keteleeria, Limlia, Litsea, Machilus, Meliosma, Michelia, Pinus, Schefflera, Symplocos, and Trochodendron (Li et al. 2013). The mean canopy height of these forests is approximately 15 meters. There is no deciduous forest in NTWN, while native deciduous tree species are scattered within the forests of the region. Natural grasslands are commonly found along the mountain ridges spanning from the coast to the inland regions within the NTWN, while the elevations of natural grassland vary across these ridges within the study area (Liao et al. 2023). The predominant species of natural grassland in these mountain areas are Miscanthus sinensis and Pseudosasa usawai (Liao et al. 2014; Liao et al. 2023). It is worth noting that the climatic niches of these two species are similar and may demonstrate convergence of climatic niches (Liao et al. 2023).
Downscaling of the current climate dataset
In this study, a gridded climate dataset with spatial resolution of 50 × 50 m² was created to present historical climate environments of the study area. This dataset was generated by utilizing daily meteorological data downloaded from the Central Weather Bureau’s website (CWB, https://e-service.cwa.gov.tw/HistoryDataQuery/index.jsp). The 50 × 50 m² gridded climate dataset was adopted to accurately capture the heterogeneous climate environments along the mountain slopes. To construct this dataset, we downloaded daily meteorological data from CWB’s website. The detailed process of downscaling the historical climate dataset described in Liao et al. (2023) includes the following steps: (1) interpolation of the meteorological climate dataset to generate smooth climate variable surfaces; (2) creation of gridded cells, each with a spatial resolution of 50 × 50 m², to extract data from the smooth climate variable surfaces; (3) altitudinal adjustment of the extracted climate data.
Climate data from 30 meteorological stations in and around the study area were downloaded from the CWB website. Mean monthly temperature and total monthly precipitation obtained from the 30 meteorological stations were imported into ArcInfo software (ESRI, Redlands, California, USA) to generate smooth surfaces of climate variables using the Kriging method, resulting in the generation of .tif files for the climate variables. Gridded cells with a spatial resolution of 50 × 50 m² were also created using ArcInfo software. A total of over 0.4 million gridded cells were generated within the study area. For each gridded cell, the longitude, latitude, and elevation data were extracted from a digital terrain model (DTM) developed by the Department of Geography, Chinese Culture University. The DTM had a resolution of 20 by 20 meters. The elevation data obtained from the DTM was named DElev. Subsequently, the 50 × 50 m2 gridded cells were mapped and overlapped with the .tif files of the meteorological climate surfaces to extract climate data. The meteorological climate dataset with a spatial resolution of 50 × 50 m² was named MCD50. Furthermore, the elevations of the meteorological stations were also interpolated using the Kriging method in ArcInfo software, resulting in the generation of a smooth elevation surface named MElev. The gridded cells of MCD50 were overlapped with MElev to extract the elevation data. The differences between DElev and MElev were calculated for the altitudinal adjustment of MCD50. The altitudinal adjustment function is: AdjMCD50 = slope × (DElev – MElev) + MCD50. The abbreviation AdjMCD50 represents the altitudinally adjusted meteorological climate data with a spatial resolution of 50 × 50 m². The slope of the function, also known as the empirical lapse rate, was calculated as the slope of the linear correlation between the elevation and climate data of the nearest 12 meteorological stations. The linear regression model was implemented using the "stats" package within the R environment (Chambers & Hastie 1992). The detailed methodology for generating the current climate dataset followed the study conducted by Liao et al. (2023).
The altitudinally adjusted climate data, AdjMCD50, was utilized as the historical climate dataset for modeling species distributions. The climate dataset was generated by considering the following nine variables: mean annual temperature (Tmean), mean maximum temperature of the warmest month (Twrm), mean minimum temperature of the coldest month (Tcld), mean temperature in summer (Tsmr) and winter (Twnt), temperature differences between warmest and coldest months (Tdif), annual total precipitation (Pann), total precipitation in summer (Psmr) and winter (Pwnt).
Downscaling of future climate projections
The history of the Intergovernmental Panel on Climate Change’s (IPCC) assessment reports covers several generations of emissions scenarios (IPCC 2013, 2022; Pedersen et al. 2021). Emission scenarios are generally developed to describe different socio-economic and policy choices, allowing for the assessment of different potential futures and their implications for the long-term climate systems (Kebede et al. 2018). In 2022, the global Shared Socioeconomic Pathways (SSPs) were introduced in the Sixth Assessment Report (AR6) most recently published by the IPCC, integrating socioeconomic developments into future climate scenarios (IPCC 2022). There are five SSPs (SSP1 to SSP5), and these SSPs are used in conjunction with climate models to generate a range of possible future climate and environmental conditions based on different societal choices and policy directions (IPCC 2022; Pedersen et al. 2021). Among the five pathways, SSP1 is the most optimistic scenario and emphasizes sustainable development, while SSP2 represents a middle pathway (Pu et al. 2020). SSP3 and SSP4 are the most undesirable pathway, assuming unsustainable development trends. SSP5 assumes an energy intensive, fossil-fuel-based economy, but also relatively optimistic development (Pu et al. 2020).
Regarding the future climate scenarios, the working group of Taiwan Climate Change Information and Adaptation Knowledge Platform (TCCIP) has generated 5 × 5 km2 gridded climate datasets to present future climate projections for the Taiwan island (Wang et al. 2021). The future climate projections fort the Taiwan island were downscaled from 49 General Circulation models (GCMs) during the 6th phase of the Coupled Model Intercomparison Project (CMIP6). Among the 49 available GCMs, five climate system models were used in this study: ACCESS-CM2 (Meucci et al. 2023), FGOALS-g3 (Pu et al. 2020), GFDL-ESM4 (Dunne et al. 2020), MIROC6 (Kataoka et al. 2020), and TaiESM1 (Wang et al. 2021). The five GCMs were used to generate downscaled future climate datasets at 50 × 50 m2 resolution for the study area under different climate scenarios.
The TCCIP published 5 × 5 km2 gridded climate datasets to present climate data from 1960 to 2100. Since the climate dataset of AdjMCD50 was at a spatial resolution of 50 × 50 m2, the gridded climate dataset projecting future climate at 50 × 50 m2 was recalculated based on the relative changes observed in the TCCIP’s historical and future climate datasets. We selected three time periods of climate data from TCCIP’s climate datasets to present the climate datasets for the early (2000–2020), mid (2045–2055), and end (2091–2100) of the 21th century. The TCCIP’s climate datasets for the three time periods were used to calculate the mean monthly temperature and total monthly precipitation. Subsequently, differences in monthly temperature and total monthly precipitation between the early and mid, as well as the early and end, of the 21th century were calculated to generate relative changes in climate data. The relative changes in temperature and precipitation were also represented as 5 × 5 km2 gridded data, which were used to generate .tif files of smooth climate surfaces using the Kriging method in ArcInfo software. The gridded cells with a spatial resolution of 50 × 50 m2 were overlapped with the .tif files of smooth climate surfaces to extract the relative changes in climate data. The historical climate datasets, AdjMCD50, were overlapped with the relative changes in 50 × 50 m2 gridded climate datasets to project future climates.
Simulations of virtual species
The gridded cells with historical climate data (2000–2020) were analyzed using principal components analysis (PCA). PCA was employed to diminish dimensionality and mitigate collinearity among environmental variables, enabling a clearer quantification of environmental overlap (Journé et al. 2020; Meynard et al. 2019; Qiao et al. 2016). Nine climate variables were initially included in the PCA. Notably, significant collinearity was observed among Pwnt and Pann, Tsmr and Twrm, and Twnt and Tcld. Thus, Pann, Twrm, and Tcld were excluded from the PCA. The remaining climate variables were Tmean, Tsmr, Twnt, Tdif, Psmr, and Pwnt. Retaining of the first two principal components (PC1 and PC2) explained a combined 84.1% of the overall variation.
To create a range of spatial patterns in habitat suitability, the function “generateSpFromPCA” implemented by R package “virtualspecies” was employed to simulate virtual species (Leroy et al. 2016). When using the function, the parameter nb.points was set to 5000. The other two parameters, means and sds, were configured to simulate five virtual species located in various regions of the PCA diagram, representing distinct niches and suitable environments of different virtual species. The performance of the function resulted in distinct geographical distribution patterns in the study area (Fig. 1 and Supplement 1). The first virtual species was designed to be located at the center of the PCA diagram, resulting in a geographic distribution range covering the windward slopes near mountain ridge (Fig. 1). The other four virtual species were designed to be located at the lower, left, right, and upper sides of the PCA diagram (Supplement 1). The simulation of the five virtual species aims to ensure that they occupied different environments and that their geographical distribution ranges widely across the study area.
Inventory of real species
In this study, we selected eleven real plant species and grasslands for modeling analysis. Among them, five species are rarely observed in the study area: Maackia taiwanensis, Benthamidia japonica, Lilium speciosum, Rhododendron pseudochrysanthum, and Bretschneidera sinensis. Their rarity is attributed to their limited geographic range on this island, mainly confined to the mountainous regions of NTWN, wtih only a few occurrences of these species recorded in the fieldwork. The remaining seven species include four woody angiosperms: Rhododendron nakaharai, Euscaphis japonica, Ficus fistulosa, Saurauia tristyla var. oldhamii, as well as two fern species, Dipteris conjugata and Sphaeropteris lepifera. Occurrences of these eleven plant species and grasslands were collected along the roads and mountain trails within the study area. Coordination of occurrences collected in the fieldwork were spatially verified to delete duplicated occurrence records.
Modelling technique
In this study, 11 algorithms were employed to predict the potential distribution ranges of virtual and real species in mountainous areas. The 11 algorithms include artificial neural network (ANN), classification tree analysis (CTA), flexible discriminant analysis (FDA), generalized additive model (GAM), generalized boosting model (GBM, or usually called boosted regression trees), general linear model (GLM), multiple adaptive regression splines (MARS), Maximum Entropy (MAXENT), random forest (RF), surface range envelop (SRE, or usually called BIOCLIM), and extreme gradient boosting training (XGBOOST). In addition, ensemble modeling, which includes 11 algorithms, was employed to predict virtual and real species in mountainous areas. The 11 algorithms and ensemble modeling were implemented using the “biomod2” package in R software (Thuiller et al. 2016).
For each virtual species, occurrence and background points were integrated to generate a modeling dataset. The coordinates of the occurrences and background points were used to extract climate data from the climate surfaces. The occurrences were randomly sampled from the geographic distribution range of the virtual species, as shown in Fig. 1 and Supplement 1. Various numbers of sample points were used in the model prediction to assess the impacts of sample size on the model performances. The numbers of sample points imported into the model predictions were 5, 20, 50, 100, and 200. The number of background points was 100 times the number of sample points randomly selected in the study area. When the sample and background points were used for the model predictions, a random set comprising 80% of the occurrence and background data was selected to train the model, and the remaining 20% was used for evaluation.
Model performance
To assess accuracy of algorithms and ensemble modeling, the training dataset was resampled and modeled 10 times to quantify uncertainties in predictions. True skill statistics (TSS) and the area under receiver operating characteristic curve (AUC) were commonly used to assess the accuracy of species distribution models (Fois et al. 2015; Lannuzel et al. 2021; Qiao et al. 2019; Xu et al. 2021). For creating the final ensemble models, only those models with a TSS score greater than 0.8 were used (Khan & Verma 2022).
Notably, a previous document proposed that the error indices, such as TSS and AUC, do not imply accuracy of suitability, since these indices provide a single-number discrimination measure across all possible ranges of thresholds (Lobo et al. 2008). High values of TSS and AUC do not guarantee accurate model performance (Liao & Chen 2022). Thus, a similarity index called expected fraction of shared presences (ESP) was introduced to evaluate model performance in this study. The ESP was modified from the Sorenson similarity index to compare the similarity of potential ranges between two species (Godsoe 2014; Inman et al. 2021). In this study, the ESP was revised to compare the suitable range of a virtual species simulated by PCA with its potential ranges predicted by 11 algorithms and ensemble modeling. The function of the ESP is:
$$\text{E}\text{S}\text{P}=\frac{{2{\Sigma }}_{1}^{j}{P}_{s\left(j\right)}{P}_{p\left(j\right)}}{{{\Sigma }}_{1}^{j}({P}_{s\left(j\right)}+ {P}_{p\left(j\right)})}$$
where Ps(j) denotes the presence of suitable range at a given cell j, and Pp(j) denotes the presence of potential range at a given cell j. Meanwhile, Ps(j)Pp(j) denotes that a given cell j is both presence of suitable range (Ps(j)) and potential range (Pp(j)). An ESP value of 1 indicates perfect agreement between the suitable and potential ranges of a virtual species, while a value of 0 indicates complete geographic separation (Godsoe 2014; Inman et al. 2021).