Forest migration outpaces tree species range shift across North America

: 29 Mounting evidence suggests that geographic ranges of tree species worldwide are shifting under 30 global environmental change, but little is known about forest migration—the shift in the 31 geographic ranges of forest types—and how it differs from individual tree species migration. 32 Here, based on in situ records of more than 9 million trees from 596,282 sample plots, we 33 quantified and compared the migration patterns of forests and tree species across North America 34 between 1970 and 2019. On average, forests migrated at a mean velocity of 205.2 km · decade -1 , 35 which is twice as fast as species-level migration (95.6 km · decade -1 ), and 12 times faster than the 36 average of previous estimates (16.3 km · decade -1 ). Our findings suggest that as subtle 37 perturbations in species abundance can aggregate to change an entire forest from one type to 38 another, failing to see the forest for the trees may result in a gross underestimation of the impacts 39 of global change on forest ecosystem functioning and services. With the first forest classification 40 and quantification of forest migration patterns at a continental level, this study provides an 41 urgently needed scientific basis for a new paradigm of adaptive forest management and 42 conservation under a rapid forest migration. Our findings depict the first continentally consistent and locally relevant record of forest classification and forest migration patterns. These results contribute fundamental insights into the rapid shifts in tree species assemblage distribution under global environmental change, and their underlying drivers. Our machine-learning analyses reveal strong effects of climate change and species evenness on forest migration patterns, and pinpointed forest communities with an extreme migration velocity, where assisted migration and other adaptive forest management efforts 17 are critical in mitigating biodiversity loss, climate change, and associated socioeconomic 217 impacts. Overall, this study provides an urgently needed scientific basis for a new paradigm of adaptive forest management and conservation, so that effective mitigation and intervention efforts can be developed in response to the rapid forest migration.

species move to higher latitudes, tracking warming climate [1][2][3][4] , while some move towards lower 48 latitudes 4,5 , longitudinally 6 , or altitudinally 7,8 . Collectively, these changes can alter the relative 49 abundance and dominance of tree species, causing a complete change in the type of local forest 50 3 communities. To differentiate from the term tree species migration which refers to the shift of 51 tree species ranges 3-6 , we call the shift in the geographic range of a forest type forest migration. 52 Quantifying forest migration is crucial for the understanding of the impacts of global 53 change on forest ecosystem functioning and services. A forest constitutes a foundational entity 54 supporting most ecosystem services as well as human culture, customs, economies, and identity 9-55 12 . In addition, a forest is a fundamental unit of sustainable forest management 11 . By shifting 56 local forest types, forest migration can extensively change ecosystem functioning and services 13-57 17 , causing massive ecological and socioeconomic impacts worldwide 18 . For instance, in the 58 central United States, a diminishing supply of Quercus alba, Q. macrocarpa, and other white oak 59 species caused by the shifting and shrinking ranges of oak-dominated forests is threatening the 60 bourbon industry 19 , a staple of American culture and tradition. Meanwhile, the migration of 61 maple-dominated forests has raised concerns over the sustainability of the maple syrup industry 62 in North America 20 . 63 To see the forest for the trees is a major challenge in quantifying forest migration. 64 Previous studies found that the geographic ranges of some tree species in North America shifted 65 at a mean velocity of 16.3 km·decade -1 , with a range of 0.03 -100.20 km·decade -1 (see 66 Supplementary Table 1). However, because these studies were limited to a local or regional scale 67 with inconsistent migration measurements (e.g., some use marginal shifts, but some use centroid 68 or latitudinal shifts), the patterns of forest migration at a continental scale still remain largely 69 unknown 21 . Moreover, as most reported migration velocities were calculated from species-level 70 shifts, how forest migration differs from tree species migration also remains largely unknown. 71 Here, we systematically quantified, for the first time, forest and tree species migration 72 patterns at a continental scale, based on more than 9 million ground-surveyed tree records from 73 4 596,282 sample plots. Using these in situ data, we classified North American forests into a 74 hierarchical system consisting of eight forest biomes and 51 underlying forest types (Table 1, 75 sans forests in Mexico, Central America, and the Caribbean due to a lack of data). We then 76 quantified the azimuth and velocity of forest migration between 1970-1999 and 2000-2019. 77 Similarly, we quantified the azimuth and velocity of tree species migration across the continent 78 during the same time periods. 79 To quantify forest migration, we first used an established machine learning algorithm to 80 consistently classify all forested areas across the study region into 51 forest types (Table 1, see 81 §Forest Classification in Methods). There are 49 forest types in eight biomes in the 82 conterminous United States and Alaska, and 35 forest types in six biomes throughout Canada 83 ( Fig. 1, Extended Data Figs. 1, 2). The two countries share a total of six forest biomes. The 84 Boreal Forest (total area 2,462,924 km 2 ) is the largest forest biome shared by the two countries, 85 followed by the Eastern Mixed Forest (644,011 km 2 ). Mediterranean California (59,849 km 2 ) 86 and Southern Plains (547,118 km 2 ) are only distributed in the United States. At the forest type 87 level, black spruce-balsam fir (B-E, 750,121 km 2 ) is the largest forest type shared by the United 88 States and Canada, followed by quaking aspen-balsam fir-paper birch (B-A, 349,949 km 2 ) and 89 jack pine-black spruce (B-C, 294,291 km 2 ). The largest non-boreal forest types shared by the 90 two countries are balsam fir-maple-yellow birch (E-I, 229,088 km 2 ) and subalpine fir- 91 Engelmann spruce (W-K, 203,700 km 2 ). 92 Based on the temporal differences of the range of forest types classified above, we  (Table 1), quaking aspen-balsam fir-paper birch forest (B-A) migrated 96 5 with the highest velocity at 683.3 km·decade -1 , moving eastward (Table 1, Supplementary Table   97 2). Among the twelve forest types that migrated at a speed between 100 and 440 km·decade -1 , 98 five are in the Eastern Mixed Forest biome (E-A, E-C, E-H, E-J, and E-K), three in the Pacific  (Table 1). Across the continent, forests 104 migrated at a mean velocity of 205.2 km·decade -1 (Fig. 2a).

105
At the tree species level, we estimated the geographic range of 150 tree species in North 106 America for the same time period to quantify tree species migration. We found that tree species 107 on average migrated at 95.6±1.7 km·decade -1 (Fig. 2b). Picea sitchensis had the greatest 108 migration velocity of all the tree species (504.8 km·decade -1 ), followed by Abies balsamea 109 (502.0 km·decade -1 ) and Alnus incana (359.4 km·decade -1 ). In contrast, Platanus occidentalis 110 had the lowest migration velocity (4.3 km·decade -1 ), followed by Quercus macrocarpa (4.9 111 km·decade -1 ) and Celtis laevigata (5.3 km·decade -1 ) (Supplementary Table 3 Fig. 3). The ratio of forest migration velocity to tree species migration 132 velocity was positively associated with climate change, an aggregated indicator of temporal 133 changes in the top nine bioclimate variables. In contrast, the ratio was negatively associated with 134 tree species evenness (Fig. 3b). 135 The substantial difference in the velocity of migration between forest type and individual 136 tree species therein can also be attributed to the high sensitivity of forest type classification to 137 changes in the abundance and dominance of underlying tree species. A small, local perturbation 138 in species abundance and/or dominance, which has little impact on the overall shift of the species 139 range, can potentially alter the local forest type and the overall forest migration pattern. 140 Our findings suggest that the impacts of global environmental change on forest 141 ecosystem functioning and services may have been grossly underestimated. Since the mean 142 velocity of forest migration (205.2 km·decade -1 ) estimated here is more than 12 times greater 143 than the average of previous estimates (16.3 km·decade -1 ), the associated impacts of on forest 144 ecosystem functioning and services can be much more profound than previously thought.

145
Because forest ecosystem functioning 22,23 , productivity 24 , as well as phenology and population 146 turnover 25,26 are very sensitive to tree species composition and tree species diversity, subtle 147 changes in relative abundance or relative dominance of tree species can aggregate to affect  scales 6 , we found that climate change accelerates forest migration more than it accelerates tree 188 9 species migration. In contrast, tree species evenness was found to reduce the difference between 189 forest migration velocity and tree species migration velocity. This complements previous 190 findings that biodiversity and species evenness in particular make forest communities more 191 resilient to climate change 43,44 . 192 The differed migration patterns between forests and tree species observed here represent  Besides climate forcing which is generally seen as the main cause of these changes,   Table 1

. Summary of forest types and biomes classified based on the present (2000-2019) and past (1970-1999) forest inventories.
Only the top dominant species for each forest type are listed to save space. Forest types with "W-" belongs to West region, "E-" to East region, and "B-" to Boreal region.  Fig. 1. Map of present (2000-2019) forest type classification, as well as the azimuth and velocity of each forest type. Forest migration was quantified based on the movement of weighted geographic centroids of each forest type. Forest type code corresponds to Table 1.

Fig. 2. Comparison of migration velocity (km·decade -1 ) between forest types (a) and tree species (b)
, assessed at the 0.025° grid level. Migration velocity was measured by distance shift in kilometers per decade. Violin plots show the distribution of grid-level velocity by region and type (left: forest migration, right: tree species migration). Solid lines and surrounding bands represent the mean and 95% confidence interval, respectively (red represents forest migration velocity, and blue tree species migration velocity). Competing interests: Authors declare that they have no competing interests.

Data and data integration
For this study, we compiled and integrated in situ forest-tree data from independent and standard forest inventories. Data for the United States came from the Forest Inventory and Analysis (FIA) 46  We derived the following data integration protocol to harmonize the different forest inventory datasets described above into consistent continental data frames. From each dataset, we obtained tree-level information for all the trees with a minimum diameter at breast height (DBH) of 1 cm. We grouped these tree-level records by the year of inventory, and compiled one data frame for 2000-2019 and another data frame for 1970-1999. For each period, we then summarized tree-level information into a plot-level species abundance matrix, which contained the percentage of the number of stems by species (i.e., relative abundance), as well as the percent basal area by species (i.e., relative dominance). Based on the species abundance matrix, we calculated the importance value of each tree species present on a sample plot, which equally weights relative abundance and relative dominance of a particular species [51][52][53] . Our forest classification consists of two steps: defining forest types and mapping them.
The definition of forest types was determined by the combination of autoencoder neural network and K-means cluster analysis. Autoencoder neural networks create a compressed representation of the original data, which is more suitable for K-means cluster analysis than the original data.
Then, we mapped forest types determined by the K-means cluster analysis to the forested area using machine learning imputation models. Due to the random nature of K-means cluster analysis, we repeated the whole process 20 times to derive the final classification results.
For each region, based on the continental data frame and aggregated grid data described above, we first used an autoencoder neural network to calculate a latent space representation of the original input features 61  encode the input into its reduced dimensional representation, which was then inputted into the Kmeans clustering algorithm.
To avoid potential bias caused by insufficient sample sizes, we excluded the species that are present in less than 60 grids (Supplementary Table 3). Based on the reduced dimensional representation, we conducted a K-means cluster analysis to classify forests across North America. We conducted K-means cluster analysis in R (version 4.0.4) using the built-in function "kmeans" 64 . We set the number of starts to 50 and the maximum iterations to 100. Choosing the number of dimensions from the autoencoder neural network and the number of clusters, as well as the evaluation methods are described in §Model evaluation below.
With the defined forest types (i.e., clusters) from the 20 repetitions, we manually matched the same forest type by calculating the Euclidean distance in terms of species importance value between all the combinations of forest types generated from the 20 repetitions. When 10 or more repetitions identified the given forest type, we recognized the forest type as a final forest type.
Since we classified forest types for three regions separately (West, East, and Boreal), there were potential overlaps of forest types between regions. To identify and merge potential overlaps, we calculated the Euclidean distance of all combinations of the final forest types in terms of species importance value. If a Euclidean distance was less than 60, across-region forest types were merged. One exception was that western aspen-mixed conifer (W-J) and boreal quaking aspenbalsam fir-paper birch (B-A) remained separated due to the large expanse of Populus tremuloides.

Mapping of forest types
To map the distribution of forest types across the 4.9 million-km 2 study region, we considered two candidate imputation models to estimate the underlying forest type of unsampled 28 grids based on 38 predictor variables. The two candidate models were random forests and support-vector machines. Random forests are a non-parametric ensemble learning approach 65 , which combines a variant of classification trees and an additional level of randomness by bootstrapping sub-data and different sets of predictor variables to mitigate the multicollinearity issues that most statistical models face 66 . We used the "randomForest" package in R with the default hyperparameter setting 67 . Support-vector machines are supervised learning models which construct a hyperplane or set of hyperplanes in a high-or infinite-dimensional space to help analyze data for classification and regression analysis 68 . We used the "e1071" package in R with the default hyperparameter setting 69 . We compared the performance of these two candidate  Table 1 and Extended Data Fig. 6. We used "Hmisc" package in R to impute missing data in those predictor variables 76 .

Model evaluation
To maximize the clustering performance, we calculated the silhouette width to determine the number of dimensions from the autoencoder neural network and the number of clusters.
Silhouette width is an indicator of between-cluster heterogeneity 77 . With a range between -1 and 29 1, positive silhouette width values indicate that a given member of a cluster is closer to its own cluster's centroid than to the nearest cluster's centroid. Negative values indicate that a given member is closer to the nearest cluster's centroid than to the centroid of its own cluster.
Generally, higher silhouette width values indicate greater between-cluster heterogeneity. We used the silhouette width to fine-tune hyperparameters for the autoencoder (the number of dimensions) (Extended Data Fig. 7) and K-means cluster analysis (the number of clusters). We calculated silhouette width using the "cluster" package in R 77 . The mean silhouette widths from our K-means cluster analyses were significantly greater than zero for all forest types (p < 0.001) in the West for the present dataset. Eighteen out of 19 forest types in the West, 22 out of 26 in the East, and all six forest types in the Boreal region were significantly greater than zero in the mean silhouette width. In summary, 90% of the forest types classified here were significantly distinct from one another in terms of species composition (Supplementary Table 2).
To assess the performance of the imputation model in mapping forest types across the continent, we conducted a rigorous 80/20 cross-validation using bootstrapping. In each iteration, we used stratified sampling to split the entire dataset into the training (80%) and testing (20%) set, and conducted the combination of under-sampling and oversampling of the training set for both random forests and support-vector machines. Stratified sampling was conducted using the "caret" package in R 78 , and under-sampling and oversampling were conducted using the "UBL" package 79 . Based on five random iterations with sample replacement in each of the 20 repetitions, we calculated the 95% confidence interval of classification accuracy, the Kappa statistic, and elements of confusion matrix, as well as predictor variable importance. For each candidate imputation model, the output was a matrix of class probability from five iterations. We 30 chose the forest type of majority vote from the five iterations, and thus, our final output was a matrix of class probability from the 20 repetitions.
The classification accuracy, Kappa statistic, and elements of confusion matrices were calculated based on the prediction on the testing set in each iteration. Compared with the support vector machine model, random forests model was 10-17% more accurate in terms of overall accuracy and 11-20% more precise in terms of the Kappa statistic (Extended Data Fig. 5).
Therefore, we selected random forests as the final imputation model. The confusion matrices based on random forests models were based on the number of cases in class prediction, standardized in percentage (Extended Data Figs. 8, 9). For the present dataset, the coastal redwood-tanoak forest (W-P) had the highest classification accuracy (88%, Extended Data Fig.   8), and the red maple-hardwood forest (E-F) had the lowest accuracy (18%, Extended Data Fig.   8) among all the classes (i.e., forest types).

Quantifying forest migration patterns
We quantified migration patterns of forest type in terms of velocity (km·decade -1 ) and azimuth (°) of the mean geographic movement, as well as changes in area, based on in situ forest inventory data aggregated into 0.025 by 0.025-degree grids. is Y, and past forest type Y's closest present forest type is also X, they were considered matching.
For each matching pair of past and present forest types, we determined its mean geographic movement and associated 95% confidence interval using a bootstrapping approach with 1,000 iterations. In each iteration, we randomly sampled 80% of past and present data with replacement and quantified the velocity and azimuth of forest migration, based on the past and present centroids of the geographic range of that forest type. The geographic centroid was calculated by weighting the grid geographic coordinates with percent forest type. After mapping forest types across the continent using the imputation random forest models, all the 1,004,358 grids contain a percentage for each forest type as well as the geographic coordinates (latitude and longitude) of that grid's centroid. Percent forest type was determined by how many repetitions, out of 20 repetitions, returned the particular forest type. Geographic centroids for each forest type were then calculated by weighting the geographic coordinates and percentage in that grid with the following equations: where Q R S is the weighted mean longitude of forest type j, \ R S is the weighted mean latitude of forest type j, Q N and \ N are the longitude and latitude for the centroid of grid i, and 8 NS is the grid cell level percentage of forest type j.
This geographic distance was calculated using the "sp" package in R 80 , while the associated azimuth was determined using the "sfsmisc" package 81 . The velocity of forest migration (km·decade -1 ) was then calculated as the average distance of movement for each forest type (Table 1) per decade. We also determined area coverage of each forest type by weighting grid area by percent presence of the forest type. Grid area was estimated using the "raster" package in R 82 .

Comparison of forest migration and tree species migration
To directly compare the geographic shift of forest types and tree species, we calculated grid-level velocity for each entity. For forest type, we quantified grid-level velocity of forest migration by weighting the forest type velocity by percent presence of the forest type in each grid. Percent presence of the forest type here was based on how many models, out of five models, returned the given forest type. Therefore, the output was a matrix of grid-level velocity from the 20 repetitions.
We estimated tree species migration in a similar manner, using the same grid-level foresttree data for identical time periods. For each tree species and each time period, we estimated tree species distribution range based on random forests models and the 38 predictor variables (Extended Data  53 . We calculated weighted mean geographic centroids using predicted importance value, and determined the species' mean geographic shift using the identical method to the one stated above. We then repeated this process 20 times to derive the mean and 95% confidence interval of tree species migration velocity. To maximize the model performance while minimizing computational time, we selected the number of trees = 100 after fine-tuning using the West present dataset as an example. Specifically, we calculated root mean square error (RMSE) for different number of trees with 10 iterations, and chose the number of trees where RMSE values converged.

Modeling the ratio of forest migration velocity to tree species migration velocity
Based on the grid-level velocities of forest types and tree species, we took the ratio of forest migration velocity to tree species migration velocity for each grid. We then trained a random forests regression model with the ratio being the response variables, and 18 predictor variables (Extended Data Fig. 3). Based on grid-level tree species abundance data, we calculated three biodiversity measures: species richness, Shannon's index, and species evenness. Species richness (S) represents the total number of tree species present in the grid. Shannon's diversity index (H) 83 was calculated using the formula: where pi is the proportion of importance value of species i relative to the sum of importance value of all species present in that grid. Species evenness (E) was calculated using the measure (3) In addition, we calculated the temporal changes of 15 bioclimate variables (C1-C15, Extended Data Table 1) between the past and present surveys, and added these variables (∆C1, ∆C2,…, ∆C15) as predictor variables. With the total of 18 predictor variables, we conducted a bootstrapping of 100 random forests regression models, each trained with a random 80% subset of the full dataset with replacement. Variable importance was determined based on the Gini impurity, a measure that represents the probability of incorrect classification of randomly selected sample due to its distribution.

Data and Code availability
All data, code, and materials used in the analysis will be deposited to Figshare and Purdue University Research Repository (PURR) upon the publishing of this paper.