Predicting high resolution total phosphorus concentrations for soils of the Upper Mississippi River Basin using machine learning

The spatial distribution of soil phosphorus (P) is important to both biogeochemical processes and the management of agricultural landscapes, where it is critical for both crop production and conservation planning. Recent advances in the availability of large environmental datasets together with big data analytical tools like machine learning have created opportunities for evaluating and predicting spatial patterns in complex environmental variables like soil P. Here, we apply a random forest machine learning model to publicly available soil P datasets together with nearly 300 geospatial attributes summarizing aspects of soil type, land cover, land use, topography, nutrient inputs, and climate to predict total soil P at a 100 m grid scale for the Upper Mississippi River Basin (UMRB), USA. The UMRB is one of the most intensively farmed regions in the world and is characterized by widespread water quality degradation arising from P-associated eutrophication. Although potentially complex interacting drivers determine total soil P, the predictive accuracy of our random forest model was relatively high (R2 = 0.58 and RMSE = 129.3 for an independent validation dataset). At the regional scale represented by our model, the variables with the greatest comparative importance for predicting soil P included a combination of soil sample depth, land use/land cover, underlying soil physical and geochemical properties, landscape features (such as slope, elevation and proximity to the stream network), nutrient inputs, and climate-related factors. An important product of this research is a fine-scale (100 m) raster data layer of predicted total soil P values for the UMRB for public use. This dataset can be used to improve conservation planning and modeling efforts to improve water quality in the region.

total soil P at a 100 m grid scale for the Upper Mississippi River Basin (UMRB), USA. The UMRB is one of the most intensively farmed regions in the world and is characterized by widespread water quality degradation arising from P-associated eutrophication. Although potentially complex interacting drivers determine total soil P, the predictive accuracy of our random forest model was relatively high (R2 = 0.58 and RMSE = 129.3 for an independent validation dataset). At the regional scale represented by our model, the variables with the greatest comparative importance for predicting soil P included a combination of soil sample depth, land use/land cover, underlying soil physical and geochemical properties, landscape features (such as slope, elevation and proximity to the stream network), nutrient inputs, and climaterelated factors. An important product of this research is a fine-scale (100 m) raster data layer of predicted total soil P values for the UMRB for public use. This

Introduction
Phosphorus (P) is an essential nutrient in terrestrial and aquatic systems, with strong influences on plant productivity, element cycling and many other aspects of ecosystems (Vitousek et al. 2010). In excess, P has strong negative impacts on water quality and aquatic ecosystems arising from diverse effects of eutrophication (Schindler 2006;Clark and Longo 2018). Because soils are the main source of phosphorus for terrestrial and freshwater ecosystems, understanding of ecosystem dynamics and management of phosphorus enrichment due to human activities requires knowledge of distributions of soil P. While a large majority of total soil phosphorus exists in diverse forms that are not immediately available to plants (i.e., primary and secondary minerals, organic P), these pools of soil P determine general P availability and influence losses to aquatic ecosystems (Hou et al. 2016;Vitousek et al. 2010). Thus, knowledge of total soil P distributions can provide fundamental information for research and management in terrestrial and aquatic ecosystems.
In the Cornbelt and Great Plains regions of the United States and Canada, decades of agricultural practices including the conversion of perennial vegetation into row crops, intensive inputs of synthetic fertilizers, and the spreading of manure from concentrated animal feeding operations (CAFOs), have led to an accumulation of legacy phosphorus in soils well beyond their retention capacity (Goyette et al. 2018). Wastewater sources also contribute to P export, although their impact is comparatively small for most industrially managed agricultural watersheds (Boardman et al. 2019). In many areas across the region, legacy P continues to be compounded by ongoing fertilizer and manure inputs, exacerbating the challenges of addressing nonpoint source pollution (Stackpoole et al. 2019;Boardman et al. 2019;Van Meter et al. 2021). These anthropogenic P inputs are overlaid on complex underlying controls on soil P content and transport, including soil parent material, soil texture, and landscape geomorphology (Records et al. 2016).
Field work efforts to measure soil P at high spatial resolution can be time intensive and costly, if not infeasible, across large spatial extents. Although soil P is predicted quite well at small spatial scales by coincident soil properties such as pH, texture, particle size and soil organic matter (e.g., Hosseini et al. 2017), such microscale soil attributes are generally not available at regional scales. Moreover, P heterogeneity at regional scales may also be affected by broader scale drivers such as climate, parent material, and glacial history. Thus, efforts to predict soil P from existing geospatial and climatic datasets have considerable utility within larger biophysical modeling efforts designed to support conservation planning.
Recent advances in the availability of large environmental datasets together with analytical tools like machine learning models have created new opportunities for predicting spatial patterning in complex environmental outcomes (Zhong et al. 2021). Total P at the soil surface represents a critical pool of P that is potentially erodible, reactive, and eventually bioavailable and thus highly relevant to both crop production and water quality. Machine learning represents a potentially useful tool to help predict spatial variability in total soil P, which is driven by complex, interacting drivers. Previously, machine learning has been used together with detailed geospatial data to predict soil P across smaller geographic areas, ranging from a single field to small catchments to subregions with areal extents of up to a few hundred km 2 (Matos-Moreira et al. 2017;Jeong et al. 2017;Sahabiev et al. 2021;Kaya and Başayığıt 2022). Machine learning has also been used to predict soil attributes such as % carbon and total nitrogen content at continental scales (Ramcharan et al. 2018), as well as nutrient concentrations in river networks (e.g., Shen et al. 2020;Sadayappan et al. 2022). In this study, we develop a random forest machine learning model from large, publicly available datasets of soil P and nearly 300 predictive variables including soil type, land cover, topography, nutrient inputs, hydrology and climate to predict total surface soil P at a 100 m grid scale for the entire Upper Mississippi River Basin (UMRB), USA. The UMRB is one of the most intensively farmed regions in the world and is characterized by widespread water quality degradation arising from P-associated eutrophication (Jacobson et al. 2011). Predicting the spatial variability of total soil P for this landscape has direct implications for conservation planning for water quality improvement and for understanding the resilience of soils for crop production (Qiao et al. 2022). Our objectives for this research were to (1) highlight the potential of underutilized publiclyfunded and available datasets to support conservation planning; (2) predict soil P at a fine (100 m) resolution across the UMRB and provide results in an open access data framework for use in future conservation planning and analysis; (3) identify geospatial attributes important in predicting soil P at regional scales.

Study area
The focus of this study is the Upper Mississippi River Basin (UMRB), which comprises the uppermost tributaries to the Mississippi River, from the headwaters at Lake Itasca in Minnesota down to the confluence of the Mississippi and Missouri Rivers at St. Louis, Missouri (Fig. 1).
The UMRB covers nearly half a million square kilometers (492,027 km 2 ) of the upper Midwestern Fig. 1 Map of the study region showing a) locations of soil P samples from the USGS National Geochemical Survey database (purple dots, n = 6,664) and from the National Cooperative Soil Survey (NCSS) Soil Characterization Database (in orange, n = 558) across the Midwestern regional extent used in this study, with the Upper Mississippi River Basin (UMRB) outlined in bold and. b) close up of the UMRB showing land cover classes from the 2019 National Land Cover Dataset (Dewitz, 2021) and locations of major rivers and cities United States, including large parts of Minnesota, Wisconsin, Iowa, Missouri, Illinois, and smaller areas of Indiana, North Dakota, and South Dakota, and corresponds to the HUC2-scale watershed (HUC ID = 07) in the USGS Watershed Boundary Dataset (2021). The UMRB is intensively farmed for large scale monocultures of primarily corn and soybeans. CAFOs are also prolific in the basin, particularly across Iowa and southern Minnesota (Harun and Ogneva-Himmelberger 2013). Overall, cultivated crops comprise 49% of total basin land area (Dewitz, 2019); nearly 90% of those crops are corn and soybeans (USDA 2022). The intensity of agricultural land use varies across a gradient from north to south, with the uppermost parts of the basin (in Minnesota and Wisconsin) characterized by relatively higher amounts of forest and wetland cover compared to the southern regions of the basin, where crop land use can exceed 85%. However, even the northern areas of the basin are undergoing rapid land use change from wetland and forest cover towards more agriculture (Green et al. 2018). The precipitation gradient varies from comparatively drier in the north and west to wetter in the south and east. Soil textures in the basin are largely silty loam and loam, but also include areas of poorly drained clays as well as areas of well drained sands. Topography in the UMRB includes widespread flat and gently rolling areas, as well as steeper terrain and large bluffs near many rivers and streams. Elevation across the basin ranges from 85 to 640 m above sea level.
Data processing and analysis overview Data processing and analysis was done in R Studio v. 2022.07.1 (RStudio Team 2022). All of the input data and methods used to generate the machine learning model for predicting soil P-including data acquisition, data preprocessing, data splitting, model training, evaluation and interpretation-are publicly available at https:// github. com/ cldol ph/ SoilP. The workflow used to develop the model is summarized in Fig. 2.
Soil phosphorus data For model training and testing, we used nearly 7,000 soil P measurements from two spatially extensive datasets: (1) the USGS National Geochemical Survey (NGS) Database (USGS 2004), and (2) the National Cooperative Soil Survey (NCSS) Characterization Dataset (NCSS 2021). Both of these datasets are publicly available 1 and contain information about soil P, among other soil attributes. We focused on total soil P rather than soil test P in our model development for several reasons. First, the NGS does not contain information about soil test P. Although the NCSS does contain information about various soil test P measures (e.g., Bray 1 P, Olsen P, etc.), soil test P is not measured consistently across the geography of the dataset. Because different soil test P measures extract variable amounts of P, they may not be readily comparable to one another (Wuenscher et al. 2015), making it difficult to compile and interpret multiple soil test P measures into one cohesive dataset. By contrast, total soil P data across the NGS and NCSS was considerably more extensive compared to any one soil test P measure, creating the opportunity for stronger machine learning models. Second, total soil P is a fundamental ecosystem property with broad applications in conservation, biogeochemistry, and agronomy (Correll 1998;Sharpley et al. 2009;Shen et al. 2011). While soil test P may more directly indicate the crop availability and risk of loss of soluble P to downstream water bodies (e.g., Vadas et al. 2005), total P provides a more consistent and integrative measurement of soil P status. Moreover, previous work has shown total P to correlate with soil test P measures (especially when other soil properties are also accounted for; Allen and Mallarino 2006), and thus total P can be related to the potential for both erosive and soluble P losses. Finally, water quality monitoring and management efforts are often pegged to estimates of total P in aquatic systems, rather than estimates of 'bioavailable' P (Correll 1998). The complexity of P cycling and transformation in terrestrial and aquatic environments is a good reason to track total P in our initial machine learning efforts. Future applications of this approach may be applied to soil test P, particularly as more data is available from high frequency sampling to account for high variability in these forms, as the bioavailability of different components of total P may change over time.
Soil P estimates from these two datasets provided only partial spatial cover of the UMRB (Fig. 1). To expand our potential to predict soil P for different types of soils and landcover conditions present across the UMRB, we expanded the regional range of data used to build our predictive model to include locations that were outside of the UMRB boundary but that shared potentially similar soils, climate and/ or land management. This expanded study region included the following 13 U.S. states: Arkansas, Illinois, Indiana, Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin (Fig. 1).

National geochemical survey dataset (NGS)
The NGS contains 284 geochemical attributes for soil and stream sediment samples across the United States (n = 77,212). While the primary focus of the dataset is stream sediments, the dataset contains a considerable number of soil samples (n = 19,992). Of these soil samples, 19,574 samples contain information about total soil P. Total soil P estimates (% dry weight) were obtained using inductively coupled plasma spectrometry after acid dissolution (USGS 2004). We converted soil P estimates to mg/kg by multiplying % weight estimates by 10,000.
Samples in the NGS dataset were collected between 1900 and 2008. Since we were most interested in developing modeling approaches capable of predicting relatively current soil P conditions from concurrent spatial attribute data, we subsampled the dataset to those collected between 2000 and 2008, leaving 9,589 samples. We further restricted samples to those found in our geographic study region, leaving a total of 6,664 soil P samples (Fig. 1).
For most samples, the NGS contains information about soil depth. However, depth was represented inconsistently across the dataset, with some samples associated with minimum and maximum numeric depth estimates (in inches), some samples associated with a soil horizon category (i.e., "A", "B", etc.), and some samples lacking any kind of information about depth or horizon.
For samples where numeric depth information was available, we designated a predictor variable 'depth' as the maximum depth pertaining to that sample. For soil samples without numeric depth measurements, we assumed a maximum depth of 30 cm if samples were listed as occurring in the soil A horizon (corresponding to the surface or plowable layer). We also assumed a maximum depth of 30 cm for samples with no depth or horizon information provided (that is, we assumed samples without depth information were surface samples). If samples occurred in any other soil horizon (and if no numeric depth information was provided) we designated samples as 'subsurface', and assigned them a maximum depth of 76 cm based on information from samples where depth data were present (and corresponding primarily to the soil B horizon).

National cooperative soil survey
The NCSS contains data commonly requested for agronomic and biogeochemical purposes for 404,080 soil samples across the United States analyzed by the Kellogg Soil Survey Laboratory and cooperating universities. Of these, 10,678 samples contained information about total P, with 961 samples occurring in our study region. As for the NGS dataset, we filtered samples to include only those collected since 2000, leaving 558 samples in the remaining NCSS dataset. Total soil P estimates (mg/kg) were obtained using inductively coupled plasma spectrometry after acid dissolution (Soil Survey Staff 2014). All samples included an estimate of the top and bottom depth of the soil horizon sampled in centimeters. We used the estimate of horizon bottom depth as the estimate of depth associated with each sample.
After subsampling the data, the combined NGS and NCSS datasets yielded a total of 7,222 soil P samples for model development and testing across the Midwestern United States (Fig. 1).

Predictive variables
To predict soil P across the UMRB, we sought attributes that had previously been identified as known drivers of P abundance, including soil properties such as parent material, mineral and organic content, grain size and texture, as well as attributes describing landscape position, land use, climate and anthropogenic inputs (Records et al. 2016;Deiss et al. 2018;He et al. 2021). We used geospatial attributes from multiple publicly available datasets to summarize aspects of these potential drivers of soil P. These datasets included: (i) the National Hydrography Dataset Plus v2 (NHDPlusv2) 2 ; (ii) the National Land Cover Database (NLCD) for 2006 3 ; (iii) the National Wetlands Inventory 4 ; (iv) the gridded Soil Survey Geographic (gSSURGO) Database (10 m resolution) 5 ; and v) the U.S. EPA StreamCat dataset 6 . Overall, we included 268 attributes derived from these datasets at multiple spatial scales as predictors in soil P models. Each of these datasets are described in more detail below and specific attributes used in soil P models are listed in Supplementary Information (SI) Table S1. Once soil samples were joined to all predictors, 6,683 soil samples had complete attribute information available for use in modeling.

National hydrograph dataset plus v2 (NHDPlusv2)
We sought to include information about near channel areas (i.e., the 100 m buffer surrounding the stream network) in our predictor dataset, because of the possibility for these environments to accumulate P relative to upland areas. For example, in a review of the effects of conservation practices on water quality, Dodd and Sharpley (2015) found that near channel areas such as riparian buffers, including grass and vegetative filter strips and wetlands, accumulated labile forms of soil P over time. We derived a categorical attribute indicating whether soil samples were collected in the 100 m riparian buffer of various stream categories defined in the NHDPlusv2 layer, including ditches, intermittent streams, and perennial streams. The NHDPlusv2 includes a digital stream network layer for the conterminous United States, divided into reaches (i.e., unique links in the network) as well as catchment boundaries associated with each reach. Catchments (i.e., local drainage areas) include the immediate land area draining into each individual stream reach excluding areas draining to upstream reaches. Buffering of the stream layer was conducted in ArcMap version 10.8.1. We also used the NHDPlusv2 layer to assign soil P samples to NHD catchments, which were subsequently used to link soil samples to predictive variables in the StreamCat database (see below).

National land cover database (NLCD)
Land use is a likely driver of soil P dynamics through multiple potential pathways, including alteration of P inputs and outputs associated with different land cover regimes, and through management practices that may alter biological, chemical and physical processes that affect P composition (Liu et al. 2018). We used the NLCD to summarize aspects of land cover for the UMRB. NLCD layers contain information about 16 land cover classes across the United States at 30 m resolution. We chose NLCD 2006 because most soil P samples in our dataset (85%) were collected between 2000 and 2006, and all were collected before 2011. We used the spatial join function in ArcMap to link land cover classes to soil P sample locations.

National wetlands inventory
Similar to riparian buffers, wetlands can accumulate P by intercepting P-rich runoff from surrounding drainage areas (Dodds and Sharpley 2015); thus we sought to include wetland presence and type as predictors in machine learning models for soil P. The NWI is a publicly available dataset that captures distribution and attribute information for wetlands in all 50 U.S. states. We derived a categorical attribute indicating whether soil samples were collected from a series of possible wetland types included in the NWI, including freshwater emergent wetlands, freshwater forested/shrub wetlands, freshwater ponds, lakes, or riverine wetlands, or from non-wetland areas.
gSSURGO A large body of previous work has identified the importance of soil properties such as mineral and organic content, parent material, and soil texture to P abundance (e.g., Records et al. 2016). We used the 10 m resolution raster gSSURGO dataset (Soil Survey Staff 2021) to assign soil properties to P sample locations. The gSSURGO database is organized by map unit soil polygons where each map unit is linked to hundreds of detailed soil attributes. We selected 81 attributes from the gSSURGO dataset with the potential to control soil P abundance -including aspects of parent material, grain size, organic matter content, and mineral content (Records et al. 2016)-and linked them by location to soil P sample locations.
For gSSURGO attributes that are mapped at scales smaller than the map unit, we preprocessed numeric SSURGO attributes by taking the weighted average of values for all components in a map unit, where weights were the proportional contribution by area of each component to the map unit. For categorical SSURGO attributes, we selected the attribute value for the component comprising the highest percent area of the map unit. If two attribute values comprised the same percent area of the map unit, we preferentially selected the component attribute with the higher slope value because surface P runoff is likely to be higher for soils with greater slope.

StreamCat dataset
U.S. EPA's StreamCat dataset contains information for over 600 different environmental metrics linked to individual stream reaches in the NHDv2Plus dataset (Hill et al. 2015). These metrics summarize diverse geospatial attributes including land use, impervious surfaces and road density, soil type, point source and nutrient inputs, and climatic factors (temperature and precipitation) at the catchment and watershed scale draining into each reach. In contrast to the 30 m pixel gridded NLCD 2006 data, StreamCat attributes are summarized at the catchment (local drainage area) and watershed scales and thus offered the potential to capture land use trends at a slightly broader, but still local, geographic scale that might be relevant to soil P concentrations. We included attributes from Stream-Cat variables potentially relevant to soil P across our study region (Table S1). These tables included 220 attributes summarized at the watershed and catchment in our dataset.

Random forest modeling
Random forest regression is a nonparametric ensemble learning method that utilizes predictions from multiple decision trees to improve model accuracy. Each tree is composed of branches ("nodes") representing yes-no questions where features (i.e., predictive variables) are used to split the dependent variable into two groups that minimize in-group variability and maximize between group variability. We selected random forest as our modeling method because these models can be highly accurate, are relatively fast to develop, can handle categorical and numerical attributes, are robust to outliers, and can handle both non-linear and unbalanced data, all of which were pertinent criteria for our dataset. Random forest models also require very few assumptions of the input datasets. At the same time, some limitations of random forest modeling methods should be noted: mainly, the machine-learning techniques are characterized by the data mechanisms as 'black box' or 'gray box' algorithmic approaches (Jeong et al. 2016). Particularly, random forest is built on nonparametric advanced classification and regression tree (CART) analysis methods and models may not be fully described mechanistically (Breiman 2001). Also, the random forest algorithm has the potential to overfit data, and may become impractical for making predictions beyond the training data range (Jeong et al. 2016). Given the inherent random forest structure, permutation of many trees may make the algorithm slow for real-time prediction (Breiman 2001).
Random forest model development followed the general scheme for machine learning models articulated by Zhong et al. 2021 (Fig. 2). We used a tidymodels framework (Kuhn and Wickham 2020) in R (R Core Team 2022) to develop random forest models for predicting soil P from the assembled predictors. Models were evaluated based on R 2 and RMSE values (Chicco et al. 2021). Within this framework, we used both the ranger package (Wright and Ziegler 2017) and the randomForest (Liaw and Wiener 2002) packages at different points in the modeling process. Both ranger and randomForest have been shown to perform as well or better than other machine learning approaches for predictive purposes (Hagenauer et al. 2019) and have been shown to produce very similar model results to one another (Wright and Ziegler 2017). We used the ranger package for initial model tuning, due to its considerably faster implementation relative to randomForest (Wright and Ziegler 2017). Once optimal hyperparameter values had been identified using this approach, we applied these hyperparameter values to a randomForest implementation to create our final model, because randomForest model objects (but not ranger model objects) are compatible with conditional permutation methods for estimating the relative importance of predictors to model outcomes (see rationale for the approach we took to evaluate predictor importance below). Prior to switching from ranger to randomForest implementations, we confirmed that the two approaches produced nearly the exact same results in terms of model performance (as measured by R 2 and RMSE for actual vs predicted values on an independent test dataset), when the same hyperparameter values were used. A complete R script detailing our approach can be found at https:// github. com/ cldol ph/ SoilP.

Additional data curation
After pre-processing soil predictors as described above, we linked attributes to soil P samples and removed variables that did not contain useful information (e.g., all rows = 0). We also excluded attributes where information was missing ('NA') for > 20% of soil P samples. Finally, we also excluded three attributes from the gSSURGO dataset (pmkind, taxpartsize, and texcl) with category levels that did not encompass the entire set of the categories contained within the UMRB grid dataset (see below). Many of the remaining attributes still contained some missing values. Because random forest models cannot handle missing values in predictor variables, we used the missRanger package in R (Mayer 2021) to impute the remaining missing values for the training and testing datasets. missRanger imputes missing values for each variable by building a random forest model for each variable that uses all other variables in the dataset as covariables. Prior to random forest modeling, we normalized (i.e., centered and scaled) numeric attributes to have a mean of zero and a standard deviation of one.

Model tuning
For the soil P random forest model, we used 90% of the data for model training, and the remaining 10% for model testing. Using the training dataset and the ranger implementation for random forest modeling, we applied tenfold cross validation to tune model hyperparameters across a range of possible values. The hyperparameters selected for tuning were: mtry (i.e., number of variables randomly sampled as candidates at each split) and min_n (i.e., the minimum number of data points in a node). The trees hyperparameter (i.e., number of trees) was set to 1000 across all models. K-fold cross validation can assist in avoiding model over-fitting and works by partitioning training data into K equal sized "folds" (in our case 10). The model is iteratively trained on various combinations of tuning hyperparameters across K-1 folds, leaving the remaining fold to evaluate model performance for each combination. We defined a grid of 20 potential combinations of hyper-parameters using the tune_grid () function from the tidymodels collection of packages in R. This approach draws hyperparameter values semi-randomly from parameter space such that the various combinations cover the whole space of potential values. We selected hyperparameter values using out-of-bag (OOB) RMSE and R 2 for the associated models.
Once hyperparameter values were tuned, we reran the random forest model using the randomForest package, to create a randomForest object that was compatible with our selected measure of predictive variable importance (see next section). We evaluated overall model performance based on the independent test dataset (comprising 10% of the original dataset).

Predictive variable importance
Many implementations of random forest models come with default measures used to quantify the relative importance of individual predictive variables to model performance. However, recent work has shown that many of these default strategies, including those based on measures of impurity or permutation-based metrics, can produce biased results when model predictors (1) vary in their scale of measurement; (2) vary in their number of categories; or (3) are highly correlated to one another (Hooker et al. 2021). When these conditions apply, as they do for predictors in our soil P dataset, alternative measures of importance such as conditional permutation importance (CPI) may be more appropriate, and less biased towards collinear predictive variables (Debeer and Strobl 2020;Hooker et al. 2021). CPI aims to capture the dependence between a predictor and the response variable, conditionally on the values of all other predictors. That is, CPI is a measure that can be used to assess how much each variable contributes to accurately predicting the response variable, given what we know from all other predictive variables. The tradeoff is that CPI methods can be computationally expensive. To derive less biased estimates of variable importance to the random forest model performance, we implemented the CPI approach from the permimp package in R (Debeer et al. 2021). This approach builds upon the widely used party package for CPI, while offering improved computational speed, accounting for nonlinear dependence between predictors, less sensitivity to dataset sample size, and greater stability of results (Debeer and Strobl 2020). To date, permimp can be applied to randomForest or cforest model objects in R (but not to ranger objects). In permimp, a threshold value, equal to 1-the p-value for the association between predictor variables, is used to determine whether to include a predictor in the conditioning for the predictor of interest. We used the default value for the threshold parameter in permimp (0.95) because Debeer and Strobl (2020) advised utilizing threshold values close to 1 for larger datasets.

Predicting soil P across the Upper Mississippi River Basin
One of our goals of this effort was to predict soil P values for unsampled locations across the UMRB. To do this, we created a grid of points spaced 100 m apart across the entire UMRB (UMRB grid), using the Create Fishnet tool in ArcMap version 10.8. We linked these locations to the same set of attributes used in our random forest model (i.e., attributes from the NHDPlusv2, NLCD 2006, NWI, gSSURGO, and StreamCat datasets). We excluded grid locations that coincided with locations of open water. We used the High-Performance Cluster at the University of Minnesota's Supercomputing Institute (https:// www. msi. umn. edu/) to assign attributes to the UMRB grid points, and to generate predictions for the grid locations using the random forest model we developed. For this effort, we assigned all predictions to the soil surface (max depth = 30 cm). Lastly, we used the Inverse Distance Weighting (IDW) tool in ArcMap version 10.8 to interpolate a raster surface of soil P values at the 100 m grid scale for the UMRB.

Soil phosphorus
Mean soil total P for the broader Midwestern region included in the study was 580 mg/kg (range 17-4370 mg/kg). The distribution of soil P values showed a slight right skew (Fig. 3). The dataset was comprised of 1,761 samples that could be considered to occur in the "plowable layer" at the soil surface (i.e., < = 30 cm deep), and 5,128 samples that occurred at depths > 30 cm (i.e., 'subsurface' samples). Surface soil samples had a mean total P concentration of 585.6 mg/kg (range = 80-4370 mg/kg). Subsurface soil samples had a mean total P concentration of 592.6 mg/kg (range = 17-4280 mg/kg). The dataset contained 1,085 samples that were located in "near channel" or riparian areas (i.e., within a 100 m buffer of the NHDv2Plus stream network), and 5,798 samples that were non riparian or located outside of the stream buffer zone (Fig. S5). The dataset also contained 309 samples that were located within wetland soils, and 6,574 that were located in nonwetland soils.
Across all samples, 352 samples (5%) had soil P > 1000 mg/kg and were distributed throughout the study region ( Figure S1). A very small number of samples (n = 5) had soil P > 3000 mg/kg. All of these samples occurred in the National Geochemical Survey dataset. We examined the original field notes for these samples to ascertain if these values could be attributed to error or other specific site features, but there was no obvious information pointing to sampling or analytical error.
Relative to samples with soil P < 1000 mg/kg, samples with very high soil P were located in areas with higher median crop productivity index for corn, lower available water storage in the root zone, relatively higher contribution of groundwater to streams in the local catchment and watershed, relatively higher rates of N deposition (as NO3 − , NH4 + and inorganic N), higher pesticide and fertilizer use, higher precipitation and runoff, and higher cropland cover in the catchment, watershed and riparian zone (See SI Table S2). Higher contributions of groundwater to stream and river flow could indicate high permeability (like karst), a large contribution of drain tile (Schilling and Libra 2003), or a lower contribution of surface water such as in the drier western part of the basin. Overall, predictor values for samples with high soil P suggest that very high soil P samples occur on average in areas with greater precipitation and well drained soils, which are also associated with areas of more intense agriculture in the UMRB. This rationale makes sense for samples located in parts of the study region (e.g., Iowa), where these conditions are likely to occur. However, the spatial distribution of samples with high soil P ( Figure S1) indicates that many of them also occurred in northern areas of the basin characterized by relatively little agriculture (e.g., northern Minnesota).

Model performance
The original model included all soil P samples in model training and testing; however, this resulted in a model that tended to underpredict soil P for samples > 1000 mg/kg (SI Figure S2). As a result, we decided to increase model prediction accuracy for the vast majority (95%) of soil samples in our dataset by excluding the 5% of samples with soil P > 1000 mg/ kg from the training and testing datasets (see Discussion). The final selected hyperparameters for the random forest model based on model tuning with tenfold cross validation for this dataset were mtry = 50, trees = 1000, min_n = 3. Evaluation of predicted vs actual soil P values for the independent test dataset indicated a model RMSE of 0129.3, and an R 2 of 0.58, for soil P samples < 1000 mg/kg (Fig. 4).

Predictive variable importance
Using the permimp function to estimate conditional permutation importance (CPI) for predictors in our model indicated that soil sample depth had by far the strongest relative impact on model performance, followed by land use in the immediate area (i.e., the 30 m grid cell) in which the soil sample was collected (Fig. 5). The remaining top predictive variables are shown ranked by importance in Fig. 5 and Table 1. In addition to sample depth and local land use, the top predictors ranked for importance included aspects of catchment-and watershed-scale land use (including aspects of urban land use, the extent of impervious surfaces, open water cover, barren land cover, and urban and agricultural land use in riparian areas), underlying soil properties (e.g., soil characteristics such as organic matter and mineral content, water storage and permeability), landscape properties (such as slope, elevation, and whether sites were located adjacent to the stream network), inputs (including atmospheric deposition as well as fertilizer and manure inputs), and climate factors (including temperature and precipitation; Table 1). Additional predictors beyond this set did not appear to contribute strongly to model accuracy once all other variables had been accounted for.
Examining some of these attributes in relation to soil P in more detail, we find that the highest soil P concentrations (> 750 mg/kg) were found almost exclusively in samples collected at a depth of 100 cm or shallower, i.e., comparatively closer to the surface. Nearly all samples deeper than 100 cm had soil P < 750 mg/kg (SI Figure S3). Among NLCD land use categories, samples collected in areas designated as open water, shrub/scrub, cultivated crops, developed/open space and grassland/ herbaceous had higher mean soil P concentrations than the total dataset average, whereas other categories had comparatively lower soil P ( Figure S4). Riparian soils (i.e., soils located within 100 m of the NHDv2Plus stream network) had significantly higher soil P (mean soil P = 646 mg/kg) compared to non-riparian soils (mean soil P = 580 mg/kg; unpaired t test t = − 7.04, df = 1456., p < 0.00001; Fig. 4 Actual vs predicted total soil P (mg/kg) for soil samples in the independent test dataset (i.e., not used to build the model). R2 = 0.58, RMSE = 0.129.3, p < 0.0001. Note that the random forest model was run using centered and scaled values of P, but unscaled values are shown here for easier interpretation Figure S5). However, caution should be used when applying univariate relationships to aid in the interpretation of importance values for the random forest model. Because the random forest model is based a series of successive data splitting events (i.e., decision trees) where data is binned into branches of groups based on predictive variable split points, the predictor values useful in parsing these groups may not be readily apparent from univariate relationships.
Predicting soil P for the Upper Mississippi River Basin Figure 6 shows predicted total soil P values for the soil surface (max depth = 30 cm) of the UMRB at the 100 m grid scale, based on the random forest model. The predictions indicate the highest concentrations of total soil P in northern Iowa, southern Minnesota, and north-central Illinois. The lowest predicted concentrations occurred in northern and north/north-central Conditional Permutation Importance (CPI) values of covariates in the random forest model for soil P samples with soil P < 1000 mg/kg. CPI is a measure that can be used to assess how much each variable 'adds' to accurately predicting the response variable, given what we know from all other covariates. Additional covariates beyond those shown did not show a strong impact on model performance, once all other covariates were accounted for. Full covariate descriptions are listed in Table 1  Table 1 Top covariates, as ranked by conditional permutation importance (CPI), to the performance of the random forest model  Wisconsin, and in southern Illinois and southern Missouri.

Discussion
Soil P across the study region Attributes are designated by major categories including 'land use', 'underlying soil properties', landscape properties', 'inputs', and 'climate'. More detailed descriptions of each attribute are available in SI Table S1. 'Riparian zones' refers to the 100 m buffer surrounding the NHD stream network Here, we have assembled a large dataset of total P concentrations for soils across the Midwestern United States from publicly available datasets and used them to predict total soil P at fine resolution across the UMRB, as well as to identify predictors that were comparatively important to model accuracy. These soil datasets, collected by federal agencies over relatively long time periods and large spatial extents, have great potential utility in the analysis of dynamics in soil and water chemistry, but the availability of these datasets is not widely known and they are not uniformly organized (and require considerable pre-processing), possibly leading to their underutilization in the study of soil nutrient dynamics. Our study highlights the potential of existing publicly-funded datasets to support ongoing studies and management design related to soil and water health. For this reason, we provide the collated dataset together with the metadata and code in an open science framework available in the open access Github repository (https:// github. com/ cldol ph/ SoilP). The range of total P concentrations reported here for soils of the Midwestern United States (mean = 580 mg/kg, range = 17-4370 mg/kg) is similar to that reported by Schilling et al. (2021) for riparian soils across Iowa (mean = 460 mg/kg; range = 109-1569 mg/kg; though it should be noted that the latter study used the aqua regia digestion method for measuring total P which can sometimes result in slightly lower measured P concentrations (J. Kovar, personal communication), rather than the HCl digestion method used for the samples in this study. Riparian soils in our study averaged slightly higher total P (646 mg/kg) compared to those measured by Schilling et al. (2021). Ringeval et al. (2017) developed models to simulate total P for cropland soils globally and estimated the global average for total P in cropland soils to be 567 mg/ kg; however, their simulated total P values for soils in North America ranged considerably higher, typically above 1000 mg/kg.

Model performance
Heterogeneity in total soil P is the outcome of potentially complex interacting drivers including geology, hydrology, climate, biogeochemical processes, and specific land use practices. Given this complexity, we view the predictive accuracy of our random forest model as relatively high (R 2 = 0.58 and RMSE = 129.3 for an independent validation dataset). The RMSE for our model represents 24% of mean soil P for the validation dataset, which is comparatively more accurate than P prediction models developed at smaller scales in Europe and Russia (Matos-Moreira et al. 2017 and references therein;Sahabiev et al. 2021). Our model performance is also considerably stronger than that for continental-scale predictive models developed using similar methods for soil organic carbon and total nitrogen by Ramcharan et al. (2018). However, the model developed here underpredicts total soil P for samples with very high soil P (> 1000 mg/kg). Although such samples comprised a relatively small proportion of our total dataset (~ 5%), underestimating soil P content for these samples may be substantively important if they represent hotspots for P accumulation and ultimately transport. Understanding the drivers of very high soil P samples in this study region is an important area for future research. Knowledge of specific land use practices at finer scales than those used in our study (e.g., fertilizer or manure application at field scales smaller than the catchment or watershed scales available for land use practices in the StreamCat dataset), could also potentially help with predictive accuracy. Intensive animal agriculture (e.g., CAFOs) and associated manure inputs in particular have been shown to coincide with areas of P accumulation (Stackpoole et al. 2019). Future research could also explore whether a classification-based approach (i.e., applying expert judgment to group numeric soil P estimates into biochemically meaningful categories such as 'very high ', 'high', moderate', 'low' etc.) rather than the regression approach used here might enable random forest models to more accurately identify hotspots or areas of very high soil P.

Predictive variable importance
Our analysis indicates that, at the regional scale represented by our model, the variables with the greatest comparative importance for predicting soil P included sample depth, land use, underlying soil properties, landscape properties, P inputs, and climate. Soil sample depth was the most comparatively important variable to model performance, with soil P tending lower with increased depth ( Figure S3). This finding echoes Ramcharan et al. (2018), who found depth to be consistently ranked as one of the most important predictors for random forest models developed for soil properties including total nitrogen and soil organic carbon.
Land use in the immediate areas (30 m grid) where soil samples were collected was the second most important variable to model accuracy, according to the conditional permutation importance measure we used. The comparatively high ranking of land use/ land cover differs somewhat from that of Ringeval et al. (2017), who found, using a different modeling approach, that the main driver of variability for total soil P at a global scale was the soil biogeochemical background corresponding to P inherited from natural soils. However, at smaller (continental) scales, Ringeval et al. (2017) found farming practices were also important in driving heterogeneity in total soil P, though still less important than native soil properties. At a smaller scale,  found that for an alluvial plain (~ 100 km 2 ) characterized by intensive agriculture, land cover classes were the most important attributes to predicting soil P. For a small watershed in Brittany, France, Matos-Moreira et al. (2017) found that information about agricultural practices contributed more to machine learning models for extractable P than for total P, whereas models for total P relied more strongly on soil and parent material attributes.
In our study, land use in the immediate area of a soil sample was more important to accurately predicting soil P than soil properties at the gSSURGO map unit scale and more important than measures of land use at catchment or watershed scales. Given this finding, a question for future research would be whether more specificity in soil properties and land use practices at the hyper local scale (e.g., onfield practices such as fertilizer and manure input, specific crop type, tillage practices at the field or 30 m grid scale, etc.) could lead to improvements in models for predicting soil P across the Midwest. Matos-Moreira et al. (2017) suggested that detailed information about crop rotation in particular may improve the accuracy of soil P prediction models, as it correlates with a number of agricultural practices, including fertilizer and pesticide application and soil tillage.
Model variables associated with urban or highdensity areas, including extent of impervious surfaces, urban land use, and road density, were all ranked as relatively important to model performance. This finding suggests that soil P in urban or highdensity areas may be distinct from other landscapes. However, several ranked "urban" attributes describe the extent of low density or "open" urban land use, which are indicative of suburban and exurban landscapes and may indicate that these landscapes are also distinct in terms of soil P. Although soil P dynamics in urban areas are relatively under-studied, they have been identified as hotspots for P cycling, with distinct drivers that influence P stocks and flows (Metson et al. 2015). For example, Hobbie et al. (2017) observed that urban watersheds in our study region retained comparatively little of their P inputs, and that the majority of annual P inputs were exported via stormwater flows. Additional research is needed to understand drivers of total soil P in urban and suburban landscapes and how these may differ from those in agricultural settings.
Riparian attributes summarized at the catchment or watershed scale (e.g., the % of crop, hay or urban land use cover in catchment or watershed-scale riparian zones) were also identified as relatively important to model prediction accuracy. This finding could result from the fact that land use in riparian zones may be indicative of a particularly intensive form of land use-i.e., if riparian zones are used for crops, hay or dominated by urban land use, this may indicate particularly intensive landscapes for these uses in a way that affects soil P. Conversely, these drivers may be strongly affecting soil P in riparian areas, and therefore could have been important for model accuracy for riparian samples. A categorical attribute ('StreamType') describing whether a sample was collected in the riparian buffer of the stream network (and which kind of stream reach it was adjacent to such as a ditch, stream, river etc.) also appeared in the list of important variables to model performance. Post hoc analysis indicated that riparian areas (soils within 100 m of the stream network) had significantly higher total P (646 mg/kg) compared to non-riparian areas (580 mg/kg). This echoes the finding by previous studies which have noted riparian and wetland areas as potential sinks, and ultimately sources, of P in agricultural landscapes (Kleinman et al. 2022).
Soil properties that were ranked as important to model performance included the native P content (as phosphorus oxide) and mineral content (aluminum, iron, calcium, and sodium oxides) of soils, along with aspects of soil texture, water storage and depth, soil organic carbon/organic matter, glacial history, hydraulic conductivity, erodibility, and suitability of soils for growth of various crop types. Extensive research has documented the importance of soil compounds such as iron and aluminum oxides and calcium carbonate on soil P retention and release (Records et al. 2016). Likewise, soil organic matter content has been shown to have direct and indirect effects on P sorption capacity (e.g., Kang et al. 2009). And as documented recently by Plach et al. (2018), the parental origins of soil and their glacial history have a potentially strong effect on the physical and geochemical properties of soil which can subsequently affect P retention and transport. Our findings likewise indicate that many of these properties are important for accurately predicting soil P across Midwestern landscapes.
Nitrate and ammonium deposition at the catchment scale, as well as fertilizer and manure inputs at the catchment scale, were all identified as important to model performance. Atmospheric deposition of nitrogen could be an indicator of the proximity of fossil fuel combustion, fertilizer or manure application or livestock emissions (Russell et al. 1998), which may also affect soil P.
Predictors summarizing aspects of climate, landscape properties, and their interaction-including precipitation, temperature, runoff, slope, and the contribution of groundwater to baseflow-also appeared to affect model performance. Recent work has shown that both temperature and precipitation are important to soil P availability at a global scale and may have contrasting effects (Hou et al. 2018). Slope and resulting erosional processes (whether driven by human land use, climate, or other factors) also have a profound effect on erosional and subsequent biogeochemical processes that may affect soil P (Berhe et al. 2018).
In addition to land cover, a major driver of hydrology and water-soil interactions in the UMRB is the vast network of subsurface tile drainage that characterizes much of the region (Schottler et al. 2013). The predictor datasets we used did not include information about subsurface drainage. Although tile drainage has important ramifications for downstream transport of P (Smith et al. 2015), it may have less of an impact on total soil P. For example, based on soil P data assembled here, average total soil P in the UMRB is 580 mg/kg. If we assume a soil bulk density of 1.4 (Jacobson et al. 2011), this average soil P concentration equates to ~ 8,120 kg/ha of P in the top meter of the soil. By comparison, previous studies have shown yearly P losses from tile drains to be ~ 0.01% of this amount (e.g., 0.5-1.0 kg/ha/yr; King et al. 2015;Dialameh and Ghane 2022). Meanwhile, P inputs may be ~ 12-25 kg/ha/yr, depending on the crop rotation (Potter et al. 2010). Thus, while accounting for tile drainage impacts is critical to understanding losses of P to downstream water bodies, it may be less important to accurately estimating total soil P. In addition, it is likely that the presence and density of tile drainage co-varies with other predictors already included in our model. For example, attributes included in the model that relate to land use, slope, soil texture and soil drainage class correlate with tile drainage (Valayamkunnath et al. 2020). Future research in this area may consider using tile drainage as a proxy for a more complex suite of soil and management parameters if detailed soils data are not available.
It is important to note that methods for the estimation of variable importance within random forest models is an active area of research (Debeer and Strobl 2020), with rapid development of prospective improvements to "opening up" the black box of random forest models. We applied a recently developed approach that is aimed at examining the conditional importance of predictors, which may give different results than more commonly applied importance measures. Care should be taken when interpreting these results relative to other modeling approaches if different methods for importance estimation are used.

Limitations and future directions
Our initial goal in developing the machine learning model described here was to determine if we could make accurate predictions for soil P across a broad geographic area, for use in future biophysical modeling and conservation planning applications in the UMRB. Because the UMRB contains considerable variation in underlying soil properties, land cover and soil P values, we included a large number of potential predictor variables in our machine learning framework to improve model accuracy as much as possible. Complex models often perform more accurately than simpler ones (Wadoux et al. 2020). Indeed, our model performed relatively highly in terms of accuracy compared with other machine learning models for soil attributes, possibly because of the large number of predictors included. The downsides of this approach include the considerable time and effort required to assemble and curate such a large dataset of predictors, as well as increased difficulty in interpreting the role of so many variables in predicting soil P. A third challenge stemming from the use of hundreds of heterogeneous predictors is the way it complicates predictions for soil P at unknown locations. Because we assembled predictors from existing disparate public datasets, these variables were not summarized at consistent spatial resolution and extent. For example, data from the StreamCat dataset was summarized at catchment and watershed scales, whereas gSSURGO data is summarized at the SSURGO map unit scale, etc. This variability presents an obstacle to the application of recently developed approaches to spatial prediction; e.g., where machine learning is performed using stacks of raster data organized at the same spatial extent and resolution (Hengl and MacMillan 2019). We chose to predict soil P at the scale of a 100 m grid to provide fairly fine resolution data for a large geographic extent given the hundreds of heterogeneous predictors we had access to. However, the disadvantage of this approach is that important fine scale spatial variation in soil P could still be obscured. Future work could further investigate a soil P model in the context of parsimonious model design and evaluate whether simpler machine learning models with fewer predicting variables could perform as well as the more complex one developed here (Wadoux et al. 2020).

Conclusion
Here we have used existing, large public soil chemistry and geospatial datasets together with a random forest model to generate predictions of total soil P at fine spatial resolution across the Upper Mississippi River Basin in an open science framework. Predicted soil P values are publicly available as a raster data layer at https:// github. com/ cldol ph/ SoilP. The combination of large existing datasets with powerful analytical tools like machine learning in a high-performance computing environment offers new possibilities for understanding and predicting complex biophysical phenomena such as soil P at fine scales. Such predictions could be useful when combined with watershed models to improve estimates of P loss from this largely agricultural landscape. Improved knowledge of critical source areas of soil P to include undersampled locations is necessary to develop effective regional conservation plans and prioritize resources such that they can reduce the accumulation and transport of P across landscapes.