2.1 Study area
The focus of this study is the Upper Mississippi River Basin (UMRB), which comprises the uppermost tributaries to the Mississippi River, from the headwaters at Lake Itasca in Minnesota down to the confluence of the Mississippi and Missouri Rivers at St. Louis, Missouri (Figure 1).
The UMRB covers nearly half a million square kilometers (492,027 km2) of the upper Midwestern United States, including large parts of Minnesota, Wisconsin, Iowa, Missouri, Illinois, and smaller areas of Indiana, North Dakota, and South Dakota, and corresponds to the HUC2-scale watershed (HUC ID = 07) in the USGS Watershed Boundary Dataset (2021). The UMRB is intensively farmed for large scale monocultures of primarily corn and soybeans. CAFOs are also prolific in the basin, particularly across Iowa and southern Minnesota (US EPA, 2013). Overall, cultivated crops comprise 49% of total basin land area (Dewitz, 2019); ~ 85% of those crops are corn and soybeans (USDA, 2010). The intensity of agricultural land use varies across a gradient from north to south, with the uppermost parts of the basin (in Minnesota and Wisconsin) characterized by relatively higher amounts of forest and wetland cover compared to the southern regions of the basin, where crop land use can exceed 85%. However, even the northern areas of the basin are undergoing rapid land use change from wetland and forest cover towards more agriculture (MPCA, 2017; Green et al., 2018). The precipitation gradient varies from comparatively drier in the north and west to wetter in the south and east. Soils in the basin are largely silty loam and loam, but also include areas of poorly drained clays as well as areas of well drained sands. Topography in the UMRB is generally flat or gently rolling; however some areas, particularly in the Minnesota River Basin, are characterized by steep bluffs created by an incisional process initiated by the draining of glacial Lake Agassiz (Gran et al., 2019).
2.2 Data processing and analysis overview
Data processing and analysis was done in R Studio v. 2022.07.1. All of the input data and methods used to generate the machine learning model for predicting soil P – including data acquisition, data preprocessing, data splitting, model training, evaluation and interpretation – are publicly available at https://github.com/cldolph/SoilP. The workflow used to develop the model is summarized in Figure 2.
2.3 Soil phosphorus data
For model training and testing, we used nearly 7,000 soil P measurements from two spatially extensive datasets: 1) the USGS National Geochemical Survey (NGS) Database (USGS, 2004), and 2) the National Cooperative Soil Survey (NCSS) Characterization Dataset (NCSS, 2021). Both of these datasets are publicly available[1] and contain information about soil P, among other soil attributes. We focused on total soil P rather than soil test P in our model development for several reasons. First, the NGS does not contain information about soil test P. Although the NCSS does contain information about various soil test P measures (e.g., Bray 1 P, Olsen P, etc.), soil test P is not measured consistently across the geography of the dataset. Because different soil test P measures extract variable amounts of P, they may not be readily comparable to one another (Wuenscher et al., 2015), making it difficult to compile and interpret multiple soil test P measures into one cohesive dataset. By contrast, total soil P data across the NGS and NCSS was considerably more extensive compared to any one soil test P measure, creating the opportunity for stronger machine learning models. Second, total soil P is a fundamental ecosystem property with broad applications in conservation, biogeochemistry, and agronomy. While soil test P may more directly indicate the crop availability and risk of loss of soluble P to downstream water bodies (e.g., Vadas et al., 2005), total P provides a more consistent and integrative measurement of soil P status. Moreover, previous work has shown total P to correlate with soil test P measures (especially when other soil properties are also accounted for; Allen and Mallarino, 2006), and thus total P can be related to the potential for both erosive and soluble P losses. Finally, water quality monitoring and management efforts are often pegged to estimates of total P in aquatic systems, rather than estimates of ‘bioavailable’ P. The complexity of P cycling and transformation in terrestrial and aquatic environments is a good reason to track total P in our initial machine learning efforts. Future applications of this approach may be applied to soil test P, particularly as more data is available from high frequency sampling to account for high variability in these forms, as the bioavailability of different components of total P may change over time.
Soil P estimates from these two datasets provided only partial spatial cover of the UMRB (Figure 1). To expand our potential to predict soil P for different types of soils and landcover conditions present across the UMRB, we expanded the regional range of data used to build our predictive model to include locations that were outside of the UMRB boundary but that shared potentially similar soils, climate and/or land management. This expanded study region included the following 13 U.S. states: Arkansas, Illinois, Indiana, Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin (Figure 1).
2.3.1 National Geochemical Survey Dataset (NGS)
The NGS contains 284 geochemical attributes for soil and stream sediment samples across the United States (n=77,212). While the primary focus of the dataset is stream sediments, the dataset contains a considerable number of soil samples (n=19,992). Of these soil samples, 19,574 samples contain information about total soil P. Total soil P estimates (% dry weight) were obtained using inductively coupled plasma spectrometry after acid dissolution (USGS, 2004). We converted soil P estimates to mg/kg by multiplying % weight estimates by 10,000.
Samples in the NGS dataset were collected between 1900 and 2008. Since we were most interested in developing modeling approaches capable of predicting relatively current soil P conditions from concurrent spatial attribute data, we subsampled the dataset to those collected between 2000-2008, leaving 9,589 samples. We further restricted samples to those found in our geographic study region, leaving a total of 6,664 soil P samples (Figure 1).
For most samples, the NGS contains information about soil depth. However, depth was represented inconsistently across the dataset, with some samples associated with minimum and maximum numeric depth estimates (in inches), some samples associated with a soil horizon category (i.e., “A”, “B”, etc), and some samples lacking any kind of information about depth or horizon.
For samples where numeric depth information was available, we designated a predictor variable ‘depth’ as the maximum depth pertaining to that sample. For soil samples without numeric depth measurements, we assumed a maximum depth of 30cm if samples were listed as occurring in the soil A horizon (corresponding to the surface or plowable layer). We also assumed a maximum depth of 30 cm for samples with no depth or horizon information provided (that is, we assumed samples without depth information were surface samples). If samples occurred in any other soil horizon (and if no numeric depth information was provided) we designated samples as ‘subsurface’, and assigned them a maximum depth of 76cm based on information from samples where depth data were present (and corresponding primarily to the soil B horizon).
2.3.2 National Cooperative Soil Survey
The NCSS contains data commonly requested for agronomic and biogeochemical purposes for 404,080 soil samples across the United States analyzed by the Kellogg Soil Survey Laboratory and cooperating universities. Of these, 10,678 samples contained information about total P, with 961 samples occurring in our study region. As for the NGS dataset, we filtered samples to include only those collected since 2000, leaving 558 samples in the remaining NCSS dataset. Total soil P estimates (mg/kg) were obtained using inductively coupled plasma spectrometry after acid dissolution (Soil Survey Staff, 2014). All samples included an estimate of the top and bottom depth of the soil horizon sampled in centimeters. We used the estimate of horizon bottom depth as the estimate of depth associated with each sample.
After subsampling the data, the combined NGS and NCSS datasets yielded a total of 7,222 soil P samples for model development and testing across the Midwestern United States (Figure 1).
2.4 Predictive variables
To predict soil P across the UMRB, we sought attributes that had previously been identified as known drivers of P abundance, including soil properties such as parent material, mineral and organic content, grain size and texture, as well as attributes describing landscape position, land use, climate and anthropogenic inputs (Records et al., 2016; Deiss et al., 2018; He et al., 2021) We used geospatial attributes from multiple publicly available datasets to summarize aspects of these potential drivers of soil P. These datasets included: i) the National Hydrography Dataset Plus v2 (NHDPlusv2)[2]; ii) the National Land Cover Database (NLCD) for 2006[3]; iii) the National Wetlands Inventory[4]; iv) the gridded Soil Survey Geographic (gSSURGO) Database (10m resolution)[5]; and v) the U.S. EPA StreamCat dataset[6]. Overall, we included 268 attributes derived from these datasets at multiple spatial scales as predictors in soil P models. Each of these datasets are described in more detail below and specific attributes used in soil P models are listed in Appendix Table S1. Once soil samples were joined to all predictors, 6,683 soil samples had complete attribute information available for use in modeling.
2.4.1 National Hydrograph Dataset Plus v2 (NHDPlusv2)
We sought to include information about near channel areas (i.e., the 100m buffer surrounding the stream network) in our predictor dataset, because of the possibility for these environments to accumulate P relative to upland areas. For example, in a review of the effects of conservation practices on water quality, Dodd and Sharpley (2015) found that near channel areas such as riparian buffers, including grass and vegetative filter strips and wetlands, accumulated labile forms of soil P over time. We derived a categorical attribute indicating whether soil samples were collected in the 100m riparian buffer of various stream categories defined in the NHDPlusv2 layer, including ditches, intermittent streams, and perennial streams. The NHDPlusv2 includes a digital stream network layer for the conterminous United States, divided into reaches (i.e., unique links in the network) as well as catchment boundaries associated with each reach. Catchments (i.e., local drainage areas) include the immediate land area draining into each individual stream reach excluding areas draining to upstream reaches. Our study region overlaps with the NHDPlusv2 network for the following drainage basins: Great Lakes, Ohio, Upper Mississippi, Lower Mississippi, Souris-Red-Rainy, Upper Missouri, Lower Missouri, and Arkansas-Red-White. Buffering of the stream layer was conducted in ArcMap version 10.8.1. We also used the NHDPlusv2 layer to assign soil P samples to NHD catchments, which were subsequently used to link soil samples to predictive variables in the StreamCat database (see below).
2.4.2 National Land Cover Database (NLCD)
Although the processes connecting land use to soil P content have yet to be fully elucidated, land use is a likely driver of soil P dynamics through multiple potential pathways, including alteration of P inputs and outputs associated with different land cover regimes, and through management practices that may alter biological, chemical and physical processes that affect P composition (Liu et al., 2018). We used the NLCD to summarize aspects of land cover for the UMRB. NLCD layers contain information about 16 land cover classes across the United States at 30m resolution. Of the available NLCD layers (available every 5 years from 1992-2016), we chose NLCD 2006 because most soil P samples in our dataset (85%) were collected between 2000 and 2006, and all were collected before 2011. We used the spatial join function in ArcMap to link land cover classes to soil P sample locations.
2.4.3 National Wetlands Inventory
Similar to riparian buffers, wetlands can accumulate P by intercepting P-rich runoff from surrounding drainage areas (Dodds and Sharpley, 2015); thus we sought to include wetland presence and type as predictors in machine learning models for soil P. The NWI is a publicly available dataset that captures distribution and attribute information for wetlands in all 50 U.S. states. We downloaded NWI spatial data for all 13 states included in our study region. We derived a categorical attribute indicating whether soil samples were collected from a series of possible wetland types included in the NWI, including freshwater emergent wetlands, freshwater forested/shrub wetlands, freshwater ponds, lakes, or riverine wetlands, or from non-wetland areas. We used spatial join function in ArcMap to link wetlands in the NWI layer to soil P sample locations.
2.4.4 gSSURGO
A large body of previous work has identified the importance of soil properties such as mineral and organic content, parent material, and soil texture to P abundance (e.g., Records et al., 2016). We used the gSSURGO dataset to assign soil properties to P sample locations. gSSURGO is a detailed soil geographic geodatabase developed by the USDA National Cooperative Soil Survey, containing information for more than 36 million mapped soil features. The gSSURGO database is organized by map unit soil polygons where each map unit is linked to hundreds of detailed soil attributes. gSSURGO data are available at the 10m scale when downloaded as a raster file for individual states. We downloaded gSSURGO raster files for each of the 13 states in our study region (Soil Survey Staff, 2021), and used Extract by Location function in ArcMap to link unique SSURGO map unit identifying information (i.e., the map unit key) to soil P sample locations. We selected 81 attributes from the gSSURGO dataset with the potential to control soil P abundance, including aspects of parent material, grain size, organic matter content, and mineral content (Records et al., 2016). Some of these attributes are mapped at scales smaller than the map unit scale (e.g., the component scale). To reconcile this spatial discrepancy, we preprocessed numeric SSURGO attributes by taking the weighted average of values for all components in a map unit, where weights were the proportional contribution by area of each component to the map unit. For categorical SSURGO attributes, we selected the attribute value for the component comprising the highest percent area of the map unit. If two attribute values comprised the same percent area of the map unit, we selected the component attribute with the higher slope value as the tie breaker, because surface P runoff is likely to be higher for soils with greater slope.
2.4.5 StreamCat Dataset
U.S. EPA’s StreamCat dataset contains information for over 600 different environmental metrics linked to individual stream reaches in the NHDv2Plus dataset (Hill et al., 2015). These metrics summarize diverse geospatial attributes -- including aspects of land cover including land use, impervious surfaces and road density, soil type, point source and nutrient inputs, and climatic factors (temperature and precipitation), among others – at the catchment and watershed scale draining into each reach. In contrast to the NLCD 2006 data, which allowed us to capture site-specific land cover information for soil P samples within a 30 m pixel, i.e. StreamCat attributes are already summarized at the catchment (local drainage area) and watershed scales and thus offered the potential to capture land use trends at a slightly broader, but still local, geographic scale that might be relevant to soil P concentrations.
We downloaded the StreamCat dataset from the US EPA data repository, and linked StreamCat variables to each stream catchment for which soil P samples were available. We included attributes from StreamCat variables potentially relevant to soil P across our study region (Table S1). These tables included 220 attributes summarized at the watershed and catchment in our dataset. StreamCat contains National Land Cover Dataset (NLCD) attributes for multiple years (2001, 2006, 2011, 2016); we used land cover attributes only from the NLCD 2006 dataset, in keeping with our temporal window in which most soil P samples were collected.
2.5 Random Forest Modeling
The primary motivation behind the work described here is to enable quantification of how much sediment-associated P might be contributed from different areas of the landscape to river networks, for use in water quality conservation and planning in the UMRB. Thus, we sought to generate the most accurate predictions for soil P possible for our study region, using all of the different types of geospatial predictive variables readily available to us. Random forest regression is a nonparametric ensemble learning method that utilizes predictions from multiple decision trees to improve model accuracy. Each tree is composed of branches (“nodes”) representing yes-no questions where features (i.e., predictive variables) are used to split the dependent variable into two groups that minimize in-group variability and maximize between group variability. We selected random forest as our modeling method because these models can be highly accurate, are relatively fast to develop, can handle categorical and numerical attributes, are robust to outliers, and can handle both non-linear and unbalanced data, all of which were pertinent criteria for our dataset. Random forest models also require very few assumptions of the input datasets. At the same time, some limitations of random forest modeling methods should be noted: mainly, the machine-learning techniques are characterized by the data mechanisms as ‘black box’ or ‘gray box’ algorithmic approaches (Jeong et al., 2016). Particularly, RF is built on non-parametric advanced classification and regression tree (CART) analysis methods and models may not be fully described mechanistically (Breiman, 2001). Also, the RF algorithm has the potential to overfit data, and may become impractical for making predictions beyond the training data range (Jeong et al., 2016). Given the inherent RF structure, permutation of many trees may make the algorithm slow for real-time prediction (Breiman, 2001).
Random forest model development followed the general scheme for machine learning models articulated by Zhong et al., 2021 (Figure 2). We used a tidymodels framework (Kuhn and Wickham, 2020) in R to develop random forest models for predicting soil P from the assembled predictors. Within this framework, we used both the ranger package (Wright and Ziegler, 2017) and the randomForest (Breiman et al., 2022) packages at different points in the modeling process. Both ranger and randomForest have been shown to perform as well or better than other machine learning approaches for predictive purposes (Hagenauer et al., 2019), and have been shown to produce very similar model results to one another (Wright and Ziegler, 2017). We used the ranger package for initial model tuning, due to its considerably faster implementation relative to randomForest (Wright and Ziegler, 2017). Once optimal hyperparameter values had been identified using this approach, we applied these hyperparameter values to a randomForest implementation to create our final model, because randomForest model objects (but not ranger model objects) are compatible with conditional permutation methods for estimating the relative importance of predictors to model outcomes (see rationale for the approach we took to evaluate predictor importance below). Prior to switching from ranger to randomForest implementations, we confirmed that the two approaches produced nearly the exact same results in terms of model performance (as measured by R2 and RMSE for actual vs predicted values on an independent test dataset), when the same hyperparameter values were used.
2.5.1 Model tuning
A complete R script detailing our approach can be found at https://github.com/cldolph/SoilP. After pre-processing soil predictors as described above, we linked attributes to soil P samples and removed variables that did not contain useful information (e.g., all rows = 0). We also excluded attributes where attributes information was missing (‘NA’) for >20% of soil P samples. Finally, we also excluded three attributes from the gSSURGO dataset (pmkind, taxpartsize, and texcl) with category levels that did not encompass the entire set of the categories contained within the UMRB grid dataset (see below). Many of the remaining attributes still contained some missing values. Because random forest models cannot handle missing values in predictor variables, we used the missRanger package in R (Mayer, 2021) to impute the remaining missing values for the training and testing datasets. missRanger imputes missing values for each variable by building a random forest model for each variable that uses all other variables in the dataset as covariables. Prior to random forest modeling, we normalized (i.e., centered and scaled) numeric attributes to have a mean of zero and a standard deviation of one.
For the soil P random forest model, we used 90% of the data for model training, and the remaining 10% for model testing. Using the training dataset and the ranger implementation for random forest modeling, we applied 10-fold cross validation to tune model hyperparameters across a range of possible values. The hyperparameters selected for tuning were: mtry (i.e., number of variables randomly sampled as candidates at each split) and min_n (i.e., the minimum number of data points in a node). The trees hyperparameter (i.e., number of trees) was set to 1000 across all models. K-fold cross validation can assist in avoiding model over-fitting and works by partitioning training data into K equal sized “folds” (in our case 10). The model is iteratively trained on various combinations of tuning hyperparameters across K-1 folds, leaving the remaining fold to evaluate model performance for each combination. We defined a grid of 20 potential combinations of hyper-parameters using the tune_grid () function from the tidymodels collection of packages in R. This approach draws hyperparameter values semi-randomly from parameter space such that the various combinations cover the whole space of potential values. We selected hyperparameter values using out-of-bag (OOB) rmse and R2 for the associated models.
Once hyperparameter values were tuned, we re-ran the random forest model using the randomForest package, to create a randomForest object that was compatible with our selected measure of predictive variable importance (see next section). We evaluated overall model performance based on the independent test dataset (comprising 10% of the original dataset).
2.6 Predictive Variable Importance
Many implementations of random forest models come with default measures used to quantify the relative importance of individual predictive variables to model performance. However, recent work has shown that many of these default strategies, including those based on measures of impurity or permutation-based metrics, can produce biased results when model predictors 1) vary in their scale of measurement; 2) vary in their number of categories; or 3) are highly correlated to one another (Hooker et al., 2021). When these conditions apply, as they do for predictors in our soil P dataset, alternative measures of importance such as conditional permutation importance (CPI) may be more appropriate, and less biased towards collinear predictive variables ( Debeer and Strobl, 2020; Hooker et al., 2021). CPI aims to capture the dependence between a predictor and the response variable, conditionally on the values of all other predictors. That is, CPI is a measure that can be used to assess how much each variable ‘adds’ to accurately predicting the response variable, given what we know from all other predictive variables. The tradeoff is that CPI methods can be computationally expensive. To derive less biased estimates of variable importance to the random forest model performance, we implemented the CPI approach from the permimp package in R (Debeer et al., 2021). This approach builds upon the widely used party package for CPI, while offering improved computational speed, accounting for non-linear dependence between predictors, less sensitivity to dataset sample size, and greater stability of results (Debeer and Strobl, 2020). To date, permimp can be applied to randomForest or cforest model objects in R (but not to ranger objects). In permimp, a threshold value, equal to 1 – the p-value for the association between predictor variables, is used to determine whether to include a predictor in the conditioning for the predictor of interest. We used the default value for the threshold parameter in permimp (0.95) because Debeer and Strobl (2020) advised utilizing threshold values close to 1 for larger datasets.
2.7 Predicting soil P across the Upper Mississippi River Basin
One of our goals of this effort was to predict soil P values for unsampled locations across the UMRB. To do this, we created a grid of points spaced 100m apart across the entire UMRB (UMRB grid), using the Create Fishnet tool in ArcMap version 10.8. We linked these locations to the same set of attributes used in our random forest model (i.e., attributes from the NHDPlusv2, NLCD 2006, NWI, gSSURGO, and StreamCat datasets). We excluded grid locations that coincided with locations of open water. We used the High Performance Cluster at the University of Minnesota’s Supercomputing Institute (https://www.msi.umn.edu/) to assign attributes to the UMRB grid points, and to generate predictions for the grid locations using the random forest model we developed. For this effort, we assigned all predictions to the soil surface (max depth = 30 cm). Lastly, we used the Inverse Distance Weighting (IDW) tool in ArcMap version 10.8 to interpolate a raster surface of soil P values at the 100m grid scale for the UMRB.
[1] NGS: https://mrdata.usgs.gov/geochem/; NCSS: https://ncsslabdatamart.sc.egov.usda.gov/