Predicting high resolution total phosphorus concentrations for soils of the Upper Mississippi River Basin using machine learning

doi:10.21203/rs.3.rs-2285751/v1

Download PDF

Research Article

Predicting high resolution total phosphorus concentrations for soils of the Upper Mississippi River Basin using machine learning

https://doi.org/10.21203/rs.3.rs-2285751/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 28 Mar, 2023

Read the published version in Biogeochemistry →

You are reading this latest preprint version

The spatial distribution of soil phosphorus (P) is important to both biogeochemical processes and the management of agricultural landscapes, where it is critical for both crop production and conservation planning. Recent advances in the availability of large environmental datasets together with big data analytical tools like machine learning have created opportunities for evaluating and predicting spatial patterns in complex environmental variables like soil P. Here, we apply a random forest machine learning model to publicly available soil P datasets together with nearly 300 geospatial attributes summarizing aspects of soil type, land cover, land use, topography, nutrient inputs, and climate to predict total soil P at a 100m grid scale for the Upper Mississippi River Basin (UMRB), USA. The UMRB is one of the most intensively farmed regions in the world and is characterized by widespread water quality degradation arising from P-associated eutrophication. At the regional scale represented by our model, the variables with the greatest comparative importance for predicting soil P included a combination of soil sample depth, land use/land cover, underlying soil physical and geochemical properties, landscape features (such as slope, elevation and proximity to the stream network), nutrient inputs, and climate-related factors. An important product of this research is a fine-scale (100 m) raster data layer of predicted total soil P values for the UMRB for public use. This dataset can be used to improve conservation planning and modeling efforts to improve water quality in the region.

soil phosphorus

modeling

random forest

conservation

data mining

water quality

Phosphorus (P) is an essential nutrient in terrestrial and aquatic systems, with strong influences on plant productivity, element cycling and many other aspects of ecosystems (Vitousek et al., 2010). In excess, P has strong negative impacts on water quality and aquatic ecosystems arising from diverse effects of eutrophication (Schindler 2006; Clark and Longo, 2018). Because soils are the main source of phosphorus for terrestrial and freshwater ecosystems, understanding of ecosystem dynamics and management of phosphorus enrichment due to human activities requires knowledge of distributions of soil P. While a large majority of total soil phosphorus exists in diverse forms that are not immediately available to plants (i.e., primary and secondary minerals, organic P), these pools of soil P determine general P availability and influence losses to aquatic ecosystems (Hou et al., 2016; Vitousek et al., 2010). Thus, knowledge of total soil P distributions can provide fundamental information for research and management in terrestrial and aquatic ecosystems.

In the Cornbelt and Great Plains regions of the United States and Canada, decades of agricultural practices including the conversion of perennial vegetation into row crops, intensive inputs of synthetic fertilizers, and the spreading of manure from concentrated animal feeding operations (CAFOs), have led to an accumulation of legacy phosphorus in soils well beyond their retention capacity (Goyette et al., 2018). Wastewater sources also contribute to P export, although their impact is comparatively small for most industrially managed agricultural watersheds (Boardman et al., 2019). In many areas across the region, legacy P continues to be compounded by ongoing fertilizer and manure inputs, exacerbating the challenges of addressing nonpoint source pollution (Stackpoole et al., 2019; Boardman et al. 2019; Van Meter et al., 2021). Regional studies indicate that levels of soil test P (i.e., P that is readily available for crop uptake) are often well above concentrations needed to maintain crop production (e.g., Fernandez et al., 2012). These anthropogenic P inputs are overlaid on complex underlying controls on soil P content and transport, including soil parent material, soil texture, and landscape geomorphology (Records et al., 2016).

While eroded soil from agricultural fields has long been known to contribute to watershed P export, recent efforts have also highlighted the potential for other landscape components, such as lakes and near channel environments (i.e., riparian corridors, stream banks and bluffs, and riparian wetlands), to modify the transport of P to riverine environments. For example, lakes, riparian buffers and wetlands can trap P as net sinks, but after long term, high enrichment have the potential to become sources of dissolved P to river networks (Dodds and Sharpley, 2015; Wu et al., 2022). Stream bank and bluff erosion can also be an important source of particulate P to river networks (Schilling et al. 2021). The contribution of streambanks to riverine P export can depend on soil P content, bank erosion rates, bank and bluff area, and soil bulk density (Boardman et al. 2019; Schilling et al., 2021).

Field work efforts to measure soil P at high spatial resolution can be time intensive and costly, if not infeasible, across large spatial extents. Although soil P is predicted quite well at small spatial scales by coincident soil properties such as pH, texture, particle size and soil organic matter (e.g., Hosseini et al 2017), such microscale soil attributes are generally not available at regional scales. Moreover, P heterogeneity at regional scales may also be affected by broader scale drivers such as climate, parent material, and glacial history. Thus, efforts to predict soil P from existing geospatial and climatic datasets have considerable utility within larger biophysical modeling efforts designed to support conservation planning.

Recent advances in the availability of large environmental datasets together with analytical tools like machine learning models have created new opportunities for predicting spatial patterning in complex environmental outcomes (Zhong et al., 2021). Total P at the soil surface represents a critical pool of P that is potentially erodible, reactive, and eventually bioavailable and thus highly relevant to both crop production and water quality. Machine learning represents a potentially useful tool to help predict spatial variability in total soil P, which is driven by complex, interacting drivers. Previously, machine learning has been used together with detailed geospatial data to predict soil attributes including % carbon and total nitrogen content (Ramcharan et al., 2018), and with remote sensing data to predict the nutrient content of soils at the scale of a single agricultural field (Sahabiev et al., 2021). Machine learning has also been used to predict nutrient concentration in river networks (e.g., Shen et al., 2020; Sadayappan et al., 2022). In this study, we develop a random forest machine learning model from large, publicly available datasets of soil P and nearly 300 predictive variables including soil type, land cover, topography, nutrient inputs, hydrology and climate to predict total surface soil P at a 100m grid scale for the Upper Mississippi River Basin (UMRB), USA. The UMRB is one of the most intensively farmed regions in the world and is characterized by widespread water quality degradation arising from P-associated eutrophication (Jacobson et al., 2011). Predicting the spatial variability of total soil P for this landscape has direct implications for conservation planning for water quality improvement and for understanding the resilience of soils for crop production (Qiao et al., 2022). Our objectives for this research were to 1) highlight the potential of existing, underutilized publicly-funded datasets to support conservation planning; 2) predict soil P at a fine (100 m) resolution across the UMRB and provide results in an open access data framework for use in future conservation planning and analysis; 3) identify geospatial attributes important in predicting soil P at regional scales.

2.1 Study area

The focus of this study is the Upper Mississippi River Basin (UMRB), which comprises the uppermost tributaries to the Mississippi River, from the headwaters at Lake Itasca in Minnesota down to the confluence of the Mississippi and Missouri Rivers at St. Louis, Missouri (Figure 1).

The UMRB covers nearly half a million square kilometers (492,027 km²) of the upper Midwestern United States, including large parts of Minnesota, Wisconsin, Iowa, Missouri, Illinois, and smaller areas of Indiana, North Dakota, and South Dakota, and corresponds to the HUC2-scale watershed (HUC ID = 07) in the USGS Watershed Boundary Dataset (2021). The UMRB is intensively farmed for large scale monocultures of primarily corn and soybeans. CAFOs are also prolific in the basin, particularly across Iowa and southern Minnesota (US EPA, 2013). Overall, cultivated crops comprise 49% of total basin land area (Dewitz, 2019); ~ 85% of those crops are corn and soybeans (USDA, 2010). The intensity of agricultural land use varies across a gradient from north to south, with the uppermost parts of the basin (in Minnesota and Wisconsin) characterized by relatively higher amounts of forest and wetland cover compared to the southern regions of the basin, where crop land use can exceed 85%. However, even the northern areas of the basin are undergoing rapid land use change from wetland and forest cover towards more agriculture (MPCA, 2017; Green et al., 2018). The precipitation gradient varies from comparatively drier in the north and west to wetter in the south and east. Soils in the basin are largely silty loam and loam, but also include areas of poorly drained clays as well as areas of well drained sands. Topography in the UMRB is generally flat or gently rolling; however some areas, particularly in the Minnesota River Basin, are characterized by steep bluffs created by an incisional process initiated by the draining of glacial Lake Agassiz (Gran et al., 2019).

2.2 Data processing and analysis overview

Data processing and analysis was done in R Studio v. 2022.07.1. All of the input data and methods used to generate the machine learning model for predicting soil P – including data acquisition, data preprocessing, data splitting, model training, evaluation and interpretation – are publicly available at https://github.com/cldolph/SoilP. The workflow used to develop the model is summarized in Figure 2.

2.3 Soil phosphorus data

For model training and testing, we used nearly 7,000 soil P measurements from two spatially extensive datasets: 1) the USGS National Geochemical Survey (NGS) Database (USGS, 2004), and 2) the National Cooperative Soil Survey (NCSS) Characterization Dataset (NCSS, 2021). Both of these datasets are publicly available^{^[1]} and contain information about soil P, among other soil attributes. We focused on total soil P rather than soil test P in our model development for several reasons. First, the NGS does not contain information about soil test P. Although the NCSS does contain information about various soil test P measures (e.g., Bray 1 P, Olsen P, etc.), soil test P is not measured consistently across the geography of the dataset. Because different soil test P measures extract variable amounts of P, they may not be readily comparable to one another (Wuenscher et al., 2015), making it difficult to compile and interpret multiple soil test P measures into one cohesive dataset. By contrast, total soil P data across the NGS and NCSS was considerably more extensive compared to any one soil test P measure, creating the opportunity for stronger machine learning models. Second, total soil P is a fundamental ecosystem property with broad applications in conservation, biogeochemistry, and agronomy. While soil test P may more directly indicate the crop availability and risk of loss of soluble P to downstream water bodies (e.g., Vadas et al., 2005), total P provides a more consistent and integrative measurement of soil P status. Moreover, previous work has shown total P to correlate with soil test P measures (especially when other soil properties are also accounted for; Allen and Mallarino, 2006), and thus total P can be related to the potential for both erosive and soluble P losses. Finally, water quality monitoring and management efforts are often pegged to estimates of total P in aquatic systems, rather than estimates of ‘bioavailable’ P. The complexity of P cycling and transformation in terrestrial and aquatic environments is a good reason to track total P in our initial machine learning efforts. Future applications of this approach may be applied to soil test P, particularly as more data is available from high frequency sampling to account for high variability in these forms, as the bioavailability of different components of total P may change over time.

Soil P estimates from these two datasets provided only partial spatial cover of the UMRB (Figure 1). To expand our potential to predict soil P for different types of soils and landcover conditions present across the UMRB, we expanded the regional range of data used to build our predictive model to include locations that were outside of the UMRB boundary but that shared potentially similar soils, climate and/or land management. This expanded study region included the following 13 U.S. states: Arkansas, Illinois, Indiana, Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin (Figure 1).

2.3.1 National Geochemical Survey Dataset (NGS)

The NGS contains 284 geochemical attributes for soil and stream sediment samples across the United States (n=77,212). While the primary focus of the dataset is stream sediments, the dataset contains a considerable number of soil samples (n=19,992). Of these soil samples, 19,574 samples contain information about total soil P. Total soil P estimates (% dry weight) were obtained using inductively coupled plasma spectrometry after acid dissolution (USGS, 2004). We converted soil P estimates to mg/kg by multiplying % weight estimates by 10,000.

Samples in the NGS dataset were collected between 1900 and 2008. Since we were most interested in developing modeling approaches capable of predicting relatively current soil P conditions from concurrent spatial attribute data, we subsampled the dataset to those collected between 2000-2008, leaving 9,589 samples. We further restricted samples to those found in our geographic study region, leaving a total of 6,664 soil P samples (Figure 1).

For most samples, the NGS contains information about soil depth. However, depth was represented inconsistently across the dataset, with some samples associated with minimum and maximum numeric depth estimates (in inches), some samples associated with a soil horizon category (i.e., “A”, “B”, etc), and some samples lacking any kind of information about depth or horizon.

For samples where numeric depth information was available, we designated a predictor variable ‘depth’ as the maximum depth pertaining to that sample. For soil samples without numeric depth measurements, we assumed a maximum depth of 30cm if samples were listed as occurring in the soil A horizon (corresponding to the surface or plowable layer). We also assumed a maximum depth of 30 cm for samples with no depth or horizon information provided (that is, we assumed samples without depth information were surface samples). If samples occurred in any other soil horizon (and if no numeric depth information was provided) we designated samples as ‘subsurface’, and assigned them a maximum depth of 76cm based on information from samples where depth data were present (and corresponding primarily to the soil B horizon).

2.3.2 National Cooperative Soil Survey

The NCSS contains data commonly requested for agronomic and biogeochemical purposes for 404,080 soil samples across the United States analyzed by the Kellogg Soil Survey Laboratory and cooperating universities. Of these, 10,678 samples contained information about total P, with 961 samples occurring in our study region. As for the NGS dataset, we filtered samples to include only those collected since 2000, leaving 558 samples in the remaining NCSS dataset. Total soil P estimates (mg/kg) were obtained using inductively coupled plasma spectrometry after acid dissolution (Soil Survey Staff, 2014). All samples included an estimate of the top and bottom depth of the soil horizon sampled in centimeters. We used the estimate of horizon bottom depth as the estimate of depth associated with each sample.

After subsampling the data, the combined NGS and NCSS datasets yielded a total of 7,222 soil P samples for model development and testing across the Midwestern United States (Figure 1).

2.4 Predictive variables

To predict soil P across the UMRB, we sought attributes that had previously been identified as known drivers of P abundance, including soil properties such as parent material, mineral and organic content, grain size and texture, as well as attributes describing landscape position, land use, climate and anthropogenic inputs (Records et al., 2016; Deiss et al., 2018; He et al., 2021) We used geospatial attributes from multiple publicly available datasets to summarize aspects of these potential drivers of soil P. These datasets included: i) the National Hydrography Dataset Plus v2 (NHDPlusv2)^{^[2]}; ii) the National Land Cover Database (NLCD) for 2006^{^[3]}; iii) the National Wetlands Inventory^{^[4]}; iv) the gridded Soil Survey Geographic (gSSURGO) Database (10m resolution)^{^[5]}; and v) the U.S. EPA StreamCat dataset^{^[6]}. Overall, we included 268 attributes derived from these datasets at multiple spatial scales as predictors in soil P models. Each of these datasets are described in more detail below and specific attributes used in soil P models are listed in Appendix Table S1. Once soil samples were joined to all predictors, 6,683 soil samples had complete attribute information available for use in modeling.

2.4.1 National Hydrograph Dataset Plus v2 (NHDPlusv2)

We sought to include information about near channel areas (i.e., the 100m buffer surrounding the stream network) in our predictor dataset, because of the possibility for these environments to accumulate P relative to upland areas. For example, in a review of the effects of conservation practices on water quality, Dodd and Sharpley (2015) found that near channel areas such as riparian buffers, including grass and vegetative filter strips and wetlands, accumulated labile forms of soil P over time. We derived a categorical attribute indicating whether soil samples were collected in the 100m riparian buffer of various stream categories defined in the NHDPlusv2 layer, including ditches, intermittent streams, and perennial streams. The NHDPlusv2 includes a digital stream network layer for the conterminous United States, divided into reaches (i.e., unique links in the network) as well as catchment boundaries associated with each reach. Catchments (i.e., local drainage areas) include the immediate land area draining into each individual stream reach excluding areas draining to upstream reaches. Our study region overlaps with the NHDPlusv2 network for the following drainage basins: Great Lakes, Ohio, Upper Mississippi, Lower Mississippi, Souris-Red-Rainy, Upper Missouri, Lower Missouri, and Arkansas-Red-White. Buffering of the stream layer was conducted in ArcMap version 10.8.1. We also used the NHDPlusv2 layer to assign soil P samples to NHD catchments, which were subsequently used to link soil samples to predictive variables in the StreamCat database (see below).

2.4.2 National Land Cover Database (NLCD)

Although the processes connecting land use to soil P content have yet to be fully elucidated, land use is a likely driver of soil P dynamics through multiple potential pathways, including alteration of P inputs and outputs associated with different land cover regimes, and through management practices that may alter biological, chemical and physical processes that affect P composition (Liu et al., 2018). We used the NLCD to summarize aspects of land cover for the UMRB. NLCD layers contain information about 16 land cover classes across the United States at 30m resolution. Of the available NLCD layers (available every 5 years from 1992-2016), we chose NLCD 2006 because most soil P samples in our dataset (85%) were collected between 2000 and 2006, and all were collected before 2011. We used the spatial join function in ArcMap to link land cover classes to soil P sample locations.

2.4.3 National Wetlands Inventory

Similar to riparian buffers, wetlands can accumulate P by intercepting P-rich runoff from surrounding drainage areas (Dodds and Sharpley, 2015); thus we sought to include wetland presence and type as predictors in machine learning models for soil P. The NWI is a publicly available dataset that captures distribution and attribute information for wetlands in all 50 U.S. states. We downloaded NWI spatial data for all 13 states included in our study region. We derived a categorical attribute indicating whether soil samples were collected from a series of possible wetland types included in the NWI, including freshwater emergent wetlands, freshwater forested/shrub wetlands, freshwater ponds, lakes, or riverine wetlands, or from non-wetland areas. We used spatial join function in ArcMap to link wetlands in the NWI layer to soil P sample locations.

2.4.4 gSSURGO

A large body of previous work has identified the importance of soil properties such as mineral and organic content, parent material, and soil texture to P abundance (e.g., Records et al., 2016). We used the gSSURGO dataset to assign soil properties to P sample locations. gSSURGO is a detailed soil geographic geodatabase developed by the USDA National Cooperative Soil Survey, containing information for more than 36 million mapped soil features. The gSSURGO database is organized by map unit soil polygons where each map unit is linked to hundreds of detailed soil attributes. gSSURGO data are available at the 10m scale when downloaded as a raster file for individual states. We downloaded gSSURGO raster files for each of the 13 states in our study region (Soil Survey Staff, 2021), and used Extract by Location function in ArcMap to link unique SSURGO map unit identifying information (i.e., the map unit key) to soil P sample locations. We selected 81 attributes from the gSSURGO dataset with the potential to control soil P abundance, including aspects of parent material, grain size, organic matter content, and mineral content (Records et al., 2016). Some of these attributes are mapped at scales smaller than the map unit scale (e.g., the component scale). To reconcile this spatial discrepancy, we preprocessed numeric SSURGO attributes by taking the weighted average of values for all components in a map unit, where weights were the proportional contribution by area of each component to the map unit. For categorical SSURGO attributes, we selected the attribute value for the component comprising the highest percent area of the map unit. If two attribute values comprised the same percent area of the map unit, we selected the component attribute with the higher slope value as the tie breaker, because surface P runoff is likely to be higher for soils with greater slope.

2.4.5 StreamCat Dataset

U.S. EPA’s StreamCat dataset contains information for over 600 different environmental metrics linked to individual stream reaches in the NHDv2Plus dataset (Hill et al., 2015). These metrics summarize diverse geospatial attributes -- including aspects of land cover including land use, impervious surfaces and road density, soil type, point source and nutrient inputs, and climatic factors (temperature and precipitation), among others – at the catchment and watershed scale draining into each reach. In contrast to the NLCD 2006 data, which allowed us to capture site-specific land cover information for soil P samples within a 30 m pixel, i.e. StreamCat attributes are already summarized at the catchment (local drainage area) and watershed scales and thus offered the potential to capture land use trends at a slightly broader, but still local, geographic scale that might be relevant to soil P concentrations.

We downloaded the StreamCat dataset from the US EPA data repository, and linked StreamCat variables to each stream catchment for which soil P samples were available. We included attributes from StreamCat variables potentially relevant to soil P across our study region (Table S1). These tables included 220 attributes summarized at the watershed and catchment in our dataset. StreamCat contains National Land Cover Dataset (NLCD) attributes for multiple years (2001, 2006, 2011, 2016); we used land cover attributes only from the NLCD 2006 dataset, in keeping with our temporal window in which most soil P samples were collected.

2.5 Random Forest Modeling

The primary motivation behind the work described here is to enable quantification of how much sediment-associated P might be contributed from different areas of the landscape to river networks, for use in water quality conservation and planning in the UMRB. Thus, we sought to generate the most accurate predictions for soil P possible for our study region, using all of the different types of geospatial predictive variables readily available to us. Random forest regression is a nonparametric ensemble learning method that utilizes predictions from multiple decision trees to improve model accuracy. Each tree is composed of branches (“nodes”) representing yes-no questions where features (i.e., predictive variables) are used to split the dependent variable into two groups that minimize in-group variability and maximize between group variability. We selected random forest as our modeling method because these models can be highly accurate, are relatively fast to develop, can handle categorical and numerical attributes, are robust to outliers, and can handle both non-linear and unbalanced data, all of which were pertinent criteria for our dataset. Random forest models also require very few assumptions of the input datasets. At the same time, some limitations of random forest modeling methods should be noted: mainly, the machine-learning techniques are characterized by the data mechanisms as ‘black box’ or ‘gray box’ algorithmic approaches (Jeong et al., 2016). Particularly, RF is built on non-parametric advanced classification and regression tree (CART) analysis methods and models may not be fully described mechanistically (Breiman, 2001). Also, the RF algorithm has the potential to overfit data, and may become impractical for making predictions beyond the training data range (Jeong et al., 2016). Given the inherent RF structure, permutation of many trees may make the algorithm slow for real-time prediction (Breiman, 2001).

Random forest model development followed the general scheme for machine learning models articulated by Zhong et al., 2021 (Figure 2). We used a tidymodels framework (Kuhn and Wickham, 2020) in R to develop random forest models for predicting soil P from the assembled predictors. Within this framework, we used both the ranger package (Wright and Ziegler, 2017) and the randomForest (Breiman et al., 2022) packages at different points in the modeling process. Both ranger and randomForest have been shown to perform as well or better than other machine learning approaches for predictive purposes (Hagenauer et al., 2019), and have been shown to produce very similar model results to one another (Wright and Ziegler, 2017). We used the ranger package for initial model tuning, due to its considerably faster implementation relative to randomForest (Wright and Ziegler, 2017). Once optimal hyperparameter values had been identified using this approach, we applied these hyperparameter values to a randomForest implementation to create our final model, because randomForest model objects (but not ranger model objects) are compatible with conditional permutation methods for estimating the relative importance of predictors to model outcomes (see rationale for the approach we took to evaluate predictor importance below). Prior to switching from ranger to randomForest implementations, we confirmed that the two approaches produced nearly the exact same results in terms of model performance (as measured by R² and RMSE for actual vs predicted values on an independent test dataset), when the same hyperparameter values were used.

2.5.1 Model tuning

A complete R script detailing our approach can be found at https://github.com/cldolph/SoilP. After pre-processing soil predictors as described above, we linked attributes to soil P samples and removed variables that did not contain useful information (e.g., all rows = 0). We also excluded attributes where attributes information was missing (‘NA’) for >20% of soil P samples. Finally, we also excluded three attributes from the gSSURGO dataset (pmkind, taxpartsize, and texcl) with category levels that did not encompass the entire set of the categories contained within the UMRB grid dataset (see below). Many of the remaining attributes still contained some missing values. Because random forest models cannot handle missing values in predictor variables, we used the missRanger package in R (Mayer, 2021) to impute the remaining missing values for the training and testing datasets. missRanger imputes missing values for each variable by building a random forest model for each variable that uses all other variables in the dataset as covariables. Prior to random forest modeling, we normalized (i.e., centered and scaled) numeric attributes to have a mean of zero and a standard deviation of one.

For the soil P random forest model, we used 90% of the data for model training, and the remaining 10% for model testing. Using the training dataset and the ranger implementation for random forest modeling, we applied 10-fold cross validation to tune model hyperparameters across a range of possible values. The hyperparameters selected for tuning were: mtry (i.e., number of variables randomly sampled as candidates at each split) and min_n (i.e., the minimum number of data points in a node). The trees hyperparameter (i.e., number of trees) was set to 1000 across all models. K-fold cross validation can assist in avoiding model over-fitting and works by partitioning training data into K equal sized “folds” (in our case 10). The model is iteratively trained on various combinations of tuning hyperparameters across K-1 folds, leaving the remaining fold to evaluate model performance for each combination. We defined a grid of 20 potential combinations of hyper-parameters using the tune_grid () function from the tidymodels collection of packages in R. This approach draws hyperparameter values semi-randomly from parameter space such that the various combinations cover the whole space of potential values. We selected hyperparameter values using out-of-bag (OOB) rmse and R² for the associated models.

Once hyperparameter values were tuned, we re-ran the random forest model using the randomForest package, to create a randomForest object that was compatible with our selected measure of predictive variable importance (see next section). We evaluated overall model performance based on the independent test dataset (comprising 10% of the original dataset).

2.6 Predictive Variable Importance

Many implementations of random forest models come with default measures used to quantify the relative importance of individual predictive variables to model performance. However, recent work has shown that many of these default strategies, including those based on measures of impurity or permutation-based metrics, can produce biased results when model predictors 1) vary in their scale of measurement; 2) vary in their number of categories; or 3) are highly correlated to one another (Hooker et al., 2021). When these conditions apply, as they do for predictors in our soil P dataset, alternative measures of importance such as conditional permutation importance (CPI) may be more appropriate, and less biased towards collinear predictive variables ( Debeer and Strobl, 2020; Hooker et al., 2021). CPI aims to capture the dependence between a predictor and the response variable, conditionally on the values of all other predictors. That is, CPI is a measure that can be used to assess how much each variable ‘adds’ to accurately predicting the response variable, given what we know from all other predictive variables. The tradeoff is that CPI methods can be computationally expensive. To derive less biased estimates of variable importance to the random forest model performance, we implemented the CPI approach from the permimp package in R (Debeer et al., 2021). This approach builds upon the widely used party package for CPI, while offering improved computational speed, accounting for non-linear dependence between predictors, less sensitivity to dataset sample size, and greater stability of results (Debeer and Strobl, 2020). To date, permimp can be applied to randomForest or cforest model objects in R (but not to ranger objects). In permimp, a threshold value, equal to 1 – the p-value for the association between predictor variables, is used to determine whether to include a predictor in the conditioning for the predictor of interest. We used the default value for the threshold parameter in permimp (0.95) because Debeer and Strobl (2020) advised utilizing threshold values close to 1 for larger datasets.

2.7 Predicting soil P across the Upper Mississippi River Basin

One of our goals of this effort was to predict soil P values for unsampled locations across the UMRB. To do this, we created a grid of points spaced 100m apart across the entire UMRB (UMRB grid), using the Create Fishnet tool in ArcMap version 10.8. We linked these locations to the same set of attributes used in our random forest model (i.e., attributes from the NHDPlusv2, NLCD 2006, NWI, gSSURGO, and StreamCat datasets). We excluded grid locations that coincided with locations of open water. We used the High Performance Cluster at the University of Minnesota’s Supercomputing Institute (https://www.msi.umn.edu/) to assign attributes to the UMRB grid points, and to generate predictions for the grid locations using the random forest model we developed. For this effort, we assigned all predictions to the soil surface (max depth = 30 cm). Lastly, we used the Inverse Distance Weighting (IDW) tool in ArcMap version 10.8 to interpolate a raster surface of soil P values at the 100m grid scale for the UMRB.

^{^[1]} NGS: https://mrdata.usgs.gov/geochem/; NCSS: https://ncsslabdatamart.sc.egov.usda.gov/

^{^[2]} https://nhdplus.com/NHDPlus/NHDPlusV2_data.php;

^{^[3]} https://www.mrlc.gov/data/nlcd-2006-land-cover-conus

^{^[4]} https://www.fws.gov/wetlands/data/data-download.html

^{^[5]} https://nrcs.app.box.com/v/soils

^{^[6]} https://www.epa.gov/national-aquatic-resource-surveys/streamcat-dataset

3.1 Soil phosphorus

Mean soil total P for the broader Midwestern region included in the study was 580 mg/kg (range 17-4370 mg/kg). The distribution of soil P values showed a slight right skew (Fig. 3). The dataset was comprised of 1,761 samples that could be considered to occur in the “plowable layer” at the soil surface (i.e., <= 30 cm deep), and 5,128 samples that occurred at depths > 30 cm (i.e., ‘subsurface’ samples). Surface soil samples had a mean total P concentration of 585.6 mg/kg (range = 80–4370 mg/kg). Subsurface soil samples had a mean total P concentration of 592.6 mg/kg (range = 17–4280 mg/kg). The dataset contained 1,085 samples that were located in “near channel” or riparian areas (i.e., within a 100 m buffer of the NHDv2Plus stream network), and 5,798 samples that were non riparian or located outside of the stream buffer zone (Figure S5). The dataset also contained 309 samples that were located within wetland soils, and 6,574 that were located in nonwetland soils.

Across all samples, 352 samples (5%) had soil P > 1000 mg/kg and were distributed throughout the study region (Figure S1). A very small number of samples (n = 5) had soil P > 3000 mg/kg. All of these samples occurred in the National Geochemical Survey dataset. We examined the original field notes for these samples to ascertain if these values could be attributed to error or other specific site features, but there was no obvious information pointing to sampling or analytical error.

Relative to samples with soil P < 1000 mg/kg, samples with very high soil P were located in areas with higher median crop productivity index for corn, lower available water storage in the root zone, relatively higher contribution of groundwater to streams in the local catchment and watershed, relatively higher rates of N deposition (as NO3⁻, NH4⁺ and inorganic N), higher pesticide and fertilizer use, higher precipitation and runoff, and higher cropland cover in the catchment, watershed and riparian zone (See Appendix Table S2). Higher contributions of groundwater to stream and river flow could indicate high permeability (like karst), a large contribution of drain tile (Schilling and Libra, 2003), or a lower contribution of surface water such as in the drier western part of the basin. Overall, predictor values for samples with high soil P suggest that very high soil P samples occur on average in areas with greater precipitation and well drained soils, which are also associated with areas of more intense agriculture in the UMRB. This rationale makes sense for samples located in parts of the study region (e.g., Iowa), where these conditions are likely to occur. However, the spatial distribution of samples with high soil P (Figure S1) indicates that many of them also occurred in northern areas of the basin characterized by relatively little agriculture (e.g., northern Minnesota).

3.2 Model Performance

The original model included all soil P samples in model training and testing; however, this resulted in a model that tended to underpredict soil P for samples > 1000 mg/kg (Appendix Figure S2). As a result, we decided to increase model prediction accuracy for the vast majority (95%) of soil samples in our dataset by excluding the 5% of samples with soil P > 1000 mg/kg from the training and testing datasets (see Discussion). The final selected hyperparameters for the random forest model based on model tuning with 10-fold cross validation for this dataset were mtry = 50, trees = 1000, min_n = 3. Evaluation of predicted vs actual soil P values for the independent test dataset indicated a model RMSE of 0.661, and an R² of 0.58, for soil P samples < 1000 mg/kg (Fig. 4).

3.3. Predictive Variable Importance

Using the permimp function to estimate conditional permutation importance (CPI) for predictors in our model indicated that soil sample depth had by far the strongest relative impact on model performance, followed by land use in the immediate area (i.e, the 30 m grid cell) in which the soil sample was collected (Fig. 5). The remaining top predictive variables are shown ranked by importance in Fig. 5 and Table 1. In addition to sample depth and local land use, the top predictors ranked for importance included aspects of catchment- and watershed-scale land use (including aspects of urban land use, the extent of impervious surfaces, open water cover, barren land cover, and urban and agricultural land use in riparian areas), underlying soil properties (e.g., soil characteristics such as organic matter and mineral content, water storage and permeability), landscape properties (such as slope, elevation, and whether sites were located adjacent to the stream network), inputs (including atmospheric deposition as well as fertilizer and manure inputs), and climate factors (including temperature and precipitation; Table 1). Additional predictors beyond this set did not appear to contribute strongly to model accuracy once all other variables had been accounted for.

Examining some of these attributes in relation to soil P in more detail, we find that the highest soil P concentrations (> 750 mg/kg) were found almost exclusively in samples collected at a depth of 100 cm or shallower, i.e., comparatively closer to the surface. Nearly all samples deeper than 100 cm had soil P < 750 mg/kg (Appendix Figure S3). Among NLCD land use categories, samples collected in areas designated as open water, shrub/scrub, cultivated crops, developed/open space and grassland/herbaceous had higher mean soil P concentrations than the total dataset average, whereas other categories had comparatively lower soil P (Figure S4). Riparian soils (i.e., soils located within 100m of the NHDv2Plus stream network) had significantly higher soil P (mean soil P = 646 mg/kg) compared to non riparian soils (mean soil P = 580 mg/kg; unpaired t test t=-7.04, df = 1456., p < 0.00001; Figure S5). However, caution should be used when applying univariate relationships to aid in the interpretation of importance values for the random forest model. Because the random forest model is based a series of successive data splitting events (i.e., decision trees) where data is binned into branches of groups based on predictive variable split points, the predictor values useful in parsing these groups may not be readily apparent from univariate relationships.

3.4. Predicting Soil P For The Upper Mississippi River Basin

Figure 6 shows total soil P values for the UMRB at the 100 m grid scale, predicted from the random forest model. The predictions indicate the highest concentrations of total soil P in northern Iowa, southern Minnesota, and north-central Illinois. The lowest predicted concentrations occurred in northern and north/north-central Wisconsin, and in southern Illinois and southern Missouri.

4.1. Soil P across the study region

Here, we have assembled a large dataset of total P concentrations for soils across the Midwestern United States from publicly available datasets and used them to predict total soil P at fine resolution across the UMRB, as well as to identify predictors that were comparatively important to model accuracy. These soil datasets, collected by federal agencies over relatively long time periods and large spatial extents, have great potential utility in the analysis of dynamics in soil and water chemistry, but the availability of these datasets is not widely known and they are not uniformly organized (and require considerable pre-processing), possibly leading to their underutilization in the study of soil nutrient dynamics. Our study highlights the potential of existing publicly-funded datasets to support ongoing studies and management design related to soil and water health. For this reason, we provide the collated dataset together with the metadata and code in an open science framework available in the open access Github repository (https://github.com/cldolph/SoilP).

The range of total P concentrations reported here for soils of the Midwestern United States (mean = 580 mg/kg, range = 17-4370 mg/kg) is similar to that reported by Schilling et al. (2021) for riparian soils across Iowa (mean = 460 mg/kg; range = 109–1569 mg/kg; though it should be noted that the latter study used the aqua regia digestion method for measuring total P which can sometimes result in slightly lower measured P concentrations (J. Kovar, personal communication), rather than the HCl digestion method used for the samples in this study. Riparian soils in our study averaged slightly higher total P (646 mg/kg) compared to those measured by Shilling et al., 2021. Ringeval et al. (2017) developed models to simulate total P for cropland soils globally and estimated the global average for total P in cropland soils to be 567 mg/kg; however, their simulated total P values for soils in North America ranged considerably higher, typically above 1000 mg/kg.

4.2. Model Performance

Heterogeneity in total soil P is the outcome of potentially complex interacting drivers including geology, hydrology, climate, biogeochemical processes, and specific land use practices. Given this complexity, we view the predictive accuracy of our random forest model (R² = 0.58 for an independent validation dataset) for total soil P as relatively high. This performance is considerably stronger than that for continental-scale predictive models developed using similar methods for soil organic carbon and total nitrogen by Ramcharan et al. (2018). Our model also performed more accurately on an independent test dataset than a model developed for soil P at the scale of a single field by Sahabiev et al. (2021). However, the model developed here underpredicts total soil P for samples with very high soil P (> 1000 mg/kg). Although such samples comprised a relatively small proportion of our total dataset (~ 5%), underestimating soil P content for these samples may be substantively important if they represent hotspots for P accumulation and ultimately transport. Understanding the drivers of very high soil P samples in this study region is an important area for future research. Knowledge of specific land use practices at finer scales than those used in our study (e.g., fertilizer or manure application at location-specific scales smaller than the catchment or watershed scales available for land use practices in the StreamCat dataset), could also potentially help with predictive accuracy.

4.3. Predictive Variable Importance

Our analysis indicates that, at the regional scale represented by our model, the variables with the greatest comparative importance for predicting soil P included sample depth, land use, underlying soil properties, landscape properties, inputs, and climate. Soil sample depth was far and away the most comparatively important variable to model performance with soil P tending lower with increased depth (Figure S3). This finding echoes Ramcharan et al., (2018), who found depth to be consistently ranked as one of the most important predictors for random forest models developed for soil properties including total nitrogen and soil organic carbon.

Land use in the immediate areas (30 m grid) where soil samples were collected was the second most important variable to model accuracy, according to the conditional permutation importance measure we use. The comparatively high ranking of land use/land cover differs somewhat from that of Ringeval et al. (2017), who found, using a different modeling approach, that the main driver of variability for total soil P at a global scale was the soil biogeochemical background corresponding to P inherited from natural soils. However, at smaller (continental) scales, Ringeval et al (2017) found farming practices were also important in driving heterogeneity in total soil P, though still less important than native soil properties. In our study, land use in the immediate area of a soil sample appeared more important to accurately predicting soil P than soil properties at the gSSURGO map unit scale and more important than measures of land use at catchment or watershed scales. Given this finding, a question for future research would be whether more specificity in soil properties land use practices at the hyper local scale (e.g., on-field practices such as fertilizer and manure input, specific crop type, tillage practices at the field or 30m grid scale, etc) could lead to improvements in models for predicting soil P across the Midwest.

Model variables associated with urban or high density areas, including extent of impervious surfaces, urban land use, and road density, were all ranked as relatively important to model performance. This finding suggests that soil P in urban or high density areas may be distinct from other landscapes. However, several ranked “urban” attributes describe the extent of low density or “open” urban land use, which are indicative of suburban and exurban landscapes and may indicate that these landscapes are also distinct in terms of soil P. Although soil P dynamics in urban areas are relatively under-studied, they have been identified as hotspots for P cycling, with distinct drivers that influence P stocks and flows (Metson et al., 2015). For example, Hobbie et al., (2017) observed that urban watersheds in our study region retained comparatively little of their P inputs, and that the majority of annual P inputs were exported via stormwater flows. Additional research is needed to understand drivers of total soil P in urban and suburban landscapes and how these may differ from those in agricultural settings.

Riparian attributes summarized at the catchment or watershed scale (e.g., the % of crop, hay or urban land use cover in catchment or watershed-scale riparian zones) were also identified as relatively important to model prediction accuracy. This finding could result from the fact that land use in riparian zones may be indicative of a particularly intensive form of land use - i.e., if riparian zones are used for crops, hay or dominated by urban land use, this may indicate particularly intensive landscapes for these uses in a way that affects soil P. Conversely, these drivers may be strongly affecting soil P in riparian areas, and therefore could have been important for model accuracy for riparian samples. A categorical attribute (‘StreamType) describing whether a sample was collected in the riparian buffer of the stream network (and which kind of stream reach it was adjacent to such as a ditch, stream, river etc) also appeared in the list of important variables to model performance. Post hoc analysis indicated that riparian areas (soils within 100m of the stream network) had significantly higher total P (646 mg/kg) compared to non-riparian areas (580 mg/kg). This echoes the finding by previous studies which have noted riparian and wetland areas as potential sinks, and ultimately sources, of P in agricultural landscapes (Kleinman et al., 2022)

Soil properties that were ranked as important to model performance included the native P content (as phosphorus oxide) and mineral content (aluminum, iron, calcium, and sodium oxides) of soils, along with aspects of soil texture, water storage and depth, soil organic carbon/organic matter, glacial history, hydraulic conductivity, erodibility, and suitability of soils for growth of various crop types. Extensive research has documented the importance of soil compounds such as iron and aluminum oxides and calcium carbonate on soil P retention and release (Records et al., 2016). Likewise, soil organic matter content has been shown to have direct and indirect effects on P sorption capacity (e.g., Kang et al., 2009). And as documented recently by Plach et al. (2018), the parental origins of soil and their glacial history have a potentially strong effect on the physical and geochemical properties of soil which can subsequently affect P retention and transport. Our findings likewise indicate that many of these properties are important for accurately predicting soil P across Midwestern landscapes.

Nitrate and ammonium deposition at the catchment scale, as well as fertilizer and manure inputs at the catchment scale, were all identified as important to model performance. Atmospheric deposition of nitrogen could be an indicator of the proximity of fossil fuel combustion, fertilizer or manure application or livestock emissions (Russell et al., 1998), which may also affect soil P.

Predictors summarizing aspects of climate, landscape properties, and their interaction – including precipitation, temperature, runoff, slope, and the contribution of groundwater to baseflow – also appeared to affect model performance. Recent work has shown that both temperature and precipitation are important to soil P availability at a global scale, and may have contrasting effects (Hou et al., 2018). Slope and resulting erosional processes (whether driven by human land use, climate or other factors) also have a profound effect on erosional and subsequent biogeochemical processes that may affect soil P (Berhe et al., 2018).

It is important to note that methods for the estimation of variable importance within random forest models is an active area of research (Debeer and Strobl, 2020), with rapid development of prospective improvements to “opening up” the black box of random forest models. We applied a recently developed approach that is aimed at examining the conditional importance of predictors, which may gives different results than more commonly applied importance measures. Care should be taken when interpreting these results relative to other modeling approaches if different methods for importance estimation are used.

Here we have used existing, large public soil chemistry and geospatial datasets together with a random forest model to generate predictions of total soil P at fine spatial resolution across the Upper Mississippi River Basin in an open science framework. Predicted soil P values are publicly available as a raster data layer at https://github.com/cldolph/SoilP. The combination of large existing datasets with powerful analytical tools like machine learning in a high performance computing environment offers new possibilities for understanding and predicting complex biophysical phenomena such as soil P at fine scales. Such predictions could be useful when combined with watershed models to improve estimates of P loss from this largely agricultural landscape. Improved knowledge of critical source areas of soil P to include undersampled locations is necessary to develop effective regional conservation plans and prioritize resources such that they can reduce the accumulation and transport of P across landscapes.

Acknowledgements

We would like to thank Zhen Xu for alerting us to the availability of the USGS soil dataset. Random forest model tuning and predictions were performed on the High Performance Cluster at the University of Minnesota’s Supercomputing Institute, https://www.msi.umn.edu/. This project was funded by a U.S. Department of Agriculture Conservation Effects Assessment Project (CEAP) grant #NR203A750023C023.

Funding

This project was funded by a U.S. Department of Agriculture Conservation Effects Assessment Project (CEAP) grant #NR203A750023C023.

Competing Interests

The authors have no relevant financial or non-financial interests to disclose.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Christine Dolph. R script quality control was performed by Se Jong Cho, S.Cho also designed Figure 2. The first draft of the manuscript was written by Christine Dolph and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Data Availability

The datasets and R scripts generated during and/or analysed during the current study are available in the Soil P repository, https://github.com/cldolph/SoilP.

Boardman, E., Danesh-Yazdi, M., Foufoula-Georgiou, E., Dolph, C.L., Finlay, J.C., 2019. Fertilizer, landscape features and climate regulate phosphorus retention and river export in diverse Midwestern watersheds. Biogeochemistry 146, 293–309. https://doi.org/10.1007/s10533-019-00623-z
Breiman, L.2001. Random Forests. Machine Learning 45, 5–32 https://doi.org/10.1023/A:1010933404324
Breiman, L., Cutler, A., Liaw, A., Wiener, M. 2022. Package ‘randomForest’. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Clark, B., Longo, S.B., 2018. Land–Sea Ecological Rifts. Monthly Review 108–121. https://doi.org/10.14452/mr-070-03-2018-07_5
Debeer, D., Hothorn, T., Strobl, C. 2021. Package ‘permimp’. https://cran.r-project.org/web/packages/permimp/permimp.pdf
Debeer, D., Strobl, C., 2020. Conditional permutation importance revisited. BMC Bioinformatics 21. https://doi.org/10.1186/s12859-020-03622-2
Deiss, L., de Moraes, A., Maire, V., 2018. Environmental drivers of soil phosphorus composition in natural ecosystems. Biogeosciences 15, 4575–4592. https://doi.org/10.5194/bg-15-4575-2018
Dodd, R.J., Sharpley, A.N., 2015. Conservation practice effectiveness and adoption: unintended consequences and implications for sustainable phosphorus management. Nutrient Cycling in Agroecosystems 104, 373–392. https://doi.org/10.1007/s10705-015-9748-8
Fernández, F.G., Farmaha, B.S., Nafziger, E.D., 2012. Soil Fertility Status of Soils in Illinois. Communications in Soil Science and Plant Analysis 43, 2897–2914. https://doi.org/10.1080/00103624.2012.728268
Goyette, J.-O., Bennett, E.M., Maranger, R., 2018. Low buffering capacity and slow recovery of anthropogenic phosphorus pollution in watersheds. Nature Geoscience 11, 921–925. https://doi.org/10.1038/s41561-018-0238-x
Gran, K.B., Dolph, C., Baker, A., Bevis, M., Cho, S.J., Czuba, J.A., Dalzell, B., Danesh‐Yazdi, M., Hansen, A.T., Kelly, S., Lang, Z., Schwenk, J., Belmont, P., Finlay, J.C., Kumar, P., Rabotyagov, S., Roehrig, G., Wilcock, P., Foufoula‐Georgiou, E., 2019. The Power of Environmental Observatories for Advancing Multidisciplinary Research, Outreach, and Decision Support: The Case of the Minnesota River Basin. Water Resources Research 55, 3576–3592. https://doi.org/10.1029/2018wr024211
Green, T.R., Kipka, H., David, O., McMaster, G.S., 2018. Where is the USA Corn Belt, and how is it changing? Science of The Total Environment 618, 1613–1618. https://doi.org/10.1016/j.scitotenv.2017.09.325
Hagenauer, J., Omrani, H., Helbich, M., 2019. Assessing the performance of 38 machine learning models: the case of land consumption rates in Bavaria, Germany. International Journal of Geographical Information Science 33, 1399–1419. https://doi.org/10.1080/13658816.2019.1579333
He, X., Augusto, L., Goll, D.S., Ringeval, B., Wang, Y., Helfenstein, J., Huang, Y., Yu, K., Wang, Z., Yang, Y., Hou, E., 2021. Global patterns and drivers of soil total phosphorus concentration. Earth System Science Data 13, 5831–5846. https://doi.org/10.5194/essd-13-5831-2021
Hill, R.A., Weber, M.H., Leibowitz, S.G., Olsen, A.R., Thornbrugh, D.J., 2015. The Stream-Catchment (StreamCat) Dataset: A Database of Watershed Metrics for the Conterminous United States. JAWRA Journal of the American Water Resources Association 52, 120–128. https://doi.org/10.1111/1752-1688.12372
Hobbie, S.E., Finlay, J.C., Janke, B.D., Nidzgorski, D.A., Millet, D.B., Baker, L.A., 2017. Contrasting nitrogen and phosphorus budgets in urban watersheds and implications for managing urban water pollution. Proceedings of the National Academy of Sciences 114, 4177–4182. https://doi.org/10.1073/pnas.1618536114
Hooker, G., Mentch, L., Zhou, S., 2021. Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance. Statistics and Computing 31. https://doi.org/10.1007/s11222-021-10057-z
Hosseini, M., Rajabi Agereh, S., Khaledian, Y., Jafarzadeh Zoghalchali, H., Brevik, E.C., Movahedi Naeini, S.A.R., 2017. Comparison of multiple statistical techniques to predict soil phosphorus. Applied Soil Ecology 114, 123–131. https://doi.org/10.1016/j.apsoil.2017.02.011
Hou, E., Chen, C., Kuang, Y., Zhang, Y., Heenan, M., Wen, D., 2016. A structural equation model analysis of phosphorus transformations in global unfertilized and uncultivated soils. Global Biogeochemical Cycles 30, 1300–1309. https://doi.org/10.1002/2016gb005371
Hou, E., Chen, C., Luo, Y., Zhou, G., Kuang, Y., Zhang, Y., Heenan, M., Lu, X., Wen, D., 2018. Effects of climate on soil phosphorus cycle and availability in natural terrestrial ecosystems. Global Change Biology 24: 3344-3356.
Jacobson, L.M., David, M.B., Drinkwater, L.E., 2011. A Spatial Analysis of Phosphorus in the Mississippi River Basin. Journal of Environmental Quality 40, 931–941. https://doi.org/10.2134/jeq2010.0386
Jeong, J.H., Resop, J.P., Mueller, N.D., Fleisher, D.H., Yun, K., Butler, E.E., Timlin, D.J., Shim, K.-M., Gerber, J.S., Reddy, V.R., Kim, S.-H., 2016. Random Forests for Global and Regional Crop Yield Predictions. PLOS ONE 11, e0156571. https://doi.org/10.1371/journal.pone.0156571
Kang, J., Hesterberg, D., Osmond, D.L., 2009. Soil Organic Matter Effects on Phosphorus Sorption: A Path Analysis. Soil Science Society of America Journal 73, 360–366. https://doi.org/10.2136/sssaj2008.0113
Kleinman, P.J.A., Osmond, D.L., Christianson, L.E., Flaten, D.N., Ippolito, J.A., Jarvie, H.P., Kaye, J.P., King, K.W., Leytem, A.B., McGrath, J.M., Nelson, N.O., Shober, A.L., Smith, D.R., Staver, K.W., Sharpley, A.N., 2022. Addressing conservation practice limitations and trade‐offs for reducing phosphorus loss from agricultural fields. Agricultural & Environmental Letters 7. https://doi.org/10.1002/ael2.20084
Kuhn, M., Wickham, H. 2020. Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
Liu, J., Cade-Menun, B.J., Yang, J., Hu, Y., Liu, C.W., Tremblay, J., LaForge, K., Schellenberg, M., Hamel, C., Bainard, L.D., 2018. Long-Term Land Use Affects Phosphorus Speciation and the Composition of Phosphorus Cycling Genes in Agricultural Soils. Frontiers in Microbiology 9. https://doi.org/10.3389/fmicb.2018.01643
Metson, G.S., Iwaniec, D.M., Baker, L.A., Bennett, E.M., Childers, D.L., Cordell, D., Grimm, N.B., Grove, J.M., Nidzgorski, D.A., White, S., 2015. Urban phosphorus sustainability: Systemically incorporating social, ecological, and technological factors into phosphorus flow analysis. Environmental Science &amp; Policy 47, 1–11. https://doi.org/10.1016/j.envsci.2014.10.005
NCSS, 2021. National Cooperative Soil Survey, National Cooperative Soil Survey Soil Characterization Database, Accessed online September 10, 2021. http://ncsslabdatamart.sc.egov.usda.gov/
Plach, J.M., Macrae, M.L., Williams, M.R., Lee, B.D., King, K.W., 2018. Dominant glacial landforms of the lower Great Lakes region exhibit different soil phosphorus chemistry and potential risk for phosphorus loss. Journal of Great Lakes Research 44, 1057–1067. https://doi.org/10.1016/j.jglr.2018.07.005
Qiao, L., Wang, X., Smith, P., Fan, J., Lu, Y., Emmett, B., Li, R., Dorling, S., Chen, H., Liu, S., Benton, T.G., Wang, Y., Ma, Y., Jiang, R., Zhang, F., Piao, S., Mϋller, C., Yang, H., Hao, Y., Li, W., Fan, M., 2022. Soil quality both increases crop production and improves resilience to climate change. Nature Climate Change 12, 574–580. https://doi.org/10.1038/s41558-022-01376-8
Ramcharan, A., Hengl, T., Nauman, T., Brungard, C., Waltman, S., Wills, S., Thompson, J., 2018. Soil Property and Class Maps of the Conterminous United States at 100‐Meter Spatial Resolution. Soil Science Society of America Journal 82, 186–201. https://doi.org/10.2136/sssaj2017.04.0122
Records, R.M., Wohl, E., Arabi, M., 2016. Phosphorus in the river corridor. Earth-Science Reviews 158, 65–88. https://doi.org/10.1016/j.earscirev.2016.04.010
Ringeval, B., Augusto, L., Monod, H., van Apeldoorn, D., Bouwman, L., Yang, X., Achat, D.L., Chini, L.P., Van Oost, K., Guenet, B., Wang, R., Decharme, B., Nesme, T., Pellerin, S., 2017. Phosphorus in agricultural soils: drivers of its distribution at the global scale. Global Change Biology 23, 3418–3432. https://doi.org/10.1111/gcb.13618
Russell, K.M., Galloway, J.N., Macko, S.A., Moody, J.L., Scudlark, J.R., 1998. Sources of nitrogen in wet deposition to the Chesapeake Bay region. Atmospheric Environment 32, 2453–2465. https://doi.org/10.1016/s1352-2310(98)00044-2
Sadayappan, K., Kerins, D., Shen, C., Li, L., 2022. Nitrate concentrations predominantly driven by human, climate, and soil properties in US rivers. Water Research 226, 119295. https://doi.org/10.1016/j.watres.2022.119295
Sahabiev, I., Smirnova, E., Giniyatullin, K., 2021. Spatial Prediction of Agrochemical Properties on the Scale of a Single Field Using Machine Learning Methods Based on Remote Sensing Data. Agronomy 11, 2266. https://doi.org/10.3390/agronomy11112266
Schilling, K.E., Isenhart, T.M., Wolter, C.F., Streeter, M.T., Kovar, J.L., 2021. Contribution of streambanks to phosphorus export from Iowa. Journal of Soil and Water Conservation 77, 103–112. https://doi.org/10.2489/jswc.2022.00036
Schilling, K.E., Libra, R.D., 2003. INCREASED BASEFLOW IN IOWA OVER THE SECOND HALF OF THE 20TH CENTURY. Journal of the American Water Resources Association 39, 851–860. https://doi.org/10.1111/j.1752-1688.2003.tb04410.x
Schindler, D.W., 2006. Recent advances in the understanding and management of eutrophication. Limnology and Oceanography 51, 356–363. https://doi.org/10.4319/lo.2006.51.1_part_2.0356
Shen, L.Q., Amatulli, G., Sethi, T., Raymond, P., Domisch, S., 2020. Estimating nitrogen and phosphorus concentrations in streams and rivers, within a machine learning framework. Scientific Data 7. https://doi.org/10.1038/s41597-020-0478-7
Soil Survey Staff, 2014. Kellogg Soil Survey Laboratory Methods Manual. Soil Survey Investigations Report No. 42, Version 5.0. R. Burt and Soil Survey Staff (ed.). U.S. Department of Agriculture, Natural Resources Conservation Service., p. 457. https://www.nrcs.usda.gov/Internet/FSE_DOCUMENTS/stelprdb1253872.pdf
Soil Survey Staff, 2021. Gridded Soil Survey Geographic (gSSURGO) Database for . United States Department of Agriculture, Natural Resources Conservation Service. Available online at https://gdg.sc.egov.usda.gov/ Accessed online September 10, 2021.
Stackpoole, S.M., Stets, E.G., Sprague, L.A., 2019. Variable impacts of contemporary versus legacy agricultural phosphorus on US river water quality. Proceedings of the National Academy of Sciences 116, 20562–20567. https://doi.org/10.1073/pnas.1903226116
USGS, 2004. The National Geochemical Survey - database and documentation: U.S. Geological Survey Open-File Report 2004-1001, U.S. Geological Survey, Reston VA. 2021. https://doi.org/10.3133/ofr20041001 Accessed online June 22, 2022.
Vadas, P.A., Kleinman, P.J.A., Sharpley, A.N., Turner, B.L., 2005. Relating Soil Phosphorus to Dissolved Phosphorus in Runoff: A Single Extraction Coefficient for Water Quality Modeling. Journal of Environmental Quality 34, 572–580. https://doi.org/10.2134/jeq2005.0572
Van Meter, K.J., McLeod, M.M., Liu, J., Tenkouano, G.T., Hall, R.I., Van Cappellen, P., Basu, N.B., 2021. Beyond the Mass Balance: Watershed Phosphorus Legacies and the Evolution of the Current Water Quality Policy Challenge. Water Resources Research 57. https://doi.org/10.1029/2020wr029316
Vitousek, P.M., Porder, S., Houlton, B.Z., Chadwick, O.A., 2010. Terrestrial phosphorus limitation: mechanisms, implications, and nitrogen–phosphorus interactions. Ecological Applications 20, 5–15. https://doi.org/10.1890/08-0127.1
Watershed Boundary Dataset for HUC07. 2021. Available URL: http://datagateway.nrcs.usda.gov [Accessed September 21, 2021.
Wright, M.N., Ziegler, A., 2017. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77. https://doi.org/10.18637/jss.v077.i01
Wu, Z., Li, J., Sun, Y., Peñuelas, J., Huang, J., Sardans, J., Jiang, Q., Finlay, J.C., Britten, G.L., Follows, M.J., Gao, W., Qin, B., Ni, J., Huo, S., Liu, Y., 2022. Imbalance of global nutrient cycles exacerbated by the greater retention of phosphorus over nitrogen in lakes. Nature Geoscience 15, 464–468. https://doi.org/10.1038/s41561-022-00958-7
Wuenscher, R., Unterfrauner, H., Peticzka, R., Zehetner, F., 2016. A comparison of 14 soil phosphorus extraction methods applied to 50 agricultural soils from Central Europe. Plant, Soil and Environment 61, 86–96. https://doi.org/10.17221/932/2014-pse
Wuenscher, R., Unterfrauner, H., Peticzka, R., Zehetner, F., 2016. A comparison of 14 soil phosphorus extraction methods applied to 50 agricultural soils from Central Europe. Plant, Soil and Environment 61, 86–96. https://doi.org/10.17221/932/2014-pse
Wuenscher, R., Unterfrauner, H., Peticzka, R., Zehetner, F., 2016. A comparison of 14 soil phosphorus extraction methods applied to 50 agricultural soils from Central Europe. Plant, Soil and Environment 61, 86–96. https://doi.org/10.17221/932/2014-pse
Zhong, S., Zhang, K., Bagheri, M., Burken, J.G., Gu, A., Li, B., Ma, X., Marrone, B.L., Ren, Z.J., Schrier, J., Shi, W., Tan, H., Wang, T., Wang, X., Wong, B.M., Xiao, X., Yu, X., Zhu, J.-J., Zhang, H., 2021. Machine Learning: New Ideas and Tools in Environmental Science and Engineering. Environmental Science & Technology. https://doi.org/10.1021/acs.est.1c01339

Table 1

Top covariates, as ranked by conditional permutation importance (CPI), to the performance of the random forest model. Attributes are designated by major categories including ‘land use’, ‘underlying soil properties’, landscape properties’, ‘inputs’, and ‘climate’. More detailed descriptions of each attribute are available in Appendix Table S1. ‘Riparian zones’ refers to the 100m buffer surrounding the NHD stream network.
Abbreviation	Category	Short description	Source layer
Depth_cm	Underlying soil properties	Soil depth (cm)	NGS/NCSS
NLC06	Land use	Land use category in the 30 m grid cell where soil sample was collected.	NLCD 2006
PctImp2006CatRp100	Land use	% Impervious in riparian zones within catchment	StreamCat
PctImp2006Cat	Land use	% Impervious within catchment	StreamCat
PctBl2006Ws	Land use	% Barren land within watershed	StreamCat
P2O5Cat	Underlying soil properties	Mean % lithological phosphorous oxide content in surface geology within catchment	StreamCat
PctOw2006WsRp100	Land use	% Open water in riparian zones within watershed	StreamCat
slope_r	Landscape properties	Mean slope in gSSURGO map unit	gSSURGO
RdDensCatRp100	Land use	Road density in riparian zones within catchment	StreamCat
nccpi3corn	Underlying soil properties	National commodity crop productivity index for corn in gSSURGO map unit	gSSURGO
PctGlacLakeCrsCat	Underlying soil properties	% of catchment area classified as coarse-textured glacial outwash & glacial lake sediment	StreamCat
NO3_2008Cat	Inputs	Mean NO3 wet deposition within catchment	StreamCat
HydrlCondCat	Underlying soil properties	Mean lithological hydraulic conductivity in surface geology within catchment	StreamCat
sieveno200_r	Underlying soil properties	Mean soil fraction passing a #200 sieve (0.074mm square opening) for gSSURGO map unit	gSSURGO
PctUrbOp2006CatRp100	Land use	% Developed, open space in riparian zones within catchment	StreamCat
NO3_2008Ws	Inputs	Mean NO3 wet deposition within watershed	StreamCat
ElevCat	Landscape properties	Mean elevation within catchment	StreamCat
PctUrbLo2006Cat	Land use	% Developed, low intensity land use within catchment	StreamCat
OmCat	Underlying soil properties	Mean organic matter of soils within catchment	StreamCat
MgOCat	Underlying soil properties	Mean lithological magnesium oxide (MgO) content in surface geology within catchment	StreamCat
BFICat	Landscape properties	Ratio of baseflow to total flow within catchment	StreamCat
PctGlacLakeCrsWs	Underlying soil properties	% of watershed classified as coarse-textured glacial outwash and glacial lake sediment	StreamCat
Al2O3Cat	Underlying soil properties	Mean lithological aluminum oxide content in surface geology within catchment	StreamCat
aws0_30	Underlying soil properties	Mean available water storage at 0–30 cm depth for gSSURGO map unit	gSSURGO
soc150_999	Underlying soil properties	Mean soil organic carbon stock from 150 cm to max reported depth of the soil profile for gSSURGO map unit	gSSURGO
AgKffactCat	Underlying soil properties	Mean soil erodibility factor on ag land within catchment	StreamCat
WetIndexWs	Underlying soil properties	Mean composite topographic (wetness) index within watershed	StreamCat
nccpi3sg	Underlying soil properties	National commodity crop productivity index for sorghum in gSSURGO map unit	gSSURGO
FertWs	Inputs	Mean rate of synthetic N fertilizer application to ag land within watershed	StreamCat
aws0_999	Underlying soil properties	Available water storage in total soil profile in gSSURGO map unit	gSSURGO
Al2O3Ws	Underlying soil properties	Mean lithological aluminum oxide content in surface geology within watershed	StreamCat
aws0_20	Underlying soil properties	Available water storage estimate at 0–20 cm depth in gSSURGO map unit	gSSURGO
PctUrbLo2006CatRp100	Land use	% Developed, low intensity land use in riparian zones within catchment	StreamCat
WtDepWs	Underlying soil properties	Mean seasonal water table depth of soils within watershed	StreamCat
StreamType	Landscape properties	Whether soil samples were collected within riparian buffer of stream network	NHDv2Plus
SN_2008Ws	Inputs	Mean wet deposition for average sulfur and nitrogen within watershed	StreamCat
nccpi3all	Underlying soil properties	National commodity crop productivity index for all crops in gSSURGO map unit	gSSURGO
nccpi3soy	Underlying soil properties	National commodity crop productivity index for soybean in gSSURGO map unit	gSSURGO
Fe2O3Ws	Underlying soil properties	Mean lithological ferric oxide content in surface geology within catchment	StreamCat
aws0_150	Underlying soil properties	Mean available water storage in top 150 cm of soil depth in gSSURGO map unit	gSSURGO
PctOw2006Cat	Land use/landscape properties	% Open water within catchment	StreamCat
RdCrsCat	Land use	Road crossings in catchment	StreamCat
PermCat	Underlying soil properties	Mean permeability of soils within catchment	StreamCat
aws50_100	Underlyings soil properties	Available water storage in 50–100 cm depth in ssurgo map unit	gSSURGO
CaOWs	Underlying soil properties	Mean lithological calcium oxide content in surface geology within watershed	StreamCat
PctHay2006WsRp100	Land use	% Hay in riparian zones within watershed	StreamCat
TmaxCat	Climate	30 year normal max temperature, 1981–2010, within catchment	StreamCat
PctEolFineWs	Underlying soil properties	% of catchment area classified as eolian sediment, fine-textured (glacial loess)	StreamCat
PrecipWs	Climate	PRISM climate data - Mean precipitation (mm) within the watershed. Period: 2008	StreamCat
PctCrop2006WsRp100	Land use	% Crop land use in riparian zones within watershed	StreamCat
ElevWs	Landscape properties	Mean watershed elevation (m)	StreamCat
BFICat	Landscape properties	Ratio of baseflow to total flow within catchment	StreamCat
SN_2008Cat	Inputs		StreamCat
RunoffWs	Landscape properties/climate	Mean runoff (mm) within watershed	StreamCat
CompStrgthWs	Underlying soil properties	Mean lithological uniaxial compressive strength content in surface geology within watershed	StreamCat
ManureCat	Inputs	Mean rate of manure application to ag land from CAFOs within catchment	StreamCat
PctImp2006Ws	Land use	% Impervious surfaces within watershed	StreamCat
claytotal_r	Underlying soil properties	% Clay for soils in gSSURGO map unit	gSSURGO
Na2OCat	Underlying soil properties	Mean lithological sodium oxide content in surface geology within catchment	StreamCat
Na2OWs	Underlying soil properties	Mean lithological sodium oxide content in surface geology within watershed	StreamCat
FertCat	Inputs	Mean rate of synthetic nitrogen fertilizer application to ag land within catchment	StreamCat
PctUrbHi2006CatRp100	Land use	% developed, high-intensity land use within catchment	StreamCat
NH4_2008Cat	Inputs	Mean wet deposition for ammonium ion concentration 2008 within catchment	StreamCat
PermWs	Underlying soil properties	Mean permeability of soils within watershed	StreamCat
PctUrbOp2006Ws	Land use	% Developed, open space land use within watershed	StreamCat

DolphetalSIAppendix11.16.22.docx

Download PDF

Journal Publication

published 28 Mar, 2023

Read the published version in Biogeochemistry →

Editorial decision: Major revisions
21 Dec, 2022
Reviewers agreed at journal
18 Nov, 2022
Reviewers invited by journal
18 Nov, 2022
Editor invited by journal
18 Nov, 2022
Editor assigned by journal
18 Nov, 2022
First submitted to journal
17 Nov, 2022

You are reading this latest preprint version

Predicting high resolution total phosphorus concentrations for soils of the Upper Mississippi River Basin using machine learning

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Methods

3. Results

3.1 Soil phosphorus

3.2 Model Performance

3.3. Predictive Variable Importance

3.4. Predicting Soil P For The Upper Mississippi River Basin

4. Discussion

4.1. Soil P across the study region

4.2. Model Performance

4.3. Predictive Variable Importance

Conclusions

Declarations

References

Tables

Supplementary Files

Status:

Journal Publication

Version 1