Urban soil contamination with potentially toxic elements (PTEs) has elicited significant global concern because of its potential environmental and human health threats (Praveena et al. 2015; Amari et al. 2017). The threat of PTEs to the environment and human health is due to their toxicity, persistence, bioaccumulation (Amari et al. 2017), easy enrichment, and poor mobility in soil (Sundaramanickam et al. 2016). In addition, urban soils are enriched with a substantial level of PTFs relative to the natural reference values (Liu et al. 2006; Miao et al. 2008; Adedeji et al. 2019). High levels and variations in PTEs in urban soils are mainly due to anthropogenic activities, such as city expansion and urbanization (Rodríguez-Seijo et al. 2015), the density of vehicles and the types of fuels (Minguillón et al. 2014), and emissions from factories, industries, and transportation (Dao et al. 2014). Therefore, understanding the efficient and accurate spatial variation of PTEs in the soil is crucial for effective management and contamination remediation (Song et al. 2019).
Numerous methods have been reported to generate and predict the concentrations and distributions of soil PTEs. For example, ordinary kriging (OK), simple kriging, universal kriging, co-kriging (CK), regression kriging (RK), and empirical Bayesian kriging (EBK) are widely used geostatistical methods in soil and environmental sciences (Webster and Oliver 2007; Jiang et al. 2017; Gribov and Krivoruchko 2020). However, each geostatistical approach considers many assumptions and excludes various explanatory variables that influence the accuracy of the spatial prediction and distributions of PTEs in soils. For example, OK considers spatial autocorrelation and data stationarity assumptions (Webster and Oliver 2007) and excludes explanatory variables. Similarly, the relationship between the response variables and the spatial covariates (Shi et al. 2009; Sun et al. 2012) and the number of covariates (Wackernagel 1994; Giraldo and Herrera 2020) are further limitations in obtaining acceptable predictions of soil PTEs using RK and CK techniques. If any of these premises do not meet, the outcome values of OK, RK, and CK might be suboptimal, and predicting values at non-sampled locations lacks reasonable accuracy (Pilz and Spöck 2008).
EBK regression prediction (EBKRP) is an advanced geostatistical prediction technique that integrates kriging with regression methods to make predictions more precise than either regression or kriging can achieve independently (Krivoruchko and Gribov 2019; Gribov and Krivoruchko 2020). The EBKRP is also estimated regionally and accounts for regional effects (Gribov and Krivoruchko 2020). For example, the associations between the explanatory and dependent variables may change in different areas; however, the EBKRP can precisely model these local changes. However, despite the many advantages of EBKRP, it has several shortcomings. For example, the EBKRP does not identify the independent variables that are strongly associated with the response variables and the significance of the explanatory variables that influence the predicting variables. Therefore, Random forest (RF) is a capable machine learning technique for the spatial prediction that solves the limitations associated with the EBKRP method.
The RF model is more promising than other machine learning approaches for the accurate estimation and prediction of soil PTEs because of its insensitivity to noise features, resistance to overfitting, and unbiased measurement of the error rate (Breiman 2001). RF can also identify the influences and associations of explanatory variables. However, RF algorithms did not clearly describe how the predictions were made, and some studies have considered RF a “black box” algorithm (Taghizadeh-mehrjardi et al. 2016; Hengl et al. 2018; Minasny et al. 2018). In addition, the RF algorithm only accounts for the relationships between the predictors and auxiliary variables by ignoring the influence of neighboring observed data (spatial autocorrelation)(Guo et al. 2015). Hence, no single method has shown better performance than the others for soil PTE studies (Xiang et al. 2020); thus, a combination of different techniques can neutralize their weaknesses.
A combination of different techniques, hereafter called hybrid methods, refers to combining two conceptually different approaches to model the spatial variation of soil properties (Mirzaee et al. 2016). Previously, hybrid methods were introduced by combining other geostatistical techniques and additional information, such as kriging with external drift and CK with auxiliary variables(Tziachris et al. 2019). Recently, hybrid methods have been increasingly used to model geostatistics trends using machine learning models to enhance the prediction accuracy of interpolation methods (Matinfar et al. 2021).
Several studies have reported that hybrid models improve prediction accuracy (Dai et al. 2014; Mirzaee et al. 2016; Tziachris et al. 2019; Matinfar et al. 2021). However, few studies have been conducted on the performance of EBKRP over machine-learning approaches (Requia et al. 2019; Mallik et al. 2020). Specifically, no well-documented studies have reported the prediction performance of EBKRP, RF, and the hybrid methods of RF-EBKRP for the spatial prediction of PTEs in the soil. Furthermore, studies on the relationship with covariates using geostatistical techniques and machine learning models have not been conducted entirely worked out (Hengl et al. 2018). It is also critical to identify the key auxiliary variables that directly correlate with their target PTEs to explain the spatial variability of soil PTEs in the prediction process. For example, soil fertility indicators for managing urban green space areas, such as the application of fertilizer, irrigation, herbicides, and pesticides, influence the spatial predictions of PTEs in urban green space soils. Furthermore, topographic factors for designing and improving greenspace landscapes, such as slopes and elevations, play a role in predicting PTEs in greenspace soil. Therefore, the objective of this study was to identify the roles of auxiliary variables for predictive model performance and to compare the prediction capabilities of the EBKRP, RF, and hybrid RF-EBKRP models for estimating the five soil PTEs (Pb, Cu, Zn, Cr, and Pb).