Training and validation datasets used in the estimation of SOCS in surface soils are given in Table 3. The range of entire SOCS dataset was between 17.41 and 87.66 (Mg C ha− 1), with a mean value of 43.86 (Mg C ha− 1), a median of 36.19 (Mg C ha− 1) and a standard deviation of 20.04 (Mg C ha− 1). The measures of central tendency indicated the similarity of complete dataset and the train and test datasets. The samples used in the training and test datasets are representative of the entire SOCS data set, thus, the number of samples were adequate for development and validation of an accurate estimation model.
Table 3
Descriptive statistics for the complete SOCS dataset, as well as the training and validation datasets.
Dataset | Minimum | Maximum | Median | Mean | Standard Deviation |
| Mg C ha− 1 |
Complete Dataset | 17.41 | 87.66 | 36.19 | 43.86 | 20.04 |
Training Dataset | 17.41 | 87.24 | 36.29 | 44.90 | 19.65 |
Validation Dataset | 19.36 | 87.66 | 34.61 | 41.64 | 21.36 |
3.1. Analysis of remote sensing-based indexes
Descriptive statistics of remote sensing indexes calculated using the relevant bands of the Sentinel 2A optical satellite image are given in Table 4 and the correlation coefficients between SOCS and remote sensing indexes are given in Fig. 4. The ARVI with a mean value of 0.36 ranged between − 0.24 and 0.93. Significant positive correlation (r = 0.43; p < 0.01) was recorded between ARVI and SOCS. The mean value of NDVI was 0.34, the smallest value was − 0.43 and the highest value was 0.68. The mean values of NDVI and ARVI calculated with the NIR and Red bands of Sentinel 2A were quite close to each other. Less sensitivity of ARVI to atmospheric effects compared to the NDVI may cause the difference in minimum and maximum values (Kaufman and Tanre 1992). A moderately significant positive correlation (r = 0.44; p < 0.01) was found between NDVI and SOCS values (Fig. 4). The NDVI values indicated a dense vegetation from south to northwest of the study area, and the middle part of the study area was characterized by lower NDVI values (Fig. 5). The NDWI values ranged from − 0.44 to 0.78, with a mean value of 0.33. The SRCI and GSI are the covariates that provide information on soil texture. The SRCI ranged from 0.8 to 2.34, with a mean value of 1.64. The GSI value was between − 0.49 and 0.04 and the mean value was − 0.20. The minimum values obtained by SRCI are in line with Bousbih et al. (2019). The correlation coefficient values between SOCS and SRCI and GSI were 0.51 and − 0.39, respectively (Fig. 4).
Table 4
Descriptive statistics of remote sensing-based indexes for the study area
Variables | Min | Max | Mean | Std. Dev | CV (%) |
ARVI | -0.24 | 0.93 | 0.36 | 0.26 | 72.22 |
GNDVI | -0.07 | 0.83 | 0.50 | 0.14 | 28.00 |
NDVI | -0.43 | 0.68 | 0.34 | 0.14 | 41.18 |
BI | 0.02 | 0.69 | 0.11 | 0.05 | 45.45 |
NDWI | -0.44 | 0.78 | 0.33 | 0.20 | 60.61 |
SRCI | 0.80 | 2.34 | 1.64 | 0.25 | 15.24 |
GSI | -0.49 | 0.04 | -0.20 | 0.11 | 55.00 |
3.2. Artificial neural network modeling
Optimal number of neurons in the hidden layer is critical to the development of a MLP structure (Sergeev et al. 2019). The number of neurons in the hidden layers was increased from 1 to 30 (Küçüktopcu and Cemek 2021). The lowest RMSE and MAPE values in the test data set were obtained at 15 and 16 neurons in hidden layers 1 and 2, respectively (Fig. 6).
The results of ANN-MLP model architecture created to estimate the spatial distribution of SOCS are given in Table 5. The lowest RMSE and MAPE values were obtained when the activation function of both hidden layers was Hyperbolic tangent sigmoid transfer (tansig) and output layer transfer function was linear transfer function (purelin). Similarly, the lowest RMSE and MAPE in the MLP architecture were recorded using the training algorithm of Levenberg-Marquardt backpropagation training function (trainlm) and the learning algorithm of Gradient descent with momentum weight and bias learning function (learngdm) (Mathworks 2020). Coefficient of determination (R2) was 1 and 0.628 in the training and test dataset, respectively. The number of input variables used in the MLP architecture and the statistical properties of the data are very important in the learning algorithm. Similar to this study, Falahatkar et al. (2016), who modeled the spatial estimation of SOC content in surface soils with the MLP using remote sensing data, determined the number of hidden neurons as 12–12 using the Sinusoidal Activation Function in the hidden layers. The coefficient of determination (R2) values in the training and test datasets were 0.75 and 0.61, respectively. The difference in model architectures can be attributed to the number of samples and the coefficient of variation of the data, as well as the type of activation function.
Table 5
Summary of the best results for the MLP modeling architecture and optimum parameters used to predict the spatial distribution of SOCS in the study area
Number of | Activation Function | Training Algorithm | Learning Algorithm | R2 |
Hidden Layer | Neuron | Hidden Layers | Output Layer | Training | Test |
2 | 15–16 | tansig | Purelin | trainlm | learngdm | 1.0 | 0.628 |
The R2 value provides important information on the regression between the measured and the predicted values. Zhang et al. (2021) modeled the spatial distribution of soil organic matter using all bands of Sentinel 2A and the vegetation indexes calculated from the remote sensing bands. The coefficient of determination using only the bands was R2 0.56, while it was 0.65 using vegetation indexes. In this study, the R2 (0.628, test) value was consistent with those reported in the previous studies. In addition, the results revealed that spectral index variables can be used as effective input variables in SOCS estimation by reducing spectral reflection errors.
3.3. Gradient Boosted Decision Tree
This chapter contains the Gradient Boosted Decision Tree spatial distribution model of SOCS using the hyperparameters optimized Lsboot (least-squares boosting) with Bayesian optimization. The coefficients of determination and optimized hyperparameters for training, validation and all data are given in Table 6. The R2 values for the training, testing and all data sets were 0.899, 0.814 and 0.823, respectively. In the GDBT model conditions; Min Parent Size was 10, Min Leaf Size was 5, Learn Rate was 0.297, Number of learning cycles was 11 and Max Number of splits was 3. Breiman et al. (2017); Loh and Shih (1997), and Tahraoui et al. (2022) stated that the Gradient Boosted Decision Tree optimization conditions should be within the limits of 1 < Max Number Splits < 99, 1 < Min Leaf Size < 50 and 1 < Min Parent Size < 10. In these conditions, the estimation accuracy using the GBDT model with the LSBoost algorithm and the optimization of the hyperparameters with Bayesian Optimization was successful.
In addition to the Bayesian Optimization used to improve hyperparameters, k-fold optimization is an algorithm adopted in GBDT or Regression Tree machine learning methods. Zeraatpisheh et al. (2019) modelled the spatial distribution of SOC with remote sensing data using Regression Tree, optimizing with 10-fold validation and calibration. The calibration improved the RMSE value by 42% and R2 by 170%. The R2 (0.93) value reported in the post-optimization test dataset was compatible with our finding. The study area has a homogeneous topography and climatic data set was limited for the study area. Therefore, the interpolation maps obtained from the climate database did not show a spatial variation. The GBDT performance was higher than the Regression Tree considering that the presented study was carried out under limited covariates data conditions. In addition to optimization, this phenomenon is related to the use of gradient descent approach in the LSBoost and regression model at end nodes. Thus, the GBDT is capable of generating a range of estimations. The conventional regression tree end nodes contain a fixed value (Malone et al. 2014; Alajmi and Almeshal 2020). The GBDT also fits the error residual of the previous tree into a decision tree. Therefore, every new tree in the community is focused on reducing the error made by the previous weak learner rather than predicting the target directly (Inria 2022; Tahraoui et al. 2022). The aforementioned structure of the GDBT is very useful for spatial distribution models of soil properties, that are highly affected by environmental factors such as organic carbon and have high spatial variation, under the number of covariates is limited.
Table 6
The hyperparameter optimization results of GBDT machine learning
Min Parent Size | Min Leaf Size | Learn Rate | Number of Learning Cycles | Max Number Splits | Number of Node | R2 | |
Train | Test | All data | |
10 | 5 | 0.297 | 11 | 8 | 13 | 0.899 | 0.814 | 0.823 | |
3.4. The Performances of Different Estimation Models
The SOCS map obtained with the GBDT and MLP estimation models are shown in Fig. 7. The histogram statistics of GBDT map were min:18.23, max:82.67, mean:48.28, std. deviation: 17.75 and CV:41.98%. The statistics of SOCS map produced by the MLP model were min: 17.62, max:87.23, mean:52.17, std.deviation:21.53 and CV:41.32%. The min, max and CV values obtained in both methods were quite similar to complete dataset (Table 3).
The maps produced using the models were compatible with the fact that the Nehil Marshes, located in the South of Yüksekova and have the highest SOCS values. The remote sensing indexes of BI and GSI had a moderate negative correlation with the SOCS (Fig. 3). High SOCS content in Nehil Marshes shown in both model maps, where the BI and GSI values are the lowest, coincide with the laboratory analysis. In addition, the relationship between remote sensing indexes and SOCS has been explained under the subheading of Important Factors for Predicting SOCS in the Study Area.
The SOCS content was high in the south, mid west,east and central parts of the study area. The results clearly stated that spatial variation of SOCS was affected by land cover differences as well as soil moisture. In particular, the SOCS content was high in areas close to the streams. High moisture content around the streams throughout a year compared to the other areas in the dry season causes continues organic carbon accumulation and a slow decomposition of organic matter in such fields. Zhang et al. (2021) reported that soil erosion caused by rainfall and streams cause the accumulation of sediments and organic matter in low-lying areas, which is also consistent with our findings. These results showed that the input parameters used affect the significance of the estimation variance in the whole study area and therefore, significantly contribute to the spatial variation of the SOCS.
Since the test data was not introduced to the model during the training process, the test data set was used to compare the effects of different models on the estimation of SOCS. The values of ME, MAE, RMSE, MAPE and RI calculated using the estimated and measured values are given in Table 7. The ME index, which expresses the mean bias of the predictions indicates the differences between the estimation of models and the actual values. The underestimation was − 2.41 in the GBDT and 1.67 in the MLP model. The MAE is an absolute mean and expresses the size of the estimation error produced by the model compared to the measured value. The difference between the mean SOCS value estimated by the GBDT model and the measured SOCS value was 3.94, which was lower than the that estimated SOCS value of the MLP model (Isaaks and Mohan 1989). The RMSE is commonly used to measure the bias in the mean and variance (Chicco et al. 2021) and it was higher than the MAE values in all methods. The result indicates that errors have a significant effect on spatial variability of SOCS in the study area (Goodchild et al. 1999; Liu et al. 2015). The GBDT model with a much lower RMSE (6.64) compared to the other MLP model, produced fewer errors in SOCS estimation. The RI value, indicator of the relative improvement in the RMSE, was 50%. In addition, the relative improvement recorded with the GBDT model in MAPE compared to the relative to the MLP model was 45.43 and 60.88% in MAPE. The GBDT-MAPE and MLP-MAPE values were 6.64 and 13.28%, respectively. The model obtained in this study can be considered as highly accurate estimator since the MAPE value is < 10 (Lewis, 1982). On the other hand, the MLP-MAPE value was 25.49% that makes the model as reasonably estimator (20 < MAPE < 50) (Lewis, 1982). Low ability of the MLP model is associated with the low tolerance of MAPE for right-skewed data distribution and outliers in the data set (Juan et al., 2013). The MAPE values (0.02 and 54.7%) reported by Chernova et al. (2021) in the spatial distribution model of SOCS are consistent with our findings.
Table 7
Predictive performances of the models
| ME | MAE | MAPE | RMSE | RI |
| (ton/ha) | (%) |
MLP | 1.67 | 7.22 | 25.49 | 13.28 | |
GBDT | -2.41 | 3.94 | 9.97 | 6.64 | 0.5 |
3.5. Important Factors in Estimation of Soil Organic Carbon Stocks
The importance of variables in the estimation variance is given in Fig. 8. The relative importance of different variables in estimation of SOCS based on the sensitivity analysis is given in Fig. 9. The results of sensitivity analysis revealed that the most important factor affecting the SOCS in the study area is the SRCI. The contributions of the covariates to the SOCS variance in the GBDT model were SRCI > NDVI > ARVI > GSI > GNDVI > NDWI > BI, respectively. In addition, all Sentinel 2A-based remote sensing indexes and SOCS had a significant correlation (Fig. 4). Viscarra Rossel et al. (2006, 2009) conducted a study to assess the reliability in estimation of soil properties using visible and near infrared (VNIR, 400–1200 nm) and short-wave infrared (SWIR, 1200–2500 nm) reflections with a variety of approaches, such as chemometrics techniques and specific absorption characteristics. The researchers stated that many pedovariables (texture, CaCO3 and water content, etc.) can be characterized using the reflections of soil surface in the wavelength range of 400 to 2500 nm. Similarly, Chabrillat et al. (2002) stated that the SWIR region of electromagnetic spectrum characterizes clay minerals and carbonates. Simple Ratio Clay Index, which is one of the covariates and has the most influence on the estimation variance, was calculated using the SWIR1 and SWIR2 bands of the S2A optical satellite image (Sabins 1999). Drury (1987) and Levin et al. (2007) reported a significant correlation between SWIR and soil surface clay content. Topsoil Granin Size (GSI) calculated with the red, green and blue bands of Sentinel 2A was the fourth covariate affecting the estimation variance. The GSI had a moderate negative correlation with SOCS (Fig. 4). Xiao et al. (2006) reported a significant positive relationship between GSI and sand content (R2 = 0.738) and a negative correlation between GSI and clay and silt content (R2 = 614). Similarly, a strong negative relationship (r=-0.891, R2 = 0.793 p < 0.01) was recorded between GSI and SRCI in Yuksekova plain (Fig. 8). Soil texture significantly affects the estimation variance in the spatial distribution of SOCS in the study area. Consistent with the findings of Zhong et al. (2018), a positive strong relationship was found between SOCS and clay content in a wetland such as Yüksekova. In addition, Castaldi et al. (2015) reported that clay spectral indexes had high correlations with soil moisture classes. The results of previous studies indicated that soil moisture affects the relationship between clay content and wavelengths at different moisture levels. The strong correlation between soil texture indices and SOCS in our study can be attributed to the fact that these indices were calculated using S2A bands belonging to the period when the moisture content of soils and vegetation in Yüksekova were at the lowest level.
The variability of SOCS had a strong relationship with remote sensing indexes calculated from the visible and near infrared bands of Sentinel 2A (Fig. 9). A satellite sensor detects different reflectance values due to different soil organic matter contents in a field. Because, the main chemical groups of soil organic matter and water absorption properties are located in the same spectral region of the VNIR (Gaffey et al. 1993). The importance of estimation variance of SOCS in NDVI, GNDVI and ARVI calculated with visible bands B2, B3 and B4 of Sentinel 2 can be attributed to the similar spectral behaviors of chromophore groups in the VIS region and all are strongly correlated with soil organic matter (Vasques et al., 2010). The results revealed that spectral reflectance changes in the visible, NIR and SWIR regions of satellite sensors are accurately represented in the SOCS estimation variance (Khan et al., 2020; Kweon and Maxton, 2013). In addition, vegetation indexes are affected by temporal and spatial variation of land cover and differences in land cover causes fluctuations in vegetation index values (Vijith and Dodge-Wan 2020). In this context, our results are in line with the findings of Canedoli et al. (2020) who explained the effect of different land use types, including urban areas, on SOC variation in Milan, Italy. Similarly, vegetation indexes representing different land covers significantly affected the estimation variation in our study.
The NDWI was useful in explaining the estimation variance of SOCS in the study area (Fig. 9). The NDWI was highly correlated with the water content of vegetation and soil. In addition, land surface and vegetation are good indicators of moisture change (Bernstein 2012). Kerr and Ochsner (2020) conducted a study on different land use types (oak forest and grassland) in Oklahoma to reveal the effect of soil moisture on SOC and to assess the importance of soil moisture as a variable in SOC models. Soil moisture conditions were reported to be strong predictors of SOCS in temperate grasslands. The researchers stated that soil moisture is the most important variable explaining 8 to 15% variation in SOC at all soil depths. Consistent with our findings, Hursh et al. (2017) reported a statistically significant relationship between soil moisture and SOCS and stated that soil moisture is a dominant determinant of soil respiration in certain biomes (Fig. 4 and Fig. 9).
3.6. Soil Organic Carbon Stocks Assessment using the CORINE Land Classification Classes
The greatest advantages of machine learning models is to provide spatial soil information with high accuracy and precise resolution at low cost and labor. Thus, the machine learning provides the opportunity to evaluate the variations in soil properties in a realistic way. In this part of the study, the carbon content of CORINE LC classes was evaluated by using the SOCS spatial information obtained from the SOCS map produced using the GBDT machine learning model.
The Moran Index, like Pearson correlation statistics for independent samples, is a conventional method to determine the autocorrelation (Moran 1950). Moran Index result of spatial autocorrelation (Morans I) analysis for data independence was − 0.0051. According to Adhikari et al. (2011), the selected SOCS sample means had the data independence main condition of the ANOVA (Fig. 10) (Chen et al. 2015). The Z-score (0.114) indicates significant random sampling from the GBDT SOCS map. Thus, ANOVA analysis was started with independent and unbiased samples. The p value, which is the basic assumption of one-way ANOVA, was higher than 0.05 (0.612), thus, the variances were considered as homogeneous. The difference in carbon values among the LC classes of the points sampled from the GBDT SOCS map was statistically significant. The ANOVA results showed that land cover significantly affects SOC in Yuksekova plain. Tukey's Post Hoc test, which evaluates the mean SOCS between wetland and other land cover units, created 4 statistically different subgroups (Table 8). The mean values of SOCS in pastures (38.27 Mg C ha− 1) and non-irrigated arable land (39.77 Mg C ha− 1) were similar (sig. 0.786) (Fig. 11 and Table 8). Budak et al. (2018) recorded the highest SOCS (44.33 Mg C ha− 1) in forest lands and the lowest SOCS (28.91 Mg C ha− 1) in arable fields with field crops. They reported that overgrazing in pasture lands and conventional tillage and burning of harvest residues in agricultural lands had a negative impact on SOCS in the Upper Tigris Basin. The average SOCS for continuously irrigated farmlands was 45.58 Mg C ha− 1 and it was significantly different from the mean SOCS of non-irrigated arable lands. The mean SOCS in arable lands, with significant areas of natural vegetation was 50.22 Mg C ha− 1 and this amount was significantly different from the SOCS of other land covers (p < 0.01). The mean SOCS of inland marshes LC, which represents the Nehil Marshes in the study area, was higher than the SOCS in all other land covers (Table 8 and Fig. 11). Similar to our findings, Xiaolong Wang et al. (2016) investigated the spatial distribution of SOCS in Poyang Lake, one of the two important wetlands in China, and reported significant SOCS variation along the land cover gradient. Yu et al. (2016), indicated significant differences in SOCS of cultivated fields and wetlands in Yellow River Delta (China). They reported that SOCS was highly affected by anthropogenic activities. On the other hand, the researchers reported that SOCS decreased as wetlands transformed into paddy fields. The reduced organic carbon mineralization in the natural wetlands due to the low or none anthropogenic activities causes a high SOC content (Babauta et al. 2012). Previous studies have reported that increased decomposition in drylands compared to anaerobic conditions in wetlands results in a lower amount of SOCS (DeBusk and Reddy 1998; Sigua et al. 2009). In the study area, conventional tillage methods, which have a negative impacts on SOCS, especially in arable agricultural areas should not be used anymore, and conservational tillage methods should be adapted to the region. The rangelands should be improved to both preserve the existing SOCS and increase the carbon content during the reclamation process.
The findings of this study underline the importance of wetlands for terrestrial carbon storage and mitigation of global warming. Conversion and degradation of wetlands to arable lands reduce organic carbon storage potential of these lands. This study provided important spatial information for carbon accumulation in the Nehil Marshes, and the results drew attention to the conservation of wetlands to reduce CO2 emissions.
Table 8
Land cover subgroups SOCS group differences
LC | SOCS Means |
Pastures | 38.27 | | | |
Non-irrigated arable land | 39.77 | | | |
Complex cultivation patterns | | 42.90 | | |
Permanently irrigated land | | 45.58 | | |
Land principally occupied by agriculture, with significant areas of natural vegetation | | | 50.22 | |
Inland marshes | | | | 61.46 |
Sig. | 0.786 | 0.189 | 1.000 | 1.000 |