Improvement of spatial estimation for soil organic carbon stocks in Yuksekova plain using Sentinel 2 imagery and gradient descent–boosted regression tree

Carbon sequestration in earth surface is higher than the atmosphere, and the amount of carbon stored in wetlands is much greater than all other land surfaces. The purpose of this study was to estimate soil organic carbon stocks (SOCS) and investigate spatial distribution pattern of Yuksekova wetlands and surrounding lands in Hakkari province of Turkey using machine learning and remote sensing data. Disturbed and undisturbed soil samples were collected from 10-cm depth in 50 locations differed with land use and land cover. Vegetation, soil, and moisture indices were calculated using Sentinel 2 Multispectral Sensor Instrument (MSI) data. Significant correlations (p≤0.01) were obtained between the indices and SOCS; thus, the remote sensing indices (ARVI 0.43, BI −0.43, GSI −0.39, GNDI 0.44, NDVI 0.44, NDWI 0.38, and SRCI 0.51) were used as covariates in multi-layer perceptron neural network (MLP) and gradient descent–boosted regression tree (GBDT) machine learning models. Mean absolute error, root mean square error, and mean absolute percentage error were 3.94 (Mg C ha −1), 6.64 (Mg C ha−1), and 9.97%, respectively. The simple ratio clay index (SRCI), which represents the soil texture, was the most important factor in the SOCS estimation variance. In addition, the relationship between SRCI and Topsoil Grain Size Index revealed that topsoil clay content is a highly important parameter in spatial variation of SOCS. The spatial SOCS values obtained using the GBDT model and the mean SOCS values of the CORINE land cover classes were significantly different. The land cover has a significant effect on SOC in Yuksekova plain. The mean SOCS for continuously ponded fields was 45.58 Mg C ha−1, which was significantly different from the mean SOCS of arable lands. The mean SOCS in arable lands, with significant areas of natural vegetation, was 50.22 Mg C ha−1 and this amount was significantly higher from the SOCS of other land covers (p<0.01). The wetlands had the highest SOCS (61.46 Mg C ha−1), followed by the lands principally occupied by natural vegetation and used as rangelands around the wetland (50.22 Mg C ha−1). Environmental conditions had significant effect on SOCS in the study area. The use of remote sensing indices instead of using single bands as estimators in the GBDT algorithm minimized radiometric errors, and reliable spatial SOCS information was obtained by using the estimators. Therefore, the spatial estimation of SOCS can be successfully determined with up-to-date machine learning algorithms only using remote sensing predictor variables. Reliable estimation of SOCS in wetlands and surrounding lands can help understand policy and decision makers the importance of wetlands in mitigating the negative impacts of global warming.


Introduction
The amount of organic carbon stored at a particular location is termed as carbon stock, and carbon stock in wetlands indicates organic carbon stored in flooded soils, living plant biomass and litter (Pearse et al. 2018). Wetlands cover approximately 6% of earth surface (Lehner and Döll 2004), and, and are important for the regulation of global climate change. In addition to the role in carbon sequestration, the wetlands provide many ecosystem services such as providing freshwater, improving water quality, being habitat for biodiversity (IFAD 2021), mitigating flood hazards, recharging aquifers, and producing food and energy for human use (Boutwell and Westra, 2016). Carbon sequestration potential of coastal wetlands per unit area is estimated approximately as 3 to 50 times higher compared to that of rainforests (Howard et al. 2017) due to the greater productivity coupled with slow decomposition of organic matter under anaerobic condition.
Sustainability of carbon sequestration in wetlands is highly dependent on the health and productivity of wetlands (Lane et al. 2016). High rates of net primary productivity and low rates of decomposition in wetlands are dependent on water at surface or within the root zone, anaerobic soil conditions, and a biota adapted to flooding (Mitsch et al. 2013). The wetlands and organic carbon stored in wetlands are highly sensitive to climate change (Ma et al. 2016), and anthropogenic interference. In the last 50 years, degradation of wetlands due to climate change and human impact disturbed large amounts of previously stored soil carbon (Crowther et al. 2016). Despite the essential role in regulation of many ecosystem services, wetlands are being degraded and approximately 35% of wetlands in the world were lost between 1970 and 2015 (Pendleton et al. 2012).
Soil organic carbon (SOC) is the most important component determining the physical, chemical, and biological functions of the soil, and therefore the information on content and spatial distribution of SOC is useful for sustainable management of soils (Mirchooli et al. 2020). Estimating storage and spatial distribution of SOC in wetlands and surrounding lands is important to show the policy makers that the wetlands are under severe threat of degradation due to artificial drainage, unsustainable use, and upstream soil erosion. The decomposition of organic matter under anaerobic conditions in wetlands is much slower than that of aerobic conditions (Clarkson et al. 2014). The amount of SOC stock within a wetland will depend on the locations. The SOC content of a location within a wetland with a longer inundation period will be significantly higher than the outer edge and high-water table (Bernal and Mitsch 2012). The information on spatial variability of SOC with in a wetland determines the number of samples needed to accurately estimate the carbon storage potential of the whole wetland (Pearse et al. 2018). Some studies have been carried out to determine the spatial distribution of nutrients (Gao et al. 2019) and heavy metals in wetland soils (Ramachandra et al. 2018;Wang et al. 2019). However, the studies investigated the spatial distributions of SOC in the upper most soil layers of wetlands and surrounding lands are very rare.
Chemometric methods are generally used to determine organic carbon contents of soils in a region. However, the collection and analysis of large numbers of soil samples for a region are labor-intensive and time-consuming (Forkuor et al. 2017;Loiseau et al. 2019). Spatial distribution of SOC is significantly correlated with environmental conditions (such as elevation, vegetation type, precipitation, and temperature) and previous studies indicated that variability in soil formation factors affects the spatial distribution of SOC (Keskin et al. 2019). Therefore, optical satellite images have been widely used in digital mapping of soil characteristics with the rapid development of remote sensing technology (Amani et al. 2018). This approach is an operative method to reduce the costs of field sampling and analysis needed for spatial and temporal monitoring, analysis, and management of SOC content.
The spatial and spectral resolution of a remote sensing image significantly affect the prediction accuracy of soil properties. Several studies reported that SOC content is sensitive to the visible near-infrared (VNIR) and short-wave infrared (SWIR) (400-2500 nm) region (Chen et al. 2000;Viscarra Rossel et al. 2006). The SOC content may cause differences in the spectral reflectance properties of the soils (Hong et al. 2022). Therefore, many studies have been conducted to investigate the reliability of models in estimation of soil properties using sensitive bands of remote sensing images or several soil and vegetation spectral indexes. Most researchers used normalized difference vegetation index (NDVI) as a covariate, while vegetation density is the major limitation of NDVI in estimation of soil properties, because the spectral signal of soil color is significantly affected when the vegetation density is between 45 and 70% (Huete and Jackson 1988). On the other hand, the NDVI reduces the noise from cloud shadows and topographic level differences (Huete et al. 2002;Kawamura et al. 2005). In addition, the atmospherically resistant vegetation index (ARVI) was developed to eliminate or reduce confusion caused by the atmospheric particles using the blue band to make corrections on the red band (Kaufman and Tanre 1992;Liu et al. 2002). Green normalized difference vegetation index (GNDVI) has a high sensitivity to chlorophyll and reduces non-photosynthetic effects (Wu 2014); therefore, the discrimination potential of GNDVI is needed under heterogeneous vegetation in a landscape (Funghi et al. 2020).
Linear regression was initially used in SOC estimation with spectral data, and later multiple linear regression and geographically weighted regression models were used (Meersmans et al. 2008;Doetterl et al. 2013;Kumar et al. 2013). The implementation and interpretation of linear models are easy, while the SOC and covariates have complex and non-linear relationships. Data mining and machine learning techniques presented favorable opportunities to set up nonlinear relationships between several soil physical, chemical and biological properties, and remotely sensed data. Artificial neural networks (Mishra et al. 2010), classification regression trees (Keskin et al. 2019), random forest (Zeraatpisheh et al. 2019), and gradient-boosted decision trees (Pham et al. 2021) are the commonly used data mining and machine learning techniques, of which tree-based model algorithms had reliable estimation results (Zhou et al. 2020). Random forest is an extension of classification regression trees that can effectively control the risk of overfitting and has been proven to be superior in nonlinear relationships (Xie et al. 2022). Random forest consists of many decision trees, and classification regression trees are usually selected from the decision trees. This new approach, called gradient-boosted decision tree, consists of many decision trees (regression trees) and is a promising technique with high prediction success (Pham et al. 2020),. When training a tree, the training data is the residual (a gradient of the objective function) of the trained model (pre-trained tree). The approach uses a regression tree to optimize the objective function (Chen et al. 2021). In this study, gradient-boosted decision tree machine learning model was chosen to estimate SOC content in Yuksekova wetland and surrounding lands. The study area consists of a wide variety of land cover, including the Nehil Marshes, meadows, rangelands, and agricultural fields. Linear and nonlinear estimation models have been compared extensively in the literature. Contrary to previous studies, the reliability of two different non-linear models (multi-layer perceptron and gradientboosted decision trees) in estimation of SOC have been compared in this study.
Similar to many wetlands in other parts of the world, 291.339 (21.2%) ha of wetlands in Turkey have been drained from 1920 to 2014 due to the intention of combating malaria (Tuğluoğlu 2008;Ataol and Onmuş 2021) and land clearing for agricultural production. Yuksekova wetland is the largest high-altitude wetland of Turkey, and located in south east corner of Turkey. The major cause for the degradation of wetlands is creating new agricultural lands. The changes in SOC content following the cultivation of previous wetlands have attracted the interest of researchers from various disciplines (Xu et al. 2019a). However, the spatial distribution of SOC in Yukseova wetlands and surrounding areas, even in any other wetlands in Turkey has not been studied. Therefore, no information is available about SOC stocks of Yuksekova wetland. The objectives of this study are to determine the spatial distribution of soil organic carbon stock in Yuksekova plain, which has various land covers/land uses and homogeneous climate and topography, with different machine learning models; to investigate the ability of machine learning algorithms representing the spatial variation of soil organic carbon stock in Yuksekova plain; to reveal the effects of land use/land cover change on soil organic carbon stock, and to assess the effects of environmental variables represented by radiometric remote sensing indices on the estimated soil organic carbon stock variance.

Study area
The study area, known as Yuksekova plain, is located between 37°25′49″ -37°36′32″ N latitudes and 44°04′21″ -44°22′56″ E longitudes. Average altitude of the study area is 1950 m and the study area approximately covers 19.5 thousand ha land (Fig. 1). The plain is surrounded by high mountains and the majority of the plain was wetland in the 1950s. Unfortunately, the coverage area of the wetland significantly decreased due to the decreased precipitation and increased anthropogenic pressures. The entire Yuksekova plain (21.533 ha) was recognized as a wetland of national importance by the Ministry of Agriculture and Forestry of Turkey in 2015 (Anonymous 2019); however, the coverage area of wetlands decreased to 2755 ha in 2021. Today, majority of previous wetland area have been converted to meadow, grasslands, and cultivated fields where mostly wheat and alfalfa, chickpea, and vetch are grown. Long-term (1979Long-term ( -2018 average annual precipitation is 762 mm, and the annual average evaporation is 850 mm. Long-term average annual temperature is 6.9°C, with the lowest in January (−9.3°C) and the highest in July (21.3°C). The summer season in Yuksekova is very short and the winter season is quite long. The study area, which has a snow cover of approximately 1 m, remains covered with snow for an average of 120 days.
Soils in the Yuksekova plain are in Histosols, Mollisols, Alfisols, Inceptisols, and Entisols ordos (Soil Survey Staff, 2014). The geology of the wetland and its surroundings has been defined as silt, loam, gravel, coarse-fine clastics, and clay units. The basin, in which the study area is located, was filled with materials carried by the rivers, which travels from a route where layers of Eocene flysch and clay-radiolarite facies were located. The rocks at the edges of the study area are highly folded and have a metamorphic character. The thickness of peat in the wetland ranges from 0.5 to 12 m. Gyttja is found at the shore and lower layers of the wetland (Sahinoglu and Ozdemir 2019). The wetlands where the gyttja occurs are rich in nutrients. The gyttja is a clay-textured, plastic, and jellylike material when it is wet, and the gyttja becomes very hard when it is dry (Kroetsch et al. 2011) (Fig. 2).
The vegetation in grasslands is dominated by

Soil sampling
Fifty disturbed and undisturbed soil samples were collected from 0 to 10 cm depth to determine soil organic carbon stock in the Yuksekova plain between September 21 and October 03, 2020. The location of soil samples were determined considering changes in land use and parent material. The exact coordinates of the sampling points were recorded using a GPS (global position system) device in the field.

Soil analysis
Disturbed soil samples were dried in room temperature, and passed through 2 mm sieve for laboratory analysis. Soil organic carbon (SOC) content was determined with the modified Walkley-Black method (Nelson and Sommers 1996), and bulk density (Bd) was determined using undisturbed soil samples by the cylinder method (Blake and Hartge 1986). The amount of soil organic carbon stock (SOCS) was determined using the equation 1 introduced by Zhang and Hartemink (2017): where Bd is the bulk density of soil (Mg m -3 ), D is the soil depth (m), SOC is the soil organic carbon (g kg -1 ), and A is the area (ha: 10 4 m 2 ).

Processing of remote sensing data and soil and vegetation indexes
Sentinel-2A imagery dataset for June and September 2020 were downloaded from the European Space Agency (ESA) Fig. 1 Location of the study area and sampling points as remote sensing data to calculate soil and vegetation indexes for spatial modeling of SOC content. The Sentinel-2A imagery is available at the Sentinel Scientific Data Hub freely (https:// scihub. coper nicus. eu) (ESA 2021). The Semi-Automatic Classification Plugin, which is an add-on to the QGIS software, was used for image preprocessing. The Sentinel 2A satellite imagery was downloaded using the SCP. Bands (11 and 12) with a resolution of 20 m in the dataset were resampled to a resolution of 10 m with the nearest neighbor method. In this new dataset, the boundary of study area was cut for spectral index calculations (Congedo 2021).
Sentinel 2A SWIR (B11: SWIR1, B12: SWIR2), NIR (B8), red (B4), green (B3), and blue (B2) bands were used to calculate the soil and vegetation indexes presented in Table 1. The month of June (June 25, 2020) with the highest NDVI values was chosen to calculate the radiometric vegetation indexes. On the other hand, the end of September (September 23, 2020) with the lowest NDVI value was chosen to evaluate soil properties (Bousbih et al. 2019).
Different bands or band combinations of remote sensing images provide information on various soil properties such as soil moisture, organic carbon content, and soil texture   (He et al. 2021). Spectral indexes are operative in reducing spectral reflection errors while creating SOC estimation models. Previous studies have proven that spectral indexes are useful in estimation of SOC content of soils (Jin et al. 2016). The spectral indexes used as covariates in the spatial distribution models of SOCS were calculated using the equations in Table 1. The normalized difference vegetation index (NDVI) expresses the normalized difference in reflectance between the near-infrared (NIR) and visible red bands. The NDVI is useful index in the assessment and monitoring of important global biomasses and has been widely used in mapping of SOC content (Tucker 1979;Yengoh et al. 2016;Shafizadeh-Moghadam et al. 2022). The green normalized difference vegetation index (GNDVI) is calculated similar to the NDVI, except that the green band is used in GNDVI instead of the red band (Gitelson et al. 1996). The atmospherically resistant vegetation index (ARVI) uses the difference in brightness of blue and red channel to correct the brightness in red channel. The ARVI is resistant to atmospheric interferences due to self-correction process (Somvanshi and Kumari 2020). Moisture indexes reflect the moisture status of soils. Normalized difference water index (NDWI) has been proven to be significantly correlated with soil moisture (Casamitjana et al. 2020) The NDWI is calculated from the reflectance values of the NIR and shortwave infrared spectrum region reflecting the changes in water content and spongy-like mesophyll in plant canopies (Gao 1995).
Grain Size Index (GSI) was calculated using the red, blue, and green bands of Sentinel 2A to evaluate the surface soil texture (Xiao et al. 2006). Simple ratio clay index (SRCI) for the study area was calculated using late September images, because the highest sensitivity of B6 and B7 bands to clay content was recorded when the moisture content and vegetation are at the lowest density (Bousbih et al. 2019). The brightness index (BI) is used to evaluate the brightness or darkness of the surface. The BI is often used for mapping soil properties such as soil texture, salinity, and moisture (Forkuor et al. 2017;Asfaw et al. 2018). The BI combines the information from the red and near-infrared bands of the S2A optical satellite image, represented by the B4 and B9 bands.

CORINE land cover (CLC)
Coordination of Information on the Environment (CORINE) Land Cover/Use Classification is land cover/use data obtained by visual interpretation of remote sensing images. The main purpose of the CORINE project was to manage the same basic data and to create a database in order to set standards for determining environmental changes in the land, rational management of natural resources and establish environmental policies in all European Environment Agency (EEA) member countries in line with the criteria and classification system determined by the EEA (Anonymous 2015). CORINE Land Cover (CLC) 2018 data was obtained from Copernicus Land Monitoring Service in vector format . The area covering Yuksekova plain was cut in ArcMap 10.8 and dissolved process was applied for the generalization. Thus, vector data with the same land cover in different locations were aggregated according to their attributes. The ESRI GeoDatabase format map of Version 2020 20u1 (dated 02/2020) is shown in Fig. 3.
The raster prediction map produced by the MLP-ANN or GBDT machine learning algorithms, the model with the best accuracy performance, was used to determine the spatial distribution of SOCS according to CLC classes. Spatial autocorrelation was eliminated to evaluate average SOCS among the CLC classes. Spatial analysis of variance (ANOVA) was carried out using the method introduced by Chen et al. (2015). In addition, Moran's index was calculated in ArcGIS 10.8 for spatial autocorrelation analysis (Moran 1950). The ANOVA was carried out following the confirmation of no spatial autocorrelation among the SOCS sampled according to the CLC classes. The main purpose of analyzing spatial autocorrelation is to eliminate the effects of nearby spatial units to satisfy one of the basic conditions of ANOVA. This process filters out the spatial autocorrelation and transforms the spatial data into ordinary data, thus, ensures that the data are independent of each other .

Multilayer perceptron network spatial modeling approach
Multilayer perceptron (MLP) architecture is the universal estimation algorithm of feed forward artificial neural network (ANN) (Falahatkar et al. 2016). The MLP was introduced to improve the performance of Backpropagation neural network (BPNN), and is widely used for modeling of soil characteristics using remote sensing data (Gupta et al. 2016). A typical MLP architecture includes input, output, and at least one hidden layer. Generally, the input and output layers receive the data to make an estimation, while the hidden layer continues to store the model parameters (weight and bias) (Gruszczyński 2019). Therefore, the MLP is very useful as it allows the definition of hidden layers and the number of neurons in the hidden layer. Additionally, the MLP allows the use of the dropout rate function to reduce overfitting and improve estimation accuracy (Xu et al. 2019b).
The MLP network architecture was constructed in the Matlab R2021a with one input, two hidden, and one output layers. In addition, the network was trained by optimizing the weight and bias values using the Levenberg-Marquardt training algorithm and a feed-forward backpropagation network (Samui and Sitharam 2011;Liew et al. 2022). In determining the number of hidden neurons, the number of neurons was determined considering the root mean square error value starting from 1 neuron (Sergeev et al. 2019). Hyperbolic tansig activation function (Eq. 2) was used in hidden layers to model nonlinear relationships between remote sensing estimators and the SOCS data.
In the equation, x represents the corresponding input and is defined as the "tansig," as well as the transfer function (Vogl et al. 1988).

Gradient-boosted decision tree
Gradient-boosted decision tree (GBDT) is a descent-based tree model that performs regression tasks by performing iterative operations. Each decision tree in the GBDT is trained in sequential order, and iteration aims to fit residual values in sequential order of the trees (Friedman 2002). The GDBT has been well suited for spatial distribution models of soil properties due to highly flexible and adaptable nature (Xie et al. 2022). The GDBT machine learning algorithm was implemented with the fit ensemble function in the Matlab environment. Grid search strategy was carried out on the gradient boosting strategy applied for the least squares (LSBoost). Hyperparameters are usually used to develop the LSBoost. Bayesian optimization is an algorithm well suited to optimize the hyperparameters of − 1 classification and regression models (MathWorks 2020); therefore, the Bayesian optimization was used to optimize the hyperparameters. The parameters are number of learning cycles, learn rate, minimum leaf size, minimum parent size, and maximum number of splits. All parameters were optimized to obtain a reliable estimation (Breiman 2001;. The importance of a variable in the GBDT algorithm was calculated by the average importance of the j characteristics in a single tree using the Equation 3 (Xie et al. 2022).
In the equation, M is the number of tree. The significance of variable j in a single tree was calculated using Equation 4: In the equation, L is the number of leaf nodes of the tree, L-1 is the number of non-leaf nodes, and v t is the associated property at node t. ́l 2 t t is the decrease in quadratic loss after dividing the node t. When this value is large, the loss of this node is desired to be reduced, because the model gains greater estimation ability with this loss.

Error criteria for interpolation method
The SOCS values determined by the final output of the network and laboratory analyzes were used to assess the accuracy of different spatial interpolation models. The performances of models were assessed using the error metrics given in Table 2.

Results and discussion
Training and validation datasets used in the estimation of SOCS in surface soils are given in Table 3. The range of entire SOCS dataset was between 17.41 and 87.66 (Mg C ha −1 ), with a mean value of 43.86 (Mg C ha −1 ), a median of 36.19 (Mg C ha −1, ) and a standard deviation of 20.04 (Mg C ha −1 ). The measures of central tendency indicated the similarity of complete dataset and the train and test datasets. The samples used in the training and test datasets are representative of the entire SOCS dataset; thus, the number of samples was adequate for development and validation of an accurate estimation model.

Analysis of remote sensing-based indexes
Descriptive statistics of remote sensing indexes calculated using the relevant bands of the Sentinel 2A optical satellite image are given in Table 4 and the correlation coefficients between SOCS and remote sensing indexes are given in Fig. 4. The ARVI with a mean value of 0.36 ranged between −0.24 and 0.93. Significant positive correlation (r = 0.43; p<0.01) was recorded between ARVI and SOCS (Fig. 4b).
The mean value of NDVI was 0.34, the smallest value was −0.43 and the highest value was 0.68. The mean values of NDVI and ARVI calculated with the NIR and Red bands of Sentinel 2A were quite close to each other. Less sensitivity of ARVI to atmospheric effects compared to the NDVI may cause the difference in minimum and maximum values (Kaufman and Tanre 1992). A moderately significant positive correlation (r=0.44; p<0.01) was found between NDVI and SOCS values ( Fig. 4a and 4b). The NDVI values indicated a dense vegetation from south to northwest of the study area, and the middle part of the study area was characterized by lower NDVI values (Fig. 5). The NDWI values ranged from −0.44 to 0.78, with a mean value of 0.33. The SRCI and GSI are the covariates that provide information on soil texture. The SRCI ranged from 0.8 to 2.34, with a mean value of 1.64. The GSI value was between −0.49 and 0.04 and the mean value was −0.20. The minimum values obtained by SRCI are in line with Bousbih et al. (2019). The correlation coefficient values between SOCS and SRCI and GSI were 0.51 and −0.39, respectively ( Fig. 4a and 4b).

Artificial neural network modeling
Optimal number of neurons in the hidden layer is critical to the development of a MLP structure (Sergeev et al. 2019).  The number of neurons in the hidden layers was increased from 1 to 30 (Küçüktopcu and Cemek 2021). The lowest RMSE and MAPE values in the test dataset were obtained at 15 and 16 neurons in hidden layers 1 and 2, respectively (Fig. 6).
The results of ANN-MLP model architecture created to estimate the spatial distribution of SOCS are given in Table 5. The lowest RMSE and MAPE values were obtained when the activation function of both hidden layers was hyperbolic tangent sigmoid transfer (tansig) and output layer transfer function was linear transfer function (purelin). Similarly, the lowest RMSE and MAPE in the MLP architecture were recorded using the training algorithm of Levenberg-Marquardt backpropagation training function (trainlm) and the learning algorithm of Gradient descent with momentum weight and bias learning function (learngdm) (Mathworks 2020). Coefficient of determination (R 2 ) was 1 and 0.628 in the training and test dataset, respectively. The number of input variables used in the MLP architecture and the statistical properties of the data are very important in the learning algorithm. Similar to this study, Falahatkar et al. (2016), who modeled the spatial estimation of SOC content in surface soils with the MLP using remote sensing data, determined the number of hidden neurons as 12-12 using the Sinusoidal Activation Function in the hidden layers. The coefficient of determination (R 2 ) values in the training and test datasets were 0.75 and 0.61, respectively. The difference in model architectures can be attributed to the number of samples and the coefficient of variation of the data, as well as the type of activation function.
The R 2 value provides important information on the regression between the measured and the predicted values. Zhang et al. (2021) modeled the spatial distribution of soil organic matter using all bands of Sentinel 2A and the vegetation indexes calculated from the remote sensing bands. The coefficient of determination using only the bands was R 2 0.56, while it was 0.65 using vegetation indexes. In this study, the R 2 (0.628, test) value was consistent with those reported in the previous studies. In addition, the results revealed that spectral index variables can be used as effective input variables in SOCS estimation by reducing spectral reflection errors.

Gradient-boosted decision tree
This chapter contains the gradient-boosted decision tree spatial distribution model of SOCS using the hyperparameters optimized Lsboot (least-squares boosting) with Bayesian optimization. The coefficients of determination and optimized hyperparameters for training, validation, and all data are given in Table 6. The R 2 values for the training, testing, and all datasets were 0.899, 0.814, and 0.823, respectively. In the GDBT model conditions, min parent size was 10, min leaf size was 5, learn rate was 0.297, number of learning cycles was 11, and max number of splits was 3. Breiman et al. (2017), Loh and Shih (1997), and Tahraoui et al. (2022) stated that the gradientboosted decision tree optimization conditions should be within the limits of 1<max number splits<99, 1< min leaf size <50, and 1< min parent size <10. In these conditions, the estimation accuracy using the GBDT model with the LSBoost algorithm and the optimization of the hyperparameters with Bayesian optimization was successful.
In addition to the Bayesian optimization used to improve hyperparameters, k-fold optimization is an algorithm adopted in GBDT or regression tree machine learning methods. Zeraatpisheh et al. (2019) modeled the spatial distribution of SOC with remote sensing data using regression tree, optimizing with 10-fold validation and calibration. The calibration improved the RMSE value by 42% and R 2 by 170%. The R 2 (0.93) value reported in the post-optimization test dataset was compatible with our finding.
The use of the STEP-AWBH model (S: Soil, T: Topography, E: Ecology, P: Parent material, A: Atmosphere, W: Water, B: Biota and H: Human), which takes both anthropogenic and natural factors into account, has been proposed to determine soil and space-time interactions (Grunwald et al. 2011;Thompson et al. (2012). The topography of Yuksekova plain is quite homogeneous; therefore, the variation in topographic index values produced from digital elevation models was very low. In addition, there is no local meteorological station in the region to obtain detailed climate data. The resulting precipitation and temperature estimation maps produced from meteorological stations around the Yuksekova district were uniform. Thus, radiometric remote sensing indices were used in estimation of soil organic carbon stock of Yuksekova plain. This study provides important clues in assessing the impact of collinearity on GBDT estimation results. The boosted regression tree models of GBDT are robust to collinear predictors; therefore, the most commonly used predictors (Zhang et al. 2016) which have ability to interpret the effects of covariates affecting SOCS on the estimation variance of SOCS predictors were used in the study. Likewise, Wang et al. (2021) used remote sensing data for spatial estimation of SOC content in the Wanzhou Region of China. The researchers used similar remote sensing indices as in the current study and the R 2 value obtained with XGBoost (eXtreme Gradient Boosting) was reported between 0.86 and 0.74. In this context, the R 2 (0.814) value obtained with GBDT is consistent with the literature (Table 6). Zhang et al. (2016) reported that the GBDT model inherently reduces the effect of collinearity, which is quite common among radiometric remote sensing indices, and therefore, successful predictions can be obtained using GBDT even if the predictors have collinearity.
The study area has a homogeneous topography and climatic dataset was limited for the study area. Therefore, the interpolation maps obtained from the climate database did not show a spatial variation. The GBDT performance was higher than the regression tree considering that the presented study was carried out under limited covariates data conditions. In addition to optimization, this phenomenon is related to the use of gradient descent approach in the LSBoost and regression model at end nodes. Thus, the GBDT is capable of generating a range of estimations. The conventional regression tree end nodes contain a fixed value (Malone et al. 2014;Alajmi and Almeshal 2020). The GBDT also fits the error residual of the previous tree into a decision tree. Therefore, every new tree in the community is focused on reducing the error made by the previous weak learner rather than predicting the target directly (Inria 2022;Tahraoui et al. 2022). The aforementioned structure of the GDBT is very useful for spatial distribution models of soil properties, that are highly affected by environmental factors such as organic carbon and have high spatial variation, under the number of covariates which is limited.

The performances of different estimation models
The SOCS map obtained with the GBDT and MLP estimation models are shown in Fig. 7 (Table 3).
The maps produced using the models were compatible with the fact that the Nehil Marshes, located in the South of Yuksekova and have the highest SOCS values. The remote sensing indexes of BI and GSI had a moderate negative correlation with the SOCS (Fig. 3). High SOCS content in Nehil Marshes shown in both model maps, where the BI and GSI values are the lowest, coincide with the laboratory analysis. In addition, the relationship between remote sensing indexes and SOCS has been explained under the subheading of Important Factors for Predicting SOCS in the Study Area.
The SOCS content was high in the south, mid-west, east, and central parts of the study area. The results clearly stated that spatial variation of SOCS was affected by land cover differences as well as soil moisture. In particular, the SOCS content was high in areas close to the streams. High moisture content around the streams throughout a year compared to the other areas in the dry season causes continues organic carbon accumulation and a slow decomposition of organic matter in such fields. Zhang et al. (2021) reported that soil erosion caused by rainfall and streams cause the accumulation of sediments and organic matter in low-lying areas, which is also consistent with our findings. These results showed that the input parameters used affect the significance of the estimation variance in the whole study area and therefore, significantly contribute to the spatial variation of the SOCS.
Since the test data was not introduced to the model during the training process, the test dataset was used to compare the effects of different models on the estimation of SOCS. The values of ME, MAE, RMSE, MAPE, and RI calculated using the estimated and measured values are given in Table 7. The ME index, which expresses the mean bias of the predictions, indicates the differences between the estimation of models and the actual values. The underestimation was −2.41 in the GBDT and 1.67 in the MLP model. The MAE is an absolute mean and expresses the size of the estimation error produced by the model compared to the measured value. The difference between the mean SOCS value estimated by the GBDT model and the measured SOCS value was 3.94, which was lower than the that estimated SOCS value of the MLP model (Isaaks and Mohan 1989). The RMSE is commonly used to measure the bias in the mean and variance (Chicco et al. 2021) and it was higher than the MAE values in all methods. The result indicates that errors have a significant effect on spatial variability of SOCS in the study area (Goodchild et al. 1999;Liu et al. 2015). The GBDT model with a much lower RMSE (6.64) compared to the other MLP model, produced fewer errors in SOCS estimation. The RI value, indicator of the relative improvement in the RMSE, was 50%. In addition, the relative improvement recorded with the GBDT model in MAPE compared to the relative to the MLP model was 45.43 and 60.88% in MAPE. The GBDT-MAPE and MLP-MAPE values were 6.64 and 13.28%, respectively. The model obtained in this study can be considered as highly accurate estimator since the MAPE value is <10 (Lewis, 1982). On the other hand, the MLP-MAPE value was 25.49% that makes the model as reasonably estimator (20<MAPE<50) (Lewis, 1982). Low ability of the MLP model is associated with the low tolerance of MAPE for right-skewed data distribution and outliers in the dataset (Juan et al., 2013). The MAPE values (0.02 and 54.7%) reported by Chernova et al. (2021) in the spatial distribution model of SOCS are consistent with our findings.

Important factors in estimation of soil organic carbon stocks
The importance of variables in the estimation variance is given in Fig. 8. The results of sensitivity analysis shown in Fig. 4. a Correlation coefficients between SOCS and remote sensing indexes (parameters with significant correlation at the 0.01 level are marked with *); b scatterplots for SOCS and predictors and between predictors.
◂ Fig. 5. Remote sensing based indexes used as covariates Fig. 9 revealed that the most important factor affecting the SOCS in the study area is the SRCI. The contributions of the covariates to the SOCS variance in the GBDT model were SRCI>NDVI>ARVI>GSI>GNDVI>NDWI>BI, respectively. In addition, SOCS had a significant correlation with all Sentinel 2A-based remote sensing indexes (Fig. 4). Viscarra Rossel et al. (2006 determined the reliability of estimation for soil properties using visible and near-infrared (VNIR, 400-1200 nm) and short-wave infrared (SWIR, 1200-2500 nm) reflections with a variety of approaches, such as chemometrics techniques and specific absorption characteristics. The researchers stated that many pedovariables (texture, CaCO 3 and water content, etc.) can be characterized using the reflections of soil surface in the wavelength range between 400 and 2500 nm. Similarly, Chabrillat et al. (2002) stated that the SWIR region of electromagnetic spectrum characterizes clay minerals and carbonates. Simple Ratio Clay Index, which is one of the covariates and has the highest influence on the estimation variance, was calculated using the SWIR1 and SWIR2 bands of the S2A optical satellite image (Sabins 1999). Drury (1987) and Levin et al. (2007) reported a significant correlation between SWIR and clay content of surface soils. Topsoil granin size (GSI) calculated with the red, green, and blue bands of Sentinel 2A was the fourth impotant covariate affecting the estimation variance. The most important factors affecting soil aggregation are the silt and clay content and the surface area of mineral soil particles (Kiem et al. 2002;Ge et al. 2019). Gonzalez and Laird (2003) reported that chemical stabilization of organic molecules is provided by the binding of mineral and organic molecules with each other through soil mineral colloids. The retention of organic carbon on surfaces of clay minerals and in small aggregates prevents the decomposition by microorganisms; thus, a higher amount of C is stored in well aggregated soils Singh et al. 2018). Schweizer et al. (2021), who investigated the effect of clay content on OC content of soils at Scheyern research station in Munich, Germany, reported that more particulate organic carbon was stored with increasing clay content in soils.
The GSI had a moderate negative correlation with SOCS (Fig. 4). Xiao et al. (2006) reported a significant positive relationship between GSI and sand content (R 2 =0.738) and a negative correlation between GSI and clay and silt content (R 2 =614). Similarly, a strong negative relationship (r=−0.891, R 2 =0.793 p<0.01) was recorded between GSI and SRCI in Yuksekova plain (Fig. 8). The results showed  that soil texture significantly affects the estimation variance in the spatial distribution of SOCS in the study area. Consistent with the findings of Zhong et al. (2018), a positive strong relationship was found between SOCS and clay content in a wetland such as Yuksekova plain. In addition, Castaldi et al. (2015) reported that clay spectral indexes had high correlations with soil moisture classes. The results of previous studies indicated that soil moisture affects the relationship between clay content and wavelengths at different moisture levels. The strong correlation between soil texture indices and SOCS in our study can be attributed to the fact that these indices were calculated using S2A bands belonging to the Fig. 7. The SOCS maps obtained using the MLP and GBDT spatial distribution models period when the moisture content of soils and vegetation in Yuksekova were at the lowest level. The variability of SOCS had a strong relationship with remote sensing indexes calculated from the visible and near-infrared bands of Sentinel 2A (Fig. 9). A satellite sensor detects different reflectance values due to different soil organic matter contents in a field. Because, the main chemical groups of soil organic matter and water absorption properties are located in the same spectral region of the VNIR (Bishop 2019). The importance of estimation variance of SOCS in NDVI, GNDVI, and ARVI calculated with visible bands B2, B3, and B4 of Sentinel 2 can be attributed to the similar spectral behaviors of chromophore groups in the VIS region and all are strongly correlated with soil organic matter (Vasques et al., 2010). The results revealed that spectral reflectance changes in the visible, NIR, and SWIR regions of satellite sensors are accurately represented in the SOCS estimation variance (Khan et al., 2020;Kweon and Maxton, 2013). In addition, vegetation indexes are affected by temporal and spatial variation of land cover and differences in land cover causes fluctuations in vegetation index values (Vijith and Dodge-Wan 2020). In this context, our results are in line with the findings of Canedoli et al. (2020) who explained the effect of different land use types, including urban areas, on SOC variation in Milan, Italy. Similarly, vegetation indexes representing different land covers significantly affected the estimation variation in our study.
The NDWI was useful in explaining the estimation variance of SOCS in the study area (Fig. 9). The NDWI was highly correlated with the water content of vegetation and soil. In addition, land surface and vegetation are good indicators of moisture change (Bernstein 2012). Kerr and Ochsner (2020) conducted a study on different land use types (oak forest and grassland) in Oklahoma to reveal the effect of soil moisture on SOC and to assess the importance of soil moisture as a variable in SOC models. Soil moisture conditions were reported to be strong predictors of SOCS in temperate grasslands. The researchers stated that soil moisture is the most important variable explaining 8 to 15% variation in SOC at all soil depths. Consistent with our findings, Hursh et al. (2017) reported a statistically significant relationship between soil moisture and SOCS and stated that soil moisture is a dominant determinant of soil respiration in certain biomes ( Fig. 4 and Fig. 9).

Soil organic carbon stocks assessment using the CORINE land classification classes
The greatest advantages of machine learning models are to provide spatial soil information with high accuracy and precise resolution at low cost and labor. Thus, the machine learning provides the opportunity to evaluate the variations in soil properties in a realistic way. In this part of the study, the carbon content of CORINE LC classes was evaluated by using the SOCS spatial information obtained from the SOCS map produced using the GBDT machine learning model. The Moran's index, like Pearson correlation statistics for independent samples, is a conventional method to determine the autocorrelation (Moran 1950). Moran's index result of spatial autocorrelation (Moran's I) analysis for data independence was −0.0051. According to Adhikari et al. (2011), the selected SOCS sample means had the data independence main condition of the ANOVA (Fig. 10)  . The Z-score (0.114) indicates significant random sampling from the GBDT SOCS map. Thus, ANOVA analysis was started with independent and unbiased samples. The p value, which is the basic assumption of one-way ANOVA, was higher than 0.05 (0.612); thus, the variances were considered homogeneous. The difference in carbon values among the LC classes of the points sampled from the GBDT SOCS map was statistically significant. The results showed that land cover significantly affects SOC in Yuksekova plain. Four statistically different subgroups have been created by Tukey's post hoc test, which evaluated the mean SOCS between wetland and other land cover units (Table 7). The multiple comparison table used to report ANOVA results is given as Supplementary Information.
The mean values of SOCS in pastures (38.27 Mg C ha -1 ) and non-irrigated arable land (39.77 Mg C ha -1 ) were similar (p=0.786) (Fig. 11 and Table 7). The difference in carbon content (1.507 Mg C ha -1 ) between non-irrigated arable land and pastures was not statistically significant (Supplementary Information) (p=0.817, p≥0.05). Budak et al. (2018) recorded the highest SOCS (44.33 Mg C ha -1 ) in forest lands and the lowest SOCS (28.91 Mg C ha -1 ) in arable fields with field crops. They reported that overgrazing in pasture lands and conventional tillage and burning of harvest residues in agricultural lands had a negative impact on SOCS in the Upper Tigris Basin. The average SOCS for continuously irrigated farmlands was 45.58 Mg C ha −1 and it was significantly different from the mean SOCS of non-irrigated arable lands. The mean SOCS in arable lands, with significant areas of natural vegetation was 50.22 Mg C ha −1 and this amount was significantly different from the SOCS of other land covers (p<0.01). The mean SOCS of inland marshes LC, which represents the Nehil Marshes in the study area, was higher than the SOCS in all other land covers (Table 7 and Fig. 11). Similar to our findings, Wang et al. (2016) investigated the spatial distribution of SOCS in Poyang Lake, one of the two important wetlands in China, and reported significant SOCS variation along the land cover gradient. Yu et al. (2016), indicated significant differences in SOCS of cultivated fields and wetlands in Yellow River Delta (China). They reported that SOCS was highly affected by anthropogenic activities. On the other hand, the researchers reported that SOCS decreased as wetlands transformed into paddy fields. The reduced organic carbon mineralization in the natural wetlands due to the low or none anthropogenic activities causes a high SOC content (Babauta et al. 2012). Previous studies have reported that increased decomposition in drylands compared to anaerobic conditions in wetlands results in a lower amount of SOCS (DeBusk and Reddy 1998; Sigua et al. 2009). In the study area, conventional tillage methods, which have a negative impacts on SOCS, especially in arable agricultural areas should not be used anymore, and conservational tillage methods should be adapted to the region. The rangelands should be improved to both preserve the existing SOCS and increase the carbon content during the reclamation process. The findings of this study underline the importance of wetlands for terrestrial carbon storage and mitigation of global warming. Conversion and degradation of wetlands to arable lands reduce organic carbon storage potential of these lands. This study provided important spatial information for carbon accumulation in the Nehil Marshes, and the results drew attention to the conservation of wetlands to reduce CO 2 emissions. (Table 8)

Conclusions
This study aimed to reveal the most accurate spatial distribution of SOCS with the MLP network and GBDT machine learning algorithm, which use the remote sensing indexes calculated based on Sentinel 2A as a covariate. The results also contributed to the modeling efforts of SOC using only remote sensing data. The SOCS had significant correlation with the SRCI calculated with the Sentinel 2A SWIR spectral region and the GSI calculated with red, green, and blue spectral region. Thus, soil texture factor affecting soil organic carbon content were included in the SOCS estimation model. Similarly, the vegetation, moisture, and soil brightness indexes used in this study were also significantly correlated with SOCS. The MLP network architecture was designed based on common variables and SOCS that have been proven to be statistically reliable. The RMSE and MAPE values of the MLP model were 25.49% and 13.28 Mg C ha −1 , respectively. Spatial distribution of SOCS was modeled using the same covariates, train and test dataset with LSBoost method, Bayesian optimization algorithm, and gradient descent boosting machine learning. The RMSE and MAPE values in GBDT model were 6.64 Mg C ha −1 and 9.97%, respectively. The MAPE value indicated that the GBDT provided highly accurate forecasting, while a reasonable estimation was carried out with the MLP. The RMSE values revealed that the GBDT machine learning performed 50% more successful estimation than the MLP network. In addition, the most important covariates in estimation variance were SRCI and GSI, which represent soil texture. Highly negative correlation (r=−0.891 and R 2 =0.794) between SRCI and GSI revealed that clay content was very important in SOCS estimation. The effect of land cover on SOCS was quite significant and the common variables of NDVI, GNDVI, and ARVI contributed significantly to the estimation variance. In this context, CORINE LC was used to assess the relationship between SOCS and land cover. The results revealed that Nehil Marshes have a significant carbon stock in Yuksekova. Reliable spatial SOCS information was obtained with the combination of Sentinel-2 guided multiindex remote sensing modeling strategy and GBDT. This methodology followed in this study allows for a reliable assessment of SOCS information, which provides strong support for policymakers and decision-makers in developing best suitable carbon sequestration scenarios to avoid land degradation and reach sustainable development goals.

Authors' contributions
All authors contributed to the study conception and design. Mesut Budak: conceptualization; data curation; formal analysis; funding acquisition; investigation; methodology; software; project administration; resources; supervision; validation; visualization; roles/writing-original draft; writing-review and editing. Elif Günal: conceptualization; data curation; formal analysis; funding acquisition; investigation; methodology; project administration; resources; software; supervision; validation; visualization; roles/writing-original draft; writing-review and editing. Miraç Kılıç: conceptualization; data curation; formal analysis; methodology; software; validation; visualization; roles/writing-original draft; writing-review and editing. İsmail Çelik: project administration; resources; supervision; investigation; roles/writing-original draft; writing-review and editing. Mesut Sırrı: formal analysis; methodology; validation; visualization; roles/writing-original draft; writing-review and editing. Nurullah Acir: data curation; formal analysis; resources; software; validation; visualization; roles/writing-original draft; writing-review and editing. All authors have read, understood, and have complied as applicable with the statement on "Ethical Responsibilities of Authors" as found in the Instructions for Authors.
Funding This study was funded by The Scientific and Technological Research Council of Turkey (TUBITAK; project number 119O947). The authors would like to thank TUBITAK for the funding.
Data availability Data will be made available on request.

Declarations
Ethical approval Not applicable.

Consent to participate Not applicable.
Consent to publish Not applicable.

Competing interests
The authors declare no competing interests. This paper described has not been published previously. This manuscript is not under consideration for publication elsewhere. If accepted, this manuscript will not be published elsewhere in the same form, in English or in any other language, including electronically without the written consent of the copyright holder.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.