Modeling soil organic carbon using remotely-sensed predictors: a case study from Fuzhou city, 1 China

10 Background: Assessing the spatial dynamics of soil organic carbon (SOC) is essential for carbon monitoring. 11 Since, variability of SOC is mainly attributed to biophysical land surface variables, integrating a compressive set of 12 such indices may support the pursuit for optimum set of predictor variables. Therefore, this study was aimed at pre- 13 dicting the spatial distribution of SOC in relation to remotely-sensed variables and other covariates. Hence, the land 14 surface variables were combined from remote sensing, topographic, and soil spectral sources. Moreover, the most 15 influential variables for prediction were selected using the RF and Classification and Regression Tree (CART). 16 Results: The results indicated that the RF model has good prediction performance with corresponding R 2 and 17 RMSE values of 0.96, and 0.91 mg/g, respectively. The distribution of SOC content showed variability across land- 18 forms (CV=78.67%), land-use (CV=93%), and lithology (CV=64.67%). Forestland had the highest SOC (13.60 19 mg/g) followed by agriculture (10.43 mg/g), urban (9.74 mg/g), and water body (4.55 mg/g) land-uses. Furthermore, 20 bauxite and laterite lithology had the highest SOC content (14.69 mg/g) followed by fluvial (14.52 mg/g) and shale 21 (13.57 mg/g), able information for prediction of SOC. Hence, the land surface indices may provide new insights into SOC model- ing in complex landscapes of warm sub-tropical urban regions.


32
Soil organic carbon (SOC) is essential for the normal functioning of the ecosystems [1]. It also plays a significant 33 role in global C dynamics and climate change study as it stores the largest total carbon pool of terrestrial ecosystems 34 [2]. In the urban context, the existence of an optimum SOC content is a critical factor for greening projects and is 35 also a good indicator of the state of urban ecosystems and soil quality [3][4][5]. However, its spatial distribution is in-36 fluenced by landscape, lithologic and land use factors. Particularly, land use change due to intensive human activi-37 ties including urbanization and industrialization processes hugely impacts its spatial distribution [6,7].

38
In line to this, estimation of the spatial variability of SOC in urban areas has attracted the interests of many research-

48
Even though, many previous researches estimated SOC in urban areas globally and nationally, the current study area 49 has limited similar studies. Moreover, there is a need to find the optimal set of suitable environmental predictors that 50 may influence prediction of SOC distribution. As the spatial distribution of SOC is highly influenced by various 51 environmental variables, assessing suitability of remotely-sensed predictors and other environmental covariates for 52 mapping of the spatial variability of topsoil SOC in complex urban environment is imperative. To that end, McBrat-53 ney et al. (2003) also highly recommended the need of new researches so as to select suitable environmental covari-54 ates for digital soil mapping [12]. Furthermore, even though the previous researches used multisource datasets, 55 they ignored essential biophysical land surface variables that may add information to SOC distribution. For instance, 3 soil moisture, land surface temperature and built-up index were omitted. Since, SOC is highly influenced by soil 57 temperature and moisture than any other factors, integrating them may provide new insight for SOC mapping [13, 58 14]. The previous SOC prediction studies overlooked soil moisture and temperature indices perhaps due to a short-59 age of high-resolution soil moisture and temperature data. Especially, the traditional ground-based soil-moisture 60 observation networks produce sparse soil-moisture data for smaller regions. Meanwhile, some studies used atmos-61 pheric temperature and precipitation as a proxy to measure soil temperature and moisture [15,16]. However, they 62 also have very coarse resolution for smaller geographic scales. In the meantime, the remote sensing products of soil 63 moisture such as Advanced Scatterometer (ASCAT) and Soil Moisture Ocean Salinity (SMOS) have coarse resolu-64 tions (i.e., in tens of kilometers) [17]. But, optical/thermal infrared (TIR) sensor products have higher spatial resolu-65 tions (meters to kilometers) and could be good solution for regional and local applications [18]. Variables derived 66 from optical sensor products including vegetation temperature condition index (VTCI) and land surface temperature 67 (LST) can be better choice to obtain the soil moisture and temperature information.

68
Therefore, the main aim of this study was to identify the role of remotely-sensed variables for SOC prediction and to 69 understand the contribution of the landform, land-use, and lithology on the spatial variation of SOC in the coastal 70 city of Fuzhou, China. To that end, land surface variables such as vegetation temperature condition index (VTCI), 71 land surface temperature (LST), and normalized difference built-up index (NDBI) were integrated with other envi-72 ronmental covariates obtained from multiple sources including remote sensing, topography, and proximal sensing.

73
Meantime, the most important environmental variables were selected and used to estimate the spatial distribution of 74 SOC using a random forest and CART model.

79
Zone with a total area of 11,462.41 km 2 . It is geographically located at 118°08'E to 120°31' East Longitude and 25° 80 15'N to 26° 29'N North Latitude in the southeast of China. It neighbors with Ningde and Nanping to the north,

81
Quanzhou, and Putian to the south, and Sanming to the west. The city has 13 administrative regions comprising of 6 82 districts, one county-level city, and six counties. The current study site includes all regions except Pingtan County. 4 The climatic condition of the area belongs to the humid subtropical maritime climate, with an annual average tem-84 perature of 16-20℃ and the annual average precipitation of 900-1200 millimeters.

85
It is also covered by acidic volcanic rocks and Cretaceous sandstones from Jurassic Period [19,20]

96
(GPS) receiver was used to identify sample points and to capture the location information. Then, the samples were 97 transported to the laboratory, air-dried, and sieved. The samples were separated for spectroscopic measurements and 98 chemical analysis. The SOC was determined using a dry combustion method using a CNS elemental analyzer (Flash

99
EA 1112 NC-Soil, Thermo Fisher Scientific, Pittsburgh, PA). Spectral soil data was measured using an indoor spec-100 tral measurement in the wavelength ranges of 350nm to 2500nm using the FieldSpec-3 spectroradiometer. The sam-101 pling intervals were set to be 1nm, and as a result, a total of 2151 bands were produced.  Moreover, bandwidth of the spectral soil data was ranging from 350nm to 2500nm. Different spectral transfor-111 mations such as multiplicative scatter correction (MSC), first derivative, second derivative, normalization, and con-112 tinuum removal were applied to remove background noise [26]. Moreover, smoothing approaches of the Savitzky-

113
Golay filtering algorithm (SG) with a second-order polynomial and averaging was performed across a 10-band win-

121
The LST of the study area was calculated using Landsat 8 TIR bands, while the NDBI was used to extract built-up 122 areas [28]. A threshold value of 0.038 was used to extract the built-up area. Additionally, the spectral bands of the 123 Landsat-8 image were used along with soil spectral data. Additionally, the DEM data with a resolution of 10m was 124 obtained from aerial imageries and used to extract landform, hydrological, and spatial indices. The topography was 125 classified using the approaches described in FAO guidelines for soil description [62]. Additional landform character-126 istics of the study area was described using different landscape metrics including slope, relief, curvature, TWI, TPI, 127 and others. Furthermore, proximity to essential features such as industries, landfill sites, water-points, and port 128 points was calculated as the Euclidean distance from sampling points [29].

129
All environmental covariate maps and point datasets were converted into similar formats and stacked using ArcGIS 130 10.3 software. All datasets were resampled into the 30m pixel using 'bilinear resampling' for continuous data and

133
The land-use /cover of the study area was classified using a supervised image classification technique. Representa-134 tive ground control points (i.e., 50 points for each class) were captured to compare the class signatures. Out of 50 135 ground control points, 40 were used for classification calibration, whereas the remaining ten were used for valida-136 tion. The land-use was classified into four predominant (i.e., urban, vegetation (forest), agriculture, and water bod-137 6 ies) covers, as can been seen in Figure 8. Finally, classification results and accuracy of classification were generated 138 (Table 3).

140
The summary statistics were calculated using standard descriptive statistics for the variates. The coefficient of varia-141 tion (CV) and the Pearson correlation coefficients were used to measure the spread of the mean and the linear de-142 pendence between SOC values and environmental covariates, respectively. Similarly, Wilding (1985) and Nielsen 143 and Bouma (1985) used CV values to characterize the variability into classes of low (0-15), moderate (16-35%) and 144 high (> 36%) SOC variability [31,32].

145
A combination of random forest and CART were used for prediction, performance evaluation, and selection of the 146 essential variables. Random forest avoids over-fitting and provides reliable error estimates of out-of-bag (OOB), 147 avoiding the need for an independent validation dataset [33].

148
The 75 % of soil samples were used for the training of a model, and identifying essential variables for prediction, 149 and then the model was applied to the validation set. Additionally, CART, was implemented using the R package 150 "rpart" to improve interpretability [34]. The generated regression forest with a minimum split factor of 4, is seen in

166
Where n-is number of observations, zi-is average prediction of the i th observation, zi OOB -is the average prediction for 168 the i th observation from all trees for which the observation was OOB.

169
Additionally, the percentage of explained variance (Varex) was calculated as: Where, Varz-is the total variance of the response variable.

172
The ntree parameter (the number of trees in the forest) was adjusted using the mean squared error (MSE values) as a 173 measure of the prediction accuracy of the RF model (figure 13).

180
Ntree and mtry were selected to be 500 and 52, respectively, as a large number of trees is recommended for datasets 181 with sophisticated features and when the emphasis is given for identifying essential variables [40] (Figure 15).

182
The statistical indices of root mean square error (RMSE) and mean error (ME) were used, as stated in equations 7 183 and 8 to evaluate the performances of the model.  important predictors was shown in Figure 16.

200
The mean concentration of SOC was 11.70 mg/g, with a minimum value of 0.7 mg/g and the maximum value of 201 45.80 mg/g. It was highly variable with the coefficient of variation (CV) of 81% and a standard deviation of 8.92.

218
Based on the result, the VTCI index was the most crucial variable for the prediction of SOC from all 52 predictors, 219 followed by VIs, lithology, and NDBI. The remaining variables, such as TPI, Slope, X2224, LST, Band4, X424, 220 RVI, TWI, Band7, and NDVI, occupied the remaining top fifteen positions as influential variables. NDBI and LST

221
were among the influential variables with rankings of fifth and ninth places.

222
Similarly, topographic variables of TPI and Slope had superior impact ranking in sixth and seventh positions, re-223 spectively, while TWI was thirteenth.

225
The topographic and hydrological factors such as slope, curvature, aspect, TPI, MBI, and TWI of the study area had 226 high variability with increased CV (see Table 4). Slope, aspect, and MBI were more variable with > 50 % CV than 233 had a significant positive correlation with SOC concentration at p<0.05 (see Table 4).

235
The distribution of SOC content across landforms, land-use, and lithology was examined. The result shows that its 236 distribution was highly variable across landforms with a mean CV of 78.67 %. The highest SOC variation was rec-237 orded within medium-gradient Mountains(SM) with a CV of 122.08% (Table 5), followed by high-gradient moun-238 tains (TM) (12.64 mg/g). Considerably high SOC contents were predicted in plain, medium gradient hills (SH) and 239 medium-gradient mountains (SM) with mean values of 11.94 mg/g, 11.88 mg/g, and 10.90 mg/g, respectively. On 240 the contrary, a lower proportion of SOC was predicted in high-gradient hills (TH) and shoreline (WR) with mean 241 values of 7.47 mg/g and 1.67 mg/g, respectively.

260
VTCI was the most essential variable for the prediction of SOC from 52 environmental predictors. The main reason 261 for the superior influence of VTCI on the spatial prediction of SOC distribution may be attributed to its precision to 262 measure the crop water status and the subsequent impacts of soil moisture on the aboveground biomass. Moreover,

263
since VTCI is derived based on the relationship between land surface temperature (LST) and vegetation index 264 (NDVI), it could provide more information for the spatial prediction of SOC. Alvarez and Lavado (1998)  The reason for high importance of VTCI for SOC mapping can be due to its control on the extent of vegetation cov-270 er (i.e., quantity and quality of OM enters into the soil), the rate of mineralization, and litter decomposition [47,48].

271
Moreover, since soil moisture is an essential element for microbial growth, it may facilitate the degradation of plant 272 and animal residues that improves SOC contents [13,49].

273
ASTAVI and EVI were the second and third influential variables for the prediction of SOC distribution. The reason 274 for their influence may be due to their ability to provide proxies for measuring aboveground biomass that may influ-

339
In general, the landform variations may contribute to changes in soil properties (clay content), human activities (i.e.,

344
Additionally, the SOC distribution was highly variable across the land-use.

381
Additionally, the selection of NDBI as one of the essential predictors may provide an insight to predict SOC in resi-382 dential areas.

383
The current study has derived biophysical land variables such as soil moisture, land surface temperature, and human 384 influence using substantially improved indices, which were often ignored in the prediction of SOC in previous stud-

385
ies. However, the indices were among the most influential variables for the prediction of the spatial distribution of 386 SOC in a complex coastal urban environment of Fuzhou city. Even though, the variables were derived from high-387 resolution Landsat-8 images, the result might be further improved in the future studies by using better resolution 388 images such as Sentinel products.

389
This result shows that similar approaches and biophysical land variables can be used in other regional and local level 390 SOC prediction studies in similar sub-tropical coastal environments.

393
The data used for this study is part of an interdisciplinary project on Land Degradation (IUCLAND). Data support-394 ing the conclusions of this manuscript will be made available by the corresponding author. All supporting data will 395 be publicly available.

397
The authors declare no competing interests.