Machine learning and geostatistical approaches for estimating aboveground biomass in Chinese subtropical forests

DOI: https://doi.org/10.21203/rs.3.rs-25148/v2

Abstract

Background: Aboveground biomass (AGB) is a fundamental indicator of forest ecosystem productivity and health and hence plays an essential role in evaluating forest carbon reserves and supporting the development of targeted forest management plans. Methods: Here, we proposed a random forest/co-kriging framework that integrates the strengths of machine learning and geostatistical approaches to improve the mapping accuracies of AGB in northern Guangdong province of China. We used Landsat time-series observations, Advanced Land Observing Satellite (ALOS) Phased Array L-band Synthetic Aperture Radar (PALSAR) data, and National Forest Inventory (NFI) plot measurements, to generate the forest AGB maps at three time points (1992, 2002, and 2010) showing the spatio-temporal dynamics of AGB in the subtropical forests in Guangdong, China. Results: The proposed model provided excellent performance for mapping AGB using spectral, textural, and topographical variables, and the radar backscatter coefficients. The root mean square error of the plot-level AGB validation was between 15.62 and 53.78 (t/ha), the mean absolute error ranged from 6.54 to 32.32 t/ha, and the relative improvement over the random forest algorithm was between 3.8% and 17.7%. The highest coefficient of determination (0.81) and the lowest mean absolute error (6.54 t/ha) were observed in the 1992 AGB map. The spectral saturation effect was minimized by adding the PALSAR data to the modeling variable set in 2010. By adding elevation as a covariable, the co-kriging outperformed the ordinary kriging method for the prediction of the AGB residuals, because co-kriging resulted in better interpolation results in the valleys and plains of the study area. Conclusions: Validation of the three AGB maps with an independent dataset indicated that the random forest/co-kriging performed best for AGB prediction, followed by random forest coupled with ordinary kriging (random forest/ordinary kriging), and the random forest model. The proposed random forest/co-kriging framework provides an accurate and reliable method for AGB mapping in subtropical forest regions with complex topography. The resulting AGB maps are suitable for the targeted development of forest management actions to promote carbon sequestration and sustainable forest management in the context of climate change.

1. Introduction

Forests play an important role in global carbon cycling because they act as carbon sinks and sources for atmospheric CO2 (Chave et al. 2014; Pan et al. 2011). Forest aboveground biomass (AGB) is an indicator for assessing forest ecosystem productivity and health and determining the potential for carbon storage and carbon sink, as well as an important parameter for estimating carbon emissions and disturbances caused by land use and climate change (Schulze et al. 2006; Muukkonen and Heiskanen 2007; Baccini et al. 2017). Currently, the most accurate methods for obtaining forest AGB are the use of site- and species-specific allometric equations based on measured forest biometric parameters, such as the diameter at breast height (DBH), height, crown closure, and stem density (Ali et al. 2015; Chave et al. 2014; Paul et al. 2015). However, due to the time required to obtain the field data, the biomass information is often outdated when it is used (Chave et al. 2014). Thus, region-level estimates of AGB using only field data may not attain the same precision and timeliness as methods that combine field data with auxiliary information, such as remote sensing data.

Remote sensing images contain abundant spectral and textural information and have excellent temporal and spatial resolutions, wide coverage, and excellent timeliness (Zhu and Liu, 2015). The combination of forest inventory plots and remote sensing data to estimate forest AGB has become a mainstream method(Lu, 2006; Lu et al. 2016) in the last decades. Remote sensing-based AGB estimates have used three types of remotely sensed data: optical imagery, radar, and light detection and ranging (LiDAR), depending on the spatial resolution required and the application purposes. Optical remote sensing data are the dominant type of data. The most commonly used optical data include low-resolution AVHRR and MODIS (Li et al. 2018), medium-resolution Landsat and SPOT data (Gasparri et al. 2010; Zhu and Liu, 2015), and high-resolution IKONOS, Quickbird, Worldview, and drone data (Arroyo et al. 2010; Dassot et al. 2012; Proisy et al. 2007). High spatial resolution images have a significant amount of shadows from trees and terrain, resulting in errors for AGB estimation (Thenkabail et al. 2004; Vaglio Laurin et al. 2019). Due to the high cost of the imagery, the estimation of biomass is often limited to a relatively small area (Foody et al. 2001; Gonzalez et al. 2010). Meanwhile biomass estimation using low-resolution remote sensing images typically has two major drawbacks, namely, mixed pixels and difficulty to match the pixel size with the sample plot size; these problems result in relatively large errors of AGB estimates, limiting the application to national, continental, or global scales (Baccini et al. 2012; Chopping et al. 2011). In contrast, medium-resolution Landsat (30 m) data are widely used in combination with sample plot data for AGB estimations in the past two decades because they are freely available since 2008, 16-day revisit time, and wide coverage. In addition, the 30 m resolution is similar to the size of sample plots in China forest inventories (NFIs), thus reducing the error when matching the pixel to the field plot and obtaining better estimation results (Hall et al. 2006; Karlson et al. 2015; Powell et al. 2010).

In structurally complex forests, spectral saturation is encountered when using optical remote sensing imagery, causing estimation errors for areas of high biomass (Mermoz et al. 2015). Optical images are also susceptible to the influence of clouds and differences in solar illumination (Avtar et al. 2012). The successful launch of the Advanced Land Observing Satellite (ALOS) Phased Array L-band Synthetic Aperture Radar (PALSAR) in 2007 has increased the use of radar data to measure biomass. The instrument is the first long-wavelength (L-band, 23-cm wavelength) synthetic aperture radar (SAR) satellite sensor with the capability of collecting data in single, dual, full, and scan-SAR mode with cross-polarized (horizontal-transmit, vertical receive (HV)) and co-polarized (horizontal-transmit, horizontal receive (HH); vertical-transmit, vertical receive (VV)) data (Avtar et al. 2013). The high penetration capability of SAR allows for extensive information extraction of the structural parameters of plants for improving biomass estimation. Methods that use L-band radar and optical data to estimate AGB have proven successful in forests with low to medium biomass levels (Lu et al. 2012; Mitchard et al. 2012; Ploton et al. 2013) but similar to passive optical remote sensing, these systems suffer from signal saturation at high AGB and show limited sensitivity to high AGB. Thus, the combination of optical and radar systems to estimate AGB in dense tropical and subtropical forests remains problematic and often results in the underestimation of AGB in complex and mature forests (Sandberg et al. 2011). Furthermore, radar systems are affected by terrain, surface moisture, and speckle noise (Imhoff, 1995; Martins et al. 2016; Mermoz and Le Toan, 2016).As such, light detection and ranging (LiDAR) is an active remote sensing technology capable of providing detailed, spatially explicit, and three-dimensional information on forest canopy structure (Lefsky et al. 2002b). Thus, LiDAR estimates of AGB provide better precision than radar and optical data (Cao et al. 2016). However, it is expensive to collect wall-to-wall LiDAR for forest structure characterization over large areas, e.g., at the state or country level. Thus, current LiDAR systems are limited to areas smaller than those extents. Multi-source remotely sensed images in conjunction with appropriate statistical modeling algorithms are regarded as an effective and reliable means for mapping AGB over large areas.

Linear regression models have been widely used in early studies of AGB estimation (Lefsky et al. 2002a; Myneni et al. 2001). This parametric method is simple and straightforward, but can require strict assumptions on the model error and the normality of dependence variable. In addition, relationship models for practical AGB estimation are limited because of the complex nonlinear relationship between forest AGB and remote sensing features. Non-parametric methods do not require a strictly linear relationship between the response and covariates, and training data are used to train the model to estimate the parameter of interest. Due to the development of non-parametric estimation models, numerous studies on forest biomass estimation have used these methods in recent years, including the k-nearest neighbor (KNN) method (Chirici et al. 2008), neural networks (Foody et al. 2001), support vector machines (SVM) (Chen et al. 2010) and decision trees (Hansen et al. 2016). Random forest (RF) is a machine learning algorithm based on decision trees that has been used extensively for forest AGB mapping using remote sensing data in the last two decades. RF provides higher accuracy than comparative machine learning methods and conventional statistical regressions because RF is less sensitive to noise in the training samples (Hoover et al. 2018; Powell et al. 2010; Zhao et al. 2019). However, a major shortcoming of RF is that it ignores the spatial autocorrelation of the data when mapping the feature distribution (Chen et al. 2019).

Geostatistics is based on the theory of regionalized variables; it does not only quantitatively describe spatial heterogeneity or spatial correlation but also establishes a spatial prediction model to interpolate and estimate spatial data (Isaaks and Mohan, 1989). Kriging is a geostatisticial method and is also known as local spatial estimation or local spatial interpolation; it provides the best linear unbiased prediction (BLUP) of the regionalized variable values in a limited area and is based on the theory of variograms and structural analysis (Le and Zidek, 2006). On the basis of linear regression, kriging needs to assume the spatial stability in the spatial variation analysis within limited distance, which can make up for the defects of non-parametric models (RF) to some extent. Nothdurft et al. (2009) used a random parameter model with variance components to develop non-parametric most similar neighbour (MSN) approach to predict forest stand variables in management units (stands) from sample plot inventory data. Fox et al. (2020) proposed the random forest regression kriging (RFRK) model to improve the two techniques by comparing spatial regression and the RF. The prediction accuracy of RF model combining with kriging outperformed RF for predicting the spatial distribution of soil attributes and pollutant concentrations (Guo et al. 2015; Tziachris et al. 2019). Several studies have focused on AGB prediction using remote sensing data from different sources based on RF coupled with ordinary kriging (OK) (RFOK); this method combines RF predictions and residual estimations by OK (Chen et al. 2019; Silveira et al. 2019a). OK is a linear estimation method that is suitable for inherently stationary random fields and satisfies the isotropic hypothesis. In a large study area, the interpolation of the RF residuals is affected by the uneven distribution of terrain and climate, and OK is the most suitable method (Cressie, 1990; Le and Zidek, 2006). Co-kriging (CK) is an extension of OK and uses one or multiple auxiliary variables to interpolate the variables of interest. The auxiliary variables are related to the target variables and are used to improve the accuracy of target prediction (Chatterjee et al. 2015).

Subtropical and tropical forests are very diverse (in the type, climate, and site conditions), with high uncertainty of AGB estimates. AGB prediction models trained with limited ground observations in tropical or subtropical forests over a large area are prone to overfitting and do not describe local features adequately (Zhao et al. 2016; Lu et al. 2016). In this case, the optimized integration of multiple types of remote sensing data and multiple modeling algorithms may provide an alternative to reduce high uncertainty (Santoro et al. 2015; Yu et al. 2014). Numerous studies have combined active and passive remote sensing data sources for forest mapping over large spatial scales (Deo et al. 2017; Shen et al. 2016; Su et al. 2016). However, there are few studies on AGB mapping of large areas in subtropical forests using the combination of RF and CK (RFCK), especially in mountainous areas with complex terrain. In this study, we proposed a framework that uses field sample plot data, time-series passive and active remote sensing data, and Co-kriging with RF models to improve the mapping accuracy of AGB of subtropical forests in southern China for the years 1992, 2002, and 2010. We expect that the results of this study will support the strategic development of carbon sequestration forests and sustainable forest management practices.

2. Methods

2.1 Study Area

The study area is located in northern Guangdong province (extending from 113.10˝E, 23.64˝N to 114.75˝E, 25.44˝N) China, and includes the administrative cities of Shaoguan, Qingyuan, and Heyuan (Figure 1). The topography is undulating, and the elevation ranges from 13 to 1709 m above sea level. The area has a mid-subtropical monsoon climate, with mean annual precipitation of 1300 to 2400 mm and a mean annual temperature of 18 to 21 °C. The rainy season lasts from March to August, and approximately 53% of the annual precipitation falls between April and June. The vegetation includes natural forests and plantations. The dominant tree species are Pinus massoniana, Cunninghamia lanceolata, Pinus elliottii Engelm, Eucalyptus, Pinus kwangtungensis, Castanopsis fissa, Acacia mangium, and Phyllostachys edulis. Most of these species are evergreen and fast-growing. Other vegetation includes deciduous trees and shrubs. The most common meteorological events causing plant damage in the region are chilling events, storms, floods, and droughts.

2.2 Data collection

2.2.1 Field data

The NFI in China is the first level of China’s three-tiered inventory system, which is administered by the State Forestry and Grassland Administration (Xie et al. 2011). The system is based on permanent sample plots (PSPs), systematic sampling, and design-based inference. The NFI has been conducted eight times from the 1970s to 2012, with a cycle of five years; the sample sites are fixed and evenly distributed. The Guangdong Provincial Center for Forest Resources Monitoring provided us with eight NFI datasets collected between 1979 and 2012 to support this research. In this study, we used three years (1992, 2002, and 2012) of data from the inventory plots located in the study area. A total of 1355 plots with a size of 25.82 m × 25.82 m (0.0667 ha) were located in the study area. The AGB of the plots was derived using tree species-specific allometric equations developed by the Guangdong Provincial Center for Forest Resources Monitoring. The AGB is calculated using linear or non-linear regression models and DBH, tree height, and wood density of the surveyed trees. The biomass of the sample plots in 2010 was obtained by linear interpolation of the results of 2007 and 2012 to match the date of the used remote sensing data.

Since most remote sensing images have amount of cloud cover, the sample plots covered by clouds in the images were excluded in the analysis to obtain more accurate inversion results. Additionally, quality control of the raw sample data was selected to remove unreliable observations. Statistical analysis was conducted in SPSS software (version 21.0, IBM, Armonk, NY, USA) after screening of the samples in the three inventories. Values that were larger or smaller than the mean plus/minus three times the standard deviation were considered outliers and were removed. The AGB unit used in this paper is tons per hectare (t/ha).

2.2.2 Landsat data and SRTM digital elevation data

The Shuttle Radar Topography Mission (SRTM) digital elevation data produced by NASA represents a breakthrough in global digital elevation mapping, providing access to high-quality elevation data for large portions of the tropics and other areas of the developing world. This study used a free digital elevation model (DEM) with a resolution of 90 m (https://earthexplorer.usgs.gov/). We extracted geographical variables from the DEM data. We resampled the SRTM data from 90 m to 30 m resolution for consistency with the image data and extracted the slope and aspect in ArcGIS.

The study area of Northern Guangdong is covered by the Landsat World Reference System-2 path/row 122/043 tile. The Landsat 5-Thematic Mapper (TM) imagery was acquired from the United States Geological Survey (USGS) EROS data center (https://glovis.usgs.gov/). We downloaded the images for the three years (1992, 2002, 2010) and ensured that the acquisition dates of the imagery fell into the growing season, and the images had minimal cloud cover (Table 2).

Table 2. Description of the Landsat imagery used in this study

Filename

Acquisition time

Cloud cover

LT51220431992212BJC00

1992/7/30

3%

LT51220432002287BJC00

2002/10/14

4%

LT51220432010085BKT00

2010/9/26

11%


The Landsat ecosystem disturbance adaptive processing system(LEDAPS) is an image preprocessing algorithm developed by NASA for the generation of surface reflectance for the analysis of long-term time-series Landsat data. The algorithm is based on the 6S radiation transmission model and performs inversion of atmospheric parameters, such as aerosol data, the visible, near-infrared, and short-wave bands of Landsat TM/Enhanced Thematic Mapper (ETM)+ data (Vermote and Saleous, 2007). In this study, LEDAPS was used to obtain surface reflectance products through geometric correction, radiometric calibration, atmospheric correction, and F-mask processing for cloud detection. Subsequently, the C-correction (Fan et al. 2014) model was used to conduct topographic correction based on the surface reflectance data. The relationship between the slope and aspect from the DEM and the solar azimuth was used to correct the pixel reflectance value in the shadows of the mountains.

It is crucial to select appropriate variables for the inversion of remote sensing data to determine AGB. We performed spectral feature transformations, including principal components analysis (PCA), tasseled-cap transformation, and extraction of vegetation indices to obtain suitable variables. In addition to the six original bands of the image (excluding the thermal infrared band), we also extracted the reciprocal reflectance of the band as an additional variable for the inversion.

For extracting image texture, the first principal component (PC1) of the PCA, which contained more than 80% of the original spectral information, was used to extract image texture using eight texture variables. The texture variables were calculated with three window sizes (3 × 3, 5 × 5, and 7 × 7) with an offset ([1,1]) and 64 gray-level quantification.

2.2.3 ALOS PALSAR data

The ALOS/PALSAR data were pre-processed at the Japan Aerospace Exploration Agency (JAXA). The global 25 m resolution PALSAR/PALSAR-2 mosaic is a seamless global SAR image created by mosaicking SAR images of the backscattering coefficient obtained from PALSAR/PALSAR-2, where all the paths within 10×10 degrees in latitude and longitude are path-processed and mosaicked for processing efficiency. Since the ALOS/PALSAR data ranged from 2007 to 2010, we selected the mosaicked images of the study area in 2010 to match the sample field data and Landsat images. Correction of the geometric distortion specific to SAR (orthorectification) and topographic correction based on image intensity (slope correction) were performed to facilitate forest classification. The digital numbers of the SAR signal amplitude were extracted from the imagery and were converted to backscattering coefficients in dB using Eq. (1): (see Equation 1 in the Supplemental Files) 

where DN is the digital number of the amplitude of the HH and HV backscatters (expressed as an unsigned short integer), and CF is the absolute calibration factor, which is equal to -83 (dB). A 7×7 Lee filter was used on the HH and HV backscatter images to reduce speckle noise (Lee 1980).

The variables included the HH and HV polarizations and HH/HV and the radar forest degradation index (RFDI) (HH−HV)/(HH+HV) (Aslan et al. 2016). The RFDI is a useful input layer for wetland vegetation mapping because HH polarized data are sensitive to water beneath the canopy, whereas HV polarized data are known for the sensitivity to volume scattering and biomass (citation is needed here). The derived variables were resampled to 30 m resolution using bilinear interpolation to match the Landsat datasets.

2.3 Prediction methods

2.3.1 Random Forest model

The RF model is a supervised machine learning algorithm based on decision trees. It has the advantages of providing variable importance, being robust to data reduction, generating an internal unbiased estimate of the generalization error as the forest building progresses, and having higher accuracy than individual decision trees and low sensitive to parameter adjustment than other machine learning models (e.g., neural networks) (Amit and Geman 1997; Breiman 2001, 2004; Hastie 2008). We used the randomForest package (Liaw and Wiener 2002) in the R software for AGB prediction.

Numerous variables have been used for forest AGB prediction (Kumar et al. 2015),but choosing the right variables has a significant influence on the prediction accuracy of the model. After stacking the available feature layers, the coordinates of the sample plots were matched to those of the pixels so that the AGB of the plot corresponded to the suite of predictor variables (feature names), and established the models. We then performed variable importance analysis in the random forest package by deriving two different variable importance measures, the analytical result provided means to assess the contribution of each predictor variable to the modeling performance based on the response type. Importance type 1 was calculated by randomly permuting each predictor variable and computing the associated reduction in predictive performance using the out of bag (OOB) error for RF models and importance type 2 was calculated using the decrease in node impurities attributable to each predictor variable (Breiman,2001). The higher the percent increase in mean square error (MSE) (%IncMSE) and increase in NodePurity (IncNodePurity) indicated a stronger importance of these predictor variables (Karlson et al. 2015b; Shen et al. 2016).

Based on the ranking of each variable in GI-index and OOB error rate, the variables with the higher ranking were selected as the predictors highly related to AGB, so as to reduce the calculation complexity and obtain more accurate and easily interpretable results. Appendix Table A1 shows all the alternative variables.

The correct parameters in machine learning do not only increase the predictive power of the model but also facilitate model training. The parameters of the model in this study included the mtry, ntree, and node size, where mtry represents the number of variables used to split the tree at each node, ntree represents the number of decision trees in the RF, and node size represents the minimum number of nodes in the decision tree. For parameter tuning, mtry was defaulted to the quadratic root of the number of variables in the data set (classification model) or to one third (prediction model). Nodesize was defaulted to the discriminant model at 1 and the regression model was 5. The selection of ntree value was determined by constantly testing that how much number of decision trees was gained -when the error in the model was relatively stable. After multiple tries, we used the following values for the parameters: mtry=3, ntree=500, and node size=5.

2.3.2 Kriging-based model

Geostatistics is an effective method for determining spatial or spatiotemporal variation based on statistical measures. Since the RF model does not consider spatial autocorrelation of the AGB sample plots, we used a combination of RF and kriging to determine the spatial distribution of AGB. The detailed flowchart of RFCK is shown in Figure 2. Specifically, a regression-kriging technique was used to extract the components of the residuals obtained from the RF regression (Guo et al. 2015). The residual is the observed AGB minus the predicted AGB from RF. By adding the extracted components to the RF-based predictions, we obtained a higher prediction accuracy of the AGB. Since the RFOK considered the spatial autocorrelation of the AGB residuals, the saturation problem encountered in optical remote sensing data for dense or mature forests was minimized (Lu, 2005), resulting in an unbiased and reliable AGB map. Here, we focused on comparing the accuracies of ordinary krig (OK) and co-krig (CK).

OK is a linear estimation method suitable for inherently stationary random fields and satisfies the isotropic hypothesis (Cressie et al. 1990). In this hypothesis, the mathematical expectation of the variable of interest is independent of its position, and the covariance is a function of the distance between the points. We assume that we estimate n points in the neighborhood of point x0; the interpolation formula of the OK method is defined in Eq.(2):(see Equation 2 in the Supplemental Files) 

where Zok*(x0) is the residual value of the AGB to be estimated, n is the number of sample points used for interpolation, Z(xi) is the AGB residual of site i, and λi is the weighting coefficient at point i.

 CK is an improvement over the OK method and can deal with multivariable problems. The random fields that need to be modeled are called primary variables, and other random fields involved in modeling are called covariables. Variograms and covariograms of each attribute are used in the calculation. Since the research area is located in the mountainous area of northern Guangdong province, the AGB is affected by the terrain. Therefore, elevation was selected as a covariable for the interpolation. The residual value is defined in Eq.(3): (see Equation 3 in the Supplemental Files) 

where Z2,CK*(x0) is the residual value of the AGB to be estimated; Z1(x1i) is the AGB residual of the site i; λ1i is the weighting coefficient of site i, Z2(x2j) is the elevation of site j, λ2j is the weight assigned to the elevation, N1 is the number of training samples, and N2 is the number of sample points of the elevation, where N1 ≥N2 .

A variogram describes the structural changes of regionalized variables, as well as random changes, and is an effective tool for the analysis of spatial variation and spatial structure. In kriging, the type of variogram is important because it determines the accuracy and reliability of the estimate. In this study, the semivariogram was modeled in GS+ (version 9.0, Leland Stanford Junior University, Stanford, California, USA) using spherical, exponential, and Gaussian functions. The three model parameters of a semivariogram are the nugget, range, and sill. The nugget represents the small-scale variability of the data. The range is the distance beyond which little or no spatial autocorrelation occurs. The sill is the maximum value of the semivariogram, where the spatial distance between two locations reaches the range (Ou et al. 2017). The nugget effect, that is, the ratio between the nugget and sill, calculating the nugget effect can be used to compare the relationship between local variation and population variation, the stronger spatial autocorrelation is denoted by the lower values of nugget/sill (Matheron 1963; Zimmerman and Zimmerman 1991). The fitting performance of the variogram was estimated by the coefficient of determination (R2) and the residual sum of squares (RSS). The larger the R2, and the smaller the RSS, the lower nugget effect, the better the fitting performance is.

The following steps were performed for RFOK modeling: the estimated residual value of each sample point was obtained by subtracting the predicted value of the RF result from the observed AGB value (Eq. (4). The residual values of the sample points were modeled using Eq.(2) and (3) to obtain the structure of the component in the residuals. Subsequently, the final AGB estimate was obtained by adding the structure of the component to the RF-based prediction (Eq.(5)). (see Equations 4 and 5 in the Supplemental Files) 

where Z(xi) is the AGB residual value of site i, C(xi) is the observed AGB of site i, ĈRF(xi) is the RF-based predicted AGB at site i, ĈRFOK(xi) and ĈRFCK(xi) are the predicted AGB at site i using RFOK and RFCK respectively.

Finally, we used the maximum likelihood classifier in the ENVI 5.3 environment to classify the Landsat images into forest and non-forest classes. The classified forest pixels were used as a mask to extract AGB maps of the forest areas.

2.3.3 Model Fitting and Evaluation

The ground observation data were randomly divided into a training set (80% of the data) and a validation set (20% of the data) for each bootstrap analysis. We used several statistical measures to quantify the model performance, including the R2, (Eq. (6)), the mean absolute error (MAE) (Eq. (7)), the root mean square error (RMSE) (Eq. (8)), and percentage mean residual deviation (Bias%) (Eq. (9)). (see Equations 6-9 in the Supplemental Files) 

where n is the number of samples, ŷi is the AGB predicted by the model, yi is the observed AGB from the ground measurements, y is the arithmetic mean of all observed AGB values. Additionally, the relative improvement (RI) index of the RFOK and RFCK models over the RF model was calculated using Eq.(10) to assess the performance improvement: (see Equation 10 in the Supplemental Files) 

3. Results

3.1 Variable Importance

In the variable importance analysis, %IncMSE is the importance ranking result obtained by replacing the OOB data, and IncNodePurity is the importance ranking result calculated by the Gini index. Figure 3 shows the top 25 %IncMSE variables identified by the RF OOB strategy, indicating the ability of the variables to predict AGB. The size and color of the circles indicate the IncNodePurity. The ten variables with the highest IncNodePurity were selected as independent variables for AGB mapping. In the 2010 variable selection, the PALSAR-derived variables (e.g., HH_HV and RFDI), were included in the list, indicating the relatively high importance for the prediction of AGB. Overall, the reflectance and combined indices from Landsat-5 accounted for a large proportion of the better-performing variables, and the texture variables from the PC1 were also important variables for AGB estimation.

We selected ten variables to reduce the complexity and calculation load, including TM7_1, B7, B2, SAVI, TM14, TM17, B5, TM37, VRI, and second77 to map AGB in 1992. For 2002, we used the mean55, RSI, mean77, TM37, VRI, TM12, TM27, TCD, elevation, and TM34. Mean77, B1, TM34, HH_HV, TM2_1, TM25, ARVI, B2, RFDI, and EVI were selected for mapping AGB in 2010.

3.2 Random forest modeling

We attained the performance measures for the three RF prediction models. Table 3 lists the performance measures of the fitted models for 1992, 2002, and 2010. The R2 for model training ranged from 0.86 in 1992 to 0.95 in 2010; when the fitted models were used to predict the training data, we obtained RMSEs of 9.05 t/ha, 43.68 t/ha, and 37.51 t/ha for 1992, 2002, and 2010 respectively. We also obtained bias% of -0.05%, -0.67%, and -1.84% for 1992, 2002, and 2010 respectively. The lowest RMSE was observed for the 1992 AGB because the biomass values were lowest for the 1992 training plot; the same was observed for the standard deviation (Table 1). As a result, the R2 had the lowest value (0.86) of the three years. In contrast, due to the use of microwave information (HH_HV, RFDI) in the 2010 RF model, the model achieved the highest R2 (0.95) (Table 3). The values of the Bias% for the 1992 model were nearer closer to 0 than those of 2002 and 2010. Figure 4 shows the validation performances of the three fitted models; 20% of the data were used to conduct the validation. The validation R2 values were around 0.40, and the fitted line (red) was substantially different from the 1:1 line; the low AGB observations were overestimated, and the high AGB observations were underestimated.

Table 1. Descriptive statistics of the AGB (t/ha) of the sample plots in the three forest inventories.

Year

Number of sample plots

AGB (t/ha)

Number of samples used

Value range

Median

Mean

Std Deviation

Training

Validation

1992

201

1.134 - 50.241

13.511

9.484

6.152

161

40

2002

278

2.388 - 256.310

55.340

41.856

23.522

222

56

2010

218

4.728 - 250.881

61.839

47.600

24.924

174

44


Table 3. Summary of the modeling performance measures derived from the three models.

Year

Variables used for model prediction

Predicted RMSE (t/ha)

MAE (t/ha)

Bias

(%)

Training R2

1992

TM7_1, second77, B2,SAVI, TM14,TM17, B5,TM37,VRI, B7

9.05

4.04

-0.05

0.86

2002

mean55, RSI, mean77,TM37,VRI, TM12, TM27, TCD,elevation,TM34

43.68

16.68

-0.67

0.90

2010

mean77,B1,TM34, HH_HV, TM2_1, TM25, ARVI, B2, RFDI,EVI

37.51

13.88

-1.84

0.95


3.3 Random forest combined with kriging models

3.3.1 Residuals of RF-derived AGB and semivariance analysis

Table 4 summarizes the descriptive statistics of the residuals. The absolute kurtosis values of the three groups of residuals were close to 3 or 2, and the skewness was close to 1, indicating that the residuals were approximately normally distributed (Figure 5). The means of the residuals in 1992 and 2010 were 0.004 t/ha and 0.186 t/ha, respectively; these values were closer to 0 than the value in 2002 (-1.223 t/ha). The 1992 residuals had the lowest standard deviation, indicating a small difference between most residuals and the mean residual. The range of the residuals was largest in 2002 (159.1t/ha). The residuals are shown in different colors and sizes based on the quartile distribution (Figure 5).

Table 4. Descriptive statistics of the residuals 

Year

Residual value

Mean (t/ha)

Std deviation (t/ha)

Value range(t/ha)

Skewness

Kurtosis

1992

0.004

5.746

-11.16—20.90

1.39

2.67

2002

-1.223

21.906

-53.87—104.23

1.31

2.95

2010

0.186

18.166

-42.98—70.42

0.90

2.09


After confirming the approximate normality, the three groups of the residuals were used to calculate empirical semivariograms for the subsequent RFOK interpolation. In GS+, for the semivariograms, the Gaussian model was used for 1992, the exponential model for 2002, and the spherical model for 2010. Additionally, the elevation of the plots was used as a covariable in the RFCK model to obtain accurate estimates. Table 5 and Figure 6 show the parameters of the semivariogram models and the semivariograms, respectively of the RFOK and RFCK models. Overall, the fitting performance was better for the RFCK than the RFOK model; the former had a larger R2, smaller RSS, and lower nugget/still value than the latter (Table 5). In the comparisons of Table 5, The nugget/still values indicated that the variability caused by spatial autocorrelation in the 1992_RFOK model is the larger than the other years’ models. The highest nugget value of 402.91 for the 2002 RFOK model suggested that the 2002 residuals exhibited the strongest spatial heterogeneity. The RFCK models had lower nugget values than the RFOK models, indicating that the addition of the elevation covariable reduced the spatial variability (Table 5). Overall, there was lower spatial heterogeneity in the 1992 residuals than in the 2002 and 2010 residuals, indicating by the smallest nugget/sill ratio in the 1992_RFOK model. The 1992 models had the highest spatial autocorrelation and highest suitability for kriging than models in 2002 and 2010, so 1992 models provided better kriging results.

Table 5. Parameters of the theoretical semivariogram models of the residuals for the RFOK and RFCK models.

Model parameter

Theoretical model

Nugget

Sill

Nugget/Sill ratio

Range (degree)

R2

RSS

1992_RFOK

Gaussian

9.95

27.19

0.366

0.618

0.921

80.90

1992_RFCK

Gaussian

8.11

12.42

0.653

0.653

0.923

52.15

2002_RFOK

Exponential

402.91

507.46

0.792

0.935

0.139

9865.00

2002_RFCK

Exponential

365.46

482.14

0.758

0.905

0.143

9217.00

2010_RFOK

Spherical

307.80

328.50

0.937

0.348

0.156

9364.00

2010_RFCK

Spherical

302.93

312.96

0.968

0.492

0.181

8124.50


3.3.2 Forest AGB mapping using RFOK and RFCK

Based on Eq. (5), the predicted AGB was obtained from the RFOK and RFCK models (Figure 7), followed by validation with 20% of the samples. Table 6 shows the validation accuracies of the three AGB models in each of the three years. The 1992 RFCK model had the highest RI value (17.7%) (compared with the 1992 RF model); the R2 increased from 0.41 to 0.81, and the MAE decreased from 9.93 t/ha to 6.54 t/ha, the bias % changed from -22.59% to -10.68%. The accuracy of the 1992 AGB maps was substantially higher than that of the 2010 and 2002 maps. The reason may be that the observed forest AGB values were lowest in 1992, and the variability was lowest (Table 1); thus, the saturation effect was minimal in the 1992 models, resulting in higher validation R2 values. In 2002 and 2010, saturation occurred (the highest measurements on the ground were above 250 t/ha, Table 1), resulting in a significant decrease in the validation R2 values to about 0.4 and 0.5. The bias% of the validated model in 2010 was lower than those in 1992 and 2002. Additionally, the difference in the RI value between the RFOK and RFCK model in 2010 was 0.007, which was attributed to the use of microwave data in the model; in contrast, the differences in the RI values for 1992 and 2002 were only 0.001 and 0.002, respectively since no microwave data were used. The RFCK model outperformed the RFOK and RF models in all three years (Table 6).

Table 6. Accuracy of the RF, RFOK, and RFCK models based on the validation data.

Year

Models

MAE (t/ha)

RMSE (t/ha)

R2

Bias(%)

RI

1992

RF

9.93

18.99

0.41

-22.59

/

 

RFOK

6.64

15.64

0.81

-10.48

0.176

 

RFCK

6.54

15.62

0.81

-10.68

0.177

2002

RF

34.81

56.67

0.35

-5.67

/

 

RFOK

33.57

53.81

0.40

-5.47

0.050

 

RFCK

32.32

53.78

0.41

-3.67

0.052

2010

RF

24.56

31.21

0.49

-3.89

/

 

RFOK

23.47

30.23

0.53

-3.59

0.031

 

RFCK

22.14

29.97

0.55

-1.85

0.038



















The range of the AGB predictions obtained from the RF model was 3.73-36.73 t/ha in 1992, 24.47-247.04 t/ha in 2002, and 33.30-258.38 t/ha in 2010. The range of AGB predictions obtained from the RFOK model was 3.78-48.84 t/ha in 1992, 8.09-266.67 t/ha in 2002, and 23.23-270.58 t/ha in 2010. The range of AGB predictions obtained from the RFCK model was 2.09-51.03 t/ha in 1992, 14.15-259.55 t/ha in 2002, and 21.89-267.43 t/ha in 2010. The range of AGB showed an increasing trend from 1992 to 2010. A decreasing trend in AGB was observed from the southeast to the northwest in 1992, whereas, in 2002 and 2010, higher AGB was observed in high-elevation mountain areas (Figure 7 a, d, e). From 2002 to 2010, areas with AGB greater than 100 t/ha increased substantially, whereas areas with AGB less than 40 t/ha decreased.

4. Discussion

4.1 Variable selection and AGB estimation accuracy

The selection of suitable remote sensing variables is a critical step in AGB estimation (Lu et al. 2016). The variables used as input parameters before modeling can be divided into three types: spectral index, terrain variables, and texture measures. An importance analysis was used to determine the best predictors. The influence of the variables on AGB mapping was determined by order of importance (Figure 3), the performance measures of the models (Table 3), and the mapping results (Figure 7). Previous research has shown that texture measures have the potential to improve AGB estimation. Zhao et al. found that texture measures were valuable for AGB estimation of subtropical forests in southwest China, especially for forests with complex stand structures, such as mixed forests and pine forests with understories of broadleaf species (Zhao et al. 2016). Tuominen et al. extracted features from normalized difference vegetation index and band ratio images and analyzed the correlations between the extracted image features and forest attributes measured from sample plots (Tuominen and Pekkarinen, 2005). Our research results confirm that the role of texture in AGB estimation depends on the vegetation type. The results of the importance analysis of the 2002 and 2010 variables indicated a high correlation between the textural variables and AGB, especially for the mean texture measure of the gray-level co-occurrence matrix (Figure3). The eight texture measures based on the gray-co-occurrence matrix generated from the PC1 and the backscatter of PALSAR performed well for AGB estimation, and the texture variables obtained from the Landsat PC1 in the 2010 model were better than those from PALSAR HH/HV. All predictors contributed to the integrated model, but the vegetation indices and spectral bands comprised the largest proportion of modeling variables (Table 3); this result was consistent with previous studies (Foody et al. 2003; John et al. 2018; Zhang et al. 2019). For forest sites with complex forest structure and species composition, such as pine forests with understories of broadleaf species, texture measures are needed. They had higher importance values than spectral information when TM imagery or PALSAR HH and HV polarization data were used. The importance analysis of the 1992 data showed that spectral variables ranked higher than texture variables. The reason is that no large-scale afforestation projects were conducted in Guangdong province, and the average AGB in the study area is relatively low; thus, spectral saturation was not a problem. Furthermore, most forests are not mature (low AGB values in Table 1) and have a relatively simple structure; therefore, texture measures are less suitable than spectral variables for capturing the simple structure.

Precipitation is absorbed by the forest soil and plant leaves, and the moisture affects not only the spectral information of ground features but also the forest biomass (Chen et al. 2019). We used Landsat images acquired during the summer in the growing season (Table 2), which was coincident with the period of heavy precipitation in the study area. The spectral signatures of sparse forests are affected by the soil background. Subtropical forests grow vigorously due to the abundance of water. Vegetation indices were included because they minimize the influence of the background on AGB estimation, but more ground data need to be obtained by multi-source sensors. HH polarized data are sensitive to moisture beneath the canopy, whereas HV polarized data are sensitive to volume scattering and biomass (Hess et al. 1995). However, the estimation of forest biomass using ALOS PALSAR data currently has limitations, because the L-band saturates at about 150 t/ha (Minh et al. 2018). Mermoz et al. found a negative correlation between the SAR backscattering coefficient and forest biomass after reaching a maximum value (Mermoz et al. 2015). This result was attributed to signal attenuation from the forest canopy as the canopy becomes denser with increasing biomass. We also selected PALSAR-based polarized features as predictors to map AGB, and our results were in agreement with previous research findings.

We found that RF modeling overestimated low biomass values and underestimated high biomass values (Figure 4), which partly explained the presence of bias in the AGB prediction and indicated a saturation problem. The 2010 RF model incorporated a combination of Landsat TM and PALSAR variables, and this model improved the AGB estimation of forest stands by less than 150 t/ha. The scatterplots in Figure4 showed that the points of the 2002 RF model were clustered. The predicted value obtained by the 2010 model was lower than that of the observed AGB. Similarly, the predicted values of the two verification points with AGB values greater than 150 t/ha in 2010 are low, indicating the limitation of the L-band SAR sensor when dealing with saturation in high biomass stands. Nonetheless, the use of SAR sensors with higher radar wavelengths (e.g., P-band) may be suitable for the estimation of biomass at higher levels (on the order of 300 mg/ha) (Minh et al. 2013). Additionally, LiDAR systems can capture the horizontal and vertical structure of vegetation in great detail, and the data have a higher threshold for sensor saturation (e.g., biomass estimation on the order of 1200 Mg/ha) (Lefsky et al. 2002b; Giannico et al. 2016; Manuri et al. 2017; Valbuena et al. 2017). However, considering the high acquisition cost and limited acquisition scope, LiDAR data were not considered in this study.

In addition, we had to obtain the AGB of the sample plots in 2010 by linear interpolation to ensure time matching of the data; therefore, the results may have differed from the AGB obtained by the allometric growth equations using NFI data. The forest growth rate is not uniform, and the dependent variable in 2010 had errors, which affected the estimation accuracy of the model. During sample point screening, the standard deviation method we used to remove outliers may not be appropriate, because the occurrence of outliers would have a great impact on the mean value and standard deviation. There is a more stable technique of outlier removal worth considering like the Center distance calculation method based on median absolute deviation (Leys et al. 2013). In the co-kriging process for the residual, we did not consider the nonlinear relationship between the covariable with AGB, the selection of covariables could be quantitatively considered in the further research, and some methods should be used for linear transformation of covariates (e.g. box-cox) (Fox et al. 2020). In general, the results of the RF model indicate that time-series derived from multi-temporal Landsat images and PALSAR data can improve the accuracy of AGB estimation and reduce the saturation problem. The accuracy of model prediction can be further improved by obtaining dependent variables obtained from more accurate sample site measurements in a future study.

4.2 Model improvement and map accuracy

This study proved that the RF model combined with kriging provided higher mapping accuracy than the RF models, and on this basis, RFCK was marginally better than RFOK. Table 6 shows the performance of the three models (RF, RFOK, and RFCK). Among the three models each year, the RFCK model was the one with the largest R2, the smallest RMSE and MAE, and the bias% was closest to 0. The RI was largest in 1992, followed by 2002 and 2010. Chen et al. found that due to the low spatial autocorrelation of the RF-based residuals, the improvement in the accuracy of the RFOK model was limited, and the degree of accuracy improvement was higher in the models that used variables from a single sensor (Chen et al. 2019). Hengl et al. showed that if the model residuals exhibited spatial autocorrelation, the model performance could be improved by interpolating residuals using kriging and adding the components to the RF model (Hengl et al. 2004). Nothdurft et al. (2009) and Fox et al. (2020) has proved the feasibility of improve the accuracy of non-parametric models by variance or residuals. The results of our analysis confirmed these results. The model variables in 1992 were derived from a single remote sensing source, and the RI of the RFOK model was 0.176; in contrast, the 2010 model used a combination of optical remote sensing data and radar data, resulting in the lowest RI of the RFOK model for the three years (RI=0.031). In addition, the semivariogram results (Table 5) indicated that the 1992 RFOK model had a high RI because of the strong spatial autocorrelation of the residuals. Studies using similar residual interpolation methods to predict soil organic matter have shown that the accuracy was substantially improved when the nugget/sill values were lower than 0.6 (Guo et al. 2015; Tziachris et al. 2019). The results of our study confirmed these findings. The spatial distribution of the RF residuals (Figure 5) showed that there was less spatial autocorrelation, the results of the OK interpolation were more accurate, and the RI in the accuracy of the RF model was higher (i.e., for the 1992 model) when the residual range was relatively narrow, and the spatial distribution was uniform. The validation R2 values of the RFOK and RFCK models were both 0.81, and the RMSE of the RFOK model increased by 17.6% compared to the RF model. As shown in Figure 7 (a) and (e), there are significant differences in the AGB between the RFOK map and the RF map, and the patches are better defined in the RFOK map.

Previous studies have shown that there is a strong correlation between elevation and biomass in subtropical mountains (Silveira et al. 2019b). Vegetation carbon storage and elevation showed similar spatial variation trends, and carbon storage was closely related to AGB (Yadav et al. 2019). In our study, the residual values of 2002 and 2010 were randomly distributed, and high residual values were observed in the mountain areas. This finding was related to the overestimation and underestimation of the low values in the RF model in 2002 and 2010, respectively. The sample sites in the high mountains are in areas of dense forest with high biomass; the AGB at these sites was underestimated by the RF model, resulting in high residual values. As shown in Table 6, in 2002 and 2010, the RFCK model, which included elevation as a covariable, had lower nugget and sill values and higher R2 values than the RFOK model. This result demonstrated that in regions with an uneven spatial distribution of AGB, the use of elevation as a covariable improved the accuracy of the interpolation results and minimized the saturation effect in areas with high biomass, especially in intersection of mountains and plains. Future studies should consider other covariables, such as fractional vegetation cover (FVC) and precipitation, which are often used for estimating forest AGB with geostatistical methods (Eduardo et al. 2016).

Compared with the RF model that only considered the variables from remote sensing data, the RFCK model included the residuals, resulting in higher accuracy of the AGB prediction because spatial autocorrelation was considered (Table 6). Figure 4 shows the scatter plots of the observed values against the predicted values from the model validation process, different colors are used to distinguish different models. The RFOK scatter points(yellow) get closer to the 1:1 trend line than the RF’s (blue), and the RFCK scatter points (green) are closer to the 1:1 trend line than the RFOK’s. Figure 7 shows the interpolation results of the residuals (Figure 7 b and c) and the final AGB maps (Figure 7 d and e) derived from adding the structured components to the maps generated by the RF model (Figure 7 a). The two kriging methods showed the same spatial pattern, but the CK results exhibited more spatial details than the OK results. The interpolation results of OK and CK exhibit differences, but both results show lower residual values in areas with less forest biomass, such as valleys and rivers. At the boundary of mountain valleys and plains, errors caused by the sensor angle and solar azimuth cannot be eliminated (Saraf et al. 1996). The RFCK model provides better results than the RFOK model regarding the texture and rich details of the AGB on the mountain slopes.

The R2 of the RCFK model was 0.81 in 1992, 0.41 in 2002, and 0.55 in 2010. The ensemble model used by Zhang et al. explained 75% of the variance of forest AGB in a Chinese subtropical forest, with an RMSE of 45.5 Mg/ha (Zhang et al. 2019). Our study proved that the RFCK model provided excellent accuracy for forest AGB estimation. Shen et al. used an RF model with multi-temporal Landsat images for AGB estimation in northern Guangdong in 2011; the RMSE was 39.49 t/ha, and the R2 was 0.51 (Shen et al. 2016). The accuracies of this study are better than those of other studies of modeling AGB in subtropical mountains using a combination of optical and radar data. The RFCK model requires no assumptions on the relationships between the target variable and predictor variables, and can deal with nonlinear relationships and eliminate some overfitting in non-parametric estimation. On the other hand, the disadvantage is that it is challenging to interpret the relationships between the response variable and independent variables (e.g., the residuals represented the unexplained variance of the model).

4.3 Long-term changes in AGB and their effects on policy

In the study area, significant changes were observed in the AGB values and spatial distribution in the 18 years. Previous studies have shown that extensive deforestation contributes to global climate change, altering hydrological cycle patterns, and resulting in adverse environmental effects, such as soil erosion and degradation (Eckert et al. 2011; Muttaqin et al. 2019). Shen et al. found increases in the temperature and decreases in the precipitation from 1986 to 2016 of Guangdong province, and the correlation between climate and AGB decrease was lower than that between human activities and AGB (Shen et al. 2018). In addition, there were interactions between climate and human activities. The changes in the AGB in the study area was primarily the result of afforestation and deforestation. Guangdong Province had limited forest resources in the 1980s, and the forest coverage was only 30.2%; however, the population density was high, and the economy prospered. After the implementation of China's "Economic reform and opening up" policy, Guangdong's economy developed rapidly, the population increased, and a large proportion of forest land was used for urban expansion and construction. As a result, the forest area was reduced substantially. In the 1990s, the province implemented forest land protection policies and deforestation projects, and the area of forest land gradually increased.

The afforestation areas in Guandong province, China from 1992 to 2010 are shown in Figure 8 and Table 7; the data were obtained from the Statistical yearbook of Guangdong (http://stats.gd.gov.cn/). Wang et al. found that trees planted in the late 1980s grew into medium and near-mature forests between 1992 and 1997, increasing the capacity of the forest for carbon storage and resulting in a substantial increase in forest biomass and carbon density in 1997 (Wang et al. 2016) . These data are in agreement with our findings that both the AGB data provided by the NFI and the AGB estimates obtained from the proposed model in this study showed a significant increase in AGB from 1992 to 2002. In the decade from 2002 to 2010, the mean and maximum of the biomass also increased slightly. These changes are attributed to harvesting of timber within the scope of forest management practices, as well as the establishment of many nature reserves to develop tourism resources in the mountains in northern Guangdong. In 2012, the proportions of near-mature forests, mature forests, and over-mature forests in the province were substantially higher than those in 2002 and 1992, confirming the effects of afforestation in the past 20 years. The AGB trend of the maps obtained by the RFCK model for the period corresponds well to the data obtained from the Statistical yearbook of Guangdong well, demonstrating the reliability of the model. Future research should focus on longer time-series data and determine the influence of other factors on regional AGB to provide information for administrative departments and forest management.

Table 7. Proportion of forests of different stand ages in 1992-2012 in Guangdong

Year

Young forest %

Half-mature forest %

Near-mature forest %

Mature forest %

Over-mature forest %

1992

40.19

40.52

15.32

3.04

0.93

2002

38.78

39.74

15.18

5.92

0.38

2012

21.90

41.33

20.75

12.84

3.18


5. Conclusions

The results demonstrated that the RFCK model based on Landsat had the best performance for the prediction of AGB in 1992, with an R2 value of 0.81. The changes in the spatial distribution of the AGB in the three years were confirmed by forest management statistics. Validation with an independent dataset showed that the RFCK model had the highest accuracy for AGB estimation, followed by the RF and RFOK models. The RFCK model also provided a more realistic spatial distribution of AGB than the RFOK model. The saturation effect was minimized by using PALSAR data; the residuals had higher spatial autocorrelation and less heterogeneity in 2010 than in 2002. Overall, the proposed RFCK model provided the best performance for AGB mapping in the subtropical forest with complex terrain. The result of this study can be used by forest managers to develop forest management practices and protect forest resources.

Declarations

Authors’ contributions

H.S. designed the study, data preparation, analysis, and wrote the paper. W.S. participated in the conceptual design and provided suggestion of the method. J.W. participated in the data preparation and processing. A.A. provided the paper editing. M.L. provided the project design and paper editing. All the authors read and approved the final manuscript.

Acknowledgements

The authors would like to acknowledge the United States Geological Survey (USGS),National Aeronautics and Space Administration (NASA),and Japan Aerospace Exploration Agency (JAXA) for providing the image data. Special thanks to the Guangdong Provincial Center for Forest Resources Monitoring for sharing their forest inventory data.

Funding

This work was jointly supported by the Natural Science Foundation of China [31670552, 31971577],China Postdoctoral Science Foundation [2019M651842], and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Conflict of interest statement

The authors declare no conflict of interest.

References

Ali A, Xu M-S, Zhao Y-T, et al (2015) Allometric biomass equations for shrub and small tree species in subtropical China. Silva Fenn 49:1–10. https://doi.org/10.14214/sf.1275

Amit Y, Geman D (1997) Shape Quantization and Recognition with Randomized Trees. Neural Comput 9:1545–1588. https://doi.org/10.1162/neco.1997.9.7.1545

Arroyo LA, Johansen K, Armston J, Phinn S (2010) Integration of LiDAR and QuickBird imagery for mapping riparian biophysical parameters and land cover types in Australian tropical savannas. For Ecol Manage 259:598–606. https://doi.org/https://doi.org/10.1016/j.foreco.2009.11.018

Aslan A, Rahman AF, Warren MW, Robeson SM (2016) Mapping spatial distribution and biomass of coastal wetland vegetation in Indonesian Papua by combining active and passive remotely sensed data. Remote Sens Environ 183:65–81. https://doi.org/10.1016/j.rse.2016.04.026

Avtar R, Sawada H, Takeuchi W, Singh G (2012) Characterization of forests and deforestation in Cambodia using ALOS/PALSAR observation. Geocarto Int 27:119–137. https://doi.org/10.1080/10106049.2011.626081

Avtar R, Suzuki R, Takeuchi W, Sawada H (2013) PALSAR 50 m Mosaic Data Based National Level Biomass Estimation in Cambodia for Implementation of REDD+ Mechanism. PLoS One 8:. https://doi.org/10.1371/journal.pone.0074807

Baccini A, Goetz SJ, Walker WS, et al (2012) Estimated carbon dioxide emissions from tropical deforestation improved by carbon-density maps. Nat Clim Chang 2:182–185. https://doi.org/10.1038/nclimate1354

Baccini A, Walker W, Carvalho L, et al (2017) Tropical forests are a net carbon source based on aboveground measurements of gain and loss. Science (80- ) 358:230 LP – 234. https://doi.org/10.1126/science.aam5962

Breiman L (2004) CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS

Breiman L (2001) Random forests. Random For 45:5–32. https://doi.org/10.1201/9780367816377-11

Cao L, Coops NC, Innes JL, et al (2016) Estimation of forest biomass dynamics in subtropical forests using multi-temporal airborne LiDAR data. Remote Sens Environ 178:158–171. https://doi.org/10.1016/j.rse.2016.03.012

Chatterjee S, Santra P, Majumdar K, et al (2015) Geostatistical approach for management of soil nutrients with special emphasis on different forms of potassium considering their spatial variation in intensive cropping system of West Bengal, India. Environ Monit Assess 187:1–17. https://doi.org/10.1007/s10661-015-4414-9

Chave J, Réjou-Méchain M, Búrquez A, et al (2014) Improved allometric models to estimate the aboveground biomass of tropical trees. Glob Chang Biol 20:3177–3190. https://doi.org/10.1111/gcb.12629

Chen G, Hay GJ, Zhou Y (2010) Estimation of forest height, biomass and volume using support vector regression and segmentation from lidar transects and Quickbird imagery. In: 2010 18th International Conference on Geoinformatics. IEEE, pp 1–4

Chen L, Wang Y, Ren C, et al (2019a) Assessment of multi-wavelength SAR and multispectral instrument data for forest aboveground biomass mapping using random forest kriging. For Ecol Manage 447:12–25. https://doi.org/10.1016/j.foreco.2019.05.057

Chen X, Chen HYH, Chen C, Peng S (2019b) Water availability regulates negative effects of species mixture on soil microbial biomass in boreal forests. Soil Biol Biochem 139:107634. https://doi.org/10.1016/j.soilbio.2019.107634

Chirici G, Barbati A, Corona P, et al (2008) Non-parametric and parametric methods using satellite images for estimating growing stock volume in alpine and Mediterranean forest ecosystems. Remote Sens Environ 112:2686–2700. https://doi.org/10.1016/j.rse.2008.01.002

Chopping M, Schaaf CB, Zhao F, et al (2011) Forest structure and aboveground biomass in the southwestern United States from MODIS and MISR. Remote Sens Environ 115:2943–2953. https://doi.org/https://doi.org/10.1016/j.rse.2010.08.031

Cressie N (1990) The origins of kriging. Math Geol 22:239–252

Cressie N, Gotway CA, Grondona MO (1990) Spatial prediction from networks. Chemom Intell Lab Syst 7:251–271. https://doi.org/https://doi.org/10.1016/0169-7439(90)80115-M

Dassot M, Colin A, Santenoise P, et al (2012) Terrestrial laser scanning for measuring the solid wood volume, including branches, of adult standing trees in the forest environment. Comput Electron Agric 89:86–93. https://doi.org/https://doi.org/10.1016/j.compag.2012.08.005

Deo R, Russell M, Domke G, et al (2017) Evaluating Site-Specific and Generic Spatial Models of Aboveground Forest Biomass Based on Landsat Time-Series and LiDAR Strip Samples in the Eastern USA. Remote Sens 9:598. https://doi.org/10.3390/rs9060598

Eckert S, Rakoto H, Olivia L, et al (2011) Deforestation and forest degradation monitoring and assessment of biomass and carbon stock of lowland rainforest in the Analanjirofo region , Madagascar. For Ecol Manage 262:1996–2007. https://doi.org/10.1016/j.foreco.2011.08.041

Eduardo P, Ocimar A, Monteiro T, et al (2016) Spatial distribution of forest biomass in Brazil’s state of Roraima,northern Amazonia. For Ecol Manage 377:170–181. https://doi.org/10.1016/j.foreco.2016.07.010

Fan Y, Koukal T, Weisberg PJ (2014) A sun-crown-sensor model and adapted C-correction logic for topographic correction of high resolution forest imagery. ISPRS J Photogramm Remote Sens 96:94–105. https://doi.org/10.1016/j.isprsjprs.2014.07.005

Foody GM, Boyd DS, Cutler MEJ (2003) Predictive relations of tropical forest biomass from Landsat TM data and their transferability between regions. Remote Sens Environ 85:463–474. https://doi.org/10.1016/S0034-4257(03)00039-7

Foody GM, Cutler ME, McMorrow J, et al (2001) Mapping the biomass of Bornean tropical rain forest from remotely sensed data. Glob Ecol Biogeogr 10:379–387. https://doi.org/10.1046/j.1466-822X.2001.00248.x

Fox EW, Ver Hoef JM, Olsen AR (2020) Comparing spatial regression to random forests for large environmental data sets. PLoS One 15:1–22. https://doi.org/10.1371/journal.pone.0229509

Gasparri NI, Parmuchi MG, Bono J, et al (2010) Assessing multi-temporal Landsat 7 ETM+ images for estimating above-ground biomass in subtropical dry forests of Argentina. J Arid Environ 74:1262–1270. https://doi.org/10.1016/j.jaridenv.2010.04.007

Giannico V, Lafortezza R, John R, et al (2016) Estimating Stand Volume and Above-Ground Biomass of Urban Forests Using LiDAR. Remote Sens 8:339. https://doi.org/10.3390/rs8040339

Gonzalez P, Asner GP, Battles JJ, et al (2010) Forest carbon densities and uncertainties from Lidar, QuickBird, and field measurements in California. Remote Sens Environ 114:1561–1575. https://doi.org/https://doi.org/10.1016/j.rse.2010.02.011

Guo P-T, Li M-F, Luo W, et al (2015a) Digital mapping of soil organic matter for rubber plantation at regional scale: An application of random forest plus residuals kriging approach. Geoderma 237–238:49–59. https://doi.org/https://doi.org/10.1016/j.geoderma.2014.08.009

Guo P, Li M, Luo W, et al (2015b) Geoderma Digital mapping of soil organic matter for rubber plantation at regional scale : An application of random forest plus residuals kriging approach. Geoderma 237–238:49–59. https://doi.org/10.1016/j.geoderma.2014.08.009

Hall RJ, Skakun RS, Arsenault EJ, Case BS (2006) Modeling forest stand structure attributes using Landsat ETM+ data: Application to mapping of aboveground biomass and stand volume. For Ecol Manage 225:378–390. https://doi.org/https://doi.org/10.1016/j.foreco.2006.01.014

Hansen MC, Potapov P V., Goetz SJ, et al (2016) Mapping tree height distributions in Sub-Saharan Africa using Landsat 7 and 8 data. Remote Sens Environ 185:221–232. https://doi.org/10.1016/j.rse.2016.02.023

Hastie T et. all. (2008) The Elements of Statistical Learning (2nd ed.). Springer

Hengl T, Heuvelink GBM, Stein A (2004) A generic framework for spatial prediction of soil variables based on regression-kriging. 120:75–93. https://doi.org/10.1016/j.geoderma.2003.08.018

Hess LL, Melack JM, Filoso S, Yong Wang (1995) Delineation of inundated area and vegetation along the Amazon floodplain with the SIR-C synthetic aperture radar. IEEE Trans Geosci Remote Sens 33:896–904. https://doi.org/10.1109/36.406675

Hoover CM, Ducey MJ, Colter RA, Yamasaki M (2018) Evaluation of alternative approaches for landscape-scale biomass estimation in a mixed-species northern forest. For Ecol Manage 409:552–563. https://doi.org/10.1016/j.foreco.2017.11.040

Imhoff ML (1995) Radar backscatter and biomass saturation: ramifications for global biomass inventory. IEEE Trans Geosci Remote Sens 33:511–518. https://doi.org/10.1109/TGRS.1995.8746034

Isaaks,Edward H., Mohan SR (1989) An introduction to applied geostatistics. Oxford, Oxford University Press

John R, Chen J, Giannico V, et al (2018) Grassland canopy cover and aboveground biomass in Mongolia and Inner Mongolia: Spatiotemporal estimates and controlling factors. Remote Sens Environ 213:34–48. https://doi.org/10.1016/j.rse.2018.05.002

Karlson M, Ostwald M, Reese H, et al (2015a) Mapping Tree Canopy Cover and Aboveground Biomass in Sudano-Sahelian Woodlands Using Landsat 8 and Random Forest. Remote Sens 7:10017–10041. https://doi.org/10.3390/rs70810017

Karlson M, Ostwald M, Reese H, et al (2015b) Mapping tree canopy cover and aboveground biomass in Sudano-Sahelian woodlands using Landsat 8 and random forest. Remote Sens 7:10017–10041. https://doi.org/10.3390/rs70810017

Kumar L, Sinha P, Taylor S, Alqurashi AF (2015) Review of the use of remote sensing for biomass estimation to support renewable energy generation. J Appl Remote Sens 9:97696

Le ND, Zidek J V (2006) Statistical analysis of environmental space-time processes. Springer Science & Business Media

Lee J-S (1980) Digital image enhancement and noise filtering by use of local statistics. IEEE Trans Pattern Anal Mach Intell 165–168

Lefsky MA, Cohen WB, Harding DJ, et al (2002a) Lidar remote sensing of above‐ground biomass in three biomes. Glob Ecol Biogeogr 11:393–399

Lefsky MA, Cohen WB, Parker GG, Harding DJ (2002b) Lidar remote sensing for ecosystem studies: Lidar, an emerging remote sensing technology that directly measures the three-dimensional distribution of plant canopies, can accurately estimate vegetation structural attributes and should be of particular inte. Bioscience 52:19–30

Leys C, Ley C, Klein O, et al (2013) Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol 49:764–766. https://doi.org/10.1016/j.jesp.2013.03.013

Li X, Du H, Mao F, et al (2018) Estimating bamboo forest aboveground biomass using EnKF-assimilated MODIS LAI spatiotemporal data and machine learning algorithms. Agric For Meteorol 256–257:445–457. https://doi.org/10.1016/j.agrformet.2018.04.002

Liaw A, Wiener M (2002) Classification and regression by randomForest. R news 2:18–22

Lu D (2006) The potential and challenge of remote sensing‐based biomass estimation. Int J Remote Sens 27:1297–1328. https://doi.org/10.1080/01431160500486732

Lu D (2005) Aboveground biomass estimation using Landsat TM data in the Brazilian Amazon. Int J Remote Sens 26:2509–2525. https://doi.org/10.1080/01431160500142145

Lu D, Chen Q, Wang G, et al (2016) A survey of remote sensing-based aboveground biomass estimation methods in forest ecosystems. Int J Digit Earth 9:63–105. https://doi.org/10.1080/17538947.2014.990526

Lu D, Chen Q, Wang G, et al (2012) Aboveground forest biomass estimation with Landsat and LiDAR data and uncertainty analysis of the estimates. Int J For Res 2012:. https://doi.org/10.1155/2012/436537

Manuri S, Andersen H-E, McGaughey RJ, Brack C (2017) Assessing the influence of return density on estimation of lidar-based aboveground biomass in tropical peat swamp forests of Kalimantan, Indonesia. Int J Appl Earth Obs Geoinf 56:24–35. https://doi.org/https://doi.org/10.1016/j.jag.2016.11.002

Martins F da SRV, dos Santos JR, Galvão LS, Xaud HAM (2016) Sensitivity of ALOS/PALSAR imagery to forest degradation by fire in northern Amazon. Int J Appl earth Obs Geoinf 49:163–174

Matheron G (1963) Principles of geostatistics. Econ Geol 58:1246–1266. https://doi.org/10.2113/gsecongeo.58.8.1246

Mermoz S, Le Toan T (2016) Forest disturbances and regrowth assessment using ALOS PALSAR data from 2007 to 2010 in Vietnam, Cambodia and Lao PDR. Remote Sens 8:217

Mermoz S, Réjou-Méchain M, Villard L, et al (2015) Decrease of L-band SAR backscatter with biomass of dense forests. Remote Sens Environ 159:307–317. https://doi.org/10.1016/j.rse.2014.12.019

Minh DHT, Le Toan T, Rocca F, et al (2013) Relating P-band synthetic aperture radar tomography to tropical forest biomass. IEEE Trans Geosci Remote Sens 52:967–979

Minh DHT, Ndikumana E, Vieilledent G, et al (2018) Potential value of combining ALOS PALSAR and Landsat-derived tree cover data for forest biomass retrieval in Madagascar. Remote Sens Environ 213:206–214. https://doi.org/10.1016/j.rse.2018.04.056

Mitchard ETA, Saatchi SS, White L, et al (2012) Mapping tropical forest biomass with radar and spaceborne LiDAR in Lopé National Park, Gabon: overcoming problems of high biomass and persistent cloud. Biogeosciences 9:179–191

Muttaqin MZ, Alviya I, Lugina M, et al (2019) Developing community-based forest ecosystem service management to reduce emissions from deforestation and forest degradation. For Policy Econ 108:101938. https://doi.org/10.1016/j.forpol.2019.05.024

Muukkonen P, Heiskanen J (2007) Biomass estimation over a large area based on standwise forest inventory data and ASTER and MODIS satellite data: A possibility to verify carbon inventories. Remote Sens Environ 107:617–624. https://doi.org/10.1016/j.rse.2006.10.011

Myneni RB, Dong J, Tucker CJ, et al (2001) A large carbon sink in the woody biomass of Northern forests. Proc Natl Acad Sci 98:14784–14789

Nothdurft A, Saborowski J, Breidenbach J (2009) Spatial prediction of forest stand variables. Eur J For Res 128:241–251. https://doi.org/10.1007/s10342-009-0260-z

Ou Y, Rousseau AN, Wang L, Yan B (2017) Spatio-temporal patterns of soil organic carbon and pH in relation to environmental factors—A case study of the Black Soil Region of Northeastern China. Agric Ecosyst Environ 245:22–31

Pan Y, Birdsey RA, Fang J, et al (2011) A large and persistent carbon sink in the world’s forests. Science (80- ) 333:988–993. https://doi.org/10.1126/science.1201609

Paul KI, Roxburgh SH, Chave J, et al (2015) Testing the generality of above‐ground biomass allometry across plant functional types at the continent scale. Glob Chang Biol 22:2106–2124

Ploton P, Pélissier R, Barbier N, et al (2013) Canopy texture analysis for large-scale assessments of tropical forest stand structure and biomass. In: Treetops at risk. Springer, pp 237–245

Powell SL, Cohen WB, Healey SP, et al (2010) Quantification of live aboveground forest biomass dynamics with Landsat time-series and field inventory data: A comparison of empirical modeling approaches. Remote Sens Environ 114:1053–1068

Proisy C, Couteron P, Fromard F (2007) Predicting and mapping mangrove biomass from canopy grain analysis using Fourier-based textural ordination of IKONOS images. Remote Sens Environ 109:379–392. https://doi.org/https://doi.org/10.1016/j.rse.2007.01.009

Sandberg G, Ulander LMH, Fransson JES, et al (2011) L-and P-band backscatter intensity for biomass retrieval in hemiboreal forest. Remote Sens Environ 115:2874–2886

Santoro M, Beaudoin A, Beer C, et al (2015) Forest growing stock volume of the northern hemisphere: Spatially explicit estimates for 2010 derived from Envisat ASAR. Remote Sens Environ 168:316–334

Saraf AK, Das JD, AGARWAL B, Sundaram RM (1996) False topography perception phenomena and its correction. Int J Remote Sens 17:3725–3733

Schulze E, Biological ES, Geosciences E, Schulze E (2006) Biological control of the terrestrial carbon sink To cite this version : HAL Id : hal-00297549 Biogeosciences Biological control of the terrestrial carbon sink ∗. 3:147–166

Shen W, Li M, Huang C, et al (2018) Annual forest aboveground biomass changes mapped using ICESat/GLAS measurements, historical inventory data, and time-series optical and radar imagery for Guangdong province, China. Agric For Meteorol 259:23–38. https://doi.org/10.1016/j.agrformet.2018.04.005

Shen W, Li M, Huang C, Wei A (2016) Quantifying Live Aboveground Biomass and Forest Disturbance of Mountainous Natural and Plantation Forests in Northern Guangdong, China, Based on Multi-Temporal Landsat, PALSAR and Field Plot Data. Remote Sens 8:595. https://doi.org/10.3390/rs8070595

Silveira EMO, Espírito FD, Wulder MA, et al (2019a) Pre-stratified modelling plus residuals kriging reduces the uncertainty of aboveground biomass estimation and spatial distribution in heterogeneous savannas and forest environments. For Ecol Manage 445:96–109. https://doi.org/10.1016/j.foreco.2019.05.016

Silveira EMO, Henrique S, Silva G, et al (2019b) Object-based random forest modelling of aboveground forest biomass outperforms a pixel-based approach in a heterogeneous and mountain tropical environment. Int J Appl Earth Obs Geoinf 78:175–188. https://doi.org/10.1016/j.jag.2019.02.004

Su Y, Guo Q, Xue B, et al (2016) Spatial distribution of forest aboveground biomass in China: Estimation through combination of spaceborne lidar, optical imagery, and forest inventory data. Remote Sens Environ 173:187–199. https://doi.org/10.1016/j.rse.2015.12.002

Thenkabail PS, Stucky N, Griscom BW, et al (2004) Biomass estimations and carbon stock calculations in the oil palm plantations of African derived savannas using IKONOS data. Int J Remote Sens 25:5447–5472. https://doi.org/10.1080/01431160412331291279

Tuominen S, Pekkarinen A (2005) Performance of different spectral and textural aerial photograph features in multi-source forest inventory. Remote Sens Environ 94:256–268. https://doi.org/10.1016/j.rse.2004.10.001

Tziachris P, Aschonitis V, Chatzistathis T, Papadopoulou M (2019) Assessment of spatial hybrid methods for predicting soil organic matter using DEM derivatives and soil parameters. Catena 174:206–216

Vaglio Laurin G, Ding J, Disney M, et al (2019) Tree height in tropical forest as measured by different ground, proximal, and remote sensing instruments, and impacts on above ground biomass estimates. Int J Appl Earth Obs Geoinf 82:101899. https://doi.org/10.1016/j.jag.2019.101899

Valbuena R, Maltamo M, Mehtätalo L, Packalen P (2017) Key structural features of Boreal forests may be detected directly using L-moments from airborne lidar data. Remote Sens Environ 194:437–446. https://doi.org/10.1016/j.rse.2016.10.024

Vermote E, Saleous N (2007) LEDAPS surface reflectance product description. Coll Park Univ Maryl

Wang J, Wu H, Xin S, et al (2016) Forest Carbon Storage and Dynamic Change in Guangdong Province. J Northeast For Univ 36:18–22

Xie X, Wang Q, Dai L, et al (2011) Application of China’s national forest continuous inventory database. Environ Manage 48:1095–1106

Yadav RP, Gupta B, Bhutia PL, et al (2019) Biomass and carbon budgeting of land use types along elevation gradient in Central Himalayas. J Clean Prod 211:1284–1298. https://doi.org/10.1016/j.jclepro.2018.11.278

Yu G, Chen Z, Piao S, et al (2014) High carbon dioxide uptake by subtropical forest ecosystems in the East Asian monsoon region. Proc Natl Acad Sci 111:4910–4915

Zhang R, Zhou X, Ouyang Z, et al (2019) Estimating aboveground biomass in subtropical forests of China by integrating multisource remote sensing and ground data. Remote Sens Environ 232:111341. https://doi.org/10.1016/j.rse.2019.111341

Zhao P, Lu D, Wang G, et al (2016) Forest aboveground biomass estimation in Zhejiang Province using the integration of Landsat TM and ALOS PALSAR data. Int J Appl Earth Obs Geoinf 53:1–15. https://doi.org/10.1016/j.jag.2016.08.007

Zhao Q, Yu S, Zhao F, et al (2019) Comparison of machine learning algorithms for forest parameter estimations and application for forest quality assessments. For Ecol Manage 434:224–234. https://doi.org/10.1016/j.foreco.2018.12.019

Zhu X, Liu D (2015) Improving forest aboveground biomass estimation using seasonal Landsat NDVI time-series. ISPRS J Photogramm Remote Sens 102:222–231. https://doi.org/10.1016/j.isprsjprs.2014.08.014

Zimmerman DL, Zimmerman MB (1991) A comparison of spatial semivariogram estimators and corresponding ordinary kriging predictors. Technometrics 33:77–91