Landslide susceptibility mapping in Three Gorges Reservoir area based on GIS and boosting decision tree model

As one of the most destructive geological disasters, a myriad of landslides has revived and developed in the Three Gorges Reservoir area under the combined action of various detrimental factors. Therefore, the pertinently regional landslide susceptibility mapping (LSM) is of great significance for disaster prevention and mitigation. In this study, LSM is prepared by using a boosting-C5.0 decision tree model. Under the landslide verification of on-site investigations, the study area is divided into accumulation and rock areas, and a total of 12 impact factors are selected. TOL and VIF are employed to determine the multicollinearity among the impact factors. The independent training (80%) and validation (20%) datasets are constructed by random sampling for LSM. ANN, C5.0, and SVM are selected for comparative analysis. The results show that there is no rigorous multicollinearity among the impact factors proposed in this paper. The landslide susceptibility in the study area is divided into low, moderate, high, and very high. The highest susceptibility area distributes along the riverside where the landslide ratio is 37.05% in boosting-C5.0 model. Then the ROCs are expropriated to infer the accuracy of each model. The boosting-C5.0 performs the best with the largest area under the curve in both accumulation and rock areas, reaching at 0.991 and 0.990 in the validation sets, respectively. Finally, the composite modification of the 5 validation sets shows that the uncertainty of boosting-C5.0 is concentrated in the intermediate probability areas of susceptibility. This study reveals the feasibility of machine learning in landslide susceptibility assessment, which could provide a basis for the risk management and control of geological disasters.


Introduction
Due to the complex geological conditions and periodic water level fluctuations, the Three Gorges Reservoir area has been inevitably threatened by geological disasters over years Gong et al. 2019Gong et al. , 2021. Moreover, subtropical monsoon climate is usually accompanied by seasonal heavy rainfall (Miao et al. 2021), which equally sparks the risk of massive landslide deformation and revival (Reichenbach et al. 2018;Ering et al. 2020;Gong et al. 2022), catalyzing the Three Gorges Reservoir area a high-incidence area for geohazards. Such extremely destructive landslides shall lead to huge loss of lives and properties (Li et al. 2020;Gong et al. 2020). Undeniably, the susceptibility of landslides deserves attention (Song et al. 2019;Li et al. 2022), which is of great significance for the pertinent risk management and control in the Three Gorges Reservoir area (Huang et al. 2020(Huang et al. , 2022. Landslide susceptibility refers to the probability of adversity under the contribution of various qualitative or quantitative impact factors (Hong et al. 2019;Mehrabi 2021). Hitherto, landslide susceptibility mapping (LSM) remains an appealing approach for risk mitigation, as impact factors may have spatiotemporal continuities (Chen et al. 2018;Azarafza et al. 2021). Trustworthily, landslide susceptibility assessments can identify potential relationships between landslides and impact factors, so as to effectively delineate the susceptibility areas for corresponding preventive measures (Abbaszadeh et al. 2021;Loche et al. 2022). Moreover, geographic information system (GIS) has been regarded as an attractive way in qualitative and quantitative LSM implementations (Bragagnolo et al. 2020;Ullah et al. 2022), such as expert scoring model (Khalil et al. 2022), failure probability model (Di et al. 2018), neural network model (Ortiz et al. 2018;Lee et al. 2020), with excellent performance in landslide susceptibility assessments. Compared with subjectively qualitative methods, the relationship between past landslide events and associated variables can be analyzed more objectively within a mathematical framework by quantitative methods .
Thanks to the booming in remote sensing and computer technologies, high-resolution digital elevation models and engineering data are more readily available (Casagli et al. 2017;Bopche et al. 2022). Machine learning methods such as random forest (RF) (Wang et al. 2021;Sekkeravani et al. 2022), support vector machine (SVM) (Miao et al. 2018;Saha et al. 2022), artificial neural network (ANN) (Conforti et al. 2014;Mehrabi 2021) successfully deal with nonlinear data at different scales in the fields of identification (Khosravi et al. 2018), prediction (Liao et al. 2020;Miao et al. 2022a, b), mitigation (Piao et al. 2022), modeling (Nguyen et al. 2019;Liu et al. 2021a, b). Machine learning models autonomously mine a set of logical criteria from input data and aim to make the most accurate predictions, while traditional statistical models seek to infer relationships between variables. For example, Huang et al. (2017a, b) selected nine environmental factors as input variables and used the SOM-ELM model to generate the LSM of the Three Gorges Reservoir area. Liu et al. (2021a, b) considered the effect of feature interaction in LSM and applied attention factorization machine (AFM) to improve and obtain more dependable LSM results. Deng et al. (2022) used different evaluation units to perform LSM for mesoscale regions based on RF. Nonetheless, the current research of LSM focus on a large-scaled areas, and the nature of the landslide is usually not considered (accumulation area and the rock area). Great efforts had been paid for LSM by using different machine learning methods. However, current LSM research focused on large areas where the nature of the landslide itself is usually not considered (accumulation or rock area). In fact, both field investigations and theoretical studies have suggested that geological characteristics differently influence the occurrence of mounded landslides versus rocky landslides. Due to the unique geological conditions induced by reservoir water fluctuations, it is logical and required to choose completely different impact factors from the traditional LSM for the development of accumulation landslides.
The performance of machine learning is also depending on the fitting and the quality of the input data (Mandal et al. 2021;Naceur et al. 2022). However, the majority of these prevailing models suffer from many disadvantages, such as limited training time, unstable convergence, and local optima, because they are supposed to be shallow learning structures, with only lower hidden layers, which is not sufficient to meet the accuracy requirements of multivariate properties (Ngo et al. 2021;Zhao et al. 2022). Many researchers concluded that decision tree (DT) models exhibited significant advantages over other machine learning models (Dou et al. 2019;Merghadi et al. 2020;Akinci et al. 2021). Since the model process is constantly classifying the sample data, namely, training data set, DT model has unique advantages in the logical prediction of input variables in data sets without prior knowledge and order (Tsangaratos et al. 2016;Park et al. 2018). Meanwhile, as LSM gradually transitions from large scales to refined regional scales, the reduction of data samples shall inevitably provoke overfitting phenomena. It is urgent to effectively validate and screen the best prediction models.
In this study, boosting-C5.0 decision tree model was used for landslide susceptibility mapping. Rangdu County, located in the Three Gorges Reservoir area, was selected as a case study. First, the on-site investigations and statistics of landslides were conducted to obtain impact factors. Notably, the study area was divided into rock and accumulation areas. In addition, the natural breakpoint method was employed to classify the impact factors, and the multicollinearity test was used to improve the rationality of the selection of impact factors. Then, by means of the machine learning method, a boosting-C5.0 decision tree algorithm of the landslide disaster susceptibility assessment was modeled for prediction. Compared with the results of C5.0, SVM, and ANN model, boosting-C5.0 performed best. Finally, the uncertainty of spatial prediction of regional landslides based on model reliability was discussed. Figure 1 displays the technical flowchart of this study.

Study area
Rangdu County is located in the Three Gorges Reservoir Area along the Yangtze River in Wanzhou District, Chongqing City. The study area extends between 30°32 0 25 00 to 30°37 0 55 00 N and 108°15 0 05 00 to 108°20 0 25 00 E, covering an area of about 36.4 km 2 . Figure 2 portrays the geographical location of the study area.
The study area is a quintessential karst landform with a hilly mountainous terrain. The highest altitude of 650 m was found in the northeastern village, and the minimum height of 113 m was identified over the riverside. Due to its typical humid subtropical monsoon climate, the Rangdu County undergoes a moderate temperature, where annual rainfall is 1500 mm and in an average temperature of 18.2°C. Supported by on-site investigations, the exposed rock formations are all sedimentary rocks along the Yangtze River. Owing to the influence of the Sichuan Basin, the development of strong tectonic action makes the tight anticlines and broad synclines alternate in a ''wingslike'' distribution. The strata are comprised of monoclinic Middle Jurassic period and Quaternary period with a direction of about 310°and a gentle dip of 5°-35°. The geological environment of Rangdu County is complex and fragile, inducing unfavorable geological condition. Especially, extreme situations where torrential rainfalls and fluctuations of reservoir water may give impetus to latent risks. Moreover, since the reconstruction of township roads in Wanzhou District in 2018, road construction in Rangdu County has ushered in a period of rapid development. During the engineering construction, unreasonable excavation of slope toe (Fig. 3a) and random piles of engineering wastes (Fig. 3b) occur frequently, which destroy the stability of the original slope structure. Meanwhile, this area is greatly affected by the water fluctuations of the Three Gorges Reservoir area. The fluctuation of reservoir water seriously disturbed the geotechnical properties of the slope along the river, giving birth to an increase in the probability of landslides (Fig. 3c). Particularly, the intensity and probability of landslide disasters will be further enhanced under the flooding season. Undoubtedly, except for taking protective measures (Fig. 3d), the preparation of landslide susceptibility map would be helpful for reducing jeopardy of lives and properties.

Susceptibility assessment system
The databases for landslide susceptibility mapping in this paper mainly include: (1) topographic map: which can be used to calculate elevation, slope angle, slope shape, gully density; (2) landsat-8 imagery: to verify river erosion, reservoir impoundment; (3) geological map: to extract lithology, genetic type; (4) landslide inventory map: to obtain the thickness of the accumulation layer, road cutting. In addition, profound on-site investigations were conducted in the study area, and a total of 30 landslides were found (Table 1).

Data preparation
According to the material composition, the landslides in the study area can be divided into accumulation landslides and rock landslides (Fig. 2). The evidence from landslide inventory map indicates that there are quintessential distinctions in the disaster-forming conditions between the  (Fig. 4). Moreover, it can be felicitously inferred that there are significant differences in the movement characteristics, terrain environments and impact factors. Most of the rock landslides are retrogressive type, accounting for 63.6%. While the slumping accumulation landslides account for a higher proportion of 68.4%. Besides, in terms of terrain environments, rock landslides and accumulation landslides show diametrically opposite trends. In particular, accumulation landslides are more likely to occur in low-slope (less than 30°), while 72.7% of rock landslides occur in high-slope (greater than 30°). Similarly, the opposite phenomenon can also be seen in terms of impact factors. It is reasonable to believe that road cutting and rainfall are more adverse impact factors for accumulation and rock landslides, accounting for 68.4% and 72.7%, respectively.
In addition, there are still many differences of characteristics between accumulation and rock landslides in Rangdu County based on the on-site investigation. As displayed in Table 2, the sliding surfaces of the accumulation landslide are mainly linear, accounting for 68.4%. The boundaries are mainly controlled by valleys and ridges, both accounting for 36.8%. Conversely, the sliding surfaces of rock landslides are mainly along the joint direction, accounting for 54.5%. Moreover, the height differences between the head and toe of accumulation landslides are generally higher than that of rock landslides. The number of accumulation landslides whose height differences are greater than 30 m accounts for 68.4%, while the majority of rock landslides are less than 20 m in elevation difference, accounting for 63.6%. However, in terms of sliding inclination and stone diameter, the accumulation landslides are significantly lower than those in rock landslides. The sliding inclinations of accumulation landslides are mainly less than 30°, accounting for 78.9%, and the stone diameters less than 0.3 m account for 73.7%. On the contrary, the sliding inclinations of rock landslides are significantly higher, the number of landslides with dip angle greater than 70°accounts for 63.6%, and the proportion of stone diameter greater than 1 m accounts for 90.9%.
Considering these differences in the nature of landslides, the study area shall be divided into two area (rock and accumulation), as shown in Fig. 5. Consequently, the LSM is obtained by superimposing accumulation area and rock area. Notably, the selection of assessment cell is a crucial prerequisite for conducting landslide susceptibility assessment, which is also a key issue to determine assessment accuracy and reliability of results. At present, relatively skillful prediction cells mainly include: raster cell, terrain cell, slope cell, unique condition cell, administrative cell, and topography cell. Given this evidence, different cells should be selected for landslide susceptibility mapping under the conditions of different scales and data precisions to achieve the best results. The utilization of raster cell can tremendously improve the accuracy of assessment when superposing a large amount of spatial data. Considering that the total area of the study area is approximately 36.4 km 2 , the landslide points can be clearly displayed on the mesoscale map of 1:50,000 (Fig. 5). Therefore, we select rater cell as the basic assessment cell for LSM. The resolution size of the raster cells in the study area is 5 m 9 5 m. Among them, the raster of the rock study area has 1524 rows and 2035 columns, with a total of 590,863 raster cells. Likewise, the accumulation study area has 1417 rows and 2037 columns, with a total of 426,200 raster cells.

Impact factors
Landslides are multi-factor coupled by internal geological conditions (causing factors) and external factors (triggering factors). Specially, the causing factors mainly include stratum lithology, topography, and geological structure. Moreover, the triggering factors refer to the adverse conditions that directly induce landslides, such as hydrogeological environment, human engineering activities. Based on the on-site investigations in Rangdu County, we selected 30 landslides as research object for landslide susceptibility mapping. Specifically, the GIS platform was employed to extract landslide causing factors such as topography and geological properties, as well as landslide triggering factors such as hydrological conditions and human engineering activities. Therefore, a landslide susceptibility assessment system in Rangdu County shall be established.
The selection of impact factors for accumulation area and rock area is basically the same, including triggering factors representing hydrological conditions: flow ratio and river erosion. Triggering factors representing engineering activities: road cutting and impoundment. Causing factors representing topography: height difference, slope angel, slope shape and gully density. However, in terms of geological properties, accumulation area and rock area are distinctly different, that is, the causing factors of rock area include lithology and slope structure. In contrast, the causing factors of the accumulation area include genetic type and accumulation thickness. Table 3 illustrates the extraction method of each impact factor.
Among them, the impact factors based on Landsat-8 imagery were extracted using the supervised classification of maximum likelihood. Others were verified by 5 m 9 5 m DEM and on-site investigations to extract impact factors. In particular, the impact factors of slope (Eq. 1), gully density (Eq. 2) and slope structure (Eq. 3) were executed using the following formula: where dz/dx and dz/dy represent the delta rate of the surface from the center raster in the horizontal and vertical directions.
here, L 1 and L 2 represent the length over the raster, and V 1 and V 2 represent the field values.
a tan 2ðy; xÞ ¼ Specifically, if the slope is less than 10°, it is defined as a horizontal slope. Otherwise, while the difference of structure between the direction and the aspect is within 30°, it can be distinguished as a bedding slope. Besides, the range of the oblique slope is judged as between 30°and 150°, and the reverse slope range is 150°-180°. Moreover, the direction is obtained according to ordinary kriging method.
The nature of data shall be divided into continuous and categorical for different extraction (Table 4). Continuous data were mainly extracted directly from topographic map or landsat-8 imagery based on GIS. However, categorical data such as river erosion, road cutting and other impact factors were assigned to the raster cells using statistical methods based on the verification of the on-site investigations. Furthermore, for the sake of facilitate the landslide susceptibility model, we adopted the natural breakpoint method (Jenks) for the impact factors to obtain the subcategories. Specifically, the optimal arrangement of values was determined by iteratively comparing the sum of the squared differences between the mean x of the elements among the impact factors and the observed value x i . To illustrate, iterate out the maximum variance goodness-of-fit (R 2 ) to minimize the sum of squared differences within the group, and calculate the best subcategories (Eq. 6).
The subcategories and grading of each impact factor are shown in Fig. 6. Notably, in view of the partial differences in the causing factors between the accumulation area and rock area, the rock area includes lithology and slope structure, while the accumulation area includes the genetic type and thickness, thus they are displayed separately.

Factors assessment
During the construction of regional landslide susceptibility assessment system based on the engineering geological analogy method, the selection of impact factors is critical. Even though impact factors appear to be closely related to landslides, they may not be completely independent of each other. Additionally, blindly pursuing more impact factors without considering the appropriateness may make the accuracy of the evaluation effect questionable. Therefore, for the sake of avoiding the influence of mutual interference and overlapping of information, it is imperative to correctly evaluate the multicollinearity among the impact factors. At present, the variance inflation factor (VIF), considered as an attractive way to reveal if there is multicollinearity, examines that the expansion of a given explanatory variable is explained by all the other explanatory variables in the equation.
For the correlation matrix X of the center-standardized independent variables, its diagonal elements are called the VIF.
Here, the VIF value reflects how much each independent variable is affected by multicollinearity. For each independent variable, when it is found that the VIF value is greater than 10, or the tolerance (TOL) is less than 0.1, a serious multicollinearity can be considered. Then, methods such as eliminating some irrelevant independent variables or increasing the sample size can be used to overcome multicollinearity. Fig. 6 Landslide impact factors, a flow ratio, b river erosion, c road cutting, d impoundment, e height difference, f slope angle, g slope shape, h gully density, i lithology, j slope structure, k genetic type, l thickness 2.3 C5.0 decision tree model Recognized as a prevalent supervised machine learning method, the decision tree follows a set of rules and an established dendrogram for classification and prediction. The root node of the dendrogram is the sample attribute, and the branches are the attribute values composed of several leaf nodes and split points.
Barged to the forefront over many homologous models, the C5.0 algorithm determines the optimal branching variable and separation threshold based on the decreasing speed of information entropy. Generally, C5.0 decision tree remains robust without excessive training time even multivariate values.

Growth and pruning
The growth of the C5.0 decision tree is derived from the concept of entropy, that is, the average uncertainty of the information source before released. Specifically, for a certain node n, assume that N is the entire sample set, C is a set of target variables, and t is the number of classifications of C. Then, the entropy can be defined as: Here, p(C i |N) is the relative probability of C i (i = 1, 2,…, t). If a variable T with attributes is divided into K categories, the conditional entropy after introducing the variable can be derived: The difference between the entropy of the newly split node and the original node is the information gain, expressed as: Typically, Ent(N|T) \ Ent(N). Therefore, the information gain rate can be obtained: If node n contains E n incorrectly predicted samples, then the error rate f n shall be: Thus, the estimation error of node n is defined as: where Z represents the threshold, which is generally equal to 1.15. Subsequently, when the weighted error of the unpruned leaf node is greater than the estimated error of the parent node, the leaf node can be pruned, expressed as: where r is the number of unpruned sub-leaf nodes, P n is the ratio of the sample size in the leaf node to the subtree sample size, and e is the estimated error of the parent node.

Boosting technology
Boosting technology can promote the robustness of the C5.0 algorithm, encompassing two stages of modeling and voting (Fig. 7). During the modeling stage, the technique augments the simulated sample set by oversampling the existing weighted samples. Assuming that the entire process requires k iterations and the sample size of the training sample T is N, the modeling process can be expressed as follows.
(a) Initialize sample weights: Here, w j (i) represents the weight of j samples in the i times of iteration. (b) According to w j (i), extract n samples from T with replacement to form a training sample set T i . (c) Obtain the model C i by T i , and calculate the error e(i) of the model. (d) End the modeling process when e(i) [ 0 or e(i) = 0, otherwise update the weight of each sample according to the error. The weights of the correctly classified samples are: The weights of misclassified samples remain unchanged. Then, normalize: (e) Repeat the iterative steps (b-d) to obtain k models and k errors.
In the voting stage, for the new sample set X, each model C i (i = 1, 2,…, t) gives a predicted value C i (X) and a weight value: The sum of the weights is calculated in conformity with the categories, and the category with the highest sum is the final classification result of the set X. In addition, combining the cross-validation method with boosting techniques can improve the generalization ability of the model and prevent the model from overfitting.

The analysis of impact factors
Collinearity is a common phenomenon among independent variables, which may exist objectively. However, parlous uncertainty would be introduced into the modeling of landslide susceptibility assessment by multicollinearity, which should be vigilant when conducting multivariate data analysis. In this paper, TOL and VIF were selected to evaluate the multicollinearity of impact factors of accumulation and rock areas, respectively.
As elucidated in Table 5, the TOL and VIF of all impact factors are within the threshold. Moreover, the factors such as flow ratio, road cutting, slope angle, slope shape, and gully density achieve immense independence. The TOL are all above 0.8, and the VIF are all within 1.2. The slope shape performs best with near-perfect results. Therefore, there is no significant multicollinearity among the selected parameters. For the accumulation factors, the minimum TOL of reservoir water storage is 0.417, and the maximum VIF is 2.397. Analogously, the reservoir water storage also has a minimum TOL of 0.199 and a maximum VIF of 5.025 in rock factors. Consequently, the impact factors proposed in this paper can be considered suitable for subsequent modeling.

Landslide susceptibility map
In this paper, LSM is obtained based on the previous data preparation work. All landslide data formats are also 5 9 5 m raster cells. Here, the accumulation area is taken as an example for description. First, the state matrix A 590863910 impact factor was formed, which represents the state value of each factor after reclassification. Then, the landslide raster cells were assigned a value of 1 (a total of 20,954), while the non-landslide raster cells in the accumulation area were set as 0 (a total of 569,909), so as to constitute the landslide matrix Y 59086391 . Finally, the impact factors state matrix and the landslide matrix in the accumulation area were combined to form the attribute matrix C 590863911 of each raster cell.
Based on the Sample command (random sampling and no put-back) of the Pandas library in the Python environment, 80% of the landslide raster cells from the above matrix were randomly selected for modelling training sample matrix to generate the training data set, with a total of 33,526 sets of sample data. Then, the trained model was applied to perform predictive computation on the raster Fig. 7 Boosting technology of decision tree cells attribute matrix of the accumulation area to output the predicted probability of susceptibility. The landslide susceptibility index (LSI) was obtained by using the quantile classification method (Pradhan et al. 2010), and the points with the first 25%, 35%, 30%, and 10% probability of the predicted probability were classified into very high (VH), high (H), moderate (M), and low (L) landslide susceptibility. In addition, in the cause of exploring the prediction accuracy of boosting-C5.0 decision tree model, we selected the ANN, SVM, and C5.0 algorithms as comparative study. Here, the number of boosting for boosting-C5.0 was selected as 10, and the number of folds for cross-validation was also selected accordingly as 10 to prevent the model from overfitting, and the expected noise was set to 10%. In contrast, the parameter settings of C5.0 remained the same except for the boosting technology. Multi-layer perceptron and radial basis function are applied to ANN and SVM, respectively.
Visualize the LSM by susceptibility raster cells (Fig. 8). The results of the four methods all show that the higher the hazard level, the greater the probability of landslide disasters. Moreover, all models indicate that construction sites along the river and county roads in the northern part of the study area, as well as farmland in the southern part, are identified as H or VH level. Because the water flow has the effect of scouring and transforming the coastal terrain, under the combined action of rainfall and the periodic fluctuation of the reservoir water, the groundwater distribution state of the reservoir bank landslide will be greatly affected, which is unfavorable for the stability of the landslide. In addition, the sliding resistance of the landslide is reduced where interlayer lithology is distributed. In the same fashion, the degree of road cutting in the northern part of the Rangdu County is relatively high, and the frequent engineering activities have changed the natural geological environment and accelerated the deformation and failure process of the slope. However, the susceptibility to landslides is low for primary forests with less human activities and relatively high elevations, which is consistent with on-site investigations. It is worth mentioning that numerous accumulation landslides have occurred in the Three Gorges Reservoir area. Using the landslide susceptibility assessment system proposed in this paper, appropriate control measures can be taken for areas with different susceptibility levels to avoid landslide disasters. Figure 9 illustrates the relative distribution of landslide susceptibility grades and landslide density. Similarly, the distribution of susceptibility levels shows a unanimous trend to LSM (Fig. 9a). The proportion of low susceptibility is highest for all models, because the non-landslide areas are much larger than the landslide areas. The very high susceptibility of boosting-C5.0, ANN, SVM, and C5.0 in the study area are 4.99%, 4.00%, 3.41%, and 2.35%, which indicates that the very high susceptibility area suggested by the boosting-C5.0 model fits the distribution of actual landslides. In addition, the landslide ratio is fairly positively correlated with the susceptibility level (Fig. 9b). The higher the susceptibility level, the greater the landslide ratio. Here, we define the landslide ratio as the specific value of the landslide raster cells to the area of the susceptible area. In the very high area, the boosting-C5.0 model has the highest landslide ratio at 37.05%, while the lowest ratio at 26.47% possessed by C5.0 model. Consequently, it is reasonable to conclude that the susceptibility areas calculated by boosting-C5.0 model are more consistent with the actual landslide. Nonetheless, we still need to verify the credibility of the model.

Model comparison and validation
The receiver operating characteristic (ROC) curve, a comprehensive index of continuous variables reflecting sensitivity and specificity, was employed to evaluate the accuracy of probabilistic models in this study. In short, the closer the curve is to the upper left corner, the more accurate the model is. Likewise, ANN, SVM, and C5.0  decision tree machine learning methods are utilized as a comparative study. Figure 10 portrays the ROCs of the accumulation and rock area, respectively. The applicability and prediction of the boosting-C5.0 decision tree are the best. Specially, the training dataset area under the curve (AUC) of accumulation and rock are 0.990 and 0.987, while the validation datasets are 0.991 and 0.990, respectively. High accuracy indicates that more accurate landslide susceptibility map can be obtained with the advanced instruments and skilled historical experience when conducting regional geohazard investigations. Besides, due to the refined landslide raster division in Rangdu County, the number of landslides grids is relatively low (3.55% for accumulation and 0.95% for rock), which is also the main reason for such a high degree of model fitting.
Furthermore, the accuracy of this model in the accumulation area is slightly higher than that in the rock area. Alternatively, the false detection rate in the accumulation area is relatively lower, which indicates that the model is more suitable for LSM in the accumulation area. This result can also be reflected in the other three models (Table 6), which further indicates that the accumulation impact factors in the landslide susceptibility assessment system have a better representativeness.
In addition, in terms of model prediction accuracy, we diagnosed the validation dataset using mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE), and the results are listed in Table 7. The error results of the validation set for both the accumulation and rock areas show that Boosting-C5.0 \ ANN \ SVM \ C5.0. Moreover, the error results of all models except C5.0 in the rock area are smaller than those in the rock area, which also testifies that the model is more suitable for accumulation area. In short, the error diagnoses indicate that the Boosting-C5.0 model constructed in this paper has the best prediction accuracy.

Parameters of boosting-C5.0 decision tree model
In the boosting-C5.0 decision tree model, the selection of two parameters is crucial, which are Pruning severity (Ps) and Minimum records per child branch (Mr). Ps determines the extent to which the decision tree or rule set will be pruned. Increase this value to obtain a smaller, more concise tree. Mr can be used to limit the number of splits in any branch of the tree. A branch of the tree will be split only if two or more of the resulting subbranches would contain at least these many records from the training set. In order to explore the influence of the two parameters on the prediction results of the boosting-C5.0 decision tree model, we selected the data set of the accumulation landslides, and divided the training set and prediction set by 8:2 ratio. The prediction results of the model under different parameter combinations are shown in the Tables 8 and 9.
The results show that when Ps value is in a certain range (10-35), the prediction results of the model are relatively close. In fact, trees are pruned in two stages: First, a local pruning stage, which examines subtrees and collapses branches to increase the accuracy of the model. Second, a global pruning stage considers the tree as a whole, and weak subtrees may be collapsed. It can be known that Ps affects local pruning only. Significantly, the prediction accuracy of the boosting-C5.0 decision tree model shows an obvious decreasing trend with the increase of Mr. Nevertheless, the model may be over fitted with noisy data if Mr is too small. Therefore, for the dataset in this paper, the recommended parameter combination is: Ps = 35, Mr = 2.

Generalization performance
Generalization performance refers to the adaptability of a machine learning algorithm to fresh samples. The purpose of learning is to identify the latent laws behind the disorganized data, so that the trained model can also provide appropriate outputs for the other learning set with the same regularity.
In this paper, the boosting-C5.0 model was used to predict the landslide susceptibility of Rangdu County, the spatial probability and susceptibility of landslide occurrence in each raster cell were obtained. However, the reliability of the susceptibility obtained by this model was lacking in assessment. In view of this, the generalization performance of spatial prediction of regional landslides should be discussed.
In this research, the resolution size of the raster cells in the study area is 5 m 9 5 m. There are 590,863 rock raster cells and 426,200 accumulation raster cells in total. All grid cells are used as databases, and the samples of landslides and an equal amount of non-landslide were generated in the Python environment based on the Sample command (random sampling and no put-back) in the Pandas library. First, 4 training samples were randomly and equally drawn in the same way as LSM. Then, the model in this paper was used to train and predict the newly extracted 4 sets of training samples, and together with the prediction results of the original samples, 5 groups of data sets and landslide susceptibility prediction results were finally formed.
As shown in Table 10, there is tiny difference between the validated AUC of each data set in the accumulation and rock study area, and the maximum difference is 0.1%. It is reasonable to conclude that the C5.0 decision tree model based on boosting technology and cross-validation optimization is robust in predicting landslide susceptibility in Rangdu County.
In addition, it can be seen from Fig. 11 that the variation of ROC in the accumulation area is more consistent than that in the rock area, which proves that the boosting-C5.0 decision tree model has stronger robustness in LSM in the

Uncertainty analysis
Owing to the complexity of the disaster mechanism, the uncertainty of landslide cataloguing and other related data acquisition, and the differences in the prediction principles of different analysis models, LSM inevitably has uncertainties. Therefore, uncertainty analysis is one of the key contents in landslide prediction, which equally plays a crucial role in promoting the availability of prediction results. Correlation analysis was performed between the landslide susceptibility prediction of set 1 and predicted mean of the 5 sets, and the coordination were arranged in ascending order of susceptibility (Fig. 12). The coefficients of determination for accumulation area and rock area are 0.925 and 0.891, respectively. Moreover, it can be seen from Fig. 12 that when the model is used for landslide susceptibility assessment, the multiple prediction results of the higher and lower areas are relatively consistent in a strong correlation. However, in the prediction results of the intermediate probability area, there is a certain degree of difference between the multiple prediction effects of the model.
On the basis of the above, the double standard deviation of the 5 sets of predicted result values in each raster cell was calculated for the correlation analysis with the susceptibility prediction of set 1. As displayed in Fig. 13, the double standard deviation of the corresponding raster cell and the susceptibility prediction shows a linear relationship of ''arch bridge type''. Specifically, as the predicted value of landslide susceptibility increases, the double standard deviation of the corresponding raster cell also increases. When the landslide susceptibility reaches about 0.35 (intermediate probability area), the double standard deviation reaches the maximum. Subsequently, the double standard deviation tends to gradually decrease with the increase of susceptibility prediction. It is worth noting that raster cell with lower double standard deviation is generally distributed in the lower (\ 0.2) or higher ([ 0.8) susceptibility areas, while the standard deviation of intermediate probability area is relatively large. It is equivalently proved that there is an uncertainty in the prediction of boosting-C5.0 model. The relationship between the landslide susceptibility value and the double standard deviation can be fitted by a quadratic polynomial, then the accumulation and rock areas are respectively: Here, x is the predicted landslide susceptibility value of each raster cell, and y is the fitted value of the double standard deviation. This value can be regarded as the uncertainty degree of the predicted value of landslide disaster susceptibility in each raster cell. Specially, when the expected standard deviation is smaller, the corresponding uncertainty is smaller, and the prediction result shall be more credible.
Using the above formula (Eq. 19) to modify the landslide susceptibility prediction value in this paper, the uncertainty of the susceptibility prediction value of each raster cell can be obtained and visualized in Fig. 14. Compared with the analysis in Fig. 8, it can be deduced that the locations with high uncertainty in the predicted value of susceptibility are mainly distributed in the low and moderate susceptibility area, while the uncertainty in the very high susceptibility area is low, which is consistent with uncertainty analysis.

Conclusions
Landslide disasters are thought-provoking in the Three Gorges Reservoir area, and LSM deserves attention. In this study, machine learning model of boosting-C5.0 decision tree was used to carry out LSM on the basis of micromesh on-site investigations. The following conclusions can be educed: (1) The susceptibility assessment system proposed in this paper includes 12 impact factors in accumulation area and rock area. The factor assessment results indicated that the TOL and VIF of all factors were below the threshold without rigorous multicollinearity. Among them, the slope shape showed the highest predictive ability.
(2) The boosting-C5.0 model delivered the areas with the highest landslide susceptibility along the riverside and farmland, with a landslide ratio of 37.05%, while the lowest ratio at 26.47% possessed by C5.0 model. (3) The ROCs of the validation sets revealed that the boosting-C5.0 model achieved the highest AUC with 0.991 and 0.990 for the accumulation and rock areas, respectively, which were significantly higher than other models (ANN, SVM, C5.0). (4) The robustness of boosting-C5.0 model tended to be higher in accumulation area than rock area, and the prediction areas with higher uncertainty were mostly concentrated in the low and moderate susceptibility areas.
Summarily, this work contributes to planning and formulating effective disaster prevention and mitigation strategies in the Three Gorges Reservoir area, where a plethora of accumulation landslides are still reviving and developing. Such a high-precision boosting-C5.0 machine learning model is expected to be widely used and promoted in regional LSM.
Project of Hubei Provincial Department of Natural Resources (ZRZY2022KJ07). The authors thank the colleagues in our laboratory for their constructive comments and assistance.
Funding The authors have not disclosed any funding.

Declarations
Conflict of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.