Novel Ensemble-Based Machine Learning Models Based on The Bagging, Boosting and Random Subspace Methods for Landslide Susceptibility Mapping

Indivisual machine learning models show different limitations such as low generalization power for modeling nonlinear phenomena with complex behavior. In recent years, one of the best approaches to this issue is to use ensemble models. The purpose of this paper is to investigate the predictive power and modeling of three novel ensemble models constructed with four machine learning models: Decision Tree (DT), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naive Bayes (NB) models based on three approaches of Bagging, boosting and Random Subspace (RS) in landslide susceptibility mapping (LSM) in the Province of Ajloun in Jordan. A total number of 91 landslide locations along with 16 conditioning factors in LSM were identified and used. Also, before modeling, the selection of effective conditioning factors in LSM was done using genetic algorithm and four single models including DT, KNN, NB and SVM. The selected factors were used in modeling with individual and ensemble models. The results show that the area under the receiver operating characteristic curve (AUROC) for ensemble models is significantly higher than the individual models and the AUC for ensemble models was on average 14% higher than individual models. Based on the results, the most accurate models were RS ensemble model (AUROC = 0.850), Boosting (AUROC = 0.848) and Bagging (AUROC = 0.814), respectively. This study showed that by combining the results of simple machine learning models and making ensemble models, models with the desired accuracy can be achieved.


Introduction
Encompassing a wide range of natural and man-made risks, environmental hazards are barriers to the development of various regions (Cutter 1996;Hunter 2005;Paul 2011). Landslides are a prominent environmental hazard with relatively severe impacts on socio-cultural systems and infrastructures so that they are considered as a barrier to socioeconomic growth, and development of any region. Annually, this phenomenon accounts for significant damages and loss of lives (Dikshit & Satyam 2018;Parkash 2011). Rapid soil erosion and the destruction of agricultural and residential lands, forests, and roads are amongst such losses (Dolidon et al 2009;Pradhan et al. 2012). Landslideinduced destructive impacts are being increasingly intensified by the urbanization (Cui et al. 2019). Landslides are characterized by widespread distribution, frequent occurrence, rapid motion, and drastic losses with horrible impact in high-densely populated regions (Bardi et al. 2017;Persichillo et al. 2017).
Despite above mentioned, landslides could be well-managed than other natural disasters. The measures such as modifying slope geometry, installing structures such as piles and anchors, stabilizing structures and rerouting surface and underwater drainage can be adopted for landslide disaster mitigation (Popescu & Sasahara 2009). To this end, identifying susceptible regions is one of the most important steps to preventing resultant losses, and provides great supports to natural resources management and development plans (Kamp et al. 2008;Kavzoglu et al 2014). Identifying the influential factors of landslides in addition to appropriate susceptibility mapping, with the aim of determining high-risk regions, provides planners with essential and beneficial information . Landslide susceptibility mapping, which represents the spatial distribution of landslides, is an efficient tool for mitigating landslide-induced impacts, and efficient land use planning ( Huang & Zhao 2018;Wang et al. 2020)In addition, susceptibility zoning gives a better estimation of a zone's potential capabilities, and divides a susceptible zone to several subzones, enabling planners to work more easily. Furthermore, scientific and accurate zoning is a time and cost effective approach in terms of mitigating resultant risks.
To date, scholars have provided various models for modeling and mapping landslide-susceptible regions. They are generally data-driven or expert-based models, or a combination of them (Arabameri et al. 2018;Zhu et al. 2018). In recent years, data-driven approaches and artificial intelligent algorithms have been in the area of focus due to their higher efficacy. Random forest (RF), Neural Networks (NNs), Support Vector Machine (SVM), Decision Tree (DT), K-nearest neighbors (KNN), Extreme Learning Machine (ELM) and Naïve Bayes (NB) are amongst such techniques used for landslide susceptibility mapping (Boateng et al. 2020;Chen et al., 2019;Dou et al., 2019;Huang et al. 2017).
Different models have diverse capabilities and functions. Therefore, attentions have been recently attracted to integrating the output of multiple models in order to benefit from the advantages of each single model, and eventually to obtain more precise and accurate predictions than could be obtained from any of the constituent models alone (Brown 2010;Sagi & Rokach 2018). Ensemble models are integrated techniques which have been in the area of focus due to their capability of improving the results of learning models and techniques used to solve modeling and classification problems (Arabameri et al, 2020;Pham et al., 2020;Roy et al, 2019). This study proposes and compares the use of three novel ensemble model by bagging, boosting and random subspace (RS) techniques and based on SVM, KNN, NB, and DT models for landslide susceptibility mapping in Ajloun province of Jordan. The obtained results from ensemble models will be compared by those of the constituent models.

Figure 1. Study area location map
Ajloun Governorate is one of the governorates in northern Jordan, and the most important cities of Ajloun Governorate is Ajloun city, which considered as the economic and administrative center of the governorate located in (32°19'57" N , 35°45'6" E). Although Ajloun Governorate is the second smallest governorate in the Kingdom in terms of area with 420 km 2 , it has the highest governorate in the Kingdom in terms of the percentage of vegetation cover. The topography of the area varies greatly, while some of the heights in the governorate are considered among the highest in the Kingdom with 1247 meters above sea level, some areas of the governorate are located under the sea surface, which means that there are many steep slopes in the study area, and this is what causes many landslides accidents in it. In recent years, the investment from the government and the local community has resulted in the growth of the tourism infrastructure in Ajloun, especially after the locked down caused by the Corona pandemic, which raised the interest in domestic tourism due to the ban on travel outside the country, which made such a study necessary to help decision makers know the places that are Vulnerable to landslides.

Methodology
Landslide inventory map along with 16 initial conditioning factors for modeling were first prepared. Then, the relationships between each class of conditioning factors and landslide occurrence were evaluated in the study area using Frequency Ratio (FR) method (Yilmaz 2009). Next, four wrapper-based feature selection algorithms were developed and used to select effective features. The selected optimal features were used in the models and ensemble models were built by bagging, boosting and RS approaches. Finally, the built models were evaluated, and landslide susceptibility maps were developed. Fig. 2 shows the steps of this study.

Preparation of landslide inventory map
The main types of landslides that occur in the study area are: Rock slides, Slumping, Earth flow, Toppling, Rock spreads, It is noted that most of these landslides occur when the steep areas contain geological structures consisting of Limestone and marly limestone, as same as Weak marl and clay. Multiple field inspections and ground control points (GCPs) were conducted between June 2 nd and August 23 rd , 2020, using a global positioning system (GPS) to obtain mapping information on landslide locations ( Figure 3). These points were utilized in the GIS context to produce a map of the landslide inventory. 64 landslide locations (70 percent) were used to build models and 27 landslide locations (30 percent) were used for validation (Oh & Pradhan 2011).

Landslide conditioning factors
Providing sufficient data and information from earlier landslides and significant landslide drivers is the first and most important step in landslide hazard zonation and analysis. A variety of factors including material, mineral composition, geological structure, water table, earthquakes, distance to faults, vibrations caused by construction machinery, and subsidence may result in slope instability (Rotaru et al. 2007). The present study investigates 16 landslide conditioning factors, including elevation, slope, aspect, land use, distance to faults, geology, plan curvature, profile curvature, topographic wetness index (TWI), sediment transport index (STI), stream power index (SPI), rainfall, distance to roads, distance to drainage, soil texture and normalized difference vegetation index (NDVI) that are amongst the most important causes of slope instability, as the significant landslide factors ( Figure 3); and each factor is explained briefly in the following sections. These factors play significant role in the landslide occurrence in the study area.

Elevation
Elevation is a basic factor in most of the spatial modeling landslide susceptibility studies. The amount of precipitations, number of freezing days, and slope steepness usually depend on elevation (Du et al. 2020). High altitudes are believed to be more prone to landslides than low altitudes (Devkota et al. 2013). Rocks at higher altitudes are stronger than rocks at low altitudes or foothills (Chen et al. 2018b (Beven 1997;Beven and Freer 2001).

Slope
In landslide hazard zonation methods, including knowledge-based and data driven approaches and in all scales, slope is regarded as a main factor (Dai et al. 2002;Fall et al. 2006). Slope steepness has a significant effect in occurrence of landslides (Chen et al. 2017). This factor also intensifies the effective role of the other factors in mass movement critically (Vergari et al 2011).

Aspect
Aspect is another factor with a significant effect on slope instability. Although using this factor in landslide zonation models is not as conventional as the slope steepness, in certain areas, there might be a remarkable temperature difference between slopes exposed to the sunlight and those not exposed to the sunlight. This temperature difference might be in the range from above to below the freezing point of water. In such cases, there will be a meaningful relation between distribution of slope instabilities and slope aspects; so that the slope instabilities are more likely to happen in northern slopes (not exposed to the sunlight) and less in the southern slopes (exposed to the sunlight).

Land use
Land use and human activities play a decisive role in changes in the environment. In many cases the land uses are positioned improperly, causing disorders and damages to the natural ecosystems, This is because many of the infrastructure projects undertaken in the last two decades, such as road construction and new housing projects, were not based on sufficient geological studies. Also, many of the agricultural roads that reach the farms outside the area of the municipal authority area (outside the urban organization), did not take the adequate amount of planning and appropriate design. Improper manipulation of land in land uses like agriculture can increase the potential of landslide, because such land uses prepare soft and unstable soil formations. Figure 3.d. shows the land use map. It has become necessary for the the municipalities to extend their authority to more areas than it is now, and to update the maps of land use provided by them to become more responsive to modern requirements such as modrn infrastructure and new housing projects.

Distance to faults
Faults have significant effects on development or activation of the areas with landslide potential. Crushing and shearing in fault areas, penetration of water from these areas to the slopes, discontinuities appearing around the faults, and different level of erosion in the slopes are some of such effects.

TWI, SPI and STI
There are two groups of topographic attributes. The first group consists of primary topographic attributes, including slope and aspect. The second group consists of secondary topographic attributes including TWI, STI, and SPI indicators. TWI is used to show wetness conditions at the catchment scale and is calculated using the Eq. 1 (Moore et al 1991): SPI is used to investigate and describe the erosion potential of the terrain. This indicator is calculated using Eq. 2 (Wilson 2000): (2) STI is the result of erosion and sedimentation in the region, and is calculated using the Eq. 3 (Moore et al. 1993).
In Equations 1, 2, and 3 A s is the specific area of the basin based on the total upstream area and σ is slope gradient in degrees. In order to calculate flow accumulation, first the flow direction map is calculated using DEM in GIS environment, and then the flow accumulation map is calculated using the flow direction map.

Rainfall
Atmospheric precipitations in forms of rain or snow play a significant role in formation of mass movements. It is therefore essentially important to know characteristics of the precipitations. Intrinsic conditions prone to landslide, such as type of the material, increase the possibility of landslide with increasing rainfall and humidity (Kayastha et al. 2013).

Distance to roads
In many places, development of road networks on the slopes requires removing the hills. On the other hand, the construction of roads on top of the slopes increases the overhead and mechanical driving force. As a result, the probability of landslide occurrance increases. This factor, which is caused by the human activities, results in increased instability of slopes and potential of landslide (Juliev et al. 2019).

Distance to drainage
Flowing waters are amongst the major landslide factors in slopes. There are more waterways in less steep slopes, which results in an increased number of landslides. This is indicative of the significant role of waterways in formation of landslides (Lee & Sambath 2006;Pradhan, & Buchroithner 2010). In the steeper slopes, there are fewer waterways, as a result of which less mass movements happen. Waterways usually experience more landslides due to drainage of water and steep walls (Lee & Sambath 2006;Pradhan et al. 2010).

Soil texture
This factor is another important landslide factor (Temme 2021). Lots of studies are conducted on the role of soil texture in occurrence of landslides (Lee et al. 2017;Rahim et al. 2018). The studies show that soils with higher silt content are more likely to erode compared to clay and sandy soils (Fonseca et al. 2017). This factor has 4 classes incuding Silty Clay Loam, Loam, Silty Loam, and Clay Loam.

NDVI
This factor delineates green vegetation in the study area. NDVI is based on the idea that chlorophyll in the plant structure is capable of absorbing red radiations, and the mesophyll layer is capable of reflecting the infrared radiations (Goward, Markham, Dye, Dulaney, & Yang, 1991;Pettorelli, 2013). This index ranges between -1 for minimum and +1 for maximum photosynthesis activity. NDVI is defined by Eq. 4 (Goward et al. 1991): In this index, for the vegetation areas, red and near infrared (NIR) wavelengths are marked by high absorption and low reflection, respectively. Since this index is somehow indicative of changes in vegetation density, it can be used for modeling the landslides.

Spatial correlation between landslides and each class of factors
Having general knowledge of the number of occurred landslides in each class is a conventional step in landslide susceptibility mapping with FR as one of the most typical methods. FR is derived from Eq. 5 (Lee & Sambath, 2006): Where A is the number of landslide points in each class of conditioning factors, B is the total number of landslide points in the study area, C is the area of each class of conditioning factors and D is the area of the whole studied area.

Feature selection process using Genetic Algorithm (GA)
Feature selection is an important step in the modeling process through the machine learning algorithms (Balachandran et al., 2018). Various methods have been proposed so far for feature selection. In general, they can be classified as two main categories, i.e. filter and wrapper methods (Talavera 2005). The filter methods are independent of the machine learning algorithms and calculate the credibility of every feature by using the resultant criteria (Sánchez-Marono et al. 2007). The features are then sorted in terms of their ranks, and the features whose ranks are smaller than a predetermined threshold will be removed. These methods are very fast and also appropriate for high-dimensional data; however, they are less accurate than the wrapper methods in which the learning algorithm feedback is used for feature selection (Das 2001;Solorio-Fernández et al. 2020). In other words, the feature selection criterion is the classification algorithm accuracy. Hence, these methods are more accurate. The wrapper approach was adopted in this study for feature selection through the GA based on the DT, KNN, NB, and SVM algorithms. In fact, the accuracy of these four models was used as the criteria for selecting the effective features for modeling.
For feature selection, a population of solutions (i.e. a set of features) was first generated randomly. Each solution or chromosome is defined to consist of 16 genes, each of which indicates one feature or factor of the problem. These genes can be either 0 or 1 at random. If a gene is 0, it shows the absence of the corresponding feature; however, if a gene is 1, it shows the presence of the corresponding feature. After that, the modeling process was performed through the DT, NB, KNN, and SVM algorithms based on the features selected from each chromosome. The outputs of the models were then compared to the real values from the training set to determine the RMSE, which was used as the fitness function. Afterwards, the roulette wheel was used for selection in the genetic algorithm based on the problem solutions and the fitness values. The chromosomes entered into the second generation through selection and two-point mutation. The algorithm continues, and the features with the best fitness values are selected as the optimal features.
There were 20 chromosomes in each generation, whereas the number of generations was 200.

Decision Tree (DT)
DTs belong to the new generation of data mining techniques and have seen considerable development in the last two decades. These models can be used to extract knowledge from a database, or alternatively to create a predictive model (Kotsiantis,2013). Unlike artificial neural networks, the DTs can generate interpretable rules (Kotsiantis 2013). In fact, a DT describes its prediction as a set of rules. In addition, non-numerical (categorical) variables can also be used in these models. In this research, the Classification and Regression Tree (CART) model (Breiman et al. 1984) is employed to map the landslide susceptibility. The DT used in this research is based on three main principles (Breiman et al. 1984):  A set of questions in the general form of < , where is an independent variable and is a constant, and each question is answered with yes or no.
 Selecting the best criterion for determining the best independent variables that can be used to create the tree branches.
 Providing a summary statistic for the final node Various metrics can be used to create a branch and build the DT. However, several studies have shown that the Gini index is more suitable for most of the data set (Daniya et al. 2020). This index can be computed using Eq. 6 (Daniya et al. 2020).
where ( ) is the relative frequency of class and ( ) is the gini index at node . The tree grows until the maximum purity (i. e., minimum impurity index) at the leaf nodes is achieved.

K-nearest neighbors (KNN)
The KNN algorithm finds the nearest samples and predicts the value of the desired sample based on their value 4) Determining the state of the input sample and the class that it belongs to.

Support Vector Machine (SVM)
SVM is a supervised machine learning algorithm and is usually used to solve classification and regression problems.
It is a relatively new algorithm and has shown better performance compared to the previous classification algorithms, including the perceptron neural networks, in recent years. The equation that determines the optimal hyperplane separating the different classes is solved using quadratic programming (QP), which is a well-known technique for solving constrained equations (Vapnik et al. 1995). To handle the data complexity, the model uses special functions called kernels to transfer the data to a higher-dimensional space, in which the transferred data can be linearly separated (Vapnik et al. 1995). Radial basic function (RBF) is the most well-known and commonly used kernel function and is reported in various works as the optimal option (Patle & Chouhan 2013).

Naïve Bayes (NB)
NB is a simple, widely used model that has been used in various domains. This method can be used in problems where sample can be described as conjugation of conditionally independent features (Leung 2007). Bayesian learning, which is founded on Bayesian probability theory, allows us to determine the posterior probability based on the prior probabilities (Kelly & Kolstad 1999): (ℎ) is the initial probability of ℎ ℎ hypothesis before observing the training sample (prior probability). If this probability is not known, one can assign the same probability to all hypotheses. ( ) is the prior probability of observing the training sample . ( |ℎ) is the probability of observing the training sample given that the ℎ hypothesis is true.

The Proposed Bagging, Boosting and Random Subspace Schemes Ensembel models
In recent years, ensemble models have been used extensively by researchers. These models combine the outputs of multiple models and utilize their various capabilities to improve modeling efficiency. Bagging is one of the techniques that can be used to build an ensemble model. To create an ensemble bagging model, one can combine the bootstrap sampling method with a single model (for example, a decision tree) or use multiple models built using the same training dataset (Bühlmann 2012). The current study has used the second approach to combine the results of five different models, namely, DT, SVM, KNN and NB, to create an ensemble model. The averaging method is used to combine the outputs of these models.
In the boosting method, the Classifier 1 is built by a group of samples from main data sets. Then, the samples that were mis-classified are given to Classifier 2 to perform the learning process better (Bühlmann 2012); therefore, this method acts like the bagging method; however, the mis-classified samples are entered into the next model in each step to better perform the learning process on them. Finally, the RS model operates on features instead of samples. In other words, models are developed from features through random selection (Tian & Feng 2020).

Results
After the necessary data are prepared and created, the FR method was adopted to identify important classes in each of the designated factors. Table 1 shows the results of this method. The important classes of every factor can be seen clearly. Before the modeling process is performed, the optimal features were obtained using GA and the KNN, DT, NB, and SVM models based on the wrapper approach. Table 2    After the optimal features are found for the four models, the ensemble models were developed. Figure 8 demonstrates the ROC curves of all models in the training and test phases. Accordingly, the ensemble algorithms obtained the best AUROC in the learning phase, and the boosting, RS, and bagging models had the highest AUROC values, whereas the NB, SVM, KNN, and DT algorithms obtained the lowest AUROC values, respectively. In the validation phase, the ensemble models obtained higher accuracies than the single models. In other words, the highest values of AUROC resulted from the RS, boosting, and bagging models, respectively; therefore, the ensemble models achieved desired accuracy by integrating less accurate models.
After these models were validated, all samples of the study area were entered into the seven developed models based on the selected features in order to prepare landslide susceptibility maps. The outputs of these models were then produced in the form of landslide susceptibility Figure 8. The ROC curves for seven models in the train and test phase Figure 9. Landslide susceptibility maps obtained from the single and ensemble models maps in GIS environment. These maps were classified as very high, high, moderate, low, and very low categories based on the natural break method (Figure 9). Also, figure 9 shows the approximate percentage of each class based on the seven adopted models. Nearly 50% of the entire region can be classified as very high and high categories by using the best model, i.e. RS. These two classes are mostly observed in the western half of the region.

Discussion
Landslides occur due to various natural and man-made factors where each factor plays a key role in landslide occurrence. This study used 16 conditioning factors for landslide susceptibility mapping in Ajloun, Jordan. The first section discusses the factors which are present at least in 3 of 4 feature selection models used to select optimal factors.
It should be noted that when the factors are evaluated independently and without considering other factors, it may appear that the studied independent factor leaves a significant impact on modeling, whereas this may be completely different when all factors are evaluated as whole. For example, some studies consider distance to fault as an important factor of landslide susceptibility mapping while this was selected as an effective factor in none of the four feature selection algorithms. A total of 8 features, namely slope, NDVI, plan curvature, profile curvature, geology, distance to drainage, STI and TWI were selected at least in 3 feature selection algorithms as important optimal modeling factors. According to results, almost 81% of mass movement occurred in the slope classes of 5 to 30 degree. Instability of slope, taking the form of slope failure, accelerates due to sliding but decreases after a certain slope, and the cause behind this lies in the hardness of the formations that constitute the slope and also due to damages that occurred during soil formation on the steeper parts of the slope. This hypothesis acknowledges the results obtained by Ahmadi at al., Selecting effective features is among the most important activities in developing machine learning and data mining models. Regarding feature selection, some factors have no effects on the final output and produce random values for every sample. At the same time, some factors affect the model output and cannot be considered among the features having no effects on the modeling process. Many studies have proven that it is fundamentally important to select relevant and necessary features in the preprocessing step. Therefore, feature selection was performed through all models in this study, and the effective features were selected through the genetic algorithm. Hence, researchers and model developers should take this step extremely seriously in the modeling process, before which it should be performed as an essential preprocessing step.
In recent years, many studies have been conducted on the ensemble classifiers to develop a final classifier. According to various studies, the resultant classifiers are more accurate than the initial classifiers. This study adopted bagging, boosting, and random subspace approaches to develop ensemble models. The KNN, NB, DT, and SVM algorithms were also employed to develop these models. This study drew a comprehensive comparison between these four models and the three ensemble models. The accuracy of every ensemble model was tangibly higher than the single models, and the RS model was more accurate than the other three ensemble models. The same finding was also obtained in other studies of spatial modeling. Another study used ensemble, bagging, and boosting models as well as the random forest to classify the land cover and compared them with the DT model. The results of ensemble models were more accurate than that of the DT model. Regarding hazard mapping, the ensemble algorithms were more accurate than the single models.

Conclusion
This paper conducted a comparative study using three ensemble approaches, namely bagging, boosting and RS constructed with four well-known machine learning algorithm including DT, KNN, NB and SVM for landslide susceptibility mapping in the Ajloun province of Jordan. Although various landslide zonation techniques and methods have been developed in recent two decades, ensemble models have been adopted and compared in fewer studies. This study selected effective features based on GA and DT, NB, SVM and KNN algorithms. The results of these algorithms showed that 8 factors, namely slope, plan curvature, profile curvature, TWI, STI, distance to drainage, geology, and NDVI are the main drivers of landslide in the study area. AUROC was used to compare the models. The value of this index was higher in all three ensemble models than individual ones, and the three models showed a desired modeling accuracy. Landslide susceptibility maps are considered as a beneficial tool in land use planning, disaster management