3.1 Methodology flowchart
The methodological framework of this study is shown in Fig. 2. The framework consists of six main steps: (1) Data collection and preparation, (2) Checking the multicollinearity of the landslide-related variables, (3) Landslide inventory mapping, (4) Landslide modeling process using a hybrid ML ensemble framework, (5) Landslide susceptibility mapping, and (6) Model comparison and validation. First, the topographical, geological, hydrological, environmental data and historical landslide locations were collected and prepared for modeling samples. Second, these data were checked for multicollinearity to avoid computational instability in model assessment. Third, the landslide inventory map was created. Using the Sample tool in ArcGIS Pro, the sample data were randomly split into training (70%) and validating (30%) datasets. Fourth, ensemble ML techniques were developed for the modeling, including Bagging – MLP, Dagging – MLP, Decorate – MLP, Rotation Forest – MLP, and Random SubSpace – MLP. Fifth, the above hybrid ML models were used to create landslide susceptibility maps. Finally, the susceptibility models were compared and verified using the cross-validation approach.
3.2 Landslide inventory map
It is needed to rely on detailed information on previous landslide events in studies on the formation mechanism of landslides, landslide susceptibility mapping, and developing landslide risk mitigation strategies (Mirus et al., 2020). Thus, landslide inventory mapping is often the first and the most crucial step in data preparation for modeling (Bui et al., 2019). In this study, 1,689 landslide locations were collected from many different sources, in which 1,225 landslide positions were explored from the website of the Institute of Geosciences and Minerals of Vietnam (available at http://canhbaotruotlo.vn/phanvungcactinh.html), and 464 landslide locations were added through field survey combined with the interpretation of Google Earth images.
There are three main causes leading to landslide formation in Son La province, including construction structures on steep slopes, thick weathering crust, and high groundwater levels (Thach & Canh, 2011). In addition, heavy prolonged rainfall events are considered the primary cause of landslides in the study area (Hoang & Tien Bui, 2018). The statistics have recorded many landslide developments in Quynh Nhai, Sop Cop, Song Ma, Moc Chau, Van Ho, and Thuan Chau districts (Hoang & Tien Bui, 2018; IFRC, 2021; Thach & Canh, 2011). The landslide curves stretched from 10 meters to 100 meters, and the landslide regions were bigger than 30m2 (Thach & Canh, 2011). In this study, the landslide inventory data were randomly split into two samples: 70% training dataset (1,183 landslide sites) and 30% validating dataset (506 landslide sites) using the Sample tool of ArcGIS Pro software.
3.3 Landslide causative factors
Modeling work begins with the identification of landslide causative factors. The factors are referred to in previous studies and are based on the available data in the research area (Kavzoglu et al., 2019). Precipitation, topography, hydrology, geology, geomorphology, and geoenvironment are factors that significantly impact landslide formation in mountain areas (Bui et al., 2019; Meghanadh et al., 2022). In this study, 16 landslide causative factors were taken for the modeling in the following subsections (Fig. 3 and Table 1).
3.3.1 Elevation
In investigations of landslide susceptibility mapping, elevation is a crucial factor in studies on landslide susceptibility mapping (Goetz et al., 2015; Tehrany & Kumar, 2018). The higher elevation zones correspond to the higher landslide frequencies (Myronidis et al., 2016). The elevation map was constructed using the Digital Elevation Model (DEM). The DEM was created using an ALOS picture with a spatial resolution of 30 m obtained from https://www.eorc.jaxa.jp/ALOS/en/aw3d30/ in March 2021. The research area’s elevation ranges from 70 to 2884 meters and is separated into nine groups (Fig. 3 (a)).
3.3.2 Slope
The slope is a critical factor in landslide susceptibility evaluations since it can control the landslide creation and movement in tropical mountainous areas (Dai et al., 2002; Guns & Vanacker, 2013). The slope angle map was created using ArcGIS Pro software and a DEM with a spatial resolution of 30 m. The slope angle map is separated into six classes, ranging from 00 to 78.30. (Fig. 3 (b)).
3.3.3 Aspect, curvature
Other topographical characteristics, such as aspect and curvature, are often used as primary input data in landslide prediction (Arabameri et al., 2020; Hang et al., 2021; Kavzoglu et al., 2019). These variables were estimated using ArcGIS Pro software and a DEM with a 30 m spatial resolution. The aspect map (Fig. 3 (c)) was divided into nine classes, and the curvature map (Fig. 3 (d)) was categorized into five levels.
3.3.4 Elevation difference
Elevation difference represents the altitude difference of all points on the Earth’s surface (Corsini et al., 2005). This factor also reflects the exiguous separation of elevations where water redistribution is very significant to landslide formation (Van Westen et al., 2003). The elevation difference factor on the 1:50,000 topographical map was derived using relative altitude (meters) in each square grid (1 km2). It was categorized into nine classes (Fig. 3 (e)).
3.3.5. TWI
The Topographic Wetness Index (TWI) measures the impact of topography on the location and amount of saturated runoff source zones (Pourghasemi et al., 2012). TWI was built from a 30 m DEM and was divided into seven categories in ArcGIS Pro software Fig. 3 (f). TWI is calculated as follows:
\(TWI={ln}\left(\frac{{A}_{S}}{{tan}\beta }\right)\) | (1) |
where AS denotes the specific basin area (m2/m) and β denotes a sloped angle in degrees.
3.3.6 NDVI
The Normalized Difference Vegetation Index (NDVI) measures the development of vegetation on the Earth’s surface (Jaafari et al., 2014). The NDVI explains the link between vegetation density and the occurrence and distribution of landslides (Chen et al., 2019). The NDVI is expressed as below:
\(NDVI=\frac{NIR-Red}{NIR+Red}\) | (2) |
Where
NIR denotes the infrared reflectance of the electromagnetic spectrum, and
Red denotes the red reflectance of the electromagnetic spectrum.
The NDVI value in this research ranged from − 0.643 to 0.694 and was separated into five groups (Fig. 3(g)).
3.3.7 Rainfall
Heavy prolonged rainfall is considered the primary cause of landslides in mountain areas (Singh et al., 2021). It might trigger unexpected landslides depending on the topographical and geological characteristics of the ground/rock mass (Dung et al., 2021). In this study, the rainfall data were derived from 25 rain gauge stations in Son La province and neighboring provinces from 2010 to 2021. The rainfall map was created by interpolating the rainfall distribution based on DEM using the geostatistical Kriging technique (Fig. 3 (h)).
3.3.8 Drainage density
Drainage density describes the drainage availability in the short term in response to changes in environmental conditions (Mezughi et al., 2011). Drainage density has a direct association with the landslide formation in mountain places (Arabameri et al., 2020). The drainage density is calculated by dividing the total drainage length (km) in each square grid by the number of square grids (Fig. 3 (i)).
3.3.9 Road density and distance to road
The road network is often associated with an increase in landslide events (Skilodimou et al., 2018). Meanwhile, road density is often used to measure the effect of development on landslide formation and distribution (Simon et al., 2015). Distance from road represents a negative relation with the landslide events (Akgün & Bulut, 2007). The shorter road distances are, the higher the landslide occurrences are (Skilodimou et al., 2018). The road density map was grouped into five levels (Fig. 3 (l)), and the distance to the road map was classified into six classes (Fig. 3 (k)).
3.3.10 Distance to rivers
Many previous studies have proved that landslide has a close relationship with distance to rivers (Arabameri et al., 2020; Bui et al., 2019). Landslides often happen along the sides of the valleys where the groundwater flows toward rivers and streams (Raja et al., 2017). In this study, the distance to rivers map was calculated in ArcGIS Pro software and consisted of seven classes (Fig. 3 (n)).
3.3.11 Hydrogeology
Hydrology and hydrogeology significantly affect landslide formation in the hilly areas (Kayastha et al., 2012). Many studies have looked at the importance and complexity of hydrogeology in landslide susceptibility evaluations (Frodella et al., 2021; Sujatha, 2021). The hydrogeology map of Son La province was obtained on a scale of 50,000 in 2020 from the Vietnamese Ministry of Natural Resources and Environment. The map is covered by four hydrogeological units (Fig. 3 (m)).
3.3.12 Geology and geomorphology
Although most of the geological and geomorphological factors change over relatively long periods, their characteristics have a substantial influence on the evolution of erosional and landslide processes in mountainous areas (Bui et al., 2019; Pisano et al., 2017). In 2020, the Vietnamese Ministry of Natural Resources and Environment released geological and geomorphological maps of Son La province on a scale of 50,000. The geological map is covered by nine geological groups (Fig. 3 (s)). The detail of the nine geological groups is shown in Table 1. The geomorphological map is covered by fifteen geomorphological types (Fig. 3 (p)).
3.3.13 Land cover
Anthropogenic activities can change land cover due to the transformation of the natural landscape (Promper et al., 2014). Therefore this factor is often considered an important triggering factor for landslide occurrences in mountain regions (Van Westen et al., 2003). Land cover can affect the landslide frequency and distribution quickly (Hong et al., 2007). In this study, the land cover map was developed by ESRI in 2020 using Deep Learning models and satellite images (downloaded at https://livingatlas.arcgis.com/landcover/). The land cover map of the study area consists of eight groups, including bare ground (0.02%), built area (2.57%), crops (6.59%), flooded vegetation (0.01%), grass (0.53%), scrub/shrub (28.86%), trees (59.55%), and water bodies (1.87%) (Fig. 3 (m)).
Table 1
Complex and formation types of geological in this study
No | Group | Name | Complex and Formation type |
---|
1 | Group 1 | Quaternary | Uper holocen; Lower middle Holocene; Middle-Upper Pleistocene; undiscriminated Quaterary; |
| | | HaiHung Formation; VinhPhuc Formation; HangMon Formation; NamPo Formation; |
| | | ThaiBinh Formation; HaNoi Formation; LeChi Formation |
2 | Group 2 | PALEOGEN | ChoDon Complex; Ye Yen Sun Complex; Pu Sam Cap Complex; PuTra Formation; |
| | | NamXe_TamDuong Complex; NamBay Formation; CocPia Complex; |
3 | Group 3 | JURA-CRETA-CRETACEOUS | YenChau Formation; NgoiThia Subcomplex; TuLe Subcapmlex; Phu Sa Phin Complex; |
| | | NamChien Complex; Middle Subformation; Lower Subformation; NgoiThia volcanic Subcomplex; |
| | | TuLe volcanic Subcomplex; SuoiBe Formation; YenChau Formation; NamMa Formation; |
| | | TuLe_NgoiThia Complex; NamPo Formation; BanMuong Complex; MongHinh Formation; |
| | | NamThep Formation; HaCoi Formation; TramTau Formation; |
4 | Group 4 | TRIAS- TRIASIC | SuoiBang Formation; SongBoi Formation; NamTham Formation; NaKhuat Formation; |
| | | DongGiao Formation; KhonLang Formation; TamDao Formation; TanLac Formation; |
| | | VienNam Formation; Bavi Complex; CoNoi Formation; MongTrai Formation; |
| | | PhiaBioc Complex; PacMa Formation; BanXang Complex; SongMa Complex; |
| | | HoangMai Formation; DongTrau Formation; SuoiBang Formation; |
5 | Group 5 | DEVON_DEVONIAN | BanCai Formation; BanPap Formation; BanNguon Formation; SongMua Formation; |
| | | NamPia Formation;BoHieng Formation; SinhVinh Formation; TayTrang Formation; |
| | | HuoiNhi Formation; NamSap Formation; MiaLe Formation; |
6 | Group 6 | CAMBRI_ORDOVIC | BenKhe Formation; PoSen Complex; BanNgam Complex; MuongHum Complex; |
| | | CamDuong Formation; NuiNa Complex; BoXinh Complex; ChiengKhong Complex; |
| | | HamRong Formation; SongMa Formation; DongSon Formation; ChangPung Formation; |
| | | HaGiang Formation |
7 | Group 7 | NEOPROTEROZOI_ CAMBRI | AnPhu Formation; ThacBa Formation; DaDinh Formation; ChaPa Formation; |
| | | NamCo Formation; NamTy Formation; BoXinh Group;Nam Sl Formation; HuoiHao Formation; |
| | | SinhQuyen Formation; |
8 | Group 8 | CARBON-PERMI | DaNieng Formation; BacSon Formation; SiPhay Formation; NaVang Formation; |
| | | CamThuy Formation; YenDuyet Formation; BanDiet Group; DienBienPhu Complex; |
| | | SongDa Formation; DienThuong Complex; PhuSiLung Complex; MuongLat Complex; |
9 | Group 9 | Unknow | Unknown in age dykes and veins |
Table 2
Landslide influencing factors and their classes
Factor | Classes | Classification method |
---|
Elevation (m) | (1) 70–300, (2) 300–500, (3) 500–700, (4) 700–900, (5) 900–1100, (6) 1100–1300, (7) 1300–1500, (8) 1500–2000, (9) > 2000 | Natural Breaks |
Slope angle (degree) | (1) 0–10, (2) 10–20, (3) 20–30, (4) 30–40, (5) 40–50, (6) > 50 | Natural Breaks |
Aspect | (1) Flat (-1), (2) North (0-22.5), (3) Northeast (22.5–67.5), (4) East (67.5-112.5), (5) Southeast (112.5-157.5), (6) South (157.5-202.5), (7) Southwest (202.5-247.5), (8) West (247.5-292.5), (9) Northwest (292.5-337.5) | Azimuth |
Curvature | (1) [(-9.786) - (-0.625)], (2) [(-0.625) - (-0.173)], (3) [(-0.173) – 0.208)], (4) [(0.208–0.659], (5) [0.659–9.717], | Natural Breaks |
Elevation difference | (1) [0-132.8], (2) [132.8-232.4], (3) [232.4-312.1], (4) [312.1-385.2], (5) [385.2-464.8], (6) [464.8-557.8], (7) [557.8–684.0], (8) [684.0-1035.9], (9) [1035.9–1700] | |
TWI | (1) [2.283–4.542], (2) [4.542–5.432], (3) [5.432–6.459], (4) [6.459–7.759], (5) [7.759–9.334], (6) [9.334–11.251], (7) [11.251–19.808] | Natural Breaks |
NDVI | (1) [(-0.643) - (-0.038)], (2) [(-0.038)-0.009], (3) [0.009–0.051], (4) [0.051–0.093], (5) [0.093–0.694], | Natural Breaks |
Rainfall (mm) | (1) [972–1038], (2) [1038–1089], (3) [1089–1143], (4) [1143–1194], (5) [1194–1279] | Combile rainfall with DEM using geostatistical Kriging method |
Drainage density (km/km2) | (1) [0-2.13], (2) [2.13–4.27], (3) [4.27–6.40], (4) [6.40–8.54], (5) [8.54–10.68], (6) [10.68–12.82], (7) [12.82–14.95], (8) [14.95–17.08], (9) [17.08–19.22] | Natural Breaks |
Distance to road (m) | (1) [0–50], (2) [50–100], (3) [100–200], (4) [200–500], (5) [500–1000], (6) > 1000 | Natural Breaks |
Road density (km/km2) | (1) [0-2.077], (2) [2.077–3.676], (3) [3.676–5.274], (4) [5.274–7.671], (5) [7.671–13.638], | Natural Breaks |
Distance to river (m) | (1) [0-200], (2) [200–500], (3) [500–1000], (4) [1000–1500], (5) [1500–2000], (6) [2000–2500], (7) > 2500 | Natural Breaks |
Geohydrology | (1) Rich layer of water, (2) Middle poor layer of ware, (3) Very poor layer of water, (4)Poor layer of water | Heohydrogical categories |
Geological | (1) Group1, (2) Group2, (3) Group3, (4) Group4, (5) Group5, (6) Group6, (7) Group7, (8) Group8, (9) Group9 | Geological group |
Geomorphology | (1) Valley of invasion, (2) Cavitation plateaus develop on carbonate rocks, (3) Driftwood washing plateau grows on carbonate rock, (4) Erosion plateaus develop on carbonate rocks, (5) Erosion and erosion plateaus develop on carbonate rocks, (6) Cavitation mountain range growing on carbonate rock, (7) Massive and structural mountain ranges developed on non-carbonate rocks, (8) Erosion and erosion mountain ranges develop on rocks, (9) Erosion massif develops on carbonate rock, (10) Masses and eroded mountain ranges develop on rocks, (11) The valley erodes and accumulates, (12) Cavitation mountain range growing on non-carbonate rock, (13) The mountain range erodes the structure growing on the rock, (14) Karst Funnel, (15) Invasion valley | Geomorphological categories |
Landcover | (1) Bare ground, (2) Built area, (3) Clouds, (4) Crops, (5) Flooded vegetation, (6) Grass, (7) Scrub/Shrub, (8) Tree, (9) Water | Landcover categories |
3.4 Landslide susceptibility modeling
The base classifier model, Multilayer Perceptron (MLP), is used in this study. Five ML ensemble techniques were developed with Multilayer Perceptron (MLP) as a base classifier, including Bagging – MLP (BAMLP), Dagging – MLP (DAMLP), Decorate – MLP (DEMLP), Rotation Forest – MLP (RFMLP), Random SubSpace – MLP (RSSMLP).
3.4.1 Multilayer Perceptron (MLP)
The MLP method is one of the most common artificial neural network approaches and is widely applied in landslide susceptibility assessment (Gómez & Kavzoglu, 2005; Li et al., 2019). The backpropagation algorithm is the training rule of MLP (Gómez & Kavzoglu, 2005). The goal function’s minimal value and the method’s optimal weight values may be modified and calculated (Li et al., 2019). This model has three main components: input, hidden, and output layers. Landslide influencing factors are considered the input layers. The output layers are made up of the categorized findings that are used to classify landslides and non-landslides. The hidden layers are applied to transform inputs into outputs (Gómez & Kavzoglu, 2005). The number of neurons in the input layer of \(\text{X}=({\text{x}}_{1},{\text{x}}_{2},\dots ,{\text{x}}_{\text{m}0})\), hidden layer, and output layer are designated as m0, m1, and m2 respectively if the MLP model comprises multi-input variables and multi-output variables. The input and output of neurons in the hidden layer are calculated as follows:
\({h}_{j}=\sum _{i=1}^{{m}_{0}}{w}_{ij}{y}_{i}+{\theta }_{j}\)
|
(3)
|
\({z}_{j}=f\left({h}_{j}\right)={\left(1+{e}^{{-h}_{j}}\right)}^{-1}\)
|
(4)
|
where hj, 𝜃j and zj denote the input, the threshold, and the output of the jth neuron in the hidden layer; wij denotes the weight value of the ith input and the jth in the hidden layer neurons; \(f\left({h}_{j}\right)\) denotes the activating function. Afterward, the inputs and outputs of output layer neurons are represented as below:
\({h}_{k}=\sum _{j=1}^{{m}_{1}}{w}_{jk}{y}_{j}+{\theta }_{k}\)
|
(5)
|
\({z}_{k}={h}_{k}\)
|
(6)
|
where hk, 𝜃k and zk denote the kth input, the threshold, and the output in the output layer neurons; wjk denotes the weight value between the jth hidden layer neuron and the kth output layer neuron.
3.4.2 Bagging (BA)
Bootstrap aggregating or Bagging algorithm was proposed by Breiman (1996). This algorithm was applied to achieve an aggregated predictor based on different bootstrap samples (Breiman, 1996). This algorithm uses a training dataset T (xk,yk), where xk∈ Q, yk∈ (landslide; non-landslide), k = 1 - M and M is the number of bootstrap samples. Next, a bootstrap sample Tk is generated from the initial training dataset based on the replacement method. Therefore, the model is formed through a base classifier Bk using the bootstrap sample Tk and a classifier Bk(x) is developed from each bootstrap sample Tk. Finally, the classifier B* is synthesized from B1, B2, …, Bn and calculated as below (Bauer & Kohavi, 1999):
\({\text{B}}^{\text{*}}\left(\text{x}\right)=\text{arg}\underset{\text{y}\in \text{Y}}{\text{max}}\sum _{\text{i}=1}^{\text{n}}1({\text{B}}_{\text{i}}\left(\text{x}\right)=\text{y})\)
|
(7)
|
where Bi(x) represents a classifier that is generated from each bootstrap sample Tk.
3.4.3 Dagging (DA)
Dagging was first proposed by Ting & Witten (1997). This method determined the final prediction based on the majority vote. In this technique, the training dataset was divided into many separate classified parts, and each part of the data corresponded with a basic learner (Ting & Witten, 1997). If the input training dataset D has K samples, the dagging algorithm creates N datasets from the input training dataset. Each dataset consists of k samples (kN < k), and other datasets do not contain a similar sample. After that, each dataset would be trained by a basic classifier to build a classification model. Thus, N classification models can be formed from the N original datasets. These models make their prediction classes based on the given query samples. Finally, the prediction class of the dagging method has the most votes (Chen & Li, 2020).
3.4.4 Decorate (DE)
The Decorate was developed by Melville & Mooney (2003). Decorate is an ensemble meta-learner to create Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples (DECORATE). This algorithm is applied to create a new classifier based on the combination of original training datasets. The content of the Decorate algorithm can be understood as follows (Melville & Mooney, 2003):
1. Input a training dataset \(\text{D}=\left\{\left({x}_{1},{y}_{1}\right),\left({x}_{2},{y}_{2}\right),\dots ,\left({x}_{p},{y}_{p}\right)\right\}\) with \({\text{x}}_{i}\in {R}^{b},{y}_{i}\in Y=\left\{{l}_{1},{l}_{2},\dots ,{l}_{q}\right\},i=\text{1,2},\dots ,p;j=\text{1,2},\dots ,q\); in which p denotes the number of training sub-datasets (p>1), b denotes the number of attributes in each sub-dataset (b>1), m denotes the number of class labels in a classification model (q>1).
2. The base learning algorithm – BaseLearn - is used to train a classifier C1 from the original input dataset D, and the first ensemble \({\text{C}}^{*}=\left({C}_{1}\right)\) is got.
3. Decorate Algorithm Generates Classifiers Iteratively.
3.4.5 Rotation Forest (RF)
Rotation Forest (RF) was an ensemble learning method proposed by Rodríguez et al. (2006). This method developed an ensemble of decision trees based on the random subspace and bagging methodology with principal component analysis (PCA). In this technique, the input training datasets were divided into many sub-datasets to create the classifiers, and PCA was used for each sub-dataset (Kuncheva & Rodríguez, 2007).
The input training dataset is E, the class label of the input training dataset is F, and the feature set is G. If the quantity of training times is T with t features, then E is T⋅t matrix. Assuming that class labels F are formed based on the class set of \(\text{H}=({\text{h}}_{1}, {\text{h}}_{2},\dots ,{\text{h}}_{\text{j}})\); the feature set is allocated into M sub-datasets, P-decision tree in a forest rotation of \(\text{D}=({\text{D}}_{1}, {\text{D}}_{2},\dots ,{\text{D}}_{\text{P}})\). According to that, two indices – M and P – need to be pre-calculated. The procedure to determine the input training dataset consists of four steps (Sahin et al., 2020):
(1) Divide the feature set G into M feature subset, and the attribute of each feature subset of N = n/M;
(2) Gi,j is the attribute subset to train the classifier Di, and Yi,j is the dataset in Gi,j;
(3) Construct a rotation matrix Ri with the nominated ratio in the matrix Fi,j. The coefficients of the matrix Fi,j include: \({f}_{ij}^{\left(1\right)},{f}_{ij}^{\left(2\right)},\dots ,{f}_{ij}^{\left(Nj\right)}\), that is determined based on a linear transformation;
(4) The rotation matrix Ri can be calculated by:
\({R}_{i}=\left[\begin{array}{cccc}{f}_{i1}^{\left(1\right)},\dots ,{f}_{i1}^{\left({N}_{1}\right)}& 0& \dots & 0\\ 0& {f}_{i2}^{\left(1\right)},\dots ,{f}_{i2}^{\left({N}_{2}\right)}& \dots & 0\\ ⋮& ⋮& \dots & ⋮\\ 0& 0& \dots & {f}_{iP}^{\left(1\right)},\dots ,{f}_{iP}^{\left({N}_{P}\right)}\end{array}\right]\)
|
(8)
|
In addition, the classification step can be evaluated for a given case y, then the classifier \({Q}_{i}={q}_{ij}\left(y{R}_{i}^{b}\right)\) will be used to classify the probability of this case. In this way, the confidence of a class can be determined as the following equation:
\({\sigma }_{j}=\frac{1}{m}\sum _{i=1}^{m}{q}_{ij}\left(y{R}_{i}^{b}\right), j=\text{1,2},\dots ,d\)
|
(9)
|
Finally, y is assigned to a class which has the largest confidence determined.
3.4.6 Random SubSpace (RSS)
The Random SubSpace (RSS) was first proposed by Ho (1998). The training dataset was created in the modified feature space in the RSS method to build a higher number of training variables (Ho, 1998). The RSS algorithm can be expressed as follows: Input a training dataset D(xi,yi); where xi∈ T and is a m-dimensional vector xi = (xi1, xi2, ..., xim), yi∈ (landslide; non-landslide). First, m∗ features are randomly selected from the training dataset, where m∗ < m. In this way, the m∗ dimensional random subspace of the original m-dimensional feature space is generated. Second, the modified training dataset D* consists of m∗-dimensional training features xi = (xi1, xi2, ..., xim*). Finally, a final classifier is developed based on the combination of primary classifiers according to a voting scheme (Lai et al., 2006).
3.5 Model validation and comparison
It is required to validate the prediction models to assess the applicable ability. We used a variety of statistical metrics to assess the performance of five suggested ensemble models in this research. They are Positive Predictive Value (PPV), Negative Predictive Value (NPV), Sensitivity, Specificity, Accuracy (ACC), F-measure, Jaccard, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Reciever Operating Characteristic (ROC) curve, and Area Under the ROC Curve (AUC). Sensitivity and Specificity denote the true positive rate and true negative rate of landslide and non-landslide locations. Accuracy is the mathematics average of Sensitivity and Specificity or the scale between the number of landslide and non-landslide pixels. F-measure is expressed as a weighted harmonious average of the accuracy and revocation in binary classification. Jaccard is the Jaccard sameness coefficient factor. The ACC, Kappa, MSE, and RMSE statistical indicators are determined as in the following equations:
\(\text{A}\text{C}\text{C}=\frac{\text{T}\text{N}+\text{T}\text{P}}{\text{T}\text{N}+\text{F}\text{N}+\text{T}\text{P}+\text{F}\text{P}}\)
|
(10)
|
\(\text{K}\text{a}\text{p}\text{p}\text{a}=\frac{\text{A}\text{C}\text{C}-\text{A}\text{C}\text{C}\_\text{E}\text{X}\text{P}}{1-\text{A}\text{C}\text{C}\_\text{E}\text{X}\text{P}}\)
|
(11)
|
\(\text{M}\text{S}\text{E}={\sum }_{\text{i}=1}^{\text{m}}\frac{({\text{X}}_{\text{A}\text{c}\text{t}.}-{\text{X}}_{\text{P}\text{r}\text{e}\text{d}.}{)}^{2}}{\text{N}}\)
|
(12)
|
\(\text{R}\text{M}\text{S}\text{E}=\sqrt{{\sum }_{\text{i}=1}^{\text{m}}\frac{({\text{X}}_{\text{A}\text{c}\text{t}.}-{\text{X}}_{\text{P}\text{r}\text{e}\text{d}.}{)}^{2}}{\text{N}}}\)
|
(13)
|
in which TP denotes true positive, TN denotes true negative, FP denotes false positive, and FN denotes false negative.
RMSE represents the level of dispersion of predictive values from actual values. The less the RMSE value is, the higher the accuracy of the prediction model is. Meanwhile, the AUC indicator is considered the main value in assessing the performance of the predictive models (Ye et al., 2016).