Research on Non-Landslide Selection Method for Landslide Hazard Mapping

Most landslide prediction models need to select non-landslides. At present, non-landslides mainly use subjective inference or random selection method, which makes it easy to select non-landslides in high-risk areas. To solve this problem and improve the accuracy of landslide prediction, the method of selecting non-landslide by Information value (IV) is proposed in this study. Firstly, 230 historical landslides and 10 landslide conditioning factors are extracted and interpreted by using Remote Sensing (RS) image, Geographic Information System (GIS) and eld survey. Secondly, random, buffer, river channel or slope, and IV methods are used to obtain non-landslides, and the obtained non-landslides are applied to the popular SVM model for landslide hazard mapping (LHM) in western area of Tumen City. The landslide hazard map based on the river channel or slope method is seriously inconsistent with the actual situation of study area, Therefore, the three methods of random, buffer, and IV are veried and compared by accuracy, receiver operating characteristic (ROC) curve and the area under curves (AUC). The results show that the landslide prediction accuracy of the three methods is more than 80%, and the prediction accuracy is high, but the IV is higher. In addition, IV can identify the very high hazard regions with smaller area. Therefore, it is more reasonable to use IV to select non-landslides, and IV method is more practical in landslide prevention and engineering construction. The research results may be useful to provide basic information of landslide hazard for decision makers and planners.


Introduction
Landslide is one of several natural disasters in the world (Guzzetti 2015;Zhang et al. 2017), which has a signi cant impact on human life and property. It can provide basic information for decision makers and planners of landslide disaster with understanding landslide mechanism and drawing landslide hazard map (Dou et al. 2015;Nguyen et al. 2017). However, due to the complexity of landslide disaster, it is still a big challenge to achieve reliable spatial prediction of landslides. Many scholars put forward a variety of landslide evaluation methods from different angles, which can be divided into two types: physics based and statistics based methods (Bui et al. 2015). SVM has been applied in landslide prediction, and it is reported that the prediction ability is better than the traditional method (Bui et al. 2018;Pham et al. 2016).
Most landslide prediction models need to select the same number of non-landslides as landslides. There are mainly three ways to choose non-landslides: (1) take the known landslides edge as the boundary, buffer outwards to produce non-landslides (Peng et al. 2014;Su et al. 2017) (buffer method); (2) randomly select non-landslides except for the known landslides in the study area (Pham et al. 2016) (random method); (3) consider that non-landslides are in the region of river channel or slope less than 2 °( Taskin Kavzoglu et al. 2013) (river channel or slope method). These methods have strong randomness and subjectivity, which makes the selected non-landslides are prone to appears in the high hazard region, thus reducing the accuracy of landslide prediction. Therefore, in order to reduce the probability of the selected non-landslides appears in the high hazard region and improves the spatial prediction accuracy of landslide, the Information value (IV) method is proposed to select non-landslide in this paper. First of all, combined with eld survey, landslides and the conditioning factors are extracted and interpreted by RS and GIS. Secondly, the four methods, buffer, random, river channel or slope and IV, are used to select non-landslide and apply to SVM for LHM. Finally, the four non-landslide selection methods are veri ed and compared.

Study Area
The study area (550.71km 2 ) ( Fig. 1) belongs to the west of Tumen City, Jilin Province, China, in the low mountain and hilly area of Changbai Mountain, surrounded by mountains, crisscross valleys and developed rivers, neotectonic movement, long-term weathering and water erosion, which forms the landform of vertical and horizontal valleys and rolling hills. Surrounded by mountains, the whole landform has three gradients of mountains, hills and basins. The total land area of Tumen City is roughly: "eight mountains, half grass, half water and one eld". The annual precipitation is 500-700mm.

Landslides
The rst step of LHM is to locate landslides that have occurred (Jiménez-Perálvarez et al. 2010).
According to the project of "Remote Sensing Survey of Basic Geology in Northeast of China", SPOT5, 02C, and eld survey, a total of 108 landslides are obtained. The landslides spatial distribution is analyzed by RS and GIS, which shows that the landslides mainly distribute along roads ( Figure 1).

Landslide conditioning factors
The landslide is affected and restricted by many factors. Starting from the characteristics and causes of landslide disaster in the study area, based on RS interpretation and eld investigation, the internal relationship between geological environment and landslide in study area is studied in depth, Ten factors, such as geotechnical, distance to fault, altitude, slope, aspect, distance to road, normalized difference vegetation index (NDVI),land-use, distance to river and rainfall, are selected for LHM.

Methodology
The methodology is mainly based on the selection of non-landslide in landslide prediction model. Firstly, 108 historical landslide points are 230 landslide units, and the 230 landslide units are divided into training (80%) and validation (20%) datasets based on the 5-fold-cross-validation method. The CA and IGR methods are used to evaluate the conditioning factors. The buffer, random, river channel or slope and IV methods are used to select non-landslide. Then, the four non-landslide selection methods are applied to SVM for LHM. Finally, the performance of non-landslide selection methods are evaluated and compared by accuracy, ROC curve and AUC.

Training and validationdatasets
SVM is binary pattern recognition, so 230 grid cells equal to the landslides are selected as non-landslides.
In the landslide hazard modeling, it is necessary to divide the grid cells of landslide and non-landslide into training dataset to build model and validation dataset to test the model accuracy (Chung and Fabbri 2008). In this paper, the ve-fold cross-validation method is adopted to improve the stability and accuracy of the model.

Evaluation of landslide conditioning factors
Feature selection is very important in the landslide spatial prediction. Too many features may lead to dimension disaster, which makes prediction accuracy worse. In this study, correlation analysis (CA) and information gain ratio (IGR) are used to select features, remove irrelevant and redundant features, and obtain high-quality conditioning factors.
CA refers to the analysis of two or more variable factors with correlation, so as to measure the closeness between variable factors. In this study, 10 conditioning factors are analyzed by SPSS.
Information gain (IG) tracks the reduction of entropy to quantify the factors importance. It can measure the predictive ability of conditioning factors (Bui et al. 2015). However, there is a natural deviation in the IG, which may reduce the prediction ability of the model (Wang and Borgelt 2004). In order to overcome this problem, the IGR is proposed. The prediction ability of conditioning factors increases with the increase of IGR.
Information gain (Ig) tracks the reduction of entropy to quantify the importance of factors. It can measure the predictive power of conditional factors (bui et al. 2015). However, there is a natural bias in Ig, which may reduce the prediction ability of the model (Wang and borgelt, 2004). In order to overcome this problem, IGR is proposed. The predictive ability of regulatory factors increased with the increase of IGR. (2) Random method In addition to the known landslide cells of study area, 230 grid cells are randomly selected as nonlandslide.
(3) River channel or slope method The slope is calculated by using DEM data with 10m resolution. Considering the river channel, 230 grid cells are randomly selected as non-landslide in the area with slope less than 2 °.

Information value
IV is a statistical method theory based on information, which has been widely used in the evaluation of geological disasters such as landslide. Its principle is to analyze the landslide hazard under the comprehensive in uence of conditioning factors according to the information entropy. The model uses the information to describe the quantity and quality of conditioning factors, so as to determine the probability of landslide disaster (Clerici et al. 2002). The total information of each study cell is the sum of all conditioning factors information in the cell.
I represents the total information value of each study cell, n represents the conditioning factors number. S i and S represent the area of factor x i and the study area, respectively, N i and N represent the landslides number of factor x i and the study area, respectively.
The landslide hazard indexes (LHI) are calculated according to the formula 1 and is partitioned into ve levels by natural breaks method, 230 cells are randomly sampled from the very low hazard regions as non-landslide cells.

Support vector machine
SVM uses supervised learning method for binary classi cation of data. It can nd the optimal separating hyper plane automatically through learning calculation (Xu et al. 2012). The performance of SVM is affected by kernel function. In this paper, the most commonly method RBF kernel SVM is selected to predict the landslide. The parameters C and γ of RBF kernel SVM are 0.8 and 0.95 respectively (Bui et al. 2015;Zhuang and Dai 2006).

Accuracy evaluation methods
The performance and precision of the model can be tested by the known landslides (Kayastha et al. 2013), and measured by the accuracy, ROC curve and AUC.
Accuracy is the proportion of the correctly predicted number to the total number of cells, which can measure the landslide prediction ability of the model. ROC curve is a common method to check the spatial validity of LHM (Bui et al. 2015). The model accuracy increases with the ROC curve approaches the top left corner. AUC is a common index for overall performance test of landslide model. The model predicts is well if the AUC value is close to 1. (Bui et al. 2011). Generally, it is considered that model has a high accuracy when the model AUC value is greater than 0.7 (Luo et al. 2019).

Importance of conditioning factors
Through the correlation analysis of 10 conditioning factors, the correlation coe cients of altitude and NDVI, distance to river, distance to road, slope, rainfall are 0.481, 0.447, 0.432, 0.306, 0.303 respectively. Therefore, after comprehensive consideration, the altitude is removed and the remaining 9 conditioning factors are retained. Figure 2 shows the IGR of each conditioning factor and landslide occurrence. Factors with higher IGR value are more important to the landslide. The IGR value of the distance to road is the highest (0.163).
Rainfall has the lowest IGR value of 0.016, which is the lowest contribution to landslide prediction. The IGR of 9 factors are > 0, so the 9 factors contribute to the landslide occurrence. Finally, 9 factors are selected as landslide conditioning factors (Figure 3).

Landslide hazard map
The four non-landslide selection methods, random, buffer, IV and river channel or slope, are applied to SVM to draw landslide hazard map, the results are shown in Figure 4. The LHM of study area obtained by the non-landslide selection method of river channel or slope is seriously inconsistent with the actual situation of study area. Therefore, the selection of non-landslide using river channel or slope will not be considered. The LHM obtained by the other three non-landslide selection methods is basically consistent with the actual situation of study area. The three methods predict that the very high hazard regions are mainly distributed along roads. To quantitatively analyze and compare the prediction results of the three methods, the hazard area percentage and landslide distribution are counted ( Figure 5). The landslide number percentage are 2.61%, 2.17%, and 2.17% for Random, IV, and buffer respectively in the very low regions, while the area percentage are 60.58%, 62.27% and 60.14% respectively. The area percentages of the very low regions for the three methods are more than 50%, but the number percentages of landslides are less than 3%. The landslide number percentage is 85.65%, 89.57%, and 87.39% for Random, IV, and buffer respectively in the very high regions, while the area percentage is 6.95%, 6.09%, and 6.16% respectively. The area percentages of very high regions are very small, but the landslides are mainly concentrated in this area. Table 1 shows the accuracy of the three methods. The accuracy of IV is the highest in training (91.30%) and validation (85.87%). In the validation, the accuracy of random and buffer is equal (84.78%), while in the training, the accuracy of random is the lowest (89.13%). Therefore, compared with the other two methods, the accuracy of IV is highest.  Figure 6 show that the IV is closest to the upper left corner in both training and validation. The AUC of IV is the highest, which is 0.933 and 0.876 in training and validation, respectively. In the training, the AUC of buffer is the lowest (0.918). While in the validation, the AUC of random is the lowest (0.874). IV has better performance than other two methods and has higher spatial prediction ability of landslide.

Discussion
The main purpose of this paper is to discuss the practicability of using IV model to select non-landslide, so as to reduce the probability of non-landslide prone to appear in high-risk areas, and improve the spatial prediction accuracy of landslide. Non-landslide selection methods based on random, buffer, river or slope and IV proposed in this paper are compared in SVM.
The selection of landslide conditioning factors affects landslide spatial prediction. landslide spatial prediction. Before analyzing the landslide hazard, correlation between conditioning factors and the relative importance of conditioning factors for landslide are evaluated by CA and IGR. The results of CA show that there is a correlation between the altitude and other several conditioning factors, and the altitude is removed. From the results of IGR, the distance to road has the highest IGR value, which is the most important for the landslide occurrence. The main reason is that the road construction changes the original natural slope, which may lead to landslide.
At present, non-landslides in most models are selected by subjective inference or random, which makes the selected non-landslides are prone to appears in high-risk areas. Therefore, the method of selecting non-landslide by IV is proposed in this study. IV can achieve good results in the hazard assessment of geological disasters with simple theory, high objectivity and strong practicability (Du et al. 2018). The four non-landslide selection methods are applied to SVM to draw the landslide hazard map. The results show that the landslide hazard map drawn by the non-landslide selection method of river channel or slope is seriously inconsistent with the actual situation of study area. The landslide hazard map of other three non-landslide selection methods show that the very high hazard regions are mainly distribute along roads, and occupy a small area, which is consistent with the actual situation of study area. The LHM is analyzed by the following rules: (1) the landslides investigated are mainly distributed in the very high hazard regions of landslide; (2) in all investigations, the points of very high hazard regions should account for a low portion (Gokceoglu et al. 2005;Su et al. 2017). Therefore, compared with other three methods, IV is more practical.
The three non-landslide selection methods of random, buffer, and IV are compared by the accuracy, ROC curve and AUC. Accuracy of the three methods shows that the IV method has the highest accuracy. The ROC curve of IV is the closest to the upper left corner, and the AUC value is the highest. Therefore, IV method can reduce the probability of selected non-landslide prone to appear in high-risk areas and improve the accuracy of landslide spatial prediction.

Conclusions
To reduce the probability of the selected non-landslides are prone to appears in the high hazard region and improve the accuracy of landslide prediction, the method of selecting non-landslide by IV is proposed, and the performance of the four non-landslide selection methods are evaluated and compared.
The following conclusions can be drawn: (1) compared with the existing three main non-landslide selection methods, the non-landslide selection method proposed in this paper has the highest accuracy, reduces the probability of the selected non-landslides are prone to appears in the high hazard region, which is bene cial to improving the prediction accuracy of landslide; (2) the non-landslide selection method of river channel or slope is not suitable for LHM in study area; (3) compared with other three methods, the method of IV can better identify the smaller area of the very high hazard regions, which will greatly reduce the cost of prevention in speci c engineering practice , and is more practical for landslide prevention. The study results can provide new ideas and scienti c basis for LHM, and have certain reference value for landslide management and land use planning.