Application of an integrated model based on bivariate and 1 multivariate method in landslide susceptibility mapping

: Landslides usually result in human losses and economic damages in mountainous areas 10 especially for Himalayan areas. Landslide susceptibility mapping (LSM) is a key approach for 11 avoiding hazard and risk. This study aims to explore an improved model combining multivariate 12 and bivariate statistical methods for LSM. Four models were established as logistic regression 13 (LR), LR integrated with certain factor (CF), LR integrated with frequency ratio (FR) and LR 14 integrated with information value method (IV) and their performance was compared in LSM. 15 Firstly, a landslide inventory map with 313 determined landslide events was prepared and 12 16 predisposing factors were selected. Secondly, the dataset was randomly divided into two parts, 17 75% of which was used for modeling and 25% for validation. Finally, area under the curve (AUC) 18 and statistical metrics were applied to validate and compare the performance of the models. 19 Results show that the performance of IVLR model is the best (AUC 0.792 and accuracy=78.8%). 20 susceptible areas. It identified the major factors and intervals of high susceptibility that profile 22 curvature greater than 0.1, less than 2 km from the stream, maximum elevation difference greater 23 than 1200 m and rainfall between 440 and 450 mm were prone to landslide. The conclusion 24 reveals that the quality of LSM can be improved by comparing and combining the bivariate and 25 multivariate methods, which serve as a more effective guide for land use planning in the study 26 area or other highlands where landslides are frequent. 27


30
Landslide is a sudden geological phenomenon widely distributed across the world, causing 31 direct or indirect damages to property and injuries or fatalities of people residing in the area 1,2,3 . 32 The frequency and scale of landslide outbreaks in China are far beyond than that of other countries 33 in the world 4,5 . The prevention measures need to identify the existing landslides for spatial 34 zonation 6 . Generally, damages can possibly be controlled by prediction where disasters may occur 35 in the future 7 . Therefore, landslide susceptibility mapping (LSM) is considered as an effective 36 approach to avoid hazards and risk. 37 The approaches for landslide susceptibility modeling can be broadly classified as qualitative 38 (knowledge-driven methods or physically based methods) and quantitative (data-driven methods) 39 where p represents the probability of an event ranging from 0 to 1; y represents a linear fitting 140 function as showed below: 141 Where b0 is the intercept of the model, b1, b2, ...,bn are the partial regression coefficients and x1, 143 x2, ..., xn are the variables. 144 LR was modeled in SPSS software and forward stepwise method was applied for exclude 145 the non-significant variables. The values of 12 evaluation factors of all units were extracted as 146 8 independent variables while dependent variables represent the occurrence of landslide event i.e. 1 147 represents occurrence and 0 represents non-occurrences. The significant values of all variables 148 retained in the last step of the analysis were less than 0.05 and no variables were added. 149

CF method
150 CF is a bivariate statistical method which is commonly used in analyzing the probabilistic 151 relationship between the dependent and independent variables. Accordingly, the classification of 152 Where PPa represents the ration of the area of landslide of the a-th conditioning factor in a specific 159 interval to the area of a-th factor; PPs represents the ratio of the total number (or area) of landslide 160 to the total study area.

161
The value of CF ranges from -1~1 on the basis of equation 3. A positive CF value indicates 162 that the occurrence of landslide is highly certain and the geological environmental conditions are 163 prone to geological disasters. The higher the value, the higher will be the certainty. On the contrary, 164 negative values represent low certainty of landslide.

165
CFLR model was established in SPSS, which considers the CF value of the units as 9 independent variable and the occurrence of landslide as the dependent variable.
Where i=1，2，3，…，n; j = 1，2，3，…，m; ni-j represents the area of landslide of the i-th 173 conditioning factor in j-th interval; si-j represents the area of the i-th conditioning factor in j-th 174 interval; n represents the total area of landslide and s represents the total area.

175
The values from IV method can either be positive or negative. If it is positive, it indicates that 176 the factor is conducive to the occurrence of landslide in a specific interval: the greater the IV, the 177 higher the possibility of landslide, and vice versa. 178 Similarly, the IVLR model was established in SPSS, taking the IV value of the units as 179 independent variable and the occurrence of landslide as the dependent variable. If the FR >1, a determined correlation exists between the landslide occurrence and factor 185 10 class and if the FR <1 than there will be a reflection of weak correlation. The IV, FR and CF 186 method has produced the corresponding indexes of each class of the 12 control factors as shown in 187

Performance and comparison of different models
207 Z-score normalization method was applied to standardize the data and to eliminate the impact 208 of different dimensions (units) before modeling. Besides, a correlation analysis was conducted to 209 test collinearity among the independent variables. The variance inflation factor (VIF) is a common 210 applied index 41 . If VIF is greater than 5 or 10, it indicates that there is severe collinearity between 211 the selected variables. 212 Table 2 showes the VIF values of the chosen independent variables and indicates that no 213 multicollinearity exists among the chosen variables. SPSS also provides the test indexes that 214 reflect the overall goodness of fitting of the model: -2LL, CSR 2 and NR 2 . The Cox and Snell R 215 square values and the Nagelkerke R square value indicated that the independent variables can 216 explain the dependent variables, having values of 57.7% and 50.3%, 63.3% and 74.4%, 66.2% and 217 68.2%, 67.2% and 69.5% for LR, CFLR, FRLR and IVLR models respectively (Table 4). 218 The IVLR model achieves the highest value of sensitivity (81.6%), followed by the FRLR 219 model (sensitivity=80.9%), the CFLR model (sensitivity=76.8%) and LR model (sensitivity 220 =74.2%) as shown in The performance of the models was declining in verification especially for the CFLR model 234 which indicated that the model was over-fitting and generalization ability was doubtful. It was 235 noticed that hybrid models were better than the single model in terms of prediction capacity. 236 However, there was a certain gap between the three hybrid models. The improvement of CFLR 237 model was not obvious compared to LR model. The performance of FRLR model and IVLR were 238 close and better than CFLR model. 239

241
The IV method was applied to ensure the relationship between the influencing factors and 242 the occurrence of the landslides and the results are shown in Table 1. As for rainfall, the 243 percentages of landslide area for 440~450 mm and >450 mm were 52.05% and 38.81% 244 respectively, which means that 90% of landslide areas were distributed among the two class of  The integrated models finally selected the factors that have significant influence on the model 284 fitting by rainfall, elevation, maximum elevation difference, profile curvature, distance to fault and 285 distance to stream ( Table 3) Three integrated models were established in this study to explore the relative importance of 298 conditioning factors, the results of which were obviously different. On the other hand, aspect was 299 not involved in the integrated models while appeared in LR model ( Table 3). 300

301
The aforementioned analyses indicated that the IVLR model shows a prominent fitting and 302 generalization capability in predicting the landslide susceptibility compared to the other 3 models 303 presented in this study. Therefore, it is determined as the most suitable model and applied to the 304 calculate the landslide susceptibility index for the whole study area. 305 The probability P of the occurrence of landslide in the whole study area was determined 306 based on the four models (LR, CFLR, FRLR and IVLR). The equal spacing principle is used to 307 reclassify the landslide susceptibility index into five levels: very low (0~0.2), low (0.2~0.4), 308 moderate (0.4~0.6), high (0.6~0.8) and very high (0.8~1) . 309 Fig. 7 show the distribution of landslide susceptible classes and the area percentage of each 310 16 class of each map is summarized in Fig. 8. As for LR model, very low, low, moderate, high or 311 very high susceptible class occupied 27.32 %, 20.85 %, 12.55 %, 21.74 % and 17.53 % of the 312 study area respectively (Fig. 8) A regular landslide susceptibility map should meet two rules: (1) the determined landslide 319 locations should appear in the high or very high-susceptibility class area as much as possible and 320 (2) the very high-susceptibility class area should occupy only a small proportion (Bui et al., 2012). 321 It was noticed that the landslide samples were mainly located in the dark (purple or red) areas and 322 the non-landslide points in the light (green or yellow) areas for IVLR model. Besides, LSM had 323 the smallest percentage of very high susceptible class compared to the others. The very-high 324 susceptibility areas of landslide are mainly distributed around the Yarlung Zangbo river and its 325 tributaries in the study area. River network curves and shapes geomorphology scour eroded slopes 326 in a great extent 42 . The areas near stream were densely populated with human activities and the 327 occurrence of landslide threats lives and property. 328 The performance of FRLR was also excellent in terms of prediction capability. However, 329 the percentage of moderate susceptible area was the largest among the models as 21.69%. The 330 predicted units as the moderate class were impalpable. Besides, the percentage of low or very low 331 susceptible areas was combinedly smallest as 37.87 % which was contrary to previous research. 332 Therefore, the LSM constructed by IVLR model was more analytical and receivable. 333

334
Ensemble algorithms as bagging, stacking and boosting have been applied in LSM and the 335 accuracy was exceeded up to 85% or 90% in previous studies 43,44 . New machine learning methods 336 and deep learning emphasize the optimization, and accordingly the multiple parameters involved 337 need to be tuned before application, which is difficult to implement especially for the 338 non-professionals 45 .Traditional statistical methods establish mathematical equations to explore the 339 relationship between landslide-related factors and landslide occurrence, which are more 340 acceptable. In this study, the IVLR model also performed well with satisfactory prediction 341 capability (AUC = 0.792 and accuracy = 78.8%). Three integrated models performed better than 342 the normal LR model in terms of accuracy, which indicated that the combination was effective. 343 Previous researchers have applied bivariate and multivariate statistical methods and compared 344 their performance in LSM 46,47 . Although CF, FR and IV have similarity in both principles and 345 results but their performance varies when combined with LR model. Each method has its own 346 strengths and weaknesses and generally its performance varies with different study areas 9 . Some 347 researches indicated that the bivariate methods perform better than multivariate methods, while 348 others support the multivariate methods 17,48 . However, it is believed that the integrated models 349 have more accurate results than the result of an individual classifier, which has slightly better 350 generalization ability than that of random guessing 13 . Therefore, it is recommended to compare 351 various models for the selection of most suitable one on the basis of robustness and reasonability. 352 Accuracy is of major consideration for LSM but it should not be the only focus. Identifying 353 18 the major conditioning factors responsible for landslide occurrence is also important which helps 354 in further engineering guidance. The determination of subjective weight and objective weight 355 helps to distinguish the contribution of these factors and analytic hierarchy process (AHP) and 356 factor analysis (FA) are the two commonly used methods without prior conditions 49,50  no special requirement for data distribution, the LR model needs to convert nominal variables into 364 dummy variables, which makes the regression model more complex. Therefore, the bivariate and 365 multivariate methods are complementary up to some extent and it is worth combining them for a 366 more reasonable and comprehensive analysis to provide a better way to analyze the major factors 367 in details. 368

369
In the current study, four models based on bivariate and multivariate methods as LR, CFLR, 370 FRLR and IVLR were explored and their performance is compared in LSM in Luoza county and 371 the following conclusions can be drawn: 372 The IVLR model performed the best in terms of accuracy and the landslide susceptibility 373 map constructed by IVLR model was reasonable and analytical. It indicated that landslides are 374 more likely to occur in areas with profile curvature greater than 0.1, within 2 km from the stream, 375 maximum elevation difference greater than 1200 m and rainfall between 440 and 450 mm. The 376 combination of bivariate and multivariate methods not only improves the prediction accuracy but 377 analyze the major conditioning factors in details. It is desirable to improve the advancement of the 378 application by combining multiple methods considering that some methods are complementary up 379 in some ways. The conclusion of the current study is helpful for landslide risk mitigation in

552
The authors declare no competing interests. 553