A comparative study of statistical methods and machine learning algorithms for prediction of landslides in Mizoram state of India through analysis of causative factors using Geo-informatics

doi:10.21203/rs.3.rs-4196847/v1

The landslide has been one of the most severe and significant natural hazards in the study area, Mizoram, which has rolling hills and deep valleys in almost every landform. A comparative study of landslide hazards in the area was conducted using various statistical analytic techniques and machine learning algorithms. The statistical method includes- Frequency Ratio (FR), Analytic Hierarchical Process (AHP), Shannon’s Entropy (SE), and Weight of Evidence (WOE), while the machine learning algorithms methods comprise basic classifiers such as Gradient Boosting Decision Tree (GBDT), Random Forest (RF), and Extreme Gradient Boosting (XGB), and hybrid classifiers using the Logistic Regression (LR) methods viz., GBDT + LR, RF + LR, XGB + LR. The study aims to find out the collinearity of various parameters of landslide-inducing factors and analyse their weight for most contributing factors to least contributing factors. It also aims to develop the Landslide Hazard Zonation (LHZ) map using various parameters weights and layer stacking by weighted sum overlay in a GIS software environment. The generated LHZ map was separated into five classes viz., low, moderate, high, very high, and severe. For statistical analysis, validation of the zonation maps was done by using past landslide inventories. Classification of the number of past landslides point data in each class of the zonation map was done to validate the accuracy of the zonation map. More than 65 per cent of Landslide point data falls in the High to Severe zone in the classification for FR, AHP, and SE which was considered to be in the positive validate zone, whereas only 60 per cent of Landslide point data falls in the High to Severe zone for WOE which was considered to be inadequate and undesirable for applicable LHZ map. For machine learning algorithms, a buffer zone of a 50m radius was created for the application of the seeding technique for preparing landslide inventory. More than 10000 landslide seeds cells and non-landslide cells were taken in which 80% and 20% train-test split was conducted. A series of metrics such as accuracy, precision, recall, f- f-measure, Area Under (receiver operating characteristic) Curve (AUC), kappa index, mean absolute error (MAE), and root mean square error (RMSE) was used to evaluate the accuracy and performance of the seven models. Based on the AUC curve, the XGB model having the highest AUC value (0.9039) was identified as the most efficient model among the machine learning models. It was found that an improvement of more than 15% accuracy was shown by the machine learning models compared to the statistical approach. The results suggest that the machine learning method is propitious for an application in landslide estimation in the study area.

Landslide Hazard Map

Statistical Analysis

Landslide Parameters

Stacking Ensemble Machine Learning Algorithms

Zonation

Validation

GIS

Mizoram

Landslides triggered by rainfall pose a threat to human lives and property in Mizoram. As the state’s population grows and more infrastructure development takes place, landslides have become a major concern to the safety of the citizens. Today, the need for new residential areas and new engineering buildings has increased rapidly because of the increase in population(Yalcin and Bulut, 2007). Unlike flooding which causes damage to structures that more often can be fixed, landslides may leave irreparable damage(Alejandrino et al., 2016). Every year, a huge number of lives of citizens and properties are lost due to this landslide phenomenon. It also disturbs the economic growth of the state as roads and bridges get blocked especially during the triggering period. More than thousands of past landslide records were found during the year 2011–2019 in the study area which mainly takes place during the monsoon season. It has also been observed that anthropogenic activities are also responsible for causing some of the landslide events in the past. However, the impact is generally not significant in the scenario of human-induced landslides. The landslide risk is expected to increase with increasing development activities, increasing urbanization and development, continued deforestation in landslide-prone areas, increased precipitation, and seismic activity (Aversa et al., 2018).Landslide susceptibility mapping (LSM) in a geographic information system (GIS)-integrated environment is the key to formulating disaster prevention measures and reducing future risks (Yuke et al., 2022). A systematic study of landslide-affected areas/events over the North-East region of India is valuable for reducing landslide-related risk in this region. Advances in satellite remote sensing technology and the increasing availability of high-resolution geospatial products have provided an unprecedented opportunity for such study (Bhusan et al., 2013). Using GIS environments helps in the calculation and visualization of the cumulative effects of causative factors on landslides. A few studies have been carried out by researchers around the study area for landslide hazard assessment and a micro-zonation has been developed in the past recent years. A comparative study of Landslide Hazards in the area was conducted using various Statistical Analytic techniques and Machine Learning algorithms.

The results of this study can help in the delineation of landslide-prone regions in the study area(Panchal and Shrivastava, 2021). It is also expected that the landslide hazard zonation map to be helpful for highway engineers, geologists, and other important involved in various infrastructure development activities for hazard mitigation and intelligent planning in the study area.

1.1 Study Area

The Study Area as shown in Fig. 1, is located in the southern part of North-East India bounded by the extend between 92^◦ 39’54” E to 92^◦ 46’57” E longitude and 23^◦ 39’54” N to 23^◦ 50’35” N latitude, situated in a hilly area with an average annual rainfall of 2500mm to 3000mm. In the Survey of India toposheet, Aizawl is represented by No. 83 A/9, 83 A/10, and 83 A/84. The geographical area of Aizawl is approximately 120.3 km². Geologically, the region is classified into the pre-Cambrian to Quaternary era. Tertiary rocks of the Disang and Baraingroup consist of shale and sandstone are most predominant in the territory, which on weathering becomes platy and splintery, proving the most ideal state for landslide to occur. The terrain is also seismogenic being one of the most active regions of the world and according to BIS (2002); it falls under Seismic Zone V with frequent moderate to large magnitude earthquakes visiting the terrain causing extensive damage to both life and property (Sengupta and Nath, 2022).

2.1 Collinearity Test of Independent Factors

Collinearity, also known as multicollinearity, is a statistical term that refers to the use of two or more highly correlated predictor variables in a regression model. Collinearity is the term used to describe the close relationships between some predictor variables, which allow them to assess similar aspects of the outcome variables. In statistical analyses, it can be challenging to comprehend the unique impacts of each predictor variable on the outcome variable. This is due to the fact that if two or more predictor variables are highly correlated, it may be challenging to pinpoint precisely which one is having an impact on the outcome variable. Collinearity can also result in inaccurate approximations of regression coefficients and standard errors, which can compromise the model's validity and dependability. Therefore, it is crucial to identify and get rid of collinearity in order to guarantee the quality of statistical analyses. The variance inflation factor and the correlation matrix are two techniques used to check for collinearity (VIF). A heat map can be used to visualise correlation matrices, with the hue of each cell representing the strength of the correlation coefficient (Yuke et al., 2022).

The correlation coefficients between related sets of variables in a dataset are shown in a Fig. 2, is called a correlation matrix. The strength and direction of the linear connection between two factors are assessed using correlation coefficients. In study, a correlation matrix is frequently used to determine the strength of the relationship between variables and to investigate possible relationships between variables. A correlation coefficient is present in each column of the correlation matrix, and it can be between − 1 and 1. A perfect negative correlation, or coefficient of -1, means that the variables move in the opposite paths. If the coefficient is zero, there is no correlation and no relationship between the two factors. The variables move in the same way if the coefficient is 1, which denotes a perfect positive correlation (Yuke et al., 2022).

A Variance Inflation factor (VIF) is also used to conduct the multicollinearity test. It is a statistical test used in research to identify the presence of multicollinearity among predictor variables in a regression model. Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated with each other, which can cause problems in accurately estimating the coefficients of the predictor variables and interpreting the results of the regression analysis. The VIF test calculates the ratio of the variance of a predictor variable in a regression model to the variance of that predictor variable if it were completely uncorrelated with all the other predictor variables in the model. A VIF value of 1 indicates that there is no multicollinearity, while a VIF value greater than 1 suggests the presence of multicollinearity. Generally, a VIF value greater than 5 or 10 is considered to be problematic (Yuke et al., 2022).

The VIF is calculated by the equation:

$$VIF= \frac{1}{1-{R}_{j}^{2}}$$

1

Where R_j is the negative correlation coefficient two independent variables.

The level of multicollinearity among predictor variables in a regression model is gauged by the variance inflation factor (VIF). A VIF score more than 1 often denotes the existence of collinearity of some kind, whereas a value of 5 or above is regarded as troublesome. There is no multicollinearity among the predictor variables in the regression model, as shown by the fact that none of the VIF values in the presented table are more than 5, shown in Table 1. Therefore, it is unlikely that the predictor variables' collinearity will have a substantial impact on the precision or interpretation of the regression findings.

The correlation matrix also indicates that the conditioning factors to have lies mostly between 0.6 to -0.4. Based on this result, it can be inferred that the predictor variables or factors in the model are relatively independent of each other, which suggests that they are each making unique contributions to the prediction of the outcome variable.

Table 1

Multicollinearity assessment by VIF
Sl No.	Factors	Variance Inflation Factor (Vif)
1	ASPECT RATIO	1.00002
2	ELEVATION	1.010394
3	PLAN CURVATURE	1.000112
4	DISTANCE TO ROAD	1.014721
5	HILLSHADE	1
6	LITHOLOGY	1.000324
7	DISTANCE TO DRAIN	1.000939
8	PROXIMITY TO EARTHQUAKE	1.00001
9	DISTANCE TO LINEAMENT	1.000059
10	LANDUSE/ LAND COVER (LULC)	1.001737
11	RAINFALL	1.000468
12	SLOPE	1.003634
13	TOPOGRAPHY WETNESS INDEX (TWI)	1.003962
14	GEOMORPHOLOGY	1.000089
15	NORMALIZED DIFFERENCE VEGETATION INDEX (NDVI)	1.01304

2.2 Modelling Approach

Based on the literature studied, several datasets were collected from different sources. Fifteen landslide conditioning factors including landslide inventory were collected from sources including the Geological Survey of India, United States Geological Survey, Google Earth Engine Sentinel-2 and Mizoram Remote Sensing Application Centre (MIRSAC) station data. The data sets are processed in the ArcGIS software where the required data for further processing were obtained. A multicollinearity test was conducted using a Correlation Matrix and Variance Inflation Factor (VIF) to determine any outlier or any data having a multicollinearity that could disturb the precision of output. After the multicollinearity test was done the datasets were moved into the various statistical approach and machine learning approach for hazard estimation assessment.

For the statistical method, all the datasets were reclassified into the same number of classes such that the class containing pixel information can be easily derived. The datasets were resampled into 30mx30m cell size and are fed into the selected statistical methods including- the Analytical Hierarchical Process (AHP), Frequency Ratio (FR), Shannon’s Entropy (SE) and Weight of Evidence method. The idea of using these algorithms was based on literature studies and their ability to assign weights to the parameters of landslide conditioning factors. To validate the hazard zonation map, the landslide data was used, with at least 60% of past landslide point data expected to fall under the high hazard zone in the hazard map for statistical analysis. For the machine learning method, a buffer zone of a 50-meter radius around the crown of each landslide point data was applied to the machine learning algorithms. This method was chosen because it was believed that the closest area to the landslip point itself would contain the best, undisturbed morphological features. The "Seed Cell Theory," which was initially put out by Süzen and Doyuran, 2004, is also known as the buffer zone approach. The data layer attribute values were mapped into the buffer zones made by converting points to transmit attribute information from the original data layer to the buffer zones. With the implementation of the Seed Cell theory, over ten thousand additional landslide point data were produced, while non-landslide points were produced utilising random point extraction beyond the buffer zone. From the study of different literature, machine learning methods such as Random Forest (RF), Gradient Boosting Decision Tree (GBDT), Extreme Boosting (XGB) and their stacking with Logistic Regression (LR) were adopted. The idea of selecting these algorithms was their ability to assign feature classification (weights) of different landslide parameters and capabilities of deep learning with parameter tuning. The machine learning algorithms were then fed a mix of landslide and non-landslide points that were randomly split into training and testing data (80% − 20%). The test for specivity and sensitivity which can be seen in the Receiver Operator curve (ROC) was used to the model's predictive accuracy of landslide hazard regions inside the study area.

3.1 Comparison of the two model methods

The Landslide Hazard Maps were generated by the statistical models and machine learning methods in the GIS environment. The landslide hazard map was divided into five classes such as low, moderate, high, very high, and severe. The main aim of validation with the past landslide inventory was to find out how much past landslide data points fall in each class of the hazard zonation map and calculate the overall percentile for each class. If at least 60% of the past landslide data point falls on the high to severe classes, the landslide hazard map was expected to be satisfactory for practical use. According to the basic assumption that future landslides will most likely happen in similar physiographic settings of the past and present landslides. Fig 3 shows the percentile pixel count in each class. An assumption was made that Low to Moderate classes were classified as non-landslide area while classes between High to Severe were assumed as landslide area. From the percentile pixel calculation in each class, 68.49% of landslide data points fall on the high to severe class in the case of FR, while 68.79% of landslide data points fall on the high to severe class in the case of AHP. SE has 67.96% of landslide data points fall on high to severe class, and that of WOE has 63.25%. In terms of landslide areas, each of the hazard zonation maps was divided into pixel counts containing landslide and non-landslide pixels, and the areas of each landslide pixel and non-landslide pixel for each landslide hazard map were again calculated in the GIS environment. Here, areas under high to severe class were considered as landslide areas. For FR, 57.39% of total study areas fall under landslide areas and 42.6% of the total areas fall under non-landslide areas. While 62.5% of the total study area falls under landslide areas and 37.4% of the total areas falls under non-landslide areas for SE. AHP had 51.9% and 48% of landslide and non-landslide areas from the overall study area. And, WOE had 46.39% and 53.6% of landslide and non-landslide areas from the whole study area.

For machine learning methods, 73.32% of landslide data points fall on the high to severe class in the case of GBDT, while 70.7% of landslide data points fall on the high to severe class in the case of RF. From the percentile pixel calculation in each class referring to Fig 4, 72% of landslide data points fall on high to severe class for XGB, 70% of landslide data points falls on high to severe class for GBDT+LR, 67% of landslide data points falls on high to severe class for RF+LR and 70% of landslide data points falls on high to severe class for XGB+LR. Parameters such as Accuracy, Precision, Recall, f-measure, Area under the ROC curve, Mean Absolute Error, Root Mean Square Error and Kappa Index were used to evaluate the performance of the six algorithm models (Yuke et al.,2022). Higher values of model accuracy, precision, recall, f-measure, AUC, and kappa, as well as lower values of RMSE and MAE, mean better performance of the model (Yuke et al.,2022). The models' evaluation performance can be seen in Table 2 and Table 3 for the five performance indicators. From Table 2 , it can be seen that model XGB has the highest AUC values of 0.923, followed by GBDT and RF, all crossing above 0.900, indicating that the three models demonstrate very satisfactory and acceptable predictive capability. In terms of machine learning metrics, the XGB+LR model has the highest value for Accuracy, Precision, Recall and f-measure. In terms of error metrics, XGB, GBDT+LR and XGB+LR show high Kappa Index values and the XGB+LR model shows the lowest value for MAE and RMSE. The kappa index values show the compatibility and reliability of the LSM models (Yuke et al.,2022). The overall performance of the machine learning algorithms is quite satisfactory in landslide estimation. The model performance indicators also show a very acceptable performance which also indicates that the stacking ensemble method is a useful tool for improving the accuracy of model prediction (Yuke et al.,2022). Overall, the XGB model having an AUC value of 0.923 has the best accuracy and predictive ability among the other five models.

Fig 5 shows landslide hazard maps generated based on the prediction of various machine learning approaches with historical landslide location. From the analysis of each of the ROC curves for all the machine learning models and statistical models as shown in Fig 6 and Fig 7, an assumption can be made that the machine learning models outperform each of the statistical models. An improvement of more than 15% can be seen by adopting machine learning for accuracy assessment and predictive capabilities. Adopting a suitable hyperparameter for each of the machine learning and stacking ensemble algorithms boosts the algorithms for higher reliability and model robustness. However, Yuke et al.,2022 assumes that the importance of landslide conditioning factors is specific to a region and cannot be extrapolated to other regions. Likewise, not all regions will give less acceptable performance of the statistical method compared to advanced machine learning methods. The model fitness for landslide predictive assessment or even prediction will also be greatly influenced by the quality of the datasets collected. In the case of machine learning methods, the value of AUC for basic classifier models outperforms all the stacking ensemble models. Yuke et al., 2022 suggested that a simple stacking ensemble process of a model will not necessarily improve its performance and also suggested that, it is not always the case that the modelling performance of a fusion model is better than that of a single model. But it can be seen that, in terms of error-based assessment, the XGB+LR model has the lowest MAE and RMSE, and the highest Kappa index compared to the other five machine learning models which shows that the model performs the best in error elimination.

Table 2: Evaluation of landslide estimation models using machine learning metrics

MODELS	ACCURACY	PRECISION	RECALL	F-MEASURE	AUC
GBDT	0.978	0.979	0.998	0.988	0.9223
RF	0.976	0.976	0.999	0.988	0.9037
XGB	0.98	0.98	0.999	0.989	0.923
GBDT+LR	0.982	0.982	0.999	0.991	0.8613
RF+LR	0.981	0.981	0.999	0.99	0.8721
XGB+LR	0.984	0.9855	0.999	0.992	0.8962

Table 3: Evaluation of landslide estimation models using error metrics.

MODELS	MAE	RMSE	KAPPA INDEX
GBDT	0.021	0.147	0.233
RF	0.0232	0.152	0.071
XGB	0.0198	0.141	0.511
GBDT+LR	0.017	0.133	0.418
RF+LR	0.018	0.135	0.385
XGB+LR	0.0152	0.123	0.551

Landslides had been one of the most threatening natural disasters that could take a toll on lives and properties, and destroy infrastructure at one go. But unlike other natural hazards like earthquakes, assessment of landslide causative factors and prevention can be conducted by various intelligent studies and scientific research. In this study, a GIS-based statistical approach was carried out for landslide hazard zonation and understanding the influences of various conditioning factors. From the research result, it can be concluded that the overall performance of various statistical methods was satisfactory as they could pass the least evidence that at least 60% of the past landslide data point to falls in the landslide regions in each of the hazard maps.

The landslide estimation studied by another method known as the Stacking Ensemble Machine Learning method outperformed the conventional statistical models by crossing an 80% accuracy test for all the models. The adoption of the Seed Cells technique by creating a 50m buffer radius around the crown, i.e., landslide point data increases the robustness of datasets for a train-test split test. In the accuracy assessment by machine learning matrices and error matrices, it is concluded that the XGB model having an AUC value of 0.923 performs the best among all the other models including the statistical models. The adoption of stacking ensemble of basic classifiers also vigorously improves the error reduction of various machine learning models. Adoption of other classifiers with improved algorithms might increase the reliability of predictive capability in the study area.

The landslide hazard estimation maps prepared from this study are expected to help other researchers who have an interest in the same study area such that the influence of various causative factors of landslides could be studied more deeply and give versatility for using advanced methodologies. It is also expected to help the construction practitioner other engineers and geologists in the proper planning of any infrastructure project for landslide mitigation in the future.

Funding: The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Competing Interests: The authors declare no competing interest.

Author Contributions: Conceptualisation was done by Satyaprakash. Material preparation, data collection and processing were performed by Joel TC Vanlalnunzira. The analysis of the maps and interpretation was done by both Satya Prakash and Joel. The manuscript was written by Joel TC Vanlalnunzira, further editing and finalisation of the manuscript was done by Satyaprakash.

Ethical Approval: The present study does not include any testing on humans or animals hence no ethical approval was sought from any agency. The data was collected from sources which are open to everyone and was collected as per our institution’s Ethical Review Guidelines.

Consent to Participate: Since no tests were performed on humans or animals, no consent was obtained to participate in the research.

Consent to Publish: Informed consent was obtained from all the authors before the publication of research findings.

Alejandrino, I.K., Lagmay, A.M., Eco, R.N., 2016. Shallow landslide hazard mapping for Davao Oriental, Philippines, using a deterministic GIS model, in: Advances in Natural and Technological Hazards Research. Springer Netherlands, pp. 131–147. https://doi.org/10.1007/978-3-319-20161-0_9
Aversa, S., Cascini, L., Picarelli, L. (Luciano), Scavia, C. (Claudio), 2018. Landslides and engineered slopes. experience, theory and practice : proceedings of the 12th International Symposium on Landslides (Napoli, Italy, 12-19 June 2016).
Bhusan, K., Singh, M.S., Sudhakar, S., 2013. Landslide hazard zonation using RS and GIS techniques: A case study from north east India, in: Landslide Science and Practice: Landslide Inventory and Susceptibility and Hazard Zoning. Springer Science and Business Media Deutschland GmbH, pp. 489–492. https://doi.org/10.1007/978-3-642-31325-7_63
Panchal, S., Shrivastava, A.K., 2021. A comparative study of frequency ratio, shannon’s entropy and analytic hierarchy process (Ahp) models for landslide susceptibility assessment. ISPRS Int J Geoinf 10. https://doi.org/10.3390/ijgi10090603
Sengupta, A., Nath, S.K., 2022. GIS-Based Landslide Susceptibility Mapping in Eastern Boundary Zone of Northeast India in Compliance with Indo-Burmese Subduction Tectonics, in: Advances in Geographic Information Science. Springer Science and Business Media Deutschland GmbH, pp. 19–37. https://doi.org/10.1007/978-3-030-75197-5_2
Süzen, M.L., Doyuran, V., 2004. Data driven bivariate landslide susceptibility assessment using geographical information systems: a method and application to Asarsuyu catchment, Turkey. Eng Geol 71, 303–321. https://doi.org/10.1016/S0013-7952(03)00143-1
Yalcin, A., Bulut, F., 2007. Landslide susceptibility mapping using GIS and digital photogrammetric techniques: A case study from Ardesen (NE-Turkey). Natural Hazards 41, 201–226. https://doi.org/10.1007/s11069-006-9030-0
Yuke, H., Khan, U., Zhang, B., Huan, Y., Song, L., 2022. Stacking Ensemble of Machine Learning Methods for Landslide Susceptibility Mapping in Zhangjiajie City, Hunan Province, China Bayesian-MCMC Inference of Geochemical Fields Constrained by Three-dimensional Geological Model View project 3D Geosciences Spatial Field Modeling in Consideration of Geological Occurrence View project Stacking ensemble of machine learning methods for landslide susceptibility mapping in Zhangjiajie City, Hunan Province, China. https://doi.org/10.20944/preprints202203.0337.v2

A comparative study of statistical methods and machine learning algorithms for prediction of landslides in Mizoram state of India through analysis of causative factors using Geo-informatics

Status:

Version 1

Abstract

Figures

1. Introduction

1.1 Study Area

2. Methodology

2.1 Collinearity Test of Independent Factors

2.2 Modelling Approach

3. Result and Discussion

4. Conclusion

Declarations

References

Status:

Version 1