Communality from equal weight for GIS-based landslide susceptibility mapping

Modeling landslide susceptibility is one of the important aspects of land use planning and risk management. Several modeling methods are available based either on highly specialized knowledge on causative attributes or on good landslide inventory data to use as training and testing attribute on model development. Understandably, these two criteria are rarely available for local land regulators. This paper presents a new model methodology, which requires minimum knowledge of causative attributes and does not depend on landslide inventory. As landslide causes due to the combined effect of causative attributes, this model utilizes communality (common variance) of the attributes, extracted by exploratory factor analysis and used for calculation of landslide susceptibility index. The model can understand the inter-relationship of different geo-environmental attributes responsible for landslide along with identication and prioritization of attributes on model performance to delineate non-performing attributes. Finally, the model performance is compared with the well established AHP method (knowledge driven) and FRM method (data driven) by cut-off independent ROC curves along with cost-effectiveness. The model shows it’s performance almost at par with the established models, involving minimum modeling expertise. The ndings and results of the present work will be helpful for the town planners and engineers on a regional scale for generalized planning and assessment.

Geo-environmental attributes on slope instability, which may differ from terrain to terrain. Statistical methods are quantitative with numerical expressions (Aleotti and Chowdhury, 1999) such as logistic regression, frequency ratio, fuzzy logic, arti cial neural networks, etc Carrara and Pike, 2008;Ercanoglu and Gokceoglu, 2004;Ghosh et al., 2011;Komac, 2006;Lee and Pradhan, 2007;Lee et al., 2004;Suzen and Doyuran, 2004;van Westen et al., 2008) all of which depends on apriori knowledge of landslide events. Understandably, it is quite di cult to assure the completeness of landslide inventory due to non-approachability at all sites on rugged mountainous terrains, rapid land-use change and obliteration of historic landslide signatures (Gariano and Guzzetti, 2016;Ghosh et al., 2012;Rowbotham and Dudycha, 1998;Westen et al., 2013) or the area might not have any previous history of the landslide (ex. Malin landslide in India, claimed 160 lives (Ering and Babu, 2016)).
To overcome these shortcomings, here we introduce a new mixed modeling method using both knowledge and statistics, which can fairly classify the area at a regional scale with simple judgments on geo-environmental attributes, using a widely used statistical tool and without any landslide inventory data. We present the model with an aim to address the following questions ; a. Can we understand the inter-relationship of different geo-environmental attributes responsible for landslide by this modeling technique?
b. Can we identify and prioritize the contribution of attributes on model performance and delineate nonperforming attributes? c. What are the proposed model performance and cost-effectivity concerning other widely-used models. We coin this model as Equal Weight Method (EWM), as it starts with assuming the equal in uence of all attributes causing landslides and ends up with a formulation of composite indicator to delineate the landslide susceptible zones.

Equal Weight Method
The weight assigned to an attribute cannot be directly related to the importance of that attribute in de ning the composite indicator as the importance of the attribute depends on its correlation and variances within the system. To develop a statistically sound index, weights can be used as "scaling coe cients" by taking values between 0.5 to 1.0 (Becker et al., 2017). Assigning equal weights to different attributes are commonly employed for Resource Governance Index (Quiroz and Lintzer, 2013), Global Innovation Index (Dutta et al., 2014), Good Country Index (Anholt and Govers, 2014), as well as, for multi-model atmospheric forecasting (DelSole et al., 2013). The present model Equal Weight Method (EWM) is based on understanding the correlation between the geo-environmental attributes, or in other words in uence of one attribute over the other, as landslides result from the interplay of different attributes (Carrara and Pike, 2008). In this method, initially, all attributes are considered to have a similar in uence on landslide production and given equal weight (say 1). The weight of the parent attribute (1) is equally and systematically distributed to the sub-attributes considering their possible role with landslide hazards. At this stage, the modeler needs to make a simple judgment to rank the importance of sub-attributes concerning landslide. For example, relative relief is an important attribute during landslide activity in mountainous terrain. We can divide relative relief into three sub-attributes, viz. Low (< 100 m), Medium (100 m-300 m) and High (> 300 m) and accordingly we can scale the in uence of relative relief on landslide as low relative relief < medium relative relief < high relative relief. For giving weights to each of the sub-attribute, we distribute the original weight to three equal parts (i.e. 0.33) and add cumulatively from the lowest in uence sub-attribute to highest vulnerable sub-attribute. In this process, low relative relief will have a weight of 0.33 and high relative relief will have 1.0 weight. This weighting scheme is applied to all attributes. The process will generate multivariate data, which we will take as input to nd out the relationship between the attributes.
As landslides are caused due to complex interplay of the attributes, we should know the interrelation between attributes. To understand the interrelation in multivariate data by providing a parsimonious and meaningful explanation for the observed correlation in different attributes, we performed factor analysis to reduce variable complexity (Kerlinger, 1979). The basic idea of factor analysis is to reduce the dimensionality of observed attributes to unobservable latent attributes that share a common variance. An important postulate of factor analysis is that the internal attributes, which cannot be directly measured but their effects are re ected when one obtains measures on measurable attributes. These internal attributes are referred to as factors or latent attributes.
Mathematically, let A be the number of variables (X 1 , X 2 , X 3 ,…..X A ) and B be the number of factors (F 1 , F 2 , F 3,………… F B ), then the model assumes that each variable is a linear function of all the factors reproducing maximum correlation. Now, if X i is a variable, then; (1) Where, f i1 , f i2 , f i3 ,…..f iB are factor loadings which give an idea that how much the variable contributed to each factor and it can vary from + 1 to -1 (Dragon, 2006). Factor analysis used matrix algebra where the basic input is the correlation coe cient of the variables. Once the correlation matrix of variables are computed, factor loading can be computed by (Rummel, 1988); Where R BXB is the correlation matrix, F BXA denotes common factor loading (F‵ AXB is the transpose) and U 2 BXB is a unique variance. The variance associated with any variable arises from three sources. The variance due to common factors is referred to as common variance or communality. The variance associated with a speci c factor is referred to as a speci c variance. Other than these two types of variances, another variance occurred due to error of measurement, referred to as error of measurement variance. The speci c variance is commonly combined with an error of measurement variance to form the unique variance or uniqueness. The difference between R BXB and U 2 BXB is the communality, which can be readily computed for each variable by adding a square of factor loadings against that variable.
Communality coe cients are speci c to measured attributes.
As we do not have any clear prior hypotheses regarding the factor structure underlying the attributes, the exploratory method of factor analysis is employed (Yong and Pearce, 2013). In this method, there are no apriori restrictions placed on the pattern of relationships between the observed attributes and latent attributes. As we want to keep the maximum variance in the solution and as we have 7 number of attributes, we extracted 6 factors, explaining about 92% of the total variance which is nearly equivalent to required variance as per Hair et al (1995) (Table-1). Factor-1 extracts maximum variance followed by other factors. A close look in factor loadings of attributes indicates that slope, lithology, relative relief and road have important contributions for factor-1. Out of these four attributes, lithology and road are positively correlated and slope and relative relief are negatively correlated with factor-1. This also indicates good interrelationship within these attributes and has maximum in uence on the factor model.
The other three attributes (viz. LULC, drainage, and lineament) have poor correlation with any factors indicating their interrelationships with other attributes are poor and have insigni cant contributions for the factor model.
In the next stage, extracted communality of each variable ( Where, C 1 , C 2 ,…..C n is communality of V 1 , V 2 ………V n is attributes. Now the question is how to ne-tune the model performance, i.e. to exclude the attributes from the model having minimum contribution in model performance. From the above discussion, we have seen that the main pillar of the EWM model is a common variance or communality which is speci c to the measured attribute. This means that the attribute having more communality has more in uence on model performance. This can be indirectly measured by stepwise standard deviation cumulative plot of LSI along Y-axis and cumulative no of attributes along the X-axis starting with the attribute of highest or lowest communality or directly plotting cumulative communality starting with the highest or lowest communality. The attributes having a minimum in uence on model performance will form a break in slope and will trend nearly asymptotically with the X-axis due to their lower contribution of common variance.

Study area
Although the model generates a composite map giving information for landslide susceptibility of an area, it does not require any spatial structure information and hence it can be applied to any geographical area.
As mountainous terrains are more prone to land sliding and for any hilly terrain, road corridors are the main locales for the development activities, we have selected a well studied major road corridor of the eastern Himalaya in Sikkim state of India (Kumar N et al., 2017) connecting several densely populated townships and major hydroelectric projects which are affected by the frequent occurrence of landslides (Fig-1).
The elevation of the area varies between 320 meters to 3680 meters and there is a general increase in elevation from south to north. Due to high elevation differences, climatic conditions vary from tropical to alpine types from south to north with average annual rainfall between 300 to 400 cm. Monsoon generally sets in June and continues for about four months. The Tista river is one of the major rivers of Sikkim, originates from a Tista Khangse glacier in the north, passes through the entire study area from Chungthang up to Rangpo. Geologically, the area falls in the Lesser Himalayan and Central crystalline zone of the Sikkim-Darjeeling Himalayas. The rock types found in the area are represented by alternate meta-pelites and meta-psammites of different metamorphic grades belonging to the Daling group, highgrade gneiss and schist belonging to Darjeeling Gneiss Formations. We have mapped along 100 km. National Highway (NH-34A) and 44 km. of village roads ( made up of gravels and often mixed with tar), delimiting the lateral boundaries between the Tista river on one side and water divide in the opposite hillside. For up to date landslide inventory, we have taken the help of a previous inventory catalog (Paul and Ghosal, 2009)), remote sensing (Google Earth Imagery) and eld veri cation using GPS survey. All landslides are debris slide (Varnes, 1978), mostly triggered due to heavy precipitation, favored by oriented rock mass discontinuities, joints, faults, or schistosity. A total of 55 number of landslides were mapped within an area of about 149 sq.km.
The Himalaya is a tectonically active region, geological structure (fault-lineament) and lithology are important controlling factors. Moreover, as landslide is a gravitational process, terrain slope and relative relief play important roles. Similarly, vegetation cover and distance from drainage are also found to play pivotal roles in landslide occurrences (Pradhan and Lee, 2010). In addition to natural controlling factors, anthropogenic activities, viz. road construction, land use pattern, etc. also promote landslide. Based on the above importance, for the present study, we have selected lithology, land use-landcover, slope, relative relief, lineament-fault, drainage and road as causative geo-environment attributes responsible for landslides. We used the ASTER Digital Elevation Model (DEM) for the derivation of slope angle and local relative relief. Since ASTER DEM does not pick up water bodies and low in vertical resolution, we recti ed the DEM by using the topographical map of Survey of India for water bodies and high-resolution eld GPS data for vertical correction. Excluding drainage, road and lineament-fault layers, all other layers were classi ed following code provided by the Bureau of Indian Standard (BIS, 1998) which is widely used in the Indian sub-continent. For drainage, road and lineament-fault layers, we calculated systematic buffer intervals by using a GIS platform. The geo-environmental attributes were converted to a raster grid with 50 × 50 m cell size and used as a mapping unit with an assumption that each grid cell represents a spatially homogenous domain for application to the susceptibility modeling. The area grid consists of 61667 numbers of cells, out of which 55 cells occupy landslides. Details of the geo-environmental attribute are given by (Kumar N et al., 2017).

Landslide susceptibility map
The proposed model is used to prepare a landslide susceptibility map which can demarcate the areas having a likelihood or probability of landsliding. In the present work we have used natural breaks for ve classes (viz. Very low, Low, Moderate, High and Very high) in GIS platform to produce the susceptibility map from the calculated LSI by EWM. We have calculated LSI taking into account all attributes i.e. total communality, derived for the model and described as EWMTC map (Fig-2a) and compared with the susceptibility map considering partial communality of the model (EWMPC , Fig-2b) by the exclusion of some attributes. The decision on exclusion of attributes from the model is taken by plotting cumulative standard deviation as described earlier (inset in g-2b). This process excludes LULC, drainage, and lineament which individually represents less than 10% of total communality. Distribution of landslide shows the presence of 79% and 84% landslide per sq. km. in high and very high zones of EWM-TC and EWM-PC maps respectively within an aerial coverage of about 32% of the total mapped area, while very low and low susceptible areas although having cumulative coverage of about 42% and 45% for EWM-TC and EWM-PC maps respectively, the very low susceptible area does not contain any landslide and low susceptible area contain only 7% and 4% landslides per sq.km respectively. (Table-3).
To compare the performance of EWM based landslide susceptibility map, we have selected a well-referred Analytic Hierarchy Process (Saaty and Vargas, 2001) and the Frequency Ratio Method (Lee and Pradhan, 2007). The former method (AHP) is full knowledge driven based primarily experts opinion without taking into account known landslide incidences, while the later one (FRM) is fully data driven, in which the model solely depends on landslide inventory data. Details of model results by these two methods are given in table-2. Figure-3a shows the landslide susceptibility map after AHP method with about 69% and 17% landslides per sq.km respectively for very high and high susceptible zones, covering about 26% of the mapped area while about 37% of the mapped area is demarcated as very low and low susceptible for landslides with a concentration of about 4% and 6% landslide per sq.km (Table-3). On the other hand, FRM based landslide susceptibility map (Figure-3b) delineate about 13% of the mapped area as very high to highly susceptible for a landslide with about 80% and 12% of landslides per sq.km respectively while about 73% of the mapped area is demarcated as very low and low susceptible for landslides with a concentration of about 1% and 4% landslide per sq.km (Table-3).
Comparison plots of all the three models (Figure-4) show a sharp break in slope of landslides per sq.km from high onwards, indicating all three models almost equally picked up the high and very high susceptible areas, but from medium to very low susceptible areas the curves for AHP and FRM models are almost parallel to X-axis, indicating low resolution of these models to clearly distinguish the classes of comparatively low landslide susceptibility, while both the models of EWM clearly distinguishes the very low susceptible area without any reported landslide incidences and maintains a low angle smooth gradient till the high susceptible zone reaches. Area percentage plots show high positively skewed area distribution pattern for the FRM model which is unrealistic, indicating the probability of more misclassi cation of the areas to demarcate as low landslide-prone. AHP model shows a nearly Gaussian area distribution pattern, but here the problem is with a high proportion of medium landslide-prone areas, the class, which is di cult by any planner to decide on future developments. On the other hand, neither skewness nor high medium class is observed for the EWM model. EWM-TC nearly equally distributes the different classes, while ne-tuning to EWM-PC has the potential to make a bimodal distribution pattern separating high and low susceptible areas and comparatively low medium classes than AHP.

Statistical comparison
The computational power of a model depends on its sensitivity, or in other words its ability to minimize misclassi cation and constraining the attributes having a minimum effect on model results, at least for modeling (Douglas-Smith et al., 2020). The performance of the models is judged by ROC (Receiver Operating Characteristic) curves with a binary classi cation of stable and unstable units and is considered as one of the best performance evaluation techniques. The confusion matrix in a two-class classi er system is built up with rows representing observed class and the columns represent predicted class (Table − 4 inspection. ROC curves are prepared by standard curve tting of several points representing confusion matrices calculated at each point for a range of probabilities. Area Under Curve (AUC) of ROC is used as a measure for model performance. The accuracy of tests with AUC between 0.50 and 0.70 is low; an accuracy between 0.70 and 0.90 is moderate, while an AUC over 0.90 indicates high accuracy (Streiner and Cairney, 2007).  (Streiner and Cairney, 2007). Although AUC can compare the different classi cation schemes directly, it has some drawbacks also, when the ROC curves cross over one another. When a classi er crosses over the other classi er, it means both the classi ers are the best performer in a certain range of points (Drummond and Holte, 2006) which we can see from the gure-5, where both EWMTC and EWMPC crosses over AHP and FRM at different points indicating they are superior to AHP and FRM at certain ranges. Moreover, all the models have de nite misclassi cation as are evidenced from the convex hull of ROC curves (otherwise ROC curve would have merged with co-ordinate axes ). For any land classi cation scheme, identifying the stable land is important from the economic point of view because the unstable land will be restricted in use (Frattini et al., 2010), which means the model should have a minimum false-negative count (type-II error). Higher false positive count (type-I error) will lower the extent of stable areas, but it will not be vulnerable as that of type-II error. A comparison of the ratio of misclassi cation counts (type-II/type-I) of all four models shows that EWMPC performs better than FRM and AHP from 40% and 85% probability onwards. At 100% probability, this ratio is much lower for EWMPC than that of AHP, FRM, and EWMTC indicating EWMPC is a comparatively better performer ( gure-6). However, at this stage, we do not know the misclassi cation cost and class probabilities to choose the best performer. Hence it is important to identify a model with minimum misclassi cation cost having a de nite probability range.
To achieve this goal, (Drummond and Holte, 2006) proposed cost curve analysis, where a point in the ROC curve (X, Y) corresponds to a line segment in the cost curve that has Y = FPR when X = 0 and Y = FNR when X = 1. Equation of this line is given by ; Y = (FNR-FPR) * X + FPR ……………………………(i) Where X, Y represents probability and normalized expected cost respectively. In the cost analysis, the minimum cost is involved for correctly classifying and misclassi cation cost is always higher.
However, we can not use directly the above equation because the equation is meant to evaluate models having the capability to measure maximum positive cases. This type of analysis is mainly performed in bio-medical, signal processing, etc. where positive cases are the goal sought for. Hence the model should have predictability power with minimum false positive cases as false negative is less vulnerable, although having its importance in total misclassi cation cost. But for landslide susceptibility we are mainly interested in stable areas i.e, areas with no or minimum landslide with a different representation of confusion matrix as shown in table-4 (Frattini et al., 2010) than that of (Drummond and Holte, 2006). In this case, false negative (area demarcating stable but not stable) is more important than false positive ( area demarcating unstable but stable). In this case equation (i) has to be rewritten with Y = FNR, when X = 0 and Y = FPR, when X = 1 and can be written as; Y = ( FPR -FNR ) * X + FNR ……………………(ii) X and Y remain the same as stated earlier. Figure-7 shows cost analysis of all the models having a maximum bounding cost of 0.4 which indicates that all models are cost-effective. Out of the four models, FRM is more cost-effective followed by AHP and EWM. The maximum cost difference between FRM and EWM is only 0.13 (A-B line in gure-7). However, it can be seen that at < 0.25 probability, all four models behave almost similarly. Within a probability range of 0.25 to 0.5, FRM outperforms, followed by AHP and both EWM models respectively. From 0.5 to 0.78, FRM, AHP, and EWMPC behave similarly and outperform to that of EWMTC. At > 0.78, all models behave similarly. This indicates that EWM models can perform equally well with that of other well-established models for at least three fourth of the entire probability range with a competitive bounding cost.
All of the above metrics indicate the preamble on equal weight method ful lls pragmatically the desire envisaged on a sensible real-world constraint. Land regulatory bodies may not have highly specialized knowledgeable persons to categorize stability of land-based on his experience or the area may not have su cient landslides to use for training and testing with standard data driven methods. This shows the robustness of proposed variance-based EWM methodology, in which the modeler classi es the area with minimum knowledge on attributes intrinsic properties and without any landslide checks, even with the opportunity for ne-tuning the model performance by the exclusion of insigni cant attributes for that particular area.

Conclusion
Landslide susceptibility mapping lays the foundation for land hazard management and is of great help to planners and engineers in choosing suitable locations for development. With the available techniques, a good landslide susceptibility map can be produced either by a person having sound knowledge on effects of Geo-environmental attributes of the studied area or the area to be mapped should have a nearly complete inventory data. The proposed Equal-Weight Method depends on the mapper's preliminary terrain speci c knowledge and it does not depend on landslide inventory data. Common variance (communality) is extracted by factor analysis technique from the data assigned by the modeler for each attribute. This cummunality is used to weight the original data and nally, LSI is calculated by summing up the individual weighted attributes.
Factor loadings imply a correlation between attributes and factors. Factor analysis of the proposed model indicates a good correlation between lithology, relative relief, slope and road with the latent variable, factor-1, indicating good intercorrelation between these variables which has a major in uence on the EWM model. Other attributes (LULC, drainage, and lineament ) have insigni cant correlations with any factors, which also indicate their poor interrelationships with other attributes.
Landslide is a result of the cumulative effect of different attributes and in turn dependent on the communality of individual attributes. To prioritize the contribution of attributes on model performance and to delineate non-performing attributes, cumulative stepwise LSI plot or communality plots are suggested which can indicate the non-in uential attributes in model performance. We have generated two models by using EWM. The rst one takes into account of all communality weighted attributes (EWMTC) and the second one with selective deletion of attributes (EWMPC) based on the above criterion.
To understand the acceptability of EWMTC and EWMPC models, we carried out ROC and cost analysis and compared them with well-reffered knowledge driven (AHP) and data driven (FRM) models. AUC of ROC curves for AHP, FRM, EWMTC, and EWMPC models are 0.755, 0.809, 0.74 and 0.774 respectively indicating little performance difference between models. However, as the ROC curves cross each other, it does not indicate model performances sensu stricto. The ratio of misclassi cation counts indicates the best performance of EWMPC for a wider probability range. Cost effectivity analysis indicates that EWM models can perform equally well with that of other well-established models for at least three fourth of the entire probability range with a competitive bounding cost.
All of these indicate the robust predictive power of the EWM method for delineating the landslide susceptible. The ndings and results of the present work will be helpful for the town planners and engineers on a regional scale for generalized planning and assessment. This type of map can be prepared without taking help from persons with specialized knowledge of environmental attributes and without any landslide inventory data de ning a sensible real-world constraint. However, the application of the proposed methodology for local and site-speci c studies needs to be tested.

Declaration
The data sets used and/or analyzed during the current study are available from the corresponding author on reasonable request. There is no competing interest. The work was funded by Geological Survey of India. SKS contributed mainly on the research work by developing model for landslide study and prepared 1st draft of the manuscript. GIS inputs and model comparisions are from SG and SDG. TK, JNH, MM and PD were involved in eld data generation and compilation.  Tables   Table 1. Factor matrix with communality. RR: Realative relief; LULC: Land use land cover.  Figure 1 Location and extent of studied area as depicted on hillshade map generated from DEM showing the location of landslides.

Figure 2
A. Landslide probability map obtained after the EWM model considering total communality. B. Landslide susceptibility map after the EWM model considering partial communality. Inset shows a cumulative standard deviation plot starting with the highest standard deviation with only C1V1 and then gradually adding one after other CnVn with highest to lowest standard deviations.
Page 19/23   ROC curve for all models.

Figure 6
Comparison of the Misclassi cation ratio of all models.

Figure 7
Cost curves for all models. A-B is the maximum cost difference.