Land subsidence spatial modeling and assessment of the contribution of geo- 1 environmental factors to land subsidence: comparison of different novel 2 ensemble modeling approaches

: Land subsidence is a worldwide threat. In arid and semiarid land, groundwater depletion is the 13 main factor that induce the subsidence and results in environmental damages, with high economic losses. 14 To foresee and prevent the impact of land subsidence is necessary to develop accurated maps of the 15 magnitude and evolution of the subsidences. Land subsidence susceptibility maps (LSSMs) provide one 16 of the effective tools to manage vulnerable areas, and to reduce or prevent land subsidence. In this study, 17 we used a new approach to improve Decision Stump Classification (DSC) performance and combine it 18 with machine learning algorithms (MLAs) of Naive Bayes Tree (NBTree), J48 decision tree, alternating 19 decision tree (ADTree), logistic model tree (LMT) and support vector machine (SVM) in land subsidence 20 susceptibility mapping (LSSSM). We employ data from 94 subsidence locations, among which 70% were 21 used to train learning hybrid models, and the other 30% were used for validation. In addition, the models’ 22 performance was assessed by ROC-AUC, accuracy, sensitivity, specificity, odd ratio, root-mean-square 23 error (RMSE), Kappa, frequency ratio and F-score techniques. A comparison of the results obtained from 24 the different models, reveal that the new DSC-ADTree hybrid algorithm has the highest accuracy (AUC = 25 0.983) in preparing LSSSMs as compared to other learning models such as DSC-J48 (AUC = 0.976), 26 DSC-NBTree (AUC = 0.959), DSC-LMT (AUC = 0.948), DSC-SVM (AUC = 0.939) and DSC (AUC = 27 0.911). The LSSSMs generated through the novel scientific approach presented in our study provide 28 reliable tools for managing and reducing the risk of land subsidence.


Introduction
area the plain of Semnan is relatively smooth, which is limited by several faults. Geologically, the 114 northern part of Semnan plain exposes volcanic rock formations inlcuding rhyodacite and andesite, 115 whereas the southern part is composed of marl, shale, sandstone and conglomerate (termed as Upper red 116 Fm). According to hydrographic information, the groundwater level of Semnan plain shows a descending 117 trend during the last 24-year period, indicating a decrease in groundwater resources. The northeastern part 118 has the largest share of water level drop due to the high density of agricultural wells. The water drop 119 decreases from east to west and to the south, thus increasing grain pressure and the density of the soil 120 layers and eventually leading to permanent subsidence of the land. 121

Methodology 122
The methodology followed in the study includes six main stages ( Figure 2). These involved 123 preliminary data collection (previous land subsidence) through a sequence of field surveys with the 124 support of high-resolution satellite images. After the LS inventory is entered, the next step was the 125 preparation of the LSSCFs. The third step is to find out the correlation between the predictor variables. 126 Following this, land subsidence susceptibility modeling is done applying the Decision Stump 127 Classification (DSC) model and its ensemble with SVM, LMT, NBTree, J48, and ADTree approaches. 128

Data Acquisition
organization), and field impressions. Figure 1 shows the location of subsidence in the study area. 138 Accordingly, the number (94) of available subsidence locations were randomly assigned to 70% (66 land 139 subsidence) for training and 30% (28 land subsidence) for accreditation. The LSIM has been prepared by 140 ArcGIS software with pixel size (12.5). Some of the land subsidences in the study area are shown in 141 figure 3. 142

143
There are many types of LSSCFs that can affect its occurrence. Based on previous studies, and field 144 studies in the study area, 12 LSSventilation agents were selected for this study, which include the 145 following factors: content of clay, content of sand, groundwater withdraw, topographic wetness index 146 (TWI), plan curvature, elevation, slope, distance to road (DR), distance to stream (DtS), drainage density 147 (DD), lithology, LU/LC (LU/LC). 148 The above mentioned factors were prepared through digital elevation model (DEM) with 12.5 special 149 resolution, geological maps, land use and satellite imagery. Maps of each layer have been prepared using 150 ArcGIS 10.4.2 and SAGAGIS 3.2 software with a spatial resolution of 12.5 * 12.5 m for the entire study 151 area (Table 1). Also, in order to analyze the occurrence of land subsidence, the layers are classified 152 according to the Natural Break method (Figure 4). 153 Many land subsidences occur due to water leaking from the thin soils. One of the effective soil factors 154 in LSSrelate to the engineering characteristics of clay and sediments (Ma et al., 2006). Clay due to its 155 high contraction and impermeability has a high ability to retain a moisture for a long time (Gong et al.,

175
Decision Stump Classification (DSC) is a type of MLA and is connected to other nodes (its leaves) as 176 a one-dimensional decision tree with an internal node (root) (Iba and Langley, 1992). The DSC algorithm 177 as a subsidiary classifier method only uses a specific attribute for segmentation (Chen et al., 2019). The 178 where I is an function, If the value of z is true, I (Z) = 1, but if I (Z) = 0, x is the molecular fingerprint 183 vector, also i is the value of the fingerprint and t is the target (Oliver and Hand, 1994). 184

185
The decision trees provide high speed and accuracy in forecasting and interpreting data set (Sikandar 186 et al., 2018). ADTree is one of the algorithms that can be used with strong classification in data mining 187

210
The NBTree method was first proposed (Kohavi, 1996) as a hybrid algorithm, and is a type of 211 classification algorithm used in data mining (Rahmati et al., 2019). The NBT model is very popular due 212 to its simplicity in construction, short time to implement it, and use of low-ranking training data (Pham et 213 al., 2017a) (Saha et al., 2020). Therefore, the first step in modeling in the NBT algorithm is tree growth 214 based on entropy (degree of disorder) (Nhu et al., 2020b), such that, if S is set of training, and |S| is the 215 total number of factors, they can be classified in n classes ( = 1,2, … , ), |S i | is a factor belong to 216 classes S i . As a result, the expected classification can be calculated as follows: 217 in addition, if attribute A is considered in set S, entropy is as follows: 219 also information gain ratio (IGR) is used to show the difference between entropy (S) and Entropy (S): 221 (6) 222 calculated as follows: 224 where ( ) is the probability of output of the variables = (1,0), and showed the mean and 226 standard deviation of r i , respectively (Murphy KP, 2006). 227

228
LMT is a type of supervised classification model that uses the C4.5 algorithm to split (Quinlan, 229 1993b). The LMT algorithm combines LR and decision tree learning (Landwehr et al., 2005). In the 230 logistics type, the LogitBoost algorithm is used to generate LR in each group, and then the segmentation 231 begins with the C4.5 criterion (Sumner et al., 2005). Also the C4.5 algorithm uses the entropy technique 232 to achieve optimal classification accuracy (Lim et al., 2000). In the LMT classification method for 233 dividing the tree into nodes and leaves, the IGR technique will be formulated as follows: maps. The relationships of these methods are summarized in Table 3. 253 Here, A is true positive, B is false positive, C is the number of true negatives (the number of pixels of 254 LSS occurrence classified correctly), D is false negative (the numbers of pixels of non-subsidence 255 occurrence classified incorrectly) and and are measured and expected, respectively. Also, in the 256 Kappa index: X est and X obs show value of the simulated and actual of the model, respectively. 257

258
The OR is a statistical indicator that determines the strength of the relationship between events A and 259 B (Szumilas, 2010). The OR will be defined as the ratio of the odds of A in the presence and absence of B 260 or the ratio of chance B in the presence and absence of A (Morris and Gardner, 1988). Therefore, if the 261 OR is greater than 1, A and B are related, but if the OR is less than 1, A and B are negatively correlated, 262 and the presence of one event reduces the chances of other event (Viera, 2008). Table 4 shows the values

292
The F-score statistical index in statistical analysis is a measure of the accuracy of a test (Sasaki, 2007). 293 In this index, the F value of the accuracy measurement score is related to a test. In this index, the F value 294 of the accuracy measurement score is related to a test and commonly used in information retrieval and 295 classification performance (Derczynski, 2016). This score is based on the following will be calculated 296 relationship: 297 F 1 = ( 2 recall −1 + precisior −1 ) = 2.
(13) 298 299 where the value of the score of F 1 is the harmonic mean of precision and recall (Chicco, 2020). 300

frequency ratio analysis 302
To determine the relationship between the location of the subsidence and the conditioning factors, the 303 frequency ratio (FR) method has been used. The calculated FR values for each class of factors are presented in Figure 6. In the case of the lithology layer with 10 classes, the highest FR values were groundwater exploitation, the lower the FR. So that the lowest FR value is related to class 8.95 (0.35). were calculated. Hence, there is a special relationship between classes of slope and elevation for FR 321 values, so that as elevation and slope increase, FR decreases. 322

Multi-collinearity analysis 323
In order to ensure the independent effect of each of the conditioning factors, multicollinearity testing is 324  into five categories of subsidence sensitivity: very low, low, moderate, high and very high (Figure 8 a-f). 349 The spatial relationship of the subsidence location with the sensitivity maps is shown in Table 5.  The odd ratio (OR) technique was used to compare the relative chance of subsidence in the models 374 used in the study. Therefore, the ratio between the correct and incorrect classified values for both training 375 and validation groups was calculated (Figure 11). Based on the results of odd ratios for the training group,  show that most of the subsidence is recorded in areas with altitudes of 1031 to 1089 and slope of less than 421 8 degrees. Also, the LSSrisk map based on the optimal model (DSC-ADTree) shows that 15% of the 422 study area is in a high and very high risk range, which indicates a critical situation. In general, despite 423 agricultural lands and excessive use of groundwater resources, drought and numerous wells for 424 agricultural use has caused the discharge of groundwater inventory and its failure to compensate. 425 and Visser et al., (2019) the contribuion of a circular economy and the rational use of the resources in the Planet will avoid the desertification of the land. It is relevant that in semiarid climatic conditions new 429 managements will contribute to achieve the sustainability and will contribute to achieve the Land 430 Degradation Neutrality challenge (Keesstra et al., 2018).