Forecasting Factor of Safety of Slopes Stability Using Several Machine Learning Techniques

9 Because of the disasters associated with slope failure, the analysis and forecasting of slope stability for geotechnical 10 engineers are crucial. In this work, in order to forecast the factor of safety (FOS) of the slopes, six machine learning 11 (ML) techniques of Gaussian process regression (GPR), support vector regression (SVR), decision trees (DT), long-short 12 term memory (LSTM), deep neural networks (DNN), and K-nearest neighbors (KNN) were performed. A total of 327 13 slope cases in Iran with various geometric and shear strength parameters analyzed by PLAXIS software to evaluate 14 their FOS, were employed in the models. The K-fold (K=5) cross-validation (CV) method was applied to evaluate the 15 performance of models ’ prediction. Finally, all the models produced acceptable results and almost close to each other. 16 However, the GPR model with R 2 = 0.8139, RMSE = 0.160893, and MAPE = 7.209772%, was the most accurate model 17 to predict slope stability. Also, the backward selection method was applied to evaluate the contribution of each 18 parameter in the prediction problem. The results showed that all the features considered in this study have significant 19 contributions to slope stability. However, features φ (friction angle) and γ (unit weight) were the most effective and 20 least effective parameters on slope stability, respectively.


Introduction 23
In geotechnical engineering, the analysis of slope stability is of importance. Economics has developed over time 24 and, with that progress, the number of slopes is growing. As a result, landslides affected by slope instability have been 25 among the three most catastrophic natural disasters in the world, besides volcanoes and earthquakes. In order to 26 minimize or avoid landslide failure, it is significant to understand the importance of slope stability analysis. It is 27 complicated to anticipate the stability of slopes due to the complexity of them because they are not dependent on 28 specified data or a single parameter, but they are mostly dependent on a variety of variables, such as slope angle, 29 slope height, soil properties, and seasonally changing soil water content (Sakellariou and . These researchers have done a remarkable job of tackling slope stability. However, the 37 problem is that the techniques they built are not optimal and only give specific solutions for certain slopes. These 38 models are not suitable to solve all slope stability problems completely because of several drawbacks (Lin et al. 2018): 39 a) The methods used are not really appropriate because of the shortcomings. For instance, the limit equilibrium 40 method cannot consider the constitutive relationship of soil, and it needs to search for the most dangerous sliding 41 surface. Besides, the analysis is complicated for multi-layer soils. Therefore, this approach does not accurately 42 elucidate the proper safety and reliability during the stability slopes analysis. 43 b) The safety factor is not clearly defined. In general, slopes with a factor of safety (FOS) of greater than1.20 are safe. 44 In spite of this, there have been failed slopes with the FOS greater than 1.2 in real engineering. 45 c) These methods are affected by human subjective factors. 46 In recent years, artificial intelligence and soft computing models have been broadly applied in geotechnical 47 applications, e.g., tunnel resources ( revealed that the combination of SVM and PSO models is an effective analysis method for predicting slopes' stability. 65 Li et al. (2013) and Samui (2008)  situations and observed that the estimation outcomes were precisely the same as the real situations. The model may 73 be used to accurately evaluate slope stability, achieving high predictive performance and low misclassification with a 74 low error rate. 75 The mentioned intelligent models explain slopes' failures, but they are not adequate to solve them entirely. A 76 particular technique could be suitable for some situations but not appropriate for others, and no distinction between 77 intelligent modeling approaches has been observed in the prediction of slope stabilities. Therefore, the models' 78 credibility and precision are not well established. Furthermore, cross-validation (CV) is an efficient and accurate 79 approach to confirm the models' prediction performance (Lin et al. 2018). It is crucial to recommend CV for accuracy 80 verification of optimized models. 81 In this work, six machine learning (ML) methods were used to forecast the slopes' FOS: Gaussian process regression 82 (GPR), support vector regression (SVR), decision trees (DT), long-short term memory (LSTM), deep neural networks 83 (DNN), and K-nearest neighbors (KNN). PLAXIS software is used to measure the FOS of 327 slope cases in Iran, using 84 input parameters such as unit weight (γ), cohesion (c), friction angle (φ), slope angle (α), slope height (H), and pore 85 pressure ratio (r u ). These datasets are employed in the prediction models. The prediction output of the models is 86 evaluated using the K-fold (K=5) CV process. 87 The analyzed FOSs using PLAXIS software are compared to the forecasting values made by the ML models. Finally, 88 the best prediction model has been specified. Also, in order to show the effect of each of the input parameters on the 89 SOF of a slope, the backward selection method is used. In this way, the most important parameters affecting the 90 stability of slopes are identified. Fig. 1 presents the complete flowchart of the analysis. 91

Database and statistical analysis 92
In Fig. 2 There are interquartile ranges which are the spaces between the bottoms and tops. The sample median is the line that 115 runs through the center of each box. For three boxplots of γ and c, the median is roughly in the middle of the box, 116 while the others are not, indicating sample skewness. 117 The whiskers are defined by the continuation of lines above and below the boxes. The whiskers are spread from the 118 inter-quartile range's ends to the whisker range's farthest data. Referring to Fig. 3, it can be observed that the length 119 linking the higher whisker and the higher quartile differs from the distance between the lower whisker and the lower 120 quartile for H, c, and r u factors; however, these distances are roughly the same for γ, φ, and α factors. In addition, it 121 also illustrated the allocation of the data. Beyond the whisker length, there are outliers, which are the other 122 observations. Several outlying findings are presented in the H and c parameters, as depicted in Fig. 3 in models' performance could be achieved with the removal of these cases. Therefore, these outliers (17 slope cases) 129 should be eliminated, and 327 slope cases remained. Table 2 shows the outcomes of the simple descriptive statistical 130 analysis achieved of the reducing datasets. 131 For the 327 datasets considered in this study, there are two slope stability modes of failure (168 cases) and stable 132 (159 cases). The two modes of slope stability cases employed in this research are relatively balanced, as shown in Fig.  133 4. 134 In the soft computing problems, to assess their general efficiency, prediction models' performance is necessary to 135 be verified on new datasets. Therefore, the entire dataset could be subdivided into two training and testing sets. The 136 prediction model is trained, and the hyper-parameters are tuned using the training datasets. The testing datasets are 137 utilized separately to measure the overall efficiency of the prediction of models. 138 This article applies the K-fold CV (K=5) method to classify the database into two groups: training and testing. The 139 work method is such that whole datasets are divided into five equal categories so that each time one category is used 140 as training sets, and the other four categories are used as testing sets. In the K-fold CV method, all data will be 141 experienced in both training and testing sets. Therefore, using this method will give more credibility to the predicted 142 results. Furthermore, all the datasets were normalized before modeling in the ranges [-1, 1] in order to eliminate 143 variations in input and output parameters. However, standardization of the dataset is not efficient, and the results are 144 unfavorable due to an excessive distribution of the data. 145 Some statistical indices, such as coefficient of determination (R 2 ), root mean square error (RMSE) and mean 146 absolute percentage error (MAPE), are used to determine the models' precision. The following formulas are used for 147 calculating these indices (Eqs. 1-4). 148 (3) where is the real value, ′ is the forecasted value, ̅ is the mean of , ̅ ′ is the mean of ′ , and is the number of 149 test datasets. 150 The use of kernels, sparse solutions, and Vapnik-Chervonenkis (VC) regulation of the margin and number of support 172 vectors distinguish SVR. One of the key benefits of SVR is that its computational complexity is independent of the input 173 space's dimensionality. It performs lower computation compared to other regression techniques. It also has good 174 generalization, high prediction precision, and is resistant to outliers. 175

We chose DT because: 176
DTs are a form of supervision learning algorithm that partitions the sample multiple times based on specific 177 questions about the sample. For prediction problems, these can play a significant role. They are relatively simple to 178 comprehend and very operative. DTs reflect a series of decisions with varying chances of occurring. Through the use of 179 this technique, the identification of the most substantial variables and the relation between two or more variables can 180 be provided. In our problem, we have various variables, which are related to each other, so we select DT as one of 181 comparing models. In other words, a decision tree has the advantage of forcing the analysis of all possible outcomes 182 and tracing each direction to a conclusion. It performs a thorough analysis of the implications of each branch and 183 defines the required decision nodes. 184

Key advantages: 185
-There is no need to preprocess the data. 186 The substantial advantage of DNN over its predecessor is that it automatically identifies essential features without 210 the need for human intervention. In this case, we have no idea about the dependency of feature; therefore, we use 211 DNN as one of the comparators to decrease the dimensionality of datasets and see the effects of the automatic 212 dimensionality reduction in the results. 213  We chose LSTM because: 214 The data of our problem has a time dependency feature. i.e., the variables are dependent on each other in terms of 215 time. Therefore, we decide to use RNN-based models to consider this time dependency. When we switch from RNN to 216 LSTM (Long Short-Term Memory), we introduce ever more control knobs that regulate the flow and mixing of inputs 217 according to qualified Weights. As a result, LSTM provides us with the most control and, consequently, better results. 218 However, this leads to an increase in complexity and operating costs. Consequently, in order to obtain the best 219 accuracy, we choose LSTM as a replacement for conventional RNN. 220 It is very significant to comprehend the data so that the right algorithm for the right problem can be chosen. In 221 some algorithms, only small sample sets are required, while others can work with tons and tons of samples. Some 222 certain algorithms desire numerical input while others work with categorial. Because our datasets are large, we 223 propose work with the LSTM Algorithm. 224 In the analysis of slope stability, GPR, SVR, and DT modeling are applied using MATLAB Software 2018. The LSTM, 225 DNN, and KNN algorithms are introduced in Anaconda version 3.6, and it a free, open-source application for scientific 226 computation in Python and R, which aims at simplifying package management and execution. These six predictive 227 models have been introduced to determine the best predictive model for slope stability. In the following subsections, 228 details on the ML prediction models and their prediction results are presented. 229

GPR 230
A Gaussian procedure (GP) is a gathering of arbitrary factors 1 , 2 , … for which any finite subset of the factors 231 has a joint multivariate Gaussian conveyance.
where the components of ( ) are given by an earlier mean capacity ( ), and is the portion work. The portion 237 uses two files and that provides the covariance between their comparing factors and . Given vectors of lists 238 and , returns the framework of covariances between all sets of factors where the first in the pair originates from 239 and the second from . Each is barely Gaussian, with a mean of ( ) and difference of ( , ) 240 (Mahmoodzadeh et al., 2020b). 241 Assume there has a capacity ( ) that would want to upgrade. In addition to that, suppose that could not be 242 watched legitimately, yet that an arbitrary variable can be seen that is listed by the same space as and whose 243 normal esteem is , i.e., ∀ ∈ , [ ] = ( ). Particularly, it is accepted that the earlier conviction about the 244 capacity complies with a Gaussian procedure with earlier mean and part . Furthermore, assume that is a 245 perception of ( ) that has been tainted by zero-mean, i.i.d. Gaussian clamor, i.e., = ( ) + , where ~(0, 2 ). 246 Consequently, ( ) is a shrouded variable whose back appropriation and could be able to derive in the wake of 247 watching tests of at different areas in the space. The following subtraction is called Gaussian procedure relapse 248 (Mahmoodzadeh et al., 2020c). 249 Give x a chance to be the arrangement of perceptions focuses and be the subsequent genuine esteemed 250 perceptions. This required to process the back appropriation of some new point ̂∈ . The appropriation will be 251 Gaussian with mean and difference (Eqs. 6 and 7). 252 In the regression learner app embedded in MATLAB software 2018, four different models for GPR include squared 257 exponential, rational quadratic, exponential, and Matern 5/2. After modeling by this program, the model type with the 258 most accurate results is taken into account. Also, the optimization mode is considered in the app, so that the app itself 259 optimizes the amount and type of hyper-parameters of the GPR model. The optimized type and value of the GPR 260 hyper-parameters produced by the regression learner app are presented in Table 3. 261 The 5-fold CV results of slope stability predicted by the GPR model are depicted in Fig. 5. As in Fig. 5 (Fig. 6), RMSE, and MAPE are obtained equal to 0.8139, 266 0.160892561, and 7.209771731, respectively. As the database employed in this research, all the outcomes confirm the 267 GPR model's high prediction accuracy. 268

SVR 269
Vapnik (1995)  RMSE, and MAPE are obtained by 0.6950 (Fig. 8), 0.20650649, and 11.89200416%, respectively. All of these results 284 suggest that the SVR model should be considered as a somewhat appropriate predictive model for predicting slope 285 stability. 286

DT 287
The DT is one of the classifications and regression methods based on the non-parametric survived learning 288 technique. Furthermore, it consists of a set of if-then-else decision rules. The best perdition of the model occurs when 289 the DT goes deeper and deeper to make the best fit with the actual data. There are several advantages of the DT. 290 First, the distribution of explanatory variables does not require assumption. Second, strong relations among 291 independent variables do not affect the DT outcomes. Third, various dependent variables such as survived data, 292 categorical and numerical can be covered by DT. Fourth, this technique comprises the powerful variables and 293 eliminates the least powerful variables which describe the dependent variable. For the DT, it is possible to predict 294 small and large datasets well, even though this technique was initially developed to only well predict large data 295 (Mahmoodzadeh et al., 2020b). 296 The algorithm of DT can be explained as follow: 297 1. First, the calculation of the targeted variance is performed. 298 2. Based on the various attributes, the database is divided into distinct parts, and the variance of each sectioned 299 part is deducted from the variance prior to the division. This can be defined as variance reduction. 300 Node can be defined by the variance reduction as: 301 is a group of samples that is not separated yet, is a group of separated samples with true result and is a 302 group of separated samples with a false result. Without referring to the mean, each of the summands presented 303 above is indeed variance estimates though written in a form. In each summation term in Equation 10, variance 304 estimation is required in such a way the mean is not referred to directly. 305 The decided node of the attribute is based on the highest VR. 306 3. Depending on the values of selected attributes, the datasets are separated. If the variance of a part is more than 307 zero, it is separated once more. 308

Keep another trial going until all the data is evaluated. 309
In the DT approach, three model, such as medium, coarse, and fine, are embedded in the MATLAB 2018. The FOS 310 predictions were performed through these three models and eventually considered the model that provided more 311 precise results. The information about the optimized DT's hyper-parameter parameters considered in this analysis is 312 provided in Table 5. 313 In Fig. 9, the FOS results obtained from the DT model are compared with the analyzed ones. As shown in Fig. 9,  314 there are small differences between the predicted outcomes and the analyzed values. The MAE is equal to 0.143. 315 Achieving such an error in predicting the FOS of a slope is good. Other statistical evaluation indices of R 2 = 0.6830 316 (Fig.10), RMSE = 0.210590831, and MAPE = 12.2080868% also indicate the good accuracy in the DT model predictions. 317

LSTM 318
Deep learning can be modeled as a typical version of neural networks with several layers. The knowledge found 319 within these deep learning networks is better remembered than traditional neural networks. A recurrent neural 320 network (RNN) is a form of the combination of a loop network in the neural network. Knowledge persistence exists 321 across these loops in the networks. This network gathers inputs and data from prior networks, implements specified 322 procedures and provides the next network output. Some applications may only need current knowledge, while others 323 may also need additional information from the past. A learning lag exists in redundant neural networks in which the 324 gap between the desired input and the requirement is improved. However, LSTMs networks are a special kind of RNN 325 (Mahmoodzadeh et al., 2021b) capable of studying these situations. Such networks are intentionally designed so that, 326 on the permanent networks, the long-term dependency problem can be mitigated. 327 An essential function of LSTM is recalling knowledge over a long time. It is an ordinary choice since the precision of 328 the model permanently depends on the sum of prior knowledge. The basic LSTM module, defined as the repetitive 329 module, has a particular interaction with four layers with neural networks, as shown in Fig. 11. In the module, there 330 are three gates with activation functions of σ1, σ2, and σ3, and it contains two output activation functions such as Ф1 331 and Ф2, as seen in Fig. 11. To represent the addition and multiplication in the element-wise manner, the Σ and π 332 symbols are used. A bullet (•) symbol is used to describe concatenating mathematical procedures. The principal 333 element of LSTM is the cell state, in which the current block memory (S t −1) is created from the previous block 334 memory. The information flow is then approved immediately down the line. In the network, the first layer (S t ) 335 determines the sum of the prior knowledge to flow, and Eq. 9 demonstrates the activity carried out by this layer 336 (Mahmoodzadeh et al., 2021c). 337 In the network, two layers and their associations influence the mechanism of storing new data in the cell state. A 338 sigmoid layer (σ 2 ) that find out the (I t ) value to be updated see Eq. 10 and Ф 1 tanh layer, which makes a new candidate 339 value (˜) as seen in Eq. 11. To make a case, both are to be added together with the state. The cell state is lastly 340 replaced with Eq. 12. 341 In order to optimize the LSTM model, this model was tested with different types of parameters and changes in the 342 number of neurons, layers, batch_size, epochs, and so on. Finally, the type/value of parameters of the LSTM were 343 optimized as in Table 6. 344 The structure of the network optimized LSTM model is shown in Fig. 12. 345 The FOS results of LSTM are compared with the analyzed FOS results in Figs. 13 and 14. Given that deep learning 346 methods such as LSTM require much more data to learn, it can be said that with the amount of data used in this 347 article, still, it has provided acceptable accuracy. The MAE for the LSTM results is equal to 0.169. Achieving such an 348 error in predicting the FOS of a slope is good. Other statistical evaluation indices of R 2 = 0.6616 (Fig.14), RMSE = 0. 349 221912033 and MAPE = 14.52803317% also indicate good accuracy in the LSTM model predictions. 350

DNN 351
In the layered neural network of DNN, a deep feed-forward neural network is available. There is more than one 352 layer of hidden units between its inputs and outputs. The layers contain input layers, which are followed by mid-353 layers, hidden layers, and finally, the output layer. Consequently, the input, hidden layers, and output layers are all 354 connected to the network's neighboring layers. It is a feed-forward neural network, which means that there is no 355 forward feeding and that the links between the layers are only one way, forward. Furthermore, DNN is especially well-356 suited to analyzing raw input data because it can recognize patterns and learn useful features from it without the need 357 for rigorous feature engineering, data pre-processing, or hand-crafted guidelines. Moreover, with the rise in training 358 data, its efficiency further improves. (Mahmoodzadeh et al., 2021b). DNNs have a much wide range of applications, 359 from simple text creation to machine vision tasks, and the early uses of DNN are in automatic translation. 360 In order to optimize the DNN model, such as the LSTM model, it was tested with different types of parameters and 361 changes in the number of neurons, layers, batch_size, epochs, and so on. Finally, the type/value of parameters of the 362 DNN was optimized as in Table 7. 363 The network structure of the optimized LSTM model is illustrated in Fig. 15. classification process, selecting the correct K for the data is achieved by trying multiple Ks and finding the one that fits 376 better (Mahmoodzadeh et al., 2020c). To implement the KNN method in this study, Anaconda version 3.6 is used. 377 During the KNN model's predictive analysis, various Ks were evaluated, and the best predictions were obtained with K 378 = 2. 379 For the KNN model, as shown in Fig. 18, the difference in the predicted FOS values with the analyzed results is low. 380 This MAE = 0.0913 indicates the high accuracy of the predictions made by the KNN model. The results of other 381 statistical evaluation indices of R 2 = 0.7080 (Fig. 19), RMSE = 0.205355281, and MAPE = 7.874855748% also indicate 382 the high precision of the KNN model in predicting the slope stability. Thus, considering the database employed in this 383 paper, the KNN model is a significant and considerable method for slope stability prediction. 384 To determine the best prediction model among the six ML models used in this paper to predict the slope stability, 385 in Fig. 20 and Table 6, a comparison between the results predicted by them has been made. By analyzing and 386 comparing the values of the obtained statistical evaluation indices for each model, it can be concluded that almost all 387 models have provided close results to each other, and there is not much difference between their accuracy. However, 388 the highest accuracy is produced by the GPR model. Therefore, the most acceptable results, which are not very 389 different from the analyzed ones, are provided by the GPR model. 390

Discussion 391
In this study, the generalization of the suggested Gaussian process regression methodology is discussed. 392 Generalization is a concept used to characterize the model's ability to interact and adapt to new information. 393 Therefore, a model can ingest novel data and predict accurately after practicing with data not used during training. 394 The basis for a model's success and its practical performance is related to its capability to generalize. When a model 395 was so well trained in training data, it cannot be generalized. When new data are given, the model is rendered 396 inaccurate predictions and worthless even if it can accurately predict the training data. A model starts 'memorizing' 397 the training data instead of 'learning'; this is known as overfitting. 398 Feature selection can be used to avoid the overfitting of the model. In this case, feature selection would minimize the 399 number of features, which decreases the computational complexity of the model. The stepwise approach for choosing 400 an important collection of features from the data sets is used for the total features available. 401 Stepwise regression methods can be classified into three strategies: 402  The first strategy (forward selection), which begins without any predictors in the model, relies on adding 403 more iterative predictors to the model. Simultaneously, it stops when the improvement in the results no 404 longer has a statistically positive impact 405  The second strategy (backward selection), which begins with all predictors in the model, periodically 406 eliminates the lowest contributive predictors; while it stops once you get a model, all its predictors become 407 statistically meaningful. 408  The third strategy (stepwise selection) is a mixture of forwarding and backward processes. It begins without 409 any predictors and then adds the predictors that contribute most to the outcome sequentially (like backward 410 selection). While adding every new variable, those that no more enhance the model's fit should be removed 411 (like forwarding selection). 412 In this research, the step ACI [MASS Package] was used, which defines the best design by AIC. The model also has a 413 choice called direction, which takes these values: i) forward (for elimination from forwarding); ii) backward (for 414 elimination from backward); iii) both (sequential replacement, for forward and backward elimination). The best-415 finished model is recovered. In R, among the most popular search methods for selecting features is stepAIC. For the 416 stepAIC model's values continuously to arrive at the final feature set are attempted to be reduced. In the tables below, 417 the finding listed as follows, three asterisks (*) reflect the highly significant value of p. Therefore, it could reject the 418 null hypothesis by providing a small p-value for the intercept and path that enables us to create a good relationship 419 between two measured variables (the target and the predictor variables). A p-value around 5% or less would be a 420 good cut-off point for most situations. 421 In the first step, we fit the model with all predictors and targets. The first step results of feature selection are 422 presented in Table 7. As in Table 7, the effect of all features on slope stability is significant. In this case, we do not 423 delete any of the features, and we stop at this step. According to Table 7, the most effective parameters on slope  424 stability with respect to the p-values are φ and c. Also, the parameter γ has the least impact on slope stability. 425

Conclusions 426
This study proposed six ML models to predict slope stability. 327 datasets analyzed using PLAXIS software including 427 six input parameters of γ, c, φ, α, H, and r u , and one target of FOS were employed in the models. The 5-fold CV method 428 was used to evaluate the prediction performance of the models. 429 All the models produced acceptable results and almost close to each other. However, the GPR model with R 2 = 430 0.8139, RMSE = 0.160893, and MAPE = 7.209772%, was the most accurate model to predict slope stability. 431 The backward selection method was used to assess the contribution of each parameter in the prediction problem. 432 The results showed that all the features have significant contributions to slope stability. However, features φ and γ 433 were the most effective and least effective parameters on slope stability, respectively. 434

Conflict of interest 435
There is no conflict of interest. 436    Fig. 1 Overall procedure of slope stability prediction using ML techniques. 567 Fig. 2 A slope geometry and the effective parameters on the stability of slopes. 568 Fig. 3 The box graph of six factors. 569 Fig. 4 Pie chart of all slope cases. 570 Fig. 5 Comparison of the FOS results predicted by the GPR model with the analyzed ones. 571 Fig. 6 FOS results produced by the GPR model versus the analyzed results. 572 Fig. 7 Comparison of the SVR results with the analyzed ones. 573  Table captions 588 Table 1 Earlier studies on the slope stability predictions using soft computing approaches. 589 Table 2 Statistical features of the remained slope cases. 590 Table 3 The optimized parameters of the GPR model. 591 Table 4 The optimized parameters of the SVR model. 592 Table 5 The optimized model type and hyper-parameters of the DT method. 593 Table 6 The type/value of the optimized LSTM parameters. 594 Table 7 The type/value of the optimized DNN parameters. 595 Table 8 Comparison of the statistical evaluation indices results produced by the prediction models. 596 Table 9 The first step of feature selection. 597 598