Using short-interval landslide inventories to build short-term and overall spatial prediction models for earthquake-triggered landslides based on machine learning for the 2018 Lombok earthquake sequence

During an earthquake sequence, there are often multiple recurring landslides. Understanding the spatial distribution of the landslides triggered by the first earthquake can help us predict the landslide susceptibility for subsequent shakes over a short term. This study used two landslide inventories from the Lombok earthquake sequence in Indonesia in 2018 to construct a short-term secondary disaster prediction model and an overall spatial prediction model using four machine learning algorithms. The average accuracy of the positive samples predicted by the prediction model was 7.1% lower than that of the short-term model. The highest accuracy of the overall prediction model was 14.9% higher, on average, and the area under the ROC curve (AUC) score was 8.1% higher, on average, but the corresponding probability thresholds were lower. The reason for this difference is that, in the short-term prediction model, since most of the landslides in the first landslide inventory were prone to fail two or more times due to the effect of multiple earthquakes, the prediction results have a high positive rate. This feature of the short-term prediction model makes it suitable for landslide rescue guidance in a sequence of earthquakes. In contrast, the overall prediction model can better represent the spatial distribution of the earthquake-triggered landslides in the area.


Introduction
As one of the most destructive geological disasters on Earth, earthquakes commonly induce a large number of landslides (Gorum et al., 2011;Li et al., 2014). For example, May 12, 2008, the Wenchuan earthquake in China (Mw 7.9) triggered hundreds of thousands of landslides, causing a lot of human casualties and economic losses (Xu et al., 2013). As early as the 1970s and 1980s, studies have been conducted on the mechanism, process, and impact of earthquake-induced landslides (Pain, 1972;Garwood et al., 1979;Pearce et al. 1986). Since the Chi-Chi earthquake in Taiwan in 1999, many studies of the mechanism and distribution characteristics of earthquake-triggered landslides have been conducted (Shou et al., 2003;Lin et al. 2004;Tang et al., 2009). These studies have revealed the mechanism of earthquake-induced landslides and the landslide distribution characteristics from different perspectives.
From the 1900s to the 2010s, much effort has been devoted to investigating the relationship between earthquake-triggered landslides and the possible impact factors (Keefer, 1984;Rodriguez et al., 1999;Mahdavifar et al., 2006). These studies have shown that the factors that have a great influence on earthquake-triggered landslides are the topographical factors (Yagi et al., 2009;Yuan et al., 2015;Serey et al., 2019), the geological structure (Mahdavifar et al., 2006;Sato et al., 2007;Huang et al., 2009), the ground shaking parameters (Meunier et al., 2007;Nowicki et al., 2014;Huang et al. 2021). Many scholars have used remote sensing images from before and after the earthquake to identify co-seismic landslides and generate earthquake-triggered landslide inventories (ETLIs) (Sato et al., 2007;Wartman et al., 2013;Tanyas et al., 2017;Valkaniotis et al., 2018;Ferrario, 2019). Based on the above background, some scholars have constructed spatial prediction models for earthquake-triggered landslides using machine learning method and obtained good results. For example, Niu et al. (2014) used an improved support vector machine model to identify and predict the landslides triggered by the Lushan earthquake in China in 2013 and evaluated the susceptibility to landslides in the area of high seismic intensity. Chuang et al. (2021) selected the 1999 Chi-Chi earthquake in Taiwan and used the logistic regression (LR) algorithm for machine learning modeling. They then tested the model using data from the 1998 Jueili earthquake in Taiwan and gave suggestions for probability threshold. Fan et al. (2021) used a random forest (RF) model to track the landslides that occurred after the 2008 Wenchuan earthquake over ten years and established multi-temporal landslide susceptibility maps.
However, there can often be earthquake sequences or a large earthquake followed by many aftershocks over a short period in areas with high earthquake occurrences. One ETLI results from a continuous influence of multiple shakes, and there can be many slopes that suffer more than one failure under the impact of numerous shocks. For instance, Ferrario (2019) constructed two landslide inventories for the landslides that occurred in the first and second halves of the earthquake sequence in Lombok, Indonesia, in 2018. However, the previous studies used the landslides after the earthquake sequence to build an overall spatial prediction model and did not consider predicting multiple landslides in the earthquake sequence. If an ETLI is generated soon enough right after the first shock, can we predict the possibility of secondary landslides once subsequent shocks occur over a short period using machine learning? Is there any difference between the machine learning models for short-term landslide prediction and overall spatial prediction?
In response to the above questions, we chose two ETLIs for the 2018 Lombok sequence built by Ferrario (2019) and used two different strategies to perform machine learning modeling. One approach is to use the ETLI for the first half of the earthquake sequence for training and the ETLI for the second half for testing. The other strategy was to merge the two ETLIs and use tenfold cross-validation to evaluate the prediction results. Terrain information, vegetation coverage, and seismic parameters were used as input factors for the model training. We selected three different types of commonly used machine learning methods-logistic regression (LR), random forest (RF), and neural network (NN)to eliminate the difference between the results of the two strategies caused by the algorithm. The NN used a basic single-layer artificial neural network (ANN) and a multi-layer deep neural network (DNN), so there was a total of four different algorithms. Thus, eight machine learning models were established for the two strategies using these four algorithms. In this paper, we compare the differences in the prediction results from four models under the two strategies and analyze the reasons for the differences in the results of the two strategies.

Study area
Lombok is located in southern Indonesia, at the junction between the Sunda Block and the Australian Plate (see Fig. 1a). The area has a tropical rainforest climate with warm and rainy weather all year round. Rainfall is concentrated from November to March, and the dry season is from April to October (Ferrario, 2019). The geological structure in Lombok is complex, and it is an active crustal region. The Flores Thrust located on the north of the islands marks a transition zone between subduction of oceanic crust and continental collision (Silver et al., 1983;Jones et al., 2014).
Our study area locates in the area between 8°35′-8°14′S, 116°00′-116°42′E, around the Mount Rinjani volcano (see Fig. 1b). The Mount Rinjani volcano is 3726 m above sea level, and the average slope of the surrounding area is about 16°. In the center of the volcano is Segara Anak Lake, which was formed by a volcanic eruption (Lavigne et al. 2013). Figure 2 shows the basic geographic information of the Lombok area. The terrain is the one arc-second resolution raster digital elevation model (DEM) data obtained by the Fig. 1 Study area map. a Geographical location of the study area. b Earthquakes above Mw 6.0 that occurred during July to August, 2018. The earthquakes and fault information was obtained from the U.S. Geological Survey database (USGS, Accessed Sep 2021) Shuttle Radar Topography Mission (SRTM, Accessed Sep 2021), which have been further processed to show slope, aspect, and curvature. Valley lines were extracted from the DEM, and were considered as rivers or ditches (see Fig. 2f). To investigate the vegetation coverage in the study area, we used the Landsat 8 Operational Land Imager (OLI) optical images (30-m resolution) covering Lombok to calculate the normalized difference vegetation index (NDVI). Due to the active local water vapor, there are few complete images without cloud cover. In order to reduce the influence of the different seasons on the growth of vegetation, we selected three images with good observation conditions close to the season when the earthquakes occurred, for which the dates were August 26, 2017, May 09, 2018, and July 06, 2018, respectively. The three images were stitched together and processed to obtain an NDVI product with almost no cloud occlusion.

Earthquake events and landslide inventories
From July to August 2018, a series of earthquakes occurred in the Lombok area over a period of less than one month. The seismic sequence started on July 28, 2018, with a M W 6.4 earthquake. As of August 25, there had been 30 earthquakes above Mw 5.0, nine above Mw 5.5, and four above Mw 6.0, as listed in Table 1 (USGS, Accessed Sep 2021). The epicenters were concentrated in the northeastern coastal area of the island (see Fig. 1b).
This series of earthquakes made the slopes around the volcano vulnerable, resulting in a large number of landslides. Ferrario (2019) identified the earthquake-triggered landslides by comparing the high-resolution PlanetScope optical images from July 02, 2018, August 08, 2018, and September 13, 2018. Two landslide inventories were constructed: an inventory of the landslides triggered by earthquakes before 08/08/2018 (named "LI1"), and an inventory of the landslides occurred during the earthquake sequence (named "LI2") (see Fig. 3). According to Ferrario (2019), the landslides in LI1 were mainly triggered by the Mw 6.9 earthquake on August 5, 2018, and the newly mapped landslides in LI2 were mainly triggered by the two earthquakes (Mw 6.3 and Mw 6.9) that took place on August 19, 2018. Additionally, landslides existed in LI1 may be reactivated affected by subsequent earthquakes, so there may be many secondary landslides in LI2. These landslides were all considered positive samples in this experiment. They resulted from the continuous influence of multiple earthquakes over a short period. The followed two major earthquakes, and LI2 followed four major earthquakes, not considering the other small (< Mw 6.0) earthquakes.
We selected three shaking variables-the modified Mercalli intensity (MMI), peak ground acceleration (PGA), and peak ground velocity (PGV)-as the influencing factors of earthquake-induced landslides. In order to investigate the comprehensive influence of multiple earthquakes, we take these three modeling parameters as the results of multiplying the corresponding parameters of each earthquake, followed by each landslide inventory as the training parameters. For LI1, we took: For LI2, we took: where the subscripts represent the indices of the four earthquakes. The logarithm was taken to avoid the differences of these variables between LI1 and LI2 being too large. Table 2 lists the details of the data used in this experiment. The landslide polygon data were rasterized to 30-m resolution pixels. The positive samples are the pixels of the landslide areas. Since the area of non-landslides was much greater than the area of landslides, we randomly selected a batch of negative samples in the nonlandslide area as negative samples, which were about twice of positive samples. The positive samples of LI1 and LI2 were 5198 and 10,734, respectively. The distribution of each factor in LI1, LI2, and the domain area is shown in Table 3.

Logistic regression
Logistic regression (LR) is a widely used machine learning algorithm that can combine linear regression and classification mapping (Peng et al., 2002;Yesilnacar et al., 2005). LR involves constructing a linear combination of variables as the decision boundary, and then mapping the result to (0, 1) through a nonlinear function to classify the samples. We use the sigmoid function as the mapping function, for which the basic form is as follows: where P represents the probability of the output result being positive or negative, and z is a linear combination of the factors.
in which X is the vector of the input factors, and is the weight vector of the factors.

Random forest
The random forest (RF) algorithm, developed by Breiman (2001), is a robust machine learning algorithm based on ensemble learning, widely used in classification and regression problems. RF implements modeling and prediction through a set of decision trees (DTs). Each DT makes its classification based on the selected control factors. According to the bagging algorithm (Bauer et al. 1999), all DT classification results are used in the (1)  voting, and the final voting result is the RF classification result. In RF, two main parameters need to be determined: the number of trees (n) and the number of predictive variables (m) used by the nodes of each tree. As parameter n increases, the classification results tend to converge. In this experiment, after preliminary analysis, we found that when n was over 60, the accuracy of the classification results was no longer significantly improved, so we took parameter n as 60. Parameter m is approximately the arithmetic square root of n and was set to 8 (Chen et al. 2020) (Fig. 4).

Neural network algorithms
Neural network (NN) algorithms are a well-developed machine learning method used for regression analysis and classification problems. NN algorithms update the parameters of the model through error backpropagation. They calculate the gradient of the loss function by derivation and then adjust the weight and offset of each node so that the output result decreases along the gradient direction, and a convergence result is iteratively obtained (Lee, 2004;Pradhan et al. 2010). The backpropagation algorithm has two stages in each iteration of the training process: forward forecast and backward update. The model input data are run forward to calculate the probability of the sample being a landslide, and are compared with the actual result. The error is then backpropagated, and the parameters of each layer are re-estimated. This study used single and multiple hidden layers to build the NN and DNN models, respectively. An NN model includes an input layer, a hidden layer, and an output layer. These layers are connected by linear combinations and nonlinear activation functions. The performance of an NN model is related to the network structure, including the number of layers and nodes, the activation function, and the update method of the parameters (Haykin, 1998). The number of nodes in the input layer is the same as the number of selected factors, and they are usually set as the normalized values of the factors. The connection of nodes between adjacent layers is completed by a linear connection and a nonlinear activation function. Each linear connection contains two undetermined parameters, the weight and bias, which also need to be initialized and updated for NN modeling. In the forward steps, the output result is compared with the target value, and the difference between them is used to estimate the loss. Then, the loss is passed to the front in the backward process, and the parameters are re-estimated. In this experiment, a basic single hidden layer NN model was selected, with the hidden layer containing 20 nodes. In order to classify the results, we chose the sigmoid function as the activation function of the output layer.

3
A DNN increases the number of hidden layers in the neural network to more than two. The nodes in each layer are connected by the nodes in the upper layer through a linear combination and an activation function. Generally speaking, the increase in the number of hidden layers and nodes is conducive to the better performance of the model. However, to prevent the model from overfitting, we must ensure that the number of hyperparameters is no greater than the number of samples. In this experiment, we finally used three hidden layers and nodes to count the results by setting different numbers of layers and nodes. The total nodes in each layer were 16, 64, and 4, respectively. More layers and nodes do not consistently achieve better prediction results. The sigmoid and cross-entropy functions were selected as the classifier and the loss function, respectively (Fig. 5).

Training and testing strategies
We adopted two approaches for the training and validating from two different perspectives. The first approach was to use LI1 for training and LI2 for testing (named "Strategy-1" or "S1" for short). This is because, in a series of earthquakes that have occurred over a short period, there are typically many landslides or repeated landslides in many areas following the subsequent shocks. This approach predicts possible secondary landslides over the short-term based on the first landslide inventory in the earthquake sequence. Another approach is to merge LI1 and LI2 into an overall landslide inventory, which is then divided into training data and test data in proportion (named "Strategy-2" or "S2" for short). The overall landslide inventory of an earthquake sequence reflects the tendency for earthquake-affected landslides in the area. This approach provides a spatial reference for the long-term earthquake-triggered landslides and is similar to the one used in the existing research cases mentioned in the introduction. In this approach, we divided the merged landslide catalog into ten equal parts and performed tenfold cross-validation.
The performance of the models was measured by the overall accuracy, precision, recall, and the area under the receiver operating characteristic (ROC) curve (AOC): (5) Accuracy = TP + TN TP + TN + FP + FN

Result analysis and model evaluation
The predicted landslide probability under S1 and S2 are shown in Figs. 6 and 7, separately. The prediction target of S1 was LI2, and that of S2 was the overall landslide inventory after decile division. It can be seen from the figure that the spatial distributions of the probability predicted by the four models under these two strategies are consistent with the actual landslide distribution. The figure shows that the four types of models produce similar overall distributions of landslide probabilities. Discarding the difference in the performance of these four algorithms, the prediction results of S1 and S2 are approximately the same in spatial distribution. However, when we verified the prediction results with the test data, we found that results of S1 and S2 were quite different (Figs. 6 and 7). Table 4 lists the average probability for landslides and non-landslides, as predicted by the four models. It shows that, compared with S1, whether it is a landslide sample or a nonlandslide sample, the average probability under S2 is reduced. The average probability of landslide samples is decreased by 7.1%, and that of non-landslide samples is decreased by 14.9% (based on S1).
The ROC curves are used to investigate the performance of the models under S1 and S2 (see Fig. 8). The ROC curves under S2 are drawn by the average true positive rate (TPR) and true negative rate (TNR) values of the 10-fold cross-validation. We use the AUC to evaluate the overall performance of the model, which is close to 1 indicating good model performance. The AUC scores of the four models under S1 are 0.7602, 0.8480, 0.7954, and 0.8346, rising to 0.8337, 0.9076, 0.8519, and 0.9126 under S2, respectively. The prediction performance of each model under S2 is better than that under S1, and the AUC score is 8.1% higher, on average.
The probability threshold (p-threshold) changes from 0 to 1 in steps of 0.05. We recorded the precision and recall corresponding to the highest accuracy by adjusting the p-threshold, summarized in Table 5. The accuracy of the four models under S2 is again clearly higher, by an average of 10.5% than these from S1. The precision and recall under S2 are significantly higher than those under S1. However, considering that the precision and recall are related to the p-threshold, and the p-thresholds of each model with the highest accuracy are different, the comparison between precision and recall is not practically important here. We discuss this issue further in Sect. 5.1. It is worth noting that when the models under S1 reach the highest accuracy, the corresponding p-thresholds are all 0.7, while this drops to 0.6 under S2.
Judging from the precision, accuracy, and recall rate of the 10-fold cross-validation (Fig. 9), the graphical trends of each indicator are generally similar for the same model. In contrast, those for the different models appear to be very different. This suggests that the (6) Precision = TP TP + FP (7) Recall = TP TP + FN four models are independent of each other. Therefore, we believe that the similar differences of these models between S1 and S2 are caused by the different training and test data selected by the two strategies. The reason for this will be discussed in Sect. 5.1.

Importance of factors
To evaluate the importance of each factor for the model, we conducted another series of training and testing. We used the accuracy from previous models as the reference accuracy, removing different variables. The other variables were then used to carry out the modeling, and we compared the final accuracy with the reference accuracy. The difference of maximum accuracy before and after excluding each variable was used as a reference, and the differences of the other factors were scaled proportionally, so as to compare the importance of each factor to the accuracy of the model. The importance of each factor in each model is calculated by Eq. (8): where k refers to a given model, which is LR, RF, NN, or DNN; and i is the factor index. w k i represents the importance of model k after removing factor i. The Accuracy k R is the reference accuracy of model k. The importance of each factor in each model is listed in   Table 6. It shows that, in most cases, the slope and elevation are the most important factors. The distance to rivers/ditches also shows a relatively high importance. This is because the occurrence of landslides shows strong spatial characteristics, that is, most landslides occur on the sides of valleys, in the middle of the mountainside. For the two strategies of S1 and S2, the importance of each factor is similar. The aspect, curvature, and NDVI are much less important to the models, especially the curvature, which has almost no effect on the models. There is a relatively small difference in the value distribution of these factors between landslides and non-landslides (see Table 3).
By taking the average of the importance of each factor and sorting them, we obtained the factor importance ranking, as shown in Fig. 10. Slope and elevation are the two most important factors, reaching importance of more than 0.9. The importance of distance to rivers/ ditches is also above 0.5. Earthquake-related factors are of moderate importance. The importance of MMI*, PGA*, and PGV* is 0.45, 0.54, and 0.50, respectively.  LR-S1 AUC=0.7602 RF-S1 AUC=0.8480 NN-S1 AUC=0.7954 DNN-S1 AUC=0.8346 LR-S2 AUC=0.8310 RF-S2 AUC=0.9066 NN-S2 AUC=0.8506 DNN-S2 AUC=0.9107 TPR TNR 1 3

Analysis of the differences from two strategies
The results of the two machine learning approaches (S1 and S2), show remarkable differences, as described in Sects. 4.1 and 4.2. Specifically, for each model, the difference in the results has three main manifestations:  (a) The mean probability of landslide samples from S2 is lower than that from S1. (b) The highest accuracy and AUC score of the models from S2 are higher than those from S1. (c) The p-threshold with the highest accuracy from S2 is lower than that from S1.
Since accuracy and recall depend on the value of the p-threshold, we list in Table 7 all indicators, with p-threshold = 0.6-0.75, to compare the difference between S1 and S2 with different values of p-threshold. Comparing the results from S1 and S2 for each model, the accuracy and precision from S2 are higher than those from S1, while the recall from S2 is lower than that from S1. This means that models from S2 have a poorer recognition ability for real landslides even though the overall accuracy of the S2 prediction is higher than S1.
According to the training and test data characteristics, we can infer the following reasons. Firstly, Fig. 3 shows that the landslide distributions of LI1 and LI2 have a high degree of overlap. We estimated statistics for the landslide pixels. In LI2, over 44% of the landslide pixels were landslides samples in LI1, and many of the new positive samples were reactivated by landslides in LI1. A large number of overlapping positive samples results  S2 S1 S2 S1 S2 S1 S2 Accuracy 0. in the models trained with LI1 having a relatively high positive rate for LI2. Secondly, the number of earthquakes plays an important role in causing new landslides and reactivating landslides. The influence of more earthquakes in S2 makes the predicted landslide probability higher. This leads to those non-landslide samples with similar geographic conditions to the landslide samples also having a higher probability, increasing the false positive rate (FPR) under S1. Thirdly, the number of samples used for training in LI1 is less than that in LI2, which is also a reason for the model's poor performance in S1. Besides, the epicenters of the two earthquakes on August 19, 2018 were to the east of that of the previous earthquakes, which caused the affected areas to move eastward. Therefore, there is an error in predicting the landslide of LI2 using the model trained by LI1, which makes the accuracy and AUC of S1 lower. However, in practice, it is difficult for the first landslide inventory to cover all the landslides occurring during the entire earthquake period. Furthermore, due to the higher FPR, the p-threshold has to take a higher value to obtain the best accuracy, which is why the p-threshold under S1 is higher than that under S2. In contrast, under S2, the models are built using the overall landslide inventory, with a uniform distribution in the training and test data. The dataset contains all landslides induced by the whole earthquake sequence. Therefore, S2 can better represent the susceptibility level of landslide affected by earthquake in the study area.

Respective advantages of two strategies in practical applications
These two different training strategies can show different practical values. On the one hand, the training and prediction data of S1 are composed of two landslide inventories during a series of earthquakes that occurred over a short period. In areas with frequent earthquakes, strong earthquakes not only trigger many landslides, but the aftershocks or subsequent shocks can also do that over the short term. According to the two landslide inventories, a very high proportion of the landslides in LI1 experienced secondary failure due to the subsequent earthquakes. Some unstable slopes may be reactivated one or more times during consecutive earthquakes, which is very critical for disaster rescue work. Accordingly, models of S1 have a better predictive ability for recurring landslides but bring a high false negative prediction. In the short-term disaster relief, we would rather believe that the local area has a high probability of landslide than omit places where there may be victims. In the numerical experiments in this study, the RF and deep learning models accurately predicted more than 85% of the secondary landslides, even when the probability threshold was simply set to 0.5. However, rapid construction of landslide inventories remains a challenge due to the weather-susceptibility of optical satellite imagery. Short-term prediction of earthquake-triggered landslides still has much to improve. UAV aerial photography has become an effective means of cataloguing landslides in recent years (Lin et al., 2017;Gupta et al. 2018;Valkaniotis et al., 2018), which can make up for this deficiency to a certain extent. Under the condition that satellite images are continuously updated, once an earthquake occurs, drones can be used to photograph surrounding surface changes with short intervals. It is expected that the joint application of multi-sensor earth observation technology will provide support for the rapid construction of landslide inventories in future.
On the other hand, S2 uses the overall landslide catalog for the training and testing, and the test results represent the landslide susceptibility level in the study area better. In this case, it is more suitable to adopt a treatment approach according to the level of landslide susceptibility. In long-term landslide prevention and management, high FPR means more effort will be spent on more landslide prevention work, which can result in a waste of resources. In addition, for landslide management and prevention in peacetime, training with only the co-seismic landslide inventory is clearly insufficient, and that with post-seismic landslide inventory can be better. In a long-term disaster prevention, we need a more comprehensive distribution of landslide susceptibility in the area affected by earthquakes.
The results of the 4 type of models show that these machine learning algorithms has remarkable difference in performance. RF and DNN show better performance than other algorithms both in S1and S2. For S1, we can see in Table 5 that while RF has a higher accuracy, its recall is also higher than that of DNN. This shows that RF has better recall, which is very important for short-term disaster relief. For S2, we can also draw similar conclusions when the accuracy reaches peak (p-threshold = 0.6). However, the AUC of DNN (0.9107) is higher than that of RF (0.9066) in S2. The results of these two algorithms are not very different, so in the long-term landslide prevention and control, these two algorithms can be selected according to the needs.

Conclusion
In this study, the landslide inventories for the 2018 Lombok earthquakes, Indonesia were used to build machine learning models with two strategies. The first strategy (S1) was to use the two short-interval landslide inventories for the training and testing separately. The second strategy (S2) was to divide the overall landslide inventory set into training and test data proportionally. We chose four commonly used machine learning algorithms (LR, RF, NN, and DNN) and built a total of eight models under these two strategies. Compared with S1, the predicted probability of landslide and non-landslide areas under S2 is lower by 7.1% and 14.9%, on average, respectively, while the highest accuracy is 10.5% higher. Affected by multiple earthquakes, the model under S1 can predict the secondary landslide area with higher accuracy, but it also brings a higher rate of misrecognition. Under the influence of multiple earthquakes over a short period, due to the high landslide recurrence rate, the first strategy can more comprehensively predict the possibility of secondary landslides, but it also brings higher misidentification. The S2 strategy can better represent the landslide susceptibility level in the study area, and the AUC score is 8.1% higher, on average. We suggest that using LI1 to establish machine learning models is more suitable for secondary landslide prediction. In contrast, modeling with the overall landslide inventory is more suitable for long-term landslide prevention during non-earthquake periods.