Optimizing machine learning algorithms for spatial prediction of gully erosion susceptibility with four training scenarios

Gully erosion causes high soil erosion rates and is an environmental concern posing major risk to the sustainability of cultivated areas of the world. Gullies modify the land, shape new landforms, and damage agricultural fields. Gully erosion mapping is essential to understand the mechanism, development, and evolution of gullies. In this work, a new modeling approach was employed for gully erosion susceptibility mapping (GESM) in the Golestan Dam basin of Iran. The measurements of 14 gully erosion (GE) factors at 1042 GE locations were compiled in a spatial database. Four training datasets comprised of 100%, 75%, 50%, and 25% of the entire database were used for modeling and validation (for each data set in the common 70:30 ratio). Four machine learning models—maximum entropy (MaxEnt), general linear model (GLM), support vector machine (SVM), and artificial neural network (ANN)— were employed to check the usefulness of the four training scenarios. The results of random forest (RF) analysis indicated that the most important GE effective factors were distance from the stream, elevation, distance from the road, and vertical distance of the channel network (VDCN). The receiver operating characteristic (ROC) was used to validate the results. Our study showed that the sample size influenced the performance of the four machine learning algorithms. However, the ANN had a lower sensitivity to the reduction of sample size. In addition, validation results revealed that ANN (AUROC = 0.85.7–0.90.4%) had the best performance based on all four sample data sets. The results of this research can be useful and valuable guidelines for choosing machine learning methods when a complete gully inventory is not available in a region.


Introduction
Gully erosion (GE) is the most destructive type of water erosion (Patton and Schumm 1975;Poesen et al. 2003), and continues as an environmental concern such as landslide (Zhang et al. 2019a) due to the damages and changes in the Earth landforms (Mekonnen et al. 2017;Eisenberg and Muvundja 2020;Amare et al. 2019;Karyda and Panagos, 2020;Weldu Woldemariam and Edo Harka 2020). Gullies are vertical cuts and steep slopes and are often seen on pastures, in forests, or on agricultural lands (Zabihi et al. 2018). Gullies modify the land surface, shape new landforms, and damage agricultural fields (Castillo et al. 2018). In some cases, GE can be generated by natural events like extreme rainfall events that results in floods (Stankoviansky and Ondrčka 2011) and landslides (Ionita et al. 2015). Large amounts of sediment and nutrients from Earth's surface can be washed into rivers and reservoirs due to GE which can disrupt ecosystems (Valentin et al. 2005). The impacts of GE processes depend on a range of environmental factors (Arabameri et al. 2018a). Land use change, degradation of forests and grasslands, reduced vegetation cover, and disturbance of hydrological conditions could exacerbate erosion (Deng et al. 2015). There is consensus among researchers that management of gully formation is needed to prevent land degradation (Arabameri et al. 2019a). Therefore, identifying areas prone to GE is important to the reduction of damages (Yang et al. 2011;Pourghasemi et al. 2017;Rahmati et al. 2017). Numerous studies have sought to assess landscapes' susceptibilities to GE, although there is not yet a consensus on the best way to map it. With the advent of geographic information systems (GIS) and remote sensing (Li et al. 2017;Zhao et al. 2020;Zhou et al. 2021a, b), researchers have developed methods to model areas prone to GE and several have achieved good results (Arabameri et al. 2018b;Nwankwo and Nwankwoala 2018;Saha et al. 2020). Lucà et al. (2011) used two-variable and multi-variable statistical models to map GE. The results from modeling with logistic regression achieved high accuracy. However, Conoscenti et al. (2014) used multivariate statistical models -grid cell units (CLUs) and slope units (SLUs) -to examine GE. They found the CLU model to produce the highest accuracy. Others have also used statistical models to map GE (Dube et al. 2014;Rahmati et al. 2016; Al-Abadi and Al-Ali 2018). More recently, high accuracy data mining methods for prediction of patterns of extreme events have been endorsed by many researchers (Chang et al. 2020;Nguyen et al. 2020;Pourghasemi et al. 2020). Statistical models (Zabihi et al. 2018) and multivariate decisionmaking (Arabameri et al. 2018b;Vijith and Dodge-Wan 2019) have been shown to be adequately accurate for assessing patterns of erosion risk. In order to improve predictions, machine learning models can be applied (Webb et al. 2001). Several machine learning methods have been used for gully erosion susceptibility mapping (GESM). These include logistic regression (LR) (Martínez-Casasnovas et al. 2004;Roy and Saha 2019), random forest (RF) (Saha et al. 2020), reduced-error pruning tree (REPTree) , naïve-Bayes tree (NBTree) (Arabameri et al. 2019c), boosted regression tree (BRT) (Garosi et al. 2018), support vector machine (SVM) (Tien , and k-nearest neighbors (KNN) (Avand et al. 2019). In addition to using machine learning models to improve GESM, improvements can be made to augment the conventional and limited topographic factors to model and identify GE-prone areas Azareh et al. 2019), including other environmental factors that can affect GE may increase the robustness and accuracy of predictions. In this work, we employ a new modeling approach by testing four model types -generalized linear model (GLM), maximum entropy (MaxEnt), support vector machine (SVM), and artificial neural network (ANN) -and four differently sized samples (25%, 50%, 75%, and 100%) of the GE inventory data set for training. Because of the significant amount of GE occurring in the region, the basin of the Golestan Dam, Iran, was chosen as the study area. The aim of this research is to map susceptibility to GE to support the prevention and management.
The accuracy of models for gully erosion susceptibility or other natural hazards are significantly influenced by sample size which remains a challenge. The number of gullies is a critical factor in gully erosion susceptibility modeling. Given the time requirements of field surveys, the largest expenditure related to modeling of gully erosion susceptibility is the cost of collecting gully location data (head cuts), which poses a major challenge in developing countries such as Iran. Owing to time and cost limitation in such cases, it is necessary to identify robust sample size (i.e., the number of gullies). Determining the dependence of a model's results on the sample size used is a research gap (Davoudi Moghaddam et al. 2020). Thus one of the fundamental questions addressed in this study is the response of machine learning models to changing sample size in gully locations. We evaluate the sample size on the accuracy of different models including GLM, ANN, SVM, and MaxEnt to model gully erosion susceptibility, considering a range of gullies from 260 to 1042.

Study area
The Golestan Dam basin is in northeastern Golestan Province at near the city of Kalaleh. The basin is delineated by coordinates 37°24′00″ and 37°45′00″ N, and between 55°20′00″ and 56°00′00″ E (Fig. 1). The study area covers about 222,421 hectares. Elevation in the study area ranges from 58 to 2168 m.a.s.l. The slopes in the study area range from 0 to 160%. Most of the basin, which forms part of the Alborz Mountain range, is mountainous with gentle slopes. The minimum and maximum temperatures in the area are 10 C ° and 18 C° (IRIMO 2012). Diverse lithology cover the area. The largest are swamp (60.7%) and swamp and marsh (13.7%), and grey to block shale with thin layers of siltstone and sandstone (7.9%), The remaining area is accounted by other formations (GSI, 1997) (Table 1). Average annual precipitation in the study area ranges from 224 to 736 mm; there is more rainfall in the southeast. The study area experiences four climates: humid, semi-humid, semi-arid, and Mediterranean (IRIMO 2012). This region is also the source region of the main branch of the Gorganrood River. Land use and land cover is classified as moderate rangeland (25.61%), poor rangelands (19.18%), orchards and dryland farming (14.98%), dryland farming (11.08%), and dense forest (10.92%); the remaining cover is composed of gardens, moderate forest, good rangeland, flood plain, low forest, and residential land uses.

Methodology
This study involved three steps ( Fig. 2): (i) data collection, (ii) tests of multi-collinearity test, and (iii) GESM. Highresolution images, field investigation, and global positioning system (GPS) were used to identify the locations of 1042 gullies in the study area. Data describing fourteen environmental factors at each location were compiled for modeling. These fourteen factors or independent variables were identified in previously published studies. To test the robustness of the models and the degree to which they are sensitive to the training data sample sizes, four data sets (D) were created following a strategy of incremental reduction of the data samples: D1 (100% of GE, n = 1042), D2 (75% of GE, n = 781), D3 (50% of GE, n = 521), and D4 (25% of GE, n = 260). Each data set was randomly split into two groups with a 70 to 30 (training to validation) ratio. The independent variables were examined for multicollinearity using tolerance (TOL) and variance inflation factor (VIF) tests. Finally, GESMs were prepared with the MaxEnt, GLM, SVM, and ANN models.

Gully erosion inventory map
An inventory map is the basic part of any spatial modeling especially for gully erosion. A dataset was collected to inventory gully using the available geoinformatics (interpreting Google Earth images, a handheld GPS device, archived organizational data (General Department of Natural Resources and Watershed Management of Golestan Province), and local information). Field surveys were conducted to check these data at random locations using handheld geographic positioning system (GPS) units with suitable accuracy. A total of 1042 locations were compiled for the analysis. To evaluate the best sample size to model gully erosion, four sampling proportions of the entire dataset were used: 100% (1042), 75% (781), 50% (521), and 25% (260) of the cases. All selection and preparation for the 4 sample sets was conducted in GIS using random spatial distribution of case locations and random case selection methods. The final inventory map was generated in ArcGIS 10.7.

Gully erosion effective factors
Optimal selection of effective factors is an important step in modeling (Rahmati et al. 2017). In this study, the 14 GE effective factors (GEEFs) included bulk density, cation exchange capacity of soil (CEC), elevation, distance from road, distance from stream, land use, lithology, sand soil fraction, silt soil fraction, slope, topographic position index (TPI), topographic wetness index (TWI), and vertical distance to channel network (VDCN) (Fig. 3). Elevation was determined from the digital elevation model (DEM) of the study area ( Fig. 3a). Bulk density is a soil property that refers to the amount of mineral in a soil (Okunlola et al. 2014). A GIS layer for bulk density was prepared using the cutting ring method (Fig. 3b). The values ranged from 1.3 to 1.6 g/cm 3 . The cation exchange capacity (CEC) is another soil characteristic that can affect a soil's capacity to retain nutrients and prevent acidification, conditions that can be affected by erosion (Van Zijl et al. 2014). Positive ions such as (Ca +2 ), magnesium (Mg +2 ), potassium (K + ), sodium (Na + ), hydrogen (H + ), aluminum (Al +3 ), iron (Fe +2 ), manganese (Mn +2 ), zinc (Zn +2 ), and copper (Cu +2 ) maintain the cations on soil surfaces (Brady et al. 2008). The CEC layer was prepared using a pre-treatment method (Rengasamy and Churchman 1999) in a soil laboratory (Fig. 3c). The values ranged from 12 to 35 meq/100 g. Other soil tissue classifications were also included as GEEFs (Dong et al. 2016): percentages of clay, sand, and silt were measured using the hydrometrics (Fig. 3d-f). Hydrological condition has a key role in earth sciences and natural hazards (Liu et al. 2020;Huang et al. 2019;Wang et al. 2022b, c).The TWI was first introduced by (BEVEN and Kirkby 1979). It is a topographic feature that reflects hydrological conditions (Gómez-Gutiérrez et al. 2015). The TPI is another topographic factor (Kanti Hembram et al. 2019) that measures elevation and the mean of elevation at the center point of predetermined of pixels ). These two geomorphological factors were calculated with two equations. Equation 1 is: where and are the basin-specific area to the point of discharge and the slope of the pixel, respectively. And Eq. 2 is: where E pixel and E surrounding are the cell elevation and the average elevation of the surrounding pixels, respectively. The TWI values of 2.25 and − 24.01 and the TPI values of − 35.7 and − 29.8 were derived from the DEM images of the study area in SAGA-GIS software (Fig. 3g, h). Roads have great impacts on erosion due to their concentration of overland flow (Frankl et al. 2012). Distance from the nearest roads were measured as Euclidean distances in ArcGIS10.4 software; they ranged from 0 to 20,469 m in the study area (Fig. 3i). Distance from the nearest streams were used to reflect the influence of hydrological processes on GE (Dube et al. 2014). This was also measured as Euclidean distances from GE sites to streams in ArcGIS10.4; the distances ranged from 0 to 8,336 m (Fig. 3j). The topographic factors slope and VDCN were also included. VDCN is another topographic factor that influences the occurrence of GE. VDCN represents the vertical distance between the cell elevation and the calculated elevation for the channel network (Feloni et al. 2020). Slope and VDCN were measured with the SAGA-GIS Tool parameter (Fig. 3k, l). Land use can also affect the onset of GE. Areas with low vegetation density are more susceptible to GE than areas with high vegetation coverage (Zakerinejad and Maerker 2015). The land use layer for the study area was classified into 12 groups (Table 2) by supervised classification of Landsat 8 imagery (Fig. 3m). Rock type is a geological factor that is directly associated with the potential for GE (Stolte et al. 2003). The lithology layer was created by digitizing a map of the geology of the study area at a scale of 1:100,000 and then classified into 8 groups (Table 1) (Fig. 3n).

Generalized linear model
In recent years, deep learning and machine learning models have found many applications in different sciences (Shi et al. 2022;Zhang et al. 2021Zhang et al. , 2022a. GLM is a flexible conventional linear regression technique (Dunteman and Ho 2005). GLMs were first developed by (Nelder and Wedderburn 1972) to incorporate statistical models like linear regression, logistic regression, and Poisson's regression into the modeling process. A generalized linear model (GLM) connects linear regression and the dependent variable using a link function, and allows variance of each measure to be a function of the predicted values. In this model, each Y result is assumed to follow the distribution of an exponential family with the mean μ: where E(Y) is the expected value of Y ; X is the linear value predicted, and the link function is a linear combination of unknown parameters .g (Urso et al. 2018).

Maximum entropy
Maximum entropy (MaxEnt) (Phillips et al. 2006) considers the probabilities of constraints within the occurrence data and allocates non-negative values to each portion of the study area. Typically, MaxEnt has been used to study Earth processes (Dyke and Kleidon 2010). To derive the probability of the conditional occurrence of Pr(y = 1|z), performance is determined by the conditions of the variables in the locations where the phenomenon is present, f 1(z) , in the study area f (z) (Davis and Blesius 2015) (Fig. 4).

Support vector machine
Support vector machine (SVM) analyzes data classification (Cortes and Vapnik 1995). SVM is designed to solve problems in small samples and classify nonlinear data (Fu et al. 2020). In general, SVM models represent the samples as points in space, so that the predictions will be drawn in the same space (Fu et al. 2019). If ((x i , y i ) i = 1, 2, … , n), x i is the set of samples, x i as the input vector ith and yi is where b, w , and Dot (.) represent the bias parameter, weight vector, and internal product, respectively. Also, in a binary classification, the correct identification of samples is:

Artificial neural networks
Artificial neural networks (ANNs) are computational connecting systems that function in ways that are similar to a human neurological system . ANNs are formed by connected units or nodes that are called artificial neurons. These can be used for prediction and classification. The artificial neural networks provide accurate results as output by receiving and processing information, so they have been used in many studies of environmental risk assessment (Arabameri et al. 2019d); (Gorsevski et al. 2016). One of ANN's most popular models is the multi-layer perceptron (MLP) model which includes input, hidden, and output layers (Alizadeh et al. 2018). In an ANN model, the input values are processed in the hidden layers, and testing and errors determine the number of nerve cells in the hidden layers (Abraham 2005). The hidden neurons then interact with weight inputs, and the output signal of neuron 0 can be expressed: where Wj and f (net) are vectors of weight and activation function, respectively. The variable net is defined as a scale net for weight and input vectors: where T is the easiest way to transfer a matrix (Abraham 2005).

Method validation using receiver operating characteristic
Validation of results has a key role in scientific works (Zhang et al. 2019a, b;Xu et al. 2022;Wang et al. 2022a; Gong et al. Zheng et al. 2021a, b). The relative operating characteristic (ROC) curve receiver is a graphic diagram that shows the diagnostic capability through the binary method ). The receiver operating characteristic (ROC) curve shows different thresholds in adjustments by creating a true positive rate (TPR) versus a false positive rate (FPR) (Fawcett 2006). The area under the curve (AUC) can be used to evaluate the ROC curve. The AUC threshold is between 0.5 and 1, a higher value indicating better quality prediction by a model (Nguyen et al. 2020).
The robustness of a model can be determined by analyzing the changes of the model efficiency when small alterations in input data are made (Cama et al. 2017). In the current study, four data sets were used for both training and validation steps. Models were applied to the mentioned datasets, and then were verified using the validation datasets. The robustness of the predictive model is defined as the stability of the model's results in terms of accuracy models when the training and validation samples are altered. The robustness of the models was calculated by differentiating
If VIF < 5 and TOL > 0.1, there is no LR between factors (Achour and Pourghasemi 2019). The MT results confirm that there is no collinearity among the variables (Table 3)

Relative importance of GEEFs using the random forest model
The random forest (RF) model is commonly used in studies relating to environmental issues and hazards . The RF algorithm can combine data and factors without consideration of distributions (Breiman 2018). Accordingly, it was used here to determine the relative importance of each GEEF (Table 4). The results reveal that all factors have different weights, but distance from stream in

Gully erosion susceptibility mapping
Gully erosion susceptibility mapping (GESM) was performed with the training data (dependent variable locations), GEEFs (independent variables), and the four models (SVM, GLM, MaxEnt and ANN). For each model, four GESMs were created using the four training samples (D1-D4). Each map was classified with five susceptibility classes (very low, low, moderate, high, and very high) using the natural break classification method (Figs. 5,6,7,8). The values of each susceptibility classes of each model were compiled (Table 5). The results show that all four algorithms correctly identified a high percentage of the locations susceptible to GE. The most susceptible GE areas (found mainly in the northern and northeastern portions of the study area) were properly classified. Comparing the models, the D4 sample (25% of GE), MaxEnt predicted the highest percentage of the GE locations would be in the very high susceptibility class (13.93%). For the D3 (50% of GE), D2 (75% of GE), and D1 (100% of GE) sample data sets, the ANN, SVM, and ANN models predicted the highest percentages of GE locations in the very high susceptibility class (Fig. 9).

Analysis of the goodness-of-fit and predictive performance of models with different sample sizes
Considering the predictive performance and robustness of models (Tables 6, Figs. 10 and 11), the ANN model had minimum robustness and sensitivity to change of data (0.014 and 0.047) in the training and validation step indicating the highest stability and robustness in comparison with other models, followed by GLM (0.024), MaxEnt (0.03), and SVM (0.048) for training data and SVM (0.059), MaxEnt (0.06), and GLM (0.066) for validation dataset. Generally, all four methods yielded acceptable learning and predictive power in the validation (AUROC ranged between 0.781 and Fig. 7 Gully erosion susceptibility mapping using 75% gullies: a GLM, b MaxEnt, c ANN, d SVM 0.904) and training step (i.e., AUROC ranged between 0.818 and 0.921). In some cases, models had reached excellent performance (AUROC higher than 0.9). As there was no evidence of severe performance degradation as all AUROC values were within and above the acceptable range, it can be safely said that all models showed almost stable results. The sensitivity of the results to the change of the input sample size can generally be ignored.

Discussion
Gully erosion contributes voluminous sediments into rivers oceans (Valentin et al. 2005), and the risk assessment using GESM is an important management measure. Numerous methods have been employed for this task, but these generally strive to achieve the same goals. To this end, the choice of effective methods and appropriate GEEFs for GESM has become a bit contentious among researchers. The growing capacities of machine learning algorithms for prediction and classification, and the increasing accuracies of the results, have forced the development of GESM evaluation processes (Amiri et al. 2019); (Rahmati et al. 2017). In this study, susceptibility to GE in the Golestan Dam basin was investigated with a new modeling approach using SVM, GLM, MaxEnt, and ANN models with different dataset sampling sizes. The results of the maps of predicted GE susceptibilities were classified into five levels (Figs. 5-8). Though the model chosen can factor into risk assessments, the factors chosen can also affect the prediction of probabilities. Since the most important of these could affect the performance of models (Ballabio and Sterlacchini 2012), 14 GEEFs were selected and VIF and TOL were used to ensure their independence from each other. Based on the results of the MT, there was no LR among these factors. Further testing the relationships between the models' performances and the dependent variables, the GE inventory was used in four different proportions: set D1 used 100% of the inventoried gully locations, D2 used 75%, D3 used 50%, and D4 used 25%. The relative importance of each of the factors in the GE predictions was evaluated using RF. The results showed that the most important independent variable was distance from the nearest stream for all four sample data sets; the weights for the entire sample (D1) was 72.61 and for the smallest sample (D4) was 17.05 -these were the greatest weights among the variables. Streams, by rinsing and lateral erosion, upset the balance of slopes along waterways and they increase the susceptibility to GE along rivers. The results of previous studies (e.g., Tien Rahmati et al. 2016) also confirm our findings. Elevation, distance from road, and VDCN were also relatively important based on RF. In this basin, lithology and land use were the least important GEEFs. A study of slope as a GEEF in this study showed that lands with lower slopes have the least effect on gullying. The results of other research also showed areas with steeper average slopes had greater ditch density (the ratio of the length of the ditches per areal unit). Although the results reported by Zinck et al. (2001) indicate the occurrence of gully erosion in low-slope  areas, the hydrological conditions of the study area increase the flow rate of overland runoff from upstream and transport soil particles from higher slopes, making it possible to increase GE. The results from our novel approach of modeling and GESM indicate that the ANN model had the highest robustness among all sample sizes. The SVM, GLM, and MaxEnt algorithms showed that by emphasizing the number of dependent variables used with each of the sample sets for training and integrating independent variables (GEEFs), they can adequately identify the areas susceptible to GE. A review of the literature shows that the performance of the models used in this study has never before been evaluated for GESM using different sample sizes. Therefore, the ANN model in this study has provided better performance (Fig. 9). Garosi et al. (2018) also pointed to the optimal performance of the ANN model for GESM. A recent study indicate a strong performance by ANN, increasing support for validity of our findings.
Our contribution allows to foresee where the gullies will be developed in a region, thus aiding to take measures for reducing the soil erosion and promote the sustainability of the land Fig. 9 Gully erosion susceptibility mapping using the best model (ANN 50% gullies)  Guadie et al. 2020). Our study is also highly relevant to achieve the goals and objectives of the Sustainable Development Goals for 2030 promoted by United Nations (Visser et al. 2019). Moreover, our research results would aid in avoiding land degradation processes as well as desertification processes because our models can predict where gullies can be developed (Keesstra et al. 2018a, b;Rodrigo-Comino et al. 2018).
One of the important points in implementation of artificial intelligence models and datamining for gully erosion susceptibility mapping over large areas is that sample size should be logical and that models could analyze all parts of the region. In this study, the sensitivity of models to gully erosion sample size and their capabilities were investigated. The results illustrate that the NN model has a lower sensitivity to sample size and better prediction for gully erosion. The ANN models can be applied in other large regions even if a gully inventory is not readily available. This conclusion is vital and useful for gully erosion modeling in large and data-scarce regions.

Conclusions
As the Golestan Dam basin is a major gully erosionprone area, a susceptibility survey was performed in this study. A new approach trained SVM, GLM, Max-Ent, and ANN machine learning models with four differently proportioned samples from the GE location database. The results revealed that the performances of the models with the varied proportions were different. In addition to differences in AUC values, the maps from each model showed significant differences in GESM. The classified maps showed high erosion in the northern and northeastern parts of the basin, particularly in those areas with gentle slopes. The ANN model was not only the most accurate of the models, but also performed well with all four training samples. It was able to accurately simulate future GE by using the 50% sample. Nevertheless, the other models did perform well. Therefore, machine learning models offer simple and inexpensive techniques that can eliminate the need to use more traditional metrics, and they produce maps that reliably predict the locations that are most susceptible to gullying. 1. By reducing the sample size from 100 to 50%, an increase in the accuracy of the models is observed, which varies depending on the type of model. By reducing the sample size from 50%, the accuracy of the models decrease again. For example, in SVM model, accuracy of model increases from 0.852 to 0.898 when the sample size is decreased from 100 to 50%. 2. The MLP model was not only the most accurate of the models, but also performed well with all four training samples. It was able to accurately simulate future GE by using the 50% sample. Nevertheless, the other models did perform well. Therefore, machine learning models offer simple and inexpensive techniques that can eliminate the need to use more traditional metrics, and they produce maps that reliably predict the locations that are most susceptible to gullying. The most salient finding was that ANN model was identified as the best model in terms of both accuracy and robustness for gully erosion susceptibility modelling 3. Although all factors have been effective in modeling, some factors such as distance from stream, elevation, distance from road and silt have shown more impact, suggesting the greater impact of these factors in the occurrence of gully erosion in the study area.

Data availability
The data used in this research are available by the corresponding author upon reasonable request.

Declarations
Ethics approval The authors confirm that this article is original research and has not been published or presented previously in any journal or conference in any language (in whole or in part).
Consent to participate and consent to publish Not applicable.

Competing interests
The authors declare no competing interests.