Novel Bayesian Additive Regression Tree Methodology for Flood Susceptibility Modeling

Identifying areas prone to flooding is a key step in flood risk management. The purpose of this study is to develop and present a novel flood susceptibility model based on Bayesian Additive Regression Tree (BART) methodology. The predictive performance of the new model is assessed via comparison with the Naïve Bayes (NB) and Random Forest (RF) based methods that were previously published in the literature. All models were tested on a real case study based in the Kan watershed in Iran. The following fifteen climatic and geo-environmental variables were used as inputs into all flood susceptibility models: altitude, aspect, slope, plan curvature, profile curvature, drainage density, distance from river distance from road, stream power index (SPI), topographic wetness index (TPI), topographic position index (TPI), curve number (CN), land use, lithology and rainfall. Based on the existing flood field survey and other information available for the analyzed area, a total of 118 flood locations were identified as potentially prone to flooding. The data available were divided into two groups with 70% used for training and 30% for validation of all models. The receiver operating characteristic (ROC) curve parameters were used to evaluate the predictive accuracy of the new and existing models. Based on the area under curve (AUC) the new BART (86%) model outperformed the NB (80%) and RF (85%) models. Regarding the importance of input variables, the results obtained showed that the location’s altitude and distance from the river are the most important variables for assessing flooding susceptibility.


Introduction
Any unforeseen natural occurrence that weakens or destroys economic, social and physical capacity, such as loss of life and finances, destruction of infrastructure, economic resources and areas of employment is defined as a natural disaster. Examples include earthquakes, floods, drought, seawater, volcanoes, landslides, hurricanes and natural pests (Vetrivel et al. 2018). Flooding is one of the most dynamic and disruptive natural events that puts human life and property and social and economic conditions at greater risk than any other natural disaster (Rahmati et al. 2016;Yariyan et al. 2020). This phenomenon causes damage to human achievements at all times (Woodward et al. 2014;Darabi et al. 2019;Vafakhah et al. 2020). The highest risk of flooding and corresponding damage is in the populated, i.e. urban areas. In recent years, the increase in urban flood hazards, particularly along the river banks, has resulted in the risk of flooding for residents and movable property . Due to the varying climate, unpredictable temperatures and rainfall in many of Iran's watersheds, several floods occur every year (Tehrany et al. 2014). Limiting environmental resources, reducing and destroying them as a result of the expansion of human activities, poses many challenges for today's society and the next generation. The Kan watershed is affected by flooding events annually and this vulnerability has been documented (Hooshyaripor et al. 2020). Seven important flood events were recorded in this watershed since 1954 causing damage to industrial, residential, agricultural land use, and fatalities, according to the available information.
Reducing human casualties as well as damage to property and the environment is a key objective shared by countries most often impacted by natural disasters. They are increasingly conducting feasibility studies with economic analysis to mitigate the effects of these disasters (Molinos-Senante et al. 2011). Although flooding cannot be prevented, the damage can be mitigated through appropriate analysis and forecasting techniques (Heidari 2014). The first step is to identify flood-prone areas (Janizadeh et al. 2019;Hosseini et al. 2020). One way to prevent and reduce flood damage is to provide people with reliable information through flood hazard zoning maps (Cook and Merwade 2009). The modelling of flood hazards, which may involve multi-temporal data sets, is required. Recently, machine learning methods have been successfully applied to assess flood risk with higher accuracy (Ngo et al. 2018;Talukdar et al. 2020). However, there is still no agreement on which method or set of methods can provide the best predictions (Kalantar et al. 2021;Costache et al. 2021).
Rapid access to satellite imagery based on remote sensing data has increased the use of geographic information systems in the preparation of flood susceptibility maps. A wide range of modelling techniques has been proposed and used in natural disaster assessment including AI based techniques (Sayers et al. 2014). In recent years, Bayesian methods, partly because of their over-resistance to the presence of small sample sizes and ability to deal with missing or incomplete data, have been developed recently to model flood susceptibility. These include Naïve Bayes models (Liu et al. 2016;Pham et al. 2020b;Tang et al. 2020) and regression tree models such as Random Forest (RF) models Chen et al. 2020;Vafakhah et al. 2020), Decision Tree models (Khosravi et al. 2018;Costache 2019;Janizadeh et al. 2019;Pham et al. 2020a), Logistic Regression models (Shafapour Tehrany et al. 2017;Al-Juaidi et al. 2018;Tehrany and Kumar 2018). These regression tree models have become popular in the research environment due to their capability to model nonlinear phenomena such as floods.
Machine learning algorithms by default usually present point estimates only, and so decisions are made ignoring the uncertainty surrounding these estimates. In recent years, the use of ensemble models has attracted the attention of researchers in various fields as ensemble models benefit from several individual models and therefore tend to have better performance than individual models (Al-Abadi 2018; Tehrany et al. 2019a;Costache and Bui 2020;Shahabi et al. 2020). Bayesian Additive Regression Tree is one of the new ensemble models that combines Bayesian and Regression tree algorithms giving the access to the full posterior distribution of all unknown parameters in the model. This can be useful to reduce the uncertainty.
BART model has been used for modeling and predicting in different areas such as ecological processes (Plant et al. 2021) and gully erosion (Chowdhuri et al. 2020). Due to the fact that the flood is a non-linear phenomenon and has a lot of the uncertainty, use of appropriate models that have the ability to predict this phenomenon and reduce uncertainty is essential in the management, planning and prevention of flood risk. In the field of flood hazard modeling so far, very little attention has been paid to the role of hybrid Bayesian and Decision Tree algorithms. Therefore, the purpose of this study is to develop and present a new flood susceptibility model based on the ensemble type Bayesian Additive Regression Tree (BART) method. The new method will be compared with the Naïve Bayes (Bayesian type) and Random Forest (regression tree type) based models to evaluate the predictive performance of the new method.

Study Area
The Kan River watershed is 200 km 2 and is located northwest of Tehran, Iran. This watershed is located between latitudes 51° 10′ and 51° 23′ east and 35° 46′ and 35° 58′ north (Fig. 1). The average height of the watershed is 2428 m, the average slope of the whole watershed is 43.4% and the most important river in this mountainous region is the Kan river. The study area is located in the southern margin of the central Alborz region in terms of geological status and has a mountainous climate with the average annual rainfall of 414.13 mm. The average annual discharge of the Kan River is 2.2 m 3 /s and volume runoff is about 70 million m 3 /year. Seven important flood events have been reported in the Kan Fig. 1 Location of case study a) country of Iran b) Tehran Province and c) Kan watershed watershed since 1954 which have caused damage to commercial and residential facilities, agricultural land and even caused casualties in the region (Delkash et al. 2014).

Flood Inventory Data Preparation
In order to prepare a flood susceptibility map it is necessary to analyze the historical floods. The Kan watershed has been severely affected by dangerous floods in recent decades, causing extensive damage and casualties. According to historical floods recorded by the Regional Water Company of Tehran Providence (1954/8/27, 1955/6/9, 1978/3/7, 1981/7/25, 1986/2/2, 1995/4/23, 1996/4/3), field visits and interviews with locals on 2019/10/5 to 2019/10/9 and the identification of flood-affected areas by GPS equipment (Fig. 2), 118 flooding locations are identified in the area. In addition to this, further 118 non-flood points were randomly placed in the inter-fluvial area, or within very steep altitude where the flood phenomenon is almost impossible in the case study area. The position of all 236 locations are presented in Fig. 1. The data were divided into two categories of training and validation for modeling, so that 70% of the data were used for training and 30% for validation (Ahmadlou et al. 2019;Choubin et al. 2019). The flowchart of research methodology is given in Fig. 3.

Spatial Data Preparation
Floods are one of the natural phenomena and are affected by various climatic and geoenvironmental factors. In this study, the following 15 climatic and geo-environmental variables are used as potential explanatory factors for flood susceptibility at a given location: altitude, aspect, slope, plan curvature, profile curvature, drainage density, distance form river distance from road, stream power index (SPI), topographic wetness index (TWI), topographic position index (TPI), curve number (CN), land use, lithology and annual rainfall (Ngo et al. 2018;El-Magd et al. 2021).
The above 15 factors (i.e. potential flood susceptibility model independent variables) were confirmed as significant by using the multi-collinearity analysis. The multi-collinearity analysis evaluates the intensity of multiple correlations between considered variables by calculating the variance inflation factors (VIFs). The higher the value of the VIF the more likely it is that that variable does not play a significant role in flood susceptibility prediction (Miles 2014). In this study, the threshold of 5 was used for VIF to identify significant independent variables (Tehrany et al. 2019a;Hosseini et al. 2020). VIFs were estimated using the USDM package in R software. The analysis has shown that all fifteen variables shown here have VIF values below the above threshold (see Sect. 4.1) hence they have all been used a potential explanatory factors for predicting the flooding susceptibility.
The values of above 15 variables were prepared based on previous studies (Wang et al. 2015;Ngo et al. 2018;Kalantar et al. 2021) (see Figs. 4, 5 and 6). For this purpose, the digital elevation model (DEM) of the study area with resolution of 12.5 × 12.5 m was developed with elevation data obtained using the type L-band Synthetic Aperture Radar (PAL-SAR) (https:// vertex. daac. asf. alaska. edu/#). The aspect map was prepared based on DEM in nine class in the ArcGIS 10.5 software Janizadeh et al. 2019). The ground slope is one of the important factors in the occurrence of floods in watersheds (Tehrany et al. 2015;Chapi et al. 2017). The slope map was prepared based on the DEM in ArcGIS 10.5 software (Khosravi et al. 2018). The plan and profile curvature are the spatial parameters used in the preparation of flood maps of watersheds. These variables were prepared in ArcGIS 10.5 software using a DEM (Rahmati et al. 2016;Hong et al. 2018). Drainage density of the study area in ArcGIS 10.5  (Tehrany et al. 2014;Khosravi et al. 2016Khosravi et al. , 2018. This map was prepared  (Khosravi et al. 2018). Distance from the road is also a factor affecting flooding. This variable was prepared using the 1:50,000 road map of Tehran province, the ArcGIS10.5 software and the Euclidean extension, to determine distance from the road (Shafapour Tehrany et al. 2017).
The stream power index (SPI) is one of the important parameters for flooding in watersheds and the following relationship is defined here (Tehrany et al. 2014;Shafizadeh-Moghadam et al. 2018): System for Automated Geoscientific Analyses Geographic Information System (SAGA GIS 2.6) software was used to prepare this variable (Tehrany et al. 2014).
Topographic position index (TPI) indicates the topographic status of the area, with positive values indicating high altitudes and negative values indicating low altitudes such as valleys (Papaioannou et al. 2015). Due to the role of topographic shape in the formation of floods, this index is considered as one of factors affecting floods and this variable was prepared using the SAGAGIS 2.6 software. TWI measures the effect of local topography on runoff production and shows the long-term moisture content of a (1) SPI = Catchment Area * tan(slope)  (Hong et al. 2018;Khosravi et al. 2019), hence this indicator is one of the influential variables in flood risk assessment in watersheds. This variable was obtained based on the following (Khosravi et al. 2019) in SAGAGIS 2.6 software: Lithology is one of the important factors in watershed flooding due to its direct effect on the level of permeability and surface runoff (Rahmati et al. 2016). The geological map of the Kan watershed was prepared based on the 1:100,000 geological map of the Iranian National Cartographic Center (NCC) and then turned into a raster layer with a resolution of 12.5 m. The lithology map of the study area was divided into seven different classes. The soil type map was also prepared using the data from the Administration of Natural Resources of Tehran Province and the vector file of this map was created with a raster format with pixel size of 12.5 m using the ArcGIS 10.5 software (Tehrany et al. 2014).
Land use is the result of the interrelationships of socio-cultural parameters and the potential of the land (Rahmati et al. 2016;Bui et al. 2018). Changes in land use and land cover can have significant impact on flooding in watersheds (Khosravi et al. 2018). This map was prepared using images of Landsat 8 satellite imagery OLI sensors in 2019/07/15 and using the maximum likelihood algorithm and supervised classification in the ENVI 5.1 software and divided into four classes: orchard, rangeland, residential and rocky lands. Using field survey, google earth and local queries, educational samples were prepared from the study area to ensure that land use does not change. The samples (rangeland, orchard, residential and rocky lands) for each class were randomly collected from the study area. A total of 47 samples were prepared for orchard, 85 samples for rangeland, 48 samples for residential and 51 samples for rocky land use. The samples were divided into two categories so that 70% was used for classification and 30% for evaluating the accuracy of the produced maps.
In order to prepare the annual rainfall map, the rainfall data of seven gauge stations (inside and outside the watershed) were used in the period 1994-2019. After carefully examining the various interpolation methods in the ArcGIS 10.5 software, the distribution of annual rainfall in Kan watershed was prepared based on the ordinary Kriging method.
One of the most important factors in the occurrence of floods is soil condition and different land uses, which directly affects the amount of water infiltration into the land. In other words, the curve number (CN) at the level of each area indicates the hydrological behavior of that area and its discharge regime during rainfall. In order to determine the CN map the land use map and the hydrological soil groups map were combined in the ArcGIS software environment. Then, based on the tables related to the CN for different land uses of watersheds and according to hydrological soil groups map, the value of CN was determined in the case of previous average humidity (Mahmoud and Gan 2018;Tang et al. 2018).
The data summary information of all independent variables is shown in Table 1.

Pearson Correlation
Pearson correlation is a method based on parametric statistics that shows the intensity and direction of the relationship between two variables (Nahler 2009). This method, like other correlation methods, considers the relationships of variables in pairs. This (2) TWI = ln(Catchment Area∕tan(slope)) coefficient calculates the correlation between two distance or relative variables and its value is between +1 and -1. If the value obtained is positive, it means that the changes of the two variables occur in the same direction with an increase in each variable, the other variable also increases, if the value of r becomes negative, it means that two variables act in the opposite direction, that is, by increasing the value of one variable, the values of the other variable. If the value obtained is zero, it shows that there is no relationship between the two variables, and if it is +1, the positive correlation is complete, and if it is -1, it is a complete negative correlation (Benesty et al. 2009).

Flood Susceptibility Models
This section describes three different models for predicting flood susceptibility: BART, NB and RF. All models are based on different machine learning methods that predict the flood susceptibility defined as the probability of flood occurrence at a given location of the analyzed watershed. All three models have the same set of input variables, the fifteen explanatory / independent variables described in Sect. 2.3. These model inputs were determined in all cases using correlation and multi-collinearity analysis (see next section). Finally, all models are trained and tested using the data described in the next section.

Naïve Bayes Model
The Bayesian method is a way of classifying phenomenon based on the probability of that phenomenon occurring or not occurring. Based on the inherent characteristics of probability (especially probability division), Naive Bayes method offers good results after receiving the initial practice (Rish et al. 2001). Learning method in the simplest way, the base is the type of learning with the supervisor. Bayes suggests a way to calculate the posterior probability, P (c | x), from P (c), P (x) and P (x | c). The Naive Bayes classifier assumes that the effect of the predictor cost (x) on a given category (c) of the different predictor values is neutral. This assumption is known as conditional independence: where P(c|x) is posterior probability of target, P(c) is prior probability of class and P(x) is the prior probability of predictor (Zhang 2004). The e1071 package in R software was used for Naïve Bayes modeling.

Random Forest Model
Random Forest (RF) method is a relatively complex method in which several decision trees are trained in order to increase the predictive accuracy of the model. The result is a prediction of a group of decision trees. In the random forest learning method, each decision tree is taught using a random sample selected from the training data set. The total selection of predictive variables used to divide nodes is also random. In the random forest method, the two properties mtry and ntree are determined for the number of auxiliary variables used in each subset and the number of trees used in the forest, respectively. One of the advantages of a random forest is that it can be used for both classification and regression type models. Random forest has parameters similar to the decision tree or "Bagging Classifier". Random forest adds randomness to the model as trees grow. Instead of searching for the most important features when dividing a "node", this algorithm looks for the best features among a random set of features. This leads to more variety and ultimately a better model. Therefore, in a random forest, only one subset of features is considered by the algorithm to divide a node. By adding a random threshold for each attribute, instead of searching for the best possible threshold, trees can be made even more random (Liaw et al. 2002). The randomForest package in R software was use for the RF modeling here.

Bayesian Additive Regression Tree (BART) Model
BART is a Bayesian approach to non-parametric output estimation using regression trees. The regression trees are relying on the return of the binary division of the predictive space into a set of superconductors to approximate certain unknown functions. The predictive space has dimensions corresponding to the number of variables. Tree-based regression models are capable of generating plenty of interaction and nonlinearity (Hill (4) P(c|X) = P x 1 |c * P x 2 |c * ⋯ * P x n |c et al. 2020). Models consisting of a number of regression trees are more capable of capturing interaction and nonlinearity than single trees, as are additives in f.
BART can be considered a general collection of trees with a new estimation method based on a complete Bayesian probability model. The BART model can be expressed as follows: where denotes the cumulative density attribute of the prevalent regular distribution. In this formulation, the sum-of-trees model serves as an estimate of the conditional probit at x which can be besides issues modified into a conditional threat estimate of Y = 1 (Kapelner and Bleich 2013). The bartMachine package in R software was use for BART modeling.

Model Validation and Performance Assessment
The ROC curve characterizes the relative performance of each model. The ROC curve is a graph in which the true positive (or specificity value) is shown in the vertical axis whilst the false positive (or sensitivity) is shown on the vertical axis (Frattini et al. 2010). For the sensitivity or a proportion of occurrence pixels that have been correctly predicted, the larger this value the more accurate the model is in determining the occurrence points. Also, the feature means a ratio of non-occurring pixels that the model correctly predicted. The area under the curve (AUC) measures one aspect of performance. The value of AUC varies from 0 to 1, where the value of 0.5 denotes the random prediction and 1 denotes the perfect prediction (Yesilnacar and Topal 2005). In this study, the following equations have been used to calculate true positive rate (TPR), true negative rate (TNR), specificity, sensitivity and AUC:

Analysis of Independent Variables
In order to build a flood susceptibility model, potential model input variables are first analyzed for independence (via correlation) and linearity (via multi-collinearity analysis).
The results of the correlation study of the variables used in flood susceptibility modelling based on Spearman correlation test are shown in Fig. 7. As it can be seen from this figure, the analyzed variables have a relatively low correlation with each other hence these were all selected for further analysis. The highest correlation was obtained between the distance from river and drainage density variables (-0.684) that according to the Dormann et al. (2013) study this correlation is less than 0.7, and these variables were considered for modeling, In order to determine the appropriate inputs for flood susceptibility modelling, multiple multiplexing and tolerance tests were used using usdm package (in the R software environment). In order to investigate the linearity of the VIF range, all variables with VIF value smaller than 5 were considered.  The results of multi-colinearity and tolerance analyses are shown in Table 2. The study of the linearity of the variables shows that all analyzed variables have a VIF value smaller than 5. The highest linearity was obtained for distance from the river with VIF equal to 2.39 and the tolerance equal to 0.42. The smallest linearity was obtained for the aspect variable with VIF of 1.07 and tolerance of 0.93. Based on this, all variables shown in Table 2 are selected as potential inputs into the flood susceptibility model.

Tuned Parameters
The tuned parameter values for the BART model are shown in Table 3 and Fig. 8.

Model Validation
ROC curves parameters include sensitivity, specificity, NPV, PPV and area under curve (AUC). These parameters were used to evaluate the efficiency of Naïve Bayes, RF and BART models. The corresponding results for the training and testing stages of these models are shown in Figs. 9 and 10 and Table 4. According to the results obtained in the training phase, the sensitivity statistics in NB, RF and BART models are equal to 0.76, 0.99 and 0.99, respectively. This shows the high sensitivity of the three models and their accuracy. The specificity statistics for the NB, RF and BART models are equal to 0.89, 0.95 and 0.90, respectively. The PPV statistics of 0.74, 0.95 and 0.91 and the NPV statistics of 0.77, 0.99 and 0.98 were obtained for the NB, RF and BART models, respectively. This shows the high accuracy of these models in predicting the non-flood points. The results of model evaluation based on the AUC show that the accuracy of NB, RF and BART models is 0.88, 0.99 and 0.89, respectively. Therefore, all three models have high predictive accuracy at the training stage.
Evaluation of the three models at the validation stage shows that the sensitivity statistics for NB, RF and BART models are equal to 0.76, 0.91 and 0.94, respectively. This shows the high sensitivity of these models in flood estimation. The specificity statistics in the NB, RF and BART models are equal to 0.75, 0.72 and 0.78, respectively. Evaluation of the same three models based on PPV and NPV statistics result in PPV values of 0.74, 0.75, 0.80, and NPV values of 0.77, 0.90, and 0.93 respectively, indicating high accuracy of these models Fig. 8 The result of the BART model for flood susceptibility Fig. 9 The ROC curve analysis for Naïve Bayes, RF and BART models using the train dataset when predicting non-flood points compared to flood points. For the overall evaluation of the models at the validation stage, the AUC statistic was used too and the values obtained for the NB, RF and BART models are equal to 0.81, 0.85 and 0.89, respectively.

Flood Susceptibility Modeling Results
After modelling the flood sensitivity using NB, RF and BART models and evaluating the efficiency of these models, flood susceptibility was forecasted for the whole analyzed watershed. The final map was divided into five flooding susceptibility classes (very low, low, moderate, high and very high) by using the natural break algorithm (Fig. 11). According to the map obtained, flooding susceptibility is the highest sensitivity around the main river and the areas near the outlet of the watershed, which have a lower altitude. At the same time, most of the area analyzed, which is generally high altitude, has a very low sensitivity.
The results of the area and percentage covered by each susceptibility class are shown in Table 5. According to the results, the area of very high susceptibility class is equal to 22.11 km 2 (10.26%) in the NB model, 21.23 km 2 (9.85%) in the RF model and 19.48 km 2 Fig. 10 The ROC curve analysis for Naïve Bayes, RF and BART models using the testing dataset (9.04%) in the BART model. However, the BART model, with 50.5 km 2 (23.5%) has predicted the largest area with very high and high susceptibility classes. In order to evaluate the validity of the predicted flood susceptibility maps in relation to the identified flood points in the study area, the frequency ratio (FR) approach was used (Fig. 9). As it can be seen from Fig. 12, the highest frequency ratio is in very high and high classes, which indicates the appropriate prediction of the models used for flood-susceptibility areas. However, the predictions of the RF and BART models that are in the very high class are much higher than the corresponding class predictions made by two other models, which indicates a more accurate prediction of flood susceptibility in this area.

Explanatory Variable Importance
The results of the importance of the independent (i.e. input) variables used to model the flood susceptibility using the three models are shown in Fig. 13. It is clear that in the three models used different input variables have different effects on determining the flood susceptibility. It is also clear that altitude and distance from the river are more important than other variables in all three models.
Due to the importance of 4 variables (altitude, distance from the river, distance from the road and rainfall) on flood susceptibility in the BART model, these 4 variables were further investigated (Fig. 14). As it can be seen from Fig. 14, the flood susceptibility decreases with increasing altitude, with highest sensitivity to floods being at an altitude of 1400 m (which is close to the altitude of the outlet of the watershed). This indicates the inverse relationship between the altitude and the flooding susceptibility. Further, a study of the distance from the river shows that locations with distances smaller than 500 m have a high susceptibility to flooding whilst locations with distances larger than 500 m from the river have a decreasing flooding susceptibility which stabilizes around a low value for the distances of 1000-1500 m. Regarding the distance from the road, it can be noted from Fig. 14 that the flooding susceptibility decreases with the increasing distance from the road with most sensitive areas being located less than 1000 m from the road. Finally, a study of the effect of rainfall on flooding susceptibility shows that areas with 450 to 500 mm of rainfall per year are more sensitive than the areas with higher rainfall (the susceptibility decreases so that from rainfall 550 to 650 mm it is low and constant).

Discussion
In the present study, we developed and presented a novel flood susceptibility BART model that is based on machine learning and Bayesian approach. In addition, two existing models, NB and RF were used for comparison. The results obtained showed that all three models have a high performance in predicting the flooding susceptibility in the Kan watershed in Iran but, based on the model performance criteria, the new BART model has outperformed the other two models. In terms of input variable importance, the results obtained show that the altitude and distance from the river are the most important variables for assessing flooding susceptibility in the study area.
One of the main objectives of this study was to apply the BART model and evaluate the efficiency of this model in flood modeling in the study area. Performance evaluation of NB, RF, and BART models shows that the BART model performed best in the validation stage in terms of predicting flood susceptibility. The use of the Bart model in Natural Hazard studies and especially flood sensitivity modeling has been reported rarely before. The efficiency of this model has been proven in other fields such as forest science. Ahmadi et al. (2020) used BART model to mapping forest stand characteristics and showed that this model has a high performance in comparison to other models.
The BART model is a non-parametric Bayesian regression approach that uses consistent basic random elements. Bayesian Additive Regression Trees (BART) provides a flexible way to fit a variety of regression models while avoiding strong parametric assumptions (Hill et al. 2020). The tree ensemble model is supported by an uncertainty framework in the Bayesian inferential framework and provides a principled approach to regulation through previous specifications (Pratola and Higdon 2016;Sparapani et al. 2016). This model uses a non-parametric tree aggregation model to allow flexibility of the average structure of a regression. But it also has the advantages of a Bayesian inferential framework given the amount of uncertainty and its regulation through calibrated data locations (Sparapani et al. 2016;Hill et al. 2020;Prado et al. 2021;Wu et al. 2021).
One of the main advantages of the BART model is the capacity to form inference on numerous features of the survival distribution directly from the posterior samples. As a Bayesian model, BART consists of a set of priors for the construction and the leaf parameters and a possibility for data in the terminal nodes (Pratola and Higdon 2016;Sparapani et al. 2016). The object of the priors is to afford regularization, limiting any single regression tree from dominating the total fit. Many Machine learning (ML) models suffer from missing data problems. BART model has a specialty that provides the user with the straight designation missing covariate data within the BART structure. This method combines missing data indicators into the training data set and supports for divisions on the missing indicators, guiding to raised efficiency under a pattern ensemble model structure (Hill et al. 2020;Prado et al. 2021;Sparapani et al. 2021).
Determining the importance of independent variables in flood susceptibility modeling in the Kan watershed showed that altitude, distance from river, distance from road and rainfall variables are important factors affecting flood susceptibility in this region. A study of altitude variable shows that low altitudes, which are often at the outlet of watersheds, are highly susceptible to flooding, which is consistent with the findings of Khosravi et al. (2019), Pham et al. (2020a).
Distance from river is another important factor in flood susceptibility in the Kan watershed, and the results indicate the sensitivity of areas close to the river. Ahmadlou et al.

Fig. 11
Flood susceptibility map using the Naïve Bayes, RF and BART models ▸ (2019) showed in their studies that areas 500-1000 m from the river are highly sensitive to flooding. Given that the flood-prone areas are located near the river and the reason is due to rise of flow from the river channels Darabi et al. 2019;Panahi et al. 2021), in the Kan watershed, due to lack of observance of riverbed and river boundaries, several restaurants and villas have been built in the areas near the river, and due to the presence of more orchard in the river area, has led to the obstruction of flow in these areas and has increased the sudden release of flood current. Invasion of the river boundaries and the create of orchard in it, in addition to causing financial damage to the residents of the Fig. 12 Analysis of the frequency of floods on the flood susceptibility maps predicted using the FR method Fig. 13 Results of relative importance of independent variables in flood sensitivity modeling in Naïve Bayes, RF and BART models area, also by blocking the flow in sections such as tunnels, will cause secondary floods and intensify the damage to the people and downstream areas. Another factor affecting the flood susceptibility in the Kan watershed is the distance from road. Construction and crate of communication roads will increase the runoff and runoff speed because it will reduce the area of the existing surface to absorb rainfall and thus will increase the sensitivity to flooding in these areas (Tehrany et al. 2019b;Zhao et al. 2019).
The study of the rainfall indicates that areas with less rainfall are highly susceptible to flood, which are mainly areas close to the outlet of the Kan watershed. Due to the mountainous nature of the region, most of the precipitation in the upstream areas of the Kan watershed is snow, so in these areas the possibility of infiltration is higher. In addition, precipitation in the downstream areas is in the form of storms and these storms are usually more severe in the autumn and causes the river inundation and flooding.
In recent years, due to human interventions and the resulting climate and land-use changes, the rate of flooding and the corresponding damages have increased significantly. Studies such as this one allow managers to reduce flood risks through planning and flood Fig. 14 Partial effect plot for four importance variable (altitude, distance from river, distance from road and rainfall) susceptibility analysis. Therefore, we are always looking for more accurate modeling approaches to reduce the bias in the prediction of flood susceptibility. In the present study, we showed that BART model is an accurate model that can be used for effective flood susceptibility modeling. This model can be applied in the future along with other modes that have shown high ability in flood modeling studies.

Conclusion
Floods are one of the most frequent and destructive natural disasters that can cause a lot of damage. In order to investigate and analyze the susceptibility of some are to flooding, different methods have been developed by the researchers.
In this study, the Bayesian based model (Naïve Bayes), regression tree type model (Random Forest) and ensemble type model (Bayesian Additive Regression Tree-BART) were developed to predict flood susceptibility in the Kan watershed. A total of 15 explanatory (i.e. model input) variables were used after multi-collinearity analyses as independent variables and 118 flood locations and 115 non-flood locations after field surveys and the use of available information as a dependent variable for flood modeling.
The validation results obtained for flood susceptibility modeling showed that the Naïve Bayes, RF and BART models all have a good predictive performance. However, the new BART model has the higher prediction accuracy than the Naïve Bayes and RF models. This is due to the fact that it uses features of both methods in the ensemble setting.
The analysis of the importance of explanatory variables showed that the effect of independent variables is different in each model. However, the altitude and distance from the river were more important than other variables in all three models meaning that low-height areas and areas close to the river are more susceptible to flooding.
The Kan watershed is close to the city of Tehran and the pleasant climate of this tourist area has caused that its riverbanks are occupied with many constructions that have been carried out. These areas receive a large number of tourists in spring and summer and hence are strongly affected by the floods. It is therefore necessary to provide flood hazard maps for the region. The results of this research can be used as a baseline map in development projects to determine areas susceptible to flooding hence prevent the construction in these high-risk areas.