Data used
Several datasets were used for the current study, including: 1) Digital Elevation Models (DEM): DEM (12.5 resolution) of PALSAR ALOS was acquired in 2020 (https://asf.alaska.edu/data-sets). It was used to extract various important parameters for the flood conditions (geomorphological layers), including eight parameters such as elevation, slope angle, plan and profile curvatures using ArcGIS (v.10.8), as well as topographic position index (TPI), convergence index (CI), slope length (LS), and topographic witness index (TWI) using SAGA (v8.0) software. 2) Topographic and geological maps: In the current study, different topographic maps (Egyptian Survey Authority, 1990) at 1:50,000-scale and geological maps (Conoco, 1987) at 1:250,000-scale were used to extract drainage networks and lithological units. 3) Satellite and Google Earth imagery: The current study relies on three types of remote sensing data, including passive satellite imagery, where Sentinel-2 (https://earthexplorer.usgs.gov) was utilized to extract NDVI and land use/land cover with a spatial accuracy of 10m. The other type of remote sensing imagery used in this study is Sentinel-1 (SAR), acquired from Sentinels Scientific Data Hub (https://apps.sentinel-hub.com). These images were used to identify flooded areas and create a dataset for the inventory. The third type of data is the use of Google Earth imagery, which was used to map the various flood locations related to the previous events. 4) Metrological data: it was used to estimate the distribution of rainfall depth in the catchment area, three meteorological stations were identified, namely: Qena, Luxor and Qusayr, which were obtained from the “Water Resources Research Institute of the Ministry of Water Resources and Irrigation”. 5) Anthropogenic parameters: roads were extracted from government data and high resolution Google Earth images. 6) Field study: various field studies were carried out between 2018 and 2021 to study the geographical features of the catchment and verify the accuracy of the results (flooded areas extracted from historical records and from Google Earth and radar imageries).
Seven steps are required for flood susceptibility mapping, as follow: 1) data collection from various sources; 2) production of a flood map (inventory map) based on remote sensing imagery, historical records, and field visits; 3) extraction of various flood-conditioning factors and construction of a database using ArcGIS 10.8; 4) application of multivariate (LR) and machine learning models (EGB and RF); 5) construction of flood susceptibility maps; 6) validation of flood models using ROC-AUC and other statistical parameters; and 7) selection of optimal flood model for flood management analysis (Fig. 2).
Inventory map (dependent factor)
Historical and current flood events are crucial for the preparation of the flood inventory, which serves as the basis for flood susceptibility modeling to predict the potentially vulnerable areas for future floods (Sarkar and Mondal, 2020). With the advantages of remote sensing applications, “C-Band Sentinel-1 Synthetic Aperture Radar (SAR)” has been used to detect flooded areas after flood events. These data are available with no charge, less amount of revisit time, works with all-weather conditions, and sensitive to water bodies and wetland areas (Voormansik et al., 2014; Filion et al., 2016; Anusha and Bharathi 2019). Two images from Sentinel-1 SAR were acquired from Sentinels Scientific Data Hub (https://apps.sentinel-hub.com) covering the study area and surrounding areas. These images were acquired on 25 February 2016 (flood event on 14 February 2016) and 19 November 2021 (flood event on 13 November 2021). The data were processed using ENVI v.5.5 and ArcGIS v.10.8 software. The data used in the current study were corrected for speckle and noise effects. Most of the applications of polarimetry in the current study are wetland and flooded area identification. Accordingly, Sentinel-1 imagery was acquired after the 2016 and 2021 flood events. Sentinel-1 is dual polarization system with two transmitting and receiving signals in either horizontal (H) or vertical (V) polarization. Different objects on the ground (forest canopy and ocean surface) can have different characteristics leading to produce distinct polarization signature (such as, different reflection, intensities and exchange polarization between H and V). Polarimetric techniques can be used to separate different scattering leading to provide information about these various objects. In the current study, the Sentinel-1 image of the study area shows the composite RGB (color) image generated using various channels such as VV, VH, and VV/VH ratio for red, green, and blue, respectively. The RGB color composite of Sentinel-1 SAR imageries of the current basin area have been used to highlight certain features of the flooded areas after the flood events (downstream area of the El-Matulla basin (Fig. 3 (a, c)). Additionally, the Google Earth program was applied to detect and visualize past floods. Google Earth has a time-track feature that allows us to go back in time to see the impact of flooding on the affected areas. In the current study, Google Earth has been applied to extract the flooded locations after the two flood events of 2016 and 2021 (Figure (3 (b, d)), where the signature of flowing water is represented by white color (upstream section of the wadi). Field photos were also taken after the November 13, 2021 flood event (Figure 3 (e-f). As can be seen from the photos, the flood waters flowed through the villages and isolated them. People were afraid of drowning during the flood, accordingly they were standing above their houses, and it seems that after the flood, the community tried to build some dikes to protect themselves from the rains. Based on all this data obtained from Sentinel-1 imagery, Google Earth, archived surveys, and other historical records, a flood database was created in ArcGIS 10.8. In this work, 480 flood points and non-flood points were prepared (Fig. 2). These data were randomly separated into two datasets called training (70% flood and non-flood) to build the flood susceptibility models. The remaining datasets (30% of flood and non-flood) were used for model testing and to help in selecting the optimal model based on accuracy and performance evaluations (Islam et al., 2021; Tang et al., 2021).
Flood-conditioning factors (independent factors)
Flood susceptibility model is relying on two main pillars, the first one is the dependent factor (inventory map) and the other one is the independent “flood-conditioning factors” (Rahman et al., 2019; Yagoub et al., 2020; Waqas et al., 2021). In this study, 15 “flood-conditioning factors” have been extracted from various data sources and stored in database layers in the ArcGIS 10.8 with a final pixel size of 12.5-m * 12.5-m. These flood-conditioning factors include; elevation (EL) (Fig. 4a), slope-angle (SA) (Fig. 4b), topographic positioning index (TPI) (Fig. 4c), convergence index (CI) (Fig. 4d), slope length (LS) (Fig. 4e), topographic wetness index (TWI) (Fig. 4f), profile curvature (PrC) (Fig. 4g), plan curvature (PlC) (Fig. 4h), drainage density (DD) (Fig. 4i), geology (GEO) (Fig. 4j), “normalized differential vegetation index (NDVI)” (Fig. 4k), land use/ land cover (LULC) (Fig. 4l), distance to roads (DtR) (Fig. 4m), distance to wadis (DtW) (Fig. 4), and rainfall (RF) (Fig. 4o). In the current work, the weight and the contribution of every flood-related factor is presented in Table (1). The parameters TWI and LS were determined using equations (1 and 2).
$$TWI=lin \frac{\text{A}}{\text{tan}B}$$
1
$$LS=ln \frac{\text{A}}{\text{tan}B}$$
2
where “A is the catchment area, and β is the slope angle (in °)” (Wolock and McCabe, 1995) ).
The NDVI index map was produced based on the relationship between band eighth (near-infrared electromagnetic radiation (wave length of NIR = 842nm)) and band fourth (red electromagnetic radiation (wave length of R = 665nm)) (equ. 3). The plant has a very strong near-infrared absorption, and therefore they are used in calculating the NDVI (Rouse et al., 1974).
$$NDVI=\frac{B8-B4}{B8+B4}$$
3
Topographic positioning index was calculated based on equation (4) (Guisan et al., 1999)
$$TPI=Zo-\frac{\sum _{i=1}^{n}Zi}{n}$$
4
where Zo is the height of the desired cell and \(\frac{\sum _{i=1}^{n}Zi}{n}\) is the average height of neighboring cells.
ArcGIS 10.8 was used to create the stream density layer at 12.5 m resolution using the line density tool, while the distance to roads (DTR) and distance to wadis (DTW) were created using the Euclidean distance tool. Different LULC types in a watershed can significantly affect flood susceptibility and were extracted based on a supervised classification of the Sentinel-2A image. The LULC map was classified into five categories, including agriculture, urban, sparsely tree cover, wadi deposits, and barren rocks. To estimate the amount of rainfall in the 100-year return periods, the rainfall data recorded in the three stations were collected, which are daily data for the period between 1970-2018. Statistical probability distribution methods were applied to select the suitable distribution for the rainfall data of the studied meteorological stations. Our study shows that the generalized extreme value method (GEV) was suitable for the Qena and Qusayr stations; and Person type III was suitable for Qusayr station. The 100-year return period rainfall values were 70, 57, and 45 for Qena, Luxor, and Qusayr stations respectively. The “inverse distance weighting (IDW)” algorithm was used to generate the rain distribution map as it is the most common technique for interpolating scatter points (Kilinc, 2018).
Background of the algorisms
Flood susceptibility mapping is crucial to identify the vulnerable zones of potential flooding and could be created by identifying the relationship between actual flood locations and associated causal factors. Three models were used in this study including logistic regression model (LR), Extreme Gradient Boosting (EGB) and Random Forest (RF).
Logistic Regression (LR).
LR is a common multivariate algorithm that identifies the regression relationship between independent factors (two or more factors) and a dependent variable (one factor) (Liao et al., 1988). It calculates the probability that an event will occur versus the probability that an event will not occur (Sofia et al., 2018). It was used as a proven approach for flood susceptibility mapping and identify the most contribution variables (Wubalem et al., 2020). Equation (5) shows the relationship between flood events and the conditioning factors.
$$P=\frac{1}{(1+{e-}^{z})}$$
5
Where P is the probability of flooded or non–flooded area. The coefficients are determined as the best mathematical fit for the specified model. A coefficient indicates the influence of each independent variable on the outcome variable, taking into account all other independent variables. This technique has been adapted as an accurate hazard prediction algorithm that can determine the probability of flooding (Band et al., 2020; Malik et al., 2020). The LR model was performed in the current study using SPSS (v.26) statistical software, with 15 flood-conditioning factors (independent factors) and flood and non-flood points (dependent factor). The predicted value of the model is the sum of the results of the individual products formed by multiplying the coefficient value and the independent variables, as shown in equation (6).
$$z=C+B1X1+B2X2+\dots +B15X15$$
6
Where B1 to B15 are the coefficient values and X1 to X15 are the flood related factors. The extracted linear model is LR as a function of the presence or absence of flooding as a function of the values of the independent factors associated with the occurrence of flooding in previous years. The probability of flood susceptibility is expressed by a value ranges from 0 to 1.
Extreme gradient boosting (EGB)
The EGB, a machine learning algorithm, is an advanced supervised algorithm, and was developed by Chen and Guestrin (2016). The major advantages of EGB technique are its ability to generate a strong learner from multiple weak learners outcomes, dealing with missing data in the datasets, and tuning the factors without overfitting the model, and using parallel processes to reduce the computation time (Fan et al., 2018; Naghibi et al., 2020). To conduct EGB model three steps are required as follow: 1) building the learning feature over the entire dataset of factors, 2) constructing the next model over the residuals, and 3) ending the procedure as it reaches "stopping criteria" (Fan et al. 2018). In the current study, the EGB was performed based on the open source code “xgboost” for R-software (Chen et al., 2015).
Random forest (RF)
RF was developed by (Breiman 2001) as an ensemble approach, is generated by “numerous decision trees as predictors and run for classification and regression analyses”. Many authors pointed out that RF is a robust and flexible method where random trees are applied by a set of cases through a bootstrapping technique, where the cases not considered in the designed trees are out of the bag (Zhu and Zhang, 2021). RF possess many advantages as follow 1) ability to categorize parameters based on their contribution using “mean decrease accuracy (MDA) and mean decrease Gini (MDG)” (Wang et al. 2016) and the ability to process a large number of datasets and obtain satisfactory results (Rahman et al. 2019). In this study, the flood susceptibility map was created using the random Forest package in R software (Briman and Cutler, 2015). The final map was created and classified using ArcGIS 10.8.
Multicollinearity, models’ validation and comparisons
Various methods are used to evaluate factor effectiveness in model construction, such as the multicollinearity method (Yoo and Cho 2019). The “variance inflation factors (VIF) and tolerance (TOL)” were applied to evaluate the effectiveness of factors (equations 7 and 8):
$$\text{T}\text{O}\text{L}=(1-{R}_{j}^{2})$$
7
$$\text{V}\text{I}\text{F}=\left[\frac{1}{(1-{R}_{j}^{2})}\right]$$
8
\({R}_{j}^{2}\) Represents the regression coefficient of the independent factor J on all other dependent factors. Rahman et al. (2019) indicate that a TOL < 0.10 and a VIF > 5 account for multicollinearity issues.
In addition, reliability of the models’ performances can be measured using the “receiver operating characteristics (ROC) and the area under the curves (AUC)” method. It is a crucial step in flood susceptibility modeling (Metz, 1978). The “ROC-AUC” was used in this study to calculate the accuracy of the flood susceptibility maps, which are appropriate indicators and provide valuable information (Liu et al 2021). Cao et al. (2020) classified the AUC values into 4 zones as follow: less than 0.6 which indicates no scientific significance of the model (weak), between 0.6 and 0.7, which shows moderate significance model; from 0.7 and 0.8, represent good model; and above 0.8 indicating a very good model.
To compare between these susceptibility models’ various arithmetic indices were applied using 1) Discrimination accuracy measures using (hazard accuracy (HAC), non-hazard accuracy (NHAC), overall hazard accuracy (OHAC), and kappa index (K)) and 2) Reliability assessment using “mean absolute error (MAE) and root mean square error (RMSE)” (Hembram et al., 2021). These indices are shown in equations 9-14. Hazard accuracy (HAC), non-hazard accuracy (NHAC), and overall hazard accuracy (OHAC) equations are determined based on the confusion matrix (Table 2) which depends on flood and non-flood points.
Table 1
Flood conditioning factors characteristics and contributions.
#
|
Dataset
|
Data type and scale/resolution
|
Impacts
|
Figure/ Range
|
1
|
El
|
Continuous (12.5m)
|
Elevation is a regulating factor for climatic conditions such as precipitation (Roundy and Chambers, 2021).
|
Figure 4a
(72-1083m)
|
2
|
S
|
Continuous (12.5m)
|
The slope angle has tremendous effects on runoff, infiltration, and erosion rates. Water generally floods areas with low slope (Khan et al., 2016; Vojtek and Vojteková 2019).
|
Figure 4b
(0-67 deg.)
|
3
|
TPI
|
Continuous (12.5m)
|
The topographic position index is defined as the height difference between adjacent pixels in the area (Weiss, 2001). If the height of the pixel is higher than the neighboring pixels, then it has a positive value, while a negative pixel means a lower height than the neighboring pixels, which generally represents the flooded areas (De Reu et al., 2013).
|
Figure 4c
(-36.6 - 36.8)
|
4
|
CI
|
Continuous (12.5m)
|
Convergence index (CI) is a terrain parameter that describes ground topography. Negative values represent low-lying areas (streams), while positive values represent areas with steep slopes (ridges) (Grohmann and Riccomini, 2009). It plays a crucial role in determining the locations with the highest flood potential (Chowdhuri et al., 2020).
|
Figure 4d
(-99.3 - 99.2)
|
5
|
LS
|
Continuous (12.5m)
|
Slope length (LS) affects water flow and erosion rate (Nguyen et al., 2020; Yariyan et al., 2020).
|
Figure 4e
(0 - 877.3)
|
6
|
TWI
|
Continuous (12.5m)
|
TWI is a morphometric parameter that affects overland flow and soil moisture and water distribution and is used to delineate flood-prone areas (Mokarrama and Hojati, 2018; Ali et al. 2019).
|
Figure 4f
(2.8 - 26.5)
|
7
|
PrC
|
Continuous (12.5m)
|
Profile and plan curvature are flood-related variables that indicate the direction of maximum slope of a surface. The surface is linear when the value is close to 0, while the concave surface (convergent) has values below 0 and the convex surface (divergent) has values above 0, which affects the runoff and concentration of water (Popa et al., 2019; Rejith et al., 2019).
|
Figure 4g, h
PrC
(-12.2 - 11.1)
PlC
(-11.1 - 8.6)
|
8
|
PlC
|
Continuous (12.5m)
|
9
|
DD
|
Continuous (12.5m)
|
Sangireddy et al. (2016) defined it as the total length of streams in relation to the catchment area. It plays curtail role in flood susceptibility, as higher drainage density results in higher flood probability (Onuşluel Gül, 2013).
|
Figure 4i
(0 – 0.743)
|
10
|
GEO
|
Categorical (1:250000)
|
Geology: influence on hydrological process (Zhao et al. 2019).
|
Figure 4j
15 units
|
11
|
NDVI
|
Continuous (10m)
|
Normalized differential vegetation index assesses vegetation distribution and its impact and role in runoff and flood hazard (Ullah and Zhang, 2020).
|
Figure 4k
(-0.51 - 0.79)
|
12
|
LULC
|
Continuous (10m)
|
LULC have a major influence on inundation potential and affect hydrological processes (Hölting and Coldewey, 2019; Yariyan et al., 2020). Urban expansion due to human activities will increase the rate of runoff (Lei et al., 2021).
|
Figure 4l
5 units
|
13
|
DTR
|
Polyline (1: 50000)
|
Roads: Human activities can influence flooding as barriers (Nyssen et al., 2002; Sarkar et al., 2020). Road networks can also increase impervious surfaces that reduce infiltration and increase runoff (Swain et al., 2020).
|
Figure 4m
(0–58349m)
|
14
|
DTW
|
Polyline (1: 50000)
|
Floodplains are most susceptible to flooding, so distance from adjacent wadis is critical in flood susceptibility analysis (Mignot et al., 2019; Sarkar et al., 2020).
|
Figure 4n
(0– 12479m)
|
15
|
R
|
Points (Rain stations)
|
Precipitation is a major factor causing flooding, and many studies have shown that it is a triggering factor that can contribute greatly to flooding potential (Wang et al., 2021).
Rainfall is essential factor that cause inundation and many studies indicated that it is a trigger factor that can strongly contribute to the flooding potential (Wang et al., 2021).
|
Figure 4o
(47 – 70mm)
|
Table 2
Confusion matrix used to quantify different accuracy
Predicted data
|
Observed data
|
Flood
|
Non-flood
|
Flood
|
True flood (TF)
|
False non-flood (FNF)
|
Non-flood
|
False flood (FF)
|
True non-flood (TNF)
|
$$HAC=\frac{\left(TH\right)}{(TH+FH)}$$
9
$$NHAC=\frac{\left(TNH\right)}{(FNH+TNH)}$$
10
$$OHAC=\frac{(TH+TNH)}{(TH+TNH+FH+FNH)}$$
11
$$MAE=\frac{1}{n}{\sum }_{i=1}^{i-n}|{X}_{ei}-{X}_{vi}|$$
12
$$RMSE=\sqrt{\frac{1}{n}{\sum }_{i=1}^{i-n} {({X}_{ei}-{X}_{vi})}^{2}}$$
13
$$K=\frac{{L}_{c}-{L}_{exp}}{1-{L}_{exp}}$$
14
Where TF = number of correctly classified true-flood pixels, TNF = number of correctly classified true-non-flood pixels, FF = number of false-flood correctly classified, FNH = number of false-non-flood that are incorrectly classified, Xei = values predicted by the model, Xoi = values observed by the model, n = the number of data points, Lc = correct flood pixels and correct non-flood pixels, and Lexp are the expected matches.