2.1 Setting of the study
The research area covers Khuzestan province in Iran and Al-Basrah and Maysan provinces in Iraq with an area of 99494.38 km2 (Fig. 1). The highest altitudes in the region of 3676 and the lowest altitudes are − 37 m. Al-Hawizeh / Al-Azim swamps, located on the border between Iraq and Iran, are known as one of the main dust sources in Iran. According to Javadean et al. (2019), Al-Howizeh/Al-Azim marshes are one of the three greatest important dust sources in the city of Ahvaz. Wetlands in Iran in the last decade due to oil extraction around Al-Azim, have dried up and created a source of dust (Arkian, 2017). Several other factors, including climate change, drought, wars in the region and military operations, surface water control, river diversion, and dam construction projects, have contributed to the destruction of these wetlands and the emergence of their surface as dust source (Cao et al., 2015; Rashki et al., 2021). The Khuzestan Plain, which forms a large part of Khuzestan Province in southwestern Iran, contains much small dust sources. Approximately nine percent of the Khuzestan plain is the potential of producing dust, part of which is sensitive to the dust source of the Al-Azim dried-up swamp (Heidarian et al., 2018(. These source areas contain degraded pastures, abandoned rainfed farmlands and irrigation fields, and temporary salt lakes. Among these salt lakes, we can mention Maleh Playa, Sabzeh Zohreh Sharghi, and Sabkha Karun Gharbi (Abyat et al., 2019).
Figure 1. Location of the research area
In the present research, the flowchart (Fig. 2) shows two main stages. At first, DSA was determined using satellite images (MODIS) and dust detection indices. In the second stage, sensitivity maps were prepared by six environmental forecasters and the DSA extracted from satellite images and three machine learning algorithms, namely logistic regression, random forest and MARS model were examined. In the following, the working method and dataset are explained in detail.
Figure 2. Flow chart for preparation of dust susceptibility maps using LR, MARS, and RF models
2.2 Dust Source Identification
In this study, MODIS sensor images from Terra and Aqua satellites were collected (Vickery and Eckardt, 2013). Initially, the days of dust occurrence in the years 2005 to 2020 were determined using meteorological data, such as visibility less than 2000 m, wind speed above 7 m per second, and cloudy conditions. Then, the days when the occurrence of dust coincided with the imaging were determined, and finally, 31 MODIS sensor images were used by the Terra satellite related to the selected dust days from 2005 to 2020.
MODIS images were selected due to several advantages including the free access, the high spatial resolution (LAADS Alerts, 2021), and the wide view (Hahnenberger and Nicoll, 2014; Lee et al., 2021). The method used to identify and detect DSA is the Dust Detection Products (DEP) method or the Miller method, which was calculated for all satellite images (Miller, 2003). Finally, false-color combinations were used to identify dust storms. These were according to the DEP plus bands 3 and 4 of MODIS images. The false-color combinations included DEP (Miller, 2003), B4, and B3. A Gaussian columnar model was used to detect dust. This technique is developed on a cone of dust diffusion viewed in the processed MODIS images, where the apex shows the source of the dust (Walker et al., 2009; Lee et al., 2009).
2.3 Dust source effective factors
Quantifying the relative significance of the factors influencing the area's dust source is essential to understanding the dust cycle (Kok et al., 2018). To prepare the potential map of the DSA, after preparing the distribution map of the DSA, six factors including soil maps, lithology, slope, vegetation index (Normalized Difference Vegetation Index), geomorphology units, and land use were considered. These factors play the most important roles in creating DSA (Kandakji et al., 2020).
Dust storms are more likely to occur in regions with erosion-sensitive lithological units than in areas with resistant units (Francis et al., 2017). Lithology classes have significant effects on dust storms and occur more in regions susceptible to lithology than in regions with dust storm-resistant units (Francis et al., 2017). To prepare the geological map of the research area, the world geological map at a scale of 1: 250,000 was used. Also, lithological units were obtained from the map in 6 classes Siltstone and Marne, dolomitic limestone and clay, limestone, conglomerate, river sediments, and wind sand (Fig. 3a). Dust storms and the amount of dust raised to depend on the physical characteristics of the soil (Alilou et al., 2019; Rahmati et al., 2020). Vegetation-free soils are erosion-sensitive areas with high dust production capacity (Gholami et al., 2020). In this study, 6 main types of floors including Clay-Loam, Loam, Loamy-Sand, Sandy-Loam, Sand, and Salty-Clay were identified (Fig. 3b).
Plain and flatlands with low slopes have a better potential for dust generation and wind erosion than steep lands. In these regions, the wind quickly reaches the threshold of wind erosion and leads to dust storms (Wu et al., 2016). Using the DEM of 30 m, the slope map of the research area was readied in six classes. (Fig. 3c). In many studies (e.g., An et al., 2018; Lee and Sohn, 2011), by analyzing the spatial and long-term changes of vegetation, a negative relationship between vegetation and the number of dust storms have been identified. The vegetation can absorb wind energy, increase ground roughness, and protect soil from wind erosion (Feng and Janssen., 2018). The Normalized Vegetation Differential Index (NDVI) was used to prepare the vegetation map of the area. For this goal, ETM + images of the Landsat 8 satellite (related to 2019) were used. The NDVI map was prepared using Eq. 2.
\(\text{N}\text{D}\text{V}\text{I}=\frac{\text{N}\text{I}\text{R}-\text{R}\text{e}\text{d}}{\text{N}\text{I}\text{R}+\text{R}\text{e}\text{d}}\)
|
(2)
|
The NIR indicates infrared and the Red indicates the red band. The NDVI was divided into 3 vegetation classes including − 0.0196-0, 0-0.025, and 0.025 < throughout the study area (Fig. 3d).
Land use is strongly associated with dust storms; for example, lack of vegetation and degraded soil crust has a greater capacity to produce dust (Goossens and Buck, 2014). The land use map was prepared using Landsat 8 satellite images for summer 2018 due to the highest amount of dust occurring in this period. The land use map was classified into the Agriculture land. Bare land, Forest, Marshland, Residential Area, Shrub Cover, Sparse Shrub Cover, and Wetland. (Fig. 3e).
Soil particles released into the air are closely related to the geomorphological characteristics of DSA. (Lee et al., 2021). To prepare the map of geomorphology units, first, the required map was prepared using slope and topographic maps with an accuracy of 1: 50,000 and the geology of the research area. In the next step, after interpreting the satellite images of Google Earth and Landsat 8 in 2017, more detailed information was extracted and transferred to the initial map. Finally, the map of the geomorphological units of the region in six classes was prepared (Fig. 3f).
Figure 3. Map of effective factors of dust source
2.4 Modeling
To examine alignment make a preliminary study between among independent variables, multi-collinearity was analyzed. If there are several lines between independent variables, the error increases, and the accuracy of the model prediction decreases (Park, 2017). Two indicators, A tolerance of less than 0.1 (≤ 0.01) and a variance coefficient of ≥ 10 indicating alignment were used to examine the alignment between the independent variables (O’brien, 2007).
The first step for modeling is to prepare an educational data set and validation (Boroughani et al, 2020). To perform the modeling, first using the stratified random sampling method (Lee et al., 2021), dust source was divided into two groups of 70% (106 dust sources) for modeling and 30% (46 dust sources) for evaluation (Validation). To analyze the sensitivity of the DSA using data mining algorithms, and the number of DSA in the area, the number of points without the DSA was randomly extracted. In the next step, the data related to the effective factors in the area of dust source and source distribution map were entered into the R software, and LR, MARS, and RF algorithms were used using packages (glm; earth, random forest etc.). The dust source susceptibility was then calculated for each pixel in the study area. Finally, the results were transferred to the ARC MAP 10.5 environment to produce a dust susceptibility map. A description of the algorithms used is given below.
2.5 Logistic Regression (LR)
LR is a nonlinear mathematical model for determining the relationship between a binary dependent variable and several independent variables (Crawford et al., 2021). In this research, using LR, the most effective factors in DSA were investigated. The output of the model should have coefficients between 0 and 1, which through logit theory gives probabilities higher than 0.5 value of 1 (affecting the DSA) and less than 0.5 value of zero (without the effect of the DSA) (Martinez-Garcia et al., 2011; Chen and Chen., 2021).
The logistic model in the simplest form can be expressed as Eq. 3:
\(\text{P}=\frac{1}{{1+\text{e}}^{-\text{z}}}\)
|
(3)
|
Where P is the possibility of an occurrence of an event (DSA), the value of which fluctuates from 0 to 1 in an S-shaped curve, Z is defined as an equation (linear logistic model) whose value fluctuates from -∞ to +∞ (Wang et al., 2021). Eq. 4 is stated (Nhu et al., 2020).
\(\text{Y}=\text{L}\text{o}\text{g}\text{i}\text{t}\left(\text{p}\right)=\text{ln}\left(\frac{\text{p}}{1-\text{p}}\right)={\text{C}}_{0}+{\text{C}}_{1}{\text{X}}_{1}+{\text{C}}_{2}{\text{X}}_{2}+\dots +{\text{C}}_{\text{n}}{\text{X}}_{\text{n}}\)
|
(4)
|
P: Probability of an occurrence (probability of occurrence of DSA): The so-called odds ratio or probability, the width of the source or a constant coefficient, and C1, C2,… Cn, the coefficients of the independent variables (X1, X2,…, Xn). This model analyzes the presence or absence of dependent variables (DSA) to independent variables (Crawford et al., 2021).
2.6 Random Forest (RF) model
RF is one of the best learning algorithms (Schonlau and Zou, 2020). For many datasets, it performs high-accuracy classification (Lee et al., 2017) and another positive feature is that it works very well with huge data records (Jiang et al., 2021). RF classification was performed initially on educational information and then on validation confirmation information (Schonlau and Zou, 2020; Quevedo et al., 2021). Finally, the model with the lowest Out of Bag (OOB) error is chosen. To determine the priority of the effect of each of the effective factors in RF, two factors of mean reduction accuracy and Gini mean reduction were used (Wu et al., 2021).
2.7 Multivariate Adaptive Regression Spline model (MARS)
MARS is one of the local nonparametric models first proposed by Friedman (Wang et al., 2021). In this method, regression models create flexibility for dividing the response space into intervals of input variables and fitting a base pan (spline). (Wang et al., 2020; Friedman, 1991). This method is based on functions called base functions (functions that are used to display information in one or more variables), which are defined for each variable as follows:
Y = max (0, x-t) and Y = max (0, t-x)
|
(5)
|
which is t reached a Knot and is a threshold value. These functions are called spline functions, where t pairs are reflected in a node. The general shape of the MARS model is defined as follows (Roy et al, 2020):
\(\widehat{Y}=\text{f}\left(X\right)+{c}_{0}+\sum _{k=1}^{M}{c}_{k}{B}_{k}\left(X\right)\)
|
(6)
|
In Eq. 6, Y is the value of the objective parameters defined by the function f (X), B_k which are the base function, and the coefficients c_k which are determined by minimizing the sum of the remaining squares. In the second step, the basic functions that are less important and effective in estimating are removed. Finally, the best model is chosen based on the minimum criterion called Generalized Cross-Validation (\(GCV\)). GCV is used to identify the sub-models that have the least effect on modeling.
Suppose \({GCV}_{k}\)the value \(GCV\) for the \(k\) model is in the elimination phase the GCV on which the best adaptive multiple spline regression models is selected is also as follows (Martinello et al., 2021; Hassangavyar et al., 2020):
\({GCV}_{k}=\frac{1}{n}\sum _{i=1}^{n}\left({y}_{i}-{\widehat{f}}_{k}\right.{ \left({x}_{i}\right))}^{2}/{ (1-\frac{C\left(k\right)}{n})}^{2}\)
|
(7)
|
That \({\widehat{f}}_{k}\) the model is estimated in step k of the regression elimination stage and \(\lambda . m\) + Number of model sentences in step k =\(C\left(k\right)\)
\(m\) represents the number of nodes of the linear functions in the model and \(\lambda\) is called the smoothing parameter and in practice is usually chosen between 2 and 4.
2.8 Evaluation of dust source susceptibility map
One of the most important parts of modeling is the validation of predicted results (Shano et al., 2021). In this research, all identified DSA were separated into two parts: One is for training (70% DSA) and the other is for validation (30% DSA). DSAs were selected based on random logic for training and validation of models. The ROC-AUC (r receiver-operating characteristic area below the curve) and RMSE (the root mean square error) were used to evaluate the accuracy of the statistical model. The most ideal model has the highest subsurface level and its value is between 0.5 to 1 (Nandi and Shakoor., 2009). The closer the surface below the curve is to one, the more accurate the zoning map (Hao et al., 2020). Qualitative-quantitative correlation should be below the curve and estimation of the 0.9-1, excellent; 0.8–0.9, very good; 0.7–0.8, good; 0.6–0.7, average; 0.5–0.6, weak is suggested (Arabameri et al., 2021). The ROC curve was used in this study to appraise the performance of the generated susceptibility map. However, some studies suggest that model performance evaluation using a factor such as AUC may not be appropriate because in some cases high AUCs may not guarantee the high accuracy of spatial predictions (Chaudhary et al., 2021). For this case, RMSE measurements are also used as additional criteria for evaluating model predictions and supporting decisions in model selection.