In this study, the process of GPM involves five different steps as shown in Fig. 1. In the first step, investigation region features were illustrated, and the wells present in the region were identified. In the second step, data collection was done, and a spatial database was created for effective criterion. Distribution maps of well data along with fifteen different considered factors were ready for modelling including plan curvature, drainage distance, distance to fault, rainfall, soil, TWI, land cover, slope, drainage density, elevation, slope length, fault density, profile curvature, lithology, and aspect. In the third step, the spatial correlation between effective criteria and available wells was estimated using bivariate models such as CF, EBF, FR and CNN. The fourth step involves the preparation of GPM by considering the weights of CF, EBF, FR and CNN models as input values for LMT and RF models. Finally, AUC and ROC were used to validate the GPM, and the most suitable model was chosen in the end.
2.1 Study area
Islamabad is the federal capital city of Pakistan. It is located at 33°28´ and 33°48´ north latitudes and 72°48´ and 73°22´ east longitudes as shown in Fig. 2. It is situated on the northern edge of the Pothohar plateau. Mountains and plains are spread over 1175 m in the city. There is a diversity in the seasonal conditions of the city. It has an average minimum temperature of 2°C and an average maximum of 45°C. Islamabad is a semi-arid city with an average monthly precipitation of 95.2 mm. In the monsoon season, the monthly average precipitation increases up to 267 mm and 309 mm in July and August, correspondingly.
The sudden trend of migration from rural to urban regions has been increased. The same is the case with this city. The population of the city has reached over 2 million. This rapid growth in population has put a burden on the natural resources and affected the environment adversely. In the city, the water demand is mainly fulfilled by surface water and groundwater. Moreover, the Rawal dam, Khanpur dam and Simli dam are the major water reservoirs of the region. Surface water and 180 tube wells are currently being used by the Capital Development Authority (CDA) for water supplies to the city. Some private companies are also supplying water to the inhabitants.
2.2 Good inventory
The Water Resources Administration of Islamabad published reports that provide the groundwater wells data. Based on water management reports and past research, in this study data of groundwater wells with high potential (≥ 11 m3/h) and electric conductivity (EC) and an average pH of 350 µmhos/cm and 6 was used, correspondingly. In the current study, wells were classified as random and divided into two testing and training datasets. Training includes 70% of the total locations (108 wells) and the remaining wells were 30% (34 wells) considered as validation. The station of water wells was already built-in, and these were picked up randomly to remove biasness. The location of the wells is demonstrated in Fig. 3a.
2.3 Conditioning Factors
There are mainly 5 parameters (hydrological, topographic, geological, climate and ecological parameters) involved in modelling which are subdivided into 15 factors according to the classification of each factor. The parameters are discussed in order of decreasing discriminatory value.
2.3.1 Hydrological parameters
Drainage distance, drainage density and TWI play a significant role in hydrological factors. Drainage distance was obtained by forming a 300 m buffer Polygon around the streams. Drainage density was extracted from SRTM-based DEM. TWI was calculated from bands of Landsat satellite. In hydrological processes, topographic control is illustrated by the TWI factor. Moore, et al. [43] introduced an equation to find TWI:
$$TWI=\text{ln}\left(As|tan\alpha \right)$$
1
Where, As represents the cumulative upslope area and α is the slope angle. TWI, as shown in Fig. 3h, is classified into ranges from low (1.85) to high (7.16). Areas with higher values of drainage density are unsuitable for the production of groundwater resources because it increases the surface runoff and decreases the infiltration process [44]. Figure 3b and Fig. 3i shows the classification of drainage distance in meter (< 100, 100–200, 200–500, > 500) and drainage density in \({m}^{2}\) (< 1, 1−1.84, 1.84–2.68, 2.68–3.53, > 3.53) respectively.
2.3.2 Topographic Parameters
Different topographical parameters considered in this study including slope, slope length, profile curvature, elevation, aspect, and plan curvature were extricated from ASTER digital elevation model (DEM) with a spatial resolution of 30m. SAGA GIS®2.1.2 and ArcGIS®10.3 software were used to create these layers. Previous literature, natural break technique and features of the study area define the classification of maps. Terrain surface crossing with a horizontal plane made counter line curvature termed as plane curvature while surface crossing vertical plane made flow line curvature known as profile curvature [45]. Figure 3c and Fig. 3d show the classification of plan and profile curvature in the study region. Both plan and profile curvatures include five classes in radian/sec: <−2.34,−2.34-(−0.6),−0.6−0.6, 0.6–2.6, > 2.6 and <−3.14,−3.14-(−1.1),−1.1−0.7, 0.7–2.9,>2.9, respectively.
The slope is responsible for the speed of groundwater runoff and infiltration [46]. This factor is classified into five classes including, < 12, 12–24, 24–36, 36–48, > 48 (degrees) as shown in Fig. 3j. Moore and ID.Burch [47] presented an equation to determine the potential of soil loss by combining slope steepness and slope length, collectively known as the LS factor:
$$LS={\left(Bs|22.13\right)}^{0.6}{\left(\text{sin}\alpha |0.0896\right)}^{1.3}$$
2
Where Bs signifies the particular catchment area and α angle of slope. Figure 3k shows the five classes (< 10, 10–20, 20–30, 30–40, > 40) of slope length in meters.
At different elevations, soil type and vegetation vary due to climatic changes [48]. Study area, as shown in Fig. 3m, includes different elevations in meters, < 531, 531–617, 617–756, 756–994, > 994. The slope aspect is another factor that influences the vegetation and rainfall intensity. Figure 3p illustrate the 9 classifications of slope aspect in the investigation region, including flat, north, north-east, east, south-east, south, south-west, west, north-west, and north.
2.3.3 Geological parameters
Geological parameters include two factors: lithology, fault density and distance from faults. Lithology and faults data was obtained from the Geological Map of Pakistan issued by the Government of Pakistan. Through hydraulic conductivity, the potential of groundwater is influenced by lithology. The map shows that lithology has three classes: sandstone, limestone, and unconsolidated deposits. Spatial network distribution and groundwater accumulation are controlled by faults [49]. Fault density in kilometer Square and distance from fault in kilometers are classed into various classes (100, 100–200, 200–500, 500–1000, 1000–2000, 2000–5000, > 5000) and (< 15, 15–41, 41–65, 65–91, > 91) as shown in Fig. 3n and Fig. 3e, respectively.
2.3.4 Climate parameters
The climate parameter includes rain which influences the recharge of groundwater [50]. Rainfall data is acquired by interpolation of annual rainfall records and then it was blended with the rainfall map issued by Pakistan Metrological Department (PMD). It significantly assesses the flow of water and the nutritional status of the basin. ArcGIS®10.3 is used to generate a map as shown in Fig. 3f, of rainfall in millimeters and divided into five different classes: <882, 882–999, 999–1116, 1116–1233, 1233–1350.
2.3.5 Ecological parameters
Soil and land use are considered among the most important ecological factors. The source of soil and land use is FAO land cover. Soil is an important parameter defining the surface and subsurface runoff generation and accumulation [51]. Soil map, as shown in Fig. 3g, is classified into seven different types of soil including, calcareous loamy soil piedmont, calcareous silty soil-gullied land complex, rough broken land, mountainous land with nearly continuous soil, mountainous soil with patchy soil, urban and calcareous loamy soil. Land use indirectly or directly influenced evapotranspiration, runoff, and permeability [51]. Figure 3i shows the four classifications of land use including bare area, built-up, vegetation and water.
2.4 Models
In this study, on training data four bivariate statistical models including CNN, CF, FR and EBF are applied and points of all considered factors are calculated. After that, on these values, two hybrid models i.e. RF and LMT are applied over each point individually. In the first step, CNN, CF, FR and EBF are used to prepare the input file and then run this input file on RF and LMT model. Detail of each model is given below.
2.4.1 CNN Model
DLAs (Deep Learning Algorithms) are being developed on structure human brain. These are based on computational system like Artificial Neutral System (ANN) [52]. CNN has been adopted by many research scholars. CNN based classification and prediction has been employed to earth sciences and many other disciplines [53–55]. The features like Local connections, pooling, shared weights, and especially multiple layers make CNN recognizable and distinctive. Input data in CNN are images or they might be explained as images. This the main concept in CNN which makes it differentiable. This distinctive feature reduces the parameters which speeds up the process. Normally, CNN architecture is composed of different layers such as pooling layers (PLs),including convolutional layers (CLs), and the rectified linear unit (ReLU). PLs monitors overfitting, allow stable conversion, and reduces the convulsion structures which improves computational resulting from the convolutions. CLs identifies the Obscurities are identified by CLs. It provides the best results for data classification [56]. The property of ReLU activation enhances the nonlinear characteristics of the network through the ReLU activation function. For studies, scholars use different structures on the basis of the the data type, the image, and the objective. Some popular examples of structures are GoogleNet [57], ZFNet [58] and VGGNet [59]. Many previously published research articles and papers have explained in detail the types of layers, Primary Parameters of learning, and handling of training data in CNN [60].
2.4.2 FR Model
The FR shows the occurrence probability of a specific action or a factor. It determines the relationship between wells and effective factors of groundwater [35]. The FR is the proportion of the area with wells to the total investigation region. The below equation is used for the calculation of this factor:
$$FR= \frac{\frac{\text{N}\text{p}\text{i}\text{x} \left(\text{S}\text{X}\text{i}\right)}{{\sum }_{i=1}^{m}\text{N}\text{p}\text{i}\text{x} \left(\text{S}\text{X}\text{i}\right)}}{\frac{\text{N}\text{p}\text{i}\text{x}\left(\text{X}\text{j}\right)}{{\sum }_{j=1}^{n}\text{N}\text{p}\text{i}\text{x}\left(\text{X}\text{j}\right)}}$$
3
The numerator represents the ratio of wells number to the number of the total well in a class while denominators are the proportion of the number of the pixels in that class to the number of the total pixels. 15 groundwater affecting parameters are used in this study. Each parameter is termed as criteria, and each criterion is classified into various subcategories. Each subcategory is known as a class. Simple and easy concepts are used in the FR model. It is also capable of analyzing bivariate statistical models and classification of each factor. This method ignores the relationship between variables, which is counted as a drawback of this method [61].
2.4.3 CF Model
The CF model, as a bivariate statistical model, was introduced in 1975 by Shortliffe and Buchanan. This model was later modified in1986 by Longman [62, 63]. GIS is used to integrate the results in a spatial database in this model. Well outcomes frequency in layers each class is the base for calculation of this factor. It is calculated by the following equation [64]:
$$\left\{\begin{array}{c}cf= \frac{\text{p}\text{p}\text{a}-\text{p}\text{p}\text{s}}{\text{p}\text{p}\text{a}(1-\text{p}\text{p}\text{s})}, if ppa \ge pps\\ cf= \frac{\text{p}\text{p}\text{a}-\text{p}\text{p}\text{s}}{\text{p}\text{p}\text{s}(1-\text{p}\text{p}\text{a})}, if<pps\end{array}\right.$$
4
The proportion of well-pixels number in a division to the total pixels of the same division is called ppa. The proportion of the total pixels with wells of the study region to the total pixels of the generated map is called pps. The values of the CF model range from−1 to + 1. Positive values are indicators of an increase in precision, and less certainty is indicated by negative values [65].
2.4.4 EBF Model
In 1967, Dempster proposed the Dempster-Shafer theory of evidence. It was later advanced by Shafer in 1976. On the basis of this theory, a statistical bivariate technique, i.e. the EBF model was developed [66]. Four basic EBFs are used which include the rank of belief (Bel), ambiguity level (Unc), rank of disbelief (Dis), and rank of plausibility (Pls). Bel parameter is considered pessimistic mode and has a low probability. State of high probability and optimistic mode are considered by Pls parameter. In this model, Pls parameter value is greater or equal to the Bel factor value. These two parameters differences give the third parameter i.e., Unc. The spatial relationship of the active parameters and well occurrence is estimated from the extracted data by this model. This model also estimates the spatial relationship between each class factor as well. The following equations were used for this model [67]:
$${Bel}_{{c}_{ij}}= \frac{{w}_{{c}_{ij}D}}{{\sum }_{f=0}^{m}{w}_{{c}_{ij}D}}$$
5
$${w}_{{c}_{ij}D}= \frac{N({C}_{ij}\cap )/N\left({C}_{ij}\right)}{N\left(D\right)-N({C}_{ij}\cap D)/N\left(T\right)-N\left({C}_{ij}\right)}$$
6
Equation 6 demonstrates the conditional existing well probability (i.e., the occurrence of groundwater) in the nonexistence of \({C}_{ij}\) (each division of each parameter). \({W}_{{C}_{ij}D}\) (Weight of \({C}_{ij}\).) shows the belief that there is a well instead of it being lacking. Here, \(m\) shows the considered number of criteria for modelling, each class of each criterion is represented by \(i\), each criterion is represented by \(j\). In these relations, N (T) shows the total pixels in the investigation region and N (D) represent the total well-pixels.
2.4.5 RF Model
The RF model is a well-known data mining algorithm. Different classification trees use this model. A significant number of decision trees are generated by switching and changing the target affecting variables. Then all trees are integrated by the algorithm in a prediction [68]. Then each tree’s original data is randomly chosen in the training process [69]. A number of tree depths are selected to start a for loop and divide dataset for validation inside it. After training the decision tree with that depth on training data and testing it on the validation set, keep validation error, plot it and find the global minimum of a validation error. Afterwards, three user-defined parameters are included in the RF model. It includes the number of factors used for constructing each tree, tree’s number, and minimum node number. Strength enhancement of independent trees and power reduction of the relationship between them improves the forecast of the RF model [70]. All accessible information for its tree growth is not used by the RF model. Only 66% of Bootstrap information is utilized. A forecast variable is randomly implemented during the growing phase. In the same phase, this predictor variable is used for node generation in a tree. Now the maximum size of the decision tree is produced [71]. The remaining 33% of the information is used for the evaluation of the fitted tree. This procedure is repeated numerous times. All likely values average are used as the final forecast of algorithms [70].
2.4.6 LMT Model
The nature of the logistic model tree (LMT) is a non-parametric classification model. It is associated with a supervised training algorithm combining decision tree learning and logistic regression (LR). Quantitative variables or classified variables are predicted in this model on the basis of qualitative and quantitative predictor variables collection. In reality, the hierarchical model’s decision tree includes decisive means that decompose into homogeneous regions of independent variables [72]. Interpretation of a set of laws is made by decision trees that lead to a category or value. Measurement of distance is a distinction between building techniques of a decision tree. Decision trees are known as classification trees, and they are applied to predict discrete variables. Those decision trees which are applied for forecasting continuous variables are called regression trees [73]. The objective of a decision tree is discovering methodology in the way of a series of procedures for presenting the prediction results which are extricated from input parameters. For the isolation of an increment in logistic type data, the LR model is produced in each tree leaf using the Logit Boost algorithm. The CART algorithm is used to cut the tree. LR of an additive in the Logit Boost algorithm is used for each C class (non-well or well) with least squares, where C class defines the drilling of water wells for the purpose of obtaining groundwater data as shown in the equation [74]:
$${L}_{c}\left(X\right)={\beta }_{0}+\sum _{i=0}^{D}{\beta }_{i}$$
7
$$\left(C|X\right)=\frac{\text{e}\text{x}\text{p}\left({L}_{c}\left(X\right)\right)}{{\sum }_{{c}^{{\prime }}=1}^{C}\text{e}\text{x}\text{p}\left({L}_{{c}^{{\prime }}}\left(X\right)\right)}$$
8
Where C shows the class number in the equation.
2.4.7 Validation
Estimating the classification results and evaluating its competence to recognize a particular class is done with the help of the ROC curve and is defined as a probability curve. AUC (area under ROC curve) is used for validation of the method’s sensitivity [36]. The relation of classified and unclassified values is justified by the concept of sensitivity. The efficiency of a classifier in identifying the class is directly proportional to the baseline change for a specific class in the ROC curve. The AUC is also calculated considering the trend diagram of the ROC curve [75]. AUC gives a total performance measure across all possible classification thresholds. The higher the value of the AUC, the greater that method's reliability as a predictor [76]. This index is used to assess the assigned values of the objective class properly (True Positive), assigned incorrect class values (False Positive), the non-allocated values to the specified class (True Negative) and non-allocated incorrect class values (False Negative). X-axis and Y-axis of the equation are calculated in Equations [77]:
$$X=1-\frac{TN}{TN+FP{\prime }}$$
9
$$Y=\frac{TP}{TP+FN}$$
10
The state of actual positive output is called True positive (TP). The state of actual positive output and negative predictive value is termed as False negative (FN). The situation of actual negative output and negative predictive value is known as True negative (TN). The situation of actual negative output and positive predictive value is presented by False positive (FP). The mentioned indices come from the confusion matrix. This is the base for the calculation of the ROC curve. The validation of the model is carried out by calculating ROC Curves. Such curves explain how accurate the model is which means how reliable the results are.