Characterizing Groundwater Potential Using GIS- Based Machine Learning Model in Chihe River Basin, China


 Mapping of groundwater potential over space, built by synergizing environmental variables and machine learning models, was of great significance for regional water resources management. Taking the Chihe River basin in Anhui province as an example, thirteen influence factors were used to predict the spatial distribution of groundwater, including elevation, slope, aspect, plan curvature, profile curvature, topographic wetness index (TWI), drainage density, distance to rivers, distance to faults, lithology, soil type, land use, and normalized difference vegetation index (NDVI). The potential of groundwater resource in this region was predicted using GIS-based machine learning models, including logistic regression (LR), deep neural networks (DNN), and random forest (RF) model. Then, the accuracy of prediction results was evaluated by calculating the RMSE, MAE and R evaluation index. The results show that there is no collinearity among the 13 environmental impact factors, which can provide corresponding environmental variables for the evaluation of regional groundwater potential. Machine learning models show that groundwater potential is concentrated in moderate to high potential areas. Among them, the moderate to the high potential of this area accounted for 81.14% in the LR model, 90.36% and 87.55% in the DNN model and the RF model, respectively. According to the result of these evaluation indexes, the three models all have high prediction accuracy, among which the LR model performs more prominently. The good prediction capabilities of these machine learning technologies can provide a reliable scientific basis for spatial prediction of groundwater potential and management of water resources.


Introduction
With the development of social economy and population growth, the contradiction between water resource supply and demand has become increasingly acute. As the demand for groundwater increasing, division of dominant potential areas of groundwater has become an important tool for the implementation of groundwater measurement, protection and management (Ozdemir et al. 2011; Kordestani et al. 2019). Groundwater potential assessment is a vital work for regional development, since groundwater supplies necessities water source of local resident life, industrial production and agricultural irrigation .
Groundwater potential analysis is to determine the best area for groundwater development by studying multiple factors that affect the presence of groundwater in a certain area (Díaz-Alcaide and Martínez-Santos 2019). Traditional methods are used to determine the best development area for groundwater, including eld surveys, geophysical methods and drilling project, etc. The common prediction methods for groundwater potential are geostatistics, which are often combined with geographic information (2009) used machine learning model to analyze the potential of groundwater and found that the combination of weight of evidence (WOE) and arti cial neural network (ANN) model has good objectivity, and the results of these two models are easy to implement in ArcGIS, combined with index factor analysis to achieve better prediction results.
However, the application effects of various models are also different due to regional geological environment, climatic factors, and regional scales. In addition, the selection of index factors is restricted by objective conditions with no xed standard. Machine learning models have good applicability in processing multi-dimensional, nonlinear mass data, and improving the generalization ability of model (Kermani et al. 2021). It has been applied to research in many elds. Especially in the area of groundwater spatial variability and spatial prediction, machine learning methods that collaborate with multi-source environmental variables show great potential. Common models include random forest (RF) Evaluation of groundwater potential based on frequency ratio (FR), weights-of-evidence (WoE) and logistic regression (LR) models, and the evaluation results show that the predictions of three models have high accuracy. The groundwater spring potential map can be useful for planners and engineers in water resources management and land-use planning. Hao et al. (2016) found that deep learning algorithms have more levels of non-linear operations than shallow learning methods such as neural networks and support vector machines. Hengl et al. (2015) found that random forest can avoid over tting, are insensitive to multiple linearities. This model is easy to handle missing data, and the RF algorithm consistently outperforms the linear regression algorithm.
Hence, groundwater potential mapping can be applied to the development and planning of regional water resources management system . It also reveals the relationship between groundwater resources and human activities, and helps to understand the vulnerability of ecosystem and over-exploitation. This work can also be used to formulate groundwater sustainable management strategies for water resource planners to determine the suitable location of production wells in the Chihe River basin (Rizeei et al. 2019).
In this area, the spatial distribution of groundwater is still unclear. The work on groundwater potential evaluation is absent, and it is di cult to provide an intuitive basis for groundwater development and management in the study area. Rather than using traditional learning approaches, we leverage the correlations between various environmental impact factors and the presence of groundwater in machine learning models that analyzes the relationship between the various data sets and uses them in a predictive pattern.
The objectives of this study are to map the groundwater potential of this area, evaluate the accuracy of LR, DNN and RF model using RMSE, MAE and R evaluation index, and nd a suitable method for groundwater prediction performance. Datasets used for training, validation and prediction are easily available in GIS while the processing of data in model can contribute further to disclose the groundwater potential. Thus, a new methodological regime using logistic regression (LR), deep neural networks (DNN) and random forest (RF) model of Chihe River basin was developed to improve groundwater potential mapping. The groundwater potential maps are expected to provide necessary data support for groundwater assessment and water resources management in the Chihe River basin.

Description Of The Study Area
The study area is located in the eastern part of China with longitude between 117°26′13.5″ and 118°11′50″ E and latitude between 32°17′47″ and 32°37′46″ N, and covers an area of about 4008 km 2 ( Fig. 1). The height above sea level of the Chihe River basin varies between 11 m and 383 m. The Chihe River system runs through the whole area. This area is in the north of the Jianghuai drainage divide, which belongs to the Huaihe River network. Its east and west sides are higher than the central region.
The fresh-water for human activities in this area is mainly taken from groundwater, which is generally lacking. Most of the soil is yellow-brown clay with poor water-holding capacity, and the precipitation is di cult to in ltrate and surface runoff is fast. In addition, there are few ponds, dams, and reservoirs in this area with poor storage capacity of water. However, a large area of red sandstone is covered under the soil layer (Fig. 2). The in ow of most wells in this area is less than 1 m 3 /h. In the rainy season, most of the precipitation ows into the surface water system, and ows out of this area along the Chihe River into Huaihe River. Therefore, the resources of regional groundwater are very limited, and the exploitation of groundwater is di cult and costly.
Groundwater in the study area is mainly storage in pores and fractures of bedrocks. There are widely distributed in pore-fracture of loose rock aquifers in this area, which mainly consist of phreatic water and weak-con ned water. Lithology is mainly silty-ne sand, with uneven thickness. The distribution of water richness in the aquifer is uneven in time and space. The mean in ow of a single well in this area varies

Database
With the help of various methods and techniques, groundwater potential mapping with high reliability and accuracy could be built (Moghaddam et al. 2015). In the current study, spatial data and materials were prepared, including geology map and hydrogeology map. The digital elevation model (DEM) with a spatial resolution of 30×30 m was used to extract a set of in uence factors in the study area. All impact factors were processed in the ArcGIS 10.2 software. In order to evaluate the potential of regional groundwater, the whole area was divided into 400501 grids with a size of 100×100 m based on the prediction accuracy of models and the geological conditions of this area.
In terms of groundwater data, a total of 245 wells were identi ed based on eld survey by using a handheld GPS and historical hydrogeological materials. For this analysis, these wells were randomly divided into two groups, of which 172 wells (70%) were used for training datasets and 73 wells (30%) for validation ( Fig. 1).

Selection And Analysis Of In uence Factors
The presence of groundwater is closely related to various environmental geological factors (Cantonati et al. 2016). Through quantifying environmental geological parameters to realize the analysis of groundwater potential in this area, and it is concluded that there is a functional relationship between factors and groundwater, to evaluate the presence or absence of groundwater by using the machine learning method. There are no xed guidelines in selecting of the groundwater potential in uence factors (Oh et al. 2011).
Based on the results of the eld geological survey, this study analyzed the relevant geological materials and previous literature. Thirteen in uence factors were selected to predict the spatial distribution of groundwater, including elevation, slope, aspect, plan curvature, pro le curvature, topographic wetness index (TWI), drainage density, distance to rivers, distance to faults, lithology, soil type, land use, normalized difference vegetation index (NDVI) (Fig. 3a-m). The elevation was often used as an important factor in nding the presence of groundwater (Wang et al. 2015). It was extracted from the DEM to show the undulations of the terrain. This study divides elevation into 8 categories according to an equal-interval classi cation scheme, including: <20 m, 20-40, 40-60, 60-80, 80-100, 100-120, 120-140, and >140 m.
Based on the side slope unit of terrain segmentation, the slope was adopted, which can control the ow of groundwater directly. The slope unit was extracted with 30 m resolution from DEM as the basic data, and the hydrological analysis module in ArcGIS 10.2 was used to extract the regional slope. The value of the slope was divided into 7 categories according to the natural breakpoint method, including: <0.5°, 0.5-1°, 1-1.5°, 1.5-2°, 2-5°, 5-10°, and >10°. To a certain extent, the slope can indicate the direction of groundwater ow (Naghibi et al. 2016). The slope aspect was the inclination direction of slope, which controls the ow of precipitation, wind direction and plant photosynthesis (Zabihi et al. 2016). Compared with shady slopes, sunny slopes had longer sunshine time, and their surface had stronger weathering and evaporation. The aspect was extracted from the DEM and was divided into 9 categories according to the different directions, including: Flat, North, Northeast, East, Southeast, South, Southwest, West, and Northwest.
Plan curvature was the change rate of slope at any point on the surface, which was formed by the intersection of a horizontal plane and the surface (Arabameri et al. 2020). This morphological feature will affect the convergence and divergence of surface runoff, and can re ect the degree of contour curvature.
After completing the aspect extraction in ArcGIS 10.2, the slope was extracted from this aspect and was shown as a plan curvature map. The plan curvature was divided into 7 categories using an equal-interval classi cation scheme, including: <20, 20-30, 30-40, 40-50, 50-60, 60-70, and >70. Pro le curvature represents the change rate of the surface slope at any given point. Like the plan curvature, pro le curvature map was generated by calculating the slope of DEM twice in succession, and the values were reclassi ed into 7 categories, including: <0.4, 0.4-0.8, 0.8-1.2, 1.2-2, 2-5, 5-10, and >10.
The topographic wetness index (TWI) is a physical indicator of the in uence of regional topography on groundwater ow direction and convergence (Moore et al. 1991). This index is a function of the slope and upstream contribution area. It is de ned as Eq. (1) follows: where α is the upstream area, and β is the slope o0f each point. According to the different TWI values, ve classes were created: <6, 6-8, 8-10, 10-12, and >12.
Drainage density is the total length of rivers per unit regional area. It is closely related to the precipitation, difference elevation, and moisture retention capacity of soil. The drainage density in this area is binned into four classes: <0.15, 0.15-0.3, 0.3-0.45, and >0.45 km/km 2 . Distance to rivers is a key factor affecting the potential of groundwater. Rivers are an important source of groundwater recharge. This area is mainly covered with Chihe River systems. The distance values between wells and rivers have an important in uence to this research. Based on the hydrological conditions of this area, the buffer zones on the borderlands of the river system were divided into 5 classes: <100, 100-200, 200-300, 300-400, and >400 m.
Faults can control the ow and storage of regional groundwater. Regional faults were extracted from the geological map and were reclassi ed into ve groups based on distance: <500, 500-1000, 1000-1500, 1500-2000, and >2000 m. Lithology of aquifer is the basis of groundwater ow and storage, and it determines the porosity and permeability of the aquifer (Ayazi et al. 2010). Lithology categories were extracted from the regional geological map. Fourteen types of lithology were divided, as shown in Table   1. Soil type has a vital in uence on the in ltration of surface water and recharge of groundwater. It determines the distribution of groundwater potential to a certain extent (Razandi et al. 2015). This factor was roughly divided into three groups from the regional geological map according to different types, including rock outcrops, sandy soil, and cohesive soil. Different land-use types affect the quantity of groundwater resources and quality of groundwater. The relationship between human activities and natural systems was revealed and the distribution of groundwater potential was re ected (Chen et al.

Multicollinearity Analysis Of Factors
A multicollinearity analysis of factors was performed to select factors, which have a signi cant relationship with groundwater distribution. Multicollinearity refers to a certain extent of linear correlation between independent factors, which will affect the contribution of factors to the model (Pourtaghi and Pourghasemi 2014). If there was collinearity between two factors, it was di cult to distinguish the effect of each factor on the results, and the regression model lacks stability. Then, two statistical parameters to determine the multicollinearity problem between each factor were proposed, namely tolerance (TOL) and variance in ation factor (VIF). The values of TOL>0.1 or VIF<10 suggest independence between each factor.

Description Of Models
Logistic regression (LR) A nonlinear dynamic response relationship between a dependent variable and several corresponding independent variables was established by training and testing the known samples in the logistic regression (LR) model, and then predicts or evaluates the probability of an event in unknown samples (Lombardo et al. 2018). When evaluating groundwater potential, each in uencing factor was taken as an independent variable, and the presence or absence of groundwater was taken as a dependent variable. In this study, P is the probability of the presence of groundwater with a range [0, 1], 1-P is the probability of the absence of groundwater, P/(1-P) is the ratio of probability, which is often taken as its natural logarithm. The LR model equation is expressed as follow: ln P 1 − P = β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β n X n 2 where, X 1 , X 2 , …, X n represents the independent factors;β 0 , β 1 , …, β n represents the regression coe cients.
The probability P of groundwater potential can be obtained from Eq. 3: P = e β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β n X n 1 + e β 0 + β 1 X 1 + β 2 X 2 + ⋯ + β n X n 3 In order to calculate the groundwater potential index of the Chihe River basin by using the LR model, 245 known well locations (with potential index 1) and 245 randomly selected non-well location units (with potential index 0) were randomly divided into two parts, of which 70% was used for model training and the others was used for model validation. Groundwater potential indexes of each grid unit calculated by the LR model ranges from 0.001 to 0.999, the high value with greater the groundwater potential. The data processing of the LR model was completed by SPSS 22.0 statistical software, and the P values of groundwater were imported into ArcGIS 10.2 to generate the groundwater potential map.

Deep neural networks (DNN)
In deep neural networks, sample data was processed through multiple layers, the initial low-level feature layer was gradually converted into a high-level feature layer with more abstract. The distributed characteristics of regional data were mined, which was more conducive to the visualization of Relu function: Sigmoid function: f(x) = 1 1 + e − x (5) In model training, the loss function adopts two kinds of cross entropy, and the Adam method is used to nd the optimum value. The initial learning rate is set to 0.005, and the iteration of model is 2500.
DNN algorithm was continuously optimized by using the training samples (70%) and the validation samples (30%). Then, predicted data was substituted into the model for training to obtain the predicted value, and input the values with location information into ArcGIS to generate the groundwater potential map. Random forest (RF) Random forest (RF) is the most representative algorithm based on bagging integrated learning (Norouzi et al. 2019). It uses random sampling to perform integrated learning on multiple decision trees, and nally makes predictions through a majority voting mechanism. By extracting several samples from the original datasets with samples returned, using the decision tree algorithm to train these extracted samples, and then combining these decision trees together (Naghibi et al. 2016). After voting, the nal classi cation result was the one with the most votes.
These main steps of the RF model are as follows: rst, sampling many times from the original sample set and return the sample every time, each time to form a training dataset. Then, a decision tree is generated.
If the sample has X groups of features in total, n features are randomly selected from the X groups as the split feature set of each internal node of a decision tree. Subsequently, the node is split using an optimal split method of classi cation feature set. Classi cation and regression tree (CART) algorithms are used to generate decision trees. Finally, integrate all decision trees to form a random forest model to classify and predict unknown data. Voting on the results of each tree produce the result and the most vote is the classi cation result.
This study uses the RF package in R 4.1.0 software to t the model. Takes each groundwater potential in uence factor as an auxiliary dataset, the nal model variables are screened out by error in the outside of the RF package, variables are eliminated one by one, and then the change of error in the outside of the RF package is observed. If the error increases, the variable is retained, otherwise eliminate the variable.

Model Comparison
The accuracy of model evaluation is controlled by many factors. This study evaluates the prediction accuracy of three models by calculating the mean absolute error (MAE), root mean square error (RMSE) and correlation coe cient (R) between measured values and predicted values of the validation data (Singha et al. 2020). RMSE can evaluate the variation degree of data, and its equation is shown in Eq. 6. The smaller value of the RMSE is, the higher prediction accurate of the model is. MAE is the average of absolute errors, which can better re ect the actual error of the predicted value. The smaller value of the MAE is, the more accurate prediction of the model is, as shown in Eq. 7. R value is a statistical indicator that can re ect the degree of correlation between variables, which can be calculated as Eq. 8. The value of R is closer to 1, the stronger of correlation between two variables.

√ √
Thirteen in uence factors in this study were checked for multicollinearity in SPSS 22.0 software. Results showed that the highest value of VIF is 3.597 and the lowest value of TOL is 0.278 (Table 2). It indicates that these factors are independent of each other. The analysis result of multicollinearity is shown in Figure 4.  Table 3. The larger regression coe cient value and OR value of each input variable have the greater in uence on groundwater potential. Grid data from the area was put into Eq. 2 and Eq. 3, and then the distribution of groundwater potential index of Chihe River basin was calculated. As shown in Fig. 5 and Table 4, the groundwater potential map was reclassi ed into ve categories according to the natural break method, which was very high (29.5%), high (30.2%), moderate (21.44%), low (15.92%), and very low (2.94%). DNN model learns from training data and validation data. After many tests, adjustments and optimizations, a four-layer perceptron model (13-6-6-1) was built, and the error between predicted values and validation values was less than 0.15, indicating relatively accurate prediction of this model. When the number of iterations exceeds 200 times in this model, the training result tends to be stable. Then, the prediction data was imported into the DNN model to obtain the groundwater potential index. Finally, the forecast index was imported into ArcGIS software for mapping (Fig. 6), and the results were divided into 5 categories according to the natural break method, as followed by very high (36.82%), high (38.25%), moderate (15.29%), low (5.22%), and very low (4.42%) ( Table 4).

Random forest (RF)
The trained RF model con rmed the predictive performance of the regional groundwater potential.
Seventy percent of the data was used for training RF model, and the other thirty percent data for verifying this model. Through experiment, the proper number of decision trees (ntree) and node value (mtry) had values of 500 and 3, respectively (Hengl et al. 2015). Finally, the optimized RF model was used to calculate the potential prediction value based on the raster data, and the result is shown in Fig. 7 and Table 4. This map was classi ed into ve categories of very high (42.15%), high (16.95%), moderate (28.45%), low (5.92%), and very low (6.53%) using the natural break method.

Prediction Accuracy Evaluation
In this study, the values of RMSE, MAE and R evaluation index were calculated using validation dataset.
As shown in Table 5, it can be seen that the RMSE value of LR model is the smallest, R value is the largest, and the value of MAE is slightly higher than that of RF model. It shows that the LR model has signi cantly higher prediction accuracy than DNN and RF model. The RMSE value of DNN model is lower than RF model; the R value is signi cantly higher than RF model. It shows that the DNN model has higher signi cantly prediction accuracy than RF model. Overall, the results of evaluation index indicate a good prediction accuracy of three models (Kayhomayoon et al. 2021). However, the LR model has a better performance in evaluation and prediction of regional groundwater potential.

Discussion
This study on the potential of groundwater is very important for the development, utilization and protection of water resources in a region (Díaz-Alcaide et al. 2019). Over the years, there have been different methods for water resources research, which can provide effective guides. However, many methods often have low accuracy in prediction and evaluation, and it is di cult to give detailed and systematic research results. Recently, with the gradual deepening of interdisciplinary subjects, various methods have gradually been applied to the eld of hydrogeology. With the rise of the big data period, the potential of machine learning methods in regional groundwater spatial prediction is gradually revealed ). The development of information science and data science has made it possible for people to obtain accurate hydrogeology information. Many machine learning models that cooperate with multi-source environmental variables have been widely used in revealing the spatial variability of groundwater and spatial mapping (Majumdar et al. 2020).
Spatial Through model analysis, training and validation, the relationship between environmental impact factors and groundwater potential was extracted. Then, the LR, DNN and RF model for regional spatial prediction of groundwater were selected. Compared with the traditional interpolation analysis, this process can systematically obtain the presence probability of groundwater, which effectively improves the prediction accuracy. It provides more accurate guidance for the management of regional groundwater resources.
In this study, the prediction accuracy of LR model is better than the other two models (DNN and RF model). Nonlinear relationship between the groundwater and environmental variables was well captured by LR model. In this model, independent variables can be discrete or continuous, and the calculation result was expressed by a regression formula. As shown in the LR model, the regression coe cients and OR values of TWI, drainage density and soil type were higher, indicating that it had a great in uence on regional groundwater. It can be seen from Table 3 that drainage density was the dominant factor affecting groundwater potential. Elevation, slope and river distance were negatively correlated with groundwater potential, which should be related to the low degree of topographic relief.
The DNN and RF model has accuracy prediction results of groundwater potential in this area, which is consistent with the prediction result of the LR model. Among them, RF model can well capture the nonlinear relationship between groundwater potential and environmental variables. It also can simulate the high-order interaction between variables. Norouzi and shahmohammadi-kalalagh (2019) used RF model to accurately locate the arti cial recharge source of groundwater in shabestar region, Iran.
Arabameri et al. (2020) used the integrated machine learning method to map the groundwater potential of Bastam watershed, Iran. RF modeling recovered that LU / LC, lithology, and elevation were the most important factors for predicting groundwater potential and production. Integrated machine learning method will provide an accurate and effective reference for the groundwater potential ).
From Table 4, there was a slight difference in the prediction results of regional groundwater potential from the three models, but overall, this area covers high potential of the groundwater. As shown in Fig. 5 to Fig. 7, the highest groundwater potential in this region was mainly distributed around the Chihe River.
Groundwater potential in the east and west side hilly areas is low, and the central plain is mainly composed of sand rocks with the relatively higher water potential. The distribution of groundwater resources in this area was basically controlled by these factors. These factors basically control the distribution of groundwater resources in the region. Whereas each factor provides partial information of the regional hydrological conditions, the combination of environmental factors allows obtaining a comprehensive understanding of the distribution and hydrogeology information of this complex system. Geological factors, such as fault distance and lithology factors have less affected on groundwater. Land use is closely related to human activities, so this factor also has a certain impact on groundwater potential.

Conclusions
Taking the Chihe River basin as the study area, this work not only applied GIS-based machine learning models to identify areas of groundwater potential, but also compare and discuss the applicability and accuracy of these models.
The relationship between geological environment, human activities and topographic features of the Chihe River basin were analyzed, and the presence laws of groundwater were summarized. Thirteen independent variables of environmental impact for groundwater potential analysis were selected, including elevation, slope, aspect, plan curvature, pro le curvature, TWI, drainage density, distance to rivers, distance to faults, lithology, soil type, land use and NDVI. This study applied LR, DNN, and RF model to evaluate and partition areas of groundwater potential. The applicability of three models were evaluated through the evaluation indicators MAE, RMSE and R. On the whole, it is clear that the three models indicate good accuracy. It also shows that these models can objectively evaluate the groundwater potential of Chihe River basin.
Groundwater potential is divided into ve categories according to the natural break point method. Results show that this area is mainly concentrated in the moderate to high groundwater potential. Among them, the moderate to high potential of this area accounted for 81.14%, 90.36% and 87.55% for the LR model, the DNN model, and the RF model, respectively. Overall, the regional groundwater has good potential.
The result of research in this paper shows that the combination between machine learning models, hydrological databases, DEM data and thematic information map can better avoid the subjectivity of experts. According to different models, the relationship between environmental factors and groundwater is analyzed, and a more reliable regional groundwater potential map is generated. Compared with traditional methods, this combination technologies provides some new ideas. It can give an important reference for regional hydrogeological surveys.   Histogram of multicollinearity results of regional in uence factor Page 23/24 Groundwater potential distribution established by the LR model Figure 6