Evaluation of machine learning algorithms for groundwater quality modeling

Groundwater quality is typically measured through water sampling and lab analysis. The field-based measurements are costly and time-consuming when applied over a large domain. In this study, we developed a machine learning-based framework to map groundwater quality in an unconfined aquifer in the north of Iran. Groundwater samples were provided from 248 monitoring wells across the region. The groundwater quality index (GWQI) in each well was measured and classified into four classes: very poor, poor, good, and excellent, according to their cut-off values. Factors affecting groundwater quality, including distance to industrial centers, distance to residential areas, population density, aquifer transmissivity, precipitation, evaporation, geology, and elevation, were identified and prepared in the GIS environment. Six machine learning classifiers, including extreme gradient boosting (XGB), random forest (RF), support vector machine (SVM), artificial neural networks (ANN), k-nearest neighbor (KNN), and Gaussian classifier model (GCM), were used to establish relationships between GWQI and its controlling factors. The algorithms were evaluated using the receiver operating characteristic curve (ROC) and statistical efficiencies (overall accuracy, precision, recall, and F-1 score). Accuracy assessment showed that ML algorithms provided high accuracy in predicting groundwater quality. However, RF was selected as the optimum model given its higher accuracy (overall accuracy, precision, and recall = 0.92; ROC = 0.95). The trained RF model was used to map GWQI classes across the entire region. Results showed that the poor GWQI class is dominant in the study area (covering 66% of the study area), followed by good (19% of the area), very poor (14% of the area), and excellent (< 1% of the area) classes. An area of very poor GWQI was observed in the north. Feature analysis indicated that the distance to industrial locations is the main factor affecting groundwater quality in the region. The study provides a cost-effective methodology in groundwater quality modeling that can be duplicated in other regions with similar hydrological and geological settings.


Introduction
Groundwater is the primary water source in arid and semiarid regions of the world and is used for agricultural, domestic, and industrial purposes (Li et al. 2017). Securing the safety of groundwater resources is a crucial driver of public health policies. Unfortunately, industrial development, agricultural practices, rapid population growth, and climate change have exerted extra pressure on groundwater resources in many parts of the world (Burri et al. 2019;Mohammadi et al. 2022). Growing pressure on groundwater resources resulted in various environmental issues, including water scarcity, water contamination, land subsidence, and groundwater salinization (Sahour et al. 2020a;Gill et al. 2017). It was reported that almost half of the world's population is exposed to water pollutants endangering human and wildlife health. Also, nearly 80% of the world's wastewater is released into the environment without sufficient treatment (U.N 2018). Once an aquifer system becomes contaminated, it is not easy to restore it to its original state (Shrestha et al. 2016). These contaminants can originate from both anthropogenic and natural sources. Major groundwater contaminants include heavy metals, hydrocarbons, organic compounds, agricultural fertilizers, nanoparticles, and microplastics that threaten public health, terrestrial ecosystems, and sustainable development (Li et al. 2021).
Heavy metal and metalloid contamination of water can occur naturally due to the interaction between water and rocks (Garbarino et al. 1995). Some of these elements are essential micronutrients for living systems, and others are extremely toxic (Hashim et al. 2011). For instance, As 3 , Cd 2+ , Pb 2+ , and Hg 2+ react with the cell's biomolecules to form highly stable toxic compounds that are difficult to disassociate (Duruibe et al. 2007). Compared to other water contaminants such as petroleum and agricultural byproducts, heavy metals are more persistent and thus can transfer from one environment to another (Nour et al. 2022a and b). In addition to aquifer formation and climate as natural groundwater quality determinants, anthropogenic factors such as agricultural activity and industrial and urban discharge can negatively impact groundwater quality (Li et al. 2018;Mukate et al. 2020). Agricultural activity can introduce nitrate (NO3 − ) to the aquifer system due to nitrogen fertilization (Jiang et al. 2020;Paredes et al. 2020;Torres-Martínez et al. 2021). Waste disposal, septic tank leakages, dysfunctional sewage systems, urban flash floods, and industrial discharges degrade groundwater quality by releasing chemical compounds and heavy metal pollutants into aquifer systems (Ebrahimzadeh et al. 2021;Karunanidhi et al. 2021).
Monitoring water quality is a crucial management tool to ensure the security of groundwater resources. One possible way to measure groundwater quality is to develop comprehensive indices that can consider various chemical and physical water parameters. Water quality indices (WQI)s are simple mathematical tools that can assess the overall status of groundwater quality (Badeenezhad et al. 2020;Saeedi et al. 2010;Sutadian et al. 2016). Mapping the spatial distribution of WQI indices provides easy-to-understand tools that assist decision-makers and local authorities with proper management and ensure public awareness (Rawat and Singh 2018;Alexakis 2021). Many researchers in previous studies have used WQIs to assess the quality of surface water and groundwater (Mohebbi et al. 2013;Abbasnia et al. 2018;Jha et al. 2020). One of the most widely used quality assessment methods is the World Health Organization (WHO) standards for drinking water quality (WHO 2004). This method has been further developed by other researchers and integrated with GIS capabilities to map drinking water quality spatial distribution (Babiker et al. 2007;Machiwal et al. 2011;El-Fadel et al. 2014).
WQIs are easy to calculate and provide a single criterion for monitoring drinking water quality. However, extensive fieldwork, water sampling, and lab analysis make monitoring a large area extremely costly and time-consuming. This is even more challenging in developing countries with limited infrastructure and financial constraints. In order to map the groundwater quality over an extensive area (e.g., on regional scales), various GIS-based approaches have been implemented (Jha et al. 2007;Esquivel et al. 2015;Machiwal et al. 2018). The principle behind the GIS-based groundwater quality mapping is to employ interpolation methods to estimate the groundwater quality of unknown areas located within the range of a discrete set of measured data points. The interpolation techniques are easy to implement but not precise because the error rate is proportional to the squared distance between the measured points (Ghezelbash et al. 2019, Sahour et al. 2020b). Thus, it may disregard the local variation of the target variable.
Machine learning (ML) is a branch of artificial intelligence (AI) focusing on predictive algorithms. It relies on the concept that systems can learn from data and gradually improve accuracy with minimum human intervention (Jordan and Mitchel 2015). ML techniques can predict the target of interest by establishing a linear or nonlinear relationship between the target and a set of predictors (Singh et al. 2016). Therefore, groundwater quality can be estimated by its controlling factors (e.g., topography, lithology, hydroclimate variables, and distance to pollution sources) and training an ML algorithm (Haghiabi et al. 2018;Ahmed et al. 2019;Sahour et al. 2021c).
Previous studies mainly focused on surface water quality assessments using WQIs (Ahmed et al. 2019;Elsayed et al. 2020). For example,)Bui et al. 2020) employed multiple ML algorithms to predict water quality in the Talar river in Iran. They achieved high performance on a hybrid of bagging and random tree algorithms. Haghibi et al. (2018( investigated the performance of the artificial neural network (ANN) support vector machine (SVM) and group method of data handling (GMDH) in predicting the water quality of the Tires river in southwest Iran. They concluded that all three models provided acceptable accuracy; however, the SVM model yielded the most favorable result. Those studies provided helpful information regarding the application of ML in water quality prediction; however, additional studies are necessary to quantify groundwater quality, especially in arid and semiarid countries that rely heavily on groundwater resources. Successful studies for groundwater quality prediction mainly focused on improving the accuracy of ML algorithms and less on the importance of the factors controlling groundwater quality (Agrawal et al. 2021;Singha et al. 2021). Identifying the most critical parameters affecting groundwater quality can help decision makers adopt strategies focusing on the high-priority parameters.
This study aimed to provide three major contributions to the literature. First, to apply a cost-effective methodology to predict the spatial distribution of groundwater quality on a regional scale in the north of Iran, one of the country's most populated regions. Second, to train, evaluate, and conduct a comparative analysis on six ML algorithms for modeling groundwater quality. Third, use ML to identify and prioritize the parameters affecting the groundwater quality across the study area.

Study area
The study area is a coastal plain between the southern edge of the Caspian Sea and the Alborz Mountain range in the Mazandaran province of Iran. The plain extends from 50° 34′ E to 54° 10′ E and 35° 47′ N to 37° N (Fig. 1). The study area is approximately 10,000 km 2 . The area's climate is humid, and precipitation varies from 600 to 1400 mm. The region is one of the country's most populated areas and a producer of agricultural and industrial products. The region's main land use and land cover are rice farms, citrus gardens, urban areas, rangelands, and forests (Dadashpoor and Salarian 2020). The area is known for its rich ecosystems and wildlife-protected regions . The coastal plain of Mazandaran consists mainly of alluvial sediments deposited during the Quaternary period. The aquifer system consists of a shallow unconfined coastal aquifer in the northern parts. A deep confined aquifer with highpressure groundwater underlies that shallow aquifer. The thickness of the shallow aquifer varies from 25 to 50 m. The depth of the water table ranges from 1 to 40 m. Transmissivity of the aquifer varies between 75 and 10,000 m 2 per day. Toward the Alborz mountains in the south, the karst aquifer created from the dissolution of limestone and dolomite exists; however, our study does not focus on those areas. The study area heavily relies on its groundwater resources for the residential sector and its extensive rice farms making it crucial to assess groundwater quality.

Materials and methods
Modeling groundwater quality in this study involved six steps. Water quality data were collected from 248 wells across the region, and groundwater quality index (GWQI) parameters for each well were calculated (step 1). Variables affecting the quality of groundwater in the area were identified, and the dataset (GWQI and its associated variables) was randomly divided into training (70% of the total data or 174 wells) and testing (30% of the total data 74 wells) subsets (step 2). The relationship between GWQI and the GWQI's controlling variables was developed through the optimization of six ML models for classification, namely support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), artificial neural network (ANN), K-nearest neighbors (KNN), and the gaussian classifier model (GCM) (step 3). The accuracy assessment was performed, and the optimum model for GWQI modeling was selected (step 4). The spatial distribution of GWQI across the entire region was mapped using the optimum ML algorithm (the model with the highest accuracy) (step 5). The variable importance analysis was conducted to identify and rank regional groundwater quality factors (step 6). The detail of the dataset and methodology is as follows.

Data collection
The dataset includes a target (independent variable) and multiple predictors (dependent variables). In our study, the GWQI is the target (dependent variable); and the variables affecting GWQI include elevation, lithology, aquifer transmissivity (T), distance to industrial centers (DTI), distance to residential area (DTR), population density, groundwater withdrawal (GWW), precipitation (P), and evaporation are independent variables. These variables were selected based on their relevance to water quality, as reported in previous studies.

Groundwater quality classes
Water samples were collected from 248 semi-deep wells (20-40 m) across the region in an unconfined aquifer and were analyzed in the lab. Water chemical parameters, including electrical conductivity (EC), nitrate (NO3. − ), K + , Na + , Ca +2 , Mg +2 , SO −2 4 , Cl − , pH , and TDS were measured. These parameters were measured by collecting groundwater samples from each well and laboratory analysis. The groundwater quality index (GWQI) for each well was calculated using WHO standard values and according to Eq. (1) where QW i is calculated by QW i = Q i × RW i , where Q i is the relative quality of each parameter and Rw i is the relative weight of each parameter. Q and Rw for each parameter are calculated using the following formulas: where C i is the parameter concentration in each sample, S i is the standard concentration of that parameter in drinking (3) Rwi = Wi ΣWi water according to WHO, W i is the weight of each parameter, and ∑W i is the sum of weights for n parameters. Each parameter contributes differently to water quality; therefore, different weights are assigned. The weight of each parameter in the calculation of GWQI is presented in Table 1. The next step was setting classes for each value of GWQI. The GWQI ranking was based on Yadav et al. (2010( and ranged from excellent quality to very poor  Table 2 shows the values for each groundwater quality class.

Groundwater quality predictors
The factors potentially affecting groundwater quality were identified and used as the inputs for the modeling process. These variables included elevation, lithology, aquifer transmissivity (T), distance to industrial centers (DTI), distance to residential area (DTR), population density, groundwater withdrawal (GWW), precipitation (P), and evaporation. Not all these variables were included in the modeling process. We performed a feature selection technique to identify relevant variables. Feature selection is the technique of selecting a subset of the most relevant input variables for a dataset. Relevant features that explain the target variable can allow ML algorithms to perform more efficiently concerning space and complexity. Irrelevant input variables can mislead some ML algorithms reducing their predictive performances (Darst et al. 2018). We used a recursive feature elimination (RFE) method to select the optimum independent variables. In the RFE method, variables are recursively eliminated, and the modeling is performed on those remaining variables. The technique uses the accuracy of the fit to select the best combination of the independent variables for the modeling process. For this purpose, first, an estimator is trained using all variables, and the importance of each variable is calculated (Dianati Tilaki et al. 2020). Then, the least important variables are pruned from the set of independent variables. This procedure is iterated recursively on the pruned set until the smallest subset of variables that provide the highest accuracy is identified. The optimum variables in this study were T, DTI, GWW, P, population, and evaporation.
The aquifer transmissivity (T) measures how much water can be transmitted horizontally through an aquifer's whole saturated thickness and unit width (Bear 2012). Pollution can transfer with groundwater flow from location to location. Therefore, transmissivity was selected as one of the potential variables for the modeling process. The Mazandaran Regional Water Company (MRWC) performed the pumping test to measure the aquifer's transmissivity at the monitoring wells' location. Transmissivity in the study area is lower in the northern parts close to the Caspian seashore. Transmissivity increases toward the south; however, the highest transmissivity was observed in the east and southwest of the study region. The transmissivity ranged between 75 and 3500 m 2 /day in the study area (Fig. 2).
Distance from industrial centers and distance from the residential area are two factors that affect groundwater quality by introducing pollutants to the aquifer systems. In the study area, industrial towns, manufacturers, small industries, livestock, and poultry farms significantly affect groundwater quality (Gholami et al. 2015). The industrial areas are distributed between the northern parts of the study area and the Alborz mountain in the south. The concentration of the manufacturers increases from west to east. The dataset was provided by Mazandaran Province Industries and Mines Organization (MPIMO) and further verified using high-resolution google earth imagery to create the georeferenced map in GIS (Fig. 2). The residential areas map was also provided by MPIMO using Landsat-8 satellite images. The concentration of the residential area increases from the north to the central parts of the study area. The distances between each point to villages and cities were calculated within the GIS environment. To obtain the distance from the residential areas, we created a network of grid points, and the distance from each point to the nearest residential area was calculated using the nearest function in ArcGIS. The same approach was used to obtain distance from the industrial centers. Therefore, a value was assigned to each point, representing the distance from the residential areas. The spatial distribution of population density was determined by interpolation of population statistics in GIS (Fig. 2). The higher the population density, the higher the probability of groundwater being exposed to the pollutant (Alkindi et al. 2022).
Excessive groundwater withdrawal (GWW) introduces atmospheric oxygen and high-organic surface water to the aquifer systems (Gholami and Booij 2022). Moreover, the pumped groundwater used for irrigation interacts with soil materials, infiltrates the groundwater system, and degrades the groundwater quality (Costantini et al. 2021). The agricultural lands in the study area were identified using Landsat-8 imagery. By applying the annual water consumption  of each type of agricultural land (citrus gardens, rice farms, etc.), the amount of water usage was mapped in GIS (Fig. 3). Climate variabilities, including precipitation and evaporation, can indirectly affect groundwater quality through the recharge and discharge of aquifers (Band et al. 2020;Alkindi et al. 2022). Precipitation is the region's primary source of groundwater recharge. High precipitation can introduce nitrate and organic materials into the aquifer system in agricultural areas. Precipitation in the study area increases from east to west and from north to south. Precipitation can also reduce the concentration of pollutants through dilution. Therefore, the role of precipitation can vary based on the hydrological settings and landscape characteristics of an area. Excessive evaporation can degrade the water quality in groundwater and surface water resources (Awasthi et al. 2005;Maliqi et al. 2020). Evaporation increases from west to east in the study area. The highest evaporation can be observed in the central and western parts of the region. Mazandaran regional water organization (MRWO) provided the climate data, including annual precipitation and evaporation, using data from meteorological stations. The maps were generated using The mean annual precipitation (mm). (E) Residential areas and water resources locations. (F) Transmissivity of aquifer formation (m 2 /day) the kriging technique within the GIS environment. Two items of elevation and geological units were selected for the modeling; however, they were not among the optimum factors chosen in the feature selection step. This is due to the invariability of these parameters within the study area.
Groundwater quality modeling As we mentioned earlier, the entire dataset, including GWQI, and its predictors, were divided into two subsets of train and test. The training set was used to develop ML algorithms, and the test set was used to validate the models. Preprocessing and data cleaning were performed to remove incorrect data. Data cleaning included removing the points outside the study area as well as duplicated wells. The dataset went through additional preprocessing, including normalization and oversampling. Normalization improves the accuracy of the machine learning algorithms, especially when the range of predictors is different. It also helps uniformity of the dataset in how the models utilize and treat the data. The practice of oversampling involves selecting observations so that some classes have a larger share of the dataset than they actually do in the population (Crone and Finlay 2012). The reason for performing this practice was to create a balanced dataset to ensure that the model had enough observations to learn. An imbalanced dataset can result in biased predictions. After data preprocessing, six classification algorithms were used to establish relationships between GWQI and its predictors. The selected algorithms are as follows.

XGBoost (XGB)
XGBoost (XGB) classification algorithm was used in this study to predict a GWQI class from the set of predictors. Gradient boosting refers to combining several weak models to create a collectively robust model by "boosting" or improving a single ineffective model (Johnson et al. 2017). The XGB enables boosted tree algorithms to be computed faster and more efficiently through scalable gradient-boosting implementations. Instead of building trees sequentially, XGB builds them in parallel. In every possible split in the training set, the quality of separations is evaluated using partial sums of gradient values. As a result of gradient boosting, the following model is set up to predict the classes with the least possible errors (Moisen et al. 2006). The parameters to be optimized in the XGB include the number of trees (number of boosting rounds), maximum depth (maximum depth of tree for the base learner), maximum leaves, and maximum bin (maximum number of bins for each variable). The optimum values are given in Table 3.

Random forest (RF)
As a supervised classification algorithm, the random forest is made up of many decision trees, which are assembled randomly. A committee-based prediction is more accurate than any individual tree because it uses bagging and includes randomization of trees. The decision tree is made based on different samples, and the majority vote of the  (Chen et al. 2011). Tree learners are trained by applying bootstrap aggregating or bagging to random forests.
Given a set of predictors X = x − 1,…, x − n and target Y = y − 1,…,y − n bagging is repeatedly used B time to select a random sample and fit trees to them. By averaging all the predictions from the training on x ′ , predictions can be made for unseen samples x ′ as follows (Friedman and Meulman 2003).
The hyperparameters of the RF models include the number of trees, min sample split (the minimum number of samples needed to divide an internal node), maximum depth (maximum depth of tree for the base learner), and maximum features (the number of features to consider for the best split). Table 3 shows the optimum hyperparameter values.

K-nearest neighbors (KNN)
The input in the KNN classifier is comprised of the k nearest training examples that assign objects to classes. Objects with k = 1 are given to their nearest neighbors' classes. The output of the KNN classifier is the average of k nearest neighbors. KNNs approximate functions locally and defer all computations until after the function has been evaluated. The KNN can make a more significant contribution to the average than the farther ones (Cherif 2018). Cross validation was used to determine the optimal K value, which is performed by holding out a portion of the training set to estimate the validation error rate. With N sample size and pi probability of every i sample, the nearest neighbor is found by the following: Ci represents the set of points that fall within the same class as the sample; i and pij represent the softmax over Euclidean distances within the embedded space: The hyperparameters include leaf size, number of neighbors, and weights (the values of weights used for prediction). The algorithm used to compute the nearest neighbor in our study is called KDtree. The KDtree algorithm reduces the required number of distance measurements by efficiently encoding aggregate distance information for the sample leaf size is passed to KDTree affects the speed of the process and memory required to store the constructed tree. The optimum hyperparameter values for our study area are presented in Table 3.

Support vector machine (SVM)
SVM is one of the most used algorithms for classification problems (Phan et al. 2017). SVM algorithms are used to classify points based on hyperplanes in N-dimensional spaces. There is a direct relationship between the number of features and the hyperplane's dimension. A hyperplane is just a line if there are two input features, and two features make the hyperplane a two-dimensional plane if there are three input features (Tan et al. 2007). In a best-case scenario, when the data are perfectly separable, the distance between hyper-planes and the nearest elements of the classes will be maximum. Generally, SVM attempts to approximate this condition as closely as possible. Nonlinear SVM replaces hyperplanes with other classes of manifolds, but the basic principle remains the same. Hyperplanes are replaced with different classes of manifolds in nonlinear SVM, but the principle remains the same. The hyperparameters of the SVM include regularization parameter (C), gamma (coefficients of the kernel), and kernel type. The strength of the regularization is inversely proportional to the hyperparameter C. The optimized hyperparameters are given in Table 3. Number of trees = 500, max depth = 2, max leaves = 0, max bin = 2, learning rate = 1 K-nearest neighbor Leaf size = 10, number of neighbors = 2, weights = distance Artificial neural network Activation = logistic, alpha = 0.001, solver = adam, learning rate = constant, max iteration = 300 Gaussian process classifier Kernel = RBF, number of restarts optimizer = 0, optimizer = fmin_l_bfgs_b

Artificial neural network (ANN)
Each ANN consists of three layers: input, hidden, and output. By establishing empirical relationships between input variables and corresponding targets, the ANN method establishes possible nonlinear relationships (Tu 1996). This is performed by a series of nodes that mimic the functions of neurons found in animals or humans; the neurons pass information among themselves, thereby enabling ANNs to learn. The ANN used in this study is known as a multilayer perceptron (MLP) and was trained with a backpropagation method (Durairaj and Revathi 2015). Backpropagation is performed by changing the weight values internally by approximating the nonlinear relationship between the predictors and the target. The process is divided into two stages feedforward and backpropagation. In a feedforward process, the output is generated by applying a pattern to the input layer and letting it propagate through layers. Afterward, an error signal is calculated for each output. The errors are transmitted backward from the output layer to each neuron in the hidden layer that contributed to the output layer since every neuron in the hidden layer contributed to the output errors. Layer by layer, this process is repeated until each neuron receives an error signal representing its relative contribution. Each neuron adjusts the values for each connection weight once the error signal for each neuron is calculated. In weight space, we seek to minimize the error function. The learning problem is then solved with weights with the least error functions (Huang et al. 2015). The ANN model in our study contained one hidden layer. Other hyperparameters include activation type, alpha, solver type, learning rate, and maximum iteration. The activation function defines the output of that neuron using a set of inputs. Alpha defines the regularization strength, also known as the penalty term. It is used to prevent overfitting by limiting the size of weights. The solver type is used to optimize the weights. The learning rate is used to schedule the weight updates. Maximum iteration determines the number of epochs (cycle of trainings). The optimum values and types of hyperparameters are presented in Table 3.

Gaussian classification model (GCM)
A Gaussian classification model (GCM) is an infinitedimensional extension of a multivariate Gaussian distribution. Data generated by a Gaussian process are distributed in a multivariate manner across a range. A set of n observations,y = y 1 , … , y n , can be viewed as one point taken from a multivariate (n-variate) Gaussian distribution. In this situation, we can work backward to pair this data set with a GCM. This partner GP is usually assumed to be zero everywhere. When comparing observations in these situations, we simply usek x, x ′ . "Squared exponentials" are popular choices, It should be high for functions covering a wide y axis range, where the maximum allowable covariance is expressed by 2 f . By this definition, k x, x ′ approaches this value, indicating f (x) is close to the perfect correlation between f (x) , f x ′ and x ≈ x � (Hensman et al. 2015). The hyperparameters of the GCM model include a kernel (the kernel type used for prediction), the number of restarts of the optimizer (the number of restarts of the optimizer to find the kernel's parameters that maximize the log-marginal likelihood), and the type of optimizers. The optimized hyperparameters are presented in Table 3.

Model development and evaluation
Each model contains a few parameters known as a hyperparameter that controls the learning process (Yang and Shami 2020). In order to determine the optimal hyperparameters in each model, tenfold cross-validation was performed. Cross-validation is holding out a subset of the training set from the modeling process to estimate the validation error rate. During tenfold cross-validation, approximately equalsized groups of training data are randomly divided into ten subsets. The models were trained with 90% of the training data and cross-validated with the remaining 10%. Based on the 10% validation data, we calculated the misclassification rate. There are ten repetitions of this procedure. Each of the ten validation sets is composed of different observations. Averaging out the results of the validation error gives ten estimates. The optimum hyperparameters of each model obtained from cross-validation are presented in Table 3. As we mentioned earlier, we used an out-of-sample test subset that was not used for the modeling process for the accuracy assessment; therefore, the accuracy assessment reveals the actual predictive power of the algorithms used in the prediction of groundwater quality study. The ML algorithms for GWQI classification were evaluated using the receiver operating characteristic (ROC) curve. A ROC is a visual plot showing the diagnostic ability of a model as its discrimination threshold changes. The ROC plot shows the false positive and true positive rates on the x-axes and y-axes, respectively (Hand and Till 2001). Additionally, the area under the ROC curve (AUC), precision, recall, and F-1 score were also used to validate the models. The evaluation was performed on the test subset not used for developing the models.
Variable importance (VI) analysis was also implemented to analyze the strength of the associations between GWQI and its predictors. Two methods of impurity and permutation were used to score the input variables based on their importance. Impurity-based feature importance uses a treebased feature importance metric on the GWQI classification task, and this metric ranks the variables based on their importance. The limitation of the impurity-based metric is that it is computed on the training set and may not be able to generalize on the test set. It is also biased toward high cardinality variables (Kazemitabar et al. 2017). Therefore, we also used permutation variable importance. For permutation importance, a baseline metric, determined by scoring, is estimated on a dataset. Then, a variable column from the validation set is permuted, and the metric is re-evaluated. The difference between the permutation of the variable column and the baseline metric is defined as the permutation importance (Altmann et al. 2010).
To discover the monotonic, linear, or nonlinear relationship between the features and the GWQI classes, we implemented the partial dependence plot (PDP). The PDP displays the marginal effect that input variables have on the predicted outcome (each class of GWQI) in an ML model (Sahour et al. 2020a). The flowchart of the methodology is depicted in Fig. 4.

Mapping groundwater quality classes
The optimum model was identified and tested with an accuracy assessment. This model is able to predict the GWQI classes using the independent variables (distance to industry, distance to residential area, population, groundwater extraction, transmissivity, evaporation, and precipitation). The independent variables are available over the entire area (Figs. 2 and 3), but the GWQI is unknown and is only available over the recorded wells. The ML model was applied to the independent variables to predict GWQI over unknown areas. The result is predicted GWQI classes over each location of the study area. We used the capability of GIS to create the final GWQI map. For this purpose, a network of grid points was created over the entire study area. Values of the input variables for every single point were captured. Then, the optimum model was applied to the input values to predict the GWQI for every single point. The predicted values were converted to a continuous raster format for further visualization.

Model evaluation
All models provided satisfactory results varying significantly from one to another. The accuracy assessment during the training and testing stages are presented in Tables 4 and  (Table 4). On the test subset, the RF yielded the most favorable results (overall accuracy: 0.92, precision: 0.92, recall: 0.92), followed by SVM (overall accuracy: 0.90, precision: 0.90, recall: 0.91), XGB (overall accuracy: 0.89, precision: 0.90, recall: 0.89), KNN (overall accuracy: 0.88, precision: 0.89, recall: 0.89), and ANN (overall accuracy: 0.81, precision: 0.83, recall: 0.80). The GCM model provided the least accuracy (overall accuracy: 0.80, precision: 0.83, recall: 0.81) ( Table 5). The ROC curves of the six ML algorithms used are displayed in Fig. 5. The area under ROC curves (AUC) values ranges between 0 and 1. The model generating a higher AUC value shows a better performance. In this study, RF provided higher AUC (average AUC for 4 GWQI classes = 0.95) followed by SVM (average AUC for 4 GWQI classes = 0.94), XGB (average AUC for the 4 GWQI classes = 0.93), KNN (average AUC for the 4 GWQI classes = 0.92), GPC (average AUC for the 4 GWQI classes = 0.88), and ANN (average AUC for the 4 GWQI classes = 0.87) ( Table 6). The performance of the other three models was satisfactory to high (AUC ranging from 0.87 to 0.92) (Fig. 5). Further AUC analysis for the four classes of GWQI (excellent, good, poor, and very poor) revealed that the prediction of class 1 (excellent) and class 4 (very poor) yielded the most favorable results. The average AUC from six ML models for both class 1 (excellent) and class 4 (very poor) was 0.97. The least accuracy was associated with class 2 (good) of the GWQI class, with an average AUC of 0.82 (Fig. 6).

Variables analysis
The variable importance analysis performed by the two methods of impurity and permutation (Fig. 6) revealed that the distance to industrial centers (DTI) is the most influential factor affecting groundwater quality in the Mazandaran plain. The second important factor is population density (Fig. 6). That is an important finding because both impurity and permutation selected the DTI and population as the most influential factors affecting groundwater quality in the region. Other factors, including precipitation (P), aquifer transmissivity (T), groundwater withdrawal (GWW), distance to residential areas (DTR), and evaporation, were also among the influential parameters; however, the per-mutation method shows that evaporation and DTR are two least import factors (Fig. 6). Figure 7 shows the partial dependence plot (PDP) of the predictors. The PDP analysis showed that GWQI classes have nonlinear and complex relationships with the predictors. The PDPs for the four classes should be different; therefore, the relationships between predictors and all four GWQI classes are presented in Fig. 7. For example, the dependence of the excellent class and very poor class on DTR should be different because it is expected to have a higher groundwater quality in areas far from industrial centers. Conversely, the area closer to the industrial centers is expected to have low-quality groundwater. This was the case in this study as well. In most cases, the excellent and very poor classes are in opposite directions (Fig. 7). This is, however, not a general rule due to the interaction of the predictors with one another. For example, one may expect to have excellent groundwater quality in a particular location that is far away from an industrial center, but due to high population density and groundwater withdrawal, the groundwater quality may be poor. Therefore, the interaction between variables and the behavior of groundwater quality with respect to its controlling factors can be complex and vary from location to location. Looking into the population density (Fig. 7), the dependence of excellent groundwater quality on population is in decreasing order; the opposite direction can be observed for the very poor class.

Mapping groundwater quality classes
In this study, the RF model provided the highest accuracy and was used to provide the final map of groundwater quality classes across the entire study area (Fig. 8). According to the final map, the groundwater quality in most parts of the study area is "poor" as indicated by the GWQI class. The "good" quality groundwater can be found in the western and southwestern, comprising the second major class. "Very poor" GWQI can be found in the northern part. Statistical analysis showed that the GWQI in 66% of the study area is poor GWQI. The second common GWQI class in the area is good (19% of the area). Very poor GWQI covers 14% of the study area. Finally, "excellent" GWQI in the area is extremely rare, covering less than 1%, and except for very small local spots, there is no "excellent" groundwater quality in the study region.

Discussion
According to the final map, three classes of "poor," "very poor," and "good" are the three major classes of GWQI in the study area. This ranking designates different groundwater usage for each class. The most dominant groundwater quality class is "poor," which is associated with a GWQI  of 51 to 75 ( Table 2). The poor-quality groundwater concentrates mainly in eastern parts of the Mazandaran plain southeast of the Caspian sea (Fig. 8). The poor GWQI class is associated with the areas of high industrial and residential sectors, high evaporation, low precipitation, and various transmissivity values (Fig. 2). The "good" GWQI is the second dominant class of groundwater quality in the study area. This area is mostly concentrated in the southwest of the study area and is associated with an area of high aquifer transmissivity, high precipitation, low evaporation, and fewer industrial sectors. Finally, the northwest part of the study area is identified to have the poorest groundwater quality and is classified as a "very poor" GWQI class (Fig. 8). This area is heavily occupied by the industrial sector and is associated with an area of high population, high evaporation, and high aquifer transmissivity (Fig. 2). The area lacks the "excellent" GWQI except for very small localities. According to WHO water quality standards (WHO 2004), each class of GWQI is suitable for particular usage. Subsequently, excellent GWQI (0-25) is ideal for drinking without further treatments; good GWQI (26-50) is permissible for drinking water as well as industrial purposes. It is permissible to use poor GWQI (51-75) for irrigation, while very poor class (> 75) is not suitable for human consumption and irrigation purposes.
One of the main factors determining groundwater quality is the geological units containing the groundwater (Umar et al. 2013). As reported in previous studies, the geochemical interactions between water and rocks can affect groundwater quality by introducing heavy metals (Khatri and Tyagi 2015;Kubier et al. 2019). In our study area, however, geological units comprised of alluvial sediments of the Quaternary period with minimum variation. Therefore, geological units were removed from the sets of predictors due to their invariability.
The aquifer characteristics affect the groundwater quality in various ways. Shallow unconfined aquifers are more exposed to pollutants compared to deep aquifers. The sample in this study was taken from the unconfined aquifer, and the depth of the monitoring wells varies between 20 and 40 m. In the study area, deep wells (> 100-m depth) that are less affected by pollutants are used for residential drinking purposes.
Six ML algorithms used in this study showed strong performance in GWQI modeling. However, significant differences were observed during the accuracy assessment (Tables 4 and 5; Fig. 5). The three models that showed very high performance in GWQI modeling are RF, SVM, and XGB. These three models have proven to be robust algorithms in environmental modeling (Sahour et al. 2022, 2021a, b, and. RF slightly outperformed the other models. This was reported in previous studies. For example, RF and SVM were used for water quality index modeling in Algeria, and the results revealed an overall outperformance of the RF model (Sakaa et al. 2022). Another research compared SVM, RF, and ANN models to develop a chemical sensor and concluded that the RF algorithms perform better. This is not a general rule, and the performance of ML algorithms varies based on the type, distribution, and size of datasets (Gayathri et al. 2022;Najwa Mohd Rizal et al. 2022;Waziry et al. 2022;Wang and Zeng 2022).
The models' performance was highest in classifying the excellent and very poor classes according to the ROC curves (Fig. 4). This shows that the predictors can better explain extreme (very high and very low quality) classes compared to the other two classes.
Data structure and attributes are important factors in implementing a specific ML algorithm. For example, ANN in this study provided satisfactory results. It also has proven to be a strong tool for environmental modeling (Alshehri et al. 2020;Gholami et al. 2021a, b a and b;Gholami and Sahour 2022). However, compared to other models, it performed poorly during the accuracy assessment. The ANN models are designed to work better on a large dataset and may not perform well on a small dataset (Mao et al. 2006). The adopted methodology uses already-available datasets for the prediction of groundwater quality. The method is cost-effective and can be used for other regions with similar geological and climate settings. In short, the advantage of using ML in this study is the estimation of groundwater quality for unknown areas, a practice that is typically used through expensive field studies, aquifer testing, and laboratory analysis.
The generalization of trained algorithms is one of the major limitations of ML-based methods for water quality modeling. For example, a model that is trained in one location may not be suitable for prediction in other regions. The performance of models depends on the area's characteristics and, in most cases, cannot be generalized for prediction in other locations, especially those with different geological and climate settings (Garza-Pérez et al. 2004).
The source of pollution can be site-specific and varies from region to region. Therefore, it is imperative to analyze further the controlling factors of groundwater quality to prioritize treatment practices. In this study, we used two variable importance analyses that considered the interaction among the variables and ranked them based on their importance. The most important factor in groundwater quality in the study area was found to be the distance from manufacturers and industrial sectors. This is especially true for local groundwater quality, as the very poor quality was found around manufacturers and industrial towns. As further analysis using the partial dependence plot showed (Fig. 6), the quality of groundwater increases with the increase in the distance from industrial locations. This is a crucial finding since the focus for groundwater quality treatments can focus on the industrial sector and adopt strategies to reduce the pollutants and industrial discharges to the aquifer system. In agricultural lands, the consumption of nitrogen fertilizers is an influential factor affecting groundwater quality. The predominant type of agriculture in the area is rice farms that rely on fertilizers, which introduce nitrates to unconfined aquifer systems and degrades the groundwater quality. Population density was identified as the second important factor affecting groundwater quality. This parameter affects groundwater quality in different ways. Population density has a direct relationship with introducing pollutants to the aquifer systems and groundwater extraction. Additionally, there are various small businesses, including small farms, small manufacturers, and local livestock and poultry farming, with the local communities introducing pollutants to the groundwater system.
In this study, GWQI was used as an indicator to evaluate the groundwater samples' quality. As explained earlier, this indicator assigns weights to the various chemical and physical parameters of water to measure its quality. One of the limitations of this method is ignoring other harmful materials that may exist in groundwater samples. For example, radioactive elements and bacterial pollution are not included in GWQI and, therefore, were not considered in this study.

Conclusion
Assessment of groundwater quality is crucial for public health, especially in the arid and semiarid regions of the world, where the population and industry are highly dependent on this valuable resource. However, water sampling and lab analysis are costly and time-consuming when applied over a large area. Additionally, groundwater quality monitoring requires establishing a network of monitoring wells, a practice that may not be practical in many developing countries. This study implemented a cost-effective methodology based on sampling at specific locations and predicting its quality across the entire region. Affecting factors in groundwater quality were identified and prepared within the GIS environment. The relationships between GWQI and the affecting factors were modeled using ML algorithms, and the results were evaluated using ROC curves and statistical efficiencies. The RF was selected as the optimum model considering its highest performance (ROC = 0.95, overall accuracy = 0.92). The trained RF model was used to predict the GWQI classes across the entire area. Given the high accuracy of the optimum model, ML can be used to predict the spatial distribution of groundwater quality. This, however, requires a set of geological and hydrological data that explain the variation in groundwater quality. Since the hydroclimate and landscape settings vary from one region to another, we recommend considering regional and local hydroclimate variables. The findings revealed that the relationship between groundwater quality and its controlling factor is complex and is affected by the interaction between the predictors. The result also shows ML can be used to disclose and quantify relationships between groundwater quality and its controlling factors. In our case, the distance to industrial areas is the most influential parameter affecting groundwater quality. The findings can be used for better ground-water quality management, locating the critical areas (very poor GWQI class), and prioritizing the treatment practice by focusing on industrial discharge. In this study, WHO WQI was used as a criterion for groundwater quality assessment that is limited to the chemical parameters of water samples. For future studies, we suggest using ensemble learning techniques for groundwater quality modeling. The ensemble technique allows the combining of two or more ML algorithms to improve the predictive performance. Additionally, we suggest using various water quality indices and parameters for a comprehensive assessment of groundwater quality in the region.