Comparative analysis of machine learning techniques for estimating groundwater deuterium and oxygen-18 isotopes

Isotope techniques are most frequently used when hydrochemical analysis are insufficient to determine the origin and quality of groundwater and reveal seawater intrusion into groundwater along coastlines. In this study, the potential of the multilayer perceptron, adaptive neuro-fuzzy inference system, generalized regression neural networks, radial basis neural networks, classification and regression tree, Gaussian process regression, multiple linear regression analysis, and support vector machines were compared using known hydrochemical properties of waters for estimating deuterium (δD) and oxygen-18 (δ18O) isotopes in groundwater of the Bafra plain, Northern Turkey. The data were divided into training (70%) and testing (30%) sets. Cluster analysis was performed to decrease the number of input variables. The data on electrical conductivity, chloride, magnesium, and sulfate were introduced into the models after examining different combinations of these variables in the studied models. The determination coefficient (R2), mean absolute error (MAE), and root mean square error (RMSE) were used to evaluate the performances of the models. In addition, visualization techniques (Taylor diagram and heat maps) were prepared to assess the similarities between the measured and estimated δD and δ18O values. The R2, RMSE, and MAE for δ18O (0.98, 0.31 and 0.20‰, respectively), and δD (0.95, 2.85 and 1.89‰, respectively) values for the testing datasets revealed that the performance accuracy of multilayer perceptron is the best among the applied models tested. Therefore, the study suggests using data-driven methods, multilayer perceptron in this case, when lacking appropriate laboratory isotope analysis or facing high laboratory analysis costs.


Introduction
The precipitation, evaporation, soil type, geological structure, and human activities have a significant impact on the quality of groundwaters. The water used in irrigation, residential areas, and industrial facilities substantially increased due to the constantly growing human population worldwide. Excessive use of groundwater as a water source in coastal areas disturbs the water balance between fresh and saline water and triggers seawater intrusion into groundwater. Excessive groundwater use causes groundwater to be salinized and it negatively affects plant growth and soil salinization (Libutti et al. 2018;Ding et al. 2020).
Understanding the origin of salinity and the dynamics of saline-water intrusion in aquifers, especially under the threat of saline water intrusion is of great importance for the sustainable groundwater management. In addition to classical hydrogeology techniques, isotope hydrology methods are also used to determine the regions with saline water intrusion. The distribution of deuterium (dD) and oxygen-18 (d 18 O) isotopes in groundwater or surface waters is widely used to determine the origin of waters and the relationships between water resources (Jayathunga et al. 2020). The d 18 O and d 2 H, among the environmental isotopes, are good tracers as they are natural water components. The studies on water quality assessment revealed that high volume of groundwater abstraction, excessive pumping, and less recharge in coastal wells lead to seawater intrusion, thus increased groundwater salinization (Klassen and Allen 2017;Mohanty and Rao 2019). Many studies indicated a significant relationship between the stable isotopes and other hydrochemical properties of groundwater (Lin et al. 2011;Arslan et al. 2012;Mongelli et al. 2013;Isawi et al. 2016;Bahir et al. 2018;Jahnke et al. 2019). However, isotopes analysis is costly, and very few laboratories are equipped to analyze them.
Artificial intelligence (AI) techniques such as artificial neural networks (ANN), adaptive neuro-fuzzy inference system (ANFIS), and support vector machine (SVM) are widely used to solve complex problems in a various research field, thus gaining popularity (Odabas et al. 2017;Ahsan et al. 2022). Complex nonlinear systems can be easily modeled using AI techniques that compensate for numerical models' weaknesses. The AI techniques have recently been employed successfully to estimate water quality parameters. Barzegar and Moghaddam (2016) evaluated the performances of three different ANN models including, multi-layer perceptron (MLP), generalized regression neural network (GRNN), and radial basis function neural network (RBNN), to estimate groundwater salinity in the Tabriz plain. Hameed et al. (2017) predicted the water quality index using a backpropagation neural network (BPNN) and RBNN. Lal and Datta (2018) employed genetic programming (GP) and gaussian process regression (GPR) models in modeling groundwater salinity. Juntakut et al. (2019) used the classification and regression tree (CART) to predict the nitrate concentration of groundwaters in Nebraska. Jafari et al. (2019) estimated the total dissolved solid contents of the groundwater aquifers in Tabriz plain using four AI techniques, including MLP, ANFIS, SVM, and gene expression programming (GEP). Noori et al. (2020) combined a process-based watershed model and an ANN to improve water quality predictions. Cahyadi et al. (2020) applied the BPNN model to estimate the hydraulic conductivity of fractured groundwater flow media.
Previous studies made significant contributions to the improve AI technology to evaluate water quality parameters. However, the AI techniques have not been used to estimate the distribution of dD and d 18 O isotopes in groundwater. Hence, the estimation of dD and d 18 O in groundwater using AI models will fill the gap in the literature as a novel modeling approach. Overall, the objective of this study was to develop a simple, low-cost, and accurate model to estimate the dD and d 18 O isotopes in groundwater by using different data-driven models, 2 Material and methods

Study area and dataset
The study was conducted in Kizılırmak Delta of Bafra plain, which was located in the middle part of the Black Sea region of northern Turkey and situated in 41°30 0 -41°45 0 north latitudes, 35°30 0 -36°15 0 east longitudes (Fig. 1). The soils in the study area formed over young alluviums deposited by the Kızılırmak River. Most of the soils in the area are hydromorphic, while coastal dunes are found at the seaside and colluvial-alluvial mixtures in the inland. Overall, the soils in the plain are young and have been extensively used in agricultural production.
The climate is semi-humid with an annual average temperature of 14.40°C. Annual precipitation ranges between 536.4 and 783.5 mm. The elevation rises from the coast to the inland and reaches 10 m within about 6 km. The differences in elevation cause significant fluctuations in groundwater levels and drainage within the study area. Irrigation is carried out mainly by using border and furrow irrigation methods. About 75% of the agricultural fields in the plain are irrigated using surface waters, while 25% of them are using groundwater (Cemek et al. 2007).
A total of sixty-one water samples were collected between October 2007 and September 2008 in different locations in the study area. Fifty-six water samples were collected from the twenty-eight different monitoring wells in Bafra plain and five from the Black Sea. The water samples were filtered using a 0.45 lm filter, filled in polyethylene bottles, and stored at 4°C until processing. Electrical conductivity (EC) and pH of water samples were measured in a portable handheld probe in the field. Major cation (K ? , Na ? , Ca ?2 , and Mg ?2 ) and anion (Cland SO 4 -2 ) concentrations were analyzed using the ion chromatography ( , and Na ? ranged from 37.00 to 305.00 mg L -1 , 46.00 to 839.00 mg L -1 , 4.00 to 200.00 mg L -1 , and 247.70 to 6900.00 mg L -1 , respectively. Average concentrations of these major cations were 102. 44, 170.52, 35.03, and 1170.21 mg L -1 , respectively. The concentrations of Cl -, SO 4 -2 , and HCO 3 vary from 345.34 to 9800.00 mg L -1 (average, 1551.02 mg L -1 ), 40.00 to 1950.00 mg L -1 (average 419.34 mg L -1 ), and 174.44 to 1617.00 mg L -1 (average 840.07 mg L -1 ), respectively.

Multilayer perceptron (MLP)
The MLP is one of the most frequently used neural network types (Fig. 2a). This network transmits the information from the input layer neurons to the hidden layer(s) neurons and finally to the output layer neurons (Skansi 2018).
Levenberg-Marquardt (LM) learning algorithm was employed in this study (El-Bakry 2003;Cigizoglu and Kişi 2005). The hidden and output layers were constructed using tangent sigmoid (tansig) and linear transfer (purelin) functions. The number of hidden nodes was increased from three to seven to optimize the training network.

Radial basis neural networks (RBNN)
The RBNN was introduced into the neural network literature by Broomhead and Lowe (1988) to represent an efficient mechanism for complex nonlinear functions of local approximators (Mai-Duy and Tran-Cong 2003), pattern classification and recognition (Umasankar and Kalaiarasi 2014), and modeling and controlling dynamic systems (Zhao 2008) from the input-output data. The RBNN is distinguished from other neural networks due to its universal approximation and shorter training phase. The RBNN networks are special feed-forward neural networks composed of three layers (Fig. 2b). The first layer (input) represents the input data. The second layer (hidden) comprises several nodes that implement a nonlinear transformation to the input variables. The last layer corresponds to the final output of the network and uses a linear activation

Generalized regression neural networks (GRNN)
The GRNN, initially proposed and developed by Specht (1991), composes of four layers: input, pattern, summation and output (Fig. 2c). The total number of variables is equal to the number of input units in the first layer. Each unit in the pattern layer represents a training pattern. The S and Dsummation neurons are connected to each pattern layer in the summation layer. The S-summation neuron adds the product of the pattern layer's weight outputs, whereas the D-summation neuron adds all weights. The final layer divides the output of the S-summation neuron by the output of the D-summation neuron and generates the predicted output. The distinguishing feature of GRNN is its use of a smoothing factor, which affects the generalization of the network. Low smoothing factors impair the generalizability of the network, whereas high smoothing factors frequently increase the prediction error while are helpful in generalization. Detailed information on GRNN can be obtained from Haykin (2001).

Adaptive neuro-fuzzy inference system (ANFIS)
This technique combines the learning ability of ANN and relational structure with the decision-making mechanism of the fuzzy inference system (FIS). Similar to the ANN, the ANFIS learns with samples from a training set. This method generates the most appropriate ANN structure for resolving the problem at hand. The test process is used to determine the effect of structure on previously unobserved samples. The low error values demonstrate the applicability of a model. However, one of the primary shortcomings of the ANN is the lack of justification for weight values acquired. The FIS, which is integrated into the structure of the ANFIS, addresses this shortcoming. The structure of ANFIS allows to be applied in many real-world issues. Daneshmand et al. (2015) provided the details on ANFIS. The SVM, a binary learning machine, can be used for pattern recognition and nonlinear regression problems. The SVM is a two-layer network composed of nonlinear weights in the first layer and linear weights in the second (Bray and Han 2004). The SVM can efficiently handle nonlinear classification/modeling using the kernel method. Given a training data of n samples z i ¼ x i ; ð f y i Þ; i ¼ 1; 2; 3; . . .; ng , where x i 2 R n is the training data and y i ¼ Ris the response for x i . The following quadratic optimization problem can be written as follow by solving the Kuhn-Tucker conditions.
where 1=2x 2 is the regularization term, uðx i Þ is the nonlinear mapping function, C is the error penalty factor to regulate the difference between the empirical error and the regularization term,n i and n Ã i are the positive slack variables,e is the loss function, b is scalar, and x represents a normal vector.
The best decision function f(x) can be stated as; where a i a Ã i ði ¼ 1; 2; 3; . . .; nÞ are the Lagrange multipliers, and kðx i ; xÞ is the kernel function.
Different kernel functions (i.e., linear, polynomial, Gaussian, exponential) were employed in the SVM. A polynomial kernel function was used in this study (Achirul Nanda et al. 2018). The readers may refer to Vapnik (1995) and Vapnik et al. (1997) for more information on SVM theoretical basis.

Gaussian process regression (GPR)
The GPR is a widely used nonparametric kernel-based machine learning technique that clarifies responses by presenting latent variables f(x i ), I = 1,2,3,…,n derived from a Gaussian process (GP) and explicit basis functions. A GP is a finite number of random variables with a joint Gaussian distribution.
A GP f(x) is specified by a mean m(x) function and covariance function which are defined as follows (Williams and Rasmussen 1996): when f ðxÞ $ GPð0; kðx i ; x j ÞÞ, the GPR model can be defined as: KðX; XÞ ¼ The covariance function is described by different kernel functions, which can be configured in kernel parameters in vector h; therefore, the covariance function is stated as kðx i ; x j h j Þ. This study employed four kernel functions, rational quadratic, exponential, squared exponential, and matern 5/2, to estimate the dD and d 18 O. Detailed information on the GPR can be obtained from Rasmussen (2003). Breiman et al. (1984) developed the CART, a nonparametric modeling approach. This technique is one of most commonly used methods in groundwater research management (Naghibi et al. 2016;Zhao et al. 2016;Choubin et al. 2019;Juntakut et al. 2019;Knoll et al. 2019) as it is easy to understand and interpret (Genç et al. 2015). The process begins by subdividing the data into subgroups, resulting in more homogeneous samples in the child nodes compared to the parent nodes. This process is repeated until the CART reaches a steady state in which the parent nodes do not significantly improve the capacity of the entire tree. The CART can effectively estimate the outcome in regression problems by using the least-square deviation criteria. The CART model is not a black box model, therefore, the effect of each variable on the desired output can be visualized clearly in the corresponding tree structure. Detailed information on the CART model can be obtained from Breiman et al. (1984) and Chipman et al. (1998).

Multiple linear regression analysis (MLR)
Multiple linear regression (MLR) is a statistical technique that predicts the outcome of a response variable by combining several explanatory variables. The MLR attempts to model the linear relationship between independent and dependent variables. General equation of a MLR model is as follows: where y is the dependent variable; x 1 , x 2, and x n are the independent variables; b 1 , b 2, and b n are the slope coefficients for each independent variable; b 0 is the constant term; e is the error term of the model.

Cluster analysis
Cluster analysis is a technique to group related observations into a set of clusters based on the observations of multiple variables for each individual. The cluster technique has effectively been used to classify water samples in many studies (Kazakis et al. 2017;Rao and Chaudhary 2019;Egbueri 2020;Yang et al. 2020). In this study, the cluster analysis was used to group groundwater samples for d 18 O, dD, EC, pH, Ca, Mg, K, Na, Cl, SO 4 , and HCO 3 content.

Data pre-processing
Before developing the model structure, the input and output data were standardized between 0 and 1 to certify the equal handling of all variables and enhance the efficiency of the training network. The following equation is used to normalize data: where Z norm is the standardized value, Z mea is the measured value, and Zmin and Zmax are the minimum and maximum values.
The k-fold cross-validation technique was used to select a model from a 70% training data set and a 30% testing data set. This technique randomly divides the initial data into k equal-sized subsets (k-folds). A single subset of the k partitions is designated as testing data to evaluate the model's ability, and the remaining k-1 subsets are used to train the network. This procedure is repeated k times, and the k fold results can then be averaged to produce a single estimate. In this study, the k value was taken as 10. The specific information about this procedure can be obtained from Cemek et al. (2020).

Performance criteria of model
The coefficient of determination (R 2 ), mean absolute error (MAE) and root mean square error (RMSE) were used to evaluate the performance of models (Waller 2003).
where Z mea is the measured value; Z pre is the predicted value; Z avg is the average value measured; and n is the number of data.
In addition, Taylor diagrams were used to analyze the standard deviation (SD) and correlation coefficients (R) between the model-predicted and observed values.

Results and discussion
Extensive groundwater extraction causes salinization of groundwater and soils in the areas where seawater intrusion occurs, potentially resulting in significant reductions in plant yields (Alfarrah and Walraevens 2018). Generally, hydrochemical properties alone are insufficient to determine seawater intrusion. Thus, combining hydrochemical properties and isotopes can provide additional information about seawater intrusion into the aquifer (Pennisi et al. 2006;Mahlknecht et al. 2017).
The isotopes of d 18 O and dD are two of the most critical isotopes for determining the source of water (Maurya et al. 2019). In this study, four different ANN models-MLP, GRNN, RBNN, and ANFIS-and MLR were used to estimate the dD and d 18 O isotopes in groundwater Bafra plain. The prediction performance of AI models varies depending on the input parameters (Küçüktopcu and Cemek 2021). In AI studies, appropriate variable selection is accomplished with multivariate statistical analysis methods (Lu and Ma 2020). For example, Shah et al. (Shah et al. 2021) used principal component analysis, a type of multivariate statistical analysis, to select the most influential input parameters for estimating the concentrations of dissolved oxygen and total dissolved solids in waters.
In this study, Hierarchical cluster analysis was applied to reduce the number of input variables before developing the ANN and MLR models, and similarities between the variables were determined. Three groups were generated from the hierarchical cluster analysis (Fig. 3). Group A includes EC, Cl, Na, Mg, SO 4 , dD, and d 18 O. The variables of K, Ca, and pH composed group B, and HCO 3 was constituted group C.
Increased EC and Cl values in coastal groundwater may indicate seawater intrusion (Arslan et al. 2012). Cluster analysis revealed that EC and Cl variables were in the same group as dD and d 18 O isotopes, meaning that a seawater intrusion occurred, and these variables could be successfully used to estimate dD and d 18 O isotopes. The evaluation indicated a similarity between the isotopes (dD and d 18 O) and the variables of EC, Cl, Na, Mg, and SO 4 . Therefore, dD and d 18 O isotopes were estimated using these variables in the same group as isotopes.
Different input combinations were tested to assess the degree of effect of each variable on dD and d 18 O values. The input combinations tested were (i) EC; (ii) EC and Cl; (iii) EC and SO 4 ; (iv) EC, Cl and SO 4 ; (v) EC, Cl, SO 4 , and Mg.
In the MLP models, the different numbers of hidden nodes were examined, and the optimal number was generated. The lowest RMSE value in the testing phase was selected. In the ANFIS technique, the most appropriate outputs were tested for Gaussian, trapezoidal, and triangle membership functions (MFs) with varying numbers of MFs. In the RBNN model, the optimum spread and hidden node numbers were determined using an approach to trial and error. The GRNN application found optimal models by a trial-and-error method using different spreads. In the SVM technique, optimal parameters are selected using the rule and the stopping criteria. Several studies have been used machine learning models to predict water quality parameters (Ahmed et al. 2019;Bui et al. 2020;Lu and Ma 2020).
In the MLR analysis, the dD and d 18 O were used as dependent variables, and EC, Cl, SO 4 , and Mg were assumed to be independent variables for the training data set. After deriving the regression equations, the values of dD and d 18 O for the testing data set were determined.
The training and testing results for the MLP, ANFIS, RBNN, GRNN, SVM, GPR, CART, and MLR models in the estimation of d 18 O values are given in Table 2. In the training stage, the RMSE values for MLP, ANFIS, RBNN,GRNN,SVM,GPR,, and 0.55-0.70%, respectively. The lowest RMSE value (0.19%) was obtained for the MLP5 (4,5,1) model in which the inputs were EC, Cl, SO 4 , and Mg; however, the highest value (0.72%) was obtained for the SVM2 (Gaussian) model in which the inputs were EC and SO 4 . Similarly, the lowest value of MAE (0.10%) was calculated for the MLP5 (4,5,1) model, whereas the highest value (0.56%) was for the MLR1 model. The R 2 values between the measured and estimated d 18 O were between 0.91 and 0.98 for MLP models and between 0.94 and 0.98 for GPR models, while the R 2 value varied between 0.77 and 0.97 for the other models.
The R 2 , MAE and RMSE values of the best models for different input combinations in the testing dataset were shown in Fig. 4, which facilitates the graphical comparison of the tested models. The models with input combination  Fig. 7.
The d 18 O and dD estimation models were evaluated by using a Taylor diagram (Fig. 8). The MLP5 (4,5,1) models yielded a lower SD and RMSE values and a higher correlation coefficient compared to the other studied models. Therefore, a comparison for the findings of models shows that the MLP5(4,5,1) models provided the most accurate results in the prediction d 18 O and dD values.
Studies on the estimation of isotopes using artificial intelligence models are rare. However, different researchers have estimated the hydrochemical properties of waters with AI models (El Bilali et al. 2021;Singha et al. 2021;Egbueri and Agbasi 2022). The researchers stated that different AI models could make the most accurate predictions in evaluating different water properties. Haghiabi et al. (Haghiabi et al. 2018) compared the ANN, SVM, and group method of data handling models to determine the pH, SO 4 , Na, Cl, Ca, Mg, HCO 3 values of the waters and obtained the most accurate results with the SVM model. Kouadri et al. (Kouadri et al. 2021) predicted the irrigation water quality indices using different AI models, including ANN, MLR, and long short-term memory models.

Conclusions
Excessive groundwater extraction in coastal areas degrades and pollutes water sources. When these waters are used for irrigation, they significantly reduce plant yields and contribute to soil salinization. Thus, continuous monitoring of water's hydrochemical and isotope composition is required in coastal areas with excessive groundwater withdrawal, particularly in paddy cultivation areas. Hydrochemical properties of waters are generally used in seawater intrusion studies due to the low cost and ease of obtaining. However, environmental conditions can affect the hydrochemical properties very quickly; therefore, the isotope analysis provides significant benefits in seawater intrusion studies.
This study employed different data-driven models called MLP, ANFIS, RBNN, GRNN, SVM, GPR, CART, and MLR to estimate the d 18 O and dD values using the known hydrogeochemical properties of groundwaters in the Bafra plain of Turkey. The estimated values obtained with the models were statistically compared using R 2 , RMSE, and MAE. Taylor diagram was also employed to assess the performance of the investigated models. Comparative analysis of the results obtained by the models indicated that MLP5 (4,5,1) generated the most accurate models for all estimations based on R 2 , RMSE, and MAE values for the estimation of d 18 O (0.98, 0.31 and 0.20%, respectively) and dD (0.95, 2.85 and 1.89% respectively) values in the testing datasets. In the case of lack of data availability, the MLP1 (1,5,1) model, which only used the EC values generated satisfactory results in estimating the dD and d 18 O values. Overall, the study suggests using data-driven methods, especially MLP, when lacking the appropriate isotope analysis and facing high costs.
Funding The authors have not disclosed any funding.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.