Prediction of effluent arsenic concentration of wastewater treatment plants using machine learning and kriging-based models

This study evaluates the potential of kriging-based (kriging and kriging-logistic) and machine learning models (MARS, GBRT, and ANN) in predicting the effluent arsenic concentration of a wastewater treatment plant. Two distinct input combination scenarios were established, using seven quantitative and qualitative independent influent variables. In the first scenario, all of the seven independent variables were taken into account for constructing the data-driven models. For the second input scenario, the forward selection k-fold cross-validation method was employed to select effective explanatory influent parameters. The results obtained from both input scenarios show that the kriging-logistic and machine learning models are effective and robust. However, using the feature selection procedure in the second scenario not only made the architecture of the model simpler and more effective, but also enhanced the performance of the developed models (e.g., around 7.8% performance enhancement of the RMSE). Although the standard kriging method provided the least good predictive results (RMSE = 0.18 ug/l and NSE=0.75), it was revealed that the kriging-logistic method gave the best performance among the applied models (RMSE = 0.11 ug/l and NSE=0.90).


Introduction
Most municipal and hospital wastewater treatment plants (WWTPs) have regular effluent inspections in order to meet regulation standards. However, due to high laboratory analysis costs, human mistakes, and unprecedented technical problems, it is not always possible to rely on regular measurements. Under these conditions, the effluent quality cannot always be appropriately predicted using statistical and numerical methods. In addition, the water quality of a WWTP is sensitive to parameters such as water temperature, pH, influent discharge, and contaminants. This is due to the fact that in WWTPs, the wastewater is biologically treated by metabolism processes resulting from the activity of microorganisms. Hence, the treatment process follows a complex and highly nonlinear trend influenced by several known and unknown independent parameters (Guo et al. 2015). In that sense, having a reliable trained mathematical system, either for data prediction in case of having missing data or checking those measured data with ambiguity in the measurement process, would be of great value.
Models based on data mining approaches, such as machine learning (ML) and data mining models, have already proven to be appropriate and effective tools in dealing with various environmental engineering problems (Haghiabi et al. 2018;Zounemat-Kermani et al. 2020;Şimsek and Alkay 2020;Yetis et al. 2021). Given their architecture, ML models are capable of predicting nonlinear systems' output value(s) just relying on the input parameters after the ML models have been trained and evaluated.
Due to the complexity of the wastewater treatment processes, several researchers have already attempted to evaluate the effectiveness of machine learning models in modeling and predicting water quality in WWTP effluent (Qin et al. 2012;Zounemat-Kermani et al. 2019;Bernardelli et al. 2020;Icke et al. 2020). Pai et al. (2007) used grey model (GM) and artificial neural network (ANN) to predict chemical oxygen demand (COD) and suspended solids (SS) in a hospital effluent from a wastewater treatment plant. They confirmed the suitability of ANN application in the prediction process. In another similar study, Pai et al. (2009) developed machine learning methods including adaptive neuro-fuzzy inference system (ANFIS) and ANN for the effluent prediction of SS and COD of a hospital wastewater treatment plant. It was found that the employed ANFIS model acted better in the prediction of the effluent parameters. Guo et al. (2015) employed ANNs and support vector machines (SVMs) to predict the effluent concentration in a WWTP in Korea. Both models performed acceptably in the prediction procedure; however, it was found that the SVM model surpassed the ANN in terms of models' accuracy. Sharafati et al. (2020) applied ensemble ML models, including the AdaBoost regression, gradient boost regression, and random forest regression, in predicting the WWTP effluent water quality parameters, such as total dissolved solids (TDS), COD, and biochemical oxygen demand (BOD). The AdaBoost and gradient boost regression methods showed a better performance in the prediction process of TDS and COD, respectively. Granata et al. (2017) attempted to model and estimate the wastewater quality such as BOD, COD, TSS, and TDS indicators using support vector regression (SVR) and regression tree (RT) as black-box indirect modeling. They reported successful application of both machine learning models signifying that the SVR model acted better than the RT model. Ansari et al. (2018) modeled and predicted the influent flow rate of a sewage treatment plant using a stochastic, nonlinear autoregressive network and SVM model. The results of the study demonstrated that the SVM model was successful in modeling peak inflow. Fard et al. (2020) trained a multilayer perceptron ANN by the Levenberg-Marquardt algorithm and employed a genetic algorithm to predict the total COD, ammonia, phosphate, and turbidity of the effluent of a slaughterhouse WWTP. The findings indicated that the ANN trained by the genetic algorithm performed better than the standard ANN.
Arsenic is a toxic heavy metal and is seen as being synonymous with severe contamination of water bodies. The World Health Organization recommends that arsenic concentrations in drinking water should be below 10 ppb (parts per billion) . Arsenic can be found in aquatic systems in its inorganic form resulting from the dissolution of solid phases like arsenic anhydride (As 2 O 5 ), arsenolite (As 2 O 3 ), and realgar (AsS 2 ) (Harper and Kingham 1992). Nowadays, high arsenic concentrations in surface waters and groundwaters are a worldwide problem; hence, this study focuses on the modeling of this important parameter in the effluent of WWTP's.
The reason for the presence of arsenic contamination in wastewater has been discussed in some research papers (Harper and Kingham 1992;Komorowicz and Barałkiewicz 2016). According to the literature review, arsenic may be found in wastewaters for the following two main reasons.
1) Contamination of the main water resources (e.g., the existence of arsenic in groundwater used as a source of drinking water) 2) Human activities that make the surface/drinking water polluted (e.g., ore mining, fossil fuel combustion, preparation of arsenic-containing drugs, and the use of pesticides) Different strategies have been employed to remove arsenic or diminish concentration levels in drinking water, industrial, and wastewater effluent (Sheikhi et al. 2021). The most used technologies for removing or mitigating heavy metals from contaminated water are membrane distillation (Yapıcıoğlu 2020), electrocoagulation (Yapıcıoğlu 2018a), and biosorption (Agarwal et al. 2020). In this regard, arsenic removal technologies in WWTPs and other facilities cover a wide range of methods, such as oxidation (Zhang et al. 2018;Yapıcıoğlu 2018b), coagulation and softening (Ge et al. 2020), sorption and ion-exchange techniques (Zhou et al. 2017), synergistic reduction (Wang et al. 2020), membrane technologies (Hubadillah et al. 2019, and bioremediation (Plewniak et al. 2018). Although there are no related works in using ML models to model/predict effluent arsenic in WWTPs, several chemical and environmental engineering researchers developed ML approaches to model the removal process of arsenic. Instances include using ANNs in the prediction of the arsenic removal from aqueous solution by living cells of Bacillus cereus biomass (Giri et al. 2011), using an ANFIS model to simulate the removal of arsenic and chromium from water (Mandal et al. 2015), and using ANNs to model the removal of aquatic arsenic species by indigenous microbes (Altowayti et al. 2020).
This study aims to predict the effluent arsenic concentration in WWTPs based on several influent water quality parameters, such as influent arsenic values, flow, BOD, TSS, pH, temperature, and turbidity, using data-driven techniques, including kriging, logistic-kriging, multilayer perceptron ANN, gradient boosted regression trees (GBRT), and multivariate adaptive regression splines (MARS). To the best of the authors' knowledge, previous studies have not implemented statistical and ML methods to model effluent arsenic concentration in WWTPs. Moreover, pertinent studies in the field of arsenic removal modeling using ML methods are confined to conventional and simple soft computing methods such as ANNs and ANFIS. Accordingly, the present study's contribution not only comprises the specific case study problem but also challenges the capability of five different data mining statistical (kriging and kriging-logistic) and ML (ANN, GBRT, and MARS) methods in the prediction procedure. It should be noted that in this work, the flexibility of nonlinearity for input data of the developed kriging model is enhanced using the logistic map. This logistic map has a nonlinear scalar factor, which is determined for each input variable to control their nonlinearity.

Case study and dataset description
Ten years of monthly quantity and quality influent (input parameters) and effluent (target parameter) have been used for the development of the models. The input vector consists of influent variables, including arsenic concentration, flow, BOD, total sediment solids (TSS), pH, water temperature, and turbidity. On the other hand, the output (target) parameter is the effluent arsenic concentration. All datasets were extracted from the annual pretreatment reports for the Point Loma Wastewater Treatment Plant, San Diego City, from 2009 to 2019. The Point Loma WWTP opened in 1963 and treats almost 660,000 m 3 of wastewater per day. The reports include a summary of pretreatment program activities of the metropolitan sewerage system, including the Point Loma WWTP, the South Bay Water Reclamation Plant, and the North City Water Reclamation Plant (Pretreatment Annual Report for the Point Loma POTW, 2009-2019). The general process for the wastewater treatment and pollutant removal at Point Loma plant consists of several stages, including (i) influent screens for removing large debris, (ii) grit chamber for settling heavy debris, such as sand; (iii) sedimentation tanks for removing heavy particles, organic solids, and scum; (iv) anaerobic digesters for converting organic matter to methane gas, carbon dioxide, and biosolids; and (iv) waste gas burners and cogeneration gas utilization facility. Figure 1 illustrates the location of the Point Loma WWTP. Table 1 summarizes the statistical characteristics of the data used in this study. One interesting fact about the mean values presented in Table 1 is that, in general, the WWTP has mitigated the effluent arsenic by 30% (from 1.42 to 0.99 ug/L).
Besides, based on the highest Pearson's correlation coefficient provided in Table 1

Methods
This section describes the methodology and philosophy behind developing the GBRT, MARS, and kriging-based models. The results of these models are later compared with the conventional and well-known perceptron ANN, as a benchmark model, for evaluating the capability of the applied ML and kriging-based models. It should be noted that for details on the perceptron ANN method, we refer to numerous available papers and reports explaining the perceptron ANN, such as Gardner and Dorling (1998)

Gradient boosted regression trees (GBRT)
GBRT is one of the most widely applied techniques for regression-based problems (Friedman 2001;Zounemat-Kermani et al. 2021). The core concept of the GBRT is based on improving the regression tree (RT) performance by employing the most explanatory input variables to reach the most accurate corresponding target. Generally, the RTs generate several leaf nodes, by mapping the predictor features to target values, which make several advantages like easy to find the interaction between input and response parameters, comprehensive structure to analyze the obtained results, and finally, computationally efficient (Song et al. 2019;Samadi et al. 2021). In the GBRT approach, generating initial RTs is carried out as the first step. Then, the RTs are trained through multiple iterations to minimize residual values. The forward stage format of the GBRT can be described as follows:  where F(x) and h(x) denote the GBRT model and basis function, respectively. Also, m and x represent boosting iteration and input parameters. By finding the optimum value for h(x), response values can be obtained using the following equation: where Y i,t+k indicates the target value.
After each iteration, a new RT is fitted to reach a minimum error for the training phase. To improve the generalization performance by avoiding the over-fitting problem, the GBRT model applies the learning rate factor (η) to set the contribution of each RT in the final model by considering the number of iterations (Hastie et al. 2009;Song et al. 2019;Alizamir et al. 2020a). After employing this factor, the main equation can be presented as follows (Kim et al. 2020): Smaller values of η helps the model to perform more accurate during the testing phase; however, in this case, the model also needs more iterations. In this study, parameters of the GBRT model were tuned based on a trial-and-error process to minimize differences between the target and actual values. In addition, the shrinkage factor, subsampling factor, and maximum tree depth were set to 0.015, 0.25, and 5, respectively. Also, 150 iterations and the squared loss functions were applied for the training process. More detail on the structure and the training procedure of the GBRT can be found in Friedman (2001).

Multivariate adaptive regression splines (MARS)
MARS was proposed by Friedman (1991) as a robust nonparametric statistical tool to extract the relationship between the response and input parameters. The MARS model constructs a linear basis function by dividing the training dataset into splines and slope values. After fitting a regression equation to the training dataset, basis functions (BFs) and knots can be obtained (Heddam and Kisi 2018). The MARS algorithm achieves more error reduction by employing a forward phase for improving the location of each knot and its corresponding pair of the BFs. Also, the backward phase improves the performance of the MARS models by deleting the redundant BFs that have negligible effects on the model using the generalized cross-validation (GCV). The general expression of a MARS model can be presented as follows (Zhang and Goh 2016;Alizamir et al. 2020b): where f(x) stands for the dependent variable. λ m and β are respectively a spline function and a constant coefficient calculated by using the least square methodology. Also, a linear function in the MARS model can be defined at value t as follows: Finally, the GCV value indicated the best-fit model. For this study, the GCV value and highest degree of interactions in the final model were equal to 0.0128 and 1, respectively. More details regarding MARS are available in Friedman (1991) and Friedman and Roosen (1995).

Kriging-based methods
The kriging-based surrogate model was introduced by Krige (1952), while Matheron (1963) mathematically formulated the kriging model. Commonly, the kriging models can be applied for the prediction of complex problems such as hydrological fields (Heddam et al. 2020), structural design optimization (Sakata et al. 2003;Huang et al. 2006), energy (Keshtegar et al. 2018Zhang et al. 2020), and structural reliability analysis (Yun et al. 2018;Jiang et al., 2018b;Zhang et al. 2020;Jiang et al. 2020;Lu et al. 2019;Xiao et al. 2020) due to its ability in handling the high-nonlinear relations.
In the standard and conventional kriging model, two terms are used for the interpolation of a mathematical relation as follows: In the first term of Equation 7, G(X) and β respectively denote the basis function (BF) and unknown coefficient vectors. The nonlinearity of the predicted model is controlled by the nonlinear form of function G(X). For accurate predictions with highest tendency and agreement, the BF mathematical form should be selected on the tended line of the prediction model. The second term represents the stochastic term, where: and In Equations (8) and (9), O represents the observed data in the training phase. R denotes the correlation function, 9) r(X) = R , X 1 , X , R , X 2 , X , … , R , X n , X T which is given a stationary Gaussian function with the input of X p , X q (p, q = 1, 2, ..., m) (where p-th and q-th are input samples for m data points) and correlation parameter θ. The parameter θ is determined by using the maximum likelihood estimator as follows ): In Equation (10), Ln represents the logarithmic operator. The correlation matrix, R, is determined as below: in which r(X p , X q ) is the covariance basis function. Commonly, the Gaussian function is utilized to define the kernel function in the modeling process (Sun et al. 2017). By applying the Gaussian function for the correlation function, r (X p , X q ) is given as follows: where θ i , i = 1, 2, …, k (where i-th is the correlation parameter for k input variables). The parameters of Equation (7) using the optimum condition of Equation (10) are given as follows: By comparing the kriging model, the BF is affected by the parameters of the kernel function in the stochastic term. Therefore, in the regression analysis of problems with limited data, the BFs are effective tools to provide accurately predicted results. Commonly, the second-order polynomial function is used for the kriging model by the following relation: The two terms in the kriging interpolation provide the perfect prediction in the training phase for the studied dataset. Nonetheless, the abilities of the predicted result for the testing (validation) dataset for both accuracy and tendency are dependent on the BFs, which should be tended on the predicted line (Hasanipanah et al. 2021;Fei et al. 2020). The accuracy of the predicted model can be controlled by selecting a flexible function and a sigmoid-based logistic function. The logistic function can be changed from linear to nonlinear forms by using a shape factor, i.e., 0.1≤κ≤10, as in the below form.
The logistic function presented in Eq. (15) with sigmoid form can predict a nonlinear function for which its flexibility is controlled by using the factor κ, which is determined by trial and error in this current work. The presented functions in Equations (14) and (15) are respectively used in the kriging model and the kriging-based logistic models proposed in this study. As can be realized, the presented logistic model is as simple as the standard kriging model, and it has all the abilities of the standard kriging models in the training phase in achieving perfect simulations.
The models presented in Eqs. (14) and (15) can be used for predicting the data in the testing (validation) phase under the input data as below: Step 1: Determine r using Eq. (9) for point X.
Step2: Compute stochastic term r(X) T γ using the attained results from step 1.
As can be seen, the unknown coefficients and correlation parameters are computed in the regression process using the Eqs. (10) and (13) for the training data points. The correlation matrix, R, applied in Eqs. (8) and (13) is computed for the training data while the r(X) is determined as presented in step 1 for the testing data points.
The prediction in the testing dataset of these models should be evaluated based on several statistical models for accuracy, tendency, and agreement. We believe that this model could provide the most accurate results for problems with limited data points in the training phase subject to the number of training data being more than 2k+1.

Feature selection process
Besides the importance of methods used for achieving proper prediction/simulation results in data mining problems, feature selection-which is the process of choosing the best and most relevant input parameters-plays an essential role in constructing the optimum data mining model. In other words, in the feature selection process, one confines the original input vector to an optimal one by eliminating some irrelevant or non-effective input parameters, which leads to a faster training procedure and more accurate results. There are a variety of different approaches for applying the feature selection process, including the wrapper methods (e.g., forward feature selection) (Meyer et al. 2018), filter methods (e.g., the mutual information approach) (Cameron et al. 2019), and embedded methods (Rodriguez-Galiano et al. 2018).
In this study, based on the wrapper methodology, we used the forward selection combined with the k-fold cross-validation method, where tenfold were randomly assigned for applying the technique. The result of the feature selection process implies that just three input parameters, i.e., influent flow, influent temperature, and influent arsenic values, effectively influence the target value (effluent arsenic). As could be expected, influent arsenic values greatly affect predicting effluent arsenic (see Table 1). The interesting point is that the applied feature selection method selected only flow rate and temperature among the other six independent variables (flow rate, BOD, TSS, pH, temperature, and turbidity). It is worth mentioning that the flow rate parameter is the only quantitative parameter among the available independent variables so that no other parameter could represent its impact on effluent arsenic. On the other hand, temperature, as a qualitative parameter, was chosen to represent the impact of other quantitative parameters, such as BOD, TSS, pH, and turbidity. It does not mean that the other parameters do not directly affect the effluent arsenic; however, it implies that in this specific case study, taking into account the other input parameters, such as pH, will not improve the predictive results of the applied models. In Fig. 3, the final results of the feature selection process, based on the calculated R 2 (%), are illustrated. The highest calculated value for R 2 is attributed to selecting the influent arsenic, flow rate, and temperature.

Implementation of the models
In this study, MATLAB programing language was used to create artificial intelligence models. After implementing the models, the data was split into two datasets, 75% for the training phase and 25% for the testing period. Also, based on the trial-and-error process, the best parameters of different models including the number of hidden layers, activation function type, and learning rate were gained to minimize the difference between real and estimated arsenic concentration values.
In addition, the data are predicted using the kriging model by DACE MATLAM toolbox (Lophaven et al. 2002), while the calibrating process of the developed kriging-based logistic map is coded by MATLAB software with an extra subroutine that is connected to DACE MATLAB toolbox for searching the nonlinearity degree of the input variables using a logistic map.

Modeling scenarios
This study deals with two types of input combinations for constructing the data mining models; in the first scenario, all of the seven available independent variables are considered in the input vector (see Table 1). On the other hand, the second scenario is established based upon the effective independent variables elicited from the feature selection process: flow rate, influent arsenic, and temperature. Differentiating between these two input scenarios helps to evaluate the effectiveness and significance of the feature selection process on the final performance of the applied models in predicting effluent arsenic.

Evaluation of the models' performance
To evaluate the performance of the models applied, several statistical measures are used, including the root mean square error (RMSE) and mean absolute error (MAE), as absolute scale deviance measures; the mean absolute geometric error (MAGE) and geometric reliability index (GRI), as relative scale deviance measures; and the Pearson's correlation coefficient (PCC) and Nash-Sutcliffe efficiency (NSE) coefficient, as similarity measures (Jachner et al. 2007). The formula for each of the above-mentioned measures and the possible ranges and ideal values are presented in Table 2.
In Table 2, y and ŷ denote respectively the observed and simulated/predicted values for the influent/effluent arsenic concentrations in the training/testing stages. y is Fig. 3 The result of forward feature selection analysis with 10-fold cross-validation leading to select three effective input parameters (flow rate, influent arsenic, and temperature) out of the seven available variables (see Table 1) the average of the arsenic values in the desired period, and N is the number of data, which stands for the number of observed values in the training or testing stages.

Results
In this section, the simulated and predicted outcomes of the applied models in the training and testing stages are presented for the two input combination scenarios separately.

First scenario, all available input parameters
As mentioned, the first scenario considers all of the available independent parameters for the construction of the kriging and machine learning models. Table 3 shows the overall performance of the compared machine learning models in the training stage. The GBRT model shows a better training performance compared to the MARS and ANN models.
Given the results in Table 4 for the testing phase, it can be clearly seen that the kriging-logistic method excelled in the absolute deviance and similarity measures at predicting effluent arsenic values in comparison to the other applied  models. However, all models acted similarly for the relative deviance measures (MAGE and GRI). Specifically, the kriging-logistic method performed better than the other models for all of the statistic measures except the MAGE measure (1.009 for the kriging-logistic vs. 1.004 for the ANN model). By and large, the kriging-logistic method surpassed the other models in the prediction process of effluent arsenic. The scatter plots of the MARS, GBRT, kriging, and kriging-logistic models for the predicted values are shown in Fig. 4. They compare the modeled effluent arsenic concentration with the measured samples by using the coefficient of determination (R 2 ). In these plots, by the aid of the lines of agreement (1:1 line) and 20% error lines, it is shown how close the measured data to the predicted values are. The R 2 for the kriging-logistic model is 0.86, which means the predicted results match the measured arsenic values to a fair extent.
In Fig. 5, the predicted outcomes of the models, as well as the measured arsenic concentrations, are illustrated using violin plots. Similar to boxplots, violin plots are used to represent the comparison of a variable distribution (here the effluent arsenic) across various data categories. The median, mean, interquartile range (midspread), and kernel density estimation (the body of the violin plot) are presented in these plots. Taking into account the above characteristics of the plots reveals that the kriging-logistic and ANN models performed better than the other models. Obviously, the kriging method produced the least good results, as shown by the deformed kernel density shape and the interquartile range height compared to the corresponding plot of the measured data.
Second scenario, modeling based on the feature selection procedure Similar to the previous section, the performances of the ML models using the statistical measures are given in Table 5. As can be seen, the ANN model acted slightly better in the training phase according to the absolute deviance measures and similarity measures. However, this does not imply this model's superiority, and one should refer to the predicted results in the testing set for attaining a judgment.
Based on the calculated measures for the predicted arsenic effluent values in Table 6, it is obvious that the ANN model's performance is not as accurate as that of the GBRT's. However, the highest NSE (= 0.896) and PCC (= 0.962) scores in Table 6 show that the kriging-logistic method tended to be the best-fitted model among the five at predicting effluent arsenic. Likewise, the kriging-logistic method provided the lowest values for the absolute deviance measures (RMSE = 0.113 ug/L and MAE = 0.083 ug/L), which shows that the kriging-logistic is the most accurate model, followed by Even though the ANN gave the best score for the MAGE, the kriging-logistic model ranked second (MAGE = 1.007) for the MAGE and first for the GRI (= 1.099). Subsequently, the kriging-logistic model is the dominant best predictive model concerning the prediction of effluent arsenic in this study.
In order to assess the performance of the developed models graphically, scatter plots and violin plots of the predicted values versus the measured data are depicted, respectively, in Figs. 6 and 7. First, the predicted effluent arsenic concentrations by the MARS, GBRT, kriging, and kriging-logistic models are compared with the measured data in terms of scatter plots in Fig. 6. In addition to having a higher coefficient of determination (R 2 =0.92), it can obviously be seen from the provided scatter plots that the kriging-logistic method predicted values are closer to the corresponding measured effluent arsenic data than those of the other applied models. Given the trend line equations in the scatter plots (assume that the linear equation is y = ax + b), the kriging-logistic method followed by the MARS model provided closer a and b coefficients to 1 and 0, respectively.  In Fig. 7, the predicted results versus the measured arsenic concentration are summarized graphically using violin plots. Similar to the results obtained for the first scenario, all of the applied predictive models acted similarly in presenting the interquartile range. Nevertheless, the ML models show better performance for the measures of central tendency, including the mean and median. According to the rotated kernel density plot, it can also be easily observed that the predictions obtained from the Fig. 6 Scatter plots of the predicted results versus measured effluent arsenic concentration for the second input scenario using the feature selection procedure Fig. 7 Violin plots of the measured versus predicted effluent arsenic by kriging-based and machine learning methods in the second modeling scenario kriging-logistic and ANN models are more or less the same as the rotated density plot of the measured data.

Discussion
In general, the ranking of the applied machine learning models is similar for both the first and second input scenarios, being overall the GBRT, then the MARS, and ANN. However, on the whole, the kriging-logistic method provided the best predictive results for both input scenarios. The simple kriging method, on the other hand, gave the weakest performance among all of the models. To seek more insight into the results of the two input scenarios, we have compared the performance of both scenarios and presented the performance enhancement of the predicted values in the second scenario compared to the first scenario in Table 7.
As can be seen from the results in Table 7, the statistical measures related to the predicted arsenic values have been improved noticeably for the kriging methods, e.g., an averaged RMSE increase of 15%. This issue emphasizes the importance of the feature selection procedure in improving the performance of the data-driven statistical-based models, such as kriging models.
A similar claim could be made for the upgrading in the performance of the applied machine learning models (with the exception of the MAGE measure). However, the magnitude of improvement is not as high (an average RMSE increase of 2.7%) as the results of the kriging-based methods. It can be concluded that because of their complex structure, ML models are better tools in capturing the nonlinear properties of the system problem based on even some ineffective/irrelevant independent input parameters. In other words, ML models are less sensitive to the input vector parameters than the kriging methods.
Due to the fact that the applications of ML methods in modeling WWTPs are rare in literature, it is highly recommended that more related studies be accomplished in the field of using ML methods in simulating WWTPs. This should improve the knowledge and understanding of ML methods' ability to evaluate and predict the efficiency of WWTPs.

Conclusions
This study assessed and evaluated the potential of kriging-based and machine learning models in predicting the effluent arsenic concentration of a treatment plant using feature selection procedure establishing two different input scenarios. To meet this end, monthly influent wastewater discharge and six quality parameters, as well as effluent arsenic concentrations, were gathered from the Point Loma WWTP, which is a primary wastewater treatment facility in San Diego, California. The performance of both model categories was evaluated in terms of statistical measures, including absolute scale deviance measures (RMSE and MAE), relative scale deviance measures (MAGE and GRI), and similarity measures (PCC and NSE), as well as diagnostic methods such as scatter and violin plots. Results showed that ML models, such as ANN, MARS, and GBRT, could be effectively applied to the monthly interval prediction of arsenic concentration of effluent. Among the ML models, overall, the GBRT model showed a slightly higher prediction accuracy in the testing stage. The standard kriging model was unsuccessful in giving acceptable predictive results compared to the other models applied. However, it was revealed that the modified kriging model, namely the kriging-logistic model, provided the best predictive results in comparison to the ML models. It is worth mentioning that the kriging-based models were more sensitive to the outcomes of the feature selection procedure than the ML models. To sum up, this study suggested the efficiency and robustness of machine learning models and the higher accuracy of kriging-logistic method in modeling effluent arsenic of WWTPs.
Acknowledgements As the corresponding author, I wish to express my appreciation to the Alexander von Humboldt Foundation for providing financial support for this research project within the framework of the Return Fellowship program. Also, the third author was supported by University of Zabol under grant nos UOZ-GR-9618-1 and UOZ-GR-9719-1.
Author contribution Mohammad Zounemat-Kermani designed the study and carried out the analyses. Meysam Alizamir and Behrooz Keshtegar executed the models. Mohammad Zounemat-Kermani, Okke