Comparison of river water quality assessment methods using the tree model and the nearest neighbor method (A case study: AhvazHydrometric Station)

Ahwaz Hydrometric Station is responsible for controlling surface water resources and the Karoon River near Ahwaz city in southwestern Iran. And the present study aimed to determine the parameters affecting water quality, especially TH and SAR parameters. For this purpose, 39-year old statistical data were collected with 463 samples. To determine the water quality, �rst the correlation matrix method and statistical analysis were conducted, and then the correlation between them and the accuracy of these methods were checked using the tree model and the K-Nearest Neighbor (K-NN) method. The K-NN method and multivariate regression were compared for water quality characteristics, including SAR. The results indicated that K-NN methods were better than the regression method. In addition, the K-NN method using the effective anion and cation combinations yielded better results of estimating Sodium Absorption Ratio (SAR) and Total hardness (TH). Furthermore, the accuracy of the tree model after estimating TH using SO 42-was more than that of Ca 2+ . Moreover, the accuracy of the tree model using the Cl - data for SAR estimation was higher than that of the Na + data. In general, according to the APHA standard (2005), river water is in the high-risk and low-alkaline group.

pollution and quality of water resources.It is crucial to monitor and analyze the water quality of rivers.
Total hardness (TH) is dissolved minerals in the water.This is a vital parameter to assess water quality.
The presence of alkaline earth metals leads to the hardening of water (Ameen, 2019).A signi cant amount of hardness of water is vital for drinking and irrigation.The most essential properties determining the quality of irrigation water is Sodium Absorption Ratio (SAR) In another study, using machine learning models and combining it with satellite image data that was conducted for an 18-year time series in the Estuary River of China, the ndings showed that machine learning-based approach was developed to estimate total suspended solid (TSS) and chlorophyll-a (Chl-a) in the turbid Pearl River.And Satellitederived time series of TSS and Chl-presented signi cant spatiotemporal variations (Ma et al. 2022).Alqahtani et al (2022) studied based on the comparison of individual supervised Machine Learning (ML) models, such as gene expression programming (GEP) and arti cial neural network (ANN), with that of an ensemble learning model, for predicting river water salinity in terms of electrical conductivity (EC) and dissolved solids (TDS) in the Upper Indus River basin, Pakistan..The results of the sensitivity analysis demonstrated that HCO 3− is the most effective variable followed by Cl − and SO4 2− for both the EC and TDS.The assessment of the models on external criteria ensured the generalized results of all the aforementioned techniques.(Haritash et al., 2016;Zakwan et al., 2017;Anmala and Turuganti 2021).The ndings of other researchers who have worked on the water quality of different rivers reinforce the idea that machine learning models can be used to predict water quality assessments with a high degree of accuracy.As a result, machine learning-based models reduce the cost and complexity of calculating subindices of water quality parameters and hence improve water quality management) Shamsuddin et al. 2022;Malek et al. 2022 ; Khoi et al. 2022).The Karoon River Basin is the largest river basin in Iran.One of the major cities along the Karoon River is Ahvaz City, which is the capital of Khuzestan Province.The water quality of the Karoon River has declined in many years due to an increase in water use for different aims and disposal of sewage.Degradation of water quality has endangered the life of the aquatic environment.It has reduced the quality of drinking water in addition to the status of the river ecosystem.The Karoon River supplies the water required for the high areas of farmland, industries, and 1.7 million people in sixteen large cities in the southwest Iran.Therefore, regular water quality analysis of the Karoon River is crucial.
In short, several studies have been conducted about the chemical quality of water.For example, Sattari et al. (2015) showed that the tree model might be used to identify water quality class by the least number of hydro-chemical variables.Fallah and Haghizadeh (2017) used the Man-Kendal test.Furthermore, they reported a growing trend of Cl -, EC, and TDS in the Lorestan Territory (Iran).In the Dehnu Station, a positive trend of Ca 2+ and Cl -was found.Moreover, they showed that SO 2 materials And Methods
Figure 1 shows the study area and Ahvaz Hydrometric Station.

2.2-sample Collection
The chemical water quality parameters include TDS, EC, CO

2.3-Methods
Water quality variables are non-linear linked; therefore, water quality studies are complex like testing the relation between TH, SAR, Na%, and anions and cations.Existing models are a mixture of hydrodynamic and quality models.In this regard, the old methods for water quality modeling have low accuracy, but new methods like arti cial intelligence (AI) are useful in this area (Chen & Liu, 2015).An important feature of arti cial intelligence methods is their ability to communicate the input and output of a process regardless of its physical characteristics (Vafakhah, 2012).
The decision tree consists of instanced-based learning algorithms identical to the tree structure gradation in ow charts.When independent variables are uniformed, the decision tree is the hierarchic model.The choice tree joins the highlights of a different leveled style with the end goal.The most signi cant element is in the base of the tree model (Lee & Lee, 2015).The decision tree includes three components: decision node, branch, and leaf node.Gradation rulings are de ned in various data sets decision interpreted by the tree model.The entire decision-making operation begins with the root decision node.From top to bottom, every decision node corresponding to the data cluster is grouped.Every leaf node produces the result.Distinct paths de ne graduation rules.The group of gradation rules comprises a set of decision tree statements (Chen et al., 2019).Decision trees are a new creation of data mining methods.This method is used to create predictive models.Decision tree common tools are utilized for gradation and prediction.
They are different from neural networks that generate law (Mounce et al., 2017).Gradation and regression trees are two decision trees.The regression is used when the tree predicts continuous values (Breiman et al., 1984).The bene ts of this method include higher education speed, simple and easy training, and its effectiveness for the high number of data sets (Bhatia & Vandana, 2010).The CHAID algorithm is based on supervised learning, so that this algorithm is used to develop the tree model.The development of the tree model is carried out using the chi-square test in addition to the signi cant value of p (Milanovic & Stamenkovic, 2016).CHAID is a non-parametric method not creating hypothetical background data (Bashari et al., 2021).
The likelihood ratio squared statistic may be utilized when the target is a well-organized continuous-type variable (Song & Chae, 2008;Althuwaynee et al., 2014;Yeon et al., 2010).This algorithm is a common method of sorting, based on distance measurement.This method considers the training data and their corresponding classes.The output of the K-Nearest Neighbor (K-NN) sorting is class membership.The K-NN comes under the group of methods, which can group an unidenti ed quantity, if it is provided with the data with some speci ed attributes (X) and the rate of relationship (Y).The K-NN classifying method is a pattern-based non-parametric method measuring the distance of the target point concerning the nearest points of the speci c k values.The value k is selected according to the maximum points given to the neighborhood points, and the KNN algorithm places the parameters with similar properties in a class directly based on the decision.The K-NN calculation expects that pixels close to one another in the attribute space fall into one class (Avand et al., 2019).The K-NN method is planned to investigate k training samples that are the nearest to the target entity.In this regard, the leading class is identi ed from the k training samples, and it is attached to the target object.The base of the K-NN method is that all samples possessing similar attributes are grouped in one class in a feature eld.To make gradation, this method identi es the class to which the sample belongs based on the nearest distance only.The K-NN method is limited to the meager number of adjoining samples in decision formulation.The K-NN method depends upon the few adjoining samples in addition to the method of the discriminant domain to demarcate the class.The K-NN method is highly relevant as compared to any other method to pend sample sets when the class domain overlaps each other.
The K-NN algorithm may be implemented as follows (Fan et al., 2019): 1. Select the k value.
2. Calculate the distance between the point of the considered category in addition to the current point.
3. Sort the points in ascending order of distance.4. Elect k points with the minimum distance regarding the current point.5. Determine the number of points of a particular category in which k points are located.
. The category with the maximum number of points of rst k points is recognized as sorting of the current point.
The parameters TH and SAR are analyzed.Finally, corrective relations are presented to measure their value.Sodium adsorption ratio (SAR) is obtained from Eq. (1) (Sadick, 2017): SAR= SAR is a dimensionless parameter.The TH is calculated as: The water quality used was based on the Schuller Diagram for irrigation (Sheriff & Hussain, 2017).In the Schuller Diagram, the anions, cations, and TH, in addition to TDS are in milligrams per liter and pH is in moll.Wilcox Gradation is used to rule the water quality for irrigation purposes (Brhane 2016).As Table 1 shows, based on EC and SAR, water is classi ed as "low risk", "medium risk", "high risk", and "very high risk".The general form of the multivariate regression is shows as Eq. ( 3) (Kewan, 2015): Where β i and x i are the coe cients of constant and independent parameters.In addition, ε is a measurement error.Before analyzing data sets, the reliability of chemical measurements was examined through the equilibrium ion charge or reaction inaccuracy (RI) based on Eq. ( 4): If the value of RI is more than ve, the accuracy of the data is questionable (Ebadati & Hooshmandzadeh, 2014).

results And Discussion
In Table 2, the summary of the statistical analysis of the data are presented.This table shows the amounts of anions in addition to cations in milligrams per liter.pH is in mol/l.3) present the trends analysis for TH (mg/l) in addition to SAR.In the gures, the x-axis is dimensionless.
The equation of the TH time series over the statistical period using Eq. ( 5) is as follows: The relationship of the SAR time series over the statistical period using the form (6) is as follows: Comparing the average of TDS with the Schuller Diagram, the water quality of the Karoon River in Ahvaz City is considered acceptable.Similarly, comparing the average of EC with Table 1, water quality has a high risk for soil.In Fig. 4 the slope of the line in the time series is positive.Hence, it can be seen that these parameters have an upward trend.Thus, the quality of water for drinking and agriculture has been challenged.Table 3 presents the signi cant test results in addition to the coe cients of SAR, TH, and water quality parameters.and Cl − as model design variables to estimate SAR and also used Na + , SO4, Cl − and Mg to predict TH.
Among the above-mentioned anions and cations, the most effect was related to Na + .
To use the data of anions and cations, the data were tested for normality using the Monte Carlo test that was dimensionless.Moreover, given that the Monte Carlo constant for all of them was 0.000, the values used had a normal distribution.
Anion and cation application scenarios for TH and SAR are used to implement the tree model as well as the K-NN method.However, the combination of anion and cation scenarios has not been used to execute the tree model.Table 4 shows how to use these scenarios.Na + and Cl −

3.1-Nearest Neighborhood Method
To run this model using the SPSS Package, 70% of data were used for training, and the rest 30% for durability and testing.The distance calculation was based on the Euclidean method.The number of the k was ve.In the research, k values were considered between 3 and 5 in order to be able to choose the optimal value.Determining the optimal k value: The optimal k value for which the error of the calculated value of TH and SAR is the lowest by the nearest neighbor method has been obtained: TH: The linear regression equations between the real and calculated values by the nearest neighbor method using SO 4 2-were determined as follows: TH act =25.5 + 0.917TH Knn R 2 = 0.947 k = 4 (8) TH act =19.56 + 0.946TH Knn R 2 = 0.935 k = 5 (9) The correlation of the estimated data and the measured TH data was 0.867 using SO 4 2-data.(Fig.5).The coe cient of the estimated and measured data was 0.862 using Ca 2+ data.Similarly, Fig. 6 show the correlation between TH, Ca 2+ and SO 4 2-data.And show the results of the implementation of the K-NN model and the correlation between the obtained and measured data.
In Fig. 7 the time series diagram of the real and estimated data using the nearest neighbor method and SO 4 2-is shown for different k value.
The linear regression equations between the real and calculated values using the nearest neighbor method using Ca 2+ were obtained as follows: TH act =24.94 + 0.917TH Knn R 2 = 0.698 k = 3 (10) TH act =24.68 + 0.919TH Knn R 2 = 0.719 k = 4 (11) TH act =-8.04 + 1.033TH Knn R 2 = 0.719 k = 5 (12) The obtained correlation coe cients are all smaller than 0.8.Therefore, Ca 2+ data cannot be used to implement the K-NN model to predict TH.Because the correlation coe cients for k = 4 and k = 5 are equal.Therefore, other statistical parameters should be used to select the optimal coe cient k.In Fig. 8 the time series diagram of the real and estimated data using the nearest neighbor method and Ca 2+ is shown for different k value.In Fig. 11, the coe cient of the TH estimation model was 0.992 using a combination of SO 4 2− and Ca 2+ .
Accordingly, the tted line is closer to the rst-quarter bisector of the coordinate system, which is a 45 0  angle, showing that the result is satisfactory.Linear regression equations between real and calculated SAR values were determined by the nearest neighbor method using Cl − ion as follows: (16 Based on the correlation coe cient, there is a very small difference between the three correlation coe cients.Figure 11 shows the time series of real and estimated SAR data using the K-NN method and Cl -anion.
Linear regression equations between real and calculated SAR values were determined by the nearest neighbor method using Na + as follows: Based on the correlation coe cient, the results for k = 5 were more favorable compared to k = 3,4 values.
Figure 12 show the K-NN method outputs with Na + and Cl -to estimate SAR.
According to Fig. 13, the coe cient of the SAR data obtained from the K-NN method and the observed data was 0.991.In conclusion, the K-NN method is good at estimating SAR.And based on the explanations given and the diagram obtained in Fig. 14 it is observed that the K-NN model provides favorable results for SAR estimation using Na + and Cl -.
Based on Figs.11t0 13, the results obtained from the K-NN method did not exhibit good agreement with the observed data.High-precision relation cannot be used to estimate SAR using Na + and Cl -data separately.The results obtained by Dezfooli et al. (2017) indicated that K-NN had 10% error during calibration similar to validation stages.Kim et al. (2015) showed that the disparity in predictive accuracy was around 5% under dry and wet weather conditions.Babbar and Babbar (2017) observed that the wrong water quality class was around 2%-28% for the K-NN method.In addition, it was 1%-38% and 10% -20% for arti cial neural network and rule-based classi ers, respectively.

3.2-Decision Tree Model
In this analysis, 70% of the data were used in the training stage.At last, 30% of data were used in the testing stage.The effective parameter was considered TDS.The method used for the growing method is the CHAID method.Figure 14 presents the output of the tree model for TDS estimation.The gure includes training as well as testing stages using Ca 2+ .The coe cients of correlation between TH and Ca 2+ were 0.999 and 0.757 in training and testing stages, respectively.In addition, the coe cient of TH and SO 4 2-in the training and testing stages was 1.00 and 0.974, respectively.(Fig.15).
The coe cients of correlation of the tree model to estimate SAR were 0.919 and 0.983 using Na + data for training and testing stages, respectively.Furthermore, the coe cients for the tree model in training and testing stages using Cl -was 0.999 and 0.998, respectively.Figure 16 shows the outputs of the tree model to estimate SAR using Na + and Cl -.
Assuming the linear relationship of Na + and Cl -with SAR, single variable regression was applied to estimate SAR as follows: SAR = 0.518 Na + R 2 = 0.983 ( 22) SAR = 0.52Cl -R 2 = 0.983 ( 23) Tables 5 and 6 6 and 7, the regression equations were signi cant.Since Na + and Cl − tend to combine to form the salt, the coe cient of them will be greater than 0.9.In general, multivariate regression using Na + in addition to Cl − cannot be used to estimate SAR, because the "linearity" problem occurs.In short, multivariate regression was unable to produce accurate results.Linearity is a characteristic of a relationship or mathematics function, being displayed in visual form as a straight line.In Fig. 17, the effect of "collinearity", the linear correlation of Na and Cl is shown In Table 8, the correlation coe cients between real and predicted data using the tree method and different growth algorithms of this method are presented.As can be seen, SAR prediction by Cl − and CHAID algorithm gives the highest correlation coe cient.In general, all R2 coe cients obtained with three growth algorithms and Cl − are higher than correlation coe cients with Na + .The smallest range of R2 coe cient changes is related to the combination of Ca 2+ and SO4 2− .Based on this, it can be said that in using the combination of these two ions, the type of growth algorithm had no effect on the results of the tree model.In predicting TH using Ca 2+ and SO 4 2− separately, CHAID algorithm has the highest correlation coe cient.The difference between the results can be due to the different geological conditions and the effect of evaporative formations on the water quality of the Karoon River.
Ebadati and Hooshmandzadeh (2019) presented two bivariate regression equations to calculate TH in terms of TDS and EC with p = 0.001.In addition, they presented a trivial regression in addition to TDS and EC with the interference of SO

conclusions
According to the statistical analysis performed and using the K-NN method and comparing it to the tree model, the water quality assessments are presented as follows: 1-In addition to SAR, the CO 3 2− and HCO 3 − parameters have the lowest correlation with TH and indicate that CO 3 2− and HCO 3 − do not affect the parameters studied in this study.Furthermore, it can be concluded that the determination of SAR only using Na + or Cl − is possible.SAR can be accurately estimated in using the tree model with Na + and Cl − .The accuracy of the tree model using Cl − for SAR estimation was higher than that using Na + .
2-Using the K-NN method in determining the amount of TH by SO 4 2− values, the accuracy of the results increases by approximately 14.5%.Additionally, based on the results of the K-NN method to determine SAR, common formulas are not reliable.
3. Statistical analysis of the results of the K-NN model using the correlation coe cient, in some cases, such as Cl, provides similar results with the coe cient of variation.But in the case of Na, the results are opposite to Cl. 4-The use of TDS as an effective variable has provided favorable results for the implementation of the tree model, and this depends on the high correlation coe cient between TDS and research parameters.Output of the K-NN method to estimate SAR using Cl - Output of the K-NN method for SAR estimation using Na + and Cl -data.Linear correlation between Na+ and -Cl Statistical evaluation: Data sets were analyzed with Statistical Packages (SPSS 18.00 and Minitab 16.00).Two-way ANOVA was used to analyze the connection of sites and the seasons at the signi cance level of p < 0.05.The coe cient test was applied to determine an association between water quality and chemical parameters.MAPE is the mean absolute percentage error(Kim &Kim, 2016).To analyze the error, the Mean Absolute Difference (MAD) was computed(Ballester et al., 2016).MSD is the mean square deviation(Karmaker et al., 2017).The obtained numerical variance is one of the dispersion indicators and shows how far the data are from the average value.If the standard deviation of a set of data is close to zero, it is an indication that the data are close to the mean and have little dispersion.

Figure 9
Figure9presents the result of the K-NN method using Ca 2+ and SO4  2-to examine TH by the SPSS Package.And it can be seen that the results obtained by using different k values match with the real data.The regression relationships between the real and calculated values of TH for k = 3, 4 and 5 were obtained as follows: k = 4 R 2 = 0.921 SAR cal =0.275 + 0.926SAR obs SAR cal =0.25 + 0.935SAR obs R 2 = 0.927 k = 5 (18) k = 4 R 2 = 0.875 SAR cal =0.438 + 0.879SAR obs SAR cal =0.391 + 0.887SAR obs R 2 = 0.885 k = 5(21)

Figure 9 Time
Figure 9

Figure 15 Output
Figure 15

Figure 17
Figure 17 (Nikoloski et al. 2020reported that stream had the poorest water quality conditions in the summer.In addition, it has the best subjective conditions in the winter.Ebadati et al. (2014)presented a regression formula for TH using Ca 2+ whose correlation was 0.732.According to the mentioned researches, comprehensive studies show that global models are the more practical for the domain experts.They can be easily interpreted as they predict all the targets simultaneously, while over tting less and maintaining or even improving the performance of local models.(Nikoloskietal. 2020).Sattari et al.
4 2-and K + declined.Using SAR and Na%, most of the samples were in the permissible limits.Most of the TDS samples presented high values making them susceptible to irrigation.Overall, the results have practical signi cance in maintaining the sustainable use of water in the Syr Darya River (Zhang et al., 2019).Bagherian et al. (2014) used the QUAL2K model to analyze the water quality of the Karoon River.Haghighi and Arabi (2010) used BOD and DO during the analysis of critical stages for water quality.Zarei et al. (2013) analyzed the impact of Gachsaran Formation on the water quality of Karoon and Dez rivers.Namdari and Hooshmandzadeh (2019) showed that SO 42-and Ca 2+ correlated highly with TH.Accordingly, they presented regression relations to calculate TH using this cation and anion.The coe cient of the linear regression of TH and TDS was 0.704.Thus, the correlation between TH and TDS was poor.Jozaghi et al. (2018) compared multiple water quality criteria to optimize the location in order to construct a dam in southeast Iran.Khorrami et al. (2019 a; 2019 b) analyzed the effect of land displacement on groundwater quality.For this purpose, a novel method known as SAR was used.In addition, they analyzed the effects of TDS, BOD, arsenic, and lead on groundwater quality.Sarani et al. (2012) compared Arti cial Neural networks (ANN) and multivariate regression to calculate SAR.They concluded that ANN was better than the regression method.Gorgan-Mohammadi et al. (2023) researched on the decision tree models in predicting water quality parameters of dissolved oxygen and phosphorus in lake water.The results show that decision tree methods with the help of hydro chemical parameters can classify and predict water quality with high accuracy and in a short time.And concluded that Classi cation And Regression Tree (CART) model is better than Chi-squared Automatic Interaction Detector (CHAID) model in predicting data, and C5 model is better than Quick Unbiased E cient Statistical Trees (QUEST) model in predicting group numbers.Afkhami et al. (2002) analyzed the effect of wastewater on the Karoon River by the water quality index.

Table 2
Summary of the Karoon River water quality data at

Table 2
presents the maximum values, standard deviations, mean and the minimum of water quality parameters.Comparing the values of this table and quality standards, the quality gradation of the Karoon River water can be determined.Using the sum of the cations and anions, accordingly, the ion load equilibrium error value was 0.0037.Thus, all data are used for chemical quality analysis.It should be noted that the change in TDS causes a change in TH and SAR values and the time series graph of TDS of Karun river water has an increasing trend.Therefore, it increases the values of TH and SAR, and its results are consistent with other methods and con rm the accuracy of the methods.Figures(2)and (

Table 4
Sarani et al. (2012)of the ANOVA test in addition to estimation of coe cients.The tables show the sum of squares and mean squares for regression and residuals.A signi cance level of the ANOVA test of each variable is presented.In addition, F is the test statistic and df represents degrees of freedom.Also, Ebadati and Hooshmandzadeh (2019) identi ed Cl -in addition to Na + as dominant over SAR.They obtained equations with the coe cient for Na + (0.946) and Cl -(0.928).Sarani et al. (2012)obtained a lower coe cient compared to the present study.Due to this research, they regarded pH as a physical property of water.
Table 7 presents the results of estimating the coe cients of SO 4 2-and Ca 2+ with TH, as well as SAR with Na + and Cl -.Here, m is the test statistic, B is the correlation equation, and Sig means signi cant.

Table 8
4 2-yielding a correlation coe cient of 0.873.Ahmed et al. (2016) obtained 0.727 and 0.434 as well as 0.794 and 0.529 for the accuracy and precision of K-NN and tree models, respectively.