Environmental Factors Assisted the Evaluation of Entropy Water Quality Indices with Efficient Machine Learning Technique

Water is an indispensable resource for human production and life. The evaluation of water quality by scientific methods provides sufficient support for the regeneration and recycling of water resources. In this study, entropy theory was used to evaluate water quality and overcomes the limitations of traditional water quality assessment, which does not consider the impact of different environmental factors on water quality. Considering the complexity of the traditional evaluation process, two typical machine learning (ML) methods – generalized regression neural network (GRNN) and support vector machine (SVM) – were used to predict the entropy water quality index (EWQI). Correlation analysis was applied to divide environmental factors into different combinations that subsequently acted as the input vector for the ML model. According to the results of the root mean squared error (RMSE), the SVM was selected as the better prediction model. Then, four different types of optimization algorithms were used to optimize the SVM to calculate nonlinear regression predictions and classifications of water quality. After analyzing the prediction results with different types of scientific evaluation indicators, the algorithm of differential evaluation and gray wolf optimization (DE-GWO) achieved markedly better performance than the other three algorithms, which has important advantages in avoiding the prediction model falling into a local optimal solution. The results of this study have significant guidance for water quality prediction and could make further contributions to the rational use and protection of water resources.


Introduction
Over the past several decades, rapid population growth accompanied by increasing urbanization, agriculture and industrialization has imposed numerous stresses on natural systems. The contradiction between growing demands and the availability of water has been one of the major constraints on global development and is even tougher in developing countries (Bansal and Ganesan 2019;Xiao et al. 2019;van Vliet et al. 2021). The growing trend of agricultural, domestic and industrial water demands and assumptions produced a large amount of wastewater that was discharged directly into the rivers due to the lack of sufficient sewage collection and treatment systems. Water plays pivotal roles in ecological and human health, economic development and social prosperity; thus, it is critical to prevent and control declining water quality in water bodies (Wu et al. 2018;Tang et al. 2022). Understanding the conditions and performing robust assessment of surface water quality are essential to provide reliable information for decision-making and effective management strategies.
Surface water quality evaluation is a highly fundamental and important task for water environment management. Assessing water quality requires plenty of monitoring data of physical, chemical, and biological parameters. Traditional water quality evaluation methods compare each water quality parameter with its corresponding existing standard levels and then rank the grade of the overall water quality by the worst one. Water quality categories and the primary pollution factors can be determined quickly and intuitively, but it can provide neither continuous and quantitative analysis nor timely early warnings. Water quality indices (WQIs), which are mathematical instruments, can integrate and interpret a group of water-monitoring data in a single composite value in a simple and understandable manner (Nong et al. 2020). WQIs can describe a comprehensive picture of water quality spatially and temporally, and make comparative analyses among different rivers, regions or basins. They also facilitate the rapid transfer of information about water quality to water resource managers and the general public, guiding decision-makers' policies and actions.
The WQI was proposed in the 1970s and has been widely used to evaluate the water quality around the world (Hou et al. 2016), such as the Canadian Water Quality index (CWQI), the British Columbia Water Quality Index (BCWQI), the Florida Stream Water Quality Index (FWQI). The difference in water quality indexes is caused by differences in the characteristics of statistical integration with different sub-indices and WQI calculation equations (Gao et al. 2020;Seifi et al. 2020;Ahsan et al. 2021). Many of these indexes are subjective evaluation techniques that largely rely on expert experience and judgment when assigning parameter weights and setting gradation standards, which result in losing some significant information about water quality. In contrast, an objective-weighting system can overcome this limitation and enhance the robustness of water quality evaluation, because it depends on monitored indicators and local variations in a dataset without any artificial perception. Thus, the entropy weight-based water quality index (EWQI) uses information entropy to define the disorder or uncertainty in a random process, which can provide a more accurate and reliable water quality assessment (Feng et al. 2019;Singha et al. 2021).
WQI calculation is a computationally intensive task that is complex and requires considerable computation time, which affects the decision-making ability of water resources management. In this context, machine learning (ML) techniques have been increasingly used for water quality modeling and evaluations for quick and accurate assessment results (Kadkhodazadeh and Farzin 2021). Successful computations of WQI have been achieved using artificial neural networks (ANNs) and support vector machines (SVMs). Sakizadeh (2016) optimized the ANN prediction model using Bayesian regularization, achieving an accuracy of 95%. Machiwal et al. (2018) used an ANN to assist groundwater quality evaluation and protection. Gupta et al. (2019) proposed to use a complex cascaded forward neural network to predict WQI. An ANN-based WQI can achieve high prediction accuracy, but sets stricter requirements for a complex network structure and has some weaknesses when datasets are small. SVM is a supervised learning network structure that is based on a radial basis. Its structure is simple and more suitable for data-missing environments. Haghiabi et al. (2018) used SVM to predict WQI, improved prediction accuracy and markedly reduce the time required for model training and testing. To enhance modeling performance, Wang et al. (2017) used particle swarm optimization (PSO) to optimize an SVM and predict a WQI, which can prevent the SVM from falling into a local optimum. Both ANN and SVM can produce adequate WQI predictions, but there is still a lack of comparative analysis between ANN and SVM in the field, and the advantages and disadvantages of different SVM-based optimization algorithms are not well understood currently.
In this study, EWQI was used to evaluate the water quality of the Pearl River Delta (PRD), which mitigate the lack of water quality information caused by empirical weights. Two typical types of ML methods (ANN and SVM) were used to predict the EWQI. By comparing and analyzing the advantages and disadvantages of these two ML models, a better machine learning method could be developed. Then, four different intelligent algorithms were adopted to optimize the ML method to predict the EWQI and water quality grade with high accuracy. By comparing the advantages and disadvantages of the four hybrid prediction models, the most appropriate prediction model can be selected. The results indicate that ML techniques are more reliable and accurate than traditional water quality evaluation methods.
The contributions of this study are as follows: 1. Entropy theory was used to evaluate water quality by balancing the relationship between environmental factors and the EWQI. 2. Machine learning method was adopted to predict the EWQI and water quality classification, which overcomes the drawbacks of the traditional evaluation method regarding computation time and complexity. 3. Different optimization algorithms were used to optimize the ML model, and the most appropriate hybrid ML model was obtained.
The content of this paper was mainly divided into the follwing parts: Sect. 2 describes the study area and data collection. Section 3 describes the methods. Section 4 compares different optimization machine learning methods. Section 5 concludes the paper.

Study area
The Pearl River system originates from the Yunnan-Guizhou Plateau, flows through the regions of Guangdong and Guangxi, and finally enters the South Sea of China via the PRD estuary. As the second largest river system in China, annual runoff exceeds 349.2 billion m 3 which is only lower than that of the Yangtze River. The Pearl River system has a total length of 2,320 km, and the entire drainage area is approximately 440,000 square kilometers.
The PRD is one of the most populated regions with the highest rate of developmental and infrastructural activities in China. The PRD has experienced several decades of extensive development since the reform and opening policy. Many factories have been built, and a large population influx occurred to acquire jobs and better lives . Thus, many nutrients and pollutants are input into rivers and the sea, leading to water quality deterioration. With the backflow of seawater and port transportation, water pollution is more serious in the PRD estuary. In the 21st century, because pollution control policies and measures have become increasingly strict, many sewage treatment plant plants have been built, and the environmental monitoring system has been constantly improved. Water environmental quality has thus markedly changed, and there is an urgent need for effective continuous monitoring of the water quality in this region, which is useful for its sustainable economic and social development.

Data collection
With a comprehensive consideration of the distribution of the monitoring stations and the completeness of data collection, 5 typical monitoring stations (Lixinsha, Tingjiao Bridge, Xiaohu, Nanheng, Jiaomen) are used as the data sources in this study. The location of the PRE and distribution of the monitoring stations are shown in Fig. 1. Nine years (2011-2019) of monthly water quality data were sampled at the five monitoring stations. The water quality indicators include COD, DO, BOD 5 , NH 3 -N, TN, TP, petroleum, and FC. Details are shown in the Table 1.

Entropy-based water quality index
Traditional WQI evaluation methods typically adjust the weight relationship between the parameters according to their empirical value, which could lead to information loss (Busico et al. 2020). To reduce the error caused by the inaccuracy of the weight of each input parameter, information entropy is used to weight each relevant parameter (Singha et al. 2021). Entropy is used to construct the network structure of the information set, the larger the entropy is, the higher the certainty of the event. Entropy theory has been applied to water quality evaluation and other fields . The main calculation steps are as follows.
Matrix X consists of m water samples (i = 1, 2, 3, ……, m), and each sample includes n water quality parameters (j = 1, 2, 3, ……, n). The matrix of X can be rewritten as: To eliminate the influence of different units and quality levels of each characteristic index. The feature indices can be normalized with the Eq. (2).
The ratio of the index value with j index and in i sample can be calculated by Eq. (3).
The information entropy can be calculated with Eq. (4).
Then, the entropy weight can be obtained according to the Eq. (5).
The next step to calculate EWQI is to assign a quality rating scale Qj for each input parameter. Qj was calculated by Eq. (6).
Where Cj is the concentration of each environmental factor and Sj is the permissible limits for surface waters GB3838-2002 in China. Finally, the EWQI can be calculated according to the Eq. (7).

Artificial neural network model
The widely used ANN models are back propagation (BP) network and generalized regression neural network (GRNN). The structure of BP network typically requires feedback information to modify the model, which leads to a complex structure. The GRNN can effectively deal with this problem (Specht 1991;Bodyanskiy et al. 2017). A radial basis function (RBF) neural network is typically based on nonparametric kernel regression statistical method and typically used to resolve the problems of function fitting and regression.
The structure of GRNN includes 4 layers: the input layer, hidden layer, summation layer and output layer. The number of model layer neurons is equal to the number of training set samples. The summation layer uses two types of neurons to represent the lines of all model layer neurons. One layer is unweighted and sets the connection weight between the pattern layer and each neuron equal to 1. The second is to weight all the neurons in the pattern layer, and the connection weight is used as the output matrix of the training set. Each neuron of the output layer is associated with two types of results of the summation layer. The prediction result can be obtained by the layer network structure.

Support vector machine method and optimization algorithm
The SVM method is based on the VC dimension of statistical learning theory and the principle of minimization of structural risk (Christopher and Burges, 1998;Cristianini, 2000). SVM is similar to a multi-layer network structure perceptron and a radial basis function (RBF) network. The input vector is mapped into a high-dimensional feature space with a nonlinear mapping that has been selected in advance, and the optimal classification hyperplane is constructed in this space. The primary regression function is shown in Eq. (8).
Where w is the weight vector, b is the bias, and ϕ(x) is a nonlinear mapping function. The optimization process is as follows: Where ξ i and ξ * i are the slack variables; C is the penalty parameter; and ε is the loss function parameter. The form of SVM for regression or classification models can be described as follows: Where K(x i , x) is the kernel function, and the RBF is usually used as the kernel function, which can be represented as: Where γ is the key parameter of the RBF function.
The PSO algorithm was first proposed by Kennedy J and Eberhart R.C (1995) and was derived from the predation behavior of birds. In predation, the location of the original food source can be shared via information exchange. After determining the original location, the bird flock will move to its center, spread out around it to look for other food sources, and finally achieve simple and efficient predation. The PSO algorithm randomly assigns an initial value to a feasible solution space population. Each particle of the population corresponds to a fitness value, which can judge the relative ability of an individual to transmit its own information to the next generation, and the characteristics can be represented by position and speed. Each particle can analyze the best positions of the surrounding particles and use the best positions of particles in neighboring areas to adjust its own speed vector. The optimal particle will replace the most recent particle before each iteration. The fitness value will continuously adjust dynamically, and the optimal solution can be obtained.

B. Artificial bee colony (ABC)..
The artificial bee colony algorithm was first proposed by Karaboga (2005). It is inspired by the behavior of bee colonies when collecting honey. The principle of the algorithm divides the bee colony into three types: collecting bees, observing bees and investigating bee. The goal of the algorithm is to find the optimal nectar source, which is the optimal solution of an optimization problem. Each collecting bee corresponds to a certain nectar source (i.e., a solution vector), and searches for nectar sources. According to the abundance of nectar sources (i.e., the fitness value), random chance is used to hire observer bees to collect nectar (i.e., search for new nectar sources). If the nectar source is not improved after multiple updates, the nectar source is abandoned, and the collecting bee turns into a scout bee to randomly search for a new nectar source.

C. Gray wolf optimization (GWO)..
The gray wolf optimizer (GWO) is a new meta-heuristic algorithm proposed by (Mirjalili et al. 2014) that refers to the social hierarchy and hunting behavior of gray wolves. The GWO algorithm simulates the social hierarchy and group hunting behavior of a gray wolf population and optimizes the intelligent algorithm via tracking, encircling, hunting, and attacking. The internal hierarchy of the gray wolf population, which can be divided into four parts: α, β, δ and ω. This algorithm has a strict pyramid hierarchy, where α is the best solution, followed by β and δ, and the remaining solutions belong to ω. The top-3 best wolves that are closest to their prey are α, β, and δ, which guide ω to search for prey in more promising search areas. While hunting, the wolf updates its position around α, β and δ, and determines the direction of optimization.

D. Differential evolution and gray wolf optimization (DE-GWO)..
To update the diversity of the population, a differential algorithm based on global optimization is adopted. This method primarily generates a new population via mutation, crossover, and selection of the population, and determines optimal solution. The GWO algorithm achieves good performance when applied to many problems, but easily falls into a local optimal solution while handling with some complex problems. Using the difference algorithm to update the optimal position of the wolf pack can prevent the GWO algorithm from falling into local optimum.

Model evaluation
To evaluate the performance of the prediction model, the root mean squared error (RMSE), mean absolute percentage error (MAPE), coefficient of correlation (R), and Nash-Sutcliffe efficiency (NSE) are considered in this study. The relevant mathematical formulae are as follows: where EW QI o i and EW QI p i are the real and predicted values of EWQI in month of t, respectively; EW QI o and EW QI p are the means of the real and predicted values of EWQI, respectively.

Analysis of relative environmental factors
The time series of water quality parameters are shown as mean concentrations from all the sampling sites in Fig. 2; the red lines indicate the level of the GB3838-2002 Class III standards. The variation of each parameter shows different characteristics without obvious change regularities. COD concentrations varied from 4.2 to 17.6 mg/L, with an average of 7.82 mg/L; DO concentrations varied from 5.17 to 7.92 mg/L, with an average of 6.37 mg/L; BOD 5 concentrations varied from 0.88 to 2.68 mg/L, with an average of 1.49 mg/L; NH 3 -N concentrations varied from 0.05 to 0.72 mg/L, with an average of 0.20 mg/L; and TP concentrations varied from 0.05 to 0.19 mg/L, with an average of 0.09 mg/L. According to China's environmental quality standards for surface waters (GB3838-2002), COD and BOD 5 meet Class I standards with concentration limits of 15.0 and 3.0 mg/L, respectively; DO and NH 3 -N meet Class II standards with concentration limits of 6.0 mg/L and 0.5 mg/L, respectively; TP meet Class III standards with a concentration limit of 0.2 mg/L; and the majority of TP meets Class II standards with a concentration limit of 0.1 mg/L. Considering the first five parameters, little biochemical pollution and organic pollution were found in the Nansha District waterways due to increasingly strict sewage discharge control policies. TN concentrations varied from 0.65 to 3.56 mg/L, with an average of 2.25 mg/L. TN often exceeds Class III standards with a concentration limit of 1.0 mg/L, which has been identified as a major pollution index that is mainly caused by fisheries and domestic pollution. Petroleum concentrations exhibited a marked falling trend and met Class III standards with a concentration limit of 0.05 mg/L since July 2012. Petroleum is shown to be higher in the early stage but then continues to decrease. Because petroleum is directly related to industry, the change of petroleum is caused by the treatment of industrial sewage. Except for the high sewage discharge shown early on, overall sewage discharge changed markedly over time, which also implies that industrial sewage has been effectively treated and has reached local discharge standards. FC concentrations ranged from 283.4 to 14754.2 mg/L and increased marginally. Typically, FC meets Class III standards with a concentration limit of 10k pcs/L.

Entropy weight and different combinations of input environmental factors
Traditional water quality evaluation methods do not consider the relationship between Abbreviation of Environment Variables

Full Variable Name
Unit Methods  the physical and chemical properties of different environmental factors and water quality change. Entropy theory is used to regulate different parameters, and the entropy weights of different environmental factors are shown in Table 2. Different environmental factors have different weight coefficients, among which petroleum can be as high as 0.2865, and DO is 0.0041. DO is far lower than that of the other environmental factors. Thus, different environmental factors likely play different roles in water quality evaluation. The performance of the prediction model is related to the structure of the model and to the combination of the input environmental factors. In this study, the input environmental factors of the model are classified and combined using the Pearson method. According to Table 2, petroleum is most related to EWQI, and DO shows a negative correlation with EWQI. The input environmental factors can be divided into three categories by referring to the standard evaluation. The standard evaluation of the Pearson correlation coefficient can be classified into four grades (high, medium, low, and no), and the corresponding value ranges are ( |r| ≥ 0.8, 0.8 > |r| ≥ 0.5, 0.5 > |r| ≥ 0.3, and 0.3 > |r| ), respectively. Three different combinations of the input environmental factors act as the input vectors for the optimization model, which mainly include (1)  To select the best combination of environmental factors to achieve high accuracy prediction, RMSE is used to evaluate the performance of GRNN and SVM. The results are shown in Table 3.

COD
According to Table 3, both the GRNN and SVM can achieve high prediction accuracy in the training phase, and the accuracy of the SVM is better than that of GRNN in the testing phase. The RMSEs are different when using different combinations of environmental factors, indicating that different environmental factors may have different effects on the prediction model. Finally, the second combination is selected as the new input vectors for the optimization model.

Comparative analysis of the application of different optimization SVM models in nonlinear regression prediction
To predict the EWQI with high accuracy, SVM must be improved using different optimization algorithms. In this study, four typical optimization algorithms (PSO, ABC, GWO, DE-GWO) are used to optimize the prediction model. The input dataset is typically divided into two parts: training and testing. Their ratio is 7:3, and the prediction results are shown in Fig. 3. According to the results of different evaluation indicators shown in Table 4, the accuracy of the training phase is better than that of the testing phase. In the training phase, the model can learn all the characteristics of datasets. However, in the testing phase, the testing results are primarily derived from the established model. This model mainly includes the characteristics of datasets as much as possible, but the data characteristics of the testing dataset may not exist in the established model, such as NH 3 -N and Petroleum. The abnormal points in these environmental factors would cause model over-fitting, and lead to a testing accuracy lower than that of the training phase. From the perspective of the R and NSE, the ABC-SVM is marginally lower than the other three algorithms. However, the four hybrid algorithms all achieve high prediction accuracy in both training and testing phases. The R and NSE of the DE-GWO-SVM algorithm are higher than those of GWO-SVM in the Fig. 3 The prediction results of four optimization algorithms in the testing phase training phase, but show no more marked advantages over GWO-SVM in the testing phase. According to RMSE and MAPE, DE-GWO-SVM shows obvious advantages over the other three optimization algorithms in both the training phase and testing phase.
To directly characterize the deviation between predicted and real values, a detailed comparative analysis is performed, as shown in Fig. 4. The predicted and real values tend to fluctuate along the same line. However, while the EWQI is too large, the accuracies of the four algorithms are not consistent. Thus, while the EWQI exceeds 100, the deviation of EWQI exists when using ABC-SVM and PSO-SVM. Only the results of GMO-SVM and DE-  Table 4 The results of three different evaluation indicators with using different optimization algorithms Fig. 4 The predicted value VS actual value with using four optimization algorithms GMO-SVM remain consistent. The deviation of some important points indicates that the optimization algorithm may fall into local optima in some cases and cannot find the global optimum. These results also indicate that the GWO or DE-GWO optimization algorithm achieves better performance in some complex situations. The DE algorithm can be used to improve the performance of the prediction model and avoid falling into local optima.

Comparative analysis of the application of different optimization SVM models in nonlinear regression classification
As a supervised learning model, SVM also shows good characteristics when solving nonlinear classification problems. Different optimization algorithms have been used to solve nonlinear regression predictions in Sect. 4.3. However, we now focus on the evaluation of water quality grade using ML models. According to the best combination of environmental factors that were selected in Sect. 4.2, the value of EWQIs are assigned to different grades to represent different water quality levels. The evaluation indicator mentioned in Sect. 3.3 is mainly not applicable to evaluate the water quality level, and the accuracy rate is used to evaluate the different hybrid optimization algorithms. The accuracy rates of PSO-SVM, ASC-SVM, GWO-SVM and DE-GWO-SVM in the training phase are 94.71%, 94.97%, 97.09% and 97.88%, respectively; and those in the testing phase are 88.96%, 87.73%, 92.02%, and 95.09%, respectively.
According to the results, four different optimization algorithms can achieve high accuracy in the training phase, and the prediction accuracy of GWO and DE-GWO exceed 97%. In the testing phase, the prediction accuracy of the optimization model tends to decrease. The prediction accuracy of ABC-SVM is lower (only 87.73%), which is not suitable for practical applications. However, the accuracy rate of DE-GWO-SVM reaches 95.09% in the testing phase, which can meet the real application requirements.

Conclusions
In this paper, we innovatively apply entropy theory to water quality assessment. This theory comprehensively considers the impact of different environmental factors on water quality. The Pearson method is then used to divide environmental factors, and the performances of the two typical algorithms are compared and analyzed. Due to the simple structure and the characteristics for handling nonlinear problems, SVM is more suitable for this study. To improve the prediction accuracy, four artificial intelligence algorithms are used to optimize the prediction model. While comparing the results of the scientific indicators (RMSE, MAPE, R, NSE, accuracy rate) of the four hybrid models, the DE-GWO-SVM model shows marked advantages over the other three models. The optimization ML methods are used to realize the evaluation of water quality, which overcomes the drawbacks of the traditional evaluation method regarding computation time and complexity. The method proposed in this paper achieves high prediction accuracy for the research topic, but its applicability in other fields still requires more research. Therefore, a more efficient, faster and more accurate machine learning model should still be developed in future work.