A Novel LSSVM Model Integrated with GBO Algorithm to Assessment of Water Quality Parameters

In this study, a novel least square support vector machine (LSSVM) model integrated with gradient-based optimizer (GBO) algorithm is introduced for the assessment of water quality (WQ) parameters. For this purpose, three stations, including Ahvaz, Armand, and Gotvand in the Karun river basin, have been selected to model electrical conductivity (EC) and total dissolved solids (TDS). First, to prove the superiority of the LSSVM-GBO algorithm, the performance is evaluated with three benchmark datasets (Housing, LVST, Servo). Then, the results of the new hybrid algorithm were compared with those of artificial neural network (ANN), adaptive neuro-fuzzy interface system (ANFIS), and LSSVM algorithms. Input combination for assessment of WQ parameters EC and TDS consists of Ca+2, Cl−1, Mg+2, Na+1, SO4, HCO3, sodium absorption ratio (SAR), sum cation (Sum.C), sum anion (Sum.A), pH, and Q. The modeling results based on evaluation criteria showed the significant performance of LSSVM-GBO among all benchmark datasets and algorithms. Other results showed that in Ahvaz station, Sum.C, Sum.A, and Na+1 parameters, and in Armand and Gotvand stations, Sum.C, Sum.A, and Cl−1 parameters have the greatest impact on modeling EC and TDS parameters. Then, EC and TDS modeling was performed based on the best input combination and the best algorithm in different time delays. The highest accuracy of modeling EC and TDS parameters in Gotvand station was and C1 time delay.


Introduction
Pollution rate and river WQ is an important issue that is very effective in human health. Rivers carrying water and nutrients are necessary and provide important resources for drinking, industrial, aquatic, recreational and agricultural consumption (Ho et al. 2019). Therefore, they require at least an acceptable level of WQ. In recent years due to the rapid population growth and urban extension, agriculture, economic development, and increasing industrial production, pollution has increased in rivers (Busico et al. 2020); therefore, the qualitative study of water resources is one of the most critical subjects in most regions.
One of the best ways to study the problems of water pollution is modeling and analysis of WQ using modern methods such as artificial intelligence (AI). In recent years, many studies have been conducted about EC, and TDS modeling in different regions using data mining methods as these methods have many accuracies and, like physical and mathematical models, they do not need to specify multiple parameters and reduce the cost of research work. Modeling TDS and EC concentration and predicting it is essential for pollution control and water resource management .
In recent years, the AI models have been widely employed for WQ issues such as TDS, water quality index (WQI), dissolved oxygen (DO), chemical oxygen demand (COD), biochemical oxygen demand (BOD), EC, SAR, total hardness (TH), ammoniacal nitrogen (AN), suspended solid (SS) and pH (Emamgholizadeh et al. 2014;Tiyasha et al. 2020;Baghapour et al. 2020;Vijay and Kamaraj 2021). The common examples of these AI models include: ANN, ANFIS, support vector machine (SVM), group method of data handling (GMDH), and genetic programming (GP) (Barzegar et al. 2016;Salami et al. 2016;Haghiabi et al. 2018;Aryafar et al. 2019). The hybrid models higher accuracy than individual models, and recently, it has been widely used in assessment WQ parameters (Khosravi et al. 2018).  and Kisi et al. (2019) reported that hybrid models ANFIS with differential evolution (DE), genetic algorithm (GA), ant colony optimization for continuous domains (ACO R ), compact genetic algorithm (CGA), and particle swarm optimization (PSO) performed well in modeling EC, SAR, and TH. In another study, Najafzadeh et al. (2019) used gene express programming (GEP), model tree (MT), and evolutionary polynomial regression (EPR) to estimate BOD, DO, and COD. The results of the models showed the relative superiority of the EPR. Deterministic and numerical models have been applied extensively to model WQ; the counting convolutional neural network (CCNN) model has high accuracy for estimating the DO concentration (Zounemat-Kermani et al. 2019). Li et al. (2019) proposed a new model combining recurrent neural network (RNN) with improved Dempster/Shafer (D-S) evidence theory (RNNs-DS) for the prediction of WQ. Results indicated that the new model has better performance. Lu and Ma (2020) offered two new hybrid models which combine complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) with an extreme gradient boosting (XGBoost) and random forest (RF) for WQ parameters prediction. Results showed the superiority of the CEEMDAN-XGBoost. Banadkooki et al. (2020) used ANN, ANFIS, SVM to predict the TDS. Also, optimization algorithms were used to train these models. In this study, the ANFIS-moth flam optimization (MFO) and ANFIS-cat swarm optimization (CSO) models showed good performance. Melesse et al. (2020) applied two individual M5 prime (M5P) and random forest (RF) and eight novel hybrid algorithms to predict EC. The results showed that hybrid algorithms have minor errors.
This study, for the first time, proposed a novel hybrid model by combining the LSSVM model and GBO algorithm to achieve accurate estimation of EC and TDS at three hydrometric stations with different kinds of climate and WQ in the Karun river area (as a case study). For prove the superiority of the LSSVM-GBO algorithm, the performance is evaluated with benchmark datasets. Then, the results of LSSVM-GBO are compared with the ANN, ANFIS, and LSSVM algorithms to demonstrate the ability and accuracy of the proposed algorithm. After modeling EC and TDS, based on different inputs in three stations, the best input combination and the best algorithm are selected. Finally, EC and TDS modeling is performed based on the best input combination and the best algorithm in different time delays. The study concludes that the LSSVM-GBO algorithm that integrates a novel meta-heuristic optimization algorithm and machine learning can raise the accuracy of the assessment of WQ parameters, while the proposed model has the potential to analyze other engineering problems. Other sections of the paper are as follows: In Sect. 2 first introduces the study area, benchmark datasets, ANN model, ANFIS model, LSSVM model, optimization algorithm, and hybrid of LSSVM and GBO. Afterward, data collected, evaluation criteria, and finally steps of modeling WQ parameters are presented. In Sect. 3, first, the correlation coefficients between the WQ parameters are calculated. Then presents the results of benchmark datasets. After selecting the best algorithm and the best input combination based on modeling WQ parameters, EC and TDS modeling is performed at different time delays. Finally, the main points of the present study are summed up in Sect. 4.

Study Area
The Karun river is situated in the southwest of Iran, and with a basin area of 67,257 km 2 and with a length of 95 miles, it is the most important river of Iran, which leads to the Persian Gulf. Karun river originates from Zagros mountain ranges which are stretched from northwest to southeast and is the only navigable river in Iran. The average annual precipitation in Karun is 620 mm. In the present study, three stations, including Ahvaz, Armand, and Gotvand in the Karun river, have been selected to model WQ. Ahvaz and Gotvand stations are located in Khuzestan province, and Armand station is located in Chaharmahal Bakhtiari province. The choice of stations has been such that all climates are examined. Gotvand has an arid and semi-arid climate, Ahvaz has a dry and extremely dry climate, and Armand has a humid and Mediterranean climate. Fig. 1 shows the location of the study area.

Benchmark Datasets
In this article, the performance of the proposed LSSVM-GBO algorithm is compared with that of other algorithms on three real-world regression problems. Benchmark datasets that are real-world issues are a good criterion to determine algorithms working. In recent years, many studies have been conducted to compare algorithms with real-world regression problems (Breiman 2001;Zhang and Yang 2015;Henríquez and Ruz 2017). Specifications of benchmark datasets are shown in Table S1 in Supplementary Information. Also, Fig. S1 (Supplementary Information) shows the process of changing target data in benchmark datasets.

Artificial Neural Network (ANN)
ANN is a data-processing method with specific performance characteristics resembling the human brain (Lee et al. 2008). ANN has three layers: input layer, hidden layer, and output layer. ANN consists of a set of artificial neurons . Neurons and links between the neurons have a weight. Input data is processed in hidden layers and creates an output (Acharya et al. 2019). In this study, multi-layer perceptron (MLP) is used as the structure of ANN. So far, different methods have been put forward to training the network that the most famous of which is the Levenberg-Marquart method (Hadi and Tombul 2018). relation between inputs (x) and outputs (Z) in the ANN as follows: where, f is activation function, b is bias, w i is the weight of connection, Z is the output, and x i is the input. Figure S2 shows the structure of the ANN model.

Adaptive Neuro-Fuzzy Interface System (ANFIS)
ANFIS model method is a well-known AI method that has been used currently in WQ parameters, predicting rainfall and hydrological variables (Alizamir et al. 2020). ANFIS modeling is a reach where the combination of neural networks and fuzzy argument find their strengths (Bisht and Jangid 2011).

Fig. 1 Study area and hydrometric station
This model combines the advantage of both neural networks and fuzzy logic and can benefit from that at the same time (Kumar et al. 2019). ANFIS techniques can learn a system performance from enough large data sets and automatically procreate fuzzy sets to a pre-specified correctness level. The ANFIS model consists of five layers; input layers, rule layers, average layers, consequent layers, and total output layer. The main duty of the ANFIS is to optimize values of the equivalent fuzzy such that the error between the target and the actual output is minimized. Two fuzzy "if-then" rules are used as follows  where, x and y are input variables, A i and B i are the linguistic labels characterized by convenient membership functions (i = 1 or 2), and p i , q i , r i are the output function parameters (i = 1or 2).
Refer to other research for more information on the ANFIS algorithm (Jang 1993;Milan et al. 2021). Figure S3 shows the structure of the ANFIS model.

Least Square Support Vector Machine (LSSVM)
LSSVM is an implementation of a support vector machine for classification and pattern identification, regression analysis, and the problem of learning a ranking function. The advantages of LSSVM include high precision, mathematical tractability, and direct geometric commentary. The algorithm converts the nonlinear relationship between inputs and outputs to a linear relationship (Keshtegar et al. 2019). Equation (4) shows the prevailing relationship between input and output in LSSVM: where, M is the output, i is the weighting coefficient of input data, b is the bias, k(x,x i ) is the value of the kernel function for different inputs. The LSSVM tries to minimize the difference between measured data and estimated data. The parameters i and b are computed as follows (Farzin and Valikhan Anaraki 2021): C is regulation parameter; the parameters α,M,I,1 are computed as follows: The radial basis function (RBF) is used as a kernel function, as follows (Farzin et al. 2020): Figure 2 shows the structure of the LSSVM model.
(2) IF x is A 1 and y is B 1 THEN f 1 = p 1 x + q 1 y + r 1 (3) IF x is A 2 and y is B 2 THEN f 2 = p 2 x + q 2 y + r 2

GBO Algorithm
The GBO was proposed as a meta-heuristic optimization algorithm by Ahmadianfar et al. (2020) and showed that GBO has a more promising operation capability than other optimization algorithms. The GBO uses two main operators: gradient search rule (GSR) and local escaping operator (LEO) (Hassan et al. 2021). GSR uses the slope-based method to reach better positions in the search space. Simultaneous operation of two stages of exploration and exploitation and creating a suitable balance between these two processes causes the optimal performance of this algorithm. (Olorunda and Engelbrecht 2008;Patel and Savsani 2015;Draa et al. 2015).

Gradient Search Rule (GSR)
GSR is the core of the GBO algorithm. The duty of GSR is to find better opportunities and increase convergence rate acceleration (DM) in the search space. Therefore, the equation to update the current vector position is ( x m n ) where, randn is a random number, ε is a small number ɛ [0, 0.10], rand is a random number in [0,1], and x best is the best solution. ρ 1 is a factor to balance between two stages of The schematic structure of LSSVM model exploration and exploitation, ρ 2 is a random parameter, Δx equal to the difference between the best solution ( x best ) and a randomly selected position ( x m r1 ). By replacing the position of the best vector ( x best ) with the current vector ( x m n ) in Eq. (8), the new vector ( X2 m n ) is obtained as follows: Due to the positions X1 m n , X2 m n , X m n , the new solution at the next iteration ( x m+1 n ) can be defined as: r a and r b are two random numbers in [0, 1]. Figure 3 shows the structure of the GBO model. Fig. 3 The schematic structure of GBO algorithm

Local Escaping Operator (LEO)
The LEO operator helps the algorithm to exit local optima points and accelerate the convergence of the algorithm. The LEO is capable of solving complex problems in the GBO algorithm. By using several solutions, the LEO generates a solution with superior performance ( X m LEO ). The solution X m LEO is produced as follows: where, f 1 is uniform random number ɛ [-1,1], f 2 is a random number from a normal distribution with a mean of 0 and standard deviation of 1, pr is the probability, while u 1 , u 2 , and u 3 are random numbers. For more details, see Ahmadianfar et al. (2020).

Hybrid of LSSVM and GBO
The parameters C and σ have a notable effect on the performance of the LSSVM model. In this study, the optimization algorithm GBO was used to find the optimal value of the LSSVM parameters. In the hybrid algorithm of LSSVM and GBO, the values of the LSSVM parameters are considered as decision variables. Also, the scheme of LSSVM-GBO is shown in Fig. 4. The steps of the LSSVM-GBO algorithm are described as follows: 1. Test and training data are randomly selected from the available data. 2. The initial parameters of the optimization algorithms GBO (the number of iterations and the population size) are randomly determined. 3. The LSSVM parameters (initial population) are initialized, and the GBO algorithm finds the optimal solution (values of parameters C and σ ) in the search space.
4. After obtaining the optimal answer of the LSSVM parameters, training data and test data are used to obtain the LSSVM optimization model and to evaluate the predictive ability of the LSSVM optimization model.

Data Collected
According to Table S2, in this study, the WQ data from three Karun river stations were used for EC and TDS modeling. Since each combination of inputs can have a unique effect on the accuracy of the results, this reason, 13 combinations of inputs with 0-12 months period time delay have been created for the inputs of the algorithms. Table S3 shows the details of time delays on the best inputs.

Evaluation Criteria
In the present study, four evaluation criteria, including mean absolute error (MAE), relative root mean square error (RRMSE), correlation coefficient (R), and coefficient of determination R 2 , were calculated in testing phases. Expressions for these measures are given as follows (Zhu et al. 2020):  Figure 5 depicts the general framework for the assessment of WQ parameters. The steps of WQ modeling are described as follows:

Assessment of WQ Parameters
1. Comparison of the performance of algorithms with three benchmarks.
Flowchart for modeling WQ parameter 2. 70% of the data is considered for the training data and 30% for the test data. 3. WQ is modeled based on the training data and the test data with the mentioned algorithms. 4. According to the evaluation criteria, the best algorithm and the best input combination are determined. 5. Creating time delays in the best algorithm and best input combination. 6. Calculating evaluation criteria in time delays and select the best time delay. 7. Finally, the EC and TDS time series are calculated.

Calculate the Correlation Coefficient
Tables 1, 2, and 3 show the correlation between WQ parameters in three stations of Ahvaz, Armand, and Gotvand. Based on the results of the correlation matrix, the highest amount of correlation of EC and TDS parameters with input parameters related to Sum.A and Sum.C at three stations. The correlation between EC and inputs is greater. The higher the correlation between inputs and outputs, the higher the modeling accuracy.

Testing Algorithms with Benchmark Datasets
0According to Table 4, the performance of the LSSVM-GBO algorithm is compared with that of other algorithms on three real-world regression problems. Results showed that the hybrid model provides much better accuracy than the ANN, ANFIS, and LSSVM model. Values of MAE, RRMSE, R in the Housing dataset were 5.30, 0.91, 0.44, respectively. Also, in the LVST dataset, they were 99.94, 0.20, 0.98, respectively, and in the Servo  Fig. 6, the modeling accuracy for the LSSVM-GBO algorithm and benchmark datasets has been shown.

Select the Best Input Combination and the Best Algorithm
In this section, by estimating EC and TDS using different inputs, the best input combination and the best algorithm are identified. According to the results obtained, at all stations, the LSSVM-GBO algorithm has the highest accuracy. In Table 5, the results of the algorithms are listed to the EC modeling of the Ahvaz station. Based on the results of Table 5, The modeling results showed that the Sum.C parameter has the most impact in modeling EC. In the optimal hybrid model, values of MAE, RRMSE, and R were 74.30, 0.14, 0.99, respectively. In Table 6, the evaluation criteria of the listed to the EC modeling of the Armand station. Sum.C parameter has the most impact in modeling EC in this station. Also, the value of MAE is equal to 19.67, RRMSE equal to 0.22, R equal to 0.98. In Table 7, the results of the EC

EC and TDS Modeling
After detecting the best algorithm (LSSVM-GBO) and best input combination (in Ahvaz station, Sum.C, Sum.A, Na +1 , and Q parameters, and in Gotvand and Armand stations, Sum.C, Sum.A, Cl −1 , and Q parameters), in Table 11, the effect of time delay on EC and TDS modeling results is investigated. For this purpose, 13 different input combinations with time delays of 0-12 months are defined. According to the evaluation criteria demonstrated in Table 11, the LSSVM-GBO algorithm at all station in combination C1 has better accuracy than other combinations. In Fig. 7, the modeling accuracy for the LSSVM-GBO algorithm and the EC parameter in Ahvaz, Armand, and Gotvand stations has been shown. According to Fig. 7 respectively. The modeling results showed that in most cases, the difference between the estimated values and the observed values is very small. The high value of the correlation coefficient means the positive effect of GBO in improving LSSVM performance. In Fig. 8, the modeling accuracy for the LSSVM-GBO algorithm and the TDS parameter in three stations has been shown. According to Fig. 8 In Fig. 9, the results of the EC and TDS time series model based on the best input combination, best time delay, and best algorithm (LSSVM-GBO) are compared in three stations. According to these figures, the amount of EC and TDS fluctuations is well modeled by the LSSVM-GBO algorithm, which indicates the high accuracy of this algorithm.

Conclusion
Accurate assessment of WQ parameters is an important issue due to the impact on various environmental factors. The introduction of a new hybrid model can potentially give an effective solution in this regard. In this research, a novel LSSVM model integrated with the GBO algorithm was used to estimate EC and TDS values in the Karun river in three hydrometric stations of Gotvand, Ahvaz, and Armand. In the first step, the performance of the proposed LSSVM-GBO algorithm is compared with that of other algorithms on three real-world regression problems (Housing, LVST, Servo). The modeling results showed that LSSVM-GBO indicates the highest accuracy on three regression problems. Values of MAE, RRMSE, R in the Housing dataset were 5.30, 0.91, 0.44, respectively. Also, in the LVST dataset were 99.94, 0.20, 0.98, respectively, and in the Servo dataset were 0.46, 0.41, 0.91, respectively. Eleven input combinations including, Ca +2 , Cl −1 , Mg +2 , Na +1 , SO 4 , HCO 3 , SAR, Sum.C, Sum.A, pH, Q was used to model the quality parameters of EC and TDS. The modeling results based on evaluation criteria showed the most significant performance of LSSVM-GBO among all algorithms. The modeling results showed that Sum.C, Sum.A, Na +1 , and Cl −1 parameters have the most noticeable impact on modeling EC and TDS parameters. Based on the results, the highest accuracy of modeling EC and TDS parameter in the Gotvand station was in input combination C1. Values of MAE, RRMSE, R in modeling EC were 49.86, 0.14, 0.99, respectively, and in modeling TDS were 33.86, 0.16, 0.99, respectively. Examination of EC and TDS time series modeled by LSSVM-GBO and observational time series showed a high correlation between modeling and observational results. The LSSVM-GBO algorithm has various advantages, such as high estimation accuracy, a balance between exploration and exploitation, and fast convergence, the ability to find a global solution, and easy implementation. The GBO does not fall into the trap of local optima due to the use of a local escaping operator. Also, the GBO uses the direction of movement term to move towards the solution. Therefore, GBO, by tuning the LSSVM parameters, can more accurately evaluate WQ parameters and other engineering problems.
Funding The research has not been supported through any funds.
Data Availability All data generated or used during the study are applicable if requested. This article contains supplementary information file.

Declarations
Ethics Approval Not applicable.