Performance Evaluation of Machine Learning Models with Ensemble Learning approach in Classication of Water Quality Indices Based on Different Subset of Features

Since fresh water resources in form of groundwaters which are the most available water resources for human consumption are extremely limited and due to uncontrolled human activities are prone to contamination, it’s of a great importance to constantly monitor the quality of the ground fresh water resources to provide a sustainable drinking water for people as well as protecting the ecosystem. One tool for modeling the water quality of a basin is Water Quality Index (WQI). However, calculating WQI is complicated and time- consuming, therefore, today, scientists are being inclined to propose simpler ways for modeling the quality of the water resources such as machine learning algorithms. In this study the performance of four machine learning algorithms with ensemble learning approach were evaluated to propose a classification model (classifier) with highest performance. Moreover, to identify the most important water quality parameters in the classification process, three feature selection methods with machine learning approach were applied. As a result, among four classifiers, XGBoost showed outstanding performance, with the accuracy of 96.9696% when all the parameters of interest were involved in the classification process. However, in order to make the model cost-effective it is suggested to conduct the classification with optimum parameters which in this case, for the dataset which was used in this study XGBoost classifier is suggested as the best classifier with the maximum accuracy of 95.606% with 10-Fold Cross Validation when seven parameters which were identified by Backward Feature Elimination Feature selector were involved in the classification process.


Introduction
In spite of the fact that roughly 90% of all fresh water resources are groundwater, only minuscule portion of it is accessible to be used by human as drinking water (Arabgol et al. 2016;Ostad-Ali-Askari et al. 2017). Indirectly related to rapid population growth, today, anthropogenic activities such as agricultural and industrial activities as the main source of surface water contamination, increase the call for sustainable administration and management (Motevalli et al. 2019) of this irreversible asset for the living of humankind on this planet (Dohare et al. 2014).
Not only it's of a paramount importance from the human-wellbeing point of view to constantly monitor the quality of the surface water, but it also play a pivotal role in preserving the aquatic ecosystem as when the nutrients such as N and P and/or their derivatives willfully or ignorantly are released to the aquatic systems, unwanted consequences such as eutrophication of the surface water reservoirs will occur and result in perishing the aquatic life as the nutrient enrichment causes algal blooming and resultingly decrease the dissolved oxygen (DO) levels in aquatic systems and consequently inhibiting and restricting the proper environment for the aquatic organisms to be grown (Rozemeijer and Broers 2007;Varol and Şen 2012). Hydrogen Sulfide as a result of reduction of sulfate by anaerobic bacterial activities could be considered as the indicator of eutrophication in surface water reservoirs as well. What's more, high concentrations of sulfate in drinking water can cause health problem (WHO 2006;EPA 2021). Therefore, sulfur and its derivatives must be considered as an important water quality parameter. Chloride in addition is another crucial water quality parameter due to its toxic effects in aquatic environments. Chloride can first and foremost be found in the environment as a free anion; therefore, it is considered as the chief component of the salt formation such as (Elphick et al. 2011) sodium chloride, potassium chloride, magnesium chloride, etc. Hence, it can be concluded that chloride is associated with dissolved solids in aquatic systems. Presence of iron and manganese -as two of the most abundant elements in the earth's crust -in surface water is an inevitable fact. Although presence of these two elements of not a serious health concern, higher concentrations of them result in unwanted taste and color of the water as well as clogging of the pumps in association with sediment formation by anaerobic iron-manganese bacterial activities (Wong 1984;WHO 2006;Dalai et al. 2015).
Above, only a small portion of water quality parameters and the problems followed by the presence of each and any of them were described. Therefore, consistent monitoring and assessment of water quality parameters depending on the probable contamination sources in a basin should be one of the governments' responsibilities to firstly sustain the aquatic life cycle, moreover, to provide safe and sustainable drinking water for the people.
One tool that has aided the environmental scientists as well as decision maker agencies in association with the water quality to assess the quality of water without letting themselves in for the frustrating task of analysis of water quality parameters individually is Water Quality Index (WQI).
One great advantage of WQI is that it is able to unify the massive data into one value which is simpler, more understandable, interpretable and consequently logical (Tyagi et al. 2013;Uddin et al. 2021).
Besides advantages, however, interpretation of water quality based on WQI has many drawbacks such as discrepancy between the WQI methods as there are numerous and various 4 equations which are used worldwide by different agencies (Tyagi et al. 2013;Bui et al. 2020).
Moreover, the method is quite time-consuming and complex (Bui et al. 2020). Hence, due to these disadvantages, researchers today, have inclined their tendency to use simpler, more reliable and effective methods in classification and modeling water quality.
Flourishing advanced technologies specifically application of artificial intelligences (AI) in variety of research lines, today, environmental scientists tend to evaluate the accuracy and reliability of different machine methods in their studies particularly in classification of water quality in order to firstly evade complicated calculations to make rational decisions and secondly expedite the process of classification as well as prediction as these methods have recorded promising results in recent years in this regard (Modaresi and Araghinejad 2014;Saghebian et al. 2014;Danades et al. 2016;Radhakrishnan and Pillai 2020).
Despite satisfying results, the accuracy of the classifiers can be affected by high dimensional data as well as the noise, due to the overfitting conditions and costly computational tasks. Bouamar and Ladjal (2007) evaluated and compared the performance of two machine learning algorithms (i.e., ANN & SVM) in classification of water quality. The chief disadvantage of the both models was reported to be their sensitiveness to the noise. Therefore, it can be concluded that where the data is noisy, these models are not considered as effective methods in the process of prediction as well as classification of water quality. Thus, eliminating the noise and selecting the most significant parameters can reduce the overfitting risk and improve the time and accuracy of the classification methods (Uyun and Sulistyowati 2020).
Although there are some studies in the literature with focus on the selection of the most prominent attributes in classification of water quality, they are not that of machine learning approaches (Dezfooli et al. 2018;Bui et al. 2020).
Yet, there are some studies in which feature selection based on machine learning approach have been conducted. Muhammad et al. (2015) evaluated the performance of five machine learning 5 algorithms in classification of water quality indices. The performance evaluation of the models was carried out based on the number of attributes involved in the classification and CfsSubsetEval wrapper feature selection method was used to extract the most important attributes in the classification process.
The objective of this study is to investigate the performance of different ensemble machine learning models in classification of the water quality indices which are generated to indicate the suitability of water resources in terms of water quality for public consumption. Furthermore, the investigation of the effect of reduced water quality attributes by different feature selection methods on the performance of the machine learning algorithms in classification of water quality indices is the secondary aim of this study. What's more, the effect of different dataset splitting methods on the performance of the machine learning algorithms was also evaluated in this study.

Materials and methods
In this section different classification algorithms including LogitBoost, Random Forest, AdaBoost, XGBoost whose performances were evaluated in the classification of the water quality indices were discussed. Moreover, different feature selection methods which were used in this study to extract the most significant parameters in the classification process were described in this section.
The overall workflow of the study is illustrated in Fig. 1.

Fig. 1
Schematic workflow of the study

Study area
Büyük Menderes Basin is in the west of Anatolia, basically including the Büyük Menderes River and its streams which discharge into the Eagean Sea. The basin area is approximately 2,600,967 ha.
The major industries in this basin include agriculture, livestock, food industry and tourism which can be considered as the main contamination sources of the basin. Almost 79% of the waters in the basin are used for agriculture and 21% is used by the industries as well as dwellings (General Directorate of Environmental Management, Turkey, 2016). Since the Büyük Menderes Basin is one of the mostly contaminated basins in Turkey due to the uncontrolled discharges of the industries to the Büyük Menderes River, which is the artery of the basin, and its streams, it is essential for the survival of the basin to be monitored regularly in terms of water quality and that's why this basin was selected as the study area in this research (Fig. 2).

Fig. 2
Büyük Menderes Basin and the selected stations

Data acquisition
For the purpose of this study, the water quality data was compiled and provided by The General Directorate of State Hydraulic Works (GDSHW) in Turkey. There were 142 stations in the basin which had been operating by GDSHW from 1985 to 2014 to measure the water quality of the basin in roughly regular monthly intervals. Among these 142 stations, the water quality data of 10 stations from 2004-2014 were selected to represent the downstream of the basin in this study due to the abundancy of the measured parameters and the frequency of the observations since some stations were placed in inaccessible places and in some seasons with harsh climate it was difficult to reach the station for sampling. Therefore, in many of the stations only the most important water quality 8 parameters were measured regularly according to the demand for the sampling and measurement due to the possible contamination in the area.

Dataset formation
After data acquisition, it was essential to create a dataset of the necessary parameters for the purpose of this study. This is an essential part of data preprocessing called data integration which is used for merging the data from different sources to make a unified dataset to serve the purpose of any studies (Kumar and Manjula 2012).  Table 1.
The schematic framework of developing LR model for missing value imputation is shown in Fig. 3 in which, obviously, the dataset was divided into training and test datasets by dropping the missing values from the whole dataset to create the test dataset for the prediction process. Then LR model was trained by the training dataset to make the regression between variables and predict the values in the test dataset as a result of its learning from training data. In the end, the predicted values were merged to the training dataset to make a whole dataset without missing values.

Fig. 3
The schematic process of the missing value imputation by LR machine learning algorithm

Water Quality Index (WQI)
Once the data was cleaned, it was time to diminish the massive information and bring it into a single logical expression to describe the general quality of the water for public use. One tool to serve this purpose is to use Water Quality Index (WQI) which are functions developed by national and international agencies worldwide to describe the general status of the water systems. Generally, WQI functions follow a process of four stages in which, depending on the possible contamination in a water system as well as the experts' opinions, the parameters of interest are chosen firstly. Then to create a single-valued subindex, the concentrations are converted. The third stage is the calculation of the unit weight for each water quality parameter. Finally, a value indicating the WQI is calculated based on previously calculated subindexes and unit weights (Tyagi et al. 2013;Mădălina and Gabriela 2014;Uddin et al. 2021).
In this study Weighted Arithmetic Water Quality Index (WAWQI) method (Tyagi et al. 2013) (Eq. 2) was chosen to generate the classes labels of the water quality of Büyük Menderes Basin 11 according to 16 water quality parameters observed in 10 stations between 2004-2014 in two-months intervals. Where, Qi is the quality rating of i th parameter and Wi is the unit weight which are calculated by equations 3 and 4, respectively. Where, Vi is the estimated concentration of the i th parameter and V0 is the ideal value of the i th parameter.

= /
Where, Si is the recommended standard value of the i th parameter and K is the proportionality constant which is calculated through Eq. 5. According to the outcome of the WAWQI function, the water quality is classified into five major classes based on Table 2.

Ensemble learning approach
In ensemble learning in order to make decision, the combinations of single classifiers are accumulated to provide not only a single but also an improved predictive model in which the final result is based on the prediction of all individual classifiers (Dong et al. 2020). Ensemble Random Forest (RF), XGBoost, AdaBoost and LogitBoost Classifiers were used in this study. Random Forest (Breiman 2001) which is basically the combination of decision trees takes the advantage of using bootstrap aggregation and bagging in selecting an instance to reduce the classification error.
Moreover, this algorithm is known as beneficial in classification applications due to its ability in handling high dimensional data and diverse types of data especially none-parametric data type (Arora and Kaur 2020). AdaBoost classifier (Freund and Schapire 1997), is another category of the most widely used ensemble methods. It uses the concept of the combination of sequential independent singular hypothesis to increase the performance of each weak learner. Different machine learning classifiers such as SVM, Decision Tree, etc., can be used as base classifier in this algorithm (Liu et al. 2020). Extreme Gradient Boosting (XGBoost) (Chen and Guestrin 2016) is another ensemble-based classifier. It generates a sequential decision tree in which each instance is given a weight value as a probability indicator of being chosen by each Decision tree in the next steps. Since XGBoost algorithm uses parallel implementation of sequential trees and reduce overfitting problem, it is considered as one of advantageous classifiers (Bhati et al. 2021).
LogitBoost Classifier (Friedman et al. 2000) is another ensemble boosting classifier which was 13 proposed to deal with noisy data to decrease training error. It can be considered as AdaBoost classifier in which Logistic Regression is used as the base classifier (Tehrany et al. 2019).

Feature selection methods
Not only do feature selection methods decrease the dimensions of features which contributes to elimination of irrelevant features, but they also select the most effective subset of features which are relevant to main class/classes. Hence, it results in reducing the size of dataset, shortening the training time, improving the accuracy of the classification and decreasing overfitting problem.

Filter method
In filter method the subset is selected by considering features' characteristics such as distance or correlation among features and features' rank among others, instead of applying machine learning algorithms. Filter methods are considered as fast and scalable methods due to taking advantage of statistical methods in selecting features. They, however, overlook dependency of features and the interaction with classifiers (Khaire and Dhanalakshmi 2019). In this study, Mutual Information (MI) as a filter method was used.
I(X;Y) indicates the Mutual Information between two random variables of X and Y which calculates the mutual dependency between X and Y in terms of information theory. Since Mutual Information does not consider any dependency between variables such as linearity or continuity, it is more generic than linear methods like correlation coefficient which makes it to be more advantageous method. The Equation 6 indicates mutual information of X and Y variables.
( , ) indicates the joint probability density function of X and Y and ( ) and ( ) indicate the margin of X and Y. Furthermore, the higher value of mutual information presents the intense dependency between X and Y. The calculation of mutual information matrix is depicted by a symmetric × matrix (Eq. 8), in which is the mutual information between two attributes of dataset (D) (Sefidian and Daneshpour 2019).

Wrapper method
In wrapper method, feature selector is wrapped around the machine learning method and its performance is evaluated by accuracy rate or error rate of the classifier. As a result, since wrapper method relies on the learning algorithm in selecting the most significant subset of features, error rate of the classifier should not be conspicuous. Backward feature elimination (BFE) as a wrapper method is used in this study.
In wrapper based BFE method which is more efficient than forward feature selection method, the importance scores of instances are decided with all data in each iteration and features with least importance are eliminated. As a result, each round improves the performance of the model. Finally, all remained features are merged and the round will start again until no more improvement is recognized by removing features (Pan et al. 2009).

Embedded method
Embedded feature selection algorithm in which in order to select the most effective subset, feature evaluator is assisted by the feature selection method which is embedded as a part of learning algorithm. Furthermore, compared to filter and wrapper methods, embedded feature selectors could achieve more satisfying results, due to their less computational process and their dependency on feature selection via using learning algorithm (Chen et al. 2020). RF as an embedded feature selection method (RFE) was applied in this study, due to its beneficial results in solving different problems and its generalization. In RFE method, different subsets among features in the training set are produced by bagging method and different decision trees are constructed by each selected feature firstly. The growing of the tree relates to the splitting of each node. Therefore, nodes are divided based on the selected features which are arbitrarily selected among candidate subsets of features. Then, the splitting of nodes generates different trees which are grown without pruning and each tree is consider as a classifier. In the final stage, all trees are connected which create a RF which selects subsets of features among all features in the dataset (Zhou et al. 2016).

Evaluation metrics
Multi-class confusion matrix is a tool for performance evaluation of classifiers which are applied on the multi-class datasets. Multi-class confusion matrix includes two dimensions indicating actual classes and predicted classes. Table 3 illustrates a multi-class confusion matrix in which A1, A2,…, An are actual classes and Nij is an instance of Ai class which is classified as Aj class. Performance assessment metrics which were used in this study, are presented below (Eq. 9-11) (Danaei Mehr and Polat 2019). Accuracy: Demonstrates the proportion of correctly predicted instances among all instances.
Recall (Sensitivity): Indicates what proportion of water quality class labels are predicted correctly.
Precision: Indicates what proportion of predicted water quality class labels are correctly predicted

Results and discussion
In this study WAWQI method was used to indicate the water quality class of each instance of the dataset which belongs to a particular station in a particular month and a specific year. As a result, of all 660 instances, 119 instances were classified as Excellent and 139, 95, 59 and 248 were classified as Good, Poor, Very Poor, and Unfit for Consumption, respectively.
To evaluate the effect of the feature-reduced datasets on the performance of each ML classifier used in this study, different feature selection methods were used to generate subsets of the most important features in classification process. As a result, of dataset of 19 parameters, Backward Feature Elimination method magnified the importance of 7 parameters in classification of the WQI whereas Mutual Information method could identify 9 parameters as the most prominent parameters, however, Random Forest Embedded feature selector identified 14 parameters (Table   4).
In this study, the effect of splitting methods on the performance of each classifier were evaluated as well. K-Fold Cross Validation (KFCV) including 5-FCV and 10-FCV along with percentage splitting including 70:30, 75:25, and 80:20 splitting ratios were used to split the dataset into training and test datasets.        The main objective of similar studies in the literature is to suggest a particular machine learning model which has achieved promising results based on specific parameters of interest with particular splitting method in classification of water quality indices. The purpose of developing a machine learning model in classification of water quality is to evade time-consuming computations to classify the water quality by WQI method, however, it is essential to identify the water quality class by a particular WQI model based on the purpose of the classification in the beginning to supervise the machine learning algorithm in the process of learning. In this study, obviously, XGBoost classifier achieved the most promising performance over  These three classes were those the most classification mistakes were made by the same classifier by involving reduced combination of features in the process of classification. Therefore, it can be concluded that for higher water quality classification accuracy, it is better to involve more parameters than reduced number of parameters as machine learning algorithms can predict multi class of water quality based on its learnings from variety of parameters instead of limited number of parameters. However, involving more parameters is not rational as well, because of two reasons which the authors of this study discussed and unanimously agreed on those. First of all, to decline extra computations by a machine learning algorithm in order to make it cost effective, it is better to use smaller combination of features. Secondly, to make the monitoring and evaluation of the water quality of a basin economically justifiable, the scientist must select a limited number of parameters of interest since some measurements are expensive as well as time consuming. Therefore, because of the aforementioned reasons, although the best performance was achieved by the XGBoost classifier when involving all the features of interest which were selected in the beginning of the study, it is better to evaluate the performance of the XGBoost classifiers with other subset of features which were used in this study. First of all, although the second-best performance achieved by this classifier based on 14 parameters selected by RFE feature selector, the eliminated parameters (i.e., Year, Month, Station, Temperature, pH) didn't have extra cost of processing, therefore, it would have been rational to involve all the parameters of interest and not sacrifice the performance. However, catching a glance at the performance of the XGBoost classifier by involving 7 parameters selected by BFE feature selector which had the third best performance, it could be suggested to choose this combination of the features and sacrifice a small percentage of the performance since this lack can be compensated during the training stage by larger datasets, yet reducing the number of parameters by 12 could make a model cost-effective.

Conclusion
The primary objective of this study was to suggest an ensemble machine learning model for classification of the water quality. To supervise the machine learning algorithms in the process of learning WAWQI was used to classify the instances of the dataset of 19 parameters into 5 major class labels which were suggested by the WAWQI function for public consumption. Moreover, since decisions which are made by machine learning algorithms are profoundly affected by the dimension of the dataset, three feature selection methods were used to reduce the number of parameters which were selected as the parameters of interest in the beginning of this study.
Consequently, the results achieved by either of the classifiers based on the datasets generated by each of the feature selectors as well as the main dataset consisting all the features were evaluated to propose an effective machine learning algorithm in classification of water quality of the basin.
Results demonstrated that the performance of the classifiers were affected significantly by the number of parameters involving in classification process and XGBoost classifier exhibited a brilliant merit over the other three classifiers when either of the datasets was used in classification 28 process. The best performance achieved by this classifier was when all the features which were selected in the beginning of the study was involved. However, to evade extra computations in order to make this algorithm cost-effective, it is suggested to use the subset of feature selected by BFE feature selector since the results didn't show a significant difference from the results achieved by this classifier when all the parameters of interest were involved in the process of classification. Yet, the number of features decreased by 12 parameters which can make the process cost-effective.