In this study, among 13 parameters of interest (i.e., year, month, station, temperature (T), pH, electrical conductivity (EC), chemical oxygen demand (COD), biological oxygen demand (BOD), ammonium nitrogen (NH4-N), nitrite (NO2), nitrate (NO3), phosphorous (PO4) and dissolved oxygen (DO)) which were chosen to assist the ML algorithms in prediction of DO, 9 parameters were selected by the wrapper feature selector to reduce the dimension of the dataset. The plot matrix of the selected variables is illustrated in Fig. 3. Although the eliminated variables were in close association with DO concentration, the wrapper feature selector found them the cause of noise (Chandrashekar and Sahin 2014) in the dataset and the most related and distinct variables associated with DO was found to be year, month, station, T, pH, EC, COD, BOD5, and NO3.
Discussed in the previous section (concept of feature selection methods), the mechanism which a feature selection algorithm works with is to eliminate irrelevant variable as well as keeping the most distinct variables in order to reduce the noise with the objective of increasing the performance of the classifier or the predictor. The result of the used wrapper feature selector shows that the algorithm tended to keep only one nitrogen source to assist the predictors. What’s interesting is that the feature selector has chosen nitrate over nitrite to keep in the generated subset of features. Along with ammonium nitrogen, nitrite and nitrate, phosphorous which is the main source of nutrient contamination in ground waters causing algal blooming and consequently significantly diminish the DO concentrations (Rozemeijer and Broers 2007; Varol and Şen 2012) which demonstrates its close association with DO levels in water systems was eliminated from the set of variables of interest. Furthermore, despite the mechanism of eliminating identical variables in order to reduce the noise as well as dimension, COD and BOD5 together were kept in the generated subset of features by the wrapper feature selector.
Validating the variables selected by the wrapper feature selector, besides observation information (i.e., year, month, and station), the stratification indicators such as T and pH and mineral budget indicators such as EC (Ouma et al. 2020) have been found to be essential parameters in prediction of DO concentration as many researchers have chosen them as input water quality variables based on their expertise (Heddam and Kisi 2017; Keshtegar and Heddam 2018; Csábrági et al. 2019; Ouma et al. 2020; Zhu and Heddam 2020, Dehghani et al. 2021). Other than the mentioned variables, some researchers have also included parameters such as COD (Zhu and Heddam 2020). Yet the effect of parameters such as NO3 and BOD5 which are in a close association with DO levels in water systems needs further considerations in such studies. The effect of the selected parameters by the wrapper feature selector on the performance of the predictors will be discussed in the following section(s).
Table 1, shows the results of the metrics used to evaluate the performance of each ML model applied for the prediction of the DO concentration in the target basin. Evidently, RF regressor showed better performance over MLP in terms of all metrics, yet the effect of reduced variables on the performance of the regressors was considerable. While the performance of RF regressor slightly increased when 9 variables were included in the process of prediction, the performance of MLP declined significantly by considering the reduced subset of features. This can demonstrate the adverse effect of applying reduced subset of features on ANN algorithms as they tend to observe as much as there is variables in the training phase in order to make logical decision in the end and related variables don’t have serious adverse effect on the performance of ANN models. In other word, it can be concluded that ANN models are less prone to noise than tree-based models such as random forest.
Csábrági et al. (2019) reported in their work in which they have compared the performance of differently optimized ANN models that the performance of the ANN model significantly increases with a larger dataset. Ouma et al. (2020) also reported in their article in which they have compared the performance of an ANN model with MLR statistical model that the ANN model’s performance increase when all the variables of interest are included in the process of DO prediction.
Catching another glimpse of Table 1, although the correlation coefficient of the MLP regressor dropped down by two percent, MAE, which indicates the intimacy of the prediction, hasn’t increased significantly. Therefore, it can be concluded that the general proximity of estimations from the actual DO concentrations are virtually the same. However, despite slight increase in the correlation coefficient of the RF regressor with reduced subset of features, MAE as well as RMSE considerably declined which demonstrate the promotion of the estimations by the algorithm when the noise and dimension drops down.
Table 1
Performance evaluation metrics of the regressors
Model
|
Variables
|
Pearson Correlation Coefficient
|
Mean Absolute Error
|
Root Mean Square Error
|
RF
|
Whole
|
0.7958
|
0.9188
|
1.3093
|
MLP
|
Whole
|
0.7586
|
1.2424
|
1.7291
|
RF
|
Reduced
|
0.8052
|
0.8911
|
1.2805
|
MLP
|
Reduced
|
0.7367
|
1.2495
|
1.7525
|
Fig. 4, depicts fitness of the predicted values of DO made by the RF regressor by considering the whole variables of interest to the actual values of DO. Obviously, for DO concentrations between 5 to 9 mg/l the predicted values by the algorithm fit well to the actual values of DO. Moreover, values lower and higher than this range couldn’t be perfectly estimated by the algorithm.
Diving deep into the RMSE of the RF model trained with the whole variables of interest (Fig. 5), the estimation error deviation (EED) of the algorithm is less (somewhere between -1 to 1.3) for 72.91% of the values predicted in the testing phase which can be considered a promising general error deviation for the model trained by the aforementioned set of features because although the remaining values’ (27.09%) error deviation exceed this range, only 7.38% of the predicted values were in unacceptable estimation error deviation range (2.34<EED<-2.22).
Compared to the fitness of the actual vs predicted scatter plot of the RF regressor when all features of interest were considered, the predicted values of DO with reduced subset of features look more fit to the actual values in Fig. 6. Alike the results of Fig. 4, it seems the RF regressor, when reduced subset of features supervised it, could remarkably predict the values of DO between concentrations 5 to 9 mg/l.
However, the estimation error deviation histogram plot of the model with reduced subset of features (Fig. 7) shows and demonstrates the positive effect of the noise cancelation and dimension reduction of the dataset on the prediction ability of the regressor as the regressor could predict 73.86% of the instances with the remarkable estimation error deviation range of -1 to 1.26. Moreover, only 6.81% of the instances predicted in an unacceptable estimation error deviation range (-2.1 < EED < 2.38).
Conversely, MLP algorithm demonstrated that its prediction ability highly depends on the number of variables in the training phase (Table 1). In other word, the more variables are included in the training phase, the more accurately MLP algorithm can predict the values in the testing phase.
Figure 8 and 9, respectively, depict the scatter plot of the actual versus predicted DO values predicted by the MLP regressor with the whole variables of interest and the reduced subset of features. Comparatively, the fitness of the plot is remarkable when all the variables of interest were included in the training phase. These two figures, undoubtedly demonstrate inevitable dependability of the MLP algorithm to the training from more sources rather than limited number of features which is contrary to the mechanism and results of tree-based algorithms as they are susceptible to excessive number of features for prediction of a target value (Kohavi and John 1997). Just alike RF regressor’s results, the most accurate DO value prediction bond with the MLP algorithms is 5 to 6 mg DO/l.
As for the error estimation deviation of the model (MLP) when all variables of interest are included in the training phase (Fig. 10), 71.51% of the predicted DO values were in the outstanding EED range of -1.44 to 1.35 whereas 77.21% of the predicted DO values by the algorithm were in this range when reduced subset of features were included in the training phase (Fig. 11). This aligns with the EED results of the RF model as when the reduced subset of features was included in the training phase, both models’ EED declined significantly which proves the remarkable effect of noise cancelation as well as dimension reduction on the accuracy of estimation of the both regressors.
The performance of ANN models as well as tree-based models including random forest as a robust tree-based algorithm working with ensemble learning technique have been investigated in many research lines to propose promising models to solve classification and/or regression problems.
Results of the current study showed that RF is a promising regressor in predicting DO concentrations using other water quality parameters. Despite satisfying results, MLP algorithm exhibited weaker performance against RF. Furthermore, the effect of feature selection on the performance of the regressors is undeniable.
Over years and paying serious efforts, the superiority of RF and its derivatives over many other ML models in solving regression problems have been reported as well. For instance, Singh at al. (2013) reported the superiority of RF algorithm over simple decision tree as well as SVR benchmark model in predicting air quality parameters. Wang et al. (2016) evaluated the performance of RF and Random Bits Forest (RBF) which is a method derived from the original concept of RF with seven other ML algorithms in solving a regression problem. Resultingly, the superiority of RF and RBF models over others was reported. Maheshwari and Lamba (2019) investigated the performance of 6 regressors including RF and MLP in solving an air quality regression problem as a result of which, despite an outstanding performance exhibition by MLP, random forest regressor outperformed the other four ML regressors in accuracy. It can also be concluded from the results of this study that the MLP regressor is a robust algorithm in solving such regression problems. Ghorbani et al. (2016) reported the regression performance excellence of MLP over SVR model in predicting river flow. In a different study, Heidari et al. (2016) investigated the regression ability of the MLP algorithm in prediction of the viscosity of nanofluids and they achieved an outstanding result applying the MLP algorithm in solving such regression problem which demonstrates its exceptional capability of prediction. However, that doesn’t mean the feebleness of other regressors compared to the MLP. For example, Amid and Gundoshmian (2016) investigated the performance of three regressors including MLP in prediction of the output energies of boilers. As a result, MLP showed the weakest performance among the group of three regressors. There are other examples in the literature as results of which MLP wan not found to be an effective algorithm to solve regression problems (Ghritlahre and Prasad 2018). Therefore, firstly due to this algorithm’s popularity and of course its simplicity of application, its regression performance has to be investigated further through different scenarios.
Recently, feature selection has attracted many researchers’ attention in different research domains due to its capability of making the implementation of regression and/or classification cost-effective. Reduced subset of features is generated from the main dataset not only to reduce the dimension of the dataset which decrease the extra computations and consequently make a model feasible, but also to improve the performance of prediction. However, not always reduced subset of features rocket up the performance but it might impact the performance of some ML regressors adversely.
Masmoudi et al. (2020) reported the improvement in prediction of the regressor when feature importance framework is applied to reduce the dimension of the dataset. Castangia et al. (2021) reported the positive impact of the reduced subset of feature on the performance of five ML regressors including RF regressor.
The results of our study also proved the positive impact of training the RF regressor with reduced subset of features for predicting DO concentrations. Yet the adverse effect of the reduced features on the performance of MLP regressor for prediction of DO is undeniable. However, there are studies in the literature in which the effect of reduced subset of features on the performance of MLP regressor was investigated and the results are in contradiction with the results of this study. For example, Hossain et al. (2013) reported the improvement in the accuracy of MLP regressor for the prediction of solar power using the reduced subset of features.
Considering the benefits of feature selection and reduced subset of features, sacrificing a bit of accuracy for considerable declining computations doesn’t seem much irrational. However, when a potent regressor such as RF can achieve brilliant results with or without feature selection, it won’t be cost-effective to chase ineffective algorithm for prediction of DO concentration in this study. Yet the robustness and accuracy of the MLP regressor can be investigated through different scenarios and different datasets.