Optically active and non-active WQPs from satellite data were modelled by applying XGBoost and MLP regressor with each cluster (Table 2) for 2014-2018. Consequently, spatiotemporal maps showing the distribution of different WQPs were generated for the basin showing the measured and predicted values.
In-situ WQPs
The WQ data collected were preprocessed to overcome the missing values and identify the outliers. All the preprocessing steps were conducted in Python, the scikit-learn library under sklearn.preprocessing. The in-situ data selected were treated for z-transformation in scipy's stats library as these variables were in different scales and units. Summary of basic statistics on different WQPs along all the clusters are presented in Table 3. Box and whisker plots for WQPs along different clusters are plotted to identify the outliers present in the data.
Table 3
Summary of basic statistics for C-0 from 2014-2018
Statistics
|
EC_GEN (µmho/cm)
|
pH_GEN (pH units)
|
TDS (mg/l)
|
Temp (deg C)
|
Ca (mg/l)
|
SiO2 (mg/l)
|
BOD (mg/l)
|
DO (mg/l)
|
count
|
159
|
159
|
159
|
159
|
159
|
159
|
159
|
159
|
mean
|
226.80
|
7.83
|
176.93
|
21.94
|
33.27
|
8.12
|
2.38
|
7.26
|
std
|
59.96
|
0.48
|
22.46
|
3.97
|
7.98
|
0.76
|
0.64
|
0.71
|
min
|
116.00
|
2.70
|
119.53
|
11.00
|
17.60
|
6.10
|
0.90
|
3.70
|
25%
|
180.00
|
7.70
|
166.31
|
19.00
|
28.50
|
7.65
|
2.00
|
6.90
|
50%
|
230.00
|
7.90
|
179.50
|
22.50
|
33.70
|
8.20
|
2.50
|
7.30
|
75%
|
260.00
|
8.00
|
191.54
|
25.00
|
37.50
|
8.60
|
2.70
|
7.60
|
max
|
430.00
|
8.40
|
239.32
|
30.50
|
59.50
|
9.80
|
6.30
|
8.80
|
The correlation matrix calculated using the Pearson correlation method is presented in Table 4. From the proposed correlation matrix, we can see that all the WQPs are positively correlated except EC, which negatively correlates all the WQPs. A strong correlation of <0.80 is observed between pH-DO, SiO2-DO, Temp-DO, SiO2-pH and a moderate to less correlation with other WQPs.
Table 4
Correlation between WQPs in Cluster0
|
EC_GEN
|
pH_GEN
|
TDS
|
Temp
|
Ca
|
SiO2
|
BOD
|
DO
|
EC_GEN
|
1
|
-0.74
|
-1.00
|
-0.65
|
-0.62
|
-0.79
|
-0.26
|
-0.61
|
pH_GEN
|
-0.74
|
1
|
0.73
|
0.68
|
0.36
|
0.87
|
0.50
|
0.89
|
TDS
|
-1.00
|
0.73
|
1
|
0.63
|
0.58
|
0.78
|
0.26
|
0.61
|
Temp
|
-0.65
|
0.68
|
0.63
|
1
|
0.35
|
0.63
|
0.34
|
0.86
|
Ca
|
-0.62
|
0.36
|
0.58
|
0.35
|
1
|
0.56
|
0.07
|
0.20
|
SiO2
|
-0.79
|
0.87
|
0.78
|
0.63
|
0.56
|
1
|
0.28
|
0.77
|
BOD
|
-0.26
|
0.50
|
0.26
|
0.34
|
0.07
|
0.28
|
1
|
0.42
|
DO
|
-0.61
|
0.89
|
0.61
|
0.86
|
0.20
|
0.77
|
0.42
|
1
|
Feature selection criteria on different Clusters
Statistical tests were performed on extracted Rrf data from the respective sampling station to check the inconsistency, and outliers were later corrected by applying a z-score (Sudheer et al., 2007). The Pearson correlation matrix (R) between various Landsat-8 Rrf values on different bands and band ratios with WQPs is studied for all the clusters to select the best features for the modelling, presented for Cluster 0 in Table5. In this study, multispectral bands and their combinations with correlation (i.e. r ≥ 0.50) were identified to shape the input layer (Hafeez et al., 2019; Sharaf El-Din et al., 2017). The reasonable correlation is then identified based on the significance test. A maximum of 50% significance level or p<0.05 is considered to finalize the input parameters (Abdelmalik, 2018; Nas et al., 2010; Swain & Sahoo, 2017).
Table 5
Cluster-0 Pearson correlation
|
B1
|
B2
|
B3
|
B4
|
B5
|
B6
|
B7
|
EC
|
PH
|
TDS
|
Temp
|
CA
|
SiO2
|
DO
|
B1
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
B2
|
0.99
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
B3
|
0.95
|
0.96
|
1
|
|
|
|
|
|
|
|
|
|
|
|
B4
|
0.86
|
0.89
|
0.94
|
1
|
|
|
|
|
|
|
|
|
|
|
B5
|
0.29
|
0.29
|
0.36
|
0.49
|
1
|
|
|
|
|
|
|
|
|
|
B6
|
0.16
|
0.16
|
0.16
|
0.33
|
0.85
|
1
|
|
|
|
|
|
|
|
|
B7
|
0.16
|
0.16
|
0.14
|
0.32
|
0.76
|
0.98
|
1
|
|
|
|
|
|
|
|
EC
|
-0.55
|
-0.54
|
-0.62
|
-0.53
|
-0.10
|
0.03
|
0.06
|
1
|
|
|
|
|
|
|
PH
|
0.53
|
0.51
|
0.63
|
0.51
|
0.18
|
0.04
|
0.01
|
-0.74
|
1
|
|
|
|
|
|
TDS
|
0.54
|
0.53
|
0.60
|
0.57
|
0.09
|
-0.03
|
-0.05
|
-1.00
|
0.73
|
1
|
|
|
|
|
Temp
|
0.57
|
0.56
|
0.62
|
0.53
|
0.28
|
0.15
|
0.13
|
-0.65
|
0.68
|
0.63
|
1
|
|
|
|
Ca
|
0.32
|
0.32
|
0.36
|
0.25
|
0.06
|
-0.13
|
-0.18
|
-0.62
|
0.36
|
0.58
|
0.35
|
1
|
|
|
SiO2
|
0.53
|
0.52
|
0.62
|
0.51
|
0.19
|
0.05
|
0.01
|
-0.79
|
0.87
|
0.78
|
0.63
|
0.56
|
1
|
|
DO
|
0.57
|
0.60
|
0.68
|
0.74
|
0.65
|
0.04
|
0.01
|
-0.61
|
0.89
|
0.61
|
0.56
|
0.20
|
0.77
|
1
|
As presented in Table 5 (for Cluster-0), except for Ca, we can observe a significant (p<0.05) correlation of WQPs with bands B1-B4, a similar trend has been experienced with other clusters as well. However, the remaining bands of Landsat-8 or their combinations such as Cirrus, TIR 1 and 2 had a lesser correlation (i.e. r < 0.50) within the WQPs. As Thermal Infrared 1, 2 were mainly designed to measure the surface temperatures, and on the other hand, Cirrus is commonly used for detecting clouds, this could be the reason we are witnessing a less r value. To improve the relationship between input and output variables for the ML algorithm, we created many band combinations with significant correlations with WQPs. The selection criteria for the features on different clusters are explained below. Cluster-0 is located to the downstream side of the study area and closer to other stations. From the visual interpretation of Google Earth Engine (GEE) satellite images, we could see agriculture and barren land near the sampling stations. At the same time, Cluster-2 to the upstream eastern side of the study area consists of agriculture, barren land with high to moderate (Ayodhya and Basti) of the urban area. Eliginbridge, Balrampur, Ayodhya, Basti, Birdghat, Turtipar are the six sampling stations present in Cluster-2. However, due to the unavailability of Landsat-8 data for Blarampur, which is omitted from the study. The sampling stations belong Cluster-0 and Cluster-2 are not showing any seasonal variations, as they belong to the same clusters during dry and wet seasons.
Therefore, the analysis was carried out by combining all stations for both seasons. Cluster-1 and Cluster-3 shows a fluctuating seasonal trend and is spread spatially upstream to downstream along the basin. The heavy settlements area is identified through the GEE visual interpretation near Kanpur, Lucknow, Allahabad, and Varanasi (Vinod et al., 2013). A seasonal shifting of sampling stations belonging to Cluster-1 in the dry season to Cluster-3 in the wet season and vice versa was identified in our previous study. While the availability of Landsat-8 data during the wet season (< 50%), the study was restricted to Dry season (Nov- May) for Cluster-1 &3.
The correlation coefficient based on the Pearson correlation technique and the Gini importance based ExtraTreesRegressor has been able to identify the best correlated (above 0.50) and significant (p<0.05) bands and band combination for the model input. A total of 166 input parameters (not presented here due to space constraints), including bands and their different combinations, are identified for different clusters and finalized WQPs (Temp, EC, pH, SiO2, and DO). A correlation of 0.567-0.923 is observed on different combinations with a significance of p<0.05. Feature importance scores identified through ExtraTreeRegressor are plotted for features based on the Gini importance on various combinations. The same procedure is then repeated for the feature selection process on different clusters. Feature scores for EC along Cluster-0 and pH along Cluster-1 and Cluster-3 are presented in Fig. 5 and Fig. 6a & b, respectively. A correlation of 0.51-0.89 were observed with different bands and band combinations along Cluster-1 &3.
Hyperparameter optimization (HPO) for XGBoost
XGBoost Regressor with Optuna is the HPO algorithm applied in this study. As discussed above, the model was executed using the selected features and target values evaluated separately for each cluster. The performance of these models was assessed in terms of the standard statistical indices like R2, adjusted R Square and Root Mean Square Error (RMSE). Key results from the study for each cluster are discussed in the below sections.
The XGBoost models are developed using a scikit-learn compatible API. The database for the model is first converted into an optimized data structure called Dmatrix, as this is the specific format that XGBoost can handle. HPs are optimized by applying Optuna. Published under MIT license (https://github.com/ pfnet/optuna/), it is automatic optimization software designed with a define-by-run technique (Akiba et al., 2019). This technique allows the programmer to construct the search space dynamically. Optuna utilizes the historical data to identify the promising search area to optimize the HPs in a minimum amount of time. XGBoost has almost half a dozen HPs. We have applied learning rate, max_depth, l1_reg (L1 regularization term on weights), l2_reg (L2 regularization term on weights) and n_estimators as HPs. The unpraised trails can be controlled through the pruning feature at the beginning of the training phase. We considered one WQPs at a time as Output because the best features identified for these were different. Although, the same HPs are applied in all clusters, within the WQPs and across the clusters. The optimized HPs are displayed in Table 6.
Table 6
Optimized HPs for different WQPs along different Clusters in XGBoost
Clusters
|
WQPs
|
Learning Rate
|
Max_depth
|
l1_reg
|
l2_reg
|
C0
|
EC
|
0.2926042
|
7
|
0.001134
|
0.01710109
|
pH
|
0.237338
|
4
|
0.2236841
|
0.00059588
|
Temp
|
0.205234461
|
8
|
0.0115554
|
2.8569528
|
SiO2
|
0.11117548
|
7
|
0.0038749
|
0.92009494
|
Do
|
0.1492115
|
8
|
0.0001574
|
0.18985823
|
C1
|
pH
|
0.178377066
|
7
|
0.014976
|
0.00103462
|
Temp
|
0.114854821
|
7
|
0.0001023
|
0.01760738
|
SiO2
|
0.558027582
|
7
|
3.582E-05
|
9.11E-05
|
DO
|
0.267273993
|
6
|
0.0034017
|
0.23174842
|
TDS
|
1.055113332
|
5
|
1.914E-05
|
1.12E-05
|
C2
|
EC
|
0.03212539
|
8
|
0.0001335
|
1.3344E-05
|
pH
|
0.54051175
|
7
|
0.228148
|
0.00035278
|
Temp
|
0.138631144
|
4
|
6.56E-05
|
0.01605706
|
TDS
|
0.030726507
|
5
|
4.51E-05
|
1.51E-04
|
SiO2
|
0.084894633
|
4
|
0.0025398
|
9.96E-01
|
DO
|
0.158258279
|
4
|
0.0054742
|
9.86053923
|
C3
|
EC
|
0.831185087
|
6
|
0.0001335
|
1.51E-04
|
pH
|
0.097761003
|
5
|
0.005306
|
0.09855875
|
Temp
|
0.114854821
|
7
|
0.0001023
|
0.01760738
|
TDS
|
1.055113332
|
5
|
1.91E-05
|
1.12E-05
|
SiO2
|
0.558027582
|
7
|
3.58E-05
|
9.11E-05
|
DO
|
0.480724584
|
5
|
0.3392282
|
0.00656407
|
Hyperparameter optimization (HPO) for MLP
MLPRegressor is an algorithm under Neural Network (NN)module in scikit-learn to perform regression tasks using a multilayer perceptron. For optimizing the HPs, we applied GridSearchCv. This approach will orderly builts and evaluate the model for each setoff combination for the model parameters present within a specified grid. We applied the HPs mentioned in Table 7 to optimize the best estimator for our study with 3-7-fold cross-validation. The ratio of train:test was changed from 70-80 until the accuracy for both training and testing became the same, or the difference was negligible. This procedure for the HPs search is carried out for all the clusters by taking one WQPs at a time, and the best HPs are identified for all WQPs and different clusters. The identified list of HPs is presented in Table 7.
Table 7
Optimized HPs for different WQPs along different Clusters in MLP regressor
Clusters
|
WQPs
|
Activation
|
Hidden Layers
|
Learning Rate
|
Solver
|
C0
|
EC
|
logistic
|
(50,150)
|
Constant
|
L-BFGS
|
pH
|
identity
|
(150,100,50)
|
Constant
|
L-BFGS
|
Temp
|
relu
|
(100,50)
|
Constant
|
L-BFGS
|
SiO2
|
relu
|
(150,100,50)
|
Constant
|
L-BFGS
|
Do
|
relu
|
(100,50)
|
Constant
|
L-BFGS
|
C1
|
EC
|
relu
|
(150,50,100)
|
Constant
|
Adam
|
pH
|
relu
|
(100,150,50)
|
Constant
|
L-BFGS
|
Temp
|
tanh
|
(100,)
|
Constant
|
L-BFGS
|
SiO2
|
relu
|
(150,50,100)
|
Constant
|
L-BFGS
|
Do
|
tanh
|
(150,100,50)
|
Constant
|
L-BFGS
|
C2
|
EC
|
relu
|
(50,100,150)
|
Constant
|
Adam
|
pH
|
relu
|
(100,150,50)
|
Constant
|
L-BFGS
|
Temp
|
tanh
|
(100,)
|
Constant
|
L-BFGS
|
SiO2
|
relu
|
(50,100,150)
|
Constant
|
L-BFGS
|
Do
|
relu
|
(100,50,150)
|
Constant
|
L-BFGS
|
C3
|
EC
|
relu
|
(50,150,100)
|
Constant
|
L-BFGS
|
pH
|
relu
|
(100,150,50)
|
Constant
|
L-BFGS
|
Temp
|
logistic
|
(50,)
|
Constant
|
L-BFGS
|
SiO2
|
relu
|
(100,50,150)
|
Constant
|
L-BFGS
|
Do
|
relu
|
(50,150,100)
|
Constant
|
L-BFGS
|
Evaluation and Comparisons of Results
The total dataset contained in-situ, and Rrf data were split randomly for 70% training and 30% testing set to create XGBoost and MLP regression models. R2, RMSE (Table 8 and 9) and adjusted R for predicted E.C., pH, Temp, SiO2 and DO were calculated, model evaluation for each Clusters separately and are presented in Table 8, respectively. Except for EC, the R2 values for all WQPs were high and close to 1, showing a notable correlation between Rrf data with in-situ observations. The feature of importance from each WQPs along different clusters were studied to identify the best band and combinations. Feature Importance for DO along Cluster-0 is presented in Fig. 7. The developed Landsat-8 based WQ modelling could be highly recommended as a cost-effective and time-saving method for monitoring optically active and non-active WQPs. The coefficients of determination obtained for WQPs like pH, Temp, SiO2 and DO (R2 ranges from 0.74-0.98) with XGBoost and MLP with a p-value <0.005. However, EC performed poorly in all the clusters, with R2 ranging from 0.23-0.37 for XGBoost. Although, comparatively better results are observed in MLP with R2 of 0.78- 0.99 for Cluster-1, Cluster-2 and Cluster-3, except for Cluster-0 with R2 of 0.32 and 0.27 respectively in the training and testing phase.
Table 8
Regression statistics of XGBoost regressor along different clusters
Clusters
|
WQPs
|
Train R2
|
Test R2
|
C0
|
EC
|
0.32
|
0.27
|
pH
|
0.94
|
0.78
|
Temp
|
0.88
|
0.73
|
SiO2
|
0.98
|
0.98
|
Do
|
0.98
|
0.97
|
C1
|
EC
|
0.35
|
0.33
|
pH
|
0.74
|
0.74
|
Temp
|
0.87
|
0.89
|
SiO2
|
0.96
|
0.96
|
Do
|
0.98
|
0.927
|
C2
|
EC
|
0.23
|
0.21
|
pH
|
0.74
|
0.74
|
Temp
|
0.85
|
0.89
|
SiO2
|
0.97
|
0.97
|
Do
|
0.97
|
0.90
|
C3
|
EC
|
0.34
|
0.32
|
pH
|
0.81
|
0.76
|
Temp
|
0.87
|
0.90
|
SiO2
|
0.98
|
0.97
|
Do
|
0.97
|
0.96
|
Table 9
Regression statistics of MLP regressor along different clusters
Clusters
|
WQPs
|
R2
|
RMSE
|
C0
|
EC
|
0.37
|
(mg/l)
|
pH
|
0.89
|
0.06198 (mg/l)
|
Temp
|
0.82
|
0.0812(mg/l)
|
SiO2
|
0.93
|
0.00426 (mg/l)
|
Do
|
0.93
|
0.00595 (mg/l)
|
C1
|
EC
|
0.27
|
|
pH
|
0.87
|
0.00535 (mg/l)
|
Temp
|
0.93
|
0.00065
|
SiO2
|
0.91
|
0.00472 (mg/l)
|
Do
|
0.87
|
0.02724 (mg/l)
|
C2
|
EC
|
0.23
|
0.13671 (mg/l)
|
pH
|
0.84
|
0.0183 (mg/l)
|
Temp
|
0.95
|
0.0023
|
SiO2
|
0.97
|
0.0063 (mg/l)
|
Do
|
0.81
|
0.02233 (mg/l)
|
C3
|
EC
|
0.35
|
0.00001 (mg/l)
|
pH
|
0.87
|
0.00509 (mg/l)
|
Temp
|
0.92
|
0.00065
|
SiO2
|
0.92
|
0.00472 (mg/l)
|
Do
|
0.82
|
0.02724 (mg/l)
|
The performance evaluation measures, scatter and in the testing phase for MLP (Fig. 8A-V) and XGBoost (Fig. 9A-V) for different WQPs along with all Clusters are presented. As similar trends observed in all the clusters, results of C0, C2 in scatter plots and C3 in Box plots are presented here. We can see better performance for XGBoost (R2 in range of 0.88- 0.98) from the displayed scatter plot with all the parameters than MLP (R2 in range of 0.72-0.97). Box plots were generated for observed Vs predicted values of all WQPs along different clusters to compare the observed minimum, maximum, and mean values with the predicted. A minimal difference in mean value is observed across all the Clusters for both models.
Spatial Distribution of WQPs
From the developed models, spatiotemporal maps were plotted using the predicted values to study the variation trend of WQPs across all the clusters. The map for the year 2017 dry and wet season is presented and discussed here.
The spatial pattern proves that land use modifications and seasonal variation primarily regulated the WQ conditions, which could be witnessed through the spatiotemporal maps. EC is used to address the total concentration of ionized constituents of water. The higher concentration of conductivity along some stretches of river, specifical stations belongs to Cluster-1 and 3 reflects the higher water pollution. GRB shows high concentrations of EC than the recommended value of 3000 µS/cm along the basin (CWC and NRSC 2014). The typical value of EC can be 300 µS/cm (Bhuyan et al., 2018). The concentration of EC along different clusters has not shown a significant seasonal shift for 2017 (Fig. 10A & B). The values are in the range of 190-600 µS/cm and 150-340 µS/cm in the dry and wet seasons, respectively. Thus, showing an increase in the dry season. Although, no seasonal trend is observed. Lesser value of EC is observed in Cluster-0 and 2 for the dry season. As per WHO, the permissible limits of EC for drinking water purposes should be 1500 µmhos/cm (WHO 1996). Except for the difference in the value range, no serious spatio-temporal change in the patterns was detected in Dry and Wet seasons. The high concentrations are marked in Lucknow, Allahabad Varanasi and nearby stations along Cluster-1 and 3. These higher concentrations of EC could be accredited to heavy anthropogenic activities like agricultural and urban runoff (GEE Images).
As per IS IS:2296 specification, the permissible range for pH is 6.5– 8.5. Primarily, pH ranges from 6.5 to 9 are convenient for the survival of the aquatic ecosystem (CWC and NRSC 2014). A slight variation in values is observed in both seasons (Fig. 10C & D). Keeping the aquatic ecosystem within this range is crucial, as high/low pH values can be disastrous to the ecosystem (Al-Badaii et al., 2013). This work observed stable pH values and all clusters in different seasons ranging from 7.2-8.5, which falls within the permissible limits.
WT along the basin for the state of UP ranges from 16 to 46°C. WT has a strong correlation with salinity, toxic absorption and DO. It also controls the rate of photosynthesis of aquatic plants. Human interactions such as agricultural and industrial runoff and deforestation pose a significant variation in WT and eventually reduce the DO value. Therefore, significant variation in DO and WT constitutes a greater threat to the aquatic ecosystem. The inverse correlation between WT and DO is a natural exercise, and high water temperature makes water saturated with oxygen, hence holding lesser DO (Bhat et al., 2014; Lamaro et al., 2013). As presented in Fig. 10E & F, a clear seasonal and temporal shift in WT is observed in all years, especially along the middle stream of the basin. The values were ranged from 14-25°C and 24-32°C on dry and wet seasons, respectively.
Downstream of the study area has shown high values of TDS for Dry and Wet season 216.96-285 mg/l and 231.281 mg/l respectively (Fig. 10G & H). Some of the stations located in Cluster-1 and 3 along the upstream side of the study area showed a seasonal shift, falling to minimum range of 132-181mg/l from 150-195 mg/l for both the seasons. A similar spatiotemporal trend is observed for the other years considered in this study. The permissible values of TDS as per IS:2296 are (ClassA-500, ClassC-1500 and ClassE0-2100 mg/l). The predicted values of 2017 in both seasons are well within limits. The higher value of TDS indicates the presence of high anthropogenic activities and runoff with heavy suspended matter loading.
Not much seasonal changes are witnessed for SiO2, with values ranging from 5.00-11.36mg/l for both seasons (Fig. 10I & J).
Nearly most of the sampling site has recorded a DO value above 5mg/L (Dutta et al., 2020), which is permissible for bathing as per IS:2296. This fluctuation in DO could be linked to the waste discharge from various non-point sources, increase in WT and biological activities of aquatic organisms along the river basin. Most stations in Cluster-2 and C3 shows a clear seasonal shift. The minimum value is in the range of 5.50-5.8, 5.70-5.93 mg/l in dry and wet seasons. The stations Kanpur, Lucknow Ankinghat, have followed the same pattern during both seasons with a range of 6.12-7.05 mg/l. While the maximum values of DO observed were (7.67-8.30 mg/l in both seasons) along some stretches of the basin (Fig. 10K & L).