This section presents and compares the performance obtained with the predictive models, the relative importance of the hydrological and climate variables, and their relationships with Chla.
3.1. Performance of the regression models
Figure 2 presents the scatterplots of predicted and observed values for all the models tested in this study. From the plots, one can notice that linear regression and GLM underestimate Chla. These models have strong assumptions about error distribution: homoscedasticity, normal distribution, and no autocorrelation. Although the variables with an elevated correlation have been removed, there was still some multi-collinearity between the predictors, which could be a problem for the prediction. Predictors of water quality indicators will frequently be correlated (both temporally and spatially) since the mechanisms associated with their increase or decrease are interrelated (Su et al. 2012; Liu et al. 2019; Mesquita et al. 2020). It is important to keep in mind that highly correlated variables can present complementary information when combined (Guyon and Elisseeff 2003), which reinforces the need for integrating correlation analysis with model-based variable importance.
GBM, RF, and MLP provided the best predictions (Table 3). These models are designed to capture nonlinear relationships between variables, which is likely to be the case here. GBM and RF can reduce the variance of the predicted values by employing ensemble techniques (boosting and bagging, respectively), outperforming the regression tree (Hastie et al. 2009). The SVM model with a radial kernel is also able to detect nonlinearity, as it transforms data to a dimensional space where they can be linearly separable (Awad and Khanna 2015). However, SVM had a slightly worse performance than GBM, RF, and MLP.
Table 3
Performance metrics for the fitted models.
Model
|
R²
|
RMSE
|
MAE
|
GBM
|
0.52
|
9.32
|
7.15
|
RF
|
0.46
|
10.26
|
8.01
|
MLP
|
0.45
|
9.74
|
7.66
|
SVM
|
0.36
|
10.92
|
8.77
|
KNN
|
0.35
|
10.67
|
8.22
|
Regression Tree
|
0.34
|
10.77
|
8.21
|
Linear regression
|
0.26
|
11.48
|
9.10
|
GLM
|
0.26
|
11.48
|
9.08
|
As expected, the predictive models were able to explain only part of Chla, since the best performing model had an R² of 0.52 (Table
3). This performance can be considered satisfactory for a watershed-scale model, as a reference value to evaluate phosphorus (P) prediction (which can be easier to predict than Chla) is an R² > 0.5 (Moriasi et al. 2015).
This result also suggests that hydrological and climate factors alone are not enough to predict Chla and additional variables might be necessary, such as water quality indicators (Rocha et al. 2020). However, it must be emphasized that the relationship between P and Chla in tropical lakes is not comparable to that in temperate ones, where empirically estimated relationships between P and Chla provide reliable models to calculate Chla levels (Sakamoto 1966; Dillon and Rigler 1974; Jones and Bachmann 1976). A correlation analysis between measured total phosphorus concentration, obtained from COGERH database (http://www.hidro.ce.gov.br/), and estimated Chla reveals that nutrient enrichment may not be the only influencing factor on eutrophication in tropical reservoirs (Fig. 3).
Although past studies have obtained better predictive performances (Stefanidis et al. 2021), Chla can be harder to predict in the semiarid, due to the significant water level variability (which implies more complex mechanisms behind eutrophication) and the usually higher trophic levels (Wiegand et al. 2021). There are, however, other possible explanations. The Chla time series were derived from satellite data, which has high estimation accuracy (Lins et al., 2017), but might contain noise or components that cannot be explained with known variables. Also, past studies have indicated that the drivers of Chla can vary with the temporal resolution (Blauw et al. 2018; Liu et al. 2019). For example, on a monthly scale, water temperature is less important to predict Chla than nutrient loadings (Liu et al. 2019), which means that part of the explanatory variables could not be able to explain Chla in our model.
3.2. Variable Importance
To measure the relative influence of the model’s explanatory variables, the importance measure attributed by each predictive model was extracted and scaled using min-max normalization (Fig. 4). This approach has been widely used to make machine learning models more interpretable (Hastie et al. 2009), and can be more accurate than looking only at the correlation between explanatory and dependent variables. Correlation criteria or the goodness of fitness of a linear model are simple and direct strategies to obtain information about a set of variables, but it ignores multicollinearity and interactions between them. Although this study was not intended to perform variable selection, some of the models used here have built-in processes to select the most relevant predictions, such as RF and regularized GLM, the so-called embedded methods (Guyon and Elisseeff 2003).
Radial SVM and KNN models were excluded from this analysis since they do not have a direct importance measure. For RF, GBM, and the regression tree models, the importance corresponds to the reduction in predictive performance obtained by removing the variable from the model. In GLM and MLP, the importance is associated with the weights attributed to each variable.
The boxplots in Fig. 4 reveal that water volume was considered the most important predictive variable in all models. The models do not agree regarding the mix-layer depth and bottom temperature importance, as these presented a high variation among them. The dummy variables related to the spatial location of the reservoirs (Castanhão, Orós e Banabuiú) did not seem to significantly influence Chla, indicating that spatial variability could be less important than climate variability, or yet, that the relationships between explanatory variables and Chla are similar for all three reservoirs.
The relative influence of the variables depends on the interactions identified by each model and the procedure used to do it. For example, decision trees choose the optimal variable in each split based on the information gained by adding it to the tree. The regression tree constructed to predict Chla had only the mix-layer depth and water volume as predictors (Supplementary material, Fig. S2). This means that these two variables provide enough information to give us an approximate estimation of Chla. The regression tree alone can be considered a weak predictor, as it is very sensitive to small changes in the dataset and can easily overfit. Since they assume all variables have some interaction between them, it suits well our problem, but it fails to provide accurate estimations of Chla (here, it presented an R² of only 0.32). However, it can still give us interesting information on variable importance.
GBM and RF, as explained in the Methods section, combine several regression trees to provide stronger predictive models. RF performs variable selection during its model building process, as the variables used to construct each tree in the ensemble are selected from a random subset of the explanatory variables. The trees are fitted to bootstrap samples of the data, and the importance measure is calculated on the left-out observations (out-of-bag set). The advantage of RF’s strategy to calculate variable importance is that it considers both the individual effect and the interactions between the variables (Strobl et al. 2007). GBM, on the other hand, calculates importance on the entire training set instead of using the out-of-bag sets.
To verify the effect of the climate season on the relationships between the explanatory variables and Chla, two additional models were fitted, one containing only observations registered between February and May (wet season), and one containing the observations from the remaining months (dry season). Again, variable importance was extracted for each model and normalized so one could visualize their relative influence on Chla prediction (Fig. 5).
Water volume and water level continue to be the most relevant indicators of Chla in both scenarios. However, mix-layer depth and mean temperature seem to be more important in the wet season. It is important to keep in mind that the dry season model has a smaller dataset than the wet season, as it corresponds to the observations of four months only. For this reason, the model can be biased, and more data could be necessary to provide reliable predictions.
3.3. Relative influence of hydrological and climate variables on Chla
The PDPs in Fig. 6 illustrate the relationships between hydrological and climate variables and Chla. The RF model was selected for this analysis, as it presented the best performance according to all the metrics evaluated. These plots, however, should be interpreted with caution, as they may not display all interactions of the explanatory variables.
Confirming the findings of previous studies, Chla tends to increase as water volume reduces (da Rocha Junior et al. 2018; Wiegand et al. 2021). The decrease in water volume due to evaporation loss, water withdrawals, and extended drought periods are usually associated with higher phosphorus loads in tropical reservoirs (Raulino et al. 2021; Delmiro Rocha and Lima Neto 2021). During the dry period, sediment release and nutrient resuspension are important mechanisms associated with Chla in these reservoirs. Although the effect of internal loading has been pointed as more significant in shallow reservoirs, in the semiarid, precipitation levels come close to zero and inflow decreases drastically during the dry season, so that deep reservoirs reach very low volumes and almost no external loads are carried to them (Delmiro Rocha and Lima Neto 2021; Lima Neto et al., 2022).
Wind speed did not seem to play an important role in Chla levels, which might be due to reservoirs’ morphology and the temporal scale considered here. In deep reservoirs, wind speed is indeed unimportant to Chla, as it is not a relevant driver of water column mixing. Shallow reservoirs, on the other hand, present a significant correlation with nutrient resuspension (Araújo et al. 2019; Mesquita et al. 2020). Past research has indicated that although wind speed affects the dynamics of algal growth and eutrophication, there is a loss of information on wind dynamics on a monthly scale (Stefanidis et al. 2021).
Mix-layer depth has an inverse relationship with Chla, which is consistent with previous findings (Stockwell et al. 2020; Stefanidis et al. 2021). There are several factors to consider when interpreting this relationship, such as water temperature, reservoir morphology, and the ratio between the mix-layer depth and thermocline depth. In deep reservoirs, stratification is more likely to occur and lake stability tends to increase, with a higher possibility of solute accumulation in the hypolimnion, dissolved oxygen depletion, and phosphorus release from sediments (Butcher et al. 2015; Kraemer et al. 2015; Moura et al. 2020). But an increase of mix-layer depth also results in a reduction of the light available to phytoplankton (Stockwell et al. 2020) and in lower water temperatures, which could inhibit Chla growth (Zhao et al. 2020).
Bottom temperature, mean temperature, solar radiation, and water level have direct relationships with Chla. The first three variables are directly related to each other, and their increase usually enhances phytoplankton productivity (Liu et al. 2019). The direct influence of water level on Chla is surprising, as previous studies have reported the opposite relationship (Medeiros et al. 2015; Wiegand et al. 2020; Braga and Becker 2020). However, the effect of increasing water levels on Chla depends on the quality of the inflow, whether it is related or not to a reduction in the outflow (Bakker and Hilt 2015), the depth, and the trophic state of the reservoir (Costa et al. 2015). During the rainy season (when water levels rise), external loads from rivers and surface runoff add up to internal loads due to thermal stratification and phosphorus release from sediment, which is highly correlated with Chla growth (Moura et al. 2020). Agriculture and cattle raising are important activities in all reservoirs analyzed here and are the main cause of nonpoint source pollution that increases external total phosphorus loading; this effect is higher in the wet season than in the dry season (Rocha and Lima Neto 2021; Lima Neto et al., 2022).
The PDPs for the dry and wet season models were also examined. Except for mean precipitation and wind speed, all variables maintained the patterns observed in the general model. Figure 7 presents the variables with opposing behaviors. While precipitation has a positive effect in the dry season, it presents a negative and almost insignificant effect during the wet season. One explanation for this behavior is that during the dry season, as the reservoirs have lower water volumes, precipitation can increase nutrient loadings (Jeppesen et al. 2015; da Rocha Junior et al. 2018). During the wet season, increased precipitation might induce greater flushing and lower Chla (Reichwaldt and Ghadouani 2012). This effect, however, seems to be not very relevant as produces a little variation on Chla. The extent of precipitation influence on Chla can be also related to the intensity and frequency of rainfall events (Reichwaldt and Ghadouani 2012; Ho and Michalak 2020).
During the wet season, stronger winds seem to result in a slight decrease of Chla (up to 3 µg/L), while in the dry season, it has the opposing effect. Although wind speed has a small influence on Chla, it is still interesting to investigate the sign of this relationship. Previous studies have indicated that increased wind speed can result in greater mixing of the upper layer, thus reducing Chla (Stockwell et al. 2020); however, under oligotrophic conditions, stronger winds can carry nutrients to the bottom layer and increase Chla (Kahru et al. 2010; Kim et al. 2014). This mechanism also depends on the reservoirs' morphology and water level, hence for shallow reservoirs (or for reduced water levels in the dry season), stronger winds can induce resuspension and increase internal nutrient loads (Araújo et al. 2019). In the wet season, wind-induced resuspension is less significant, as external sources of nutrients play a more important role in Chla fluctuations (Rocha and Lima Neto 2021).
PDPs can also be plotted for two variables at the same time (Supplementary material, Figure S3). Again, one must be careful when interpreting these plots, as they can show correlations between variables rather than a causal relationship. When considering higher values of solar radiation, wind speed presents an inverse relationship with Chla. Whether the mix-layer is shallow or deep, when solar radiation is higher, Chla tends to increase, a relationship that is confirmed by previous research (Berger et al. 2006). One can also notice that mix-layer depth seems to have a stronger effect on Chla only up to a certain point.
Wind speed has little effect on Chla when the water volume is constant. Again, this might be related to the size of the reservoirs analyzed here and does not necessarily mean that wind speed does not influence Chla. Precipitation can have distinct effects on nutrient concentrations (Ho and Michalak 2020). Our analysis indicates that when the water volume is high, increased precipitation levels mean higher Chla (Wiegand et al. 2020), while for low water volumes, increased precipitation levels mean lower Chla. This, again, can be related to the climate season, as previously discussed.