Advanced Hybrid Machine Learning Algorithms for Multistep Lake Water Level Forecasting


 Random Tree (RT) and Iterative Classifier Optimizer (ICO) based on Alternating Model Tree (AMT) regressor machine learning (ML) algorithms coupled with Bagging (BA) or Additive Regression (AR) hybrid algorithms were applied to forecasting multistep ahead (up to three months) Lake Superior and Lake Michigan water level (WL). Partial autocorrelation (PACF) of each lake’s WL time series estimated the most important lag times — up to five months in both lakes — as potential inputs. The WL time series data was partitioned into training (from 1918 to 1988) and testing (from 1989 to 2018) for model building and evaluation, respectively. Developed algorithms were validated through statistically and visually based metric using testing data. Although both hybrid ensemble algorithms improved individual ML algorithms’ performance, the BA algorithm outperformed the AR algorithm. As a novel model in forecasting problems, the ICO algorithm was shown to have great potential in generating robust multistep lake WL forecasts.

In this study, we selected Lakes Michigan and Superior to test the developed models on WL series with 103 low and high variations. Note that the amplitude of WL variations is smaller for Lake Superior (the largest 104 lake) than the other lakes while the WL variations in Lake Michigan is high compared to other Great lakes. 105 Monthly WL data for Lakes Michigan and Superior over the period of 1918 to 2018 was obtained from the 106 U.S. National Oceanic and Atmospheric Administration, NOAA. The WL time series was partitioned into 107 two subsets: training (70%; from 1918 to 1988) and testing (30%; from 1989 to 2018). It is worth to note 108 that there is no 'rule of thumb' for partitioning train/test sets. However, previous studies (Palani et al., 2008; 109 Barzegar et al., 2016) generally agree that the testing subset must employ data never used in the training 110 phase and that these data should represent approximately 10 -40% of the size of the training set; the 70:30 111 dataset partitioning approach is often used for training and testing machine learning models (Huang and  2020) and was thus adopted here. Summary statistics of the WL data for these subsets (Table 1)  of the subsets indicates that data are light-tailed relative to a normal distribution. The kurtosis for all subsets, 118 except the training data for Lake Superior, was negative, indicating that the left side is the longer tail of the 119 distribution. 120 121

Base machine learners 125
In the present study, we used two ML techniques, including RT and ICO, to develop lake water level 126 forecasting models. Each of these algorithms are described as follows: 127 128

Random Tree (RT) 129
The Random Tree (RT) model is a supervised learning algorithm that considers 'K' randomly selected input 130 variables when performing a split at each node in an unpruned decision tree. RT can be used for both 131 classification and regression problems and is the foundation of the Random Forest algorithm (Breiman, 132 2001). As a fast learning algorithm, RT can achieve high accuracy even in complex modeling situations 133 (e.g., when dynamic changes in the environment must be considered; Mu et al., 2017). RT's strengths stem 134 from its randomization when selecting input variables to split nodes, which is not considered in 'standard' 135 decision trees. Section 4 includes the RT hyper-parameters and their optimal values (determined via trial-136 and-error). 137 According to Drmota and Gittenberger (1997), if is a plane rooted tree with nodes, which is considered 138 as a family tree of a class, the profile of the tree may be described by the number of nodes or the number 139 of leaves. Supposing a class with trees T ( ∈ ) consist of nodes, the weight function ( ) can be 140 written as (Drmota and Gittenberger, 1997): 141 (1) where ( ; ≥ 0) are non-negative numbers and ( ) is the number of nodes ∈ with out-degree k. 142 Moreover, set: 143 Then the corresponding generating function (GF) ( ) = ∑ ≥0 satisfies the functional equation: 144 where ( ) = ∑ ≥0 , thereby equipping sets = { ∈ ∶ | | = } with a probability distribution 145 induced by the weight function ( ) (Drmota and Gittenberger, 1997). 146 147

Iterative Classifier Optimizer (ICO) 148
The Iterative Classifier Optimizer (ICO) is used to identify the optimal number of iterations for a given 149 base iterative classifier/regressor through cross-validation and a user-defined performance metric (Saad, 150 2018). At first, the base classifier/regressor is specified, along with any additional model parameters. 151 Afterwards, ICO tests the base method for different iteration numbers and evaluates the pre-defined 152 performance metric for each iteration number. ICO has a 'look ahead' mechanism that tests the base method 153 at a higher number of iterations to improve the algorithm's ability to locate an optimal iteration number. 154 The number of iterations that results in the best cross-validation performance is selected as the optimal 155 number of iterations. In this study, Alternating Model Trees (Frank et al., 2015) are used as the regressor 156 and is briefly described below. 157 158

Alternating Model Trees (AMT) 159
Alternating Model Trees (AMT) are a class of regression trees that are grown using boosting, synthesizing 160 an ensemble of trees in a single tree structure (Frank et al., 2015). In AMT, there are two types of nodes: 161 splitter and prediction nodes. At each splitter node, two prediction nodes are generated. The splitting 162 variable is selected by considering all input variables and splitting on the median value of the data that 163 reaches that node. Afterwards, two linear regression models are fit on the subsets resulting from the split. 164 The input variables used in the linear regression models are those that minimize the squared error. The two 165 linear regression models are then placed at the prediction nodes attached to the recently created splitter 166 node. When generating the prediction nodes at a given splitter node, it is possible to use two different input 167 variables in the two linear regression models. To obtain the final prediction via AMT, the individual 168 predictions made at each visited prediction node are multiplied by a shrinkage parameter (used to control 169 overfitting) and summed together. The number of iterations (equivalent to the number of splitter nodes) and 170 the shrinkage parameter must be set by the user. As noted in Section 3.1.2, ICO can be used to optimize the 171 number of iterations in AMT. Additional details on AMT can be found in Frank et al. (2015). 172 173

Hybrid algorithms 174
To develop hybrid models, two commonly used algorithms namely AR and BA were applied. Details on 175 these algorithms are given as follows: 176

Additive Regression (AR) 177
At each iteration of the Additive Regression (AR) method adopted here (Friedman, 2002), the standalone 178 algorithm is fit to the residuals from the previous iteration, i.e., AR is a gradient boosting method (Friedman 179 and Meulman, 2003). At the final iteration, all predictions are added to generate the final prediction (Frank 180 et al., 2016). The AR algorithm can be represented as (Friedman, 2002): 181 where ℎ( ; ) is the standalone model output considering inputs and model parameters at iteration 182 ( = 0, … , ), where is the number of iterations (after the initial model is fit at = 0); are a set 183 of expansion coefficients at iteration ; and ( ) is the AR algorithm output. The standalone and AR 184 algorithms' parameters ( and ) are fit in a stepwise manner (i.e., each set of parameters is estimated 185 at a particular iteration) (Friedman, 1999). 186 At each iteration of the AR algorithm, an 'intermediate' AR model output ( ( )) is generated (Friedman,

Bagging (BA) 194
Bagging (BA) is a robust ensemble method that is often used to improve model accuracy. Drawing random 195 samples from the training data (input-target variable pairs) with replacement (also referred to as bootstraps), 196 BA creates a model for each sample of the data (Bauer and Kohavi, 1999). After each model has been 197 trained, an ensemble prediction can be made for new input data by taking the mean of the predictions across 198 all trained models (also known as ensemble members). Since each training set is made by creating a 199 bootstrap replicate of the original training set, this method can weaken the defects of poorly performing 200 ensemble members and increase the overall accuracy of the ensemble (Breiman, 1996). Despite its 201 simplicity, BA has been shown to significantly reduce the out-of-sample mean square error (MSE) when 202 applied to classification and regression trees on a wide number of datasets (Brieman, 1996) and is equally 203 applicable to any base algorithm. By coupling RT with BA, the result is a RF model.
where, N is the number of data, Mi, Fi, Mm, and Fm are the observed, forecasted, mean observed, and mean 212 forecasted data, respectively; is the Pearson correlation coefficient between the forecasts and 213 observations, is the ratio between the variance of the forecasts and the variance of the observations, and 214 is the ratio of the means of the simulations and observations. KGE ranges from -∞ to 1 with an ideal value Additionally, the performance of the models was visualized using different plots including time series 222 comparative plot, bar graph, spider diagram and box-swarm plot. variable (Combination 1). The second input combination was constructed by adding the variable with the 243 next highest correlation coefficient [i.e., WL(t-1)] to the previous input. This procedure was continued by 244 successively adding the next most highly correlated of the remaining input variables, until the variable with 245 the lowest correlation coefficient [i.e., WL(t-5) in WL(t+1) forecasting for both lakes] was added to construct 246 input combination 6 (Tables 4 and 5).  Table 2 Correlation matrix heatmap of the input candidates and outputs of the models for Lake Michigan 252 Table 3 Correlation matrix heatmap of the input candidates and outputs of the models for Lake Superior    Table 6, where the lowest training set RMSE is highlighted to indicate the optimal input 264 variable combination. Input combination No. 2, including WL(t) and WL(t-1), resulted in the lowest RMSE 265 for RT-based WL(t+1) and WL(t+2) forecasting in Lake Michigan, while input combination No. 1 including 266 WL(t), resulted in the lowest error for WL(t+3) forecasting. Similar to Lake Michigan, input combination No. 267 2 provided the minimum RMSE for RT-based WL(t+1) forecasting for Lake Superior, whereas the input 268 combination No. 4 including variables WL(t), WL(t-1), WL(t-2), WL(t-3) and combination No. 6 including 269 variables WL(t), WL(t-1), WL(t-2), WL(t-3), WL(t-5), and WL(t-4) obtained the lowest RMSE for RT-based WL(t+2) 270 and WL(t+3) forecasting, respectively in Lake Superior. The most effective variable(s) for RT-based 271 forecasting models differed for the two lakes. For the ICO-based models, input combination No. 6 provided 272 the minimum RMSE for all time horizons in both lakes. 273 274

276
After finding the most effective input variables for each multistep step ahead WL forecast for each lake, 277 the selected input variables served as models' inputs and the models' optimal hyper-parameters were 278 calibrated. A WEKA software was used to determine the models' optimal values through a trial-and-error 279 process, the most popular method for this software (de Lima et al., 2014). The model hyper-parameters' 280 optimal values are listed in Table 7. The models were developed using those optimal hyper-parameters and 281 then validated using the testing data. To develop the standalone models, single ICO (i.e. based on AMT as 282 an iterative regressor) and RT algorithms were separately calibrated using training data and afterwards 283 validated using the testing (unseen) data. While, for hybrid models, the ICO and RT models were used as 284 base regressor of the AR and BA ensembles for construction and evaluation of the models. The same 285 training and testing datasets were used for the AR-RT, AR-ICO, BA-RT, and BA-ICO ensemble models. 286 In the next section, the performance of the different models is presented and discussed. 287  predictors used in the models (see Table 4, 5, and 6) 291 ** AMT is coupled with ICO, the Number of Iterations value(s) listed in the table were found to be optimal by ICO.

Results 293
Two different standalone models (RT and ICO) and their corresponding hybrid versions developed using 294 the AR and BA algorithms, drawing on their optimal input variable combinations, were tested and their 295 performance evaluated using quantitative metrics and visual methods. 296 Comparative time series and scatter plots of measured vs. forecasted WL values for WL(t+1), WL(t+2), and 297 WL(t+3) forecasting for Lake Michigan are shown in Fig. 3 and Figs. S1-S2 (in the supplementary material). 298 For test phase WL(t+1) forecasts (Fig. 3), the BA-ICO model was the most accurate (R 2 = 0.979), while the 299 standalone RT models showed greater scatter between measured and forecasted WLs (R 2 = 0.965. For 300 WL(t+2) forecasting (Fig. S1), the BA-ICO (R 2 = 0.876) and AR-ICO (R 2 = 0.877) models showed the best 301 WL forecasts, compared to the poorer predictions and greater scatter of the RT models (R 2 = 0.854). The 302 scatter plots of the forecasted WL(t+3) and the equivalent measured WLs for Lake Michigan showed the ICO 303 and AR-ICO (R 2 = 0.77) and BA-ICO (R 2 = 0.768) to have obtained the highest accuracy and RT (R 2 = 304 0.722) model to have provided the lowest accuracy in their WLs forecasts. All developed models, for all 305 time horizons for Lake Michigan, can be categorized as having very good performance (0.7 < R 2 ≤ 1.0; 306

Moriasi et al., 2007). 307
For Lake Superior, the plots of the measured vs. forecasted WL(t+1), WL(t+2), and WL(t+3) scatter plots, 308 presented in the supplementary material (Figs. S3-S5), show, based on the R 2 , that for all steps the 309 forecasted WL obtained using the ICO and its hybrid counterparts with AR and BA algorithms had higher 310 accuracy than the other models. Moreover, in the test phase, WLs forecast by the RT model showed greater 311 scatter compared to the other models, along with the poorest fitted linear regression equation for all steps 312 ahead. All models developed for WL(t+1) forecasting for Lake Superior showed a very good performance 313 (0.7 < R 2 ≤ 1.0), while for WL(t+2) RT models performed satisfactorily (0.5 < R 2 ≤ 0.6), and other developed 314 models well (0.6 < R 2 ≤ 0.7). For WL(t+3), all developed models had an unsatisfactory performance (0.0 < 315 R 2 ≤ 0.5); however, the ICO-based models (single ICO and hybrid BA-ICO and AR-ICO) had much better 316 performance compared to the RT models. The comparative plots for both lakes illustrated that coupling the 317 standalone ML model (i.e. RT and ICO) with BA and AR results in capturing the WL extremes (both 318 minimum and maximum) much better than the standalone models. Moreover, the scatter plots showed that 319 the forecasted WL using the hybrid models are much closer to the 1:1 line (i.e. forecasted WL = measured 320 WL) than the standalone models for both lakes. The error metrics (i.e., RMSE and MAE) for the developed Lake Michigan WL models (Fig. 4a,b)  Michigan. For WL(t+2) and WL(t+3) forecasting, the AR-ICO and BA-ICO hybrid model along with the 330 standalone ICO provided the lowest error, indicating it to be the most efficient model for Lake Michigan 331 WL forecasting. Although the AR and BA algorithms can no improve the ICO models in forecasting two 332 and three-months ahead WL of Lake Michigan, they can enhance the accuracy of the RT models. 333 The percentage error of the WL forecasting models in terms of RMSPE and MAPE are provided in Table  334 8. Based on these model accuracy statistics the BA-and AR-ICO showed no improvement over their 335 corresponding standalone ICO models for WL(t+2), and WL(t+3) forecasting for Lake Michigan, except for 336 WL(t+1) forecasting. The maximum improvement of the models belongs to the AR-RT for WL(t+1) 337 forecasting in Lake Michigan. all other developed models (Fig. 4d). Models' errors for Lake Superior WL forecasting show that, similarly to Lake Michigan, the RMSE and 354 MAE of the individual ICO model were lower than those of the RT model for all time horizons (Fig. 5a,  355 b). The BA-ICO model provided the best performance for WL(t+1) forecasting in terms of RMSE and MAE, 356 while the BA-RT models were the best model for WL(t+2) and WL(t+3) forecasting. Generally, the BA hybrid 357 algorithm generates more accurate predictions than the AR hybrid algorithm. 358 The MAPE and RMSPE of the models developed for WL forecasting for Lake Superior ( to 4% for the ICO-based WL forecasting models. The error reduction achieved with the BA algorithm was 362 higher than that of the AR algorithm for the ICO-based model (i.e., up to 4% and 1% decrease in error 363 based on the RMSPE and MAPE, respectively). These statistics showed that for Lake Superior, similar to 364 WL(t+2) forecasting, for WL(t+3) forecasting the BA algorithm performs better than the AR algorithm not 365 only in improving the ICO models but also in improving the individual RT model. It is worth noting that 366 the AR and BA algorithms not only improve model performance compared to the standalone models, but 367 they also reduce overfitting, which is a crucial issue when applying machine learning models to real-world 368

data. 369
For Lake Superior, the BA-ICO model (0.973) showed the highest KTAU value for WL(t+1) while the 370 highest KTAU for the WL(t+2) and WL(t+3) were 0.835 and 0.686 for AR-ICO and BA-RT models, 371 respectively, indicating the superior performance of the BA-based models for this lake WL forecasting (Fig.  372 5c). Overall, based on the KTAU, the hybrid models performed better than their corresponding single 373

models. 374
The KGE metric shows that AR-RT, with values of 0.950 and 0.760, outperforms the other models for 375 WL(t+1) and WL(t+2) forecasting of Lake Superior, respectively, while the BA-RT (0.642) had the best 376 performance among the standalone and hybrid models for WL(t+3) forecasting (Fig. 5d).  To further evaluate differences in the models' performance in forecasting WLs for Lakes Michigan and 385 Superior, we present combined box-swarm plots (Fig. 6). Although the statistical metrics indicated, 386 generally, the BA-ICO outperformed the other developed models in forecasting WL(t+1) for Lake Michigan, 387 the box plot also confirms that the BA-ICO performed better than the other developed models, and 388 considerably better than the AR-ICO model, in capturing the minimum WL. The swarm plot also confirmed 389 the superiority of the BA-ICO over the AR-ICO in a detail comparision. It is observed that the main 390 difference of the Lake Michigan's models was attributed to the capturing the minimum values, particularly 391 for the three months ahead forecasts. Swarm patterns for different models developed for Lake Michigan 392 vary slightly, indicating the low variability of the model's error. However, it is evident that the hybrid could 393 model the measured WLs better than the standalone models. For Lake Superior, the box plots illustrated 394 that developed models mainly performd differently for different step aheads WL forecasting. Moreover, 395 outlier WL forecasts could be seen for this lake. The plots showed that the ICO-based models could capture 396 these extremes along with the median WLs much better than the RT-based models. Similar to Lake 397 Michigan, it was observed that the BA-ICO followed by BA-RT outperformed among the developed models 398 for Lake Superior WL forecasting. Although both hybrid BA-and AR-based algorithms can improve the performance of individual ML 404 algorithms, generally, the BA outperforms the AR. However, the performance enhancement provided by 405 the hybrid algorithms differed for the different forecast lead times, and further differed according to whether 406 AR or BA algorithms were employed at particular lead times. In most steps ahead, the BA algorithm's 407 performance was much better than that of the AR; however, the AR did perform better for some steps ahead. 408 It is worth noting that in generating multistep forecasts in hydrology it is recommended to explore (and 409 potentially use) different hybrid algorithms at different steps to obtain the best possible performance. The and decision stump), and two hybrid methods (e.g., AR and BA) with each of the seven standalone models 427 as base classifiers, to predict the compressive strength of concrete. They concluded that the hybrid 428 algorithms improved prediction accuracy for the four regression tree models, but had less success with the 429 other three ML predictive models. Accordingly, the current study concurs with the findings of Omran et al. 430 (2016). 431 As a model newly applied to forecasting problems, the ICO algorithm using an iterative base regressor (e.g. 432 AMT) is capable of generating accurate multistep WL forecasts. Moreover, the ICO algorithm was shown 433 to be more accurate than the RT model in WL forecasting for the study sites. Differences in model 434 performances were attributed to the different algorithms' structures. Currently, no published study has used 435 ICO for hydrological forecasting, making comparisons difficult; however, Pak and Teh (2016) did employ 436 several ML algorithms for classification purposes, including ICO and RT. Their study indicated that the 437 ICO algorithm performed better than the RT in their particular application, as it did in the current study's 438 wavelet-FL and wavelet-MLP models. Compared to the results of the current study, it is found that the 449 models developed herein have higher accuracy than standalone models in Altunkayak (2014) in terms of 450 RMSE. While the wavelet-based models appear to outperform the models presented herein, it is important 451 to note that the wavelet based models in Altunkayak (2014) were developed using an approach that is 452 inappropriate for real-world forecasting application (since the forecasts incorporate 'future data'). Details to forecast WLs in both lakes. However, as indicated in the literature, these models have several 457 disadvantages over the ML models (e.g. time consuming, restrictive assumptions, so on). The results of the 458 developed hybrid models in this study show that they can be used as accurate alternatives to these is better 459 than those models. 460 Lake WL forecasting could also be carried out using mechanistic models, focused on the causality of input-461 output relationships (Baker et al., 2018). It has been frequently used for rainfall-runoff process, streamflow 462 and river water level modeling (Lees, 2000

Conclusions 482
The accurate forecasting of lake water levels (WLs) is important for sustainable water resources planning 483 and management. Given the nonlinear behaviour of WL fluctuations in lakes with response to driving 484 factors (rainfall, evaporation, etc.), machine learning (ML) algorithms are especially useful to generate 485 accurate forecasts of such processes. Drawing upon data collected from Lakes Superior and Michigan, the 486 present research evaluated, for the first time, novel hybrid ensemble ML algorithms by investigating the 487 feasibility of newly introduced methods to forecast WLs. The ability to forecast multistep ahead [(WL(t+1), 488 WL(t+2), and WL(t+3))] monthly lake WL using standalone ML algorithms (ICO (i.e. based on AMT as an 489 iterative regressor) and RT), as well as ICO-and RT-based hybrid ensemble algorithms integrated with AR 490 and BA models, was assessed. Prior to the modeling process, input time series data analysis was carried out 491 to identify important time lags. Then, in order to find the most effective input variables for each step ahead 492 lake WL forecasting, different input combinations were constructed. The models developed drew on these 493 input variables and optimal model hyper-parameters (determined by trial-and-error). The performance of 494 the developed models was assessed in terms of statistical metrics: R 2 , RMSE, MAE, RMSPE, MAPE, as 495 well as KTAU and KGE. 496 The statistics showed that the ICO-based models (single ICO and hybrid BA-ICO and AR-ICO) had better 497 performance compared to RT-based models (both single and corresponding hybridized models) in WL 498 forecasting for both lakes Michigan and Superior. All models developed for all time horizons for Lake 499 Michigan had an excellent performance. In contrast, the performance of models for Lake Superior varied 500 between unsatisfactory and excellent. All WL(t+1) forecasts for Lake Superior were excellent, whereas the 501 standalone RT and the corresponding hybrid algorithms (BA-RT and AR-RT) showed unsatisfactory 502 performances for WL(t+3) forecasting. 503 Overall, hybrid ML techniques provided forecasts of greater accuracy than did standalone models. 504 However, the performance enhancement of these hybrid algorithms differed for different forecast lead times 505 and their performance was also tied to whether AR or BA algorithms were used. Based on the RMSPE and 506 MAPE, the BA-ICO and AR-ICO improved their corresponding standalone ML models for WL(t+1), WL(t+2), 507 and WL(t+3) forecasts of Lake Michigan WL more so than the other developed hybrid models. For Lake 508 Superior, the BA-ICO in WL(t+1) and WL(t+2) forecasting and BA-RT in WL(t+3) forecasting improved their 509 corresponding standalone ML models more so than other developed hybrid models. Therefore, in order to 510 obtain the best possible hydrological forecasts, examining different hybrid algorithms at all relevant 511 forecast lead times is important. In general, the hybrid ensemble models provided better performance than 512 the standalone models since they reduced forecast bias and variance by including the strengths of multiple 513 models. This study also demonstrated that the ICO algorithm, as a novel algorithm in forecasting problems, 514 offers great potential to produce accurate multistep WL forecasts. 515 516 Acknowledgements 517 Funding for this research was provided by an NSERC Discovery Grant held by Jan Adamowski. 518 519

Declaration of Competing Interest 520
The authors declare that they have no known competing financial interests or personal relationships that 521 could have appeared to influence the work reported in this paper.