Multiple Machine Learning Methods for Runoff Prediction: Contrast and Improvement

doi:10.21203/rs.3.rs-2498296/v1

Machine learning methods provide new alternative methods and ideas for runoff prediction. In order to improve the application of machine learning methods in the field of runoff prediction, we selected five rivers with different conditions from north to south in Japan as the research objects, and compared the six watersheds and different types methods of time series prediction in machine learning methods, to evaluate the accuracy and applicability of these machine learning methods for daily runoff prediction in different watersheds, and improve the commonality problem found in the prediction process. The results show that before the improvement, the prediction results of the six methods in Kushiro river, Yodogawa river and Shinano Gawa river are good. After the improvement, the runoff prediction errors of the six methods in the five watersheds are greatly reduced, and the prediction accuracy and applicability are greatly improved. Among them, the improved deep temporal convolutional network (DeepTCN) has the best prediction effect and applicability. Of all prediction results in the five watersheds, the NSE coefficients are above 0.94. In general, the improved DeepTCN has the best comprehensive prediction effect, and has the potential to be widely recommended for runoff prediction

Machine Learning (ML) methods

Runoff prediction

Performance evaluation

Contrast

Improvement

Japan

Accurate runoff prediction plays an important role in flood disaster prevention, water resource planning, agricultural hydrological activities, ecological environmental protection, hydropower system operation and regional sustainable development. However, runoff prediction is a complex nonlinear time series problem. And it becomes a challenging task due to numerous climatic factors such as temperature, precipitation, evapotranspiration, wind speed, snow cover, solar radiation and humidity, as well as watershed's topography, shape, slope, altitude, soil type, land cover, water retention capacity, etc. (Adnan et al., 2021). With the continuous improvement of human technology and understanding of nature, distributed hydrological models have evolved over the decades to fully reflect the impact of spatio-temporal heterogeneity such as watershed topography, soil, land use/cover on the water cycle (Busico et al., 2020), making it a vital role in runoff prediction (Tang et al., 2021; Chen et al., 2021; Zhou et al., 2018). However, practical applications, distributed hydrological modelstend to experience various unsolvable problems that will affect its further development and greater contribution, including low simulation in small steps, the uncertainty in the simulation accuracy caused by data, parameters and other inputs as well as by the model's own structure, and the high requirements for various data in the target river watershed during the modeling. The advent of the artificial intelligence age has brought subversive and epoch-making progress to research and development in many fields. In order to achieve a more simple, efficient, and accurate runoff prediction, many researchers have tried to introduce machine learning methods to achieve easier, more efficient, and more accurate run-off prediction, and have made some achievements (Zhang et al., 2021; Yan et al., 2019; Tikhamarine et al., 2020; Abdalla et al., 2021; Mehr et al., 2018).

Over the past two or three decades, machine learning algorithms have grown significantly as an important branch of artificial intelligence. Significant advances have been made both in the theoretical understanding of machine learning algorithms that solve engineering problems and the practical application of this approach. In the hydrology field, machine-learning-based runoff prediction models significantly reduce the difficulty of understanding runoff mechanisms and complexity of modeling compared to traditional hydrology models, making hydrology modeling machine-learning-based a popular method of run-off modeling in recent years. Numerous examples show that data-driven machine learning methods tend to outperform hydrological models in terms of prediction accuracy. However, it is significant to investigate which types of machine learning methods are more suitable for runoff prediction, and whether these new methods can be more "inclusive and applicable" and contribute more to runoff prediction in a wide range of watersheds as machine learning methods continue to develop and new methods emerge.

The study selected five watersheds in Japan with very different situations from North to South as the research objects, and conducted extreme tests on runoff prediction for the existing mainstream machine learning methods with time-series processing. The selected six machine learning methods, included the classical recurrent neural network model long short-term memory (LSTM), and the deep learning model represented by the temporal convolutional network (TCN) model based on the 1D convolutional neural network, the transformer model based on the attention mechanism, and the decision tree model light gradient boosting machine (LightGBM) in the classical machine learning algorithm, and autoregressive recurrent network (DeepAR) and DeepTCN, the composite models that combines multiple machine learning methods. Of these, the LSTM model is a powerful circulatory nervous system specifically designed to overcome the gradient explosion/disappearance problem that usually occurs when learning long-term dependence, proposed by Sepp Hochreiter and Jurgen Schmidhuber in 1997. The LSTM model consists of a unique set of memory cells that replace RNN’s hidden layer neurons, the key to which is the state of the memory cells. The LSTM models filters information through gate structures to maintain and update the state of memory cells (Qiu et al., 2020). And it has been successfully applied in the areas of financial market forecasting (Yan and Ouyang, 2018), oil production forecasting (Sagheer and Kotb, 2019), air pollution forecasting (Xayasouk et al., 2020), electricity load forecasting (Son et al., 2020), and rainfall forecasting (Chen et al., 2021). DeepAR proposed by Salinas et al (Huang et al., 2021). is a supervised learning model, DeepAR can effectively train its parameters using a backpropagation algorithm and directly indicate the lower and upper bounds of the PI. DeepAR has been successfully applied to in areas such as unstable slope deformation prediction (Dong et al., 2021), dissolved oxygen content prediction (Huang et al., 2021), transportation demand forecasting (Lunacek et al., 2021) and other fields. The Light Gradient Boosting Machine (LightGBM) model is an open source offering from Microsoft in 2017. Its uses histogram-based algorithms to mitigate the impact of high-dimensional data, speed up calculation time, and prevent prediction systems from overfitting so that predictions are not significantly faster during processing massive amounts of data. The Light Gradient Boosting Machine (LightGBM) model has been successfully applied to E. coli DNA N-4-methylcytosine site prediction (Lv et al., 2020), general sentiment classification of short texts (Alzamzami et al., 2020), and automatic detection of fractures in concrete surface images (Chun et al., 2021). The TCN was proposed by Lea et al. in 2016 and was used for motion segmentation of videos. It is a general convolutional network proposed by sequence modeling tasks with causal constraints (Gan et al., 2021). The TCN has been successfully applied in the areas such as sepsis prediction (Kok et al., 2020), nanoporous base recognition (Zeng et al., 2020), lip reading (Martinez et al., 2020), wind speed prediction (Gan et al., 2021) and short-term traffic speed prediction (Guo and Yuan, 2020). The Deep Temporal Convolutional Network (DeepTCN) is based on the TCN model and has been successfully applied in areas such as short-term traffic forecasting (Zhao et al., 2019). The next generation of encoder-decoder neural network families, completely free of any repetitive or convoluted operations, has been successfully applied to areas such as streaming automatic speech recognition (Moritz et al., 2020), inverse synthesis reaction prediction tasks (Karpov et al., 2019) and estimation of the state of charge of lithium-ion batteries (Hannan et al., 2021).

The core objectives of the study are: (1) Comparing the accuracy and suitability of daily run prediction using the six current mainstream different types of machine learning methods, with multiple meteorological and runoff data from five rivers from North to South in Japan as model inputs; (2) Comparing the predictive effects of these six machine learning methods and enhancing the problems of commonality to further improve the simulation accuracy and adaptability; (3) Studying the combined predictive effect of the improved methods and make recommendations for their use, and improve the application potential of machine learning models in the field of runoff prediction.

2.1 Data

In order to make the selected watersheds more representative and fully verify the accuracy and applicability of the machine learning method, we selected five watersheds from north to south in Japan as the research objects. The climate types of the watersheds from north to south are Hokkaido climate, Japan Sea Coastal climate, Seto Inland Sea climate and Pacific Ocean side climate, covering most of Japan's climate types. The geological structure belts of the watersheds are different from each other, including the Xlemuro Belt, Shimanto Belt, Ashio Belt, Ryoke Belt and the North Kitakami Belt. The average annual flow of these rivers ranges from 27.55m³/s to 518m³/s, covering almost all rivers in Japan. Moreover, the distribution and characteristics of annual flow in the five watersheds are different due to the differences in climate, meteorology, geology, geomorphology and human activities, fully reflecting the diversity and variability of runoff data selection in this study. In addition, two of the five selected rivers originate from lakes and the remaining three originate from mountains, representing the common river origin types in Japan. Among them, the Kushiro river is a first-class river that flows into the Pacific Ocean from the Kushiro area to the south. Its source is Lake Kussharo in the Akan Mashu National Park. With a total length of 154 km and a drainage area of 2,510 km², it is one of the few rivers in Japan without artificial dams. The upstream of the Kushiro river is the natural dam of Kussharo lake, and the midstream and downstream of the Kushiro river is wetland. Shiribeshi Toshibetsu river is the only first-class river in southern Hokkaido. It has a total length of 80 km and a drainage area of 720 km². The watershed is relatively warm due to the northward movement of Tsushima Ocean Current across the Sea of Japan. The Shinano Gawa river originates from kobushigatake mountain at the junction of Saitama-ken, Yamanashi-ken, and Nagano-ken. With a total length of 367 km and a drainage area of 11,900 km², this river is the longest and the third largest river in Japan. The Yodogawa river flows from Biwako lake, the largest lake in Japan. It has a total length of 75.1 km and a drainage area of 8,240 km². Oyodo river is a first-class river originating from Zhongyue. It has a total length of 107 km and a drainage area of 2,230 km². Figure 1 shows the location of the Kushiro river, Shiribeshi Toshibetsu river, Shinano Gawa river, Yodogawa river, and Oyodo river located from north to south of Japan which were selected for this study.

Watershed rainfall-runoff processes mainly include rainfall, snowfall, snowmelt, interception, infiltration, puddle filling, evapotranspiration, slope confluence, channel confluence, etc., while meteorological factors have a considerable role and influence on almost all processes, and are the most important influencing factors in runoff prediction. To predict the runoff process more reasonably and simulate the runoff accurately, the maximum temperature, minimum temperature, considerable humidity, average wind speed, and average precipitation of the basins were selected as meteorological factors input models in this study. The correlation analysis between meteorological data and runoff data is shown in Fig. 2. It is easy to see that various meteorological factors have different effects on runoff, and the correlation distribution between meteorological data and runoff data is different in different watersheds, and the darker the color, the greater the effect, which also reflects the diversity of the selected watersheds.

The data used in this study included meteorological data and runoff data (Table 1 ). Among them, meteorological data (precipitation, maximum and minimum temperature, average wind speed, relative humidity) were obtained by downloading from the official website of the Japan Meteorological Agency (https://www.data.jma.go.jp/gmd/risk/obsdl/index.php) (see supplementary materials). The measured runoff data (Table 2 ) were selected from the observation station closest to the outlet of the watershed for model training, calibration and verification.

Table 1

The details of the datasets.
	Time series data in the fiver river basins
Indexes	Maximum temperature (℃) Minimum temperature (℃) Average wind speed (m/s) Relative humidity (%) Precipitation (mm) Runoff (m³/s)
Duration	01/01/2000-31/12/2018

Table 2

Detail information of runoff data.
River	Gauging station	Number	Coordinate		Source	Download path
River	Gauging station	Number	latitude	Longitude	Source	Download path
Kushiro river	Hirosato	301091281106070	43 03 20 N	144 23 51 E	Ministry of Land, Infrastructure, Transport and Turism (Japan)	http://www1.river.go.jp/ (accessed on 20 November 2021 )
Shiribeshi Toshibetsu river	Otomi	301051281102050	42 25 02 N	139 55 15 E
Shinano Gawa river	Watanabe	304031284403020	37 39 49 N	138 40 11 E
Yodogawa river	Takahama	306041286606080	34 52 04 N	135 40 11 E
Oyodo river	Kashiwada	309141289916080	31 57 17 N	131 23 55 E

2.2 Methodology

2.2.1 Problem formulation

Given the time series data $\text{Y}=\{{y}_{1}, {y}_{2}, \cdots , {y}_{T}\}$, where ${y}_{t}\in {\mathbf{R}}^{n*T}$and n is the dimensionality of the variable, our goal is to predict a series of future signals in a rolling predict fashion. Specifically, assume that $\text{Y}=\{{y}_{1}, {y}_{2}, \cdots , {y}_{T}\}$ is known and the prediction target is ${\text{y}}_{t+w}$ (w is the step of the current timestamp forward prediction). Therefore we set the input data at time t as ${x}_{t}=\{{y}_{1}, {y}_{2}, \cdots , {y}_{t}\}$. The prediction target is ${y}_{t+w}$.

2.2.2 Methods

1. Long short-term memory (LSTM) model

As a type of recurrent neural network, it is widely used for processing time-series data. Its unique forget gate and cell structure greatly alleviate the long-term dependence problems such as gradient disappearance/gradient explosion in recurrent neural networks to some extent. The structure is shown in Fig. 3 (4).

${\text{x}}_{t}$ refers to the data value of the time t input sequence. ${c}_{t}$ refers to a memory cell or cell state, which is the core of the network and controls the transfer of information. ${i}_{t}$ refers to an input gate that determines the amount of information currently retained by ${x}_{t}$ in ${c}_{t}$. ${f}_{t}$ refers to a forgetting gate, which determines the amount of ${c}_{t-1}$ that is kept from the previous moment of cell state to the current ${c}_{t}$. ${o}_{t}$ refers to an output gate, which determines the amount of output ${h}_{t}$ that ${c}_{t}$ transmits to the current state. ${h}_{t-1}$ refers to the state of the hidden layer at moment t-1. ${\sigma }$ denotes the activation function sigmoid. The synergy of the forget gate and the cell state filters important features and allows them to be transmitted over longer distances.

2. Autoregressive recurrent network (DeepAR) model

The DeepAR model (Fig. 3 (3)) uses an autoregressive recurrent network architecture, consisting of multilayer long short-term memory (LSTM) components. The inputs to the model are the time series values ${\text{x}}_{t}$ up to t, and the covariates ${\text{z}}_{t+w}$ at time t+w. The time-series data and covariates are then concatenated and fed into the LSTM layer. The output of the LSTM layer is fed to two dense layers, one as an affine function to give the mean ${\mu }$ and the other as an affine transformation to generate the standard deviation ${\sigma }$. In the case of standard deviations, the output of the dense layer is fed into the SoftPlus layer to ensure positive values. Finally, the mean and standard deviation are the inputs to the Gaussian likelihood model used to generate the samples. In this paper, we use the mean and standard deviation to parameterize the Gaussian likelihood ${\theta }_{t+w}=(\mu ; \sigma )$. The likelihood $l\left({z}_{t+w}\right|{\theta }_{t+w})$ is calculated, and the median is obtained as the final output ${\widehat{y}}_{t+w}$.

3. Light gradient boosting machine (LightGBM) model

LightGBM is a gradient boosting decision tree(GBDT) based data model proposed by Microsoft in 2017. LightGBM is a data model based on GBDT, which, like other boosting algorithms, combines weak learners into strong learners. The computational time of traditional GBDT algorithms is often consumed in the construction of the decision tree. The construction of a decision tree requires finding the optimal split points. The usual approach is to sort the feature values and then enumerate all possible feature points. This approach is time-consuming and requires a lot of memory. The LightGBM algorithm uses a modified histogram algorithm. It divides the continuous feature values into several intervals and selects the division points among the several intervals. As a result, it outperforms the GBDT algorithm in terms of training speed and spatial efficiency. Also, the decision tree is a weak classifier. Using the histogram algorithm will have the effect of regularisation, which can effectively prevent overfitting. In terms of reducing training data, the LightGBM algorithm uses a leaf-wise generation strategy. Compared to the traditional level (depth)-wise approach, leaf-wise reduces more losses when planting the same leaf.

4. Temporal convolutional network (TCN) model

The TCN network uses causal convolution and dilation convolutions on the basis of 1D convolution. The causal convolution ensures that temporal features are passed upwards, while the dilation convolution expands the perceptual field of the model, allowing the network to learn more temporal features with fewer layers. As well as a residual convolutional layer-hopping connection is also used, in order to solve the degradation problem of deep networks.

The structure of the TCN is shown in the Fig. 3 (2) and consists of two main parts: the causal/dilated convolution and the residual block. The left-hand side of the Fig. 3 (2) shows the causal and dilated convolution in the TCN architecture, where it can be seen that the value at time t of each layer depends only on the value at time t,t-1... of the previous layer, reflecting the properties of causal convolution. In addition, each layer extracts the information of the previous layer by jumping, and the expansion rate of each layer increases exponentially by 2, which reflects the characteristics of dilated convolution. Since dilated convolution is used, each layer is padded (usually by 0). The receptive field size for the null convolution is $(\text{K}-1)\text{d}+1$, so increasing either K or d increases the receptive field. In the Fig. 3 (2), d is 1,2,4 in this order, and k is 3. The right-hand side of the Fig. 3 (2) shows the residual block in the TCN architecture. Calculated as Eq. (1):

$$H\left(x\right)=F\left(x\right)+x$$

1

The input undergoes dilated convolution, weight normalization, activation function, Dropout (two rounds) as the F(x) in the residual function, while the input also undergoes 1x1 convolution filters as the x of shortcut connection. The two parts of the output are combined to give the final output.

5. Deep temporal convolutional network (DeepTCN) model

DeepTCN modifies the residual structure on the base of TCN and uses an encoding-decoding mechanism. In particular, the encoder part of this is the residual block, which is responsible for building stacked dilated causal convolutional networks to capture long-term temporal dependencies. The decoder part, on the other hand, is a variant of the residual block (referred to as RESNet-v) and is designed to integrate the output of stochastic processes with historical observations and future covariates. Finally, an output dense layer is used to map the output of RESNET-v to the final prediction.

The left-hand side of the Fig. 3 (5) shows the residual block of the encoder, where the input ${x}_{t}$ is first passed through a dilated causal convolutional layer and a batch normalization layer, repeated twice (passing through the activation function ReLU layer in the process). Then the residual operation is performed together with the original input ${x}_{t}$. Finally fed to the decoder residual block through the activation function ReLU layer. Here the batch normalization layer is designed to provide a stable distribution of activation values during training (Ioffe and Szegedy et al., 2015). This achieves faster convergence and shortens the training process of the model. On the right side of the Fig. 3 (5) is the decoder module RESNET-v, where ${\text{z}}_{t+w}$ is the future covariate. A dense layer and batch normalization are first applied to project future covariates, and RELU activation is applied, then processed by another dense layer and batch normalization. The processed result is combined with the output from the encoder to perform the residual operation. Finally, after the activation function ReLU, the output of the decoder residual block is mapped by the output dense layer to produce the probability prediction, and its median ${\widehat{y}}_{t+w}$ is taken as the final output.

6. Transformer model

The transformer (Fig. 3 (1)) discards the traditional CNN and RNN structure, and the entire network structure is composed entirely of the attention mechanism. More precisely, the transformer consists of and only consists of, self-attention and feed forward neural network. As a result of using the attention mechanism, the model can reduce the distance between any two positions in the sequence to a constant, which is a good solution to the problem of long-term dependence in time series problems. Secondly again, since it is not an RNN-like sequential structure, it has better parallelism and fits in with existing GPU frameworks.

The encoder consists of a position encoding layer and a stack of self-attentive and fully connected feedforward sublayers. Each sublayer is followed by a residual block and a layer normalization. Specifically, as the model does not contain a recurrent or convolutional structure, in order for the model to utilize the order of the sequence. After the input data, the position-encoded vectors calculated using the sine and cosine functions are added to the elements of the input vector to encode the sequential information in the time series data. The encoded result vector is fed into the self-attention sublayer and the fully connected feedforward sublayer in turn, with the final output being fed into the decoder module. The decoder section consists of a position encoding layer, a self-attention sublayer, a multi-headed attention sublayer, and a fully connected feed-forward sublayer stack. Again each sublayer is followed by a residual block and has a layer normalization. In the prediction issue, the input to the decoder consists of two parts, one being the output ${\widehat{y}}_{t+w-1}$ of the decoder at the previous moment and the other being the output of the encoder. ${\widehat{y}}_{t+w-1}$ is input to the self-attention sublayer after position encoding, after which it is input to the multi-headed attention sublayer at the same time as the output of the encoder, and finally to the output layer via the fully connected feedforward sublayer. The output layer consists of a dense layer with the activation function softmax, which maps the output of the last decoder layer to the target time series. In addition, we have used a form of look-ahead mask and single position offset between the decoder input and target output in the decoder to ensure that the prediction of time series data points depends only on the previous data points.

2.3 Determination and correction of predicted results lags

Predicted result lags behind the real date is almost an inevitable challenge in time series prediction for. an underlying reason: a trend of change (or a non-linear non-stationary series) arises in the data series. Specifically, when we use the data from ${x}_{t-h}$ to ${x}_{t}$ as input to predict the data for ${x}_{t+h}$. The model can only predict the trend of the data at ${x}_{t+h}$ by the trend of ${x}_{t-h}$ to ${x}_{t}$. The actual trend of ${x}_{t+h}$ is not observed by the model, and the model acquires intelligence with a lag, which results in the trend of the predicted outcome being later than the actual trend.

2.3.1 Dynamic time warping and Euclidean distance

1. Euclidean distance

Euclidean distance is an intuitive and simple way to describe the degree of similarity between two sequences. The line ab in the Fig. 4 is the correspondence of Euclidean distance, where points on the same moment correspond to each other and the total distance is calculated. For two sequences |A|=n, |B|=m, it is calculated as Eq. (2):

$${W}_{e}= \sqrt{\sum _{i(i=1, j=1)}^{(n, m)}{({a}_{i}-{b}_{j})}^{2}}$$

2

where ${a}_{i}$ denotes an element on sequence A and ${b}_{i}$ denotes an element on sequence B.

2. Dynamic Time Warping (DTW)

Dynamic Time Warping (DTW) is an algorithm that scales time series to calculate similarity by warping.

The line ab' in the Fig. 4 shows the correspondence of the DTW, where a point at a moment in the A sequence can correspond to a point at a non-same moment in the B sequence, while a point can correspond to multiple points, and multiple points can correspond to a single point. This means that each point will find the minimum distance from it, allowing scaling on the timeline.

For two sequences |A|=n, |B|=m, construct an n*m matrix P, where the element(i,j) is the Euclidean distance between ${a}_{i}$ and ${b}_{j}$. In the path from the start point P(0,0) to the endpoint P(n,m), the path with the sum of the smallest matrix elements is searched. At this point, the total number of elements of the path is the DTW distance. Here we define a cumulative distance ${D}_{i,j}$, matching these two sequences A and B from the starting point P(0,0), the distances added up for all previous points are cumulated. After reaching the endpoint P(n,m), this accumulated distance is the final total distance. It can be expressed as Eq. (3):

${D}_{i,j}=Dist\left({a}_{i},{b}_{j}\right)+\text{min}\left[D\left(i-1,j\right),D\left(i,j-1\right),D\left(i-1,j-1\right)\right]$ $(i=1, 2, \dots , n;j=1, 2, \dots ,m. )$ (3)

$Dist\left({a}_{i},{b}_{j}\right)$ is the Euclidean distance between ${a}_{i}$ and ${b}_{j}$, and the cumulative distance ${D}_{i,j}$ is the current grid point distance, which is the sum of the Euclidean distance between ${a}_{i}$ and ${b}_{j}$ and the cumulative distance of the smallest neighboring element that can reach the point. The DTW distance is the optimal path that minimizes the accumulation distance.

According to the calculation processes of Euclidean distance and DTW, we can conclude that the Euclidean distance, which represents the absolute distance, is more affected when the sequence data is offset, while the DTW, which represents the relative distance, is less affected by the data offset. For the experimental results, a large difference between the Euclidean distance and the value of the DTW occurs when there are the predicted results lags, and conversely, the Euclidean distance approaches the DTW in the absence of the predicted results lags.

2.3.2 Determination of predicted results lags

In accordance with the relationship between Euclidean distance and DTW, we use the difference-value W of the results of the two methods as an indicator to judge the backward offset of the prediction results, and the calculation process is as Eq. (4).

$$W=\frac{\left|{W}_{e}-{W}_{d}\right|}{mean({W}_{e}, {W}_{d})}$$

4

Where $W$ is the difference parameter, ${W}_{e}$ represents the Euclidean distance, and ${W}_{d}$ represents the DTW distance.

2.3.3 Improvement of predicted results lags

For the output of each model, we added a lag improvement layer. Figure 5 shows the structure.

After the model is trained and the final prediction ${\widehat{y}}_{t+w}$ is output on the test set, the DTW between ${\widehat{y}}_{t+w}$ and the test set ${\text{y}}_{t+w}$ and the difference value W of the Euclidean distance are calculated.

Then determine whether W reaches the threshold value, if W does not reach the threshold value, it means that DTW is similar to the Euclidean distance, then it is determined that the predicted result does not has lags or the lags situation has little influence, and the predicted result can be output directly.

Conversely, if W exceeds the threshold, the predicted result is determined to have a lag, and W is calculated again after shifting the predicted result forward by one unit. And the shifting steps are repeated until W is less than the threshold, and the predicted result is finally output.

In order to prevent errors caused by dead loops or too small threshold settings in the determination process, we set up an early warning mechanism. That is, during the forward shift of the predicted results, the adjustment of the data is automatically stopped when W still exceeds the threshold after repeating n times (n is set to 5 in this experiment), and the data with the smallest W among them is output as the final result.

2.4 Evaluation metrics

In this study, the coefficient of determination (R²), Nash–Sutcliffe efficiency coefficient (NSE) and Percent Bias (PBIAS) are used to evaluate the performance of the methods.

3.1 Results comparison of six methods

The six selected machine learning methods were used to predict the runoff processes in five watersheds successively. Thereinto, the runoff and meteorological data from 2000 to 2017 were divided into training set and validation set by 9:1. The runoff and meteorological data of 2018 were used as test set. As a multivariate time series prediction problem, we import runoff and meteorological data as input data X into the model for fitting, and select a single runoff as the output target y. The experiment results of the test set are shown in Fig. 6 (The specific prediction results of the six methods in each watershed, see the supplementary material), and the error analysis is shown in Table 3.

Table 3

Error analysis of six methods before improvement.
		LSTM	DeepAR	Transformer	LightGBM	TCN	DeepTCN
Oyodo river	NSE	0.212	0.204	0.218	0.297	0.318	0.23
	R²	0.506	0.611	0.511	0.59	0.605	0.594
	PBIAS	24.40%	-27.50%	29.52%	-9.73%	-31.28%	-9.44%
Yodogawa river	NSE	0.734	0.824	0.729	0.805	0.794	0.806
	R²	0.858	0.908	0.856	0.898	0.892	0.9
	PBIAS	2.46%	1.02%	3.90%	1.21%	-0.37%	-2.35%
Shiribeshi Toshibetsu river	NSE	0.46	0.24	0.326	0.377	0.461	0.342
	R²	0.691	0.69	0.633	0.683	0.683	0.658
	PBIAS	6.95%	-20.01%	14.92%	-9.15%	-4.13%	-7.52%
Shinano Gawa river	NSE	0.559	0.588	0.652	0.637	0.686	0.723
	R²	0.778	0.841	0.816	0.83	0.838	0.853
	PBIAS	-17.72%	-37.81%	13.28%	-17.28%	-15.01%	-6.62%
Kushiro river	NSE	0.96	0.904	0.932	0.918	0.912	0.937
	R²	0.98	0.964	0.966	0.962	0.957	0.969
	PBIAS	0.30%	-2.13%	0.60%	-1.42%	-0.50%	-1.02%

First, comparing the prediction results of the five rivers with each other, it is found that the runoff prediction results of six machine learning methods are relatively consistent. For the same river, the prediction results of the six machine learning methods do not differ much. In general, the runoff prediction results of the six machine learning methods were the worst in the Oyodo river and the best in the Kushiro river. Second, comparing the six machine methods with each other, we can find that DeepTCN and TCN have the best prediction effect, followed by LSTM and DeepAR, and LightGBM and Transformer have the worst prediction effect. Moreover, even DeepTCN and TCN, which have the best prediction results, cannot achieve high accuracy runoff prediction for all the study basins. So, the six machine learning methods need to be improved. After carefully analyzing the visualization prediction results, it seems that the runoff prediction results of the six machine learning methods in the five rivers generally have deviations from the error analysis results. In particular, DeepTCN is more obvious and prominent. The visualization runoff prediction curves of DeepTCN in each river are good, but the NSE coefficients range of runoff prediction results are from 0.23 to 0.937, which is a huge difference.

After analyzing and observing the error values and visualization results, we speculate that the data backward shift of the predicted results caused the abnormal result that the overall similarity between the predicted curve and the actual runoff curve is very high, but the error values are large. Therefore, we try to improve the accuracy of runoff prediction by judging and correcting the problem of data backward shift.

3.2 Improvement of predicted results lags

Figure 7 shows the euclidean distances and DTW distances for each rivers before improvement.

Observing the results of the Euclidean distances and DTW distances for each river, it can be seen that there are large differences between both parameters, which also indicates that there is indeed a lag in the time axis between the predicted results and real data.

Comparison of difference-values W before and after improvement is shown in Table 4. In this experiment, the W threshold was set at 0.3. The bold text in the Table 4 means that the initial W calculation did not exceed the threshold, and the results were output directly without forward shifting the predicted results.

Table 4

Comparison of difference-values W before and after improvement.
		Oyodo river	Yodogawa river	Shiribeshi Toshibetsu river	Shinano Gawa river	Kushiro river
LSTM	Before	0.464	0.646	0.515	0.703	0.282
LSTM	After	0.270	0.333	0.109	0.481	-
DeepAR	Before	1.211	1.104	1.020	0.783	0.751
DeepAR	After	0.191	0.189	0.279	0.237	0.502
Transformer	Before	0.311	0.613	0.608	0.550	0.997
Transformer	After	0.142	0.390	0.178	0.155	0.340
LightGBM	Before	0.586	0.780	0.874	0.760	1.058
LightGBM	After	0.148	0.269	0.147	0.228	0.501
TCN	Before	0.404	0.677	0.661	0.576	0.910
TCN	After	0.115	0.155	0.074	0.132	0.372
DeepTCN	Before	1.444	1.157	1.321	0.780	1.178
DeepTCN	After	0.216	0.176	0.146	0.055	0.298

The italic bold text indicates that the value of W still exceeds the threshold after n = 3 times shifts forward, so the predicted result corresponding to the smallest value of W among these shifts is taken as the final output.

It is worth noting that compared to other rivers, the improvement effect of the predicted results lags is not very obvious for the Kushiro river data. This can actually be explained by the fact that the Kushiro River is relatively unaffected by the predicted results lags.

3.3 Results comparison of improved six methods

After the correction of the backward shift of the predicted results, the error values of the runoff prediction results of the six machine learning methods in the five watersheds have been significantly reduced (Table 5), and the prediction accuracy has been improved, which is basically consistent with the visualization results. At the same time, it confirms our previous speculation that the data backward shift of the predicted results does have a negative impact on the experimental error to a certain extent.

Table 5

Error analysis of six methods after improvement.
		LSTM	DeepAR	Transformer	LightGBM	TCN	DeepTCN
Oyodo river	NSE	0.473	0.929	0.444	0.717	0.622	0.969
	R²	0.709	0.981	0.774	0.85	0.893	0.987
	PBIAS	24.40%	-27.47%	29.57%	-9.71%	-31.19%	-9.43%
Yodogawa river	NSE	0.864	0.979	0.831	0.935	0.931	0.98
	R²	0.933	0.996	0.917	0.972	0.973	0.992
	PBIAS	2.52%	1.11%	3.98%	1.29%	-0.30%	-2.28%
Shiribeshi Toshibetsu river	NSE	0.767	0.86	0.726	0.872	0.842	0.964
	R²	0.884	0.968	0.871	0.94	0.942	0.986
	PBIAS	6.94%	-20.01%	14.92%	-9.13%	-4.09%	-7.49%
Shinano Gawa river	NSE	0.728	0.873	0.847	0.884	0.875	0.94
	R²	0.869	0.988	0.943	0.953	0.952	0.974
	PBIAS	-17.74%	-37.82%	13.23%	-17.32%	-15.00%	-6.67%
Kushiro river	NSE	0.96	0.945	0.985	0.979	0.974	0.993
	R²	0.98	0.988	0.993	0.994	0.99	0.998
	PBIAS	0.30%	-2.07%	0.65%	-1.36%	-0.45%	-0.96%

At the same time, we also found that in the error results of the improved prediction results, the PBIAS value did not change significantly. After formula derivation in Eq. (5), where ${\widehat{y}}_{i}$ represents the prediction value and ${y}_{i}$ represents the true value. We found that the main reason is that PBIAS is only related to the ratio of the sum of the predicted values to the sum of the actual values. While the actual sum is fixed, shifting the prediction results along the time axis does not change the sum of the predictions.

$$PBIAS=\frac{\sum _{i=1}^{T}\left({y}_{i}-{\widehat{y}}_{i}\right)}{\sum _{i=1}^{T}{y}_{i}}= \frac{\sum _{i=1}^{T}{y}_{i}-\sum _{i=1}^{T}{\widehat{y}}_{i}}{\sum _{i=1}^{T}{y}_{i}}=1- \frac{\sum _{i=1}^{T}{\widehat{y}}_{i}}{\sum _{i=1}^{T}{y}_{i}}$$

5

Simultaneously, scatter plots of the prediction results for each method were also plotted. As shown in Fig. 8 and Fig. 9, the prediction results before and after optimization are represented respectively. We see that the distribution of each scatterplot is consistent with the performance of the error values presented in Table 5. The error optimization does also improve the prediction results, which are shown on the scatter plots as scatter closer to the regression line.

After in-depth analysis of the runoff prediction performance of the machine learning methods in the five rivers, it can be found that the improved machine learning model has the most obvious improvement in the prediction accuracy of the Oyodo river and the Shiribeshi Toshibetsu river. This is mainly due to the high complexity of geographic and meteorological factors in these two watersheds, which leads to large fluctuations in the annual runoff process of these two watersheds, which makes the machine learning model prone to the problem of backward prediction results when making runoff predictions. Or we can say the data characteristics of the two rivers make it more sensitive to the problem of data backward shift. The Kushiro river is relatively simple and stable due to its geographical and meteorological factors, the annual runoff does not change drastically, and the fluctuations are less and gentle. Therefore, we can see that the Kushiro river has maintained a relatively stable prediction results in all methods regardless of the adjustment. This is precisely because the data with relatively gentle changes are both easier to predict and less affected by backward shifts of the data. This finding has important implications for improving and extending the application of machine learning methods in runoff prediction, and even in other time-series data prediction.

Obviously, the error results of the improved machine learning model can more truly reflect the data processing and prediction abilities of each method for different rivers. Therefore, we analyze and discuss the performance of the methods based on the improved error results. The composite methods of DeepTCN and DeepAR have better prediction ability, and the adjusted error increases the most. Especially, DeepTCN has a good performance in the prediction of all the five rivers. In general, the improved DeepTCN model has the highest runoff prediction accuracy, which can adapt to most watershed conditions, achieve high-precision runoff prediction and the best comprehensive prediction effect. So the improved DeepTCN has the potential to be widely recommended for runoff prediction.

This study takes five rivers from north to south in Japan as the research object, and selects six different types of mainstream machine learning methods for runoff prediction. While studying the accuracy and applicability of these methods for runoff prediction, the methods are improved and studied for the common problems in the runoff results, and the main conclusions are as follows:

• The machine learning model before improvement can generally achieve high-precision runoff prediction in the watersheds with relatively single and stable meteorological and geographical factors. In the watersheds with relatively complex and changeable meteorological and geographical factors, the runoff prediction effect is poor.

• After analysis, it is found that the backward shift of predicted data is an important reason for reducing the accuracy of runoff prediction of machine learning methods. And the more complex the meteorological and geographical factors of the river watershed are, the more serious the impact of the backward shift problem is.

• The prediction accuracy of the six machine learning methods has been greatly improved after the improvement of the prediction data backward problem. The improved DeepTCN has the highest runoff prediction accuracy and the best comprehensive prediction effect, and has the potential to be widely recommended for runoff prediction.

This study is of great significance for improving the application of machine learning models in the field of runoff prediction, studying the applicability of new methods, and how to select machine learning models for runoff prediction. Although this study considered the impact of multiple meteorological data on runoff in the basin, the interpretability of the predicted results was still insufficient. In the future, it is an important development direction to add watershed physical factors to machine learning models or to use them in conjunction or coupling with hydrological and geographic models.

-Ethical Approval:

This research does not involve human subjects or animals and has no ethical implications.

-Consent to Participate:

Not applicable.

-Consent to Publish:

Not applicable.

-Authors Contributions:

All authors contributed to the study's conception and design. Author contributions to the manuscript:

Y. Chen: Conceptualization, Validation, Methodology, Writing-original draft.

Y. Zhang: Software, Validation, Writing-original draft.

X. Fan: Writing-Review & Editing, Supervision, Formal analysis.

X. Song: Methodology, Data curation.

J. Gao: Software, Project administration, Writing-Review & Editing.

Z. Bin: Writing-Review & Editing, Formal analysis.

H. Ma: Investigation, Writing-Review & Editing.

-Funding:

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

-Competing Interests:

The authors have no relevant financial or non-financial interests to disclose.

-Availability of data and materials:

Not applicable.

Abdalla EMH, Pons V, Stovin V, De-Ville S, Fassman-Beck E, Alfredsen K, Muthanna TM (2021) Evaluating different machine learning methods to simulate runoff from extensive green roofs. Hydrol Earth Syst Sci 1–24. https://doi.org/10.5194/hess-25-5917-2021
Adnan RM, Petroselli A, Heddam S, Santos CAG, Kisi O (2021) Short term rainfall-runoff modelling using several machine learning methods and a conceptual event-based model. Stoch Env Res Risk A 35(3):597–616. https://doi.org/10.1007/s00477-020-01910-0
Alzamzami F, Hoda M, Saddik E, A (2020) Light Gradient Boosting Machine for General Sentiment Classification on Short Texts: A Comparative Evaluation. Ieee Access 8:101840–101858. https://10.1109/ACCESS.2020.2997330
Busico G, Colombani N, Fronzi D, Pellegrini M, Tazioli A, Mastrocicco M (2020) Evaluating SWAT model performance, considering different soils data input, to quantify actual and future runoff susceptibility in a highly urbanized basin. J Environ Manage 266:110625. https://doi.org/10.1016/j.jenvman.2020.110625
Chen YC, Gao JJ, Bin ZH, Qian JZ, Pei RL, Zhu H (2021) Application study of IFAS and LSTM models on runoff simulation and flood prediction in the Tokachi River basin. J Hydroinformatics 23(5):1098–1111. https://doi.org/10.2166/hydro.2021.035
Chun PJ, Izumi S, Yamane T (2021) Automatic detection method of cracks from concrete surface imagery using two-step light gradient boosting machine. Comput-aided. Civ Inf 36(1):61–72. https://doi.org/10.1111/mice.12564
Dong M, Wu H, Hu H, Azzam R, Zhang L, Zheng Z, Gong X (2021) Deformation Prediction of Unstable Slopes Based on Real-Time Monitoring and DeepAR Model. Sensors 21(1):14. https://doi.org/10.3390/s21010014
Gan Z, Li C, Zhou J, Tang G (2021) Temporal convolutional networks interval prediction model for wind speed forecasting. Electr Power Syst Res 191:106865. https://doi.org/10.1016/j.epsr.2020.106865
Guo G, Yuan W (2020) Short-term traffic speed forecasting based on graph attention temporal convolutional networks. Neurocomputing 410:387–393. https://doi.org/10.1016/j.neucom.2020.06.001
Hannan MA, How DNT, Lipu MS, Mansor M, Ker PJ, Dong ZY, Sahari KSM, Tiong SK, Muttaqi KM, Blaabjerg F (2021) Deep learning approach towards accurate state of charge estimation for lithium-ion batteries using self-supervised transformer model. Sci Rep 11(1):1–13. https://doi.org/10.1038/s41598-021-98915-8
Herath HMVV, Chadalawada J, Babovic V (2021) Hydrologically informed machine learning for rainfall–runoff modelling: towards distributed modelling. Hydrol Earth Syst Sci 25(8):4373–4401. https://doi.org/10.5194/hess-25-4373-2021
Huang J, Huang Y, Hassan SG, Xu L, Liu S (2021) Dissolved oxygen content interval prediction based on auto regression recurrent neural network. J Ambient Intell Human Comput 1–10. https://doi.org/10.1007/s12652-021-03579-x
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. PMLR. 448–456
Karpov P, Godin G, Tetko IV (2019) A transformer model for retrosynthesis. International Conference on Artificial Neural Networks. 817–830. https://doi.org/10.1007/978-3-030-30493-5_78
Kok C, Jahmunah V, Oh SL, Zhou X, Gururajan R, Tao X, Cheong KH, Gururajane R, Molinari F, Acharya UR (2020) Automated prediction of sepsis using temporal convolutional network. Comput Biol Med 127:103957. https://doi.org/10.1016/j.compbiomed.2020.103957
Lunacek M, Williams L, Severino J, Ficenec K, Ugirumurera J, Eash M, Ge Y, Phillips C (2021) A data-driven operational model for traffic at the Dallas Fort Worth International Airport. J Air Transp Manag 94:102061. https://doi.org/10.1016/j.jairtraman.2021.102061
Lv Z, Wang D, Ding H, Zhong B, Xu L (2020) Escherichia coli DNA N-4-methycytosine site prediction accuracy improved by light gradient boosting machine feature selection technology. Ieee Access 8:14851–14859. https://10.1109/ACCESS.2020.2966576
Martinez B, Ma P, Petridis S, Pantic M (2020) Lipreading using temporal convolutional networks. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6319–6323. https://10.1109/ICASSP40776.2020.9053841
Mehr AD, Nourani V (2018) Season algorithm-multigene genetic programming: a new approach for rainfall-runoff modelling. Water Resour Manag 32(8):2665–2679. https://doi.org/10.1007/s11269-018-1951-3
Moritz N, Hori T, Le J (2020) Streaming automatic speech recognition with the transformer model. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
https://6074-6078
. 10.1109/ICASSP40776.2020.9054476
Qiu J, Wang B, Zhou C (2020) Forecasting stock prices with long-short term memory neural network based on attention mechanism. PloS one 15(1):e0227222. https://doi.org/10.1371/journal.pone.0227222
Sagheer A, Kotb M (2019) Time series forecasting of petroleum production using deep LSTM recurrent networks. Neurocomputing 323:203–213. https://doi.org/10.1016/j.neucom.2018.09.082
Son N, Yang S, Na J (2020) Deep neural network and long short-term memory for electric power load forecasting. Appl Sci 10(18):6489. https://doi.org/10.3390/app10186489
Tang L, Zhang Y, Shi H, Hu Y (2021) Runoff generation, confluence mechanism, and water balance change of coal mining areas with goaf: Establishment of a runoff prediction model. J Hydrol 603:127023. https://doi.org/10.1016/j.jhydrol.2021.127023
Tikhamarine Y, Souag-Gamane D, Ahmed AN, Sammen SS, Kisi O, Huang YF, El-Shafie A (2020) Rainfall-runoff modelling using improved machine learning methods: Harris hawks optimizer vs. particle swarm optimization. J Hydrol 589:125133. https://doi.org/10.1016/j.jhydrol.2020.125133
Xayasouk T, Lee H, Lee G (2020) Air pollution prediction using long short-term memory (LSTM) and deep autoencoder (DAE) models. Sustainability 12(6):2570. https://doi.org/10.3390/su12062570
Yan J, Jia S, Lv A, Zhu W (2019) Water resources assessment of China's transboundary river basins using a machine learning approach. Water Resour Res 55(1):632–655. https://doi.org/10.1029/2018WR023044
Yan H, Ouyang H (2018) Financial time series prediction based on deep learning. Wirel Pers Commun 102(2):683–700. https://doi.org/10.1007/s11277-017-5086-2
Zeng J, Cai H, Peng H, Wang H, Zhang Y, Akutsu T (2020) Causalcall: Nanopore basecalling using a temporal convolutional network. Front Genet 10:1332. https://doi.org/10.3389/fgene.2019.01332
Zhang J, Chen X, Khan A, Zhang YK, Kuang X, Liang X, Taccari M, Nuttall J (2021) Daily runoff forecasting by deep recursive neural network. J Hydrol 596:126067. https://doi.org/10.1016/j.jhydrol.2021.126067
Zhao W, Gao Y, Ji T, Wan X, Ye F, Bai G (2019) Deep temporal convolutional networks for short-term traffic flow forecasting. Ieee Access 7:114496–114507. https://10.1109/ACCESS.2019.2935504

attachmenttomanuscript.docx

Multiple Machine Learning Methods for Runoff Prediction: Contrast and Improvement

Status:

Version 1

Abstract

Figures

1. Introduction

2. Data And Methods

2.1 Data

2.2 Methodology

2.2.1 Problem formulation

2.2.2 Methods

1. Long short-term memory (LSTM) model

2.3 Determination and correction of predicted results lags

2.3.1 Dynamic time warping and Euclidean distance

1. Euclidean distance

2.3.2 Determination of predicted results lags

2.3.3 Improvement of predicted results lags

2.4 Evaluation metrics

3. Results And Discussion

3.1 Results comparison of six methods

3.2 Improvement of predicted results lags

3.3 Results comparison of improved six methods

4. Conclusions

Declarations

References

Supplementary Files

Status:

Version 1

Multiple Machine Learning Methods for Runoff Prediction: Contrast and Improvement

Status:

Version 1

Abstract

Figures

1. Introduction

2. Data And Methods

2.1 Data

2.2 Methodology

2.2.1 Problem formulation

2.2.2 Methods1. Long short-term memory (LSTM) model

2.3 Determination and correction of predicted results lags

2.3.1 Dynamic time warping and Euclidean distance1. Euclidean distance

2.3.2 Determination of predicted results lags

2.3.3 Improvement of predicted results lags

2.4 Evaluation metrics

3. Results And Discussion

3.1 Results comparison of six methods

3.2 Improvement of predicted results lags

3.3 Results comparison of improved six methods

4. Conclusions

Declarations

References

Supplementary Files

Status:

Version 1

2.2.2 Methods

1. Long short-term memory (LSTM) model

2.3.1 Dynamic time warping and Euclidean distance

1. Euclidean distance