Effects of Training Data on the Learning Performance of LSTM Network for Runoff Simulation

With the rapid development of Artificial Intelligence (AI) technology, the Long Short-Term Memory (LSTM) network has been widely used for forecasting hydrological process. To evaluate the effect of training data amount on the performance of LSTM, the study proposed an experiment scheme. First, K-Nearest Neighbour (KNN) algorithm is employed for generating the meteorological data series of 130 years based on the observed data, and the Soil and Water Assessment Tool (SWAT) model is used to obtain the corresponding runoff series with the generated meteorological data series. Then, the 130 years of rainfall and runoff data is divided into two parts: the first 80 years of data for model training and the remaining 50 years of data for model verification. Finally, the LSTM models are developed and evaluated, with the first 5-year, 10-year, 20-year, 40-year and 80-year data series as training data respectively. The results obtained in Yalong River, Minjiang River and Jialing River show that increasing the training data amount can effectively reduce the over-fittings of LSTM network and improve the prediction accuracy and stability of LSTM network.


Introduction
The reliability of runoff prediction is of great significance to water resource management, but the evident non-linear nature and randomness of runoff process makes it challenging to predict runoff (Wang et al. 2021a, b;Rahimzad et al. 2021). The models available for runoff prediction fall into three categories: (1) conceptual model, which is simple and easy to use, but gives a less detailed description of runoff generation and flow concentration; (2) physical model, which enjoys a good physical foundation but involves complicated calculation structure and large amount of basic data to build model. (3) data-driven model, which just needs to create the relationship between input and output data, without considering the hydrological mechanism of the basin and having a good simulation. Among them, the datadriven model, e.g., the convergence of the conventional back propagation (BP) neural network is too slow and thus has limited learning ability. With the rapid development of deep learning over the years (Hinton and Salakhutdinov 2006), the learning abilities of the Deep Neural Networks (DNN) are advancing by leaps and bounds, demonstrating super learning abilities in learning knowledge of various industries. However, the traditional DNN network cannot be used for time series modelling.
In order to learn the changing rule of the time series, Recurrent Neural Network (RNN) came into being. Based on conventional RNN, Long-Short Term Memory (LSTM) Network was first proposed by Hochreiter and Schmidhuber (1997). Due to the strong abilities in learning the time series data by digging out the long-term dependence relationship in time series, LSTM has been widely used in many fields, such as, image processing (Jiang and Liu 2021;Lin et al. 2021), transportation (Huang et al. 2021a, b), geological disasters relief (Dikshit et al. 2021), reference evapotranspiration financial economy (Dami and Esterabi 2021) and so on. Moreover, its applicability for hydrological prediction has been demonstrated by many researchers. For example, Xu et al. (2020) analyzed the performance of the LSTM network, the Soil and Water Assessment Tool (SWAT) model, Xin'anjiang Model, Multiple Regression Model and Back-propagation Neural Network (BPNN), and concluded that LSTM network has better learning abilities except for the problem of overfitting. (Rahimzad et al. 2021) compared the accuracy of Linear Regression (LR), Multilayer Perceptron (MLP), Support Vector Machine (SVM), and LSTM network in daily streamflow forecasting, and the results indicated that the LSTM is a robust data-driven technique to characterize the time series behaviors in hydrological modeling applications. All the aforementioned researches indicate that the LSTM network has strong capability in simulating runoff. In order to improve the learning ability of LSTM network, some scholars have conducted in-depth research (Wang et al. 2021a, b;Yin et al. 2021;Dilip 2021). However, there are little studies focusing on the issue of over-fitting and the effect of training data amount on the performance of LSTM model. At present, the Deep Learning Network training needs to take big data as the foundation (Hu et al. 2018;Mao et al. 2021), however, the hydrological data available often fail to meet the needs. It is necessary to find out the appropriate training data amount for developing a stable model. Therefore, studying the influence of training data amount on the LSTM and finding the future breakthrough point is the key to the application of deep learning in the field of hydrology.

Experiment Scheme
In order to study the effect of training data amount on the LSTM network performance, an experiment scheme is proposed in this study. Firstly, the K-Nearest Neighbour (KNN) algorithm (Lall and Sharma 1996) is used to derive rainfall data with long series. Secondly, the rainfall data is used as the input for SWAT model to simulate runoff data (served as experimental real data). Rainfall and runoff data are divided into training data and verification data. Then, the effect of the training data amount on the LSTM network performance are analyzed by varying the training data amount. Finally, the better parameters choices and data volume are found out, and the learning ability of LSTM network are tested based on this. The flow chart of this experiment scheme is shown in Fig. 1.

LSTM Network
The key parts of the LSTM network are comprised of fully connected layer and LSTM cells. As shown in Fig. 2a, the LSTM network includes four kinds of layers: (1) input layer, which is to receive input series data; (2) fully connected layer, which acts as a bridge Fig. 1 The flow chart of the experiment scheme between input layer and the LSTM cell layer to transmit the dimension of the input data to the dimension of LSTM cell; (3) LSTM cell layer, which provides n cells with different memory capacity; (4) output layer, which is to pass the output of LSTM cells.
LSTM cell structure is shown in Fig. 2b. There are two key states in calculation of LSTM cell, that is, Cell State (S t-1 ) and Hidden State (the output time step t-1, y-1). In the hidden state, information can be added or deleted from the cell state, which is controlled by "Forget Gate", "Input Gate" and "Output Gate". LSTM cell uses all of these gates to control memory progress so as to avoid long-term dependence (Yu et al. 2019). The details of these gates are shown below.
The "Forget Gate" f t decides what cell states are forgotten, and the forgotten extent is determined by sigmoid network output whose value ranges [0, 1], where 1 represents "Keeping all" and 0 "Forgetting all". The "Forget Gate" makes it possible to recall past information in current time step, see the Formula (1); "Input Gate" i t determines the update of cell state, consisting of sigmoid network and a tanh network. The sigmoid network here determines how much "Hidden State" information is involved in updating, see the Eq. (2); the output of the tanh network is one-dimensional vector, which determines how much "Hidden State" information needs updating, see the Eq. (3); based on the output from "Forget Gate" and "Input Gate", the cell state can be updated, see the Eq. (4); in the "Output Gate" o t , the sigmoid network is used to determine which information of the hidden state is output, see the Eq. (5); finally, the output of LSTM cells is as shown in Eq. (6).
(1) where, δ represents s-network; W f and b f represent weight matrix and bias of sigmoid network in "Forget Gate", respectively; W i and b i represent weight matrix and bias of sigmoid network in "Input Gate", respectively; tanh is one-dimensional matrix, with value range [0,1]; W c and b c represent weight matrix and bias of tanh network, respectively; f t × S t−1 indicates the information transmitted from "Forget Gate", and i t × Ct indicates the updated information from "Input Gate"; W o and b o represents weight matrix and bias of "Output Gate", with o t being the output of LSTM cell.
The LSTM network has two categories of parameters: neuron-parameters and superparameters. The neuron-parameters such as neuron weight and bias are updated in the model iterations, while super-parameters namely time-step, batch-size, and cell-size vary in their sensitivity and need adjustment based on experience. In particular, the parameter time-step reflects the length of data series. It can extract and learn the periodic relationship within time-step by learning data series. The parameter batch-size represents the number of data series in the training data packet extracted at one time; the product of the parameters time-step and batch-size represents the training data amount extracted for a single training. And parameter cell-size represents the number of neurons in a single hidden layer.
In rainfall-runoff simulation application, the rainfall data is used as the input data of the LSTM network, the number of input nodes equals the number of rainfall stations in study basin. The runoff data is the output data of the LSTM network, and the number of output nodes is equal to the number of hydrological stations in study basin.

Performance Metrics
Nash-Sutcliffe Efficiency Coefficient (NSE) and Correlation coefficient R 2 are used to evaluate the simulation effect of the LSTM network. The calculation formula is given below: where, Q 0 represents the measured value, Q m the simulation value, Q t the value at time t, and Q 0 mean value of the measured values. The closer the NSE value is to 1, the higher the reliability of the model.
where, Q m,i represents measured data, Q s,i simulation data, Q m mean value of measured data, and Q s mean value of simulation data. The value range of R 2 is [0, 1]. The closer the R 2 is to 1, the better the simulation effect.

Overview of the Study Area
Three watersheds including Minjiang River, Jialing River and Yalong River, are selected as study area. The study details of the three watersheds are shown below. The Minjiang River, an important tributary of the upper Yangtze River, stretching from latitude 28°13′N to 33°39′N and from longitude 99°36′E to 104°37′E, covers a total drainage area of 135,400 km 2 (Yin et al. 2022). Except for a very small amount of snow water replenishment in Zhenjiangguan Town and beyond of its trunk stream in Spring, the seasonal variation of Minjiang River's runoff in a year corresponds to its rainfall season. The annual average flow at the Minjiang Estuary is 2850m 3 /s, with the annual runoff of 90 billion m 3 . The rainfall data from 8 meteorological stations and runoff data from Xiangjiaba hydrological station in the watershed are used in the paper.
Jialing River is the largest tributary of the Yangtze River, stretching from latitude 29°18′N to 34°32′N and from longitude 102°31′E to 109°16′E, with a length of 1345 km and a drainage area of 160,000km 2 (Yin et al. 2022). The main runoff sources is rainfall and groundwater. The flood period is June to September each year. The rainfall data from 12 meteorological stations and runoff data from Shapingba hydrological station in the watershed are used in the paper.
Yalong River, located at 26°32′ ~ 33°58′N and 96°52′ ~ 102°48′E (Liu et al. 2019), covers a total drainage area of approximately 136,000 km 2 , with a length of 1571 km. The average annual flow is 1890m 3 /s and the annual runoff is 59.6 billion m 3 . Half of the runoff comes from rainfall, with the other from underground water and snow melt water. Its annual runoff is abundant and stable, with little annual variability. The rainfall data from 7 meteorological stations and runoff data from Panzhihua hydrological station in the watershed are used in the paper.
The location of the three river basins and the distribution of their meteorological stations are shown in Fig. 3 below.

Generation of Training Data
In this study, the training data of Minjiang River, Jialing River and Yalong River basins are generated using the SWAT model and KNN algorithm. For the three basins, first, the observed daily meteorological data and runoff data during 1991 to 2016 are used to calibrate SWAT model. Then, KNN algorithm, featured by being non-parametric and inert, is used to derive 130 years of daily rainfall data. Finally, the calibrated SWAT model is used to simulate 130 years of daily runoff process in the three river basins.
The procedure to derive prediction data using KNN model included the following steps: (1) prepare data (i.e., observed daily meteorological data during 1991 to 2016); (2) select parameter K value to predict number of neighboring days; (3) traverse the data set to extract the meteorological data of the neighbor days; (4) calculate the distance between the extracted meteorological data and their mean value; (5) select a probability depending on the distance to generate random meteorological data.
As a distributed watershed hydrological model, SWAT is often used to simulate the long-term hydrological process. The data required by the SWAT model mainly includes the Digital Elevation Model (DEM), the digital river network, the hydrometeorological data (temperature, precipitation, solar radiation, weed speed, relative humidity and stream flow), the land use and soil data. The model was calibrated by using Sufi2 given in SWAT-CUP software, the key parameters are same to that listed in the Table 3 of the paper of Xu et al. (2020).
Besides, the 130 years of rainfall and runoff data is divided into two parts: the first 80 years of data and the remaining 50 years of data. The former is used for model training and the latter for model verification. Among the training data, the first 5, 10, 20, 40 and 80 years of rainfall data (input factor) and runoff data (output factor) are taken respectively to train the LSTM network, so as to analyze the effect of training data amount on the runoff simulation performance of the LSTM network.

Selection of Training Parameters
Drawing lessons from the parameter selection experience in Xu et al. (2020) in this study, the parameters time-step, batch-size and cell-size are discretized into multiple values, while the parameters Num_layer, Learning_rate and epoch are set as a fixed value. Therefore, there are 36 different parameter combinations in total. The detailed parameter value ranges and meanings are shown in Table 1. Based on these 36 combinations, the learning performance was analyzed according to the schemes of 5, 10, 20, 40 and 80 years of training data.

Analysis of KNN Simulation Results
To evaluate the rationality of KNN simulation results, the coefficient of variable (CV) and the coefficient of skew (CS) of the simulated meteorological data series of 130 years are compared to those of measured series. The results of monthly rainfall and daily maximum temperature at Yushu rainfall station in Minjiang basin are shown in Fig. 4. It can be seen that the hydrological characteristics of simulated data was similar to that of measured data, which indicates that the simulated data can be used for the building of the LSTM network.

Effect of Parameter Combinations
To test the effect of the parameter combinations, 900 simulations, consisted of 36 available parameter combinations, 5 training data sets and 5 simulating times, were conducted for each of the three basins. The results are shown in Fig. 5. Figure 5 (YL1)-(YL5) shows the NSEs of the verification phase in the Yalong River Basin. It can be seen that the accuracy . It can be seen that, as the parameter combination changes from 1 to 4, timestep and batch-size are set as 80 and 10 respectively, and cell-size increases from 10 to 40, the NSE showes a downward trend, indicating that increasing the cell-size value can reduce the performance of the LSTM network.
The simulation results of Jialing River, shown in Fig. 5 (JL1)-(JL5), are similar to that of Yalong River and Minjiang River. It also can be observed that parameter changes may lead to the obvious change in the performance of LSTM network. In summary, the prediction accuracy of the LSTM network fluctuates as the parameters change.
In order to further study the effect of each parameter on the prediction accuracy, Fig. 5 (YL1) is enlarged and divided by crest and trough, as shown in Fig. 6. And the corresponding parameter values are shown in Table 2. From Fig. 6, it can be seen that, Zone 1, Zone 2 and Zone 3 are located at the crest, and the corresponding LSTM network has a higher prediction accuracy; Zone 4, Zone 5, and Zone 6 are located at the trough, and the corresponding LSTM network has a lower prediction accuracy. From Table 2, it can be observed that, for the parameter combinations in the crest zone, most of the batch-size and cell-size values are 10 and 20, while for the parameter combinations in the through zone, most of the batch-size and cell-size values are 30 and 40. Therefore, it can be said that, the LSTM network can obtain higher prediction accuracy when the batch-size and cell-size values are smaller.

Effect of Training Data Amount
To evaluate the effect of training data amount on the performance of LSTM network in the three basins, for each training data amount, the mean NSE of 5-time simulations for the Fig. 6 The NSEs at the verification phase of Yalong River when the training data set is 5 years training phase and the verification phase with the same parameters is calculated, and the results are shown in Fig. 7. Figure 7(YL1)-(YL5) are the results of Yalong River Basin. It can be seen that, in Fig. 7(YL1), the NSE at the training phase is much larger than that at the verification phase. The reason is that, the over-fitting, caused by the inadequate hydrological data at the training phase, leads to very poor extension capabilities of the LSTM network at the verification phase. Compared to Fig. 7(YL1), Fig. 7(YL2) shows that increasing the training data amount can lead to the decrease of the NSE at the training phase, and the raising of the NSE at the verification phase, as well as the reduction of the over-fitting phenomenon. In Fig. 7(YL3), the decline of NSE at the training phase and the increase of the Fig. 7 The mean NSEs of 5-time simulation at the training phase and verification phase of Yalong River, Minjiang River and Jialing River with varied training data amounts NSE at the verification phase results in the smaller gap between them, which proves that the trained network is highly stable. Figure 7(YL4) and (YL5) shows that the increase of the data amount could further weaken the over-fitting phenomenon of the LSTM network, but the improvement of the NSE at the verification phase is limited. All the above results show that, as the data amount at the training phase increases, the LSTM network tends to become more stable.
The simulation results of Minjiang River basin are shown in Fig. 7(MJ1)-(MJ5). In Fig. 7(MJ1), the NSE in the training phase remains around 0.9, and it is only around 0.7 in the verification phase. In Fig. 7(MJ2)-(MJ5), the NSE at the training phase declines obviously while it increases at the verification phase. It also corroborates that the increase of training data amount weakens the over-fitting of LSTM network.
The simulation results of Jialing River basin is shown in Fig. 7(JL1)-(JL5). It can be seen that, when the training data amount increases from 5 years of data to 10, 20 and 40 years of data, the NSE at the training phase and verification phase gradually decreases. But when the training data amount increases to 80 years of data, the NSE at the training phase and verification phase stands at the same level, indicating that the simulation performance of LSTM network keeps stable.
In order to further analyze the effect of training data amount on performance of the LSTM network, the NSE and R 2 of 180 simulations for the verification phase with the same training data amount for the three basins is averaged. The mean NSE and R 2 are shown in Fig. 8. It can be seen that, as the training data increases, the value of NSE and R 2 increases too, which indicates that the increase of training data amount can considerably reduce the over-fittings in the LSTM network and improve its prediction accuracy. But it also can be observed that when the training data amount exceeds a certain value or value range, the performance of LSTM network can't be improved remarkably, and the critical value varies for different basins. That is, increasing training data amount can improve the forecast performance, but over-increase of training data amount cannot necessarily lead to better results. In other words, choosing an appropriate amount of training data can better exert the learning ability of the LSTM network.

Analysis of Runoff Process
In Sects. 4.2 and 4.3, it is illustrated that LSTM network can achieve better performance when the amount of training data is set as 20 years, and the parameters time-step, batchsize, and cell-size are set as 120, 10, and 20, respectively. The results of the corresponding For the 50-year verification phase, the simulated runoff of the three basins are compared with the observed runoff in Fig. 9. It can be seen that the NSEs for Minjiang River, Jialing River and Yalong River at the verification phase are up to 0.91, 0.94 and 0.96 respectively, which indicates that the LSTM network has a strong learning ability and thus can be used for runoff simulation.

Conclusion
In order to analyze the influence of training data amount on the learning performance of the LSTM network in runoff simulation, the LSTM network is used for the runoff simulation of Minjiang River, Jialing River and Yalong River, and its performance under different parameters and training data amount schemes are evaluated in this study. The following conclusions can be drawn from the study: 1. When the parameters change, the prediction accuracy of the LSTM network changes too. When the batch-size and cell-size values are set as 10 or 20, the LSTM network has a better performance, while it has a lower performance when the batch-size and cellsize values are set as 30 or 40. The results shows that the LSTM network could provide higher prediction accuracy when the parameter batch-size and cell-size are smaller. 2. As the training data amount increases, the over-fitting phenomena can be reduced and the learning performance of the LSTM network improves significantly. But when the training data amount increases to a certain extent, the prediction accuracy of the LSTM network no longer improves much. The LSTM network in this study achieves better prediction accuracy and stability when more than 20-year training data amount are used. 3. When the parameters of time-step, batch-size, and cell-size are set as 120, 10, and 20, respectively, and the amount of training data set as 20 years, the NSEs for Minjiang River, Jialing River and Yalong River at the verification phase are up to 0.91, 0.94 and 0.96 respectively, which shows that the LSTM network can be used for runoff simulation.
Author Contribution All authors contributed to the study conception and design. Material preparation, data collection were performed by Wei Xu, model built by Anbang Peng, data analysis were performed by Xiaoli Zhang, figure drawn by Yuanyang Tian. The first draft of the manuscript was written by Anbang Peng and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Competing Interests
The authors have no relevant financial or non-financial interests to disclose.