A Novel Deep Learning Algorithm for Groundwater Level Prediction based on Spatiotemporal Attention Mechanism


 Groundwater resources play a vital role in production, human life and economic development. Effective prediction of groundwater levels would support better water resources management. Although machine learning algorithms have been studied and applied in many domains with good enough results, the researches in hydrologic domains are not adequate. This paper proposes a novel deep learning algorithm for groundwater level prediction based on spatiotemporal attention mechanism. Short-term (one month ahead) and long-term (twelve months ahead) prediction of groundwater level are conducted with observed groundwater levels collected from several boreholes in the middle reaches of the Heihe River Basin in northwestern China. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are used to evaluate the performance of the proposed algorithm and several baseline models (i.e., SVR, Support Vector Regression; FNN, Feedforward Neural Networks; LSTM, Long Short-Term Memory neural network). The results show that the proposed model can effectively improve the prediction accuracy compared to the baseline models with MAE of 0.0754, RMSE of 0.0952 for short-term prediction and MAE of 0.0983, RMSE of 0.1215 for long-term prediction. This study provides a feasible and accurate approach for groundwater prediction which may facilitate decision making for water management.


Introduction
Groundwater is an important source of drinking water, industrial water and agricultural irrigation water in many countries. Several unique qualities of groundwater (e.g. widespread and continuous availability, low development cost, drought reliability, etc.) has promoted the exploration of groundwater. In the past few decades, the effects of climate change and human activities have changed the distribution of groundwater resources seriously which led to shortages and over-exploitation of groundwater, as well as severe ecological and environmental problems in some arid and semi-arid areas 1,2 . Water resources management is a feasible solution for dealing with the abnormality in which the availability of groundwater resources is an important part.
Groundwater level is an important indicator of groundwater resources which is highly nonlinear and nonstationary and depends on many heterogeneous environmental factors (e.g., aquifer properties, precipitation, surface water, exploration, evapotranspiration). The first procedure for groundwater management is the accurate simulation and prediction of groundwater level. Therefore, it is essential to explore and develop methods and models for groundwater level simulation.
Research on groundwater level simulation may date back to the nineteenth century. In 1856, the famous French water engineer Darcy proposed Darcy's law (also called the law of permeability) to describe the relationship between groundwater seepage and hydraulic gradient 3 .
In 1935, the Theis Equation was proposed to provide solution for calculating groundwater drawdown stimulated by the discharge of a well in transient groundwater hydraulics 4 . At the end of the twentieth century, many hydrological models were developed (e.g., MODFLOW 5 ) which provided effective approaches to simulate and analyze the spatial-temporal variations in the distribution of groundwater. Since then, the hydrological models have attracted the researchers attention to conduct groundwater level simulation. Furthermore, researchers tend to integrate different models for simulating the multidisciplinary nature of natural systems which lead to the development of agent-based model 6 and modeling environment (e.g., OMS 7 , OpenMI 8 ). Cristian Guevara-Ochoa et al. coupled hydrological-hydrogeological model under different climate change scenarios to analyze the impact of climate change on water resources by quantifying the spatio-temporal dynamics of water balance and groundwater-surface water interactions 9 . However, several issues of the traditional hydrological models should be addressed which include the relatively low accuracy, the high computation costs and the necessary requirements of expert knowledge.
With the development of computer science and hardware, the data-driven methods (e.g., regression analysis, statistics, gray theory, machine learning) have been thoroughly studied 10 and widely used in lots of areas with good-enough results [11][12][13] . In the fields of hydrological, French et al. developed a neural network to forecast rainfall intensity fields in space-time and presented the advantages of neural networks compared to numerical models 14  comparative study analyzed the pros and cons of machine learning and numerical models for simulating groundwater dynamics 12 . Although great progress has been made in the study of groundwater level prediction, some problems still exist. First of all, there are few researches on long-term groundwater level prediction researches which is essential for regional water resources management. Second, there is a lack of analysis and modeling of the spatial and temporal sequences of groundwater levels. Given the fact that many groundwater level observation boreholes are distributed in a watershed, certain spatial and temporal correlations among different boreholes exist which would be great help to improve the accuracy of groundwater prediction.
In this paper, we propose a novel DL algorithm (ST-Att-LSTM) based on the spatiotemporal attention mechanism to forecast short-term and long-term groundwater level by combining information from multiple observation boreholes. The main contributions of this paper are: (1) the proposal of a novel deep learning algorithm based on spatiotemporal attention mechanism to incorporate multiple data sources; (2) the short-term and long-term prediction of groundwater level based on the proposed algorithm with more accurate results compared to baseline methods; (3) a potential way of knowledge discovery for domain experts. The rest of the article is organized as follows: the study area and collected data are described in Section 2; The structure of the proposed algorithm based on spatiotemporal attention mechanism are depicted in Section 3. In, Section 4, the evaluation of the proposed algorithm are demonstrated in Section 4; the conclusion and future works are shown in Section 5.

Study area and data description
The study area is selected as the middle reaches of the Heihe River Basin (HRB)  In the study area, the groundwater level observed at 42 observation boreholes (yellow and blue dots in Figure 1) were collected from 1986 to 2008. Groundwater level at observation borehole "22" was selected as the prediction target. Observation data from several spatially distributed boreholes near the river (i.e., "Daman", "Yanuanzhangwan", "Banqiaodongliu", "Liaoquanwanzi" and "Luocheng") (yellow dots in Figure 1) were selected to drive the algorithm.
The spatial and geographical relationship between each observation well and the target observation borehole is differentiated by spatial attention mechanism. The data processing will be introduced in Section 4.

Model structure
The proposal of the novel DL algorithm for groundwater level prediction is based on LSTM, Sequence-to-sequence model (seq2seq) and attention mechanism. Therefore, the proposed algorithm would be depicted after brief descriptions on LSTM, seq2seq and attention mechanism in this section. 7

LSTM
The traditional Feedforward Neural Network (FNN) with back propagation establishes weight connections between layers and the outputs at the current time step which only depends on the input at the current time step. Recurrent Neural Network (RNN) was then proposed by 22 to overcome this limitation. In RNN, the hidden layer's information is able to accept its own information at the previous time step by establishing interconnection between hidden layers.
However, the problem of long-term dependencies prevents RNN from large scale applications. To solve this problem, Hochreiter and Schmidhuber 23 introduced Long Short-Term Memory neural networks. Figure 2 shows the structure of LSTM in which xt refers to the input vector of time step t, ht is the output of hidden vector. From this structure, we can derive that ht contains information from ht-1 along with the input vector xt. the information is passed to the next time step t+1 through the interconnection to ensure the memorization of information. The difference between LSTM and RNN is that it adds a "processor" to the algorithm to judge whether the information is useful or not. The structure of this processor is called cell. Three gates are contained in a cell which named input gate, forget gate and output gate. The information which enters into LSTM can be judged according to the rules. Only information that complies with the algorithm's certification will be kept. The information that does not match will be discarded (forgotten) through the forget gate 24 . Equations (1) ~ (6) describe how the LSTM cell maps the input vector sequence x to the hidden vector sequence h in detail. In these equations, ft, it, ot, and Ct represent the forget gate, input gate, output gate and the memory cell vector, respectively; Wf, Wi, Wo, and WC are matrices of weighting parameter; σ and tanh are activation functions given by Equations (7) and (8).
, ( , ( , ( 3.2 seq2seq model RNN requires fixed dimensions of inputs and outputs. In order to solve this problem, the seq2seq model is proposed. The most distinguished elements in seq2seq model is the inclusion of an encoder and a decoder 25 with RNN processing the sequence information. The basic idea of seq2seq model is to use the encoder to compress the input sequence into a vector of a specified length which can be used as an implicit representation of the original sequence. The most direct way to obtain the representation of the hidden vector is to use the hidden state at the last time step of the RNN. It is also possible to transform all hidden states of the input sequence to obtain hidden variables. The role of the decoder is to generate the specified result based on the input of the hidden vector C at the current time step. The common used calculation method is to input the hidden vector C obtained by the encoder as the initial state into the RNN structure of the decoder.
The output will be used as the input of the current time step. The hidden vector C is used and only used as the initial state to participate in the operation.

Attention mechanism
The attention mechanism has recently been demonstrated success in a wide range of tasks 26 .
LSTM typically uses the last hidden state or the mean of all hidden states as an output, while the attention mechanism allows for a more direct dependency between the states of the model at The weight can be calculated according to the hidden state and cell state of the encoder at the last time step as: where , and are trainable parameters. The structure of the spatial attention is shown in Figure 3.
For long-term prediction, the seq2seq model structure is used with LSTM being the decoder.
Each time step of the decoding process obtains different encoding vectors through the time attention mechanism. In Equations (18)~(20), decoder represents the calculation process of a hidden layer unit of the encoder; ct represents the context vector in the current decoding state.
The structure of spatiotemporal attention mechanism is shown in Figure 4. Because of the existence of the characteristics of different dimension eigenvalues and target eigenvalue changes, the data was scaled to [0,1] using Equation (21). 13 where xi is the normalized data, x is the raw data, xmin and xmax are the minimum value and maximum value of x.

Parameters and performance evaluation
The environments (software and hardware) for conducting the experiment are shown in Table   1. Backpropagation with Adaptive moment estimation (Adam) 28 is used to train the proposed algorithm. The training is to minimize a loss function with regard to the parameters. A mini-batch gradient descent optimization method which is simple, efficient and computationally inexpensive is used to train the proposed model. It is suitable for non-stationary objective functions. The hyper-parameters of the proposed algorithm and baseline models will be described in the relevant sections.   Table 2. The results shown in Figure 5 indicate reasonable match for all the models. The results from LSTM are better than those from SVR and FNN because of the inclusion of time dependencies in the time sequence. However, S-Att-LSTM performs better which may benefit from the features extracted from different time sequences distributed spatially in the study area.
The best match is obtained by ST-Att-LSTM because of the spatiotemporal attention mechanism which would consider both the spatial and temporal relations with serval certain attention weights.
A quantitative comparison between different methods is conducted using the performance criteria (i.e., RMSE and MAE). The results shown in Table 3 indicate the best match from ST-LSTM which is in accordance with Figure 5. It is apparent that better results are obtained from the LSTM-based algorithms than those obtained from SVR and FNN which may be attributed to that LSTM-based algorithms consider the information through time. The S-Att-LSTM algorithm with only spatial attention mechanism performs better than the traditional LSTM without attention mechanism. The ST-Att-LSTM algorithm with spatiotemporal attention mechanisms performs best among three LSTM-based algorithms. Comparisons between LSTM-based algorithms indicates that the attention mechanism further improve the performance of the algorithms.      Table 4. The results of the algorithms are shown in Figure 7 which indicate that the performance of traditional methods (ARIMA, LSTM) for long-term prediction are worse than those from the advanced algorithms (seq2seq, seq2seq-Att, ST-Att-LSTM). Parts of the reason may be contributed to the larger number of parameters in the advanced algorithms which brings more degree of freedom to approximate the data. It is noted that a local minimum value around the 9th time step is mismatched by all the algorithms which may be accessible considering the following two reasons.
First, the purpose of the long-term prediction is to capture the trend of the data (which has already been captured well) rather than some special value in case of overfitting. Second, the groundwater level at time step 9 could be considered as an abnormal variation in this scale. A qualitative comparison between different algorithms is conducted by calculating the MAE and RMSE ( Table   5). The performance of the proposed ST-Att-LSTM algorithm ranks the best in all the algorithms which is in accord with Figure 7.

Conclusion
In this paper, a groundwater level prediction algorithm based on the spatiotemporal attention mechanism was proposed to explore the potential information in spatial and temporal distribution.
The groundwater level from five spatial distributed observation boreholes (i.e., "Daman", "Yanuanzhangwan", "Banqiaodongliu", "Liaoquanwanzi", "Luocheng" and "22") in the middle reaches of the HRB were collected and used to train and validate the proposed algorithm both in the short-term prediction and long-term prediction. The MAE and RMSE values were used to evaluate the performance by comparing with several baseline algorithms. The results showed that the inclusion of spatiotemporal information by attention mechanisms was able to improve the performance of deep learning algorithms significantly. Furthermore, the attention weights among different spatial-distributed observation boreholes were examined which may provide information for potential hydraulic connections for hydrologist. Therefore, the proposed algorithm provides (1) a potential application of spatiotemporal data; (2) an accurate prediction for groundwater level; and (3)