Climate change is one of the most relevant issues for humanity in the Anthropocene era (Malhi et al., 2020). The average temperature on the mainland in the years 2006–2015 was 1.53°C higher than that of the years 1850–1900 (IPCC, 2019). It is now well known that higher temperature is causing severe changes in precipitation regimes as well, with increasingly extreme events (Hardwick Jones et al. 2010; Myhre et al. 2019; Tramblay et al. 2020; Luppichini et al. 2023b). It is also causing an alteration of the beginning and end of growing seasons, causing a general decrease in the regional crop yields and freshwater availability. The biodiversity is further stressed and tree mortality increases (IPCC 2019). Understanding and modelling the past, present, and future climate are of fundamental importance to the issue of climate change and variability. Effective climate models represent one of our primary tools for forecasting and adapting to climate change (Schmidt 2011).
Moreover, climate change has direct repercussion on the hydrogeological systems and groundwater resources, and the hydrogeological models are consequently part of the climate models. Rainfall data, and precipitation in general, their variability, intensity, and duration, have paramount importance for the hydrogeological models (Sattari et al. 2017). Independently from the used model, several problems can still affect the input rainfall datasets. These issues often concern missing or incorrect information that can lead models to misleading results. It is worth noting that the geographical distribution of raingauges is generally not uniform (e.g., for some areas there is a deficiency or lack of raingauges). Furthermore, the completeness of the time series is not always guaranteed due to, for example, non-continuous operation of raingauges during the monitoring period (Lebay and Le 2020).
Many physically based methods simplify the natural system features to predict its behaviour (Antonetti and Zappa 2018). However, the natural systems are inherently heterogeneous(Marçais and de Dreuzy 2017) and the physically based methods may show inherent limitations in reproducing natural phenomena. In recent years, the use of artificial intelligence (AI) and graphical processing units (GPUs) have enabled remarkable advances in machine learning (and especially in deep learning ) applications such as techniques based on multilayer artificial neural networks (ANNs). Deep learning models have been successfully applied in many forecasting situations, including time series forecasting (Zheng et al. 2019; Yi et al. 2019; Fawaz et al. 2020; Nigro et al. 2022). Time series typically have chaotic and noisy problems and deep learning approaches are the most effective techniques for solving them (Livieris et al. 2020). Several authors use the rainfall dataset to create deep learning models available to replicate run-off processes (van Loon and Williams 1976; Marçais and de Dreuzy 2017; Kratzert et al. 2018; Boulmaiz et al. 2020; Sit et al. 2020; Tien Bui et al. 2020; Chattopadhyay et al. 2020; Luppichini et al. 2022a, a). Long short-term memory (LSTM) and convolutional neural networks (CNNs) are two of the most popular, efficient, and used deep learning techniques (Zheng et al. 2019; Yi et al. 2019; Fawaz et al. 2020). In the last period, some works combined LSTM and CNN models for time series prediction (Kimura et al. 2019; Baek et al. 2020; Van et al. 2020; Xu et al. 2020). The benefits of the combined CNN-LSTM models are a consequence of the characteristic of LSTM of acquiring efficiently the information of sequence patterns, thanks to their peculiar architecture. The CNN layers filter out the noise in the input data to extract the most significant features needed for the final prediction model. Furthermore, standard CNN can identify spatial autocorrelation between data but is usually not suitable for a correct analysis of a complex temporal dependence over long times (Bengio et al. 2013; Livieris et al. 2020). Several works used deep learning models based on the LSTM networks to create run-off simulations (Kratzert et al. 2018; Le et al. 2019; Boulmaiz et al. 2020; Li et al. 2020; Liu et al. 2020; Nguyen and Bae 2020; Hu et al. 2020), whereas others based on CNN(Li et al. 2018; Huang et al. 2020; Kim and Song 2020; Hussain et al. 2020) or a combination of both (CNN-LSTM) (Kimura et al. 2019; Baek et al. 2020; Van et al. 2020; Xu et al. 2020). The performance of encoder-decoder LSTM layers (LSTM-ED) is great with sequential data like a time series. This architecture consists of two blocks: one to read the input sequence and encode it into a fixed-length vector, and a second one to decode the fixed-length vector and transmit the intended sequence (Sutskever et al. 2014).
Among several applications in hydrological modelling, deep learning models are also used for several additional applications, such as reconstructing missing data and predicting rainfall data (Gers et al. 2001). In these models, the input data must obviously be of high quantity and good quality.
This work intends to use machine learning models to predict rainfall data taking advantage of a network of sensors. The models recreate precipitation time series by using data from nearby raingauges as inputs. The training data lacks temporal information, but each record is referred to a specific time. This allows the missing data to be entered into a rainfall database, allowing to complete time series for applying several types of study requiring the time series continuity (e.g., statistical methods, trend analysis, etc.). We also wanted to analyse the errors of the models investigating the role of anomalous data that can influence the performance of deep learning models. Indeed, all meteorological databases can have anomalous data caused by rare natural phenomena or anthropic factors (e.g., malfunctions of the sensor network). Understanding the answer of the machine learning models to the presence of these data is a key point for future applications of AI techniques in hydrological and meteorological studies.
We applied three machine learning models: the first one is a linear regression (LR), whereas the second and the third ones are based on CNN and LSTM. The first architecture of the deep learning models relies on a combination of CNN and LSTM layers (CNN-LSTM), whereas the second one relies on ED-LSTM. The dataset used is derived from 349 raingauges located in Tuscany (central Italy; Fig. 1), characterized by an extensive monitoring network and a wide variability of the mean annual precipitation (MAP), which is influenced by the morphology of the territory (Cantù 1977; Rapetti and Vittorini 1994; Fratianni and Acquaotta 2017). Tuscany is indeed very heterogeneous from a morphological and a geological point of view, characterized by mountain ranges, extensive hilly areas, and some relatively large plains (Carmignani et al. 2013; Baroni et al. 2015). In summary, the study area allows to apply the methodology and the investigations in an area characterized by a great climate variability and with a great number of raingauges.