Data-driven global weather predictions at high resolutions

Society has benefited enormously from the continuous advancement in numerical weather prediction that has occurred over many decades driven by a combination of outstanding scientific, computational and technological breakthroughs. Here, we demonstrate that data-driven methods are now positioned to contribute to the next wave of major advances in atmospheric science. We show that data-driven models can predict important meteorological quantities of interest to society such as global high resolution precipitation fields (0.25°) and can deliver accurate forecasts of the future state of the atmosphere without prior knowledge of the laws of physics and chemistry. We also show how these data-driven methods can be scaled to run on supercomputers with up to 1024 modern graphics processing units and beyond resulting in rapid training of data-driven models, thus supporting a cycle of rapid research and innovation. Taken together, these two results illustrate the significant potential of data-driven methods to advance atmospheric science and operational weather forecasting.


Introduction
Improvements in our understanding of the physical and chemical processes that govern the state of the atmosphere have combined with rapid advances in computational science to pave the way to dramatically improved understanding of weather and climate over the past few decades (Bauer et al., 2015). Predicting the evolution of the weather using the laws of physics and chemistry, a deterministic approach, is achieved by integrating a known set of partial and ordinary differential equations using a set of initial conditions that are based on observations. Significant improvements in weather forecasting using a deterministic approach were achieved not only through better data assimilation and modelling systems, but also from massive increases in the amount of observational data, particularly meteorological data derived from satellites (Simmons and Hollingsworth, 2002).
Not all significant physical processes, particularly those operating at small scales, can be explicitly resolved by weather and climate models and instead are parameterised (Bauer et al., 2015). The chaotic non-linear nature of weather forecasting also presents fundamental limits to its predictive skill. To better address these uncertainties, ensemble weather forecasting approaches have been developed that produce probabilistic forecasts, with the computational cost proportional to the number of members in the ensemble (Zhang and Pu, 2010).
More recently, machine learning technology driven by the availability of massive data sets and graphics processing units (GPUs) has produced substantial breakthroughs in our ability to analyse images, video and audio streams, text and speech that have enjoyed widespread adoption in both commercial and research settings (Lecun et al., 2015). Deep learning, a sub-category of machine learning, has played a leading role in these developments. Deep learning constructs a representation of input data through a series of connected non-linear layers. These networks are able to produce the desired output by optimising a set of adjustable parameters, or weights, that minimise a desired loss function. The optimisation, or training, process adjusts the weights using a form of gradient descent (Lecun et al., 2015). This article explores the application of deep learning methods to weather and climate problems.

Applying deep neural networks to weather and climate modelling
An important feature of multi-layer deep neural networks (DNNs) is their ability to represent arbitrarily complex functions or a combination of functions that map from one finite dimensional space to another, referred to as the universal approximation theorem (Hornik et al., 1989). The sweeping range of successful applications of deep learning models provides further confirmation that DNNs are a class of universal approximators (Lecun et al., 2015). Where a deterministic relationship exists between the input and the target, DNNs can potentially complement our understanding of any physical system. Particularly, where the complexity of the physical system is high, parameters are poorly constrained and/or the required set of initial conditions is not complete.
While the universal approximation theorem provides confidence that we can apply DNNs to complex physical systems, it does not provide guidance as to the number of layers in the DNN and the number of parameters or weights required to approximate an unknown function to an acceptable degree of accuracy (Hornik et al., 1989). A DNN model may be under-fit, where we have a DNN of insufficient complexity, or, in the more challenging problem, over-fit, where we have a DNN with too many parameters so it learns just the features included in the training dataset without generalising to the underlying problem. To avoid over-fitting, practitioners commonly match the increasing capacity of models with larger datasets, providing a regularisation effect that leads to better generalisation. This requirement for large quantities of data leads to the characterisation of many machine learning approaches as big data-driven methods. The results presented below address this need for large quantities of high-quality training data.
The development of convolutional neural networks (CNNs) produced a dramatic drop in error rates for image classification tasks and resulted in their rapid adoption (Krizhevsky et al., 2012). An important property of CNNs is their efficiency when compared with fully connected neural networks, so they can be applied to multi-dimensional data and can scale to large image sizes. They have seen widespread application to 2-D image and video data in many scientific domains (Liu et al., 2020). In order to address the problem of modelling data sets that include a time dimension, recurrent neural networks (RNNs) map the current time step from past time steps (Lecun et al., 2015). A crucial advance in the practical application of RNNs was the development of long short-term memory (LSTM) RNNs (Hochreiter and Schmidhuber, 1997), which can capture the long-range dependencies in the time series data while avoiding the problem of numerically unstable backpropagated gradients. The combination of CNNs and an LSTM allows us to model complex multi-dimensional time series data and is likely to enjoy widespread application to scientific problems, including weather forecasting and climate modelling (Shi et al., 2015).
Machine learning techniques have been successfully applied to solve problems such as weather forecasting (Arcomano et al., 2020;Reichstein et al., 2019;Wang et al., 2019;Zhang and Pu 2010), estimating weather forecasting uncertainty (Scher and Messori, 2018), post-processing ensemble weather forecasts to surface stations (Rasp and Lerch, 2018), climate analytics on massive climate modelling data sets (Kurth et al., 2019) and identifying synoptic-scale fronts (Lagerquist et al., 2019). Applying machine learning to precipitation estimation has attracted significant interest including its application to downscaling (Pan et al., 2019), predicting precipitation from weather radar data at high spatial and temporal resolutions (Aditya Sai Srinivas et al., 2020;Hernandez et al., 2016) and combining continuous and categorical binary indices for precipitation forecasting (Larraondo et al., 2020).
With the early successes of applying machine learning techniques to aspects of climate modelling and weather forecasting, attempts are now being made to develop elementary weather prediction models using DNNs (Scher and Messori, 2019). Recent results deliver predictions that outperform simple persistence, climatology and early deterministic weather forecast models (Weyn et al., 2019). DNNs have also been found to outperform numerical weather prediction (NWP) models under specific circumstances (Sønderby et al., 2020). DNNs have been applied to build low resolution global scale models that predict the 500 hPa geopotential height using about 7 years of atmospheric data as the training data set in order to better understand the challenges that we will face in achieving the goal of building weather forecasting models using DNNs of comparable complexity to existing deterministic models (Dueben and Bauer, 2018). In order to accelerate the development of data-driven methods for weather forecasting, benchmark data sets and recommended evaluation metrics have recently been released that are intended to facilitate rapid and consistent model inter-comparisons (Rasp et al., 2020) and proposals for the development of a novel scalable infrastructure for weather and climate modelling have been proposed (Bauer et al., 2021).
Due to the large size and complexity of weather and climate datasets, exploration of new deep learning methodologies remains a challenge. In this work, we present a framework for scaling up DNN models and for accelerating the process of discovering new DNN-based approaches to model the weather and climate.

Dataset
The ERA5 data set (Hersbach et al., 2020) provides hourly estimates, currently commencing in 1979, of many atmospheric, land and oceanic variables at global scale with a spatial resolution of 0.25°, ≈ 30 km. Atmospheric variables are computed on 137 levels to a height of 80 km. ERA5 dataset was created by combining a comprehensive set of historical meteorological observations with a sophisticated data assimilation and modelling workflow developed by ECMWF. ERA5 data from ECMWF can be obtained on request from ECMWF's meteorological data archive and retrieval system (MARS).
Our experiments use geopotential height (z) at the pressure levels [1000, 800, 500] hPa and total precipitation (tp) variables. We use the full global data set with latitude and longitude dimensions of [720,1440]. The temporal domain data span from the year 2000 up to the end of 2019, with a temporal resolution of 1 h. The resulting geopotential height data are represented as a four-dimensional numerical array with shape [175296, 720, 1440 and 3] corresponding to dimensions [time, latitude, longitude and height], a total of approximately 2.18 TBytes. Similarly, the total precipitation is represented by a three-dimensional numerical array with shape [175296,720,1440] representing the [time, latitude and longitude] dimensions, an additional 727 GBytes.
The forecasting experiments use only geopotential height at 500 hPa arranged as a three-dimensional array with shape [175296,720,1440]. Input data is selected as a moving window using from 2-10 past time steps.

Models
We apply a convolutional encoder-decoder neural network that delivers pixel-wise semantic segmentation that enables us to generate quantitative estimates of meteorological variables of interest such as precipitation and geopotential height at each latitude-longitude grid point [720,1440]. Leading examples of the application of convolutional encoderdecoder neural networks approaches include SegNet (Badrinarayanan et al., 2017), VGG16 (Simonyan and Zisserman, 2015) and Unet (Ronneberger et al., 2015). Previous work by Larraondo et al. (2019) investigated the application of SegNet, VGG16 and Unet to the prediction of precipitation fields and concluded that Unet delivered the best estimates of precipitation while employing significantly fewer model parameters. A smaller number of model parameters results in faster computation and lower memory consumption. The Unet model has been shown to work efficiently on large images running on modern GPUs as it was originally developed on 572x572 pixel images (Ronneberger et al., 2015). Based on these advantages, we have adopted the Unet model as the underlying model architecture for our study.

Methodology
The DNN model described in Figure 1 was written in Python using the TensorFlow and Keras APIs. We use Horovod to implement a data-parallel model. We selected the Adam optimiser and a learning rate of 0.003 and a learning rate warmup over the first 5 epochs. The higher learning rate and the warmup improved the model fitting on the large batch sizes when using many GPUs. We chose a batch size that maximises the use of GPU memory,16 for predicting surface precipitation and a batch size of 2 for forecasting 500 hPa geopotential height. The total batch size is then the number of GPUs multiplied by the batch size on each GPU. The total batch size is therefore a function of the number of GPUs used in model training. We use 80% of the available time steps for training and the remaining 20% of the available data for validation. We run the model training for 50 epochs at which point the MSE value has reached a minimum, then save the model and report the resulting MSE values. We found that 50 epochs ensured that the MSE value always reached a minimum without over-fitting. Using the saved model, we then make model predictions (inference).
In order to load the 2.1 TBytes of model training data using a data-parallel approach, we distribute the model data required by each GPU onto the CPU memory of the corresponding node. This approach facilitates the rapid loading of each batch of data to GPU memory and makes possible the highly scalable data-parallel training by preventing a filesystem IO bottleneck and delivering IO bandwidth at rates even greater than from local SSD drives. The approach is particularly important when running the forecast DNN model as the batch size includes multiple input time steps, and we construct a batch using a rolling window from 2-10 time steps. In order to further reduce memory usage for the forecast DNN, we define a data loader, so we load from memory only the data that each batch requires at each time step. The data stored as multiple NetCDF files which we read using the Xarray package. We divide the training and test data sets equally by time onto each GPU. It is essential that each GPU has exactly the same number of time steps to avoid problems with load balancing and the timely communication of model parameters at the end of each epoch. When fitting the precipitation data, we shuffle the data on each GPU.
When running the forecast DNN model, we also normalised the input geopotential height data and trained the model using the differences, t i+1 À t i . We found that using the normalised differences significantly improved the model training performance. We reverse the procedure when using the trained model to make predictions of geopotential height. We used the tanh activation function and ran the model training for 20 epochs at which point the MSE value has reached its minimum value, saving the best result. We also added an additional 16-wrap longitude grid points covering all latitudes at both the eastern and western extremities in order to maintain numerical consistency and stability of the spatial convolution resulting in a grid of [1472,720]. For forecasting, a time series problem, we do not shuffle the data.
In this article, we focus on the following five key features of developing a data-driven approach to weather forecasting and climate modelling. (1) Our focus is at the global scale. We use global high resolution (0.25°) weather fields extending over multiple decades at an hourly temporal resolution as our training data. (2) As a first step, we demonstrate that a DNN is capable of learning the complex physical relationship between geopotential height and precipitation at high spatial and temporal resolutions comparable to modern global weather forecasting systems. (3) We investigate scaling of the DNN model by training the model using up to 1024 GPU devices, a necessary requirement when working with terabytes of training data. (4) We then extend the DNN model to the time dimension and make predictions of geopotential height. We chose geopotential height as it has long been recognised as a key meteorological forecast field. (5) We suggest a pathway for developing data-driven models of comparable complexity to modern numerical weather prediction models.

Predicting global high resolution precipitation using a DNN
Our first experiment focusses on using a DNN to successfully perform a 2-dimensional spatial regression using geopotential height data to predict surface precipitation. While the dataset includes a temporal domain, in these initial experiments we consider each sample to be independent. We test the limits of what is possible by using meteorological data with global coverage and high spatial and temporal resolution in order to determine if data-driven models can operate at comparable scales to modern weather forecasting and climate models, in this case, by predicting the complex surface total precipitation field using geopotential height fields at 500, 800 and 1000 hPa levels. Figure 1(a) presents a schematic representation of the data set that we have utilised for model training and validation. Figure 1(b) provides a summary of the DNN we have trained to predict surface total precipitation fields. The model shown in Figure 1(b) was implemented in Tensor-Flow 2.2 (Abadi et al., 2015).
We trained the model shown in Figure 1(a) and (b) using 20 years of data over 50 epochs using a RELU activation function and the root mean square error (RMSE) as the loss function. The DNN model was implemented in TensorFlow 2.2 (Abadi et al., 2015), a programming interface specifically designed for implementing machine learning algorithms. Importantly, TensorFlow natively runs on GPU computing devices. We also use the Horovod deep learning framework (Sergeev and Balso, 2018), which takes advantage of decades of research on parallel computing by the HPC community to facilitate distributed deep learning across multiple GPU devices on multiple nodes.
The model was run on 80 Nvidia V100 processors, each with 32 GB memory on the Gadi Supercomputer at the National Computational Infrastructure. On 80 GPUs, a single epoch consisting of 175,296 time steps was completed in 287s and produced an MSE value of 0.0791 on the validation data set. Figure 2 presents an example set of results obtained from training the DNN. Figure 2 shows a plot of the ERA5 total precipitation field which we wish to estimate, the input geopotential height field at two of the three pressure levels used to predict the surface precipitation Figure 2. Prediction of surface total precipitation using the trained DNN model. (a) The ERA5 total precipitation field which we wish to estimate (mm); the plot scale is non-linear in order to reveal the variation in precipitation over several orders of magnitude; (b) the input geopotential height field at 500 hPa pressure level (m 2 s À2 ); (c) the input geopotential height field at 1000 hPa pressure level (m 2 s À2 ); (d) DNN estimates of surface total precipitation (mm) corresponding to Figure 2a. and the resulting estimates of surface total precipitation. While the model training requires significant compute resources, the inference step, where we predict the global high resolution surface precipitation, requires only a fraction of a second of compute time. The DNN model is clearly able to model the spatial variability of key precipitation features both at mid-latitudes and in the tropics, though to a lesser extent in the tropics, indicating that geopotential height at multiple levels provides important information that determines the spatial distribution of surface total precipitation. Further improvements, such as to the prediction of precipitation in the tropics, could be achieved by employing additional data fields in addition to the geopotential height data used in this study.

Scaling a DNN model on GPU devices
The combination of TensorFlow and Horovod allows DNN models to be scaled to thousands of GPU devices providing access to a level of scalability and a concomitant compute capability similar to what is currently available to run some current weather and climate models (Kurth et al., 2019). We investigate the scalability of our DNN model on GPU devices on the Lassen supercomputer at Lawrence Livermore National Laboratory. Figure 3(a) presents the results of a scaling DNN training using 4-256 Nvidia V100 GPUs with 16 GB memory using 2 years of training data. Figure 3(b) presents the results of a scaling DNN training using 128-1024 Nvidia V100 GPUs with 16 GB memory. We use 10 years of input data, a total of 87602 time steps, to train the DNN model, noting that the performance gains (scalability) shown in Figure 3 are not dependent on the number of years of input data used for training. We find that training times increase linearly with increasing years of input data; for example, using 128 GPUs with 2 years of training data, we observed a training time per epoch of 11s as shown in Figure 3(a) and using 10 years of data, we obtain a per epoch training time of 55s as shown in Figure 3(b). The results in Figure 3 illustrate that a high degree of scalability can be achieved up to 1024 GPUs when training our DNN. We also investigate scalability on the Gadi supercomputer using 16, 32 and 64 V100 GPUs with 32 GB memory using 1 year of input data for training. We observed a per epoch training times of 52, 26 and 13 s, respectively, again consistent with a very high degree of scalability of training for our DNN using a combination of TensorFlow and Horovod.

Forecasting global high resolution 500 hPa geopotential height using a DNN
Having successfully applied our DNN model, as described in Figure 1, to predict the spatial distribution of total precipitation, we extended our DNN model to include the temporal dimension. We achieved this by adding an LSTM (Hochreiter and Schmidhuber, 1997), in combination with the 2-D convolutional capabilities, making it possible to forecast the time series evolution of 2-D meteorological variables. In Figure 1(b) we replaced all the layers labelled 'Conv' with a 'Conv2DLSTM' layer; otherwise, the DNN architecture remains the same. This change, however, produced a more than ten-fold increase in the number of parameters in the DNN model as a result of including additional parameters required by the LSTM to capture the time series evolution of complex global high resolution 2-D meteorological fields.
With our forecast DNN model, we focus on the prediction of 500 hPa geopotential height using only past observations of 500 hPa geopotential height as input. We chose the 500 hPa geopotential height as prior studies using DNNs at much lower spatial resolutions have successfully forecast 500 hPa geopotential height Bauer 2018, Weyn et al., 2019). We use input data for the forecast DNN consisting of 10 prior hourly time steps and output a forecast consisting of 2 hourly time steps into the future. We then use a recursive approach to forecast up to 24 h ahead in 1-h increments. The selection of the number of input and output time steps was limited by the availability of GPU memory. Both the increase in the number of model parameters required by the LSTM and in the number of time steps to 12 (10 past + 2 future) result in a large increase in GPU memory requirements. Figure 4(a) provides example time series predictions of 500 hPa geopotential height at t = 1, 6, 12, 18 h into the future. Figure 4(b) provides a comparison of the forecast error against persistence and a weekly climatology using a range of 2-10 input time steps. We find that the forecast DNN model is able to deliver results that are significantly better than these simple forecast metrics, indicating the potential of this approach to provide predictions of global scale meteorological parameters at comparable resolutions to modern NWP models, noting that we have not yet reached the point where the forecast DNN model achieves the same accuracy as NWP models. Figure 4(b) demonstrates the importance of the number of past time steps to the successful prediction of future 500 hPa geopotential height with the predictions improving as we increase in the number of input time steps. Using 4 input time steps results in the transition from an exponential increase in RMSE to a linear increase in RSME with increasing forecast time. As we increase the number of input data points, we observe that the slope of the RMSE curve decreases and the forecast improves particularly at the longer forecast times. Figure 4(b) also indicates that increasing the number of prior time steps beyond 10 is likely to continue to improve the forecast accuracy.

Discussion
The results above demonstrate that we can make high resolution global scale predictions of meteorological variables, such as total precipitation, at a scale and time frequency comparable to current generation NWP models. While we have been successful with our current DNN model, further gains in the DNN model performance are possible. For example, the DNN model can be improved by adding additional input variables or modifying how the variables are presented to the DNN, a process referred to as feature selection by the machine learning community. Additional input variables would improve the model fit only if they provide additional information about the processes producing total   , 6, 12, 18 h (m). Each panel includes the actual geopotential height, the corresponding forecast DNN model estimates of geopotential height and the difference between these two plots. At mid-latitudes, the location of the greatest spatial and temporal variation in geopotential height, we observe the greatest prediction errors; (b) a comparison of the forecast error as RMSE (m) using a forecast DNN models trained with a range of input time steps (2-10) compared with persistence and a weekly climatology.
precipitation. By selectively adding and removing additional input variables, we are able to explore the importance of each variable and the role that it plays in the prediction of the desired output. Feature selection can also include data transformations that improve the process of numerical optimisation of DNN parameters, such as data normalisation, and can be just as critical to successfully constructing a DNN model as selecting the best set of input parameters.
The DNN model can also be modified by changing the architecture of the model and by modifying the many hyperparameters related to the model fitting process. We have based our DNN model on the Unet model architecture and have clearly demonstrated that this model architecture can deliver predictions of meteorological quantities with high spatial variability. We were then able to modify the Unet model to enable weather prediction at high spatial and temporal resolutions by combining the Unet model with the LSTM, Unet-LSTM. The combined Unet-LSTM approach has delivered excellent results in our weather forecasting application, and this architecture is likely to be relevant to many other applications that seek to predict the spatialtemporal evolution of physical quantities. Our current results provide a useful benchmark against which other high resolution global DNN models can be compared. Increasing the number of past time steps used as input to the LSTM well beyond the 10 used in this study will likely lead to improved model predictions, and this is currently only limited by available GPU memory.
While the results presented in this study are very promising, the DNN model developed in this study has a very limited focus compared to major weather forecasting systems, which incorporate dozens of high resolution input and output fields and forecast models with millions of lines of code. However, the ongoing development of modern programming interfaces written in Python for developing DNN models, such as TensorFlow (Abadi et al., 2015), which now incorporates the high level Keras API (Chollet, 2015) that runs on top of TensorFlow, will allow the rapid development of complex data-driven models. When we construct a DNN model, we define a set of inputs, model layers and outputs, which limits our ability to build large complex models as the resulting model would then be monolithic. Fortunately, the Keras interface includes a functional API that can handle multiple inputs and outputs and can treat any model as yet another layer by invoking it as an input or an output of another layer, which allows both the model architecture and the weights to be reused. Models can be nested, and ensembles of models can easily be defined using this approach. The Keras functional API therefore provides an example of a pathway for efficiently developing DNN models with complex graph topologies.
In addition to the challenges posed by developing complex DNN models, we also need to be able to train these models in a reasonable timeframe. In our study, we have demonstrated how we can scale our DNN models using Horovod (Sergeev and Balso, 2018) to run on up to 1024 GPUs with significant potential for scaling beyond this number of GPUs, which has allowed us to take full advantage of the improvement in computing power achieved on HPC systems over the past three decades. Again, while these results are very promising, we have encountered limitations in the available GPU technology. As we increase model complexity and the number of input and output variables, particularly at high resolutions demanded by weather forecasting models, we overwhelm the available GPU memory. GPU device memory is steadily increasing; however, developing complex DNN models will likely require GPU devices with hundreds of GBs to TBs of memory, similar to what is available in current CPU servers. Other developments addressing the issue of GPU memory limitations include combining both data-parallel approaches with model-parallel approaches (Awan et al., 2020, Van Essen et al., 2015. Building complex DNN models will also pose a significant challenge related to the amounts of data required for training. A DNN model similar in complexity to an NWP model will probably use PBs of data during training. Moving this data from storage through to the GPU efficiently will be essential, as model training requires that all training data is read during multiple training cycles or epochs. In our study, we took advantage of the large amounts of CPU memory available on the GPU nodes (256 GB) by staging the data to CPU memory. The total CPU memory available on Lassen when using 1024 GPUs on 256 GPU nodes was 65 TB. Given our experience in training DNN models, training complex DNN models will clearly benefit from co-design of future HPC systems to optimise that training.

Conclusions
Based on the many previous studies and the results presented here, data-driven methods are clearly now positioned to contribute to the next wave of major breakthroughs in atmospheric science. However, it is important to note that achieving this goal will require a substantial long-term effort and that data-driven weather prediction (DWP) is intended to be complimentary to existing numerical weather forecasting approaches.