Tendency-and-attention-informed deep learning for ENSO forecasts

Deep learning has been acknowledged as an increasingly important technology for ENSO forecasts. The most cutting-edge deep learning algorithm is developed based on Convolutional Neural Network (CNN), which can achieve a multi-year (about 17-month-lead) forecast and has conquered the ‘spring forecast barrier’ problem. However, this group of methods are still challenged by several critical issues. First, they usually utilize the global sea surface temperature (SST) fields as inputs without considering the specific contributions of variant oceanic regions in ENSO forecasts. Consequently, they cannot effectively investigate the role of the ‘teleconnection’ mechanism among different oceans (Indian, Pacific, and Atlantic Oceans) and different ocean parts (the tropic and non-tropic regions) especially in the forecast of extreme ENSO events. Second, existing methods mainly utilize the discrete monthly SST fields for ENSO forecasts without investigating the rate-of-changes between adjacent months, which also provides important information to the prediction of variation tendency. To solve these problems, this paper develops a Tendency-and-Attention-Informed Deep Residual Network (TA-DRN) for multi-year ENSO forecasts. The contributions of different oceanic regions can be learned by a spatial attention module while the variation tendency of adjacent previous and current months can be interpreted by the first-and-second order of differences of SST fields. Through informed by these two modules, the performance of TA-DRN can be improved significantly, especially in predicting extreme El Niño and La Niña events.


Introduction
El Niño-Southern Oscillation (ENSO) is the most significant interannual climate signal in the world with a sea surface temperature (SST) and air pressure oscillation occurring in the equatorial Pacific Ocean (Bradley et al 1987;Rasmusson and Carpenter 1982).It is known as a sea-air interaction phenomenon at low latitudes characterised by irregular fluctuations of SST in the tropical Pacific.The warm phase of ENSO is usually known as the El Niño, while the cold phase is La Niña (McPhaden et al 2006).During the ENSO growing phase the Bjerknes feedback is an essential process (Bjerknes 1969).The initial positive/negative SST anomaly associated with weakened/strengthened Walker Circulation and westerly/easterly anomaly is amplified through the Bjerknes feedback, leading to a mature El Niño/La Niña event.Although ENSO originates in the equatorial Pacific, it can exert profound impacts on climate, ecosystems and economies around the world through atmospheric and oceanic teleconnections (Ropelewski and Halpert 1987;Ashok and Yamagata 2009).
Accurate prediction of ENSO is significantly important because of its profound impacts on the world and accounts for the major skill source of seasonal-to-inter-annual climate prediction (Cane et al 1986;Luo et al 2008Luo et al , 2016)).Compared with other climate phenomena, ENSO is perhaps the one whose occurrence, dynamics and influences have been relatively well understood and predicted.Its predictability is based on the obvious quasi-periodicity, occurring every 2-7 years (Latif et al 1998;Fang and Xie 2020).For ENSO prediction, different ENSO indices are designed to reflect the complex climate change factors.Niño3.4 is the most commonly used to define El Niño and La Niña events.An El Niño/La Niña event is defined when the three-month moving average sea surface temperature anomaly in the Niño3.4region is higher than 0.5 • C (or lower than − 0.5 • C) and lasting for at least five months (Cane et al 1986;Bamston et al 1997).
Substantial effort has been devoted to predicting the occurrence, development or evolution of ENSO.Traditional ENSO forecasting models mainly consist of statistical and numerical forecast models.Statistical models use historical data to construct the evolution of ENSO to achieve analysis and prediction of ENSO phenomena.These kinds of models are less expensive and easier to develop than complex numerical models.Commonly used statistical methods, such as multiple linear regression, Markov chains and etc have made progress in ENSO forecasting (Tseng et al 2017).However, they are still challenged by several issues.For example, ENSO is a complex system and statistical methods are not ideal for the discovery of such complex patterns.In addition, observational data in the tropical Pacific are scarce and it is insufficient to provide the large amount of historical data needed for statistics to forecast ENSO enents.Numerical models mainly utilize coupled atmosphere-land-ocean physical equations to study ENSO mechanisms, simulation and prediction.Numerical forecast models include Intermediate Coupled Model (ICM) (Zebiak and Cane 1987), Hybrid Coupled Model (HCM) (Tang and Hsieh 2002) and Coupled General Circulation Model (CGCM) (Jin et al 2008) and etc.There exist more than 20 models of varying complexity for ENSO forecasts.These models have made great progress in the past decades, which can predict ENSO events 6-12 months in advance (Zhang et al 2020).However, due to the uncertainty of the initial conditions, numerical models are also difficult to simulate the interannual average variability of SST (Jin et al 2008).Because of the diversity and variability of the spatial and temporal evolution of ENSO, state-of-the-art coupling models still do not adequately capture these features (Zhang et al 2020).And the 'spring predictability barrier' (McPhaden 2003;Webster and Yang 1992) is prevalent in numerical forecast models.
With the advent of Big Data era, deep learning, which can discover complex patterns from large data sets, has already had a huge impact on many fields (LeCun et al 2015).Deep learning comprises of many novel models to address the challenges of climate prediction including ENSO forecasts.In 2018, Clifford et al. utilized various complex network metrics extracted from climate networks with long shortterm memory neural network (LSTM) to forecast ENSO.The LSTM model plays an important role in advance multistep prediction (Broni-Bedaiko et al 2019).In 2019, Mahesh et al. combined convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to predict Niño3.4 using SST with an accuracy similar to that of state-of-the-art seasonal dynamic forecasting models (Mahesh et al 2019).In 2020, Gupta et al. developed a convolutional LSTM (Con-vLSTM) network, the accuracy of which for monthly mean Niño3.4 prediction can be one year ahead.The ConvLSTM is a flexible deep learning model that can simulate complex spatio-temporal sequences (Gupta et al 2020).Yan et al. proposed an ensemble empirical mode decomposition-temporal convolutional network (EEMD-TCN), which decomposed the highly variable Niño3.4 index and southern oscillation index (SOI) into relatively flat subcomponents, used the TCN model to predict each subcomponent in advance, and finally combined the sub-prediction results to achieve the final ENSO forecasts (Yan et al 2020).These existing works all demonstrated the promising capabilities of deep learning for ENSO forecasts.
The state-of-the-art deep learning method for ENSO forecasts is the one proposed by in Ham et al (2019) using a CNN network.It can achieve a multi-year ENSO forecasts with a 17-month-lead prediction.Its performance is superior to most existing models and the forecast results are less affected by 'spring predictability barriers' (Ham et al 2019).Although it has been acknowledged as a milestone for ENSO forecasts, it is still challenged by several issues.First, the CNN model is performed on the spatial distribution of the global SST fields without considering the temporal SST variations (also known as 'Tendency') (Zhang and Gao 2016) between adjacent monthly SSTs.Specifically, the inputs of this model are discrete-monthly spatial SST fields, without considering the temporal features correlating the preceding and succeeding months, resulting in a lack of temporal feature representation.As we all know that ENSO is complex nonlinear signal, which also changes with the climate change, including its responses to the changing global climate and feedback/influences on the worldwide climate.Thus, it requires the models to investigate the temporal features between adjacent SST fields to understand the variation of ENSO along time.Second, the CNN model (Ham et al 2019) utilized the global oceanic SST fields equally without considering the different contributions of variant oceans and sea areas to the ENSO prediction.Although ENSO occurs with the warmest SST anomalies in the tropical Pacific, recent studies have perceived that the ENSO events, especially super (or extreme) El Niño and La Niña events are caused by the joint boost of Pacific, Indian and the Atlantic Oceans.The Indian and Atlantic Oceans and their 'teleconnections' with the Pacific Ocean can greatly improve super El Niño/La Niña prediction (Wang and Wang 2021).However, the teleconnection mechanism can not be investigated or reflected by using the global SST fields equally in Ham et al (2019), and thus challenges the prediction of ENSO.
To overcome the above problems, this paper developed a Tendency-and-Attention-Informed Deep Residual Network (TA-DRN) for the multi-year ENSO prediction.Motivated by the numeric models for ENSO prediction, we feed the deep network with the tendency information along with the original SST fields as inputs.The tendency of SST is defined by the first-and-second order of differences of SST fields in adjacent months.They can supply the deep network with the temporal information of SST variations along time.For the second issue, this paper introduced a spatial attention module in the deep residual network, which can learn the contributions of regional oceanic areas for ENSO prediction.The teleconnection mechanism of Indo-Pacific-Atlantic Oceans and the tropical-nontropical oceanic areas can be investigated by the spatial attention module extensively.This attention module not only improves the performance of deep learning in ENSO forecasts, especially for super ENSO event forecast, it also makes the deep network more physically interpretable.

Contributions of this work
The contributions of this paper can be summarized as follows: 1. We integrated the temporal information by using the first-and-second order of differences of SST fields in the input and developed a tendency-informed network, which can supply both spatial and temporal features for deep learning.2. We investigated the 'teleconnection' mechanism of three major oceans by introducing a spatial attention module and developed a attention-informed deep network, which learns the contributions of different oceans and variant oceanic areas automatically and thus improves the performance of deep network in forecasting super ENSO events and makes the deep network more physically interpretable.

Motivation 1: Tendency-informed deep learning
Using tendency as temporal information in the input to feed deep network is motivated by the essential idea of utilizing numerical models to simulate the SST variation (defined as tendency) for ENSO forecasts.The Intermediate Coupled Model developed by Chinese Academy of Sciences (IOCAS ICM) is one of the most important numerical models, ranked as top 10 models for ENSO forecasts.We take it as example for explanation.In IOCAS ICM, the tendency defined by the SST variation is incorporated into the ocean dynamical model to characterize thermodynamic processes as shown in Eq. 1.The left part of Eq. 1 is defined as the tendency of SST, which is inspired by the governing equation of SST in Zebiak and Cane (1987).
The governing equation determining the evolution of interannual SST variability can be written by Eq. 1 in Zhang and Gao (2016) as follows: In Eq. 1, T ′ and T e are SST anomaly and subsurface upwelling temperature field; u, v, w denote the latitudinal, longitudinal and vertical velocity; M is the Heaviside step function; H is the depth of the mixed layer.The left-hand side of the equation is the SST variation (tendency), and the right-hand side shows the various contribution terms including anomalous heat flux, horizontal advection, entrainment, horizontal diffusion and vertical mixing.The SST tendency can quantify the effects of various physical processes.Therefore, the introduction of SST tendency can enhance the temporal features of SST while only using different time steps of the SST fields may result in missing information.
In this paper, the tendency is represented by the first and second order of differences of adjacent previous and the current monthly SST along time, which can be shorten by first-and-second order of time-differences of SST in the following text.Physically, it can also be considered as 'velocity' and 'acceleration' of SST changes, respectively.They can supply important temporal information to the original SST fields.We use the time series of Niño3.4 index, which is the reflection of the changes of SST fields in the tropical pacific for more explanation.The original Niño3.4index and its first and second order of differences are shown in Fig. 1, respectively.We can see that the differences supply significantly important temporal information for ENSO forecasts by reflecting the rate of changes, which can promote the accuracy of ENSO forecasts.The computation of SST difference will be explained in Sect.3.4 in details. (1) . As mentioned above, the 'teleconnection' mechanism among Indo-Pacific-Atlantic and tropical-nontropical oceans can help forecast ENSO events.However, the specific contributions of different ocean and ocean parts are still unknown (Dayan et al 2014).Moreover, for different El Niño and La Niña events, the specific contributions of regional oceans are also different.Thus, it is very challenging for human-beings and existing methods to learn the specific contritions of the global oceans for each ENSO event.Motivated by this finding, this paper introduced a spatial attention module, which can learn the contribution of different ocean and ocean parts automatically for each ENSO event.By using the spatial attention module, the 'teleconnection' mechanism can be investigated by the deep learning network specifically and thus makes the ENSO forecasts more accurate.

Overview of the whole structure
This paper developed a Tendency-and-Attention-informed Deep Residual Network (TA-DRN) for ENSO forecasts.The whole structure is shown in Fig. 2, which consists of three parts: the Tendency-Informed Representation Module in the bottom level, the Attention-Informed Residual based Network in the middle level and the Fully-Connected layer in the top level.The TA-DRN model uses SST anomaly (SSTA) grid data over 0 • -360 • E, 55 • S-60 • N for twelve consecutive months as predictors and uses the Niño3.4index as outputs to be predicted up to two years ahead.The model is designed to fit the non-linear relationship between SSTA and the future Niño3.4 index.
The Tendency-Informed Representation Module is designed for processing input data.The most important variables affecting ENSO variations are sea surface temperature (SST) and heat content (HC) (McPhaden et al 2006).Existing methods mainly used these two variables as predictors (Ham et al 2019(Ham et al , 2021;;Ye et al 2021).However, HC cannot be obtained from observation directly.It requires high computation cost and resources to calculate HC from observation results (Leipper and Volgenau 1972).To make prediction more efficient, this paper discusses the possibility and effectiveness of using SST only for ENSO forecasts.The SSTA for each sample (monthly-averaged SSTA data) can be represented by a tensor X ∈ R 12×H×W , where H × W represents the spatial resolution of SSTA fields, 12 is the length of the time step (12 months).The first-and-second order of time-differences of X (denoted by ΔX ∈ R 11×H×W and Δ 2 X ∈ R 10×H×W ) is computed by the Tendency Module in Sect.3.4.The three tensors (X, ΔX , Δ 2 X ) are concat- enated together to form a new tensor X ∈ R 33×H×W , which is used as the final inputs to the model.
The Attention-Informed Residual based Network is composed of a convolutional layer, and multiple residual blocks.The first convolutional layer is utilized for preprocessing of inputs.Convolution can effectively extract local features from meteorological data, and outliers in meteorological data can be eliminated by the convolution process.Therefore, the inputs are initially passing through a convolutional layer to obtain a feature map.This feature map will flow through residual blocks in which the attention module is embedded.The attention module enables the model to learn the contribution of different regions so that it can focus on key information and ignore irrelevant information.The Fig. 1 The time series of orignal Niño3.4 index, its first and second order of differences, respectively residual structure can prevent excessive information leakage caused by the attention module, and thus promote the attention module to perform better.After this procedure, the final feature map will be obtained.
The final feature map is mapped into the Niño3.4index to get the final results through a fully connected and output layer as shown in the top layer of Fig. 2. Each part of the whole structure will be explained more specifically in the following sections.

Basic structure of deep residual network
Deep Residual Network is used as the basic structure of deep model.Spatial attention module is embedded in the residual blocks.Figure 3 shows the structure of the residual block, which consists of two convolutional layers and a spatial attention module.
Convolutional layer has a great ability to acquire spatial structure information.The convolution process extracts local features from the feature map by calculating the dot products between the weights of the convolution kernel and corresponding values in the feature map.The convolution process for extracting features is as follows: where x, y denote the position of the feature map at the next layer; i, j denotes the j-th feature map at the i-th convolutional layer; P i and Q i are the height and width of the con- volution kernel for the i-th convolution layer; M i−1 is the number of channels of the feature map in layer i − 1 ; w is the weight at position p, q of the convolution kernel, v i,j is a value of the j-th feature map at the i-th convolutional layer, and b i,j is the bias unit.
The residual block can be summaried as follows: where x and y represent the input and output feature maps of the residual block, respectively.The function R represents the residual mapping to be learned.The operation R(x, {W}) + x is conducted by skip connection and element- wise addition.The reason for using the residual structure as the basic model is that the residual mapping is more efficient to be optimized.The spatial attention module will be described in Sect.3.3.

Principle of the spatial attention module
The spatial attention module is introduced in this paper to learn the contributions of different ocean and ocean parts for ENSO forecasts.The attention map generated in this module guides the model to focus on the key regions with higher contributions for the prediction of extreme ENSO events.This module enables the deep learning model to focus on significant features and suppress unnecessary features.
The intermediate feature map can be adaptively refined through the spatial attention module at each residual block.The spatial attention module was inspired by the Convolutional Block Attention Module (Woo et al 2018).Its structure is shown in Fig. 4.
Given a feature map F ∈ R C×H×W , the attention module infers a 2D spatial attention map M ∈ R 1×H×W to get new feature map F ′ by the following calculation: where ⊗ denotes element-wise multiplication.Feature map F and F ′ are of the same shape and M(F) denotes the calcu- lation of spatial attention map M from the original feature map F.
The spatial attention map M can be generated by utilizing the inter-spatial relationship of features, where the average-pooling and max-pooling are applied along the where denotes the sigmoid function and f k×k represents the convolution operation with the filter size of k × k.

Principle of the tendency module
The tendency information is defined by the first-and-second order of differences of SSTA fields in the previous and the current time step.The tendency information serves as important temporal feature for ENSO forecasts.It can be calculated as follows: where X t refers to the SSTA fields at the t-th time step, ΔX and Δ 2 X represent the first-and-second order of differences of X, respectively.
Since the distribution of X, ΔX , and Δ 2 X are different with each other, we use Gaussian Normalization to make them match with each other.
(5)  2. The reanalysis data from 1871 to 1972 from the Simple Ocean Data Assimilation (Giese and Ray 2011) were used for validation, which contains a total sample of 1224 (102 × 12) calendar months: 102 denotes years; 12 denotes 12 months per year.The Global Ocean Data Assimilation System reanalysis with a total sample of 432 (36 × 12) calendar months during 1982-2017 was used (Saha et al. 2006) to test the performance of the model.The SSTA fields over 0 • -360 • E, 55 • S-60 • N with spatial resolution of 5 • × 5 • is utilized in this paper.The grid size of SSTA is 24 ×72.
We used a single NVIDIA Tesla P100 GPU for computation, and the deep model is built under Python 3.7 with TensorFlow.In the architecture of the model, the Residual Block are stacked 9 times in total.The first three times no downsampling is used, so TA-DRN requires two different sizes of convolution kernels.To accelerate the convergence of the model, batch normalization is utilized.Puming the fully connected layer dropout strategy was used.The loss function is defined by the mean squared error, which is commonly used in regression problems.During the training phase, the number of epochs is set to 200, Adam is used as the optimizer and the early stopping method is used.Finally, we determined 3 hyperparameters which are to be optimized

Evaluating metrics
According to existing works, the Pearson Correlation Coefficient (PCC) and the Root Mean Square Error (RMSE) are commonly used as two metrics for evaluating the performance of different methods for ENSO forecasts.The PCC is used to measure the linear correlation between the true Niño3.4 index and the forecasting one while the RMSE measures their differences with each other.They can be calculated as follows: where n is the numbers of the samples; y i denotes the true value of i-th sample; p i denotes the predicted value of i-th sample; y denotes the mean value of true values and p denotes the mean value of predicted values.

Visual explanation analysis for attention-module evaluation
To demonstrate the effectiveness of spatial attention module for ENSO forecasts, this paper applied the visual explanation analysis method developed in Selvaraju et al (2017) to compare the contribution map of the global ocean with and without the spatial attention module.The results will be shown in Sect.4.4.5.This method can be applied to CNN based deep models.The final contribution map (named as localization map) (Selvaraju et al 2017) L m of the target month m can be determined as follows.We firstly calculate the gradient of the prediction regarding the feature map A k of the convolutional layer.The gradients are then back propagated and pooled globally over the width and height dimension to obtain the importance weights k : where the m k represents the importance of feature map A k for the prediction of target month m.Finally, the visual explanation of the contribution map is obtained from the following formula: where k represents the number of feature map.
In the CNN-based model, the feature map of the last convolutional layer is connected to the fully-connected layer and the output layer, whose encoding ability directly determines the prediction performance of the model.The feature map of this layer retains the spatial information of the input data and the attention mechanism can allocate high weight to the key regions with more significant information.That means, the features of these key regions can be reflected in the feature map of the last convolutional layer.Therefore, this paper visualizes the contribution of the feature map obtained from the last convolutional layer for heatmap analysis.In the contribution map L, larger values represent higher contributions of these region to ENSO forecasts.The final contribution map is obtained after normalizing the values to a range of 0 to 1.The size of the contribution map is the same as the output of the last convolutional layer, which is 6 × 18 (Selvaraju et al 2017).

Determination of the optimal hyperparameters
Figures 5, 6, 7 show the PCC and RMSE of different hyperpatameters for the test set.We can find that, using a larger batch size may improve the speed of training but may not always improve the effectiveness of the model.Learning rate that is too large or too small can also affect experimental results.We finally determined the optimal parameters as follows: the filter size are 4 × 8 and 2 × 4; the batch size is 256; the learning rate is 0.001.TA-DRN ultimately adopted these parameters to ensure greater PCC and smaller RMSE.
In order to observe if the model is over-fitted in the training procedure, we illustrate the training and validation loss curves for a case in Fig. 8.We can see that the training loss gradually decreases as the training proceeds.Although the validation loss shows fluctuations in the early stages of training, it eventually converges.Thanking to the early stopping method, this case ended up with 42 training epochs, even though the epoch was set to 200.The final model we saved was the one with the lowest validation loss.( 9) The PCC and the RMSE of different structures are shown in Fig. 9a and b, respectively.From Fig. 9, we can see that the performances of 'Base', 'Base+Diff', and 'Base+Attention' are similar with each other.They can achieve at about 15-month-lead forecasts, before which the PCC is higher than 0.5.Although the effectiveness using either tendency or attention module is not obvious, when integrate them other, the final structure ('Base+Diff+Attention') has a much better performance than the other three kinds of models.It can achieve at about 17-month-lead forecasts and the RMSE is obviously lower than the others.The reason maybe that the basic deep structure DRN is relatively simple, which comprises fewer intermediate layers.After adding large quantitative of temporal features defined by the first-and-second order of time-differences of SSTA fields, the deep structure can not bear so much large number of features and overfitting may have occurred because of the overload information.However, after integrating the spatial attention module with the tendency module, the contribution map generated from the attention module can guide the deep model to focus on the most significant features, and thus the temporal features from the tendency module can be well utilized.Therefore, the final structure of 'Base+Diff+Attention' can have a much better performance than 'Base', 'Base+Diff' and 'Base+Attention'.The quantitative comparisons of PCC and RMSE are also shown in Tables 3 and 4, respectively.

Performance comparison of our method to existing methods
We compared the results of our method ('TA-DRN') with the existing cutting-edge deep learning method (denoted by 'CNN') proposed by in Ham et al (2019) and the leading dynamical forecasting method (denoted by 'SINTEX-F') proposed in Luo et al (2008).The results are illustrated in Fig. 10 and Table 3, respectively.Figure 10 shows the all-season correlation skill (PCC) and RMSE of the threemonth moving-averaged Niño3.4 index from 1982 to 2017.
For general, we can see that as the leading-time (in months)  increases, the PCCs of all the methods decrease while their RMSEs increase gradually.That is, the forecast accuracy degrades with increasing leading time.Nevertheless, our method still demonstrate better performance than the other two methods.From Fig. 10a, we can see that the PCC of our method and the 'CNN' method show similar results.Their correlation skills are greater than 0.5 up to 17 months.That means, both of them can have an effective 17-month-lead forecasts, which is much higher than the 13-month-lead forecasts of the SINTEX-F method.More interesting, the PCC of our method is higher than that of the 'CNN' for the leading months after 15 months.For example, as shown in Table 3, for 18-lead months, the PCC of our method is 0.48 while that of the 'CNN' is 0.46 and for 21-lead months, the PCC of our method is 0.41 and that of 'CNN' is 0.34.The better performance of our method can be demonstrated more obviously from the comparison of RMSE shown in Fig. 10b.We can see that the RMSE of our method is much lower than that of the 'CNN' and 'SINTEX-F' for all leading months, with an RMSE of less than 0.8 for 24 months ahead.That means, the forecast error of our method is much lower than that of the other two methods.Through the comparison results of both PCC and RMSE, we may perceive that our method is of higher capacity to achieve longer forecast skills than 17 months.

Forecasts of time series of Niño3.4 Index
In this part, we perform experiment on time series of Niño3.4 index over 1982-2017 to evaluate the effectiveness of our method in ENSO forecasts over long periods.Figure 11 shows the Niño3.4index for the December-January-February (DJF) season for 1-, 6-, 12-, and 18-monthlead forecasts, respectively.In Fig. 11a-c, we compared the results of our method with the actual observational results to demonstrate the good ability of our method.In Fig. 11d, apart from the actual observational results, we also compared the results of our method with that of the aforementioned 'CNN' (Ham et al 2019) method for more detailed analysis on the most challenging 18-lead months forecasts.
Generally, the accuracy of forecasts decreases as the leadmonth increases.Thus, the accuracy of one-month-lead forecasts is the highest of all cases.For the 1-, 6-, and 12-month lead forecasts, we can see that the forecast results of our method match well with actual observational results.However, as the lead time increases, the forecast error increases slightly.For the 12-month-lead forecasts, we can see that our method can make an accurate forecast on the trend of Niño3.4 index.For the more challenging 18-month-lead forecasts, our method and the 'CNN' method can predict the trend and amplitude of the Niño3.4index more satisfactory before 2003 than that afterwards.This is probably because ENSO also changes as the global climate warms up and it exhibits different characteristics in the twenty-first century from the twentieth century.

Extreme ENSO forecast
To investigate the performance of our method extensively, we use the more challenging extreme (or super) El Niño/La Niña event forecast in this experiment.Super El Niño/La Niña events are characterized by intense, basin-wide warming/cooling in the central-eastern tropical Pacific (Wang and Wang 2021).During the post-1950 period, there were four super El Niño events in the records, which occurred in 1972, 1982, 1997, and 2015, respectively (Wang and Wang 2021).In this paper, we use the last one as an example for analysis.This experiment is also performed to demonstrate the effectiveness of the spatial attention module for 'Indo-Pacific-Atlantic Ocean teleconnection' investigation and can be used for extreme ENSO event forecast.
Figure 12a shows the forecast results started from boreal spring (including March, April, and May, denoted by MAM) 2014 and last to FMA 2016.The forecasts are initiated in the boreal spring because it is a season that is especially difficult to predict due to the 'spring barrier' problem.We can see that the predicted Niño3.4 index by our method exhibited  To demonstrate the effectiveness of the attention module more comprehensively, we use the visual explanation method introduced in Sect.4.3 for thorough analysis.Figure 12b and c show the visual explanation map of the forecast results on the above mentioned 2015 DJF Niño3.4 using our TA-DRN model and that using 'Base+Diff' without the attention module, respectively.
The explanation map illustrates the contribution of each grid point to the target value through a weight ranging from 0 to 1. Compared to Fig. 12c, the visual explanation map generated by TA-DRN in Fig. 12b reflects higher contributions in the North Pacific, South Pacific, Tropical Indian Ocean and North Atlantic regions to the final forecast results.This finding has been proven by the latest work on super El Niño forecast, which indicated that the super El Niño event is usually formed by the joint booster of the Indian and Atlantic Oceans together (Wang and Wang 2021).Thanks to the 'teleconnection' mechanism, our attention module can successfully learn the contribution of different oceans and facilitates our model focus on key information.Therefore, the visual explanation map verifies the effectiveness of our spatial attention module in the forecast of extreme El Niño events.
We also performed experiment on extreme La Niña forecast for better analysis.Figure 13 shows the forecast results initialized at MAM 1998 and lasted to FMA 2000.From Fig. 13a, we can see that the predicted Niño3.4 by our TA-DRN model matched well with the ground-truth.Compared to TA-DRN, the performance of 'Base+Diff' without spatial attention is unsatisfactory.The explanation map shown in Fig. 13b and c also verifies the effectiveness of the spatial attention module.We can see that the North Pacific, Tropical Indian Ocean and subtropical Atlantic make high contributions to the formation of this extreme La Niña event.We also find that the contribution of the East Pacific increased obviously, which is related to the strong El Niño event during 1997 and noticeable SST changes occur in the eastern Pacific.Using the attention module, our model can promote its attention to these focal areas.

Conclusion
This paper developed a Tendency-and-Attention-informed Deep Residual Network (TA-DRN) for ENSO forecasts.Motivated by the numerical models for ENSO forecasts, the tendency module can supply the DRN network with important temporal features defined by the first-and-second order of time-differences of SSTA fields.The spatial attention module, which can learn the specific contributions of different ocean and ocean parts for ENSO forecasts, plays a significant role in investigating the teleconnection mechanism among Indo-Pacific-Atlantic Oceans and the tropical-nontropical regional oceans in the formation of extreme (super) ENSO events and thus makes the forecasts more accurate.For each ENSO event, our method can generate a specific contribution map of global ocean independently.By integrating the tendency module and the attention module together, our method can learn the most important spatial and temporal features guided by the contribution map automatically making our method more intelligent.The extensive experimental results have demonstrated our TA-DRN method have a better performance than the existing cutting-edge method for ENSO forecasts.It can achieve at a 17-month-lead forecasts with a much smaller RMSE than the existing methods.Our method also demonstrates a better performance in the extreme El Niño and La Niña forecasts than existing methods and it has satisfactory performance in real-time ENSO forecasts.
Our method provides novel ideas to deep learning techniques for ENSO forecasts.Although the experimental results have demonstrated the effectiveness of our method, the basic deep structure used in this paper is relatively simple considering the time efficiency.In the near future, we are planning to apply the developed tendency and attention module to more sophisticated deep learning models such as Transformer to get more satisfactory results.
ENSO is a climate mode that emerges in the tropical Pacific.But climate variabilities in the Pacific, Indian and Atlantic Oceans are tightly connected.Even the tropical and nontropical parts of Pacific are also highly connected.This is known as the 'teleconnection' among different ocean and variant ocean parts(Kug et al 2020).The climate variability outside the tropical Pacific Ocean may influence ENSO(Kug et al 2020).In the Indian Ocean, the Indian Ocean Dipole (IOD) also exhibits interannual climate fluctuations.The positive and negative phase of IOD are effective predictors of El Niño and La Niña events(Izumo et al 2010).SST anomalies associated with the Atlantic zonal mode affect the Walker Circulation, driving westward wind anomalies over the equatorial Pacific during boreal summer.The wind stress anomalies can enhance the SST gradient across the Pacific Ocean, which then causes SST anomalies in the Pacific Ocean(Ding et al 2012).Thus using SST in the Indian Ocean and Atlantic Ocean could also contribute to ENSO forecasts.

Fig. 2
Fig. 2 The architecture of TA-DRN for ENSO forecasts.This structure comprises of three parts: the Tendency-Informed Representation Module in the bottom level, the Attention-Informed Residual based Network in the middle level and the Fully-Connected layer in the top level channel axis and are concatenated to generate a feature descriptor.The application of pooling along the channel axis can effectively highlight the information with more significances.The feature descriptor is passed through a convolution layer to generate the final spatial attention map M. Details are as follows.Two pooling operations are used to aggregate channel information.The F avg ∈ R 1×H×W denotes average-pooled features and the F max ∈ R 1×H×W denotes the average-pooled features.Then we concatenate them together through a convolutional layer and a sigmoid activation function to generate the spatial attention map M. The calculation procedure is as follows:

Fig. 4
Fig.4The architecture of spatial attention module, where the new feature map F ′ is obtaind from the element-wise multipuliation of the orignal feature map F and the calculated spatial attention map M based on the pooling and convolution process

Fig. 5 Fig. 6
Fig. 5 The all-season correlation skill PCC and the RMSE of different patameters, the batch size is 128.The filter size of first column in the figure are 4 × 8 and 2 × 4; The filter size of second column in the figure are 5 × 5 and 3 × 3; The filter size of third column in the figure

Fig. 7 Fig. 8
Fig. 7 The all-season correlation skill PCC and the RMSE of different patameters, the batch size is 512.The filter size of first column in the figure are 4 × 8 and 2 × 4; The filter size of second column in the figure are 5 × 5 and 3 × 3; The filter size of third column in the

Fig. 9
Fig. 9 The all-season correlation skill PCC and the RMSE of our method with different moduels: a PCC, b RMSE of four structures.The x-axis means the forecast lead months, where the correlation skill above 0.5 is considered as effective forecast

Fig. 11
Fig. 11 The historical forecasts results of our method on the time-series Niño3.4 index in DJF season over 1982-2017 for different lead-month cases: a 1-month-lead forecasts; b 6-month-lead forecasts; c 12-month-lead forecasts; d 18-month-lead forecasts

Fig. 13
Fig. 13 Forecast for 1999/2000 La Niña event: a the forecast results of Niño3.4 time series with inputs from 1997 March to 1998 February; b visual explanations for the forecast of the 1999 DJF Niño3.4

Table 1
The datasets for training, validation and testing

Table 2
Details of the CMIP5 models utilized in training