Multi-step forecasting of the Philippine storm frequencies using Poisson neural network

The paper aims to forecast the Philippine storm frequencies using nonlinear Poisson autoregressive model with exogenous variables. The nonlinearity is defined by its kernel which is an artificial neural network (ANN) with one hidden layer and two output neurons, and is trained to simultaneously forecast two semesters ahead for a given input. Furthermore, the covariates studied were the Average Sea Surface Temperatures in the NINO3.4 region (5∘− 5∘S,170∘− 120∘W) and the Average Sea Surface Temperatures in the eastern pole (0∘− 10∘S,90∘− 110∘E) of the Dipole Mode Index. The data, taken from the Japan Meteorological Agency’s Regional Specialized Meteorological Center with time points running from 1950 to October 2021, is modeled at a semester-level granularity. The estimation is done using maximum likelihood estimation by minimizing the negative log-likelihood function. Furthermore, Bayesian hyper-parameter optimization was used to tune the model across different activation functions, number of hidden neurons, training optimizers, and train and validation splits. Lastly, the best model from the hyper-parameter optimization was then compared to univariate Poisson autoregressive model (and its Bayesian counterpart) and Negative Binomial Autoregressive model. The proposed model captures well the characteristics of the data both in terms of the point forecast and the associated uncertainties.


Introduction
From 2001 to 2016, the tropical storms cost the Philippines an estimated loss of 12.5 million tons in rice production, with 260 000 tons of this loss is believed to have come from the 2013 typhoon Haiyan alone (Blanc and Strobl 2016). The said typhoon is also responsible for the death of 6300 individuals (Lagmay et al. 2015), and for forcing coconut farmer survivors to explore other sources of income, including raising livestock and growing other crops (Seriño et al. 2021). Apart from the strong winds, Haiyan-known locally as Yolanda-also brought with it huge storm surges, drowning the locals in the coastal area of Leyte province (Yi et al. 2015). It is therefore of interest for the Philippines Al-Ahmadgaid B. Asaad alahmadgaid@gmail.com 1 School of Statistics, University of the Philippines Diliman, T.M. Kalaw Street, Diliman, Quezon City, 1101, Metro Manila, Philippines to further study the behavior of these natural events, to help policy makers take early precautions to hopefully reduce further losses.
Among the interests related to these disasters is the annual number of tropical cyclones (TCs) entering the Philippine Area of Responsibility (PAR). From the literature, the task of forecasting the TC counts for PAR was first studied by Magee et al. (2021), whose paper also modeled other areas of responsibilities of other southeast asian countries. The said paper started with 16 covariates, which was then filtered using an automated variable selection algorithm for each location involved. The model used by Magee et al. (2021) was a Poisson regression model for all locations studied. Furthermore, among the covariates frequently used by their automated variable selection algorithm, the Average Sea Surface Temperatures in the NINO3.4 region (5 • − 5 • S, 170 • − 120 • W) and the Average Sea Surface Temperatures in the eastern pole (0 • − 10 • S, 90 • − 110 • E) of the Dipole Mode Index (DMI) are both used as predictors in this paper. Outside the work of Magee et al. (2021), the rest of the literature have focused on major ocean basins, for example, the Western North Pacific (WNP) and the Indian ocean regions.
Moreover, most TC count modeling are done at an annual level granularity. This is true for Magee et al. (2021) and other papers which focused on major ocean regions, examples like Fan (2007) and Fan and Wang (2009) for WNP, and McDonnell and Holbrook (2004b) for Australian-Southwest Pacific region. For this study, however, we will model the TC frequencies at a semester level granularity. This is because we want to take into account the autoregressive covariates to our proposed model, and a semester level granularity will make the historical data to have periodicity annually. So that, future frequencies can be described by this periodicity expressed as autoregressive term of the model input. The periodicity follows from the fact that the first semester is a non-TC season for PAR, contrary to the second semester. This is evident from the result of Asaad (2021). Therefore, the first semester of the year will always be lower compared to the second semester in terms of the TC frequencies, and this pattern recurs annually.
In addition, most TC count modeling are trained for one point ahead forecasting. That is, given an input, we forecast 1 year ahead for an annual-level dataset. For multi-step forecasting, we roll the forecast, that is, previous forecast will be used as input for future forecast, and the independent covariates involved needs to be forecast ahead as well. The problem with this approach is that it can quickly accumulate errors since the error of the preceding forecast is propagated to future forecast. This is true with the model proposed by Magee et al. (2021), and we address this by simultaneously forecasting two steps ahead for two semesters for every given input. Furthermore, instead of using linear models, which dominated the literature of TC count modeling, we explore a nonlinear model. In particular, we propose an artificial neural network (ANN) model with one hidden layer and two output neurons (for the two semesters ahead forecasting).
As for the literature on TC count forecasting for major ocean basins, most of the models used are linear models. Popular choices are multiple linear regression models like in the works of Fan (2007), Choi et al. (2010), and Fan and Wang (2009); others have used Poisson generalized linear regression models, for example, in the study of Holbrook (2004a), McDonnell andHolbrook (2004b), Chand et al. (2010), Werner andHolbrook (2011), Zhang et al. (2018), and Magee et al. (2021). As for the nonlinear models, however, only Nath et al. (2015) have applied it to TC count forecasting, to the best of knowledge of the author. The model used by Nath et al. (2015) is based on ANN. It should be noted that there are other nonlinear Poisson regression models apart from ANN-based kernels. For example, the exponential Poisson autoregressive models studied by Fokianos et al. (2009). ANN models have gained success in the field of Machine and Deep Learning on many applications including time series forecasting. In fact, the popular extension of ANN for sequential modeling named Long Short Term Memory (LSTM) by Hochreiter and Schmidhuber (1997) have performed better on complex time series forecasting compared to statistical models (see, for example, Elsaraiti and Merabet (2021)). The striking feature of ANN-based models is their complexity, which dwarfed the simplicity of statistical models. This complexity, however, is also the reason why ANN-based models are treated as "blackbox," mainly because of their lack of interpretability. This limitation, however, motivated some recent innovations; for example, the Neural Basis Expansion Analysis Time Series (NBEATS) model, by Oreshkin et al. (2020), aims to provide an interpretable yet powerful univariate Deep Learning model, and another is the Temporal Fusion Transformer (TFT) model by Lim et al. (2021), also interpretable. These Deep Learning models are suitable for complex time series datasets, those defined to be high in granularity (e.g., every 3 s or every 5-min datasets) with low autocorrelations, meaning past values do not have much influence on current values; and those requiring long horizon forecasting, for example, 120 points for 1-day forecasting of a 5-min dataset. As such, these models can likely overfit a 142 data points considered in this paper. Hence, for this study, we will use the simplest ANN model, that is with one hidden layer only. More details are given in Section 2.
Furthermore, due to the small number of typhoons ever recorded in history-at least for the Best Track data of Japan Meteorological Agency's Regional Specialized Meteorological Center-Deep Learning-based methods have not been used for forecasting typhoon frequencies, at least to the best knowledge of the author, this is because of the nature of Deep Learning models which typically have significant number of parameters or weights to train, making it an overkill for small datasets. As such, Deep Learning models have been applied to studying typhoons whose input data are usually augmented by an input satellite image of the cyclone itself. This is specifically true for predicting typhoon track as in the works of Rüttgers et al. (2019), where a Generative Adversarial Network (GAN) was used. Apart from Deep Learning models, tree-based models have not been used for forecasting typhoon frequencies as well, at least to the best knowledge of the author. However, the said models were rather used for understanding the effect of typhoon on critical structures such as the long-span bridges as in the work of Zhang et al. (2021) which uses Quantile Random Forest (QRF). QRF outputs prediction intervals useful for uncertainty quantification. Alternatively, Bayesian dynamic linear regression models were also used for studying the effect of typhoon for critical structure for Structural Health Monitoring (SHM). The said models are also capable of imputing missing data as in the works of Wang et al. (2022) and Zhang et al. (2022).
Going back to the proposed ANN model, the output layer is designed to have two output neurons each assumed to be governed by two independent Poisson distributions, thus each having conditional mean conditioned on the nonlinear combinations of the autoregressive inputs and the covariates. The model is therefore named as Poisson Neural Network (PoisNN). The assumed Poisson distribution will provide the observations' uncertainties, just like QRF, which we refer in this paper as the aleatoric uncertainties adopted term from Kendall and Gal (2017). Another alternative for Poisson would be the negative binomial distribution as in the work of Zhu (2011) and Christou and Fokianos (2014). The work of Christou and Fokianos (2014) is the one used in this paper for performance comparison against PoisNN. Lastly, the model parameters are inferred using Maximum Likelihood Estimation (MLE), but instead of maximizing, we minimize the negative log-likelihood function.
PoisNN models were first studied by Fallah et al. (2009), to the best knowledge of the author. The said paper found that PoisNN models can largely improve the prediction in nonlinear situations. Another application of PoisNN is the work of Montesinos-López et al. (2020) on prediction of genomic count data. Furthermore, Huang et al. (2019) also used PoisNN, but for forecasting number of emergency calls. Lastly, Rodrigo and Tsokos (2020) proposed a Bayesian PoisNN and found that it performs well over traditional Poisson or negative binomial regression models based on their simulation and real data studies.
The work of Rodrigo and Tsokos (2020) explored two different activation functions for the hidden layer of their model; these activation functions are: sigmoid and hyperbolic tangent, which we also considered in this paper. These functions were also studied by Nath et al. (2015), and found them to be the optimal activations for their ANN model with one hidden layer.
The proposed model is benchmarked against univariate count time series models, as mentioned earlier. These models are the log-linear Poisson autoregressive (PoisAR) model by Fokianos and Tjøstheim (2011) and its Bayesian counterpart by Asaad (2022), and the Negative Binomial Autoregressive (NegBinomAR) model by Christou and Fokianos (2014). Finally, a hyper-parameter tuning is done using Bayesian Optimization.
In summary, this paper has two main contributions to the literature of TC activities: (i) the use of Poisson Neural Network model and its comparison to other count time series models, and (ii) the use of multi-step forecasting.

Data and methods
The data is publicly available from the Japan Meteorological Agency's (JMA's) Regional Specialized Meteorological Center (RSMC). The author used JMA data so that it can build up and relate the findings in this study to the initial analyses of Asaad (2021Asaad ( , 2022 on the Philippine storm tracks, in which JMA's best tracks were used. The available time points downloaded from JMA at the time of writing run from 1951 to October of 2021. The raw data contains both the metadata and the data itself, and hence data processing was done to separate the two. The metadata contains, among others, the typhoon ID and typhoon name recorded by JMA, which in total equates to 1852 unique TC international IDs (IDs).
In order to count for the frequency of the TCs for the PAR, the 1852 TC IDs were further filtered to only those that have entered the PAR. These include tracks that made a recursive path, meaning entering the PAR briefly and left for the north. In total, 1250 TC IDs were tallied under the PAR. It should be noted that the Philippines has its own local weather agency (PAGASA short for Philippine Atmospheric, Geophysical and Astronomical Services Administration) tracking the TCs, and there can be minor differences in terms of TC counts with that in JMA. Nonetheless, the difference should be marginal. Figure 1a shows the 1852 TCs in the WNP, and Fig. 1b shows the 1250 TCs in PAR (tails and heads of the tracks beyond PAR were cut out for emphasis). These data points were then grouped per semester annually, so that Fig. 2a depicts the time series of the historical frequencies of the TCs in PAR. Along with this is the annual frequencies and the decadal average in Fig. 2b for comparison. From Fig. 1a, one would notice the spike at 1964; this is the year with the most number of tropical cyclones ever recorded in the WNP region by JMA (see also Cassidy (1964)). Furthermore, since the spike is not due to an error in encoding, but rather an actual data as confirmed in the report of Cassidy (1964), the presence of the spike at the filtered PAR-only-tracks were made as is. The goal here is to let the models learn this characteristic should there be a potential spike in the future.
The historical data for NINO3.4 average SST is also displayed in Fig. 3a, and b is the historical data of the IODE average SST. It should be noted that Asaad (2021) defines the Philippine storms as those that went straight to the Philippines, and classified those that made a recursive track as belonging to a different cluster, including those that briefly entered PAR. While it differ, the proposed definition in this paper is already a good proxy for the number of storms that passed straight through the country. In this study, the data is partitioned into training and testing, with training accounting for 80% of the data, that is up until the first semester of 2007, and the rest accounts for the test data, that is, up until the first semester of 2021. We exclude the second semester of 2021, since the data is not complete. The model parameters are estimated using the training data, which is then further split into training and validation. Three different splitting percentages are explored through hyper-parameter Bayesian optimization. These split percentages are 20%, 30% and 40% validations. The final model is then evaluated in the test data. The software used for this study is Python (Van Rossum and Drake 2009) programming language, using TensorFlow Probability (Dillon et al. 2017) as the core library for ANN modeling using MLE. Julia (Bezanson et al. 2017) programming language was also used to produce the maps in Fig. 1. Lastly, R (Ihaka and Gentleman 1996) programming language was also used for the univariate count time series models (PoisAR and NegBinomAR) via the tscount (Liboschik et al. 2017) library.
The remainder of the paper is organized as follows: Section 2.1 introduces the PoisNN model; Section 2.2 discusses the model estimation using MLE; Section 2.3 briefly presents the theory of Bayesian optimization; Section 2.4 details the model inference for PoisNN; Section 2.5 defines PoisAR and NegBinomAR; Section 3

Poisson artificial neural network
This section provides details on the proposed model defined in Definition 1.

Definition 1 (Poisson ANN) Let
where η := [m, n, p] , and g is the activation for the hidden layer.
Remark 1 We assumed that the joint distribution of Z t+1 and Z t+2 given F X,Y,Z t is equal to the product of its marginal distributions, i.e., they are independently distributed. So that, given F X,Y,Z t , Pr(Z t+1 , Z t+2 ) = Pr(Z t+1 )Pr(Z t+2 ) = 2 k=1 Poisson(z t+k | λ t+k ), where z t+k is the random value of Z t+k . Remark 3 The model in Eq. 1 encodes the input data into a logarithmic scale and decodes the output using the exponential activation function. These encoding and decoding functions are inspired by the log-linear Poisson autoregressive model proposed by Fokianos and Tjøstheim (2011).
The network architecture for the proposed model is given in Fig. 4. The dashed outline in each neuron indicates that Fig. 4 Network architecture of the proposed Poisson ANN model for two semesters horizon the node is activated by its activation function. Therefore, the input neurons with this indicator are activated by the log function specified in Eqs. 3-5. Also, the output neurons are fixed to exponential activation functions as specified in Eq. 1.

Model estimation
The weights of the model are estimated using MLE, and is done as follows: let where T is the total number of observations. So that, the joint of each collection in matrix form is represented as follows: X := [x 1 , · · · , x R ], Y := [y 1 , · · · , y R ], W := [w 1 , · · · , w R ], and Z := [z 1 , · · · , z R ]. Furthermore, let V be the random variable of observing the data z i := [z So that, the log-likelihood is given by: and λ t+k is a function of w i , x i , y i and P as defined in Eq. 1. Substituting (1) to (15) gives us the complete form of the log-likelihood, which then is the objective function. The goal therefore is to maximize this objection function, i.e., P := argmax P (P | Z, W, X, Y).
However, maximizing (17) is equivalent to minimizing the negative of it, or explicitly, the negative log-likelihood, i.e., Minimizing this objective function is done by taking the partial derivative with respect to each of the parameters or weights of the model. The resulting derivatives are then set to 0 to align to the critical values, then solving for the corresponding weight or parameter should give the solution.
The second derivative will then confirm if the solution is the minimum value. However, most objective functions of nonlinear models like ANN do not have closed-form solution, and so exact solutions through differentiations are not possible. In practice, we approximate the solution by numerical approximations, for example, using Stochastic Gradient Descent (SGD) (Robbins and Monro 1951), RMSProp (Hinton et al. 2012) or ADAM (Kingma and Ba 2015) algorithm. For this paper, we used RMSProp. Lastly, minimization is often the preferred optimization task since most software implementing the numerical approximations were developed for minimizing objective functions, this is true with the algorithms implemented in TensorFlow (Abadi et al. 2015), which is used by TensorFlow Probability (Dillon et al. 2017).

Bayesian optimization
The PoisANN model has hyper-parameters that need tuning; these hyper-parameters are the number of hidden neurons in the hidden layer, the activation functions of these neurons, the proportion of the training data, and the model training optimizer. The idea behind Bayesian optimization is that it approximates the objective function with a surrogate function, and using this surrogate to find the global maximum or minimum of the objective function. Furthermore, it does this by fitting the surrogate function to the objective function in as few input points or iterations as possible. As such, an acquisition function is needed for choosing smartly the next input points to be used for fitting the surrogate. Using these components, a Bayes' Theorem is used in fitting the surrogate.

Objective function
The objective function here is the global cost function, it takes a collection of hyper-parameter inputs, and it uses this input to train the model (in this case, PoisANN), then the corresponding residual of the estimated model becomes the score of the objective function.

Surrogate function
The surrogate function aims to approximate the objective function. In Bayesian optimization, the surrogate function is defined by the Gaussian process (GP).
Definition 2 (Gaussian Process) Let Y 1 , Y 2 , · · · , Y T be a sequence of random variables such that Y t ∈ Y, then the sequence is a Gaussian process (GP) if and only if Y iid ∼ N (μ, ).
Definition 3 (Global Cost Function) Let G be the domain of the activation functions, L be the domain of the number of hidden neurons, B be the domain of the proportions of the train data, and A be the domain of training optimizers, then P := G × L × B × A is the domain of the hyperparameters such that for any global cost function γ , and ∀v ∈ P, we have γ : P → R.

Remark 4
The global cost function for this paper is computed from the error of the PoisNN model after training using a given the input hyper-parameters P.

Proposition 1 Let P be the domain of the hyperparameters and suppose
Proof The proof follows from the proof of Theorem 1.2.9 of Muirhead (2005).
Remark 6 From Proposition 1 and Definition 2, δ is a GP.

Proposition 2 From Proposition 1, suppose δ
, ∀i ∈ N ≤n+p , then the joint distribution of the γ (v i )s, i.e., Pr(u), follows from the proof of Theorem 1.2.9 of Muirhead (2005).
Corollary 1 From Proposition 2, let m and m * be zero vectors, then the following conditional distribution is true: Proof The proof follows by reversing the condition from X 1 | X 2 to X 2 | X 1 of the proof of Theorem 1.2.11 of Muirhead (2005).
Definition 4 (Lower Confidence Bound) Let m(v) be the mean of the GP, then the Lower Confidence Bound (LCB) notated as ξ is given by where ζ is the balancing factor.
The global costs computed from the global cost function defined in Definition 3 are approximated by the GP through Proposition 1. So that, for any new observed global cost computed from the new hyper-parameter input v, chosen by the acquisition function defined in Definition 4, the joint distribution with the previous global costs is given in Proposition 2. Therefore, computing for the conditional distribution of the new global costs conditioned on previous global costs is given by Corollary 1. The process of acquiring new hyper-parameter candidate and computing the global costs is done iteratively. The algorithm converges until there is no significant changes on the new hyper-parameter candidate with the immediate preceding hyper-parameter candidate, or alternatively, until a specified maximum iteration.

Model inference
For every given input, the model returns two point estimates as forecasts for two semesters ahead. These two estimates serve as the rate parameter or the mean of the Poisson distributions, which we can use to compute the corresponding quantiles of the prediction intervals. The intervals are 90% (lower bound at 2.5th percentile and upper bound at 97.5th percentile), 75% (12.5th lower and 87.5th upper percentiles), and 50% (25th lower and 75th upper percentiles). The percentile/quantile q of the forecast for training or test data is denoted byẑ (i) t+k,q , but for future TC count the notation isẑ t+k,q , and is computed as follows: z t+k,q := F −1 (q | λ t+k ), which is the image of F (ẑ t+k,q | λ t+k ), the cumulative distribution function, defined below:

Univariate count time series models
As mentioned earlier, the best PoisNN model from the hyper-parameter tuning is compared to three univariate count time series models defined in this section.
Definition 6 (Bayesian Log-linear PoisAR model) From Definition 5, let b = 1 and p = 2, then the Bayesian model proposed by Asaad (2022) is given by: where the following are the a prioris set for the parameters: The details on the parameter estimations for PoisAR and NegBinomAR are given in Fokianos and Tjøstheim (2011) and Christou and Fokianos (2014), respectively. As for the BayesPoisAR, a Markov Chain Monte Carlo (MCMC) approach was used by Asaad (2022).

Results and discussions
The results are presented into the following sections: Section 3.1 discusses the performance of the candidate

Candidate models' performance
Bayesian optimization was done separately for each training split: 60/40, 70/30, and 80/20. The results are given in Table 1, where the column Score correspond to the Negative Log-likehood of the model. From the said table, the best model is PoisNN0.6 with the a score of 4.454, followed by PoisNN0.7 and then PoisNN0.8. Therefore, the best hyper-parameters for PoisNN model are the following: 14 hidden neurons, each activated by a hyperbolic tangent, and best trained using Nadam (Dozat 2016) training algorithm. Figure 5 shows the training history of the PoisNN0.6 model.

Forecasts
The forecast of the best model (i.e., PoisNN0.6) for the training and testing datasets are given in Figs Moreover, it is a common practice in the literature (for example Zhang et al. (2018), McDonnell and Holbrook (2004b), among others) of TC count forecasting to also check on the Pearson correlation between the forecast and the actual values. For the proposed model, the correlation is high since the model takes advantage of the semester-level data, where periodic patterns recur every year.
Now comparing PoisNN to the PoisAR, BayesPoisAR, and NegBinomAR, the statistics are given in Table 2. The errors from the non-ANN models (both at order 2, i.e., up to lag 2 input covariates) are higher compared to PoisNN. In addition to that, both PoisAR and NegBinomAR have low correlations as well, and this is due to the method of inference used for these models as defined by the tscount (Liboschik et al. 2017) library, in which rolling forecasts is done on more than one horizon like in the test dataset, and as such it accumulated the errors. One could improve this, however, if both models (PoisAR and NegBinomAR) take advantage of the actual data in the test horizon as well; this is the approached used by Asaad (2022) for the BayesPoisAR model. In fact, from the results in Table 2, BayesPoisAR is the second best model. Despite these, however, BayesPoisAR-like PoisAR and NegBinomARcannot forecast two semesters ahead simultaneously; the said models will always need to roll in forecasting longer horizons which is prone to accumulating errors, a problem  In terms of the forecast for the second semester of 2021 and first semester of 2022, Table 3 presents the details. As mentioned in Section 2, the second semester of 2021 was not used since it is incomplete because it only accounts TC tracks up until October 2021. Having said that, the second semester of 2021 therefore needs to be forecast along with the first semester of 2022 to capture the two semester ahead forecasting proposed. The results are given in Table 3. Using the forecast mean of the first semester of 2021, we can append this to the data to simultaneously forecast two semesters ahead. The results are provided in the said table with the corresponding

Conclusion and future works
The aim of the study was to explore the use of PoisNN model for modeling the Philippine storm frequencies. The results suggest that the two independent covariates, average SSTs in the NINO3.4 and IODE regions, do contribute in characterizing the TC counts. Furthermore, hyper-parameter tuning through Bayesian optimization was considered in this study, and that the best configuration of PoisNN requires 14 hidden neurons all activated by a hyperbolic tangent function, and that the training data must be splitted into 60% training and 40% validation. Furthermore, in terms of the numerical approximation algorithm for model training, the best algorithm is Nadam. We also found Lastly, there are still room for improvements. The Poisson distribution used in the output neurons assumes that the mean and the variance of the population distribution are equal. This is often not the case in practice, and one can explore the negative binomial neural network instead.