Quantication and Forecasting of Cumulonimbus (Cb) Clouds Direction, Nebulosity and Occurrence With Autoregression Using 2018-2020 Dataset From Yaounde-nsimalen

This works reports the quantification and forecasting of Cumulonimbus (Cb) clouds direction, nebulosity 34 and occurrence with auto regression using 2018-2020 dataset from Yaoundé – Nsimalen of 35 Cameroon. Data collected for October 2018-2020 consisted of 2232 hourly observations. Codes were written to 36 automatically align, multi-find and replace data points in excel to facilitate treating big datasets. The approach 37 included quantification of direction generating time series from data and determination of model orders using the 38 correlogram. The coefficients of the SARIMA model were determined using Yule-Walker equations in matrix form, 39 the Augmented Dickey Fuller test (ADF ) was used to check for stationarity assumption, Portmanteau test to check 40 for white noise in residuals and Shapiro-Wilk test to check normality assumptions. After writing several algorithms 41 to test different models, an Autoregressive Neural Network (ANN) was fitted and used to predict the parameters 42 over window of 2 weeks. Autocorrelation Function (ACF) shows no correlation between residuals, with 0.05 p  , 43 fitting the stationarity assumption. Average performance is 80%. A stationary wavelike occurrence of the direction 44 has been observed, with East as the most frequented sector. Forecast of Cb parameters is important in effective air 45 traffic management, creating situational awareness and could serve as reference for future research. The method of 46 decomposition could be made applicable in future research to quantify/forecast cloud directions.

The solutions proposed here could be helpful in an effective air traffic management, increasing efficiency, reducing 83 uncertainty and economic disadvantages.

84
The method of quantification used here could be made applicable in combination with vector auto-regression 85 (recommended at the end of the work) to have better results in any future research on the subject.

86
The research findings in this work are based on the claim that air traffic management and improvement of air safety

98
The direction of the cloud had been recorded in a qualitative or categorical form, as in "N", or "N/NE" as shown in 99   Table 1 below. Thus, time series analysis on the direction of the cloud necessitated quantification, the conversion of 100 the qualitative variable into a quantitative variable (numerical expression).

103
Since the nebulosity had already been collected in a quantitative form, it necessitated no quantification. Therefore,

104
we will mostly deal in the following with the pre-processing of the qualitative variable, direction of cumulonimbus 105 clouds.
Figures 1-2 show the time evolution of the direction when binary representation has been used.
In the Figure 1, we notice that the values evolve as binaries or square signals between 0 and 1, discarding the need 108 for normalization. This is done for the sectors N, S, NE and E. In this figure 2, we notice that the direction of Cb 109 with respect to the other 4 principal components SE, SW, W, and NW are shown here.

110
The dataset corresponding to each variable (Nebulosity and direction) was partitioned into two: the training set and

121
The partition was such that: Thus, the length of the training set for each variable = 2233 x (80/100) = 1786.

126
For the validation set, we have: The length of the validation set per variable = 2233 -1786 = 447.

130
The partitioned dataset for each of the nine variables is shown in Figures 3, 4, 5 and 6. Figure

147
The approach taken in this work was to partition the dataset of each variable into 2 sets: the training set and the 148 validation set.

149
The dataset for the nebulosity variable consisted of 2233 data points, while for the direction variable we have 150 8×2233 data points (2232 data points for each category). The higher number of data points for the direction was 151 accounted for by the decomposition of direction into 8 independent variables.

152
An algorithm was first written and used to test several ARIMA and SARIMA models of different orders. The

153
following tests were done:

154
 The Augmented Dickey Fuller test (ADF) was done to check for the stationarity condition of each variable.

155
 The LJung Box (Box Pierce) test was used to check for the presence of white noise in the residuals.

156
 Shapiro-Wilk test for normality was made to check if the residuals (errors) had a normal distribution.

157
The optimal models were then selected based on the results of these statistical tests. As will be seen, other tests were 158 performed, and the work made use of several other criteria to choose an optimal model.

159
Parameters of the model for each variable were estimated using the autocorrelation and partial autocorrelation 160 function plots (correlogram) to determine the orders of the autoregressive process respectively.

161
The coefficients of the mathematical equations that model each process were estimated using the Yule Walker 162 equations. Finally, a Neural Network Auto-regression model was appropriated to the analysis of the direction and 163 nebulosity when ARIMA models didn't provide a better fit.

164
A roll-forward forecast was adopted with a forecast window of 2 weeks (336 hours). A simple comparison between 165 forecasted value and actual value was used to show the predictive ability of the model.
When t X is expressed as a function of t Z and some linear coefficient, i Where 0   and  is the mean.

181
The basic models encountered and widely used in time series analysis include AR(p) models, describing auto-182 regressive processes of order, p, obeying (1) and MA(q) models, describing moving average processes of order, q, 183 obeying (2). Based on these two models, we could have ARMA models, ARMA integrated models (ARIMA),

186
Differencing, simple exponential smoothing (SES), or Holt-winter's smoothing could be used to adjust or properly 187 fit our dataset.

193
For an MA(q) process, the order, q, of the moving average process is often given as the number of significant spikes 194 that show up in the ACF plot. This may be true for a weakly stationary time series. However, if the ACF plot is 195 sinusoidal, then the significant spikes indicate the order of the AR(p) process. The PACF would then do for the 196 MA(q) process, and vice versa.

197
From the AR process defined in (1), we can develop the Yule Walker Equations. By taking expectations on both 198 sides of equation (1): From (3), since the AR process is considered stationary with a constant mean:

202
Equation (3) becomes: Subtracting (4) on both sides of (1): Since t X is a dummy variable, we can consider (6) as equivalent to: By multiplying (7) by 1 t X  and taking the expectation on both sides, we have: For a lag k process, (8) is: The relation (12) is the Yule walker equation, which is used to estimate the coefficients of the model, 12 , , , p    K .

223
Note that 0 1   , since the autocorrelation function at lag 1 is always 1.

224
Where k = 1, 2, 3, 4, 5… p. Substituting the values of k: 226 Equation (13) in matrix form gives: Thus, the general form of the Yule Walker equation is r M R  where R is the column matrix of the coefficients,

239
The models that were then selected in the next phase were chosen based on a given set of criterion. The relation (17) 240 displays the Akaike Information Criteria (AIC) is defined as a function of the maximum likelihood estimator and the 241 orders of the MA (q) and AR (p) processes respectively: The Bayesian Information Criteria (BIC) is a similar estimator as the AIC; however, the BIC applies more       Figure 11displays the distribution of the forecast errors. The forecast errors have a turkosis or skewness inclined to 289 the left, which shows that the distribution of the forecast errors does not meet exactly with the normality assumption 290 for forecast errors. Thus, while this model was used and found as a better fit compared to antecedent arima models 291 that were tested, the Portmanteau test for residuals (Box Pierce test) shows that this normality is not met.

292
The high frequency at 0 shows a deviation from the normal distribution of residual errors. This could possibly mean 293 that the model is over-fitted to the dataset.

294
In this figure 11, Forecast errors are in blue while the red curve represents the actual normal distribution. Notice the 295 high frequency at 0, presenting a skewness or turquosis.

300
The ACF shows clearly that there is no autocorrelation of residuals. Thus, the residuals are completely generated 301 from a random process. The autocorrelation values are all found to be below the p=0.05 level. This could be 302 confirmed with the Shapiro-Wilk test for residuals that checks the fit of models against residuals as null hypothesis.

303
The bottom right plot shows the distribution of forecast errors, which has a normal distribution. The normal

309
Again, the previous neural model that was tested for our dataset could be modeled differently by applying log 310 transformation to the absolute value of residuals to get a normal distribution of forecast errors. Figure 13 shows the 311 absolute log-transformation, which shows an approximation to a normal distribution. Red lines show the normal 312 distribution. A log-transform have been applied to the data in Figure 11 (in blue) to get a distribution closer to the 313 normal than in the former.

332
In this figure 14, we observe the high fit during the training period, but a poor fit for the validation period. Clearly, it 333 was necessary to choose a better model that fits both validation period and training period.

334
To evaluate the predictive performance, each forecasted value has been compared against the actual value in the 335 validation period. Table 3 summarizes this process.   respectively. That is relatively as good as a study carried with 75% of accuracy.

355
The findings in this study could be crucial in strategic flight planning, improvement of situational awareness of 356 pilots via provision of forecast of Cb parameters before take-off and an improvement in safety related to the risks 357 resulting from Cb.

358
Data decompositions and direction quantification could be a useful approach in future research that incorporate 359 vector auto-regression methods to produce a more powerful forecast.

361
The authors thank the editor and anonymous reviewers for their careful review, important comments, and helpful 362 suggestions that will serve to improve this article.

364
The Authors declare that no conflict of interests exist in this paper.

365
Funding Statement

366
This study was funded by the authors.

373
Availability of data and material:

374
We have the materials, all the raw and processed data and result products. We can provide the processed data and 375 documents if it is required.

377
The tools used in this work included Excel software for pre-treatment of data, and R software for statistical analysis 378 and modeling.

380
Not applicable to this paper because there was no potential conflict of interest.

381
Consent to participate:

382
Not applicable to this paper 383 Consent for publication:

384
The submission of this paper has been approved by all relevant authors and is the independent work of the authors.

385
The latters have seen and agreed to the submitted version of the paper. It has not been submitted and not published 386 or accepted for publication and is not under consideration for publication in another journal or book.