Predicting South Africa’s Daily COVID-19 Cases using ARIMA Forecasting Model: 6 March to 6 July 2020

Background and Objective : The COVID-19 pandemic caused approximately 11,421,822 laboratory confirmed cases globally with 196,750 confirmed cases in South Africa by the 6 th of July 2020. Coronavirus is transmitted from one person to another even before any symptoms appear, thus posing a severe threat to the society as a whole. This study is aimed at coming up with an ARIMA model to predict daily COVID-19 disease cases in South Africa using data from online sources. Materials and Methods : The study used online data on daily COVID-19 reported cases in South Africa (SA) recorded from 6 March 2020 to the 6 th of July 2020. Time series analysis is used to investigate the trend in the daily COVID-19 disease cases leading to the Auto-Regressive Integrated Moving Average (ARIMA) model. Results : The time plot of the series suggests the need for differencing of the data up to the second-order to achieve a stationary time series. The best candidate model was an ARIMA(7,2,0). Residuals for the selected model are non-correlated and normally distributed with mean zero with a constant variance as expected in a good model. The fitted model predicted a continuous increase in the daily COVID-19 disease cases for the next 20 days ahead to day 143 with slight falls at a few time points. Conclusion : The results showed that ARIMA models can be applied to COVID-19 patterns in South Afriva. The model forecasted a continuous increase in the daily COVID-19 cases in South Africa. These results are important for public health planning in order combat the pandemic.


INTRODUCTION
The pandemic of the coronavirus disease (COVID-19 disease) started in China (Wuhan, Hubei province) and the first cases were officially recorded on the 31 st of December 2019 1 . The number of COVID-19 disease cases accelerated at an alarming rate in China and on the 30 th of January, it was declared a Public Health Emergency of International concern. This was the highest recent level in the World Health Organisation's (WHO) emergency response to infectious diseases 2 .
Coronaviruses, a large family of viruses, can cause illnesses that range from the common colds to much more severe illnesses like SARS, Middle East respiratory syndrome, and COVID-19 3 . Signs of the COVID-19 disease may include fever, cough, shortness of breath and general breathing difficulties, organ failure, and even death. Chinese health authorities stated that coronavirus is likely to be transmitted from one person to another even before any symptoms appear (spread during the incubation period), making prevention and control difficult. This poses a severe threat to society as a whole. The pandemic caused approximately 11,421,822 laboratory-confirmed cases globally with 196,750 confirmed cases by the 6 th of July 2020 in South Africa 4 .
Various mathematical and statistical models have been proposed to predict the spread of the COVID-19 disease. These models include the SEIR models 5 , and the autoregressive moving average models 6,7 .
The autoregressive integrated moving average (ARIMA) model can help to timely analyse and predict the changes in the COVID-19 disease, and provide dynamic information to relevant departments 8 . The ARIMA model is good at forecasting linear time series 9 .
Autoregressive integrated moving average (ARIMA) models have been used to predict future development trends of incidence and prevalence in epidemiological data. Benvenuto et al. used an ARIMA model on the Johns Hopkins epidemiological data to predict the epidemiological trends 10 .
From their analysis, an ARIMA (1,0,4) and ARIMA (1,0,3) were selected as the best models to determine the early prevalence and incidence of the COVID-19 disease, respectively.
In their research, Duan and Zhang introduced the ARIMA model to analyse daily new COVID-19 disease data sets from Japan and South Korea 8 . They selected ARIMA (6,1,7) and ARIMA (2,1,3) to predict 7 days in advance for Japan and South Korea, respectively.
The objective of this study was to come up with an optimal ARIMA model that best described the trend in daily COVID-19 cases in South Africa. The selected model was used to forecast and predict 20-days ahead to assist the South African government and policymakers to come up with ways to combat the pandemic.

Data collection:
The data used was obtained online Geographic Distribution of the COVID-19 disease cases 11 . This data is updated daily. The daily COVID-19 disease cases for South Africa were extracted and the data collected from the 6 th of March, 2020 to the 6 th of July 2020 was used for building the models.

The Time Series Process
A time-series process is a set of random variables , where T is the set of times at which the process was observed. The assumption is that each random variable is distributed according to some univariate distribution function . It also considers that time intervals are equidistant and for the real-valued random variables to allow for enumeration of the set of times for 12 .

Non-Seasonal Autoregressive Integrated Moving Average (ARIMA) models
The growth of daily COVID-19 disease cases for South Africa can be captured like other series, by an integrated model such as the ARIMA 13 . ARIMA models are aimed at describing series which exhibit a trend that can be removed by differencing. The differenced series can be described by an ARMA is the backshift operator defined as .

Selection of the best ARIMA(p,d,q) model for the data
Goodness-of-fit for the fitted ARIMA(p,d,q) models is done using the Likelihood ratio test (LRT),

Akaike Information Criteria (AIC) 15 , Mean Absolute Percent Error (MAPE), and Bayesian
Information Criteria (BIC) 13 . These methods are defined below: where is the length of the time series, are the sample autocorrelation coefficients at lag and is the lag up to which the test is performed. The test statistic asymptotically follows a distribution with degrees of freedom.

Results
In this section, a step-by-step procedure to come up with the optimal ARIMA(p,d,q) for the daily COVID-19 recorded cases for South Africa from the 6 th of March 2020 (the day when the first case was recorded) to the 6th of July 2020, thus giving 123 days post identification of the first COVID-19 case. To check for stationarity of the time series data, a time series plot for the observed daily COVID-19 cases is drawn and shown in Fig. 1.

Figure 1: Trajectory of daily COVID-19 cases in South Africa from 6 March to 6 July, 2020
The time series for the new COVID-19 cases in South Africa show an exponential increase in the number of cases. The Dickey-Fuller = 2.6456, Lag order = 4, p-value = 0.99 confirms that the recorded data is not stationary. The important finding is the persistent increase in cases.
The homogeneity check between successive terms in the time series data is done by plotting a laggedscatter plot for the observed daily COVID-19 cases for South Africa. The lagged scatter plot is shown in Fig. 2.

Diagnostic checking
The time series ARIMA model that can be used to predict COVID-19 daily cases for South Africa can only be built once the series is stationary. This is done by taking the n th -order difference (d=n) until the series becomes stationary and then test for stationarity (unit root problem). The test for stationarity is done using the Augumented Dickey-Fuller (ADF) test, under the hypotheses, H0: Time series data is non-stationary (presence of a unit root) and the alternative, Ha : Time series data is stationary.
The results from the first-difference suggests an ARIMA(7,1,0). However, further investigation on the behaviour of the residuals show that the residual PACF violets the White noise assumption. This was further confirmed by the Box-Ljunk test: X-squre = 22.679, degree of freedom=7, p-value = 0.001939 (close to zero) indicates the need for further differencing.
Results from the second-difference proposes an ARIMA(7,2,0) based on the BIC. The Box-Ljunk test: X-square = 7.2104, df = 7, p-value = 0.4073, shows little evidence of non-zero autocorrelation in the residuals. Thus, the ARIMA (7,2,0) is the best model for predicting and forecasting daily COVID-19 cases in the next 20 days. The estimated parameters for the selected ARIMA (7,2,0) model are shown in Table 1.  Table 1 shows the parameters for the optimal ARIMA model (ARIMA(7,2,0)) for daily COVID-19 cases in South Africa. The selected model tells us of the need to take into account the COVID-19 cases at 7 lags ( a week's worth of information) from a given time point t. It also tells us that the time series is not stationary, so we need to take a second order difference. The z-values for the estimated parameters are all greater than 1.96 in absolute value. Thus, all coefficients are significant at 5% level.
Hence, the selected model was used to forecast for 20-days in advance in the next subsection.

DISCUSSION
The study employed ARIMA models to forecast ahead for 20-days, the daily COVID-19 cases in South Africa. The data used was obtained from online daily COVID-19 disease cases as updated by WHO. The choice of the ARIMA(p,d,q) model was influenced by the fact that ARIMA models can handle predictions when a time series data shows either an increasing or decreasing trend and autocorrelations between the successive values of the time series 17,18 . The time series plot for the COVID-19 cases in South Africa from the 6 th of March to the 6 th of July showed a persistent increase in the number of daily cases. The data showed a very positive correlation (Pearson's correlation coefficient = 0.95). The square of the correlation coefficient showed that 90% of the variability observed in any day is explained by its previous day. Thus, the previous day's number of cases is so relevant to the current number of cases, and will in most cases be more. Stationarity of the time series was achieved by differencing and then performing some diagnostic checks on the candidate models.
Diagnostic checks were based on residuals plots, AICs, and BICs of the selected models. The Box-Ljung test was also performed on candidate models. This lead to the selection of the ARIMA(7,2,0).
After the second difference of the time series, the residuals plot for the selected model inferred that the standard errors are now constant in variance and mean over time, although some slight variations were shown towards the end of the time series. The absolute values of the estimated auto-regressive coefficients were all far below 7, which is a further confirmation that the series was now stationary.
The selected model reveals that a week's information on number of cases is needed to be able to forecast into the future. It also tells us that taking a second order difference is sufficient for the timeseries to be stationary.
The successive residuals (forecast errors) for the selected model where statistically tested and are not correlated. The residuals seem to be somewhat normally distributed with mean zero and constant variance. This lead to the conclusion that the ARIMA(7,2,0) provide adequate predictive model for the COVID-19 cases in South Africa over the selected period. This shows that a second-order difference of the time series was sufficient to make the series stationary. This collaborates with results obtained by Tandon et al. 19 and Kufel 20 who also suggested second order difference of the time series models. A study by Chellai et al. showed that the optimal ARIMA model was the ARIMA(7,2,0) for the daily COVID-19 cases in South Africa up to the 19 th of March 18 . In this case, the trend in COVID-19 disease cases in South Africa was not well pronounced since it was done during the early stages of the pandemic.
The selected model in this study was used to forecast 20-days ahead, the daily COVID-19 disease cases for South Africa. The ARIMA(7,2,0) predicted a continuous increase in the daily COVID-19 disease cases from day 123 to day 128, a fall on day 129, and an increase in subsequent days up to day 143. In general, the selected model predicted a continuous increase in daily COVID-19 cases from approximately 9,062 on 6 July 2020 to approximately 17,446 on 25 July 2020. Although a studies carried out in India proposed an ARIMA(4,1,1) model, the study also forecasted an increasing trend in the COVID-19 cases from the 30 th of June to the 19 th of July 21,22 . Thus, the South African government needs to plan for an increase in hospitalisations and death from Covid-19 in this short term period ahead.
However, the data provided is likely to be prone to error due to economic challenges and cost of testing. This might lead to underestimation of the daily COVID-19 cases.

Authors' contributions
Claris Shoko devised the initial idea and drafted the first manuscript. Delson Chikobvu finalised and proofread the article. Claris Shoko and Delson Chikobvu contributed to the analysis and interpretation of the data. Both authors participated in critical revision of the manuscript drafts and approved the final version