Forecasting the Prevalence of COVID-19 in Maharashtra, Delhi, Kerala, and India using an ARIMA model


 Aims: As the whole world was preparing to welcome the year 2020, a new deadly virus, COVID-19, was reported in the Wuhan city of China in late December 2019. By May 18, 2020, approximately 4.7 million cases and 0.32 million deaths had been reported globally. There is an urgent need to predict the COVID-19 prevalence to control the spread of the virus.Methods: Time-series analyses can help understand the impact of the COVID-19 epidemic and take appropriate measures to curb the spread of the disease. In this study, an ARIMA model was developed to predict the trend of COVID-19 prevalence in the states of Maharashtra, Delhi and Kerala, and India as a whole.Results: The prevalence of COVID-19 from 16 March 2020 to 17 May 2020 was collected from the website of Covid19india. Several ARIMA models were generated along with the performance measures. ARIMA (2,3,1), ARIMA (2,2,0), ARIMA (2,2,0), and ARIMA (1,3,1) with the lowest MAPE (5.430, 10.440, 2.607, and 2.390) for Maharashtra, Delhi, Kerala, and India were selected as the best fit models respectively. The findings show that over the next 20 days, the total number of confirmed COVID-19 cases may increase to 2.45 lakhs in India, 93,709 in Maharashtra, 19,847 in Delhi, and 925 in Kerala.Conclusion: The results of this study can throw light on the intensity of the epidemic in the future and will help the government administrations in Maharashtra, Delhi, and Kerala to formulate effective measures and policy interventions to curb the virus in the coming days.


Introduction
COVID-19, a global pandemic, is an emerging disease that spreads from human to human and is responsible for infecting millions and killing thousands of people since the rst reported fatal cases in late 2019. COVID-19 belongs to the family of zoonotic coronaviruses such as the Severe Acute Respiratory Syndrome coronavirus (SARS-Cov) and the Middle East Respiratory Syndrome (MERS-Cov) that have their origin in bats, mice, and domestic animals. The virus rst emerged in Wuhan, the capital city of China's Hubei province, in late December 2019. In just a few months, the virus spread rapidly across the world, reaching a total of approximately 4.7 million con rmed cases and 315,496 deaths as of 18 May 2020 (Hopkins 2020). The rst case of COVID-19 in India was reported in Kerala on the 30 th of January 2020, with origins in China (PIB 2020). By 17 May 2020, India had registered a total of 95,698 con rmed cases and 30,24 deaths (Covid19 2020).
As of today, the disease has spread all over the world. The number of con rmed COVID-19 cases vary due to the differences in the testing and disease surveillance capacities across the countries and regions.
Since there is no valid treatment method and prevention for this virus yet, effective planning and proper implementation of the health infrastructure and services is the only way to control the spread of the virus.
For this reason, accurate forecasting of future total con rmed cases plays a vital role in managing the health system and allows the decision-makers to develop a strategic plan and interventions to avoid a possible epidemic. Also, such estimates help in guiding the intensity and types of interventions needed to lessen the outbreak Zhang et al. 2020). To estimate the number of additional manpower and resources needed to control the outbreak, a mathematical and statistical modeling tool is required that can be used for making short-and long-term disease forecasting.
In the last few years, studies have used different statistical methods such as multivariate linear regression (Thomson et al. 2006), simulation-optimization approach (Nsoesie et al. 2013), generalized growth model and generalized logistic model (Chowell et al. 2020), holt method (Myrzakerimova et al. 2020), and grey model (Zhang et al. 2017) to forecast epidemic cases. These statistical models, however, are inadequate for analyzing the in uence of randomness on the epidemic outbreak. Random factors play an important role in the spread of a disease as Nakamura and Martinez have described in their study (Nakamura and Martinez 2019).
Autoregressive integrated moving average (ARIMA) models are the most commonly used prediction models and are considered to be the best  for predicting epidemic diseases, such as malaria (Anokye et al. 2018), tuberculosis (Zheng et al. 2015), measles (Sharmin and Rayhan 2011), and in uenza (He and Tao 2018). An ARIMA model is commonly used for predicting the time series data of infectious diseases, especially for series that have a cyclical or repeating pattern. Mostly, it deals with non-stationary time series in order to capture the linear trend of an epidemic or a disease, and it mainly predicts a future time series value by considering the previous time series values and the lagged forecast error.
In recent studies, different models have been used to predict the prevalence, incidence, and mortality rate of COVID-19. Perone (2020) used an ARIMA model and predicted that Italy would reach the in ection point in terms of cumulative cases during the months of April and May (Perone 2020). Zhao et al. (2020) applied the Metropolis-Hastings algorithm and predicted the effects of three epidemic intervention scenarios, that is, suppression, mitigation, and mildness in controlling the spread of COVID-19 in African countries (Zhao et al. 2020). Wang et al. (2020) used the SEIR model and virus reproduction rate R to predict the number of infectious cases in Wuhan, China .
With the rising number of COVID-19 cases every day, there is a lot of stress on the administration and the health care system in India for accommodating patients with symptoms of COVID-19. Hence, the prediction of the estimated new cases in the coming days will help the health administration make adequate arrangements with ample time.
This paper aims to forecast the prevalence of COVID-19 cases in Maharashtra, Delhi, Kerala, and India as a whole. The COVID-19 data corresponds to the period between 16 March 2020 and 17 May 2020. The best t ARIMA model was used to estimate the prevalence of COVID-19 cases for a period of 20 days. In addition to highlighting the characteristics of the epidemic and the behavior of its spread, this study also provides the health authorities crucial information about the intensity of the epidemic at peak times using ARIMA model. These models can help predict the health infrastructure and materials the patients will need in the future.

Data
For the validation and analysis of the proposed study, the prevalence of COVID-19 cases was taken from the (Covid19 2020), and Microsoft Excel was used to build a time-series database. The minimum sample size required for time series forecasting is 30 observations (Yaffee and McGee 2000). Hence in this study, 63 time-series observations between 16 March 2020 and 17 May 2020 were used to predict the prevalence of COVID-19 cases over the next 20 days with a 95% con dence interval limit. All analyses were performed using Statgraphics Centurion XVII.II software, with p-value<0.05 as the statistical level of signi cance.

ARIMA Model
A time series is a sequence of observations, each one being recorded at a speci c time; it may be measured continuously or discretely (Yaffee and McGee 2000). The main aim of a time series is to study past observations and develop an appropriate model to forecast future values. The ARIMA model, rst introduced by Box and Jenkins in the 1970s, is the most used time series model if the data show no seasonality pattern. The ARIMA model -generally represented as ARIMA(p,d,q) -is an extension of autoregressive AR(p), moving average MA(q), and ARMA(p,q) models (He and Tao 2018). The letters p, d, and q correspond to order of autoregression, degree of difference, and order of moving average respectively (Yaffee and McGee 2000). In an AR(p) model, the current time series value is expressed as a linear combination of p past observations …… and a random error , together with a constant term.
Similarly, in an MA(q) model, the current time series value uses past q error terms …… as the explanatory variables. The general formula of AR(p) and MA(q) models can be expressed as in Eq (1) and (2) respectively.
[Please see the supplementary les section to view the equations.] (1) Here (i=1, 2...p) and (j=1, 2...q) are the autoregressive and moving average parameters respectively. is the observed value at time t and the random error (or random shock) at time t. C is the constant term, and μ is the mean of the series. The random shock is assumed to be a white noise process, that is, a sequence of independent and identically distributed (i.i.d) random variables with mean zero and a constant variance (Yaffee and McGee 2000).
The ARMA(p,q) model is a combination of AR(p) and MA(q) models in which the current time series value is de ned linearly in terms of its past p observations as well as the current and past q random shock, together with a constant term. The general formula of an ARMA(p,q) model can be expressed as in Eq (3).
[Please see the supplementary les section to view the equations.] (3) Where C is a constant and (k=1, 2…q) are the values of the previous random shock. Time series analysis requires a stationary time series, that is, the series shows no uctuation or periodicity with time . In an ARIMA model, a non-stationary time series is made stationary by applying nite differencing to the time series. The differenced stationary time series can be modeled as an ARIMA model to perform an ARIMA forecasting (He and Tao 2018 (4), (5), and (6).
[Please see the supplementary les section to view the equations.] (4) Where is the actual value at time t, and is the difference between the actual and the predicted values. Also, n is the number of time points. Lower MAE, MAPE, and RMSE values indicate a model that best ts the data (Tseng and Shih 2019).
Steps involved in ARIMA modeling Four critical steps are involved in the ARIMA modeling, namely, identi cation, estimation, diagnostic checking, and forecasting. The rst step is to check the seasonality and stationarity of the time series data by drawing a time series plot of the observed series with the corresponding time. A time series is considered as stationary if a shift in time doesn't cause a change in the shape of the distribution, that is, the statistical properties such as mean, variance, and autocorrelation are constant over time. The stationarity of time-series data is important as it helps develop powerful techniques to forecast future values (Brockwell and Davis 2001). The second step is to construct the autocorrelation (ACF) and the partial autocorrelation (PACF) plots of the stationary time series to determine the order of the AR and MA processes. The ACF is the correlation between the observation at time t and the observation at a different time lag, while PACF is the amount of correlation between the current observation at time t and the observation at lag k that is not explained by the correlation at all lower-order lags (that is, lag<k) (Brockwell and Davis 2001). The third step involves estimating the parameters of the best t model, which is done using the performance measure criteria. The ACF plot of residuals, as well as the Box-Pierce test of white noise, were determined to evaluate the model goodness of t. The fourth step involves forecasting future values using a good t model.

Results
Prevalence and incidence of COVID-19 The descriptive statistics of the prevalence and incidence of COVID-19 in Maharashtra, Delhi, Kerala, and India are given in Table 1 Forecasting the prevalence of COVID-19 pandemic using the time series ARIMA model  . The two lines on the graph indicate the lower and upper limits of the 95% confidence interval. These lines help identify the presence of non-zero autocorrelation. The ACF plot confirms that the prevalence of COVID-19 is not stationary as the autocorrelation is seen to reduce slightly with increasing lag (see Appendix).
The first-and second-order differencing were taken to stabilize the mean of COVID-19 prevalence for Maharashtra, Delhi, Kerala, and India. After the second-order differencing, the series became stationary, and the parameters of the ARIMA model were determined according to the ACF and PACF plots as shown in Fig 2. All the analyses were performed on the transformed prevalence of COVID-19. The ARIMA model with the lowest MAPE and statistically significant parameters was selected as the best model for forecasting. ARIMA (2,3,1), ARIMA (2,2,0), ARIMA (2,2,0), and ARIMA (1,3,1) were selected as the best fit models for Maharashtra, Delhi, Kerala, and India respectively. With the minimum MAPE Maharashtra = 5.430, MAPE Delhi = 10.440, MAPE Kerala = 2.607, and MAPE India = 2.390, the models fitted the prevalence of COVID-19 very well (Fig 2 and Table 2). All the estimated parameters of the best fit models and the Box-Pierce test statistic are presented in Table 3. The fitted and predicted total confirmed COVID-19 cases are presented in Table 4 and Fig 3. For the next 20 days, the total number of confirmed COVID-19 cases is estimated to be from 79,406 to 1,08,013 in Maharashtra, 14,423 to 25,272 in Delhi, 554 to 1295 in Kerala, and 2,18,484 to 2,73,172 in India. Table 4 Prediction of the total con rmed Covid-19 cases for the next 20 days according to ARIMA models with 95% con dence interval.

Discussion
In an effort to slow down the spread of COVID-19, the Indian government took strong measures by announcing a countrywide lockdown on 24 March 2020 as the number of con rmed positive cases were increasing in the country. Estimating the prevalence and intensity of an epidemic is crucial for allocating medical and health resources, production of activities, and even the economic situation of the country.
Hence developing a forecasting model that accurately predicts the future intensity of an epidemic can help the government administrators and decision-makers prepare the manpower and medical supplies required during an outbreak. In this study, the ongoing trend and the intensity of the COVID-19 pandemic were estimated using the ARIMA time series model. The ARIMA model is one of the best models and has been extensively employed to predict the incidence of contagious diseases (Wang et al. 2018a). To the best of our knowledge, this is the rst study in India to apply the ARIMA model to estimate the prevalence of COVID-19 in India and its major states.
India has reported a lower COVID-19 death rate as compared to countries like China, United Kingdom, Italy, Spain, and the United States (Ghosal et al. 2020). However, the total con rmed COVID-19 cases in most of the Indian states show no sign of a downward trend. At the time of writing this article, India had 82,000 positive con rmed cases (Covid19 2020) and was expected to overtake China's total COVID-19 cases shortly. Minhas (2020), in his study, points out that India is another potential epicenter of the global COVID-19 pandemic due to human overpopulation and unhygienic living conditions (Minhas 2020). Containing the spread of the virus among the economically disadvantaged people, who may not be able to self-isolate, is a challenge. In Maharashtra, the number of daily new cases since March 16 has grown exponentially and crossed the 1000-cases-per-day mark on May 6. Mumbai, the state capital of Maharashtra and also India's nancial capital, has been the worst hit city by COVID-19, having recorded 15,750 total cases accounting for 20 percent of all positive COVID-19 cases in India (Dutta 2020). Kerala reported the rst case of COVID-19 in India; however, over a period of one month, the daily new con rmed cases signi cantly reduced to zero for ve consecutive days (Covid19 2020). Delhi, the national capital, reported 472 COVID-19 cases in a single day on May 14, the highest jump so far. With the lockdown curbs being relaxed after May 17, the number of new cases may increase further (Dutt 2020). This pattern will burden the health system to its maximum capacity. As a result, if adequate measures to contain the spread are not appropriately enforced, and social distancing is not maintained, the number of cases is not expected to plateau any time soon

Conclusions
An epidemic is a numbers game and as far as numbers are concerned, India has a handful of them. With no valid medical treatment and preventive measures for this virus to date, forecasting the prevalence of the disease is a vital strategy to strengthen the surveillance and allocate health resources accordingly.
Our forecasting model shows that if left unchecked, the intensity of the epidemic in India is likely to cross 2.45 lakh cases by 6 June 2020 and overburden the health care system. The results of the study will help health authorities and health care management plan the necessary supply resources, which include medical staff, medical equipment, intensive care facilities, hospital beds, and other healthcare facilities. This will make the epidemic controllable and bring it within the domain of the available healthcare resources in India. The estimated ACF and PACF plot to predict the trend of Covid-19 prevalence for (a-b) Maharashtra (c-d) Delhi (e-f) Kerala, and (g-h) India.