COVID-19 Forecasting using Multivariate Linear Regression

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe respiratory syndrome coronavirus 2 (SARS-CoV-2). It was �rst identi�ed in December 2019 in Wuhan, the capital of China’s Hubei province. The objective of this research is to propose a forecasting model using the COVID-19 available dataset from top affected regions across the world using machine learning algorithms. Machine Learning algorithms help us achieve this objective. Regression models are one of the supervised machine learning techniques to classify large-scale data. This research aims to apply Multivariate Linear Regression to predict the number of con�rmed and death COVID-19 cases for a span of one and two weeks. The experimental results explain 99\% variability in prediction with the R-squared statistics scores of 0.992. The algorithms are evaluated using the error matrix such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and accuracy for top affected regions across the world.


Introduction
Coronavirus disease 2019 (COVID- 19) is an infectious disease caused by se-vere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and was rst identi ed in December 2019 in Wuhan, the capital of China's Hubei province. High fever, cough, sore throat, headache, fatigue, muscle pain, and shortness of breath are the initial symptoms of COVID-19. Mathematical analysis of infectious diseases using SIR epidemic and endemic models have been stud-ied in the past literature [18]. Luo [21] indicates that certain measures taken by governments are based on various predictions. These predictions suggest the hospital needs, future deaths, infection peaks and others. Several regions across the world have gone into a state of lockdown, including travel restrictions, to prevent the spread of this deadly virus [10], [17].
Despite lockdown, massive amounts of death and con rmed cases occurred across the world. Data from countries like China, Japan and South Korea indicates that even though lockdowns proposed by governments have caused reduction in the spread of COVID-19 but there are other important factors which could not be neglected. One such factor is wearing masks in public, which have currently been accepted as a norm [27]. Social distancing, considered as an important factor that has contributed towards the control of the spread of this deadly virus, is expected to be prolonged in the near future [19].
The social distancing and lockdown are not just enough to prevent the spread of COVID-19 across the world. Forecasting cumulative death and con rmed cases are very useful to know and take preparatory measures to prevent the massive amount of deaths and con rmed cases across the world.
The objective of this research is to predict the cumulative count of future death and con rmed cases in top affected regions across the world using Multivariate Linear Regression. Regression models are known to be more robust towards noise provided there is aggregation of response and exposure variables [29].
The experimental results obtained from this paper are compared based on the error metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), R-Squared value, and Root Mean Squared Error (RMSE) to validate the e ciency of these proposed techniques. It explains 99% of the accuracy using Multivariate Linear Regression. These results are taken from top affected regions across the world such as the United States of America, United Kingdom, Spain, Italy and India (countries in World) along with Maharashtra, Tamil Nadu, Gujarat, Uttar Pradesh and Delhi (states in India) and Chennai (district in Tamil Nadu).

Related Works
Discussion about COVID-19 has taken the world by surprise that almost all walks of life has been impacted due to it. Different medical specializations have been diverted towards caring for COVID-19 patients. One such situation arises for radiologists, who currently undergo down time in their medical professional. Fessell et al. [12] discusses about how to make the down time useful and meaningful to our lives. They discuss about the importance of self-care, like applying the emergency masks in the airline ight before applying to your dependents, and how stress is created due to the current news outlets. Rethinking about those books that are bought before but not read and those parts of your home that are not well maintained are certain thought proving entities as of now [12]. Reduction in acute coronary syndrome in current times [27] is attributed towards reduction in air pollution. The propaganda towards advising general public to be wary of air pollution did not gain popularity but the current widespread usage of N95 masks to prevent the spread of COVID-19 has addressed multiple issues [27]. Prior research has shown the impact of usage of masks to protect humans from air pollutants like PM2. 5 [26], indicating the size and spread of such pollutants. Comparison of the presence of air pollutants with SARS-CoV-2 in the nasal epithelium and upper respiratory tract [14] suggests masks could be one factor to control the spread of this deadly virus.
Recent study on the curve of COVID-19 using Levitt's Metric suggest that the peak could be obtained somewhere within two to four weeks from 13th July 2020 [3]. Levitt's Metric suggests the closeness of H(t) to 1 as an indicator of no more new deaths. This study considers only one parameter, either Covid deaths or Covid cases at a time. Luo [21] indicates that predictions made by many researchers around the globe have tend to be outside even the next day's 95% con dence interval. Predictive monitoring, which monitors predictions of crucial future events and considers the differences not as errors but as changes, is adopted [21]. Regression is used in predictive monitoring and only those countries with Rˆ2 value greater than 0.8 is accepted and reported. Date to reach the last expected case, along with 99% and 97% of such a scenario is also reported. Author cautions that the predictions might go wrong due to efforts taken by the respective governments.
Many researchers have predicted the nal size of COVID-19 [15,23,6,28,31]. Logistic growth model and the classic susceptible-infected-recoverable (SIR) model are used to estimate the nal size COVID-19 [7]. The e ciency of the proposed logistic regression is justi ed by higher coe cient of determination (0.996) and lower p-value (< 0.000). Hashem et al. [10] has shown the spread of COVID-19 using machine learning techniques. This spread can be used not only to indicate the spread but also forecast, trace and diagnose COVID-19. Such a framework can be used in smart cities around the world. Prior to COVID-19, Cardiovascular mortality on a time series for Canada is studied using generalized additive models (GAM), distributed lag nonlinear models (DLNM) and autoregressive-moving average (ARMA) [1].
Petropoulos has applied exponential smoothing based simple time series modeling to forecast the pattern of COVID- 19 [8]. Related works are summarized in Table 1.
Likewise, in this article, we tend to predict the future for those top affected regions hoping that there is downward trend in the near future that can show some green light to the much affected stressful life these days.
One important implication on the data obtained is due to false negatives in the RT-PCR tests. This might be due to presence of the virus in the lower respiratory tracts than the upper respiratory track from which the sample was collected [32,20].Thus, there is a possibility of having discrepancy between the active cases and the con rmed cases. Thus, considering one parameter for prediction might affect the outcome of the prediction. This work considers multiple parameters as described in section 3.

Dataset
The COVID-19 dataset is collected from kaggle repository as a csv le. There are eight parameters such as region, date, con rmed cases, death, after one week con rmed and death cases and after two week con rmed and death cases (table 1). Data preprocessing is an important stage to identify missing and redundant data in the dataset [10]. Any missing data (NAN) are to be removed from the dataset to bring uniformity in the dataset. However in this dataset, there were no missing values and unnecessary values. Table 2 represents the description of the dataset.  Figure 1 shows the cumulative number of con rmed cases across the world for COVID-19. Figure 2 represents the cumulative number of death cases across the world for COVID-19. For India, Figure 3 explains the con rmed cases is high for Maharashtra and India compared to Gujarat and Uttar Pradesh. Whereas cumulative number of death cases for Tamilnadu is gradually increases next to Maharashtra compared to other states in India (Figure 4)

Proposed Methodology
The proposed methodology is shown in Figure 5. The COVID-19 dataset is identi ed by kaggle repository. The main objective of this paper is to forecast the future cumulative death and con rmed cases for top affected regions across the world. Regression and classi cation algorithms are some of the popular Supervised Learning techniques. In general, the Regression models are used to predict the continuous target values but classi cation models are used to predict the discrete target variables. In this paper, dependent variables (four target or output variables) are the continuous target variables. So, regression models are one of the suitable ways for this proposed methodology. After checking the unnecessary values and missing data in the dataset, we analyse the data to select the independent variables (features). Cumulative number of con rmed and death count (two independent variables) are used in this methodology. In the Multivariate Linear Regression Algorithm, multiple independent variables are used to predict the multiple target variables. Multivariate Linear Regression, one of the supervised learning techniques, is the extension of single Linear Regression. It is used to estimate the relationship between the two or more independent variables and dependent variables. These algorithms were implemented with the help of Python's scikit-learn and tensor ow library. Error metrics are used to evaluate the performance of the regression algorithm. Evaluation parameters such as Accuracy, R 2 score, Mean Absolute Error, Mean Squared Error and Root Mean Squared Error.
Independent Variables (Input): Cumulative number of con rmed cases and death Dependent variables (target or Output): After one week Con rmed cases, After one week death cases, After two week Con rmed cases, and After two week Death cases The above dependent and independent variables represent there are two independent variables (input) and four dependent variables (target or output). So Multivariate Linear Regression (Multi target variables) is a suitable way to handle this proposed methodology.
Multivariate regression is one of the regression techniques that estimates the single linear regression model with more than one dependent (target) variable. When there are more than one dependent and independent variables, that model is called Multivariate Multiple Linear Regression.
for i ∈ {1, . . . , n} and k ∈ {1, . . . , m} Where, Y ik ∈ R is the k-th real valued response for the i-th observation b 0k ∈ R is the regression intercept for k-th response.
b jk ∈ R is the j-th predictors regression slope for k-th response.
x ij ∈ R is the i-th predictor for the i-th observation.
The model is multivariate because m>1 (dependent variables) and this proposed model is Multivariate Multi Linear Regression because m>1 and p>1 (more than one dependent and independent variables).

Evaluation Parameters
In Regression models, Error Metrics is the best way to evaluate the performance analysis. There are various error metrics used to check the result such as Mean Absolute Error (MAE), Mean Square Error (MSE), R 2 score and Root Mean Square Error (RMSE). Table 3 represents the performance analysis for COVID-19 world dataset. Error Metrics is used to evaluate the performance analysis for this methodology. In table 3, Accuracy is 99.7%, Mean Absolute Error is 79818 and R 2 score is 99.28% using Multivariate Linear Regression (MvLR). In those cases, over tting might occur. So the dataset is divided into three parts such as train, test and validate data. Using Validation data, we evaluate error metrics and check for over tting. In Validation data, accuracy is 99%. There is no over tting in this Proposed Methodology. Finally Multivariate Linear Regression is the suitable algorithm to forecast the cumulative number of con rmed and death cases for top affected regions across the world for COVID-19.

Results And Discussion
Results from the Multivariate Linear regression analysis was conducted with July-21 Cumulative con rmed and death cases as input and predicted for July-28 and August-5 Cumulative death and con rmed cases count as output for the top eleven infected regions across the world using Multivariate Linear Regression (MLR). Results are tabulated in Table 4.
The gure 6 represent the forecasting the spread of COVID-19 (Cumulative Con rmed Cases) across the World. In India 1359221 and 1903457 Cumulative number of con rmed cases on July 28 and August 5.
In United States of America (USA) 4428455 and 4905644 Cumulative number of con rmed cases on July 28 and August 5. In United Kingdom (UK) 322611 and 365408 Cumulative number of con rmed cases on July 28 and August 5. In Spain 279847 and 300939 Cumulative number of con rmed cases on July 28 and August 5. In Italy 245815 and 246179 Cumulative number of con rmed cases on July 28 and August 5. The gure 7 represent the forecasting the spread of COVID-19 (Cumulative Con rmed Cases) across the World. In India 34962 and 50048 Cumulative number of death cases on July 28 and August 5. In The gures ( g 6 g 7) represent the forecasting the spread of COVID-19 (Cumulative Con rmed Cases) across the World. In Tamil Nadu 217822 and 291515 Cumulative number of con rmed cases on July 28 and August 5. In Uttar Pradesh (UP) 62287 and 95773 Cumulative number of con rmed cases on July 28 and August 5. In Maharashtra 382336 and 512465 Cumulative number of con rmed cases on July 28 and August 5. In Delhi 141010 and 174828 Cumulative number of con rmed cases on July 28 and August 5. In Gujarat 55559 and 72911 Cumulative number of con rmed cases on July 28 and August 5.

Conclusion
This article proposed to utilize the Regression models for COVID-19 epidemic analysis using COVID-19 dataset from Kaggle COVID-19 repository. The results show that Multivariate Linear Regression (MvLR) yielded a good accuracy in forecasting the COVID-19 pandemic virus. These experiments are evaluated by using various error metrics such as Mean Absolute Error (MAE), R-Squared Score, Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). This proposed methodology will predict cumulative number of death rate and cumulative number of con rmed cases (after one and two weeks) using the cumulative number of con rmed cases /active cases and cumulative number of death cases each day. It will be helpful for the government and health care sector (hospitals and pharmacies) to prevent the spread of COVID-19. The study can further be extended using deep learning models to forecast the spread of COVID-19 with e ciency. Figure 5 Proposed Methodology Figure 6 Forecasting the spread of COVID-19 (Cumulative con rmed cases) across the World on July 28 to August 5 Figure 7 Forecasting the spread of COVID-19 (Cumulative death cases) across the World on July 28 to August 5 Figure 8 Forecasting the spread of COVID-19 (Cumulative death cases) across the India on July 28 to August 5 Figure 9 Forecasting the spread of COVID-19 (Cumulative con rmed cases) across the India on July 28 to August

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Tables.pdf