Can a Meteorological Variable Be Considered As a Predictor of COVID-19 Cases in Urban Agglomerations of Indian Cities?

Coronavirus has been identified as one of the deadliest diseases and WHO has declared it as pandemic and global health crisis. It has become a massive challenge for humanity. India is also being facing its fierceness as it is highly infectious and mutating at a rapid rate. Many interventions have been applied in India since the first reported cases i.e. on January 30, 2020. Several studies have been conducted to assess the impact of climatic and weather conditions on its spread in the last year span. As it is a well-established fact that temperature and humidity could trigger the onset of diseases such as influenza and respiratory disorders, the association of several meteorological variables has been studied in the past with the COVID-19 related number of cases. The conclusions in those studies were based on the data obtained at the early stage and it was too early to draw any inference. This study attempted to assess the influence of temperature, humidity, wind speed, dew point, previous day’s number of deaths, and government intervention ’ s effect on the number of COVID-19 confirmed cases in 18 districts of India. It is also attempted to identify the important predictors of number of confirmed COVID-19 cases in those districts. The random forest model and the hybrid model obtained by modelling the random forest model's residuals are used to predict the response variable. It is observed that meteorological variables are useful only to some extent that too when used with


Introduction
Novel Coronavirus (COVID-19), has been spread in almost all the countries for the last one and a half years.The nations are facing the wrath of the disease with outbreaks, during which, the seasonal variations have been observed.In India, over 32,474,773 cases and 435,050 deaths have been reported (Worldometer, 2021).The temporal variations in the number of COVID-19 cases observed during January 2020 till date across the regions indicate the seasonal variations (Gautam et al., 2020;Chelani and Gautam 2021).It is a well-established fact that temperature and humidity could trigger the onset of diseases such as influenza and respiratory disorders (Shaman and Kohn, 2009;Golakota et al. 2021).The association of several meteorological variables with the COVID-19 related number of cases has therefore been assessed in various studies (Zhu et al., 2020;Yao et al., 2020;Cole et al., 2020;Gautam 2020a&b;Gautam et al., 2021;Ambade et al., 2021).In winter, a high number of flu or influenza cases are usually witnessed due to low temperature, whereas during summer, fewer cases are generally observed (Damette et al., 2021;Chen et al., 2021).It has been observed that the cities with average temperature less than 10 0 C and lower humidity have more chances of spread of the virus than with higher temperature (Sajadi et al., 2020;Araujo and Naimi, 2020).
It is observed that the rise of 1 0 C in temperature may cause approximately 4.861% increase in the daily confirmed COVID-19 cases in China (Xie & Zhu, 2020).Few studies, however have established the insignificant or negative effect of meteorological factors on COVID-19 confirmed cases (Yao et al., 2020;Liu et al., 2020;Shi et al., 2020;Méndez-Arriaga, 2020;Wu et al., 2020).The studies on the association between the meteorological parameters and COVID-19 have provided mixed results and do not provide an empirical evidence or confirmed statistical significance of the association.In India, the warmer temperature is usually observed in most regions for many days, which was the reason for the hope that the disease would not spread to the tune of other areas having cooler climates (Chen et al., 2021).However, the havoc created by the virus in India is not unknown (BBC, 2021).Even in the warmer period, large number of cases have been reported (https://www.hopkinsmedicine.org/health/conditions-anddiseases/coronavirus/first-and-second-waves-of-coronavirus).
The studies above have used a small sample size due to the non-availability of the COVID-19 confirmed cases.The outcome based on the small datasets may lead to erroneous conclusions.Over a period of time, enormous dataset has been obtained.With the availability of large sample size, a rigorous study can be conducted on the relationship between meteorological factors and number of COVID-19 confirmed cases to assess the role of the former in governing the disease spread.In addition, the effect of policy interventions such as complete or partial lockdowns implemented to prevent the spread of the virus can also be assessed.In India, the complete lockdown was initiated from March 24, 2020, which was later relaxed from April 14, 2020 in phase-wise manner.During the complete lockdown, all the activities except the essential services were completely halted.Partial lockdowns were imposed after April 14, 2020 with relaxation in phases (Wikipedia, 2021).It is interesting to know the influence of complete and partial lockdowns in the cities in the containment of spread of the disease.The number of deaths incurred due to COVID-19 on previous day may influence the spreading of the virus because people gather for condolence meets, although in fewer numbers due to government restrictions.The number of deaths incurred on the previous day was included in the model as it contains the effect of earlier days due to autocorrelation.In this study, with the data on meteorological factors such as daily mean temperature, wind speed, The time series of number of confirmed cases for each district is normalized by dividing with the population of the corresponding district.The newly formed time series is considered as the response variable, which is modelled as a function of meteorological variables such as temperature, wind speed, relative humidity and dew point, lockdown variable which is a categorical dummy variable, and number of deaths reported on previous day.The meteorological data are obtained from Wunderground (2021).The policy interventions such as complete and partial lockdown are incorporated in the model as dummy variable with the following descriptors as;

Data collection and model application
Complete lockdown -The complete lockdown was initiated in India starting with the 'Janta Curfew' on March 22, 2020.The complete lockdown was announced on March 24, 2020 till May 31, 2020.During this phase, the non-essential services and factories were suspended except essential services such as grocery stores and vegetable sellers, which were allowed to remain open for a particular duration of time.The traffic movement was restricted due to strict compliance by the police and local governments.
Partial lockdown -The essential and non-essential activities were allowed for a few hours till Unlock -On June 1, 2020, central government announced the unlock 1.0 with ease on restrictions followed by series of unlock phases like unlock 2.0, unlock 3.0, unlock 4.0 and unlock 5.0 which were extended till November 30, 2020.All the phases of unlock are incorporated in the model as one dummy variable.
Complete upliftment of the lockdown or no lockdown -From December 1, 2020, the restrictions were uplifted till the beginning of the second wave of COVID 19, which initiated on April 5, 2021.
During April 5, 2021 to April 30, 2021, partial lockdown was imposed which became stricter from May 1, 2021 to May 31, 2021.The lockdown phases were followed in this study as imposed by Central government only.The model incorporates the lockdown variables as a categorical dummy variable with complete lockdown as Lk1, unlock as Lk2, partial lockdown as Lk3 and no lockdown as Lk4.The four dummy variables were therefore introduced in the model code with binary values as described above.The other variables such as mean daily temperature ( 0 C), humidity (%), wind speed (m s -1 ), dew point temperature ( 0 C) for the respective district are used in the model as Temp, RH, WS, Dew.The number of deaths on previous few days may have influence on the spread of the virus.The number of deaths occurred on previous day only is included in the model as it is assumed that the previous day's number of deaths effect is included in the current day's number of deaths.It is denoted as Death_1 in the model.
The reported cases may be dependent on the previous number of cases due to lagged behaviour.The other lagged variables of confirmed cases are however, not included to avoid the tautological effect on the model.The inclusion of the previous day's number of confirmed cases also overpowers the model and shall have more importance than other features.
Therefore, the model is developed only with meteorological variables, number of previous day's deaths and intervention exogenous variables.The meteorological variables may also be auto-correlated, however the lags of meteorological variables are not included in the model as it is assumed that the values observed on a particular day are inclusive of the effect of previous day's values, so any effect on the response variable of those previous day's values shall be taken care off by the present day's observations.

The random forest model
Random forest (RF) is an ensemble method involving the random forest of decision trees (Breiman, 2001).It is a supervised learning algorithm that constructs decision trees based on the training data set.The combination of decision trees minimizes the out of bag error based on the training data sets.The studies have shown its applicability even in the presence of noise in the data (Kontschieder et al., 2011).Many nonlinear and high-dimensional complex classification and regression problems have been solved by applying RF (Yu et al., 2016).In case of regression problem, the predictions are obtained by the random selection of number of predictor variables in the decision tree.The best solution is determined based on the number of nodes and variables in the nodes in the fully grown tree.Every training set is fed to each decision trees (Breiman, 2001;Breiman, 2002).The number of variables or features to construct the model are selected with a specified value.For many regression problems, 1/3 rd of the total number of variables is often used (Liaw and Wiener, 2002).The usual practice to select the number of trees is randomly selecting the initial value from 10 to 1000 or higher and choosing one based on model performance criteria.Higher number of trees however slows the learning process.A discussion on other selection criteria based on cross-validation and tuning are given in Stone (1974).The decision tree is constructed for the training cases with the specified nodes of the tree.The training set is selected and trained whereas the remaining cases are used to estimate the error.So for each case, out of bag error estimates are provided.Random forest model has advantage over other supervised learning models for classification and regression problems as it avoids over-fitting samples and provides the solution based on averaging.Each variable has its importance in the overall model performance, which can be shown based on the Gene index for classification and mean square error (MSE) for regression problems.RF is applied using R4.0.0 (R Development Core Team, 2010).For variable importance, %IncMSE is computed in the R package, which is the increase in MSE of predictions of out of bag samples as a result of permutations of the variables.

AR1 model
Sometimes, the fitted model shows heteroscedasticity in the residuals and accepting the model is not advisable as some patterns in the datasets may not have been appropriately captured.Where  ̂ is the forecast value of RF model for time t.AR1 model can be fitted as in equation (2); Like RF model, the coefficients are estimated for residuals of the training set and the predictions are obtained for residuals of the testing set.

Hybrid model
The predictions obtained by RF model and AR1 model are combined.The hybrid methodology is based on the combination of linear autocorrelation model and random forest model (Chelani and Devotta, 2006), which can be given as; Where rt denotes the AR1 model fitted to residuals of RF model as given in Equation ( 2) and and RFt denotes the RF model of number of confirmed cases.First the RF model is fitted to the data and the residuals are obtained, which were then modelled as AR1 process.
Let the forecast from the AR1 model be denoted as t.The new forecasts can therefore be obtained as, To evaluate the performance of the models, the error statistics such as correlation between observed and predicted cases, root mean square error (RMSE), normalized root mean square error (NRMSE) are utilized.These test statistics can be obtained as, ----( 5) Where   is the observed and  ̂ is the predicted data and n is the total number of data points.

Results and Discussion
The descriptive statistics of number of confirmed cases and meteorological variables is given in Table 1.The mean daily temperature, wind speed, relative humidity and dew point temperature ranges from 23.3±6.7 0 C to 35.8±12.5 0 C, 0.4±0.2ms -1 to 3.1±1 ms -1 , 44.7±24.3% to 72.9±13.5%,7.08±3.82 0C to 21.69±5.020 C. The number of confirmed cases range between 0 to 2036 per million, whereas the number of deaths range between 0 to 184 per million in all the districts during the study period.Further the correlation analysis of meteorological variables, number of deaths and number of confirmed cases is given in Table 2.

Conclusion
Considering the fierceness of COVID-19, WHO has declared it as a pandemic and global health crisis.In India, many interventions have been applied and in the last year span, several studies have been conducted to assess the impact of climatic and weather conditions on its spread.The    List of Tables Table 1.Descriptive statistics of number of confirmed cases and meteorological variables

Fig 1
Fig 1 shows the districts which are included in the study.The total number of cases reported Hence further modelling of the residuals may explain the existing temporal relationships and may provide more reliable model performance than just relying on the single model fitted to the datasets.The autoregressive model of order 1 (AR1) model is therefore fitted to the residuals of the RF model.The details of the model are given below.Let   represents the residuals obtained by fitting RF model to number of confirmed cases for the training set.  =   −  ̂ ----(1) The monthly variations in confirmed cases divided by the corresponding population of the district are given in Fig.2, which shows that April and May have witnessed outbreak due to second wave in all the districts.In Nagpur, Patiala, Thane, Mumbai, Ludhiana and Pune, high number of confirmed cases have also been observed in March.Few spikes are also observed in other months.The correlation between number of deaths and confirmed cases is statistically significant in all the districts except in Mumbai and Patna.The correlation of number of confirmed cases with temperature is significant at p=0.05 in all the districts except in Chandrapur, Thane and Ujjain.The correlation of relative humidity and number of confirmed cases is mostly negative and significant at p=0.05 except at Kolkata and Nagpur.In Mumbai and Thane, the correlation with relative humidity is positive.The correlation of number of confirmed cases with wind speed is mostly negative and insignificant.Wind speed has been characterized as one of the factors in defining the ventilation coefficient of an area(Goyal and ChalapatiRao, 2007).It has been observed in the past that high wind speed and good ventilation are associated with less number of COVID-19 cases.The relationship in the study is however negative and not significant.The relationship of dew point and confirmed cases is sporadic with negative correlation in Chandrapur, Dewas, Kanpur and Ujjain and positive correlation in Chennai, Kolkata, Mumbai, Nagpur, Pune and Varanasi.When one looks at the scatter plot of confirmed cases and meteorological parameters, it can be seen that there is an inverse parabolic relation of confirmed cases with temperature, relative humidity and dew point.The number of cases increase with these factors and beyond a certain point, number of confirmed cases starts decreasing.The number of confirmed cases start decreasing with temperature at 31±0.5 0 C, relative humidity at 51.7± 0.6% and dew point at 21±0.4 0 C, respectively.The relationship of the number of confirmed cases with wind speed was exponentially decreasing with the initial increase in the number of confirmed cases.Random forest model was developed using library 'randomForest' in R. The data was divided into training and testing set with a ratio of 85:15 for each district.The number of trees and the random selection of number of variables were adjusted according to the mean square error as a cost function.The bootstrapping with a sample size of 500 was done to arrive at an optimum model.The 'Çaret' library in R was used to carryout bootstrapping of the training samples.The optimum number of trees and number of variables with minimum mean square error is observed to be 80 and 7.RF model helps in ranking the relative importance of the predictors in modelling the response variable.%IncMSE is used to rank the important features for the response variable.The model is run each time when a split on one predictor is conducted.A large change in the MSE is usually observed with the important predictor.The variable importance ranking shown in Table3suggests that the major governing factors of confirmed cases are the number of deaths occurred on previous day.Temperature is the second most important factor influencing the confirmed cases closely followed by wind speed.Dew point and relative humidity on the other hand have relatively less influence on confirmed cases predictability.The influence of lockdown variables is quite low as compared to meteorological variables and the number of previous day's deaths.Partial lockdown effective only for few hours is seen to be highly effective in confining the cases as compared to the complete lockdown and no lockdown.The unlock period is also shown to be the second influencing factor among the lockdown measures governing the number of confirmed cases.This finding is very useful to policy makers and economic growth point of view.Policymakers can opt for the partial lockdowns instead of complete or no lockdown et al to sustain the economic activities viz. a viz.confine the spread of the virus.Moreover, it can be seen from the importance matrix that meteorological variables alone cannot be used as predictors to predict or estimate the number of confirmed cases for any district.Instead the number of previous day's deaths and lockdown details are also required.The results of RF models are then further improved by fitting AR1 model on the residuals of the model.AR1 model is observed to be with coefficient  0 = −0.000127(p>0.05) and  1 = 0.46 (0.44±0.48) with p<0.05.The estimations obtained by AR1 model i.e. t are added to the estimations of RF model ( ̂) as in Equation (4).The performance of the final results of hybrid model are given Table 4.The prediction results for training and testing set in terms of box plot is shown in Fig. 3a and 3b, respectively.It can be seen that the hybrid model performs better than RF model as NRMSE is lower for the training and testing set.The model can obtain the number of confirmed cases based on the exogenous meteorological variables for a particular day along with the previous day's deaths.

Fig. 1 .
Fig. 1.Location of districts in map of India.

Fig. 3a .
Fig. 3a.Observed and predicted number of cases using hybrid model in 18 districts for training

Fig. 3b .
Fig. 3b.Observed and predicted number of cases using hybrid model in 18 districts for testing

Figures Figure 1
Figures

Figure 3 a
Figure 3

Table 2 .
Correlation between meteorological variables and number of confirmed cases

Table 3 .
Variable importance matrix

Table 4 .
Model evaluation for training and testing set

Table 1 .
Descriptive statistics of number of confirmed cases and meteorological variables

Table 2 .
Correlation between meteorological variables and number of confirmed cases of deaths due to COVID-19 on previous day, WS: wind speed, Temp: temperature, RH: relative humidity, Dew: dew point temperature, Lk1: no lock down, Lk2: unlock, Lk3: partial lockdown, Lk4: complete lockdown

Table 4 .
Model evaluation for training and testing set