The Spatial Predicting of COVID-19 Incidence and Its Mortality Based On OLS and GWR Models in Iran

Background: Within six months of the COVID-19 outbreak, 350279 people were infected, and 20125 people died of COVID-19 in Iran. There is an urgent need to nd the most accurate effective indicators on this disease's outbreak in order to control and predict. Methods: We examined the effect of 36 demographic, economic, environmental, health infrastructure, social, and topographic independent variables on the COVID-19 infection and mortality rates using the ordinary least squares (OLS) model in ArcGIS 10.5. Regarding adjusted R-squared>0/7, we selected 20 variables for COVID-19 infection rate and 16 variables for the mortality rate. The collinearity problem between the selected variables resolved after using the variance ination factor (VIF). Then, we performed the OLS and geographically weighted regression (GWR) models in ArcGIS 10.5. Results: Having a large number of men, having a large population, lack of specialist doctors, lack of hospital, having a large urban population, having a large number of people aged 65 and over or older individuals, and high natural mortality rate had the most prominent impact on the COVID-19 infection increasing rate. Also, lack of ICU beds, low number of insured people, lack of subspecialist physicians, and lack of hospital beds had the most prominent impact on increasing of COVID-19 mortality. Then the variables with VIF above 7.5 were removed and nally, high incoming immigrants rate and lack of nurses were identied as two independent variables to predict COVID-19 infection rate. In addition, high incoming immigrants rate and high number of doctor consultation were recognized as two variables to predict mortality rate due to COVID-19. The results of the Akaike information criterion (AIC) and adj.R2 showed that both models were appropriate for these analyses. Conclusions: Based on our results, there would be a considerable increase in COVID-19 infection in Kerman, Esfahan, and Kermanshah provinces. In addition, there would be a remarkable decrease in COVID-19 infection in Khuzestan, Lorestan, Azarbayjan Shargi, and Tehran provinces. Regarding COVID-19 mortality, there would be a substantial rise in Fars and Khorasan Razavi provinces. Moreover, our analyses predicted


Introduction
The rst pneumonia cases of unknown origin were identi ed in Wuhan, the capital city of Hubei province of China in early December 2019. Further, the pathogen was identi ed as a novel enveloped RNA beta coronavirus, and was named Severe Acute Respiratory Syndrome Corona Virus 2 (SARS-CoV-2), which has a phylogenetic similarity to SARS-CoV (Manoj and et al., 2020). Then, it rapidly affected many individuals in nearby areas, and next in the whole world. Following that, the WHO declared a global pandemic on March 11, 2020 (Park et al., 2020). This pandemic has caused global socioeconomic chaos, e.g., causing problems in sport, educational, religious, political, and cultural activities. So it is one the most important problems in the world (Santos et al., 2019;Fung et al., 2019;Quwaider and Jararweh, 2016).
Besides ongoing studies about the pathogenesis of this disease (Ahmet et al., 2020; Al-Zinati et al., 2020), the accurate prediction of disease outbreaks has remained a challenge for governments and policymakers (Graham et al., 2018;Metcalf and Lessler, 2018). Therefore, predicting disease outbreaks has been found appealing to researchers (Hassani et al., 2019). In this regard, spatial aspects have been demonstrated to have a crucial role in predicting disease outbreaks (Melin et al., 2020). Geographic information system (GIS) is useful tool in the public health domain, particularly for the infectious disease surveillance and modelling strategies (Saran et al., 2020). GIS offers spatial modeling options that accounts for the in uence of various factors (Abousaeidi et al., 2016). Therefore, GIS and spatial big data  Herein, we aimed to investigate the relationship between demographic, economic, environmental, health infrastructure, social, and topographic variables with COVID-19 incidence and its mortality in Iran.
Moreover, we intended to apply them to predict the future of COVID-19 in Iran. To the best of the authors' knowledge, it is the rst study evaluating these variables to predict the COVID-19 in Iran.

Data collection
In this study, we surveyed the impact of 36 different variables in COVID-19 incidence and mortality through predicting its outbreak in Iran. We got the number of COVID-19 cases and the number of dead people due to it from the Ministry of Health and Medical Education of Iran.
As seen in Fig. 1, the infection and mortality rates and the mortality to infection ratio vary in different provinces. According to Fig. 2 (a), Tehran, Fars, Khorasan Razavi, Azarbayjan Shargi, Ehfahan, and Khuzestan provinces suffered from the highest infection rate in the whole country. Tehran, Khuzestan, and Azarbayjan Shargi provinces also had the highest mortality rate (Fig. 2b). In Zanjan, Hamedan, Ardebil, Golestan, and Kerman provinces, the highest mortality to infection ratio was found (Fig. 2c).
We categorized the independent variables into 6 groups. As shown in Table 1 we collected these data from different sources at the provincial level and joined them to the administrative boundary shape le using ArcGIS Desktop 10.5.

Methods
Considering our major goals, data analysis comprised two steps. First, we examined the impacts of different variables on both the COVID-19 incidence and its mortality. Our objective was to gure out if these independent variables were effective, and if yes, what was the size of the impact on dependent variables. To achieve this, the OLS model and Adjusted R-Squared besides scatter matrix plot in ArcGIS 10.5 were employed. In addition to performing spatial analysis, the OLS model is a regression method that investigates the relationships between a set of explanatory or independent variables and dependent variables (Wheeler and Calder, 2007). So, we used this model to exclude ineffective variables and those correlated poorly with the dependent variables. In the second step, we aimed to predict the incidence of the COVID-19 infection and its associated mortality in Iran by analyzing the relationship between independent and dependent variables. So, rst, we resolved the multiple collinearities between independent variables. We used the variance in ation factor (VIF) for this purpose and omitted the variables which had VIF larger than 7.5. In the next step, the ordinary least squares (OLS) and the geographical weighted regression (GWR) models were performed for non-collinear variables using ArcGIS10.5 software for spatial analysis and predicting the Covid-19 incidence and mortality.
In the ordinary spatial regression models such as the OLS, global parameters are produced to assess spatial relationships . The regression parameter estimation value generated by the global regression models is the average of the whole region and has no local signi cance, let alone re ect the true spatial characteristics of regression parameters. Therefore, it is impossible to adequately explain an individual situation, and consequently the spatial heterogeneity, by using global overall parameters (Fotheringham et al., 2002); therefore, the ndings of classical estimation models are liable to bear certain degrees of bias . The details of the OLS may be found in Anselin and Arribas-  According to Tobler's rst law of geography, each object is related to another object, but close objects are more related than distant objects (Worboys and Duckham, 2004). Regarding the prevalence of COVID-19, it is clear that the variables are spatially correlated and according to this, we should consider the spatial correlation between independent variables and dependent variables. Fotheringham proposes geographical weighted regression (GWR) which consider spatial heterogeneity, geographic coordinates and core function to carry out local regression estimation on adjacent subsamples of each group (Wu, 2020). The geographically weighted regression (GWR) model expands the classical regression framework that effectively addresses issues of spatial heterogeneity by enabling the variable coe cients to change with the spatial locations (Sun and Xu, 2016). Regarding to this, one of the important aspects of GWR is focusing on the geographical location of the observations, while coe cients are locally estimated as they are allowed to vary spatially (Lykostratis and Giannopoulou, 2020

Results And Discussion
After compiling 36 independent variables, we surveyed their effectiveness on the two dependent variables using Adjusted R-Squared. Adj.R2 suggests that how an independent variable effects a dependent variable. Low Adj. R2 implies that the independent variable can't explain the dependent variable signi cantly, and the opposite is true for high Adj.R2. The results of this survey are demonstrated in Table  2. Accordingly, the impacts of independent variables on the COVID-19 infection and its mortality are very different. We considered the variables which their Adj.R2 were higher than 0/7 as effective ones. Thus, we found out that 20 variables out of the 36 independent variables could explain COVID-19 infection, and 16 variables could explain the mortality of the COVID-19. Table 2 shows that the variables including having a large number of men, having a large population, lack of specialist doctors, lack of hospital, having a large urban population, having a large number of people aged 65 or older individuals, and high natural mortality rate had the most prominent impact on the COVID-19 infection increasing rate, respectively. On the other hand, increasing temperature average, increasing unemployment rate, increasing slope average, increasing number of economically active people, increasing altitude average, and increasing rainfall average had the least impact on the COIVID-19 infection increasing rate, respectively. However, the examination of the impact of independent variables on the mortality rate of the COVID-19 revealed some con icting results; so that lack of ICU beds, low number of insured people, lack of subspecialist physicians, and lack of hospital beds had the most prominent impact on increasing of COVID-19 mortality; and lack of health houses, increasing intraprovincial travels, increasing altitude average, increasing slope average, increasing rainfall average and having high percent of roads had the least impact on the COIVID-19 infection increasing rate, respectively. In addition to examining the Adj.R2, we drew the scatter matrix plot to explore the relationship between independent and dependent variables. The results are similar to those obtained by using the Adj.R2 use. Figure 4 illustrates some of the matrix plots of effective and ineffective variables'.

Bold cells are signi cantly effective on infection and death of COVID-19
After investigating the effectiveness of 36 independent variables and detecting the effective ones, we eliminated non-signi cant variables. In the next step, we aimed to predict the COIVD-19 incidence and mortality in Iran. For this purpose, we needed to solve multiple collinearity problems between independent variables, so we used variance in ation factor (VIF) to examine multiple collinearities between 20 variables for infection and 16 variables for mortality due to COVID-19. We eliminated variables with high VIF one by one until the VIF dropped to 7/5. Finally, only two variables for infection and two variables for mortality of COVID-19 were chosen to be included in the spatial analysis and prediction. Table 3 demonstrates the multiple collinearity analysis results. Accordingly, the value of VIF index is lower than 7/5. The infection variables are high incoming immigrants rate and lack of nurses and for mortality of COVID-19 were high number of doctor consultation and high incoming immigrants rate. Then we found some other important statistical parameters in spatial analysis and prediction by the OLS and GWR models that we that merit mentioning. The coe cient was one of the important parameters representing the strength and type of the relationship between independent and dependent variables. The higher coe cient value, the better the model and the actual tting effect; and when the coe cient was negative, the relationship is negative and when it is positive, the relationship is positive (Yu et al., 2020); So according to Table 3, the coe cient is a strong signi cant of both infection rate and the mortality OLS analysis. The second important parameter was the probability or p-value, which must be lower than 0.01 (p < 0.01) to be statistically signi cant (Mollalo et al., 2020); so, regarding Table 3, all of the variables' were statistically signi cant (p < 0.01). However, if the koenker test is statistically signi cant, we should use the robust probabilities to assess explanatory variable statistical signi cance; so in our OLS result the koenker test is statistically signi cant and the robust pr is also statistically signi cant (p < 0.01).
Eventually, according to the results of the OLS, which are shown in Table 3, the statistical parameters were signi cant. After examining the statistical accuracy of the selected variables, we performed OLS and GWR models to predict the COVID-19 infection and mortality rates; then we used adjusted-R2 and AICc to compare the performance of both models. The higher adjusted R2 value is, the better the model and the actual tting effect and the smaller the ACIc value is, the better is the model tting degree. For more details about adjusted R2 and AIC see Zhang et al., 2020 and Yu et al., 2020. Regarding Table 4, Adj.R2 for infection and death is around 0.9, but for GWR is slightly better than the OLS; also, AICc's values, for death are lesser than infection; for death analysis, it's tter than infection analysis in both models. But the results show that AICc's values are the same in the OLS and GWR; just for death it is a bit better than in GWR and the opposite for infection. If the difference between AICs is less than around 3, the performance of models is the same (Fotheringham et al., 2002). After examining the models' propriety for predicting COVID-19, we performed GWR and OLS models on the spatial data. As seen in Fig. 5, the dark red areas depict areas where the actual values are higher than where the model predicted. On the contrary, the light blue to dark blue values indicate where the actual values are lower than the model predicted. Figure 5  change in the other provinces, and they will continue the previous process. These results are common to both models. But according to the GWR result, infection in Tehran and Ilam will decrease, while in the OLS result, their infection process won't change. As seen in Fig. 6 and Table 5, Tehran, Esfahan, Khorasan Razavi, and Fars will have the largest infection contribution, and Khorasan Shomali, Ilam, Golestan, and ChaharmahaloBakhtiari will have the lowest infection contribution.
Mortality due to COVID-19 prediction results is shown in Fig. 7 and Table 5. According to these, mortality will increase in Fars, Khorasan Razavi, Alborz, and Esfahan, and it will decrease in Tehran, Zanjan, Lorestan, Khuzestan, Hormozgan, Golestan, Gilan, Bushehr, and Ardebil. But the intensity of the increase or decrease varies in the different provinces. In some of the provinces, the models' results are different. But the main difference is in Azarbayjan Gharbi, the OLS predicts it will decrease and GWR predicts it will increase. According to Fig. 8 and Table 5, Tehran, Esfahan, Fars, Khuzestan, Khorasan Shomali, Alborz, and Azarbayjan Shargi will have the largest mortality contribution and Ilam, ChaharmahaloBakhtiari, Kohgiluye va Boyerahmad, and Semnan, will have the lowest COVID-19 mortality contribution. Meaning of type of change's sign: +2: will increase a lot, + 1: will increase, 0: no change, -1: will decrease, and − 2: will decrease a lot.

Conclusion
According to this article's main purposes, we investigated 36 independent variables' impact on the COVID-19 incidence and the mortality due to it as two dependent variables in Iran. We categorized these independent variables into 6 different groups to determine what kind of variables has the most effects on the dependent variables. Then we analyzed the COVID-19 spatially to predict its incidence and mortality in the future in Iran. For this purpose, we used the OLS and GWR models.
The results indicated that 20 different variables had a high correlation with the COVID-19 incidence in Iran. Five of these variables were demographic, 10 were health infrastructure, and 5 were social variables.
So economical, environmental, and topographical variables such as increasing temperature average, increasing altitude average, annual household's income average, unemployment rate, etc., don't affect the dependent variable, or their effects are very low. Examining of the impact of independent variables on the COVID-19's mortality showed similar results as its incidence results except for an interesting difference; the most in uential variables on the COVID-19 incidence are demographic indicators, but the most in uential variables on the mortality due to COVID-19 are health infrastructure such lack of ICU beds, lack of subspecialist physicians, lack of hospital beds, etc. Eventually we can recognize that the health infrastructure plays a very important role in COVID-19 incidence and its mortality.
In the other hand, we analyzed the COVID-19 spatially to predict its infection and mortality rates in the future across the different provinces. We used the OLS and GWR models for prediction. Generally, the results indicated that the COVID-19 infection would have a considerable increase in Kerman, Esfahan, and Kermanshah, and there would be a signi cant reduction in Khuzestan, Lorestan, Azarbayjan Shargi, and Tehran. In addition, the mortality due to COVID-19 is expected to increase considerably in Fars and