The Spatiotemporal Interaction Effect of COVID-19 Transmission in the United States

Background: Human mobility among geographic units is a possible cause of the widespread transmission of COVID-19 across regions. Due to the pressure of epidemic control and economic recovery, the states of the United States have adopted different policies for mobility limitations. Assessing the impact of these policies on the spatiotemporal interaction of COVID-19 transmission among counties in each state is critical to formulating the epidemic policies.<br><br>Methods: The study utilized Moran’s I index and K-means clustering to investigate the time-varying spatial autocorrelation effect of 49 states (except the District of Colombia) with the daily new cases at the county level from Jan 22, 2020, to August 20, 2020. Based on the dynamic spatial lag model (SLM) and the SIR model with unreported infection rate (SIRu), the integrated SLM-SIRu model was constructed to estimate the inter-county spatiotemporal interaction coefficient of daily new cases in each state, which was further explored by Pearson correlation and stepwise OLS regression with socioeconomic factors.<br><br>Results: The K-means clustering divided the time-varying spatial autocorrelation curves of 49 states into four types: continuous increasing, fluctuating increasing, weak positive, and weak negative. The Pearson correlation analysis showed that the spatiotemporal interaction coefficients in each state estimated by SLM-SIRu were significantly positively correlated with median age, population density, and the proportion of international immigrants and the highly educated population, but negatively correlated with the birth rate. The voting rate for Donald Trump in the 2016 U.S. presidential election showed a weak negative correlation. Further stepwise OLS regression retained only three positive correlated variables: poverty rate, population density, and the highly educated population proportion.<br><br>Interpretation: This result suggests that various state policies in the U.S. have imposed different impacts on COVID-19 transmission among counties. All states should provide more protection and support for the low-income population, high-density populated states need to strengthen regional mobility restrictions, and the highly educated population should reduce unnecessary regional movement and strengthen self-protection. <br>

distribution [3]. Countries with low incomes, incomplete health care capabilities and demographics with a large proportion of the elderly population have been facing the challenges of more serious disease output and health care burdens [4,5]. However, the United States, as the country with the most developed economy and the highest level of medical care, has the largest number of infections and shows geographic differences in COVID-19 transmission, which has become an important global health research issue for pandemic control.
The spatial heterogeneity in the spread of infectious diseases comes from the social, economic, and environmental differences of the geospatial unit itself [6]. Compared with the potential climate correlations implied by some studies [7,8], more studies indicate that population density [9], health measures, and mobility restrictions [10]have a greater impact on the spread of COVID-19. Wherein, mobility and connectivity [11], other than population density [12], mainly influence the pandemic transmission more in term of the spatial differences, which is also supported by related research based on US county daily commute data [13]and mobility data of Boston [14], consisted with the research on Italy's industrial spatial structure and epidemic distribution [15]. Inter-regional population movement is the main reason for the extensive spread of COVID-19 across regions [15]. Spatial distancing is considered to be the most effective way [16,17]，such a method has also been verified effective in both China [18,19] and Europe [20,21]. Due to the trade-off of epidemic control and economic recovery, the states in the United States have adopted different spatial regulatory policies at different stages to control regional mobility. Assessing the spatiotemporal interaction of the inter-county spread of COVID-19 in the states is critical to the optimizing of the epidemic policy.
However，there are two main obstacles in the measurement of spatiotemporal interaction faces, appropriate model and data depression. Spatial interaction can be captured by a variety of methods, such as Geographic Weighted Regression (GWR) [22], Geographically Weighted Principal Component Analysis (GWPCA) [23], and Spatial Panel Models (SPD) which includes Spatial Lag Model (SLM), Spatial Error Model (SEM), Spatial Dubin Model (SDM) [24]. The exploration of the spatial interaction effect of COVID-19 transmission in previous studies faces two challenges. On the one hand, current studies mostly use static SPM with cross-sectional data [25], without considering the long-time effect on spatial interaction. On the other hand, related studies, often using infection and socioeconomic data as dependent and explanatory variables [26], ignoring the fact that the increase of infection is mainly driven by the values of infection variables in the Susceptible-Infective-Removal model (SIR) during previous time state.
Data suppression is another critical problem while applied by the spatial correlation model and SIR model [27].
The COVID-19 infection data officially released may be bias due to potential unreported infectives [28][29][30]. Studies have shown that asymptomatic infections and mildly infected people may be fully reported, resulting in an underestimate of infection data [31][32][33]. More importantly, the county-level data in the United States is not released, which may limit the reliability of the traditional SIR model. A SIR model integrated with unreported infection rate (SIRu) was recently proposed, providing an effective supplement to miscalculated data [34]. This paper thus proposed an integrated model of SLM and SIRu (SLM-SIRu) to calculate the inter-county spatial interaction effect of daily new cases in each state based on the county-level US COVID-19 data. As the uneven spatial correlation may derive from the inequity in the spatial units in terms of socioeconomic features [35,36], the connection has been further explored between spatial effect and socioeconomic elements in each state, such as population density, income, political elements. The spatiotemporal correlation has also been tested by Morans' I index, which has been further used to identify spatial clusters before the appliance of the SLM-SIRu model.
This study provides an integrated model to capture the spatiotemporal interaction effect, which may help assess the impact of cross-regional mobility on epidemic transmission and assist the government in optimize COVID-19 control policy.

Methodology
The workflow can be divided into three parts: (1) spatial autocorrelation analysis of daily new infection; (2) spatial effect exploration based on SIRu model and Spatial Lag Model; (3) correlation analysis of spatial effect and socio-economic factors ( Figure 1).

Figure 1 Research Workflow.
The research workflow explains the data, methods, and models used in the article. Moran's I and K-means clustering were introduced to capture the spatiotemporal feature of COVID-19daily new case changes in the US, then the spatial lag effect underlying the SIR model was further estimated by SLM-SIRu model and further used for correlation test with socio-economic variables.

Global Moran's I index and K-means Clustering
Except for extremely strict spatial restriction policies, any potential inter-county human movement between neighboring counties in a certain state may increase the contact chance, and then affect the amount of daily new infections in each geographic unit. Such a pattern of spatial interaction in the state could be defined as spatial autocorrelation, which could be calculated by the global Moran's I index based on spatial weights. The value of Moran's I is ranging from -1 to 1, -1 represents a negative spatial correlation, 0 is random, and 1 is a positive correlation.
For a certain attribute x of geographic units, the general formula of global Moran's I is wherein, xi is the attribute value of the ith county in a state, and ̅ is the mean value of all xi in the state, xj represents the value of the jth county near to the ith county. n is the total number of counties, and W is the sum of spatial weights wij.
The spatial weight W could be calculated by a pre-defined neighboring pattern such as Queen, Rook, K-near, or distances. Here the K-near algorithm was adopted with a max neighboring number of 4. As there is only one unit in the District of Columbia, so the calculation could not be applied.
If the total infections or daily new infections of a state are zero, making the calculation of Equation 1 unapplicable, a value of zero will be designated to Moran's I value of the corresponding state. Such a situation is also applied when the p-value of Moran's I calculation result is not significant (P>0.05).

K-means Clustering Algorithm
At different stages of COVID-19 transmission, the population movement among adjacent spatial units may change with the state's control policies or the epidemic situation, resulting in Moran's I index varying over time.
The characteristics of time series Moran's I index may reflect the spatial homogeneity and heterogeneity among the states in the United States, which could be explored by clustering algorithms.
The K-means clustering algorithm is applied, as it can generate the Guarantees convergence and Relatively simple to implement, comparing to DBSCAN, GMM. The optimal group number is determined by the minimum value of AIC.

SIR with unreported infections
In the classic SIR model, the daily new infections (In) can be expressed as the product of the infected population(I), the susceptible population(S), the total population(N), and the transmission rate (β): However, the official data cannot be directly used, as there may exist unreported infections. Moreover, the actual population of recovered patients has not been released in the county-level data, the parameter of I in Equation 2 could not be calculated directly. A SIR model integrated with unreported infections (SIRu) is proposed, adding two more parameters of φ and τ, wherein, φ is the average unreported/ reported rate of infections (UIR), and τ is the recovery/death rate (RDR). Equation 2 could be revised as: Wherein, In ,Ic and Rd is the official released COVID-19 data of daily new infections, cumulative infections, and death respectively, therefore, φIn , φIc and τRd are the corresponding factual data.
A furthermore simplification of Equation 3 can be rewritten as: Such an equation could be seen as a linear regression of variables of In, Ic, Rd, Ic 2 /N, IcRd/N.

Dynamic Spatial lag model
Though dynamic spatial panel models consist of Spatial Lag Model (SLM), Spatial Error Model (SEM), and Spatial Dubin Model (SDM), the SLM is adapted to capture the spatial effect of the SIRu model.
The classic SLM model could be described as: Wherein, y is the dependent variable of a certain geographic unit, y' represents the values of the adjacent geographic units, and x denotes the corresponding explanatory variables. W is the spatial weight, λ is the spatial lag coefficient.
The SLM-SIRu model can be combined by substituting Equation 4 into Equation 5: The COVID-19 data and the corresponding spatial weights at the county level were applied in Equation 6 to calculate the spatial coefficient λ, which would be taken as the dependent variable in the further correlation exploration with socioeconomic data.

Data
The data used in the article contain Covid-19 data, Geographic data, and Socioeconomic data.
(2) Geographic data of all counties and the states in the United States. The data is available in the Harvard Dataverse (https://dataverse.harvard.edu/dataverse/cdl_dataverse).
(3) Socioeconomic data of all states in the United States in 2019, including factors such as birth rate, death rate, international immigration rate, poverty rate, median age, average education level (high school graduation rate, undergraduate graduation rate, advanced education rate), population density. Such data is available on the US Census website (https://www.census.gov). Trump's vote rate in the 2016 U.S. presidential election in the states was also included in the correlation analysis as a potential political variable of residents.

Time series Moran's I
The time series Moran's I index of each state was calculated by daily new cases and spatial weights at the county level from Jan 22, 2020, to August 20, 2020, which indicated that most states showed significant changes in terms of Moran's I index (Figure2). The spatial correlation within each state showed substantial differences over time, for example, New York was keeping a restricted state after the first wave, while Georgia and Illinois were still increasing, Florida has just passed the peak. Such a time series Moran's I may reflect the real effect imposed by spatial restriction policies. represented the fitted value with a 95% confidence interval. Obvious changes from around the 50th day can be observed, displaying several patterns of temporal changes in spatial effects such as reversed U-shape, N-shape.

K-means Clustering of Time series Moran's I
The K-means clustering algorithm was further performed based on the time series Moran's I index. The results showed that the optimal group number was 4 (Figure 3a), and the four groups could be roughly defined as fluctuating growth, continuous growth, weak positive correlation, and weak negative correlation (Figure 3b). Figure 3c displays the Moran's I values of each group, wherein, the overall fluctuation ranges in cluster 1 is between -0.1-0.4, while the fitting curves and the 95% confidence interval were concentrated between -0.1 and 0.2, indicating a state of weakly positive spatial autocorrelated. Cluster 2 with a value ranging from -0.2 to 0.1, showed a feature of weakly negative spatial autocorrelated. Both Cluster 3 and Cluster 4 exhibited increasing trends, while the former was in a resurging status after the first wave, the latter was in a continuous increasing model with a relatively lower value of Moran's I. Figure 3d illustrating the clusters on the map, the spatial agglomeration could also be observed.

SIRu integrated with Spatial Lag Model
The combined model of SLM-SIRu was tested with the spatial matrix and epidemic data of all the counties from Jan 22, 2020, to August 20, 2020. The results of all the states displayed high significance, verifying the feasibility of the SLM-SIRu model (Table 1). Wherein, the parameter λ ranged from -0.08 to 0.56, showing a normal distribution ( Figure 4). The SLM-SIRu model also showed high R square values，most of which have a value larger than 0.5. In terms of fitness, the comparison between SIRu and SLM-SIRu indicated that the coefficient of spatial lag in SLM-SIRu improved the fitness of the original model of SIRu ( Figure 5). Notes: ***Correlation is significant at the .001 level. ** Correlation is significant at the .01 level.  wherein Louisiana was the largest one. In the north, Michigan, Massachusetts, Connecticut, and New Jersey also have relatively high spatial autocorrelation coefficients, indicating that the inter-county human flow in such states remains high. Figure 6 The choropleth map of spatial correlation coefficient λ.
The choropleth map of Spatial correlation coefficient λ used the Jerkens breakpoints with 4 levels.

Correlation and Regression with Socioeconomic Variables
To explore the correlation between the social, economic, and political factors and spatial correlation coefficient, the Spears test was applied. The result showed that there existed a significant negative relationship with the birth rate, and an obvious positive correlation with the proportion of international immigrants, median age, high education rate, and population density. In terms of the political factor, President Trump's support rate in the 2016 U.S. election showed a weak negative correlation (90% CI).

Figure7
The correlation test between λ and Socioeconomic Variables.
The left part under the diagonal line is the scatter points plots, and the numbers are the correlation coefficients. The diagram along the diagonal line is the histograms.
A further Stepwise OLS review retained five variables, wherein only three of the variables were significant: population density, poverty rate, and bachelor degree rate. Among them, the population density and poverty rate are more significant, indicating that the intrastate population flow caused by population density and poverty is still dominant, and the increase of the highly educated population ratio would also increase the risk of inter-regional human flows.

Discussion
The initial objective of the project was to measure the inter-county spatiotemporal interaction effect of COVID-19 transmission in the United States and explore the correlation between spatiotemporal interaction coefficient and socioeconomic features. The results of this study indicated that the inter-county spatial effects in the states are changing with time, displaying four types of spatial correlation trends: continuous increasing, fluctuating increasing, weak positive, and weak negative. Such clustering, though never been reported, could be explained by the heterogeneity in the social and spatial peripheries [37].
The fitnesses of SIRu models in all states have an average value above 0.75, indicating that the model can explain the epidemic dynamic of COVID-19 transmission in the United States. The SLM-SIRu model showed better fitness than the SIRu model with statistical significance, further verifying the hypothesis of spatial heterogeneity in epidemic dynamics [38]. The result indicated that the eastern states have a relatively high spatial interaction coefficient, showing a high possibility of inter-county flow.
In terms of socioeconomic features, the spatiotemporal interaction coefficients in each state were found positively correlated with the proportion of international immigrants, median age, the proportion of the highly educated population, and population density, but negatively correlated with the birth rate. The correlation in the two variables of median age and international immigrants ratio verified that the residents in inequitable living, working and environmental conditions may face a greater risk for COVID-19 infection [39]. What is interesting is that the Vote rate for Donald Trump in the 2016 U.S. presidential election showed a weak negative correlation, which implies that political factors may also some impact on the inter-county flow in the states [40].
The results of stepwise OLS regression suggested that poverty rate, population density, and the proportion of the highly educated population and are the three main positive correlated variables. Wherein, population density, and poverty rate have been considered highly correlated to COVID-19 transmission rate in previous studies [41,42], while as our result implied that more inter-county flow may also occur in the states with higher population density or with low income in the United States. Such a result may support the study in Europe which proposed that high population density states appear to benefit more from their Shelter-in-Place Orders [43]. Of course, the cross-regional movement of highly educated people is also noteworthy, as these groups have better medical resources and lower infection rates [44], but their unrestricted movement may bring risks to other vulnerable groups.
The spatiotemporal dynamic model established in this research still has many deficiencies in terms of spatial weight calculation and model optimization. The COVID-19 transmission in each state not only occur in adjacent counties but also emerge among different states, which makes spatial weight calculation need more exploration with travel networks. Moreover, the relevance was explored by spatial effect measuring and OLS regression separately, which can be improved by other robust methods such as machine learning. The SIRu model, adopted to adjust the impact of data depressing, can also be further optimized in terms of accuracy.

Conclusion
This study proposes that inter-county movement within states has an impact on disease transmission, displaying obvious spatial heterogeneity which is potentially related to the social, economic, and political factors of each state. This result suggests that various state policies in the U.S. impose different impacts on COVID-19 transmission among counties. All states should provide more protection and support for the low-income population, high-density populated states need to strengthen regional mobility restrictions, and the highly educated group should reduce unnecessary regional movement and strengthen self-protection.
This study provides an integrated model to capture the spatiotemporal interaction effect, which may help assess the impact of cross-regional mobility on epidemic transmission and assist the government in optimize COVID-19 control policy.