The relative frequency and geographical distribution of infections follows two basic patterns (Fig. 1): infections were highest in prefectures that were geographically closest to Wuhan (e.g., in Anhui, Chongqing, Henan, Hunan, Jiangxi, Sichuan), or had the strongest social and economic ties (e.g., in Beijing, Fujian, Guangdong, Jiangsu, Shandong, Shanghai, Zhejiang) despite being geographically further.
Population outflow from Wuhan, the outbreak epicenter, may be hypothesized to export the virus to other locations, where it causes local first-wave infections (either directly imported cases or local transmissions15–17, 23–25). And indeed, we find a strong correlation between population outflow from Wuhan and number of infections in each prefecture (Fig. 2a, b; see also Supplementary Fig. 1). Consistent with our hypothesis, the cumulative number of infections is highly correlated with cumulative population outflow from Wuhan over time, and the correlation increases over time from r = 0.328 on January 24 to 0.861 on February 12 (p <.001 for both) (Fig. 2a, b, c). Since there is relatively little travel throughout the country during this period (when people are traditionally at home), and because of the quarantine, the population outflow variable is comparable to a lagged variable in a time series. The correlation exhibited the same robust pattern even when using different time windows of population outflow out of Wuhan ranging from 2 to 15 days (Supplementary Fig. 2).
We next compared the predictive strength of population outflow against certain other factors, over the same time period. First, we used the relative frequency of Baidu search for the top virus-related terms in each prefecture (e.g., Wuhan pneumonia, novel coronavirus, flu, SARS, atypical pneumonia, surgical mask)27. Separately, we also evaluated each prefecture’s GDP and population as possible predictors (Fig. 2c). Each of these factors became less predictive of local outbreak size over time, either for cumulative or daily reported cases (Fig. 2c, Supplementary Fig. 5) The pattern for search is the most striking since it is the most predictive factor on January 31, with r = 0.735. The relative frequency of search engine terms is usually interpreted as a measure of interest or concern, which has received prior attention as a potential proxy for the incidence of seasonal influenza, which it tracks reasonably well in a 2-week time frame27–29. However, since search tracks both cumulative and daily infection count poorly (Supplementary Fig. 3–5), a Google Flu Trends style ‘COVID–19 trends’ Baidu index is likely to be low in predictive strength28. The initially high and then declining predictive strength of search may reflect the fact that initially high volumes of information search about the virus signaled stronger risk perception in any given prefecture (e.g., because of early reported cases, having more relatives in Wuhan, etc.), but that, over time, information saturation reduced the impetus for search (Supplementary Fig. 3).
A naive model of mobility might assume that people from Wuhan are more likely to go to the largest prefectures, which tend to be the most developed in China, and create more infection cases among a larger susceptible population. However, we find a relative decline in the predictive strength of prefecture’s population size and also GDP over time (Fig. 2c). Taken together with the high correlation of population outflow from Wuhan with infection numbers in destination locations, this pattern suggests that population outflow from Wuhan was to a more diverse set of prefectures and was determined by a more nuanced set of factors than just prefecture population size, which is consistent with the scale and rationale for migration during the Chinese New Year holiday. Indeed, we observe population outflow from Wuhan to almost every prefecture in China. This pattern of results is also consistent with the specifically social reasons for the choice of destination by many migrants and travelers, as shown in prior work, whether this involves short-distance rail travel or outright emigration.30–32 For instance, previous research shows similar patterns of post-disaster migration in other nations and has found that presence of social connections is a key determinant of destination selection.10
We next use two sets of models, one cross-sectional and the other dynamic, to statistically model and benchmark the extent to which population outflow patterns from Wuhan predicts the spread and distribution of COVID–19 infections across China.
We first modeled daily infection data using an exponential model (model 1 in the Supplementary Information) that also includes prefecture’s population size and GDP as control variables (this model is a multiplicative form of Poisson regression model; see Supplementary Information). We applied a supervised machine learning approach with confirmed cases as the dependent variable to estimate the parameters of a model with Wuhan population outflow as the sole variable (R2 = 0.285 on January 24 to 0.833 on February 12) and a model with population size and GDP as co-variates (R2 = 0.710 on January 24 to 0.937 on February 12) (Supplementary Tables 1–2). Although these additional variables improve fit, the parameter for population outflow from Wuhan becomes increasingly dominant, while a prefecture’s GDP and population become increasingly less predictive over time in general. Overall, the models’ performance continuously improved as more infection cases were confirmed, suggesting that the spreading pattern of the virus gradually converges to the distribution of the population outflow from Wuhan to other prefectures in China.
The logic behind this convergence over time, as well as the model’s predictive strength, is that population outflow from Wuhan to other prefectures in China fundamentally determines the eventual distribution of total infections around the country. During the earliest phase of the viral outbreak, before the quarantine of Wuhan, there was a relative lack of awareness of the virus, and few countermeasures at the collective or individual levels preventing the spread of the virus. The transmission of COVID–19 should thus have spread randomly across the entire prefecture of Wuhan; that is, our results imply that the number of infected people was uniformly distributed (statistically speaking) in the population outflowing from Wuhan into different prefectures across the country.
Using the daily predicted cases in model (1) (details in the Supplementary Information), we are also able to calculate a daily risk score for prefectures based on the difference between their predicted and confirmed cases on any given date. Higher-than-expected levels of infections suggests greater viral transmissibility or more second-wave transmissions (i.e., spread from infected individuals not from Wuhan). Thus, we regard prefectures with more confirmed than predicted cases as having higher second-wave transmission risks (‘underperforming’ compared to the benchmark derived from the outflow population from Wuhan). On the other hand, ‘over- performing’ prefectures with fewer cases than expected are also noteworthy—since they could either have implemented highly successful public health measures or be at higher risk from inaccurate data reporting. Supplementary Figure S6 identifies prefectures with second-wave transmission risk index values over the upper bound of the 90% confidence interval on January 29, for example. Our model identified Wenzhou as having the most severe second-wave transmission risk that day; and the government announced a full quarantine on the prefecture on February 2. The predictive strength of population flow from Wuhan and the overall fit of model (1) over time can also act as an early warning index of an epidemiological phase transition; they reflect the degree to which first-wave infections are dominant at any point in time. If model strength declines significantly at any location, this may indicate that second-wave (propagating) infections may be overtaking imported cases.
We next developed a dynamic model to explore changes in distribution and growth of COVID–19 across all prefectures, over time (rather than on individual dates) (Supplementary Information 2.1). We do so by using a Cox proportional hazards model framework and replacing the constant scaling parameter of model (1) with a time-varying hazard rate function—by choosing a logistic or Gompertz function, which has an S-shaped property that epidemic events typically follow. Using the hazard model, we are able to incorporate all infection cases across all dates to statistically derive the COVID–19’s epidemic curve and growth pattern across China.
We used the same machine learning method as before to estimate the parameters (see Supplementary Information). When using only the single variable of population outflow from Wuhan to other prefectures, we observe R2 = 0.815 (inclusion of population and GDP increases R2 to 0.933; Supplementary Table 3). The surfaces in Figure 3 illustrate the basic features of our models with the data regarding confirmed cases for two separated clusters: prefectures in Hubei (excluding Wuhan), and prefectures outside of Hubei province.
We use a similar logic as before in contrasting expected and observed outcomes to gauge epidemiological risk. Here, model predictions serve as reference patterns across time (as opposed to the reference points that a cross-sectional model generates) (Supplementary Figure 7). These curves serve as benchmark trends, and the differences in the growth trends between predicted and confirmed cases can signal higher levels of COVID–19 second-wave transmission. We use the integral of the differences over time to create a total transmission risk index (normalized by subtracting the mean and dividing by the standard deviation), and identify a list of prefectures above and below the 90% confidence interval of the index (Supplementary Figure 8, Supplementary Table 4). Indeed, our model identifies a list of statistically significant “underperformers”; in most of these cases, we observed the subsequent imposition of quarantine, reflecting local leaders’ assessments that, indeed, the epidemic warranted heavier measures (see the Supplementary Information for these prefectures and our analysis, along with Supplementary Table S4 and Supplementary Figures 7 and 9). On the other hand, prefectures with lower trends than expected by our model might have had more successful public health measures in controlling the spread of the virus.
To provide convergent evidence regarding our model assumptions, we also used different types of model functions and get the same results (see the Supplementary Information). In addition, we used the entropy of the infection rate normalized by population outflow from Wuhan to other prefectures to test how stable population outflow is in predicting the distribution structure of the virus (Supplementary Fig. 10). Entropy increased in the first week of the study period and remained flat thereafter, which suggests that the rates of confirmed infection cases based on population outflow from Wuhan remained uniform across Chinese prefectures. One possible reason for our model’s robustness beyond January 24 (in the still-early stages of this epidemic) may relate to the fact that, as recent research has shown, most early transmissions have occurred in family clusters,23 which would explain why infection growth remains proportional to population outflow from Wuhan (with average household size as a possible scaling factor) (see Supplementary Information).