5.1 K-Means Clustering
Cluster
|
Average TC_Pop
|
Number of Cities
|
Cluster
|
Average TC_Pop
|
Number of Cities
|
Table 1
includes the information of 20 clusters of 329 cities. There are 11 clusters with less than 5 members.
0
|
0.138998
|
57
|
10
|
0.077705
|
9
|
1
|
44.898323
|
1
|
11
|
2.073256
|
3
|
2
|
4.348363
|
2
|
12
|
7.148953
|
1
|
3
|
0.036970
|
52
|
13
|
3.106608
|
3
|
4
|
0.030999
|
49
|
14
|
1.720157
|
1
|
5
|
0.067883
|
55
|
15
|
5.884736
|
1
|
6
|
0.051303
|
35
|
16
|
0.378943
|
7
|
7
|
13.154666
|
1
|
17
|
0.912154
|
2
|
8
|
0.080822
|
34
|
18
|
0.650542
|
1
|
9
|
0.073778
|
14
|
19
|
0.710686
|
1
|
Table 1: Information of 20 Clusters and 329 Cities
The bubble chart of Fig. 1 shows the classification result. The diameter of the circle is proportional to the number of cities in the group. The numbers inside each circle is the spread rate value for the cities.
There are 6 clusters that have fewer than 5 cities and spread rate higher than 2.8. We would discuss the demographic and geographic characteristics of these clusters and explain why these cities have high spread rate. Below is the map of Hubei Province in Midland China (Fig. 2), where most COVID-19 cases were confirmed, the map illustrates the geographic distribution of cities in Hubei Province [18].
Figure 1: Bubble Chart of Small Clusters (Less Than 5 Members) City
|
Province
|
Location
|
Population (1,000)
|
Total Cases (August 2020)
|
TC_Pop
|
Tourist
|
Cluster
|
Table 2
clusters that have fewer than 5 cities with TC_Pop > 2.8
Wuhan
|
Hubei
|
Midland
|
11,212
|
50,340
|
44.89832
|
1
|
1
|
Huanggang
|
Hubei
|
Midland
|
6,333
|
2,907
|
4.590242
|
0
|
2
|
Huangshi
|
Hubei
|
Midland
|
2,471.7
|
1,015
|
4.106485
|
0
|
2
|
Ezhou
|
Hubei
|
Midland
|
1,059.7
|
1,394
|
13.15467
|
0
|
7
|
Xiaogan
|
Hubei
|
Midland
|
4,921
|
3,518
|
7.148953
|
0
|
12
|
Jingzhou
|
Hubei
|
Midland
|
5,570.1
|
1,580
|
2.836574
|
0
|
13
|
Jingmen
|
Hubei
|
Midland
|
2,897.5
|
928
|
3.202761
|
0
|
13
|
Xianning
|
Hubei
|
Midland
|
2,548.4
|
836
|
3.28049
|
0
|
13
|
Suizhou
|
Hubei
|
Midland
|
2,221
|
1,307
|
5.884737
|
0
|
15
|
Cluster 1: Wuhan
Wuhan is the only city, in Cluster 1. Wuhan is the first city in China where COVID19 was identified. [19] Wuhan has the largest TC_pop of around 45 (per ten thousand population). Also, with its huge population of 11 million people and great population density, Wuhan is exposed to the risk of rapid increase in COVID19 cases in a short period. As a transmit center in Mildland China, Wuhan has a well-connected network of transportation. People from nearby cities pour into Wuhan for employment and education opportunities. Another reason that causes the large number of cases is that Wuhan is a tourist city, travelers increase the regional density of population, thus making it easy for COVID19 to spread in Wuhan.
Cluster 2: Huangshi and Huanggang
These 2 cities are in the same province as Wuhan (Hubei Province) and both are in Midland China. Though the population of Huanggang is about three times that of Huangshi, the by 2nd of August, both cities have similar spread rates (Tc) of 4.1–4.6. Neither of them is a tourist city, so the floating population of the two cities tend to be small. However, when we look at the geographic location within the Hubei Province of the two cities, we find that Huangshi and Huanggang lie at the two sides of the Yangtze river, which is the longest river in China. Cluster 3 is charcherized by, similar spread rates, tourism development, and geographic location.
Cluster 7: Ezhou
What distinguishes Ezhou distinguished from other clusters is its middle-sized population and high spread rate. This may be because Ezhou is geographically close to Wuhan, the capital city of Hubei Province, and thus many people go to work in Wuhan during the day and return to Ezhou where they live. As a result, the city has very large number of confirmed cases of COVID-19.
Cluster 12: Xiaogan
As a city of Hubei Province, Xiaogan lies on the northern part of Yangtze River. Xiaogan and Wuhan are located across the Yangtze River. Xiaogan is a mid-sized city with a population of around 5 million, but by 2nd Aug 2020, it is populated with 3,518 cases. It has a large TC_Pop of 7. Like Ezhou in cluster 7, Xiaogan is in Cluster 12 alone because of the change in the infected population in the short time span and the relatively large spread rate. Just as Ezhou, Xiaogan is grographically close to Wuhan, and maybe this closeness explains why some nearby cities of Wuhan are in clusters by high spread.
Cluster 13: Jingmen, Jingzhou, Xianning
All the cities are from Hubei Province in Midland China. None of them is a tourist city. Jingmen and Xianning have similar populations, while Jingzhou has twice the population. The number of confirmed cases in Jingzhou is twice that of the other two cities, yielding a spread rate for the three cities of around 3.
Cluster 15: Suizhou
Suizhou is 200 kilometers away from Wuhan. [20] Compared with Xiaogan, Huanggang and Huangshi, Suizhou is far away from Wuhan. According to the statistical yearbook of 2019 published by Hubei Province, Suizhou City has a population of 2.22 million, ranking 12th in the province; Suizhou's GDP is 10.11 million yuan, ranking 11th in the province. [21] These data make Suizhou look nothing special. Let's take a look at the Suizhou’s epidemic data. Even in the early stage of prevention and control, Suizhou had a large number of infections. The infection rate (TC_Pop) of Suizhou was the fourth in Hubei Province. The characteristics of cities with large number of confirmed cases in China are as follows: 1. Close to Wuhan; 2. Large population; 3. Frequent contact with Wuhan. Suizhou is a city in Hubei Province, which does not have the above three characteristics, but still has a large number of confirmed cases. This underscores the complexities of managing a virus such COVID-19, and the need for statistical modeling to identify drivers of the spread of the virus.
5.2 Linear Regression
The data includes 329 cities in 8 geographical regions. By looking at the distribution of cities to regions, we chose the Southwest region as the base category. One justification is that it has a fairly large proportion of the cities at 16%. It also has one of the lowest number of cases at 27.87 in an average city. Our linear regression model forecasts the number of cases as a function of Population (in tens of thousands), a binary variable for whether the cities enjoys strong tourism, and the remaining 7 regions.
We present 2 regression models. Model (a) includes all 329 cities. In Model (b) we use an iterative approach to removing outliers. Since the number of Covid-19 cases is highly concentrated in a few cities, these cities can be viewed as outliers. If the purpose of the analysis is to evaluate average effects of the number of cases of the virus, outliers should be removed. As seen in Table 3, the differences between the models are substantial.
The iterative method for removing outliers developed several regression models. In each step we identified those cities with Studentized Residuals more extreme than \(\pm\)3.5. These observations are deemed outliers. (This is justified under the assumption of a Normal distribution for the errors.) These outliers are removed and the regression is developed again. In the first step, only one observation is removed (Wuhan). In the second step, 5 more observations are removed. In all, 43 cities were removed in 14 iterations.
The two models present very different pictures regarding the number of Covid-19 cases in China. Both models conclude that Midland China experiences more Covid than the Southwest region. They also both recognize that tourism is an important factor in the spread of the virus. That’s about all the models have in common. Even the magnitude of the coefficients is substantially different in both models.
Based on table 3, from Model (a) one concludes that the average number of Covid-19 cases in a city in the Midland region is 1,000 more than in the Southwest region. We reject the null hypothesis, although only at a p-value a bit below 0.05. Here the null hypothesis evaluates whether the number of cases is the Midland region is identical to the number of cases in the Southwest. The conclusion is that there is a statistically significant difference.
In contrast, Model (b) finds that the difference between the Midland region and the Southwest is only 23.8 cases, in an average city, but it significant even at a very low p-value. The difference is quite stark. The reason for these differences is that Model (b) does not include cities with the highest number of Covid-19 cases, including Wuhan. Removing these outliers may give us a better identification for the average effects of different factors.
Table 3: Regression Results. Model (b) does not include outliers.
* p < 0.05, ** p < 0.01, *** p < 0.001 (Standard Errors in parentheses)
Looking at regional differences, Model (b) also identifies that two more regions have statistically significant more cases than the Southwest. These are the East, at almost 8 more cases in an average city, and the Northeast, and almost 9 more cases. It may be interesting to note that both models do not identify significant differences for other regions in China.
The other common significant factor in both models also emphasizes the differences in the models. Model (a) suggests that a tourist city has 2,527 more cases than one without tourism. In model (b) the result is substantially subdued at only 18.5 more cases. Again, removing Wuhan and a few large outliers is at the heart of these differences.
The one other finding from Model (b) is that population is a statistically significant driver for the number of cases. Larger cities do have more cases of Covid-19, at approximately 1 more case per 10,000 population. We now see that the result is significant at even very low p-value. This result isn’t observed in Model (a), because Wuhan and other moderately-sized cities incur the vast majority of Covid-19 cases in China. For example, Wuhan has over 50,000 cases, while Beijing and Shanghai, much larger cities, have less than 2,500 cases each.
From a statistical perspective, Model (b) is much better, which shouldn’t be a surprise. Model (b) has few outliers. For Model (a) the R2 is 0.072, while the RMSE is 2,722.8. In Model (b) these are 0.584 and 15.92, respectively. Overall, both models are statistically significant.From a statistical perspective, Model (b) is much better, which shouldn’t be a surprise. Model (b) has few outliers. For Model (a) the R2 is 0.072, while the RMSE is 2,722.8. In Model (b) these are 0.584 and 15.92, respectively. Overall, both models are statistically significant.