4.1. Clustering Analysis
To determine the COVID-19 risk level for each county, K-means clustering was performed on COVID-19 positive rates and death rates. Clustering was performed to group the counties based on similarities in their risk profile. To determine the optimal number of clusters that can define the risk level of each county, the Elbow method is used (Figure 5). The Elbow method is a visual method that can determine the optimal number of clusters considering the total within-cluster sum of squares of Euclidean distances (the cost). The optimized k value (k is the number of clusters) is such that adding another cluster (k+1) does not significantly decrease the benefit. In the Elbow method plot, the optimal k value is located at the elbow of the curve [37,38].
Figure 5 shows that k=3 could be the optimal number of clusters. Figure 6 shows the K-means clustering result that clustered the counties into three groups of low positive-low death rate, medium positive-medium death rate, and high positive-high death rate. The county located far apart from the other points (in the top right quadrant in Figure 6) belongs to the New York County, which had a positive rate of 15.5% and a death rate of 1.5%.
Figure 7 and Table 3 show descriptive results of each cluster. Cluster 1 is the low-risk cluster. It contains the highest number of counties with the lowest positive rate and death rate on average. On the other hand, cluster 3 has the lowest number of counties but the highest positive rate and death rate on average, which refers to high-risk counties. Cluster 2 is considered as the medium-risk cluster.
In the next section, classification analysis is applied on the results of clustering to find the significant parameters that affect the risk level of each county. Three COVID-19 risk levels of Low, medium, and high were used as labels for the classification analysis.
TABLE 3 Cluster attributes
Cluster number
|
Counts
|
Mean Positive Rate
|
Mean Death Rate
|
1
|
1873
|
1.231e-2
|
1.838e-4
|
2
|
1024
|
2.966e-2
|
6.639e-4
|
3
|
234
|
5.257e-2
|
18.293e-4
|
4.2. Model Selection
As can be seen in Table 3, there is size imbalance between different classes. We addressed the class imbalance using SMOTE. Then, to determine the significant factors, different classification models were employed to characterize COVID-19 risk clusters. Based on the accuracy values attained, the best model was used to select the factors with the highest significance in the transmission and mortality rate of COVID-19.
For classification, the data was divided into train and test sets (80% and 20%, respectively). Table 4 shows the classification models used in this study and their respective test accuracies. Among the classification models, Random Forest obtained the best performance on the test data. The linear models including MLR, LDA, and SVM Linear performed similarly. Therefore, to perform feature selection, we select the Random Forest model.
TABLE 1 Train and test errors of the classification models used
Method
|
Test Accuracy
|
Multinomial Logistic Regression
|
72.92%
|
LDA
|
70.55%
|
QDA
|
64.1%
|
KNN
|
80.81%
|
SVM Linear
|
73.18%
|
SVM Radial
|
85.54%
|
SVM Polynomial
|
84.66%
|
Random Forest
|
85.63%
|
4.3. Feature Selection
To identify the parameters which affect a county’s COVID-19 risk level, the Random Forest model was used. Figure 8 shows the variable importance scores of Mean Decrease Accuracy (MDA, a measure of model accuracy loss by excluding each variable) and Mean Decrease in Gini (MDG, a measure of each variable’s contribution to the homogeneity of the nodes and leaves in the resulting Random Forest) [39].
Based on both MDA and MDG criteria, mean temperature, percent of people below poverty (people with income lower than the threshold determined by the United States Census Bureau [40]), air pressure, longitude, percentage of uninsured people and population density are the highest contributing factors to the level of COVID-19 transmission and mortality rate.
In order to verify the consistency of important factors and to better interpret the contribution of each parameter to COVID-19 risk level, we used the Multinomial Logistic Regression (MLR) model. For that purpose, a backward selection approach was used based on the p-values of the coefficients in the MLR model (considering 95% confidence interval, a=0.05). The selected factors using this criterion were mean temperature, percent of people below poverty, air pressure, longitude, percentage of uninsured people, population density, percent of adults with obesity (BMI > 30 [41]) and wind speed. These selected features by the MLR model are almost the same as the features selected by the Random Forest model. It should be noted that other models with acceptable accuracy values (e.g., LDA, KNN, and SVM) suggested almost the same significant factors as MLR.
4.4. Significant Variables Analysis
In this step, the MLR model is applied only to the significant variables. Table 5 shows the coefficients of significant variables driven from the MLR model. The column Cluster 2 shows the odds ratio of variables of cluster 2 compared to cluster 1, and the column Cluster 3 shows the odds ratio of the third columns are for cluster 3 compared to cluster 1.
TABLE 5 Odds ratio of significant variables
Variable
|
Cluster 2
|
P-value
|
Cluster 3
|
P-value
|
Intercept
|
0.545
|
< 0.001
|
0.642
|
< 0.001
|
Mean Temperature
|
1.892
|
< 0.001
|
1.952
|
< 0.001
|
Percent of People Uninsured
|
1.333
|
< 0.001
|
1.660
|
< 0.001
|
Percent of People Below Poverty
|
1.205
|
0.003
|
2.410
|
< 0.001
|
Air Pressure
|
0.751
|
< 0.001
|
0.468
|
< 0.001
|
Percent of Adults with Obesity
|
1.273
|
< 0.001
|
1.349
|
< 0.001
|
Population Density
|
1.705
|
< 0.001
|
1.957
|
< 0.001
|
Longitude
|
1.337
|
< 0.001
|
1.156
|
0.014
|
Wind Speed
|
1.275
|
< 0.001
|
1.551
|
< 0.001
|
Table 5 demonstrates that increasing the average temperature, percent of people below poverty, percent of adults with obesity, longitude, wind speed, population density, air pressure, and percent of people uninsured would increase a county's chance to be in a cluster with a higher level of COVID-19 risk. On the other hand, increasing the air pressure would decrease that chance. Among them, mean temperature and population density are the two factors which have the highest impact on the risk level.
Previous studies concluded that there is a positive relationship between temperature and COVID-19 cases [21,22,42]. The results of this study were in line with their conclusion. Higher average temperature belonged to clusters 2 and 3, which had higher COVID-19 positive rate and mortality rate.
Results revealed that the percentage of people below poverty in a county was positively associated with belonging to a cluster with a higher level of COVID-19 risk, as shown in Table 5. Low-income people might have limited access to health products such as masks and sanitizers, which affects virus transmission and mortality [43,44]. They are less likely to work from their homes due to unstable jobs and income or less likely to have reliable and valid information about the COVID-19 [45,46]. Compared to other factors, the percentage of people below poverty has the most significant effect on a county’s chance to belong to the high-risk cluster.
Additionally, low-income people are less likely to have health insurance (the higher percentage of people uninsured). So, due to high medical expenses, they prefer not to go to clinics/hospitals or use medications, which might increase the COVID-19 death rate and, as a result, increase the association of a county to a higher risk cluster.
As the center for disease control and prevention (CDC) [47] stated, obesity would increase the risk of COVID-19 death. Besides, obesity would affect the immune system adversely. These are in line with the findings of this study. As the obesity percentage in a county increases, the chance of being in the high-risk cluster would increase as well.
Results demonstrate that air pressure was a significant factor which lowers the chance of a county to belong to a higher risk cluster. Air pressure determines the precipitation, wind, and weather condition. High air pressure is associated with mild wind and calm weather [48]. So, as Coccia [25] found, high pressure decreased the transmission of COVID-19. Research conducted by Takagi et al. [49] also demonstrated that high air pressure would reduce the COVID-19 prevalence. Consistent with these studies, our study shows that the air pressure lowers the probability of a county to belong to higher risk clusters.
Table 5 indicates that counties with dense population have a higher chance of being in higher risk clusters. In dense areas, people cannot keep physical distance from others which is one of the most important factors to prevent the transmission of COVID-19 [50,51].
Studies showed that COVID-19 transmission is dependent upon seasonal dynamics. Longitude is a factor that correlates with seasonal dynamics and affects the COVID-19 transmission [52,53]. Findings of our study show that increasing the longitude would increase the probability of a county’s belonging to higher risk clusters.