The study included 2,509 counties, representing a total of 604,810 deaths from PCVM. The CART analysis identified seven phenotypes (A-G) using the training dataset (n = 2008), with phenotypes A and G having the lowest and the highest medians of PCVM (34.2 and 96.6), respectively (Fig. 1). The algorithm selected five variables from all candidate predictors serving as the six splitting nodes in the outcome tree, with income < 200% of federal poverty level (age 18–64) at the top of the tree followed sequentially by physical inactivity, median household income, food insecurity, physical inactivity, and excessive drinking. All splits were statistically significant (p < 0.001). We summarized the characteristics associated with the phenotypes in Table 2.
Table 2
Characteristics of phenotypes and prevalent regions. Note: Under prevalent regions, the listing of states in multiple phenotypes refers to different counties within state, not overlapping areas.
Phenotype (median PCVM) | Characteristics associated with PCVM | Prevalent regions |
A (34.2) | Income < 200% of Federal Poverty Level (age 18–64) (≤ 33.7%), Physical Inactivity (≤ 21.4%) | Northeast and Mid-Atlantic coastal areas; Midwest (Wisconsin, Minnesota, Illinois); West (Colorado, Utah) and Western coastal areas (California, Washington, Oregon); other large metropolitan areas (Atlanta-GA, Austin and San Antonio-TX, etc.) |
B (44.2) | Income < 200% of Federal Poverty Level (age 18–64) (≤ 33.7%), Physical Inactivity (> 21.4%), Food Insecurity (≤ 11.2%) | Midwest (Minnesota, Wisconsin, Iowa, Illinois, Indiana, North Dakota); Northeast (New York, Pennsylvania, New Jersey); Mid-Atlantic (Virginia); West (Wyoming) |
C (53.1) | Income < 200% of Federal Poverty Level (age 18–64) (≤ 33.7%), Physical Inactivity (> 21.4%), Food Insecurity (> 11.2%), Excessive Drinking (> 18%) | Northeast (New York, Pennsylvania); Midwest (Ohio); South (Texas) |
D (57.8) | Income < 200% of Federal Poverty Level (age 18–64) (> 33.7%), Median Household Income (>$39,898), Physical Inactivity (≤ 26.2%) | West (California, Oregon, Washington, Idaho, Arizona, New Mexico, Colorado); South (Texas, North Carolina); Midwest (Michigan) |
E (60.2) | Income < 200% of Federal Poverty Level (age 18–64) (≤ 33.7%), Physical Inactivity (> 21.4%), Food Insecurity (> 11.2%), Excessive Drinking (≤ 18%) | South (Oklahoma, Texas, North Carolina, South Carolina, Florida, Georgia, Alabama, Tennessee, Kentucky); Midwest (Indiana, Ohio, Kansas) |
F (76.6) | Income < 200% of Federal Poverty Level (age 18–64) (> 33.7%), Median Household Income (>$39,898), Physical Inactivity (> 26.2%) | All states in the American South; Midwest (Ohio, Michigan, Indiana, Missouri); West (California, Nevada); Northeast (Maine) |
G (96.6) | Income < 200% of Federal Poverty Level (age 18–64) (> 33.7%), Median Household Income (≤$39,898) | All states in the American South, especially in the Black Belt and the Appalachian region; West (New Mexico, Arizona); Midwest (Michigan) |
On the right side of the tree (Fig. 1), phenotype G had the highest median PCVM (96.6) among all phenotypes, consisting of counties with more people (age 18–64) with income < 200% of federal poverty level (> 33.7%) and a lower median household income ($39,898) compared to the other phenotypes. Compared to phenotype G counties, counties of both phenotypes D and F had a lower median PCVM. Phenotype F counties differentiated from those of Phenotype D by having a 33% higher percentage of people who were physically inactive.
On the left side of the tree (Fig. 1), all counties had fewer people (age 16–64) with income < 200% of federal poverty level and generally had lower rates of PCVM (except for phenotype E counties). Phenotype A, with a lower physical inactivity rate (≤ 21.4%), had the lowest median PCVM (34.2), about a third of the median PCVM for phenotype G (96.6). With more people who were physically inactive, phenotypes B, C, and E also had a higher median PCVM compared to phenotype A. Food insecurity further distinguished phenotype B with C and E, where phenotype B had fewer people who lack adequate access to food (≤ 11.2%) and had about 9 to 16 fewer deaths from CVD per 100,000 people compared to phenotypes C and E. Excessive drinking further separated phenotypes C and E, where phenotype C had more adults reporting binge or heavy drinking and a slightly lower median PCVM compared to phenotype E (53.1 vs 60.2).
Applying the CART model to the test dataset showed no substantial differences in the PCVM distributions versus the training dataset (Supplementary Figure S1). The analysis of PCVM subtypes revealed that the phenotypes were consistent across PCVM subtypes (Supplementary Figure S2).
The sensitivity analysis of the CART model with a minimum number of 100 counties in a terminal node included more splitting nodes as well as more phenotypes in the model output (Supplementary Figure S3), suggesting that additional risk factors were significantly associated with county-level PCVM in different subgroups of the population. These additional splitting variables included broadband access, uninsured (age 18–64), smoking, and receipt of SNAP benefits. Notably, phenotype O, the highest risk group, had almost 4-fold median PCVM compared to that for phenotype A, the lowest risk group (113.7 vs 30.0). Supplementary Figure S4 illustrates the CART model applied to the test dataset, which revealed no significant differences compared to the model derived from the training dataset. We also evaluated model performance by comparing the predicted PCVM with the observed PCVM by using the test dataset in the CART analysis where no limitation was applied to the minimum number of counties in a terminal node. The scatterplot of predicted PCVM and observed PCVM is presented in Supplementary Figure S5 with a correlation coefficient of 0.718 (R2 of 0.52), suggesting that county-level risk factors explained 52% of the inter-county variation in PCVM.
Figure 2A and 2B present the geographic distributions of the county-level PCVM and the phenotypes (for counties in both the training and test sets) from the main model. We observed that counties with high PCVM were mostly in the Southern US. Most of these counties corresponded to the highest-risk phenotypes G and F, which were mostly distributed across the American South and the Appalachian region, especially in Kentucky, West Virginia, Mississippi, Arkansas, southern Alabama, southern Georgia, southern Missouri, and New Mexico for phenotype G. In contrast, many populous coastal counties in the Northeast and the West were of phenotype A, the lowest-risk phenotype. Counties of phenotype B, the second lowest risk phenotype, were mostly found in the Northeast and the Midwest. A large proportion of counties of phenotype C were found in rural New York and Pennsylvania, as well as in many counties in the Midwest, West, and the state of Texas. Many counties of phenotype D, the median-risk phenotype, were located in rural areas of the West, including Arizona, California, Oregon, Washington, and Idaho. Phenotype E, with fewer people with income < 200% of federal poverty level but higher levels of physical inactivity and food insecurity, and a lower level of excessive drinking, were scattered in a few states in the South and the Midwest, such Oklahoma, Indiana, and North Carolina. The geographic distribution of phenotypes was also summarized in Table 2.
The relative importance of risk factors in predicting PCVM in Fig. 3 suggested that variables appeared in the CART output were also among the top-ranking variables in the random forest analysis. Notably, median household income, income < 200% of federal poverty level, and food insecurity were the top three important variables in the random forest plot. Other high importance variables included broadband access, smoking, and receipt of Supplemental Nutrition Assistance Program (SNAP) benefits, which were also appeared in the output of the CART analysis with a minimum number of 100 counties in terminal nodes (Supplementary Figure S3). Excessive drinking, without high school degree, and physical inactivity were ranked 7th to 10th in the random forest plot.