States vs mega regions learned through ZCTA is a complex geoprocessing concept and central to this paper. Figure one demonstrates this complexity visually with large black lines outlining states, colored, labeled regions demonstrating mega regions and ZCTA contained within as small enclosed black lines. For example the Philadelphia mega region includes ZCTA’s from states Pennsylvania, New Jersey, Maryland, Delaware and New York State. The city of Philadelphia is enclosed in the state of Pennsylvania. The state of Pennsylvania includes ZCTAs from some but not all megaregions: Philadelphia, Pittsburg, Upstate New York, New York City, and Washington-Baltimore mega regions. The mega regions are home to work maximum extents learned through census records.
Figure 1. Mega region commuter extents with State outlines and ZCTA detail
Eighty six out of ninety-nine infectious agents occurred in regions and states with greater than 11 PMPM and were analyzed. Table 1 tabulates the relative case capture across all study months by agent of infection for states and regions for the 20 highest cumulative PMPM infections. Do note that ‘states’ in CMS includes Puerto Rico, Guam, US Virgin Islands as well as out-of-US region codes for beneficiaries abroad. Differences in table one should be interpreted with caution as there are slightly more regions than states, and total cases presented in table one is uncontrolled for eligible population (from which cases were drawn). Further, cases are distinct individuals who can be discovered once, monthly over four observation years for a maximum of ‘one person to agent to forty-eight case months’ ratio. Individuals who moved (changed their mailing address over the study period and crossed state lines would count twice under State PMPM. Individuals who moved across regions would also count twice should they bill for an agent of infection in the new region. Regions may capture a larger breadth of geographic change over time than states.
Table 1
Top 20 infectious agents by case-month and geographic unit
Agent of Infection | State PMPM Cases | Region PMPM Cases |
Tinea (Ring worm) | 57,032,882 | 59,926,428 |
Streptococcus | 13,290,331 | 14,170,451 |
Influenza | 10,676,429 | 11,335,067 |
Viral NOS | 7,649,741 | 8,031,743 |
Candida | 7,125,179 | 7,526,649 |
HIV | 6,162,992 | 6,285,040 |
Hepatitis C | 3,960,574 | 4,153,671 |
Clostridium Difficile | 2,271,677 | 2,400,716 |
Hand Foot (and) Mouth Disease (HFMD) | 2,065,127 | 2,160,636 |
Syncytial Virus | 1,722,965 | 1,842,686 |
Staphylococcus | 1,700,042 | 1,804,160 |
Molluscum Contagiosum | 1,247,306 | 1,305,903 |
Mycosis NOS | 1,227,030 | 1,294,095 |
Pseudomonas | 1,161,381 | 1,230,902 |
Bacteria NOS | 998,400 | 1,048,788 |
Lyme Disease | 812,570 | 873,648 |
Hepatitis NOS | 650,807 | 675,578 |
Blast mycosis | 447,016 | 474,869 |
Infection NOS | 446,024 | 471,091 |
HERPES NOS | 436,373 | 455,989 |
Figure 2 describes the relative difference between the share of states and regions in which the monthly moving average was above the series median. This should produce ‘peak’ detection. High peak months by states and regions are plotted below as the percent of geographic unit-agent-months. Both states and regions had superior detection for specific agents of infection. Note that ‘rare’ diseases are better detected with states than regions. The percent difference between geography ranged from .01–21.04% of months.
Figure 2. Top 20 PMPM monthly moving average above the median by geography type and agent
Within the spatial random forest model non-states (colonies, territories) were not used, Hawaii and Alaska were further excluded as their nearest neighbors are not rational study distances. Washington DC was considered. The segmentation models attempted to guess the geographic unit’s name from a select list of infectious agents and their monthly rates. The model knew the local area of the given unit through nearest neighbor local areas learned from centroid longitude and latitude. The Spatial ML package builds random forest ‘trees’ from the local area of a geographic observation rather than consider the total universe of observations. The models considered 200 trees and choice infectious agents (independent variable) could be used for assignment. The segmentation model was highly accurate, with states error (bad guesses) at 04.89% and regions at 02.65%; denoting that regions were better than states when considering segmentation potential learned from select agents.
Table 2 describes the differences in gini between the models. Here Mean Decrease GINI (MDG) could be understood as the distinctiveness of the segmentation decisions. The higher the MDG, the more acute the independent variables (disease case month volumes) used to make a split on a tree. For example, in table two syphilis had an increase in MGD between geography types of 113.08, so states used additional information more often when considering syphilis relative to regions. Larger values indicate that a geography is better at finding segmentation using fewer diseases when a specific disease is present in the decision. Note that different geography types consider different diseases when deciding on segmentation.
Table 2
Mean Decrease GINI by Region and State, Spatial Random Forest Model with Choice Agents
Mean Decrease Gini | Region | State | Difference |
HIV | 695.86 | 409.54 | 286.33 |
Staphylococcus | 411.54 | 167.36 | 244.18 |
Syphilis | 341.71 | 228.62 | 113.08 |
Lyme | 304.77 | 233.27 | 71.51 |
Clostridium Difficile | 255.92 | 228.74 | 27.18 |
Hepatitis C | 255.65 | 329.38 | -73.74 |
Tuberculous | 120.69 | 239.66 | -118.97 |
Streptococcus | 53.39 | 85.43 | -32.04 |
Hepatitis B | 44.19 | 102.27 | -58.08 |
Varicella | 40.68 | 54.31 | -13.63 |
Hepatitis A | 28.03 | 83.42 | -55.39 |
Campylobacter | 25.59 | 96.21 | -70.62 |
Hand Foot and Mouth Disease | 9.43 | 35.69 | -26.26 |
Influenza | 3.56 | 9.12 | -5.56 |
Figure 3 considers the variable importance by geographic unit of report, which is the contribution the agent of infection made to the segmentation decision when spatial random forest models attempt to tell labeled geographies apart. The difference between variable importance should be understood as the interquartile range of the segmentation of the diseases associated with geographies. Large shifts are detectable within diseases such as HIV which had its largest variable importance for, Region: Miami at 160.83, and State: New York with 105.13. Several general population infections which lack geo-specificity were of low model value, in particularly Hepatitis A and Influenza. Staphylococcus (Regions) and Lyme Disease (States) had noticeable departures in range. Larger interquartile ranges may suggest a geographic type’s superior fitness in detecting endemics becoming epidemic.
Figure 3. Model variable importance with choice agents