A Method for Constructing a Longitudinal Sample of Medicare Patients with Application to Diabetes Outcomes Research

Background We present a method of randomly drawing a longitudinal sample of Medicare patients to 1) conduct longitudinal analysis of care use and outcomes over a ten-year follow-up; 2) provide a representative cross-sectional sample in each year during this time period; and 3) provide adequate precision for estimates in comparisons that involve minority patients at the county level. This method was applied to patients with diabetes in the Diabetes Belt (a region in the Appalachian and southern US with higher rates of diabetes) and surrounding counties. Methods We used the Medicare Master Beneﬁciary Summary Files (A/B/C/D and Chronic Conditions segments) to identify eligible patients for each year. We targeted a sample of just under 900,000 patients per year. The 2006 sample is stratiﬁed by county and white/minority status, and targeted at least 250 patients in each stratum with the remaining sample allocated proportional to county size with oversampling of the minority population. Patients who were alive, did not move between counties, and stayed enrolled in Medicare fee-for-service (FFS) were retained in the sample for subsequent years. Non-retained patients were replaced with a sample of patients in their ﬁrst year of Medicare FFS (e.g., new enrollees) or patients who moved into a sampled county during the previous year. Results The resulting sample contains an average of 899,266 patients each year and closely matches population demographics and chronic conditions. For all years in the sample, the weighted sample average age and the population age diﬀer by < 0 . 01 years; the proportion white is within 0.01%; and the proportion female is within 0.08%. No diﬀerence was statistically signiﬁcant at the α = 0 . 05 level. Conclusions This carefully constructed survey sample will allow us to to perform longitudinal and cross-sectional analysis on healthcare utilization and outcomes. This sampling strategy can be easily adapted to other projects that require random samples of Medicare beneﬁciaries for longitudinal follow-up with possible oversampling of some sub-populations.


Background
Medicare claims data capture national data on health care utilization for Americans aged 65 years old or older, disabled, or with end-stage renal disease (ESRD). Medicare is the only source of national data on healthcare utilization, thus its importance for epidemiological and health services research cannot be overemphasized. Due to costs of acquiring Medicare claims data from the Centers for Medicare and Medicaid Services (CMS), full data on the Medicare population with longitudinal follow-up over several years may not be feasible or practical. For this reason, researchers often need to work with a representative sample of the Medicare population. In this paper we present a method of drawing a random sample that is representative of the Medicare population for both longitudinal follow-up and cross-sectional analysis.
We will illustrate this method with an example of a sampling design for a longitudinal study of Medicare patients with diabetes living in the Diabetes Belt and surrounding counties. The study is designed to assess the care received by and outcomes for these patients. The Diabetes Belt (Barker et al., 2011) is a collection of 644 counties in the Appalachian and southeastern United States that had diabetes prevalence of at least 11% in 2007. This area continues to have high prevalence of diabetes (CDC, 2020), which motivates the study of care delivery and outcomes in this population. Our survey additionally includes diabetic patients in geographically surrounding counties; these counties are expected to be as culturally similar as possible while providing a comparison population with a lower burden of diabetes and diabetes care.
The end goals of the study informed the sampling design we will describe. We plan to use this data to track changes in patient care, practice patterns, and outcomes over time. The sample we describe was designed to provide valid inference around these goals for patients with diabetes living in the Diabetes Belt and surrounding counties during the years from 2006 to 2015. In addition to providing longitudinal assessment of patients, the study is also designed to produce representative cross-sectional samples in each year, and to provide comparisons within counties.
From a sampling design perspective, the goals we have outlined are somewhat in conflict. For example, if the goal is to provide similar precision within each county, then the optimal sampling design would, as well as possible, sample approximately the same number of people in each county. In contrast, if the goal is to provide the best population level descriptive data, then sampling from each county in proportion to its size is approximately optimal (Lohr, 1999, Section 4.4, 104-108). The desire to compare white and minority populations in our study suggested oversampling of whichever group is smaller in each county. While surveys designed for a specific primary analysis can be further optimized, our survey needs to provide reasonable analytic power for multiple aims. This sampling design will provide good precision for a wide range of analyses.
Our goal was to enable estimates with good precision for a wide range of outcomes while keeping the total sample to less than 1,000,000 beneficiaries, the lowest tier of Medicare data pricing (RESDAC, 2016). With 100,000 set apart for a third comparison group beyond the scope of this report, our target sample size was just under 900,000. In the remainder of this paper we describe the methods used to construct this sample and we present an analysis that demonstrates the resulting sample was representative of the target population and that demographic estimates based on this sample have high precision and accuracy.

Methods Population
We used the Medicare Master Beneficiary Summary Files (A/B/C/D and Chronic Conditions segments) to identify Medicare patients meeting inclusion criteria each year from 2006 to 2015. To be eligible for inclusion, Medicare patients needed to have been previously diagnosed with diabetes (identified in the Chronic Conditions segment), be living in the Diabetes Belt or surrounding counties, and be enrolled in Medicare Fee-for-Service for 12 months each year. Patients enrolled in Medicare HMOs were excluded because their claims data are not available.

Diabetes Belt and Surrounding Counties
The Center for Disease Prevention and Control (CDC) identified 644 counties across 15 states in the Appalachian region and southeastern US as the Diabetes Belt (Barker et al., 2011). Some or all counties in Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, North Carolina, Ohio, Pennsylvania, South Carolina, Tennessee, Texas, Virginia, and West Virginia comprise the Belt. We used the CDC's definition based on 2008 data in this study. We identified 310 counties that are closest but not contiguous to the Belt counties as surrounding counties. They were chosen to serve as a basis for comparisons with the Belt counties. Counties that are immediately adjacent to the Belt were not included among the surrounding counties because some patients may cross county boundaries to seek care and may confound our estimates on preventive care utilization and complication rates.

Construction of the 2006 Sample
The sample for 2006 was a random sample of eligible patients stratified by county and race. We divided race/ethnicity into two groups, non-Hispanic white and all minorities. We did not further sub-divide the sample by race/ethnicity because there were relatively few individuals of Hispanic ethnicity and other races of Medicare age residing in the Diabetes Belt and surrounding areas during this time period. A very small number of patients with missing race/ethnicity were included in the sampling frame along with the white population.
In order to balance the competing needs for county and regional level inference, we first allocated a sample of 500 persons to each county (or the county eligible population if less than 500). We considered several alternatives between 500 to 1000. We found that 500 allowed a complete enumeration for the smallest 18% of counties and at least 50% sampling for 70% of counties while still allowing significant sampling in the most populous regions. We then allocated the remaining available sample to each county proportional to the size of its un-sampled population, with the constant of proportionality chosen to produce a sample size as close as possible to the 900,000 person target; the resulting sampling rate was approximately 30% of the remaining population.
Within counties we then initially tried to allocate a sample of size 250 (or the population size if less than 250) to the white population and 250 for the minority population. Remaining samples allocated to the county were then divided between the white and minority populations according to the proportion p s im = 2(p r im − 1/2) 3 + (p r im − 1/2)/2 + 1/2, where p s im represents the non-white proportion in the remaining sample for county i, p r im represents the non-white proportion of the unsampled population of county i (Figure 1). In counties where the non-white population is in the minority, this formula oversamples the non-white population by  a rate of approximately two-to-one when the non-white population is proportionally small, while dropping to equal sampling as the white and non-white populations become equal. In counties where the white population is in the minority, the white population was oversampled. Our goal was to oversample whichever group (white/non-white) was in the minority in each county in order to improve within county comparisons while still providing significant coverage of the white population, which encompasses approximately 80% of the population living with diabetes in this region.
Once we had defined the sample size by stratum (county and white/non-white), we then selected patients using simple random sampling within strata. Survey weights were defined to be the stratum population size divided by the stratum sample size.

Construction of Subsequent Years' Samples
Sampling in subsequent years was slightly complicated by the demands of retaining patients for longitudinal follow-up and ensuring a cross-sectionally representative sample in each year. However, the guiding principle is straightforward.
The guiding principle is that the 2006 sample was representative of all Medicare patients who had diabetes and met inclusion criteria. If for 2007 we retained all patients from the 2006 sample who had remained alive and eligible, then these patients were representative of the population who had been eligible for at least 1 year (and they were therefore approximately 1 year older than the overall population). In order to replace patients in the 2006 data set who had died, enrolled in a Medicare HMO, or moved, we replaced them with an appropriately weighted sample of patients who were not eligible for inclusion in 2006.
We constructed the 2007 replacement sample to first allocate at least 10 patients to each stratum to ensure that we add new beneficiaries in every county every year. Additional patients were then allocated to each stratum to target the overall sample size as would have been calculated using the 2006 sampling procedure on the 2007 county populations. All replacement patients were sampled from the population who would have been ineligible in 2006 (not enrolled in Medicare, in a Medicare HMO, lived elsewhere, or were first diagnosed with Diabetes in 2007). Sampling weights were calculated as the number of first year eligible white/minority population in each county divided by the corresponding fill-in sample size. We similarly constructed the 2008-2015 replacement samples.

Comparison of Sample to Population
In order to ensure the sample demographics reflected the underlying population for each year, we compared the randomly selected sample to the population. This analysis was performed using weighted survey sample analysis procedures (e.g. Stata "survey" suite of programs) with weights as described above and sampling strata defined by county, white/minority, and year the patient was added to the sample.

Variance Estimation
Estimating variances (for standard errors and p-values) is a long-standing challenge in survey analysis with many possible approaches. For standard statistics (means, proportions, totals, regression coefficients), Taylor series based methods are built into all statistical survey analysis packages (e.g. SAS Survey procedures, the R Survey package, or the Stata Survey suite). We used this approach for the cross-sectional analyses presented below. For more complicated statistics, such as longitudinal models, resampling or jackknife methods are typically used; see Wolter (2007) for a good and very readable overview.
Data cleaning was performed in SAS (v9.4, Cary, NC) and Stata SE (v15.1, College Station, TX); the random sample was generated using an R (v3.6.1, Vienna, Austria) program which is available on request; and descriptive statistics and comparison to the reference population were calculated using Stata survey programs. This study was approved by the University of Virginia institutional review board (IRB #21127).

Results
Our goal was to create a sample in each year from 2006 to 2015 that satisfied our research need for longitudinal follow-up and cross-sectional analysis. We targeted a stratified random sample of about 900,000 from the Diabetes Belt and surrounding counties. Our sample design yielded an average sample size of 899,266 (standard deviation 408) over the 10-year study period. A total of 28% of the 2006 sample was retained for the full 10 year study period; these 28% included more than 200,000 non-Hispanic white and 70,000 minority patients. Table 1 shows year-by-year retention based on year of initial sampling. Although Hispanic and other race/ethnicity groups represented less than 1% of our total sample, the study retained a substantial number (∼ 2000 or more) for area-wide subgroup comparison and for longitudinal follow-up. In order to assess the resulting sample, for each year of the survey we made cross-sectional comparisons of the weighted sample to the population defined from the Medicare Master Beneficiary Summary Files (described in Table 1). Population size, race, and previous year sample eligibility were the factors we used in determining the sample. We therefore focused our descriptive statistics on race, age, and population totals. Age is a particularly important variable for assessment because if the fill-in samples were incorrectly constructed, we would expect to see drift from the underlying population as the retained samples from previous years aged. We additionally included sex because it is an important factor in most health outcomes and it provides a good additional point of comparison that was not incorporated in the sampling design.
Results are shown in Table 2. For all years in the sample, the weighted sample average age and the population age differ by less than 0.01 years; the proportion white is within 0.01%; and the proportion female is within 0.08%. No difference was statistically significant at the α = 0.05 level. Figure 2 shows that in the last year of follow-up (2015) the weighted age distribution closely matches the population age distribution. This comparison provides a visual check that the fill-in samples from years 2007-2015 were appropriately weighted to allow for valid cross-sectional comparisons of age.

Discussion
In this report we described a sampling method which produced a representative sample of Medicare patients with good cross-sectional properties while still retaining patients for longitudinal analysis. We will use this data to assess trends in preventive care utilization, long term outcomes, disparities, and associations between preventive care and diabetic complications in patients with diabetes. These goals require the sample to be valid both longitudinally and cross-sectionally. A primary goal in this work was to document these methods for future researchers who might be interested in obtaining representative samples of Medicare claims data. In preparing for this project, we found only limited literature describing longitudinal sampling designs that could serve as a reference. Smith et al. (2009) offers a very high level overview and describes the principles of sampling design for longitudinal surveys. Other articles address subsets of our challenge. For example, Wolinsky et al. (2014) discusses matching Medicare claims to a longitudinally followed cohort without need for cross-sectional inference, while Thompson (2015) and Carrillo and Karr (2013) focus primarily on analytic approaches rather than design. While longitudinal surveys are not rare (for example the Population Assessment of Tobacco and Health (PATH) study, Hyland et al., 2017), they are largely the purview of governments or large survey research organizations  because it is hard for a small team to longitudinally track and follow-up with a large number of patients. Nonetheless, there are cases, such as this one, where a longitudinal survey is feasible as part of a smaller project. We hope this report will be useful to researchers interested in designing their own studies of this type. In constructing this sample, we found that it was relatively easy to produce a representative sample for the baseline year (2006). Significantly more care needed to be taken to identify the sampling frame for subsequent years. An advantage of working with samples this large is that there is abundant power to identify potential problems before collecting data.
This study will be limited by the scope of Medicare claims data. In particular, this cohort only includes patients ages 65 and up, and it excludes those patients enrolled in a Medicare HMO. In addition, Medicare only captures claims data, so we will not have access to full clinical records.

Conclusions
We demonstrated that a representative sample of Medicare beneficiaries can be carefully constructed to be used in cross-sectional as well as longitudinal analyses. This sampling method will make the data request much more affordable. The computer algorithms we created can be used by future researchers in drawing random representative samples from Medicare claims data.

Declarations
Ethics approval and consent to participate This study was approved by the University of Virginia Internal Review Board (IRB #21127). Consent was not required because it is a retrospective analysis of existing data.

Consent for publication
This study does not include any individual data.

Availability of data and materials
The data that support the findings of this study are available from the Research Data Assistance Center (ResDAC, https://www.resdac.org/); they are used under agreement and cannot be released publicly. Interested researchers who would like to work with this or similar data should contact ResDAC. R code used to construct the sample is available from the authors on request.

Funding
This work was funded by the NIH/NIDDK grant R01DK113295.

Authors' contributions
TM developed the sampling algorithm with MS, implemented it, and wrote the majority of the manuscript. JL helped frame the research questions and sampling approach, and carefully reviewed and edited this manuscript. SK defined the original population from which this cohort was drawn, and was responsible for checking and evaluating the resulting sample. HK helped frame the research question and carefully reviewed the manuscript. MS developed the sampling algorithm with TM, implemented the algorithm independently, helped frame the research questions.