Sampling strategy, characteristics and representativeness of the InGef research database

Abstract


Background
Using claims data for health service research purposes has steadily increased in Germany over the last years, showing that routine data are becoming a more common and important source for health services research (1)(2)(3)(4).Claims data (insurance data) is routinely collected for administration and reimbursement purposes.In Germany, about 85% of German inhabitants are covered by statutory health insurances (SHI).These data offer enormous potential for health services research, including health economic or pharmacoepidemiological studies.As such, they provide cross-sector data content that is, unlike primary data or survey data, free of selection or recall bias.Further, claims data are available in a large sample size and offer the possibility of longitudinal analyses.In contrast to comprehensive all payer claims databases (APCD) found in the US or Canada (1,5,6), various databases exist in Germany, which allow only very limited access and are still underreported regarding their content and validity.The InGef research database consists of anonymized data from approximately 8.8 Mio individuals, who are insured with one of the 58 German SHIs currently contributing data to the database.A sample database (InGef sample database) consisting of 5% of the German population (~four million insurees) is drawn to ensure representativeness and validity of the database used for health services research.The aim of this study is to describe the sampling method of the representative InGef sample database and to demonstrate its representativeness for the German population based on relevant demographic and clinical measures.

Data source
The InGef research database currently (December 2020) includes longitudinal data of approximately 8.8 Mio SHI members, insured in one of the contributing SHIs (mainly company or guild health insurances), and covers insurees from all federal states of Germany.The claims data are collected in a specialized data centre owned by SHIs, providing data warehouse and IT services.Data are anonymized before entering the InGef research database by the data centre, acting as a trust centre for this anonymization process.The anonymization process ensures that an identi cation of insured individuals, health care providers, and the respective SHI is not possible.Moreover, access to the InGef research database data is strictly controlled as well as project bound, and analyses are performed exclusively by InGef employees.The lag time of data availability is about nine months.Sampling of the InGef sample database starts with the year 2011.However, due to data privacy regulations for research projects, the time period covered by the InGef sample database is limited to a 6-year look-back (starting with the most recent complete data year).
German SHI claims data available in the InGef research database includes partly coarsened information on demographics (quarter of birth, sex, quarter of death if applicable, region of residence on administrative district level); inpatient care (diagnoses, diagnosis related groups (DRG), operation and procedures (OPS) (7)); outpatient services (diagnoses, treatments, specialities of physicians); dispensing of reimbursed drugs; dispensing of reimbursed remedies,devices and aids; and sick leave and sickness allowance times.In addition, costs from the SHI perspective are available of all healthcare sectors.Diagnoses in the InGef database are coded using the International Classi cation of Diseases, version 10 in the German Modi cation (ICD-10-GM) (8).Prescriptions of medication are identi ed based on Anatomical Therapeutic Chemical (ATC)-codes as classi ed by the AOK Research Institute (WIdO) (9).

Sampling Strategy
The aim of the sampling strategy for the InGef research database was to select a representative sample (InGef sample database) of the German population with respect to age, sex and, of minor priority, region of residence, which allows studies on various research questions in health services research including rare diseases or complex treatment patterns.Therefore, a sample of 5% sample of the German population was drawn annually (~ four million insurees) to continuously ensure representativeness and validity (Figure 1).The sampling strategy favoured persons with complete data, i.e. valid data from all data strands (including insurance time, incapacity to work and remedies and aids).
The desired properties of the InGef sample database had the following general consequences on the sampling strategy.First, the sample size required to draw a given number of insurees in age and sex strata of the InGef research database.If there were not enough insurees in the InGef research database for a stratum, insurees of a close stratum had to be drawn instead.Second, insurees leave the InGef research database if they die or change the health insurance.Due to the annual sampling procedure, newborns for each sampling year can initially not be included in the InGef research database, as they were not available in the year before.Accordingly, all newborns must be drawn in the annual sampling from the pool of insured persons.The German population also changes because of migration, death, or birth.Therefore, the InGef sample database had to be adapted each year to preserve representativity over the years (Figure 1).
The reference population for the InGef sample database was 5% of the German population in the categories age (0-89 in yearly steps, 90+ years), sex and federal state, based on the statistics of the Federal Statistical O ce (10) (further tables were provided upon request by the Federal Statistical O ce).Federal state is further coarsened into North (Hamburg, Bremen, Schleswig Holstein, Lower Saxony, Mecklenburg Western Pomerania), South (Baden-Württemberg, Bavaria), West (Northrhine-Westphalia, Hesse, Rhineland Palatinate, Saarland) and East (Berlin, Brandenburg, Saxony, Saxony-Anhalt, Thuringia).
Sampling started with the year 2011 and was performed as follows: 1.
The reference population at 31st Dec 2011 was extracted, cross-tabulated by age, sex and region of residence.All persons in the InGef research database at 31st Dec 2011 were eligible for sampling except those with data anomalies (i.e.: more than one birthday, insured after death or before birth).Sex and region of residence were determined on the last available date for each insuree in the database in the same categories as in the reference population.Age was calculated as of 31 st Dec in each year.The normalized insurance time was de ned as the fraction of insured time over three consecutive years, i.e.: insured time from 1st Jan 2012 to 31st Dec 2014 divided by the total time from 1st Jan 2011 to 31st Dec 2013.The insurance time is used in the further sampling strategy to prefer insured persons who are available in the database over a long time.This is important since insurance coverage is usually an inclusion criterion for further (longitudinal) analyses.Thus, including insurees with a high normalized insurance time close to one enhances the chance to achieve representativeness even after insurance veri cation.Normalization was performed for a technical reason to simplify the construction of the sampling algorithm.To avoid underrepresentation of deceased patients, their normalized insurance time was sampled from non-deceased patients.

2.
Sampling for the rst year (2011) was performed for each age and region of residence stratum in the reference population, separately for males and females.The sampling started with the least represented stratum, i.e.: the stratum with the worst ratio of available insurees in the InGef research database and the required number of persons for that stratum as of the reference population.Insurees were prioritized for sampling, if their age and region of residence agreed with the reference population and if they had complete data, i.e. maximal normalized insurance time and complete information on incapacity of work and remedies and aids.Insurees not ful lling these criteria were assigned a lower priority.
Insurees were drawn according to their priority until the number of individuals required for the stratum was reached.In the same manner, the procedure iterated through the remaining strata until all strata were sampled.Sampled insurees were removed for further iterations.

3.
For the sampling in each of the following years (2012-2019), the differences between the reference population for this year and the sample in all strata were determined.Missing persons in the strata were lled as in step 2. To keep representativity of the preceding years, newly drawn insurees, who have been insured in the database in the preceding years but did not belong to the database sample so far, were included at the beginning of one of the quarters of the respective year.The quarters in which observation time begins for newly drawn insurees in a respective calendar year were randomly assigned based on the distribution of all quarters in which all insurees entered the database in that year.All data of the respective insured person before this sampled quarter were not used for the InGef sample database.

Analyses
The following information was extracted: i) demographic information (sex, age, region of residence) of all insurees alive at 31 December 2018 or born in 2019 who were fully insured until 31 st December, 2019 or until their date of death in 2019; ii) hospitalization rates grouped by discharge diagnosis (main ICD-10-GM chapters) in 2019; iii) mortality rates in 2019 and iv) drug prescription rates in 2019 (20 most frequently prescribed ATC-groups (2 nd level) as number of prescribed packages).The hospitalization rate per de ned ICD-chapter was calculated by dividing the number of hospitalizations (fully inpatient) with a discharge date between 01.01.2019 -31.12.2019 and a main discharge diagnosis of a respective ICD-chapter by the total number of fully insured persons in the InGef sample database in 2019.The drug prescription rate was calculated as the sum of the quantity of packages prescribed of all insured persons divided by the total number of fully insured persons in the InGef sample database in 2019.The drug prescription rate for the German reference population was calculated accordingly.The mortality rate was calculated by dividing the number of deceased insurees by the number of fully insured persons in the InGef sample database in 2019.
For hospitalization rates, national reference data for 2019 were extracted from the Information System of the Federal Health Monitoring (11) and in alignment with the Federal Statiscial O ce (Destatis) data ( 12) for the total German population.For hospitalization rates diagnosis data based on the place of treatment were used, since the InGef sample database includes person with a residence abroad.Mortality rate of the total German population was extracted from the Federal Health Monitoring (13).National reference data for the distribution of age, sex and region of residence was extracted from the Federal Statistical O ce (14).Drug prescriptions for the German population insured within the SHI system was taken from the German Drug Prescription Report 2020 (15) and the Federal Health Monitoring System ( 16).
The mean continuous insurance time in the InGef sample database was determined as the time from Jan. 1, 2014 or entry into the database, to the end of insurance, death, or Dec. 31.2019, whichever came rst, in years.

Results
Mean age of insurees in the InGef sample database was in good accordance with the German population (mean age: 44.1 vs 43.9 years).Moreover, the proportion of women in the InGef sample database corresponded well to the proportion in the total German population (50.8% vs. 50.7%,InGef database vs. German population).The percentage of insurees living in the Eastern parts of Germany and the proportion of persons living in rural areas was slightly lower in the InGef sample database compared to the total German population.Table 1 displays the comparison of the main demographic characteristics.The mean continuous insurance time since entry into the InGef sample database was 4.78 ± 2.01 years.The proportion of newly drawn insurees in the year 2019 was 3.33%.Hospitalization rates, mortality rates and drug usage of persons in the InGef database were similar to the German reference data.Hospitalization rates were slightly lower in most main ICD-chapters.Larger deviations were found for Pregnancy, childbirth and the puerperium -ICD-chapter O (InGef vs. Germany: 19.3 vs. 24.5 per 1000 persons) and Certain conditions originating in the perinatal period -ICD-chapter P (InGef vs. Germany: 1.7 vs. 2.4 per 1000 persons) (Figure 2).Out of the 20 most frequently prescribed ATC drug classes, prescription rates for 18 drug classes were slightly higher in the InGef sample database compared to reference data from the German drug prescription report 2020 (15) (Figure 3).The mortality rate of the persons insured in the InGef sample database was slightly lower than in the German population (10.5 vs. 11.3 per 1,000 persons).

Discussion
The InGef sample database demonstrates good overall accordance with the German reference population.Especially, differences in sex and age distribution as well as mortality rates between the InGef sample database and Germany were small.It was previously reported that substantial differences exist in the characteristics and socio-economic standards of the persons insured with the different German SHIs (17)(18)(19).Moreover, studies that have examined socioeconomic inequalities worldwide, warn that these translate into differences in mortality and morbidity rates (20).Accordingly, lower mortality rates were reported for the year 2006 for the German Pharmacoepidemiological Research Database (GePaRD), a database with a population of presumably higher socioeconomic status than the overall German population (21).However, for Germany, van Raalte et al. recently described that regional disparities in mortality based on the large economic inequalities between the German federal states are declining (22), a nding that is supported by our comparative analysis.
Due to the structure of the SHIs that provide data for the InGef research database, the proportion of insurees in the regions of East and West Germany deviates slightly from the German reference population.The differences observed for the hospitalization rates and the prescribed ATC drug classes might likewise be explained by the described regional and socio-economic variations between insurees of the InGef research database and the German population (17,18,23).Especially, the lower hospitalization rates found for ICD-chapters O (Pregnancy, childbirth and the puerperium) and P (Certain conditions originating in the perinatal period) might be linked to a higher socioeconomic status, which has been reported to result in reduced fertility (24,25).Therefore, in observational studies which aim at examining regional differences for speci c outcomes, additional standardization with respect to region of residence, should be considered.External validity is a feature of the InGef sample database and of upmost importance for epidemiological studies comparing the effect of treatments or health interventions.Thus, unless a very high external validity is explicitly required, the observed marginal differences between the InGef sample database and the reference data are neglectable.
Some of the fundamental advantages of claims data characterized here should be mentioned.These secondary data are free of selection and recall bias and contain complete data offering an intersectoral perspective.Some further strengths of the data are the possibility to precisely determine the base population, the large sample sizes and the continuous data collection allowing the monitoring of the state of health over a long time period (26,27).In addition to the known strengths of claims data, the InGef sample database provides a readily available, reliable, and representative data source for healthcare research.

Limitations
The comparison of the InGef sample database with the external reference data showed good accordance.However, although the sampling strategy was designed to draw a sample which is representative for the German population in each year, the representativity might be lower after applying study speci c selection criteria.Especially, studies on incident drug use in a given year would not include newly drawn insurees in that year as these persons did not contribute to the database with insurance data in the previous year.However, for the analyzed year 2019 for example, the impact of excluding the newly drawn individuals on the representativity is rather low (3.33% newly drawn insurees).
Further, there are a few limitations that are inherent to claims data or come with the use of the database.First, due to the anonymized nature of the data it is not possible to validate the data using medical charts.Second, data availability for health services research purposes is limited to six years, which is critical for studies that require a longer observation period.

Conclusions
The InGef sample database can be considered representative for the German population and is thus a valuable data source for health services research.

Figure 1 Schematic
Figure 1 Schematic representation of the sampling strategy of the InGef sample database.

Figure 2 Hospitalization
Figure 2 Hospitalization rates in the InGef database and the German population in 2019 (Rates in 2019 per 1,000 persons)