Shrinking the Haystack: An Approach to Identifying Idiopathic Pulmonary Fibrosis in the Electronic Health Record using a Computable Phenotype

doi:10.21203/rs.3.rs-2008034/v1

Background: Computable phenotypes are computerized search queries that allow efficient identification of specific groups of individuals (e.g., that may meet eligibility criteria for a clinical trial). Heterogeneous clinical syndromes challenge this approach because disease definitions and sub-phenotypes evolve, and diverse phenotypes may be needed for various applications (“use cases”) for diverse research aims. Herein we describe the development and validation of a computable phenotype for the rare disease idiopathic pulmonary fibrosis (IPF), that addresses its evolving terminology and variable use cases. The goal of this study was to develop and execute a single computable phenotype for IPF using standard data architecture, and to evaluate it for different use cases, each with its own gold standard for validation.

Methods: The PaTH PCORnet Clinical Research Network (PaTH) IPF Working Group developed the candidate IPF computable phenotype and executed it against the Penn State PaTH to Health source population of 588,000 patients with an electronic medical record at Penn State Hershey Medical Center between January 1, 2011 and December 31, 2015. We established a consensus clinician diagnosis and performed duplicate (2-person parallel) chart review on a 100% sample with discrepancy adjudication.

We evaluated the computable phenotype performance for two use cases, each with a separate gold standard: the Inclusive Use Case [gold standard defined as IPF, familial pulmonary fibrosis (FPF), or combined pulmonary fibrosis and emphysema (CPFE)] and the Restrictive Use Case (gold standard defined as IPF, but not FPF nor CPFE).

Results: The IPF computable phenotype yielded an IPF Cohort (N=157) and an estimated population prevalence of 26.7/100,000. The computable phenotype had positive predictive values (PPV) for the Inclusive Use Case and Restrictive Use Case of 57% (89/157) and 47% (74/157), respectively, and an estimated population prevalence of 15.1 and 12.6/100,000, respectively.

Conclusions: These findings demonstrate the utility of a single computable phenotype that can be validated against different gold standards depending on the intended health care or research use case. In a disease where there is no discrete biomarker, this provides a flexible approach to meet diverse clinical research needs.

Trial registration: N/A

idiopathic pulmonary fibrosis

computable phenotype

electronic health record

validation

epidemiology

PCORnet

Adoption of electronic health records (EHRs) increased by an average of 14% per year in the United States with the implementation of the 2009 HITECH Act’s Meaningful Use incentives (1), and reached 96% adoption among non-federal acute care hospitals by 2017 (2). Similarly, the European Union’s rapidly rising adoption of EHRs reached 80% among primary care practices across 15 member states surveyed in 2016. The European Commission’s Digital Single Market Strategy aims to build on this success to establish full interoperability of member states’ electronic medical records (3). This growing repository of health information motivates the continuing development and refinement of methods to query patient-level data (3, 4) for various purposes, including identifying patient groups with specific disease states. Computable phenotypes are a precise, shareable, reproducible, and documented method for using EHR data to categorize people for clinical and population health research, improved diagnosis and health outcomes (4, 5). This method identifies individuals with specific characteristics through a computerized query of patient-level data using a defined set of data elements and logical expressions (5). Computable phenotypes prioritize sensitivity or specificity depending the intended application (“use case”). Use cases span scientific disciplines and healthcare settings (6) for both common and rare diseases.

Scientific research on rare diseases has historically relied on the practical but flawed approach of assembling cases in disease registries. Registries are subject to selection bias, yielding a study population that may not be representative of the population of patients with a particular rare disease (7), excluding vulnerable populations with limited access to care, and over-representing persons of relative privilege. Identifying patients from insurance claims data sources has limitations (8) because they compile patients from a specific payor, thus excluding the uninsured.

Identifying people with rare diseases has been likened to “searching for a needle in a haystack” due to the need to comb through many unaffected individuals in order to find the few with the rare disease. Recruitment efficiency quantifies the task of identifying participants from source populations and enrolling them in clinical research studies (9). Large datasets, such as those from multiple health systems and EHR platforms, assemble large populations and thus minimize selection bias (10). An ideal search strategy for a rare disease within these large datasets would improve efficiency without losing cases, i.e., “shrink the haystack”.

Many investigators of rare diseases have developed and compared different search strategies (separate algorithms) to identify their study populations. A limitation is that a new algorithm has to be developed and tested if the clinical phenotype (or subphenotype) desired for the study changes (i.e., a new algorithm for each new use case). Another approach is to create a single computable phenotype that aims to be inclusive while still eliminating many true negatives. This sub-population (shrunken haystack) can then be utilized for numerous use cases, each with its own definition.

Idiopathic pulmonary fibrosis (IPF) is a rare, specific form of interstitial lung disease (ILD) characterized by chronic and progressive lung scarring, with an estimated prevalence of 14–43/100,000 persons (8, 11, 12). The 2011 International Consensus Document established criteria for IPF diagnosis (13). Esposito (8) compared three separate claims-based algorithms to identify IPF and estimated the incidence and prevalence in the US using a private insurance database. The three algorithms were validated against chart review of less than 3% of source cases. The focus was on identifying a single, ideal computable phenotype rather than evaluating its performance for different use cases. A similar approach was used by Kaul et al (14) to estimate IPF incidence and prevalence among Veterans in the United States Veterans Health Administration system using both a broad and narrow case definition, although no validation was done. Ley et al (15) evaluated an IPF diagnostic algorithm in a health maintenance organization population, validating with a chart review of an 8% sample of identified cases. These previous computable phenotype efforts in IPF, using the 2011 criteria, provide a baseline from which to consider different definitions and use cases.

The IPF diagnostic algorithm has evolved since the 2011 Consensus Document. A subset of IPF, familial pulmonary fibrosis (FPF), was noted by the Committee to have a variant clinical presentation, often at a younger age and with pathologic heterogeneity (13). Another subset identified by the Committee was IPF with coexisting emphysema. In 2011 it was not clear if this represented a distinct clinical phenotype (13), however the 2018 statement settled on the phrase “combined pulmonary fibrosis and emphysema” (CPFE) for this IPF variant (16). While the 2018 statement broadens the definition of IPF, clinical trials often prefer the narrower definition, thus creating two distinct applications (“use cases”) for an IPF computable phenotype.

Evolving clinical definitions and overlapping clinical entities create nuanced challenges to creating and validating an IPF computable phenotype that is durable and consistent for IPF patient identification for study protocols of varying purposes. The above modifications in the IPF classification and emergence of different use cases provide a real-world context to examine the utility of an inclusive computable phenotype or multiple-use-case approach to the identification of IPF in a large dataset. The goal of this study was to develop and execute a single updated computable phenotype for idiopathic pulmonary fibrosis using a standard data architecture, and to evaluate it against different use cases, each with its own gold standard for validation.

Study Setting

The PaTH Network (17, 18) is one of 9 clinical research networks within the Patient-Centered Clinical Research Network (PCORnet (19, 20)). PCORnet’s focus is to develop infrastructure for EHR-based research, termed “data at scale”. The PCORnet Common Data Model specifies common data architecture across different EHR systems (21). The original PaTH sites included University of Pittsburgh/University of Pittsburgh Medical Center (UPMC), Penn State College of Medicine/Penn State Hershey Medical Center, Lewis Katz School of Medicine at Temple University/Temple Health, and Johns Hopkins Medicine and Health System.

Development of the PaTH IPF Computable Phenotype

Development of a single updated computable phenotype for idiopathic pulmonary fibrosis (IPF) that was consistent with the PCORnet Common Data Model (21) occurred through consensus by the PaTH IPF Working Group. The group included PaTH network investigators, IPF specialists, and medical informatics experts from each site of the original PaTH network, as well as two patient partners with IPF.

The computable phenotype was intended to identify people with IPF as defined by the American Thoracic Society (ATS)’s clinical diagnostic criteria (13), building on methods from previous IPF epidemiologic studies (11, 22, 23). Our final criteria were modified from the work of Raghu et al (11): inclusion criteria were patients with at least one ICD-9 diagnosis code of 516.3 (idiopathic interstitial pneumonia) or 516.31 (idiopathic pulmonary fibrosis) during an inpatient or outpatient encounter. In order to exclude patients likely to have fibrosing lung diseases due to connective tissue disorders (and not IPF), patients were excluded if an ICD-9 code for a connective tissue disorder was present during the same time period at any inpatient or outpatient encounter (see Table 1). Two additional exclusions to those identified by Raghu et al (11) and Ley et al (15) were antisynthetase syndromes and undifferentiated connective tissue disease (24, 25), as these have been increasingly recognized in the spectrum of CT-ILD. We excluded emergency department visits because the brevity of the encounter could lead to diagnostic misclassification; we excluded lab encounters because provisional diagnoses used in laboratory studies (as part of a diagnostic evaluation) are often subject to revision.

Validation of the PaTH IPF Computable Phenotype

Validation of the computable phenotype for different use cases, each with its own gold standard, followed the methodology of Richesson et al (5). The development of the use cases and validation procedure occurred at a single PaTH site. The study was conducted in accordance with the amended Declaration of Helsinki. The protocol was reviewed and approved by the Institutional Review Board at Penn State Milton S. Hershey Medical Center, STUDY00006433.

Determination of Use Cases and Associated Gold Standards: We created a logic diagram and standardized chart review procedure based on the international consensus statement diagnostic algorithm (13). We established two use cases, each with its own gold standard definition for subsequent evaluation. The Inclusive Use Case had a gold standard definition of IPF, FPF, and CPFE. The Restrictive Use Case had a gold standard definition of only IPF, excluding FPF and CPFE. The rationale for these definitions was based on the recognition that pharmaceutical studies aim for a more homogeneous population whereas patient registries aim to characterize the full spectrum of the disorder.

Query Strategy: The computable phenotype was alpha tested at Milton S. Hershey Medical Center (HMC), the Penn State-affiliated academic medical center in Central Pennsylvania. The source population for this validation study included all patients who had any clinical encounter recorded in the HMC electronic health record (Cerner®) in the pre-ICD-10 era, between January 1, 2011 and December 31, 2015. These criteria were translated into a database SQL-language query consistent with the local database schema.

Duplicate Chart Review and Adjudication: Two reviewers independently assessed the charts of 100% of people identified by the IPF computable phenotype. Reviewers included a board-certified pulmonologist with expertise in IPF, a general internist/PaTH IPF Working Group member, and an IPF clinical research coordinator. Reviewers were instructed to search for and, if present, review each patient’s initial outpatient pulmonary sub-specialty consultation note originating from this health care system. Reviewers also searched for and reviewed associated diagnostic studies, including lung pathology reports, chest CT scans. We did not review archived records from other health care systems.

Based on the information gleaned, each reviewer independently categorized each patient as having IPF, FPF, CPFE, other pulmonary fibrosis (including NSIP, radiation-induced ILD, occupational or environmental ILD, connective tissue-associated ILD, granulomatous disease), or non-ILD. In cases where they concluded that there was insufficient information in the EHR, they listed the diagnosis as unknown. The diagnosis of IPF required exclusion of other known causes of ILD, as well as a radiographic pattern of usual interstitial pneumonia (UIP) suggested on high resolution computed tomography (HRCT) and/or a histologic pattern of UIP on surgical lung biopsy (SLB) (13).

The reviewers convened to compare diagnostic assignments. Where there was disagreement, a third reviewer also independently performed a chart review. If disagreement persisted, charts were reviewed collectively and iteratively until consensus was reached. The resulting diagnosis represented the consensus diagnosis. Patient race and sex as listed on the EHR patient information sheet were recorded and were compared with published distributions. Patient age was defined as age at time of the reviewed pulmonary subspecialty consultation note.

Statistical Analysis: We separately determined the performance of the computable phenotype for Inclusive Use Case and the Restrictive Use Case, describing the performance with the positive predictive value (PPV): true positives/(true positives + false positives) (26). A statistical comparison of true positives and false positives for each of the Inclusive and Restrictive Use Cases used the Fisher’s exact test for categorical factors (i.e., race and sex) and a two-sample Student’s t test for age. A two-sided α less than 0.05 was considered statistically significant. Statistical tests used SAS® version 9.4 (SAS Institute, Inc., Cary, NC).

Evaluation of Fit/Utility for Intended Use: The Inclusive Use Case is intended for epidemiologic studies of the full spectrum of IPF. The Restrictive Use Case is intended for clinical trial recruitment, in which early-onset pulmonary fibrosis and concurrent emphysema are excluded.

We evaluated approximately 588,000 patients who had an EHR entry at our institution between January 1, 2011 and December 31, 2015. The PaTH IPF computable phenotype identified 157 patients (“Test Positives,” The IPF Cohort, Figure 1), an estimated population prevalence of 26.7/100,000.

The chart review validation identified 74 people as having IPF, 9 people as having CPFE, and 6 people as having FPF. The validation further identified 44 people as having an alternate interstitial lung disease (not IPF, FPF or CPFE), 6 people as having no interstitial lung disease, and 18 people with insufficient information for classification.

Evaluation of the computable phenotype for the Inclusive Use Case (gold standard of IPF, FPF, or CPFE) identified 89 True Positives for a positive predictive value of 57% (89/157) and an estimated population prevalence of 15.1/100,000. Evaluation for the Restrictive Use Case (gold standard of IPF, and not FPF nor CPFE), identified 74 True Positives for a PPV of 47% (74/157) and an estimated population prevalence of 12.6/100,000 (Figure 2).

Table 2 shows the demographics of the source population and the three cohorts. The source population averaged 50 + 20 years, with 75% of white race and 44% male sex. The age of the three cohorts averaged 72 to 75 years, with 88-90% of white race, and male sex ranging from 60-64%. The mean age among the 157 patients in the IPF Cohort was 72 years compared with a mean age of 73 for the Inclusive Use Case and 75 years for the Restrictive Use Case. These patterns persisted after removing the n=18 patients with an unknown diagnosis (insufficient information).

Recruitment Efficiency: We calculated that recruitment of persons with IPF beginning with an unfiltered EHR source population would require reviewing 3745 charts to identify a single individual with IPF (157/588,000). Recruitment beginning with persons from the IPF Cohort (based on the computable phenotype) would require reviewing 2 charts to identify a single individual who would meet the intended use. The enrichment of the candidate pool from 0.3% to 50% is a marked efficiency gain (Figure 2).

This study describes the execution of an updated computable phenotype for the rare disease IPF using PCORnet data infrastructure. We demonstrate how a single computable phenotype can be evaluated for different use cases, each with its predefined gold standard. This provides a flexible approach to meet diverse clinical research needs: the Restrictive Use Case (consensus diagnosis of IPF but not FPF or CPFE) is commonly used in pharmaceutical research studies while the Inclusive Use Case (consensus diagnosis of IPF, FPF, or CPFE) allows more comprehensive characterization of the spectrum of IPF.

Utilizing duplicate chart review and adjudication of 100% of the cases, the computable phenotype showed a positive predictive value of 57% for the Inclusive Use Case and 47% for the Restrictive Use Case. This accomplishes a marked improvement in recruitment efficiency from 0.3% (source population) to 50% (computable phenotype sub-population) when the task is to identify candidate individuals within the EHR for specific purposes such as clinical trials. However, individual level chart adjudication is still necessary.

Our IPF population prevalence estimates (12.6–26.7/100,000) are concordant with previously published US estimates of 14–43/100,000 in a large health plan (11) and 2–29/100,000 in the general population (13); they are lower than the estimates of 276–725/100,000 in a population of US veterans, possibly attributable to high levels of exposure to risk factors in that population (14). A comprehensive dataset from Quebec, Canada estimates a prevalence of 78.4/100,000 (27). International estimates compiled by Maher et al (28) demonstrate geographic variation in prevalence: 5.7–45.1/100,000 in Asia Pacific; 3.3–25.1/100,000 in Europe; and 24-29.8/100,000 in North America; with an adjusted prevalence of 3.3–45.1/100,000 globally.

The comparative demographics of the source population and IPF Computable Phenotype cohort align with established demographic characteristics of this disorder (13). Further, the slight rise in mean age seen with the Restrictive Use Case compared with the Inclusive Use Case (Table 1) fits with the acknowledged presentation of FPF at a younger age than sporadic IPF. Thus we addressed the fitness/utility of this computable phenotype and found it to be favorable through a multidimensional qualitative assessment (5).

The ideal computable phenotype will identify people accurately and automatically. For IPF, this ideal does not yet exist. There is no single biomarker for IPF (29–31) and the current most accurate diagnostic approach is multidisciplinary (32). Absent this ideal, current computable phenotype methodologies offer flexible, agile approaches to harnessing the power of the EHR for clinical research. As demonstrated in this paper, it was possible to develop a single computable phenotype and ultimately have that work for different purposes based on using different gold standards for different use cases. Whether the goal is to look at disease burden broadly or to find people to contact and recruit, this EHR-based approach is feasible.

Computable phenotypes are of growing interest to the rare-disease research community, defined in the U.S. as those diseases affecting fewer than 200,000 people (33). Although individually rare, these diseases collectively affect 25–30 million Americans of whom an estimated 1–2 million have a primary lung disorder (34). Rare diseases in Europe are defined as those affecting less than 1 in 2000 individuals and collectively affect 30 million European Union citizens (35). There is an “unprecedented interest” from industry to study IPF (36) and other ILDs resulting from an accelerated interest in pharmacotherapy for IPF in the last 25 years (37).

Our work builds on and extends the work of previous investigators. Esposito et al (8) developed claims-based algorithms to identify IPF using the 2011 criteria, thereby estimating IPF incidence and prevalence in the US. This approach compared three different algorithms to identify patients from a claims database comprised of 14 million persons enrolled between 2006–2012 for at least 6 months. The dataset excluded persons under 50 or over 100 years and required at least 1 physician diagnosis of IPF and the absence of an alternative diagnosis. Three increasingly narrow algorithms identified n = 4598 (broad case identifying algorithm), n = 2052 (narrow case identifying algorithm), and n = 1354 (IPF score algorithm) persons. The PPV rose from 44–62% to 76% using the treating clinician’s diagnosis and from 54–58% to 83% using the expert clinician’s diagnosis. Acknowledged limitations include that the dataset included only commercially insured patients, limiting generalizability. Medical record review was used as the gold standard without independent review of HRCT and pathology specimens. Clinical adjudicators determined both the treating clinician’s diagnosis and their own diagnosis. The adjudication of the IPF score was performed on only 3.7% of cases (n = 50). This was judged to be a poorly-performing algorithm (38) for estimating disease incidence and prevalence.

Ley et al evaluated a computable phenotype for IPF using a single payer health maintenance organization, reviewing/validating through chart review of a 10% sample (15). They evaluated two algorithms: the IPF algorithm required age over 18 years and a diagnosis of IPF in the absence of an alternative diagnosis. Their broader IIP algorithm required age greater than 50 and a diagnosis of IPF or the less specific IIP while excluding those with an alternative diagnosis. Through adjudication of two random samples of n = 75 cases for the IPF algorithm and one random sample of n = 75 for the IIP algorithm, they found a PPV of 42.2% and 12% for the IPF and IIP algorithms, respectively. Our computable phenotype for IPF does not exclude prior or subsequent diagnosis of IIP, but rather estimates their importance through the chart review process for specific use cases. Among the false positive cases identified by our algorithm, there were n = 44 with other ILDs [other pulmonary fibrosis, not otherwise classifiable (n = 17); non-specific interstitial pneumonia (n = 8); connective tissue-associated ILDs (n = 6); occupational, radiation or drug-induced ILD (n = 6); granulomatous diseases (n = 3); respiratory bronchiolitis-ILD (n = 2); cryptogenic organizing pneumonia (n = 1), and hypersensitivity pneumonitis (n = 1)]. Our algorithm is also revised from that of Ley et al recognizing the increasing importance of the antisynthetase syndromes and undifferentiated connective tissue disease.

The present study represents a different framework for computable phenotypes that accounts for varying use cases than has been used to date for IPF. Previous investigators compared distinct computable phenotypes (e.g., broad and narrow), aiming for estimates of disease prevalence or burden (8, 11, 12, 15). For the use case of incidence/prevalence estimates, a PPV of 50% is considered poor (38). However, for the purposes of recruitment efficiency and developing a candidate pool for clinical trials, this approach has merit. Prior studies and ours, taken together, indicate that IPF computable phenotypes can be applied to diverse geographic areas, payer mixes and EHR systems to identify people with IPF with a PPV of 40–50%. This represents a marked recruitment efficiency, compared to beginning with an unfiltered EHR pool. It also reduces, although does not eliminate, selection bias that often exists in clinical trial recruitment.

Strengths of our study include the source population from a health system comprising multiple insurers as well as the uninsured. By using a health system as the source population, instead of a claims database, we included all individuals regardless of insurance enrollment status. We utilized PCORnet’s Common Data Model (21) architecture, which makes the computable phenotype portable across PCORnet. Applied widely, this EHR-based computable phenotype will provide information complementary to claims-based studies (8, 11, 12). Another strength was building from a previously-published algorithm (11), which allows the results to be compared to other epidemiologic estimates based on a similar case definition. Strengths of our validation procedure included a chart review process with the use of consensus diagnostic criteria (13), two independent reviewers and a reconciliation process for disagreements. We were also able to perform a chart review validation, an opportunity not present when working with claims databases. We evaluated the computable phenotype with two predefined use cases, and validated each with chart review performed on 100% of the identified patients.

An inherent limitation of this and other studies (8, 15) is the inability to ascertain the true prevalence of IPF in our population, due to the practical barriers to identifying the false negatives from the source population. Based on published estimates of IPF prevalence (8, 11, 12, 28), our population of 588,000 patients might contain between 74 and 249 people with IPF. We identified 89 cases, confirmed by chart review. A well-designed population study could address this limitation but would be impractical due to the cost (39); we calculate that we would need to review 3675 charts at random to find one of the possible 160 missed diagnoses. This is a limitation shared by EHR-based rare disease research. An advantage of the detailed chart-level validation is the ability to estimate the magnitude of misclassification, which can be of value in interpreting claims-based studies.

Data for this single-center study was from a tertiary care academic medical center, which limits the generalizability of the results. However, the PCORnet CDM data architecture was chosen as it is standardized for use across more than 60 sites and 66 million people in the US, including diverse EHR systems, payer mixes and care delivery settings(40). Despite the common data architecture, analyses in other EHRs and/or in other health systems will be needed. Variability in EHR and coding practices may influence the performance of this computable phenotype in other health systems. As with any computable phenotype, the results are also limited by the quality of EHR data. Our computable phenotype was also limited by the data available. Addition of variables including HRCT findings, as identified by natural language processing, would likely improve the PPV substantially but possibly at the cost of sensitivity. As personalized medicine identifies biomarkers predicting responsiveness to specific therapies, these can also be incorporated into future computable phenotypes (41, 42).

The field of IPF research and treatment is changing to reflect a multi-targeted approach potentially with combination therapies (42) and precision medicine (41). EHR and claims-based studies have complementary but distinct advantages. For a rare disease, it becomes imperative to have efficient methods to identify potential study participants. The use of a computable phenotype within the EHR allows for identification of a source population for clinical trial recruitment and may help to address the acknowledged need to find people who can participate in evaluating new therapies. Chart review is still needed as a gold standard due to the integration of tests and history required for diagnosis of IPF and other disorders without a single biomarker. The positive predictive value calculated in this and other studies is poor for estimates of incidence and prevalence but excellent for recruitment efficiency. Thus, chart review is still necessary but many fewer charts need to be reviewed to identify eligible participants. In this way, computable phenotypes represent part of the approach to connecting people with ILD to clinical trials (43–48) and, as they become available, personalized therapies.

Future applications of the computable phenotype in EHR-based populations include ICD-10 coding, measures of disease severity, and changes in disease management, i.e., inclusion of newly-available therapies. Work is also needed to assess the broader landscape of fibrotic lung disease, such as identifying cases of progressive fibrotic interstitial lung disease. Additional uses for computable phenotype populations in IPF and fibrotic disease more broadly include biomarker-based studies and evaluation of practice variation and clinical outcomes (38). These applications provide an opportunity to evaluate computable phenotypes for additional use cases based on an understanding of disease patterns and the researchers’ goals for the computable phenotype.

The use of a computable phenotype applied to an EHR population allows investigators to “shrink the haystack” when searching for a rare disease such as IPF. This search strategy aims for sensitivity at the level of the computable phenotype and then specificity through gold standard validation for specific use cases. Chart review remains necessary for validation; however, this approach yields marked efficiencies from 0.3–50% and allows for flexibility in use cases for the computable phenotype depending on study needs. This approach, demonstrated using the PCORnet Common Data Model, can be a valuable tool for multicenter electronic health record (EHR)-based research.

ATS: American Thoracic Society

COPD: chronic obstructive pulmonary disease

CPFE: combined pulmonary fibrosis and emphysema

EHR: electronic health record

FPF: familial pulmonary fibrosis

HMC: Hershey Medical Center

HRCT: high resolution computed tomography

ILD: interstitial lung disease

IPF: idiopathic pulmonary fibrosis

PCORI: Patient-Centered Outcomes Research Institute

PCORnet: Patient-Centered Clinical Research Network

PPV: positive predictive value

SLB: surgical lung biopsy

UIP: usual interstitial pneumonia

Ethics approval and consent to participate

This study was conducted in accordance with the amended Declaration of Helsinki. The protocol was reviewed and approved by the Institutional Review Board at Penn State Milton S. Hershey Medical Center, STUDY00006433. The protocol was approved with a waiver of informed consent from the Institutional Review Board.

Consent for publication

Not applicable.

Availability of data and materials

The datasets generated and analyzed during the current study are not publicly available due to the presence of Protected Health Information. A modified, de-identified dataset is available from the corresponding author on reasonable request.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (PCORI CDRN #1306-04912) for development of the National Patient-Centered Clinical Research Network, known as PCORnet.
PCORnet had no role in the following: design of the study and collection, analysis, and interpretation of data, and writing the manuscript

Authors' contributions

Conception or design of the work: AEFD, CHC, RB
Acquisition of data: AEFD, CHC, RB
Analysis and interpretation of data: all authors
Drafting manuscript and critical revisions: all authors
Final approval of submitted version: all authors
Accountable for accuracy and integrity of the work: all authors

Acknowledgements

Jody McCullough
Jim Carns and Jim Uhrig, IPF Patient Partners
Francis C. Cordova, MD
Kathleen O. Lindell, PhD, RN
Kevin F. Gibson, MD
Herbert Y. Reynolds, MD

Conflict of Interest Statements:

No conflicts of interest

Notification of prior abstract/publication:

Presented at the American Thoracic Society meeting, Denver, CO (2015).

Dimmock AEF, Chuang CH, Bhattacharjee S, Meck DS, Bascom R. Evaluation of a Computable Phenotype for Identification of Patients with Idiopathic Pulmonary Fibrosis. Am J Resp Crit Care Med, 2015. Presented at the 2015 American Thoracic Society Meeting, Denver, CO.

Adler-Milstein J, Jha AK. HITECH Act drove large gains in hospital electronic health record adoption. Health Affairs. 2017;36(8):1416-22.
Office of the National Coordinator for Health Information Technology. Non-federal Acute Care Hospital Electronic Health Record Adoption [updated September 2017. Available from: dashboard.healthit.gov/quickstats/pages/FIG-Hospital-EHR-Adoption.php.
OECD/EU. Resilience: Innovation, efficiency and fiscal sustainability: Adoption and use of electronic medical records and eprescribing. Health at a Glance: Europe 2018: State of Health in the EU Cycle. Paris: OECD Publishing; 2018. p. 192-3.
MSI LF, Dorene Markel MS M. Computable Phenotypes: Standardized Ways to Classify People Using Electronic Health Record Data. Perspectives in Health Information Management. 2018:1-8.
Richesson RL, Smerek MM. Electronic Health Records-Based Phenotyping 2014 [Available from: http://sites.duke.edu/rethinkingclinicaltrials/ehr-phenotyping/.
Richesson RL, Smerek MM, Blake Cameron C. A Framework to Support the Sharing and Reuse of Computable Phenotype Definitions Across Health Care Delivery and Clinical Research Applications. EGEMS (Washington, DC). 2016;4(3):1232-.
Gliklich R, Dreyer N, Leavy M. Registries for Evaluating Patient Outcomes: A User's Guide [Internet]. 3rd ed. Rockville, MD: Agency for Healthcare Research and Quality (US); 2014.
Esposito DB, Lanes S, Donneyong M, Holick CN, Lasky JA, Lederer D, et al. Idiopathic pulmonary fibrosis in United States automated claims. Incidence, prevalence, and algorithm validation. American journal of respiratory and critical care medicine. 2015;192(10):1200-7.
Stewart RR, Dimmock AE, Green MJ, Van Scoy LJ, Schubart JR, Yang C, et al. An analysis of recruitment efficiency for an end-of-life advance care planning randomized controlled trial. American Journal of Hospice and Palliative Medicine®. 2019;36(1):50-4.
Haneuse S, Daniels M. A General Framework for Considering Selection Bias in EHR-Based Studies: What Data Are Observed and Why? EGEMS (Washington, DC). 2016;4(1):1203.
Raghu G, Weycker D, Edelsberg J, Bradford WZ, Oster G. Incidence and prevalence of idiopathic pulmonary fibrosis. American journal of respiratory and critical care medicine. 2006;174(7):810-6.
Pérez ERF, Daniels CE, Sauver JS, Hartman TE, Bartholmai BJ, Eunhee SY, et al. Incidence, prevalence, and clinical course of idiopathic pulmonary fibrosis: a population-based study. Chest. 2010;137(1):129-37.
Raghu G, Collard HR, Egan JJ, Martinez FJ, Behr J, Brown KK, et al. An official ATS/ERS/JRS/ALAT statement: idiopathic pulmonary fibrosis: evidence-based guidelines for diagnosis and management. American journal of respiratory and critical care medicine. 2011;183(6):788-824.
Kaul B, Lee JS, Zhang N, Vittinghoff E, Sarmiento K, Collard HR, et al. Epidemiology of Idiopathic Pulmonary Fibrosis among U.S. Veterans, 2010–2019. Annals of the American Thoracic Society. 2022;19(2):196-203.
Ley B, Urbania T, Husson G, Vittinghoff E, Brush DR, Eisner MD, et al. Code-based Diagnostic Algorithms for Idiopathic Pulmonary Fibrosis. Case Validation and Improvement. Ann Am Thorac Soc. 2017;14(6):880-7.
Raghu G, Remy-Jardin M, Myers JL, Richeldi L, Ryerson CJ, Lederer DJ, et al. Diagnosis of idiopathic pulmonary fibrosis. An official ATS/ERS/JRS/ALAT clinical practice guideline. American journal of respiratory and critical care medicine. 2018;198(5):e44-e68.
PaTH Network. About the PaTH Network 2016 [Available from: http://pathnetwork.org/about/.
Amin W, Tsui FR, Borromeo C, Chuang CH, Espino JU, Ford D, et al. PaTH: towards a learning health system in the Mid-Atlantic region. JAMIA. 2014;21(4):633-6.
PCORnet. About PCORnet 2018 [updated February 13, 2018. Available from: https://pcornet.org/about-pcornet/.
Collins FS, Hudson KL, Briggs JP, Lauer MS. PCORnet: turning a dream into reality. Journal of the American Medical Informatics Association. 2014;21(4):576-7.
PCORnet: The National Patient-Centered Clinical Research Network. Data-Driven [Available from: https://pcornet.org/data-driven-common-model/.
Esposito D, Lanes S, Deshpande G, Holick CN, Mines D, O'Quinn S, et al. Identification and Confirmation of IPF Cases in an Electronic Insurance Claims Database. OMOP-IMEDS Symposium 2013; Bethesda, MD2013.
Coultas DB, Zumwalt RE, Black WC, Sobonya RE. The epidemiology of interstitial lung diseases. Am J Respir Crit Care Med. 1994;150(4):967-72.
Pinal-Fernandez I, Casal-Dominguez M, Huapaya JA, Albayda J, Paik JJ, Johnson C, et al. A longitudinal cohort study of the anti-synthetase syndrome: increased severity of interstitial lung disease in black patients and patients with anti-PL7 and anti-PL12 autoantibodies. Rheumatology. 2017;56(6):999-1007.
Hallowell RW, Danoff SK. Treatment of Interstitial Lung Disease Associated With Myositis and the Anti-Synthetase Syndrome. Current Treatment Options in Rheumatology. 2018;4(4):316-28.
Altman DG, Bland JM. Statistics Notes: Diagnostic tests 2: predictive values. BMJ. 1994;309(6947):102.
Tarride JE, Hopkins RB, Burke N, Guertin JR, O'Reilly D, Fell CD, et al. Clinical and economic burden of idiopathic pulmonary fibrosis in Quebec, Canada. Clinicoecon Outcomes Res. 2018;10:127-37.
Maher TM, Bendstrup E, Dron L, Langley J, Smith G, Khalid JM, et al. Global incidence and prevalence of idiopathic pulmonary fibrosis. 2021;22(1):197.
Drakopanagiotakis F, Wujak L, Wygrecka M, Markart P. Biomarkers in idiopathic pulmonary fibrosis. Matrix Biology. 2018;68:404-21.
Ley B, Brown KK, Collard HR. Molecular biomarkers in idiopathic pulmonary fibrosis. American Journal of Physiology-Lung Cellular and Molecular Physiology. 2014;307(9):L681-L91.
Zhang Y, Kaminski N. Biomarkers in idiopathic pulmonary fibrosis. Current opinion in pulmonary medicine. 2012;18(5):441.
Walsh SLF, Maher TM, Kolb M, Poletti V, Nusser R, Richeldi L, et al. Diagnostic accuracy of a clinical diagnosis of idiopathic pulmonary fibrosis: an international case–cohort study. European Respiratory Journal. 2017;50(2):1700936.
Orphan Drug Act, Pub. L. No. 97-414 Stat. 96 Stat 2049 (1983).
McCormack FX. Rare Lung Diseases. In: Schraufnagel DE, editor. Breathing in America: Diseases, Progress, and Hope: American Thoracic Society; 2010. p. 185-96.
Europe ERD. What is a rare disease? [updated June 14, 2019. Available from: https://www.eurordis.org/content/what-rare-disease.
Gibson KF, Kass DJ. Clinical Trials in Idiopathic Pulmonary Fibrosis in the “Posttreatment Era”. Jama. 2018;319(22):2275-6.
Raghu G. Idiopathic pulmonary fibrosis: lessons from clinical trials over the past 25 years. European Respiratory Journal. 2017;50(4):1701209.
Farrand E, Anstrom KJ, Bernard G, Butte AJ, Iribarren C, Ley B, et al. Closing the Evidence Gap in Interstitial Lung Disease. The Promise of Real-World Data. American journal of respiratory and critical care medicine. 2019;199(9):1061-5.
Lee CD, Williams SE, Sathe NA, McPheeters ML. A systematic review of validated methods to capture several rare conditions using administrative or claims data. Vaccine. 2013;31, Supplement 10(0):K21-K7.
About PCORnet [updated February 13, 2018. Available from: https://pcornet.org/about-pcornet/.
Brownell R, Kaminski N, Woodruff PG, Bradford WZ, Richeldi L, Martinez FJ, et al. Precision medicine: the new frontier in idiopathic pulmonary fibrosis. American journal of respiratory and critical care medicine. 2016;193(11):1213-8.
Rangarajan S, Locy ML, Luckhardt TR, Thannickal VJ. Targeted Therapy for Idiopathic Pulmonary Fibrosis: Where To Now? Drugs. 2016;76(3):291-300.
Fischer A, Distler J. Progressive fibrosing interstitial lung disease associated with systemic autoimmune diseases. Clinical rheumatology. 2019;38(10):2673-81.
Maher TM, Corte TJ, Fischer A, Kreuter M, Lederer DJ, Molina-Molina M, et al. Pirfenidone in patients with unclassifiable progressive fibrosing interstitial lung disease: a double-blind, randomised, placebo-controlled, phase 2 trial. The Lancet Respiratory medicine. 2020;8(2):147-57.
King Jr TE, Bradford WZ, Castro-Bernardini S, Fagan EA, Glaspole I, Glassberg MK, et al. A phase 3 trial of pirfenidone in patients with idiopathic pulmonary fibrosis. N Engl J Med. 2014;370(22):2083-92.
Richeldi L, du Bois RM, Raghu G, Azuma A, Brown KK, Costabel U, et al. Efficacy and safety of nintedanib in idiopathic pulmonary fibrosis. N Engl J Med. 2014;370(22):2071-82.
Cottin V. Treatment of progressive fibrosing interstitial lung diseases: a milestone in the management of interstitial lung diseases. Eur Respiratory Soc; 2019.
Flaherty KR, Wells AU, Cottin V, Devaraj A, Walsh SL, Inoue Y, et al. Nintedanib in progressive fibrosing interstitial lung diseases. New England Journal of Medicine. 2019;381(18):1718-27.

Table 1: Exclusionary ICD-9 Codes

ICD-9-CM Code	Description
135	Sarcoidosis
237.7	Neurofibromatosis
272.7	Lipidoses
277.3	Amyloidosis
277.8	Other specified disorders of metabolism—includes eosinophilic granuloma
279.49	Antisynthetase syndromes*
446.21	Goodpasture’s syndrome
446.4	Wegener’s granulomatosis
495.x (includes 495.0-495.9)	Extrinsic allergic alveolitis
500	Coal worker’s pneumoconiosis
501	Asbestosis
502	Pneumoconiosis due to other silica or silicates
503	Pneumoconiosis due to other inorganic dust
504	Pneumoconiosis due to inhalation of other dust
505	Pneumoconiosis, unspecified
506.4	Chronic respiratory conditions due to fumes or vapors
508.1	Chronic and other pulmonary manifestations due to radiation
508.8	Respiratory conditions due to other specified external agents
516.0	Pulmonary alveolar proteinosis
516.1	Idiopathic pulmonary hemosiderosis
516.2	Pulmonary alveolar microlithiasis
516.8	Other specified alveolar and parietoalveolar pneumonopathies
516.9	Unspecified alveolar and parietoalveolar pneumonopathies
517.2	Lung involvement in systemic sclerosis
517.8	Lung involvement in other diseases classified elsewhere
518.3	Pulmonary eosinophilia
555.x (includes 555.0, 555.1, 555.2, 555.9)	Regional enteritis
710.0	Systemic lupus erythematosus
710.1	Systemic sclerosis
710.2	Sjogren’s disease
710.3	Dermatomyositis
710.4	Polymyositis
710.9	Undifferentiated connective tissue disease*
714.81	Rheumatoid lung
720.0	Ankylosing spondylitis
759.5	Tuberous sclerosis

Adapted from Raghu et al AJRCCM 2006. A * indicates a diagnosis code added by PaTH that was not part of the original list by Raghu et al.

Table 2: Demographic Characteristics of the Study Populations.

	Male Sex N (%)	White Race N (%)	Age (mean ± SD)
Full EHR Source Population	258,585 (44%)	440,710 (75%)	50 ± 20.8
IPF Cohort (Computable Phenotype) (n=157)	94 (60%)	140 (89%)	72 ± 13.8
Inclusive Gold Standard (IPF, FPF, CPFE)
IPF (n=89)	57 (64%)	80 (90%)	73 ± 11.8
Not IPF (n=68)	37 (54%)	60 (88%)	70 ± 15.9
Restrictive Gold Standard (IPF only)
IPF (n=74)	46 (62%)	65 (88%)	75 ± 9.9*
Not IPF (n=83)	48 (59%)	78 (90%)	69 ± 16.0

The EHR Source Population includes all patients with a clinical encounter in the HMC electronic health record (EHR) between January 1, 2011 and December 31, 2015. The IPF Cohort includes all patients identified by the computable phenotype. The Inclusive Gold Standard comprises all patients with a consensus diagnosis of IPF, FPF, or CPFE. The Restrictive Gold Standard comprises only patients with a consensus diagnosis of IPF (excludes FPF and CPFE). *p<0.01 for the comparison (“IPF” compared to “Not IPF” after gold standard chart review)

No competing interests reported.

Shrinking the Haystack: An Approach to Identifying Idiopathic Pulmonary Fibrosis in the Electronic Health Record using a Computable Phenotype

Status:

Version 1

Abstract

Figures

Background

Methods

Results

Discussion

Conclusions

Abbreviations

Declarations

References

Tables

Additional Declarations

Status:

Version 1