2.1. Study population
We used the IBM Explorys clinical data set, which has more than 50 million patient records pooled from different health-care systems with EMRs (The IBM Explorys Network) [6]. The data were standardized and normalized using common ontologies, searchable through a HIPAA-enabled, de-identified cloud-computing platform. Patients were seen in multiple health-care systems between January 1, 1999, and December 31, 2015, with a combination of data from clinical EMRs, outgoing health-care system bills, and adjudicated payor claims.
We first defined a broad cohort of patients suspected of having liver disease (any diagnosis, test, or procedure related to the liver) or at least one measurement of an elevated triglyceride laboratory test. This definition resulted in a population of approximately 5 million patients; those included patients diagnosed with a liver-related disease as well as patients who went through liver evaluations (e.g., biopsy, computed tomography) with no liver-related diseases found. Given our objective to identify physiological and comorbid biomarkers, we excluded all patients who had an indication either by a diagnosis code or a lab result indicative of viral hepatitis or human immunodeficiency virus; consequently, approximately 11,000 patients were removed. Patients at high probability for NAFLD were identified using a classification algorithm that members of our group have previously validated [7–9]. By applying this algorithm to the EMR database of patients suspected with a liver-disease-related finding (the population of 5 million), we identified the first date on which each patient developed a high probability for NAFLD (index date), resulting in a population of 334,258 patients. We selected only patients who were over the age of 18 at the index date and had at least 5 years of follow-up after the index date using encounter entries (e.g., office visit, admission, emergency room visit, or observation). This identification yielded a population of 81,911 patients with a high probability for NAFLD. Table 1 presents patient characteristics for our cohort.
We extracted all disease outcomes during the 5-year follow-up and mapped them using the Clinical Classifications Software (CCS) categories based on the International Classification of Diseases (ICD) 9th and 10th revisions [10]. Keeping only those that were prevalent in at least 1% of the NAFLD population yielded 174 unique disease categories to be used as binary-variable outcomes in our analyses. All correlations between covariates and outcomes considered are presented in Supplementary Table 1 and Supplementary Figs. 1 and 2.
Table 1. Characteristics of high probability for NAFLD cohort
Variable and category
|
Overall (n = 81,911)
|
Age (years); Mean (SD)
|
53.5 (12.5)
|
Gender (%)
|
Male
|
|
60.1
|
Female
|
39.9
|
Ethnicity (%)
|
|
Caucasian
|
87.8
|
African American
|
7.5
|
Other / Unknown
|
4.7
|
Top 10 comorbidities in prevalence (%)
|
Disorders of lipid metabolism
|
33.5
|
Essential hypertension
|
29.2
|
Diabetes mellitus without complications
|
17.9
|
Other connective tissue disease
|
14.5
|
Spondylosis; intervertebral disc disorders; other back problems
|
13.9
|
Other upper respiratory infections
|
12.2
|
Other nutritional; endocrine; and metabolic disorders
|
12.0
|
Other non-traumatic joint disorders
|
11.2
|
Esophageal disorders
|
10.9
|
Other lower respiratory disease
|
10.8
|
2.2. Discovery engine focused on NAFLD
To identify associations between covariates and outcomes in a follow-up window of 5 years after the index date, we developed a process that extracts a large collection of structured variables from the EMRs. The variables included demographics (e.g., gender, ethnicity, age), comorbidities, laboratory measurements (e.g., albumin, sodium, BMI), and behavioral descriptors (e.g., smoking status, alcohol use). For laboratory variables, we used the most recent values found in the 12 months preceding the index date. We determined the existence of a comorbidity if at least one CCS code for this comorbidity was found in the patient’s problem list before the index date. The disease outcomes in the 5-year follow-up window consisted of the 174 CCS categories described earlier. To assess the potential associations between these variables and disease outcomes, we excluded patients for each outcome who had a diagnosis for that specific outcome disease before the index date. In this way, we assessed only the newly developed disease outcomes after the index date. We imputed the missing values with the mean of the available data for each variable and performed all programming using Python (libraries: pandas.io.sql, pyodbc, time) and R (libraries: ggplot2, Hmisc, stringr).
We applied the process of NAFLD index date extraction and feature selection, (Figure 1). We followed this methodology for each of the considered outcomes, resulting in 174 different separate experimental sets. To select a subset of the potentially most predictive variables, we first applied univariate analyses to all covariates related to each outcome. We compared categorical variables using a chi-squared test and compared the differences in the means of continuous variables using a t-test or Wilcoxon rank sum test as appropriate. All statistical tests were two-sided, with Bonferroni corrections for the 314 comparisons; the adjusted P value threshold for statistical significance was 1.6 × 10-4 for each comparison. Three hundred and nine of the 314 covariates were continuous (including 281 CCS-defined comorbidities and 28 laboratory observations). Five of the 314 covariates were categorical, including smoking status, alcohol use, gender, and ethnicity (White, African American).
We used all covariates that were statistically significant in the univariate comparison to train a multivariate logistic regression model. The model with this smaller set of covariates yielded odds ratios (ORs) and P values for each covariate. We then took a stringent approach for feature selection and used the variables that were statistically significant (P < 0.05) to train another multivariate logistic regression model. This two-step training provided an increased level of confidence for the reliability of the significance of the selected variables. Finally, to account for variability, we applied bootstrapping to the reduced set of covariates, excluding those with confidence intervals that crossed an OR of 1. This methodology was capable of identifying a subset of covariates that were strongly correlated with each outcome.
2.3. General approach to extend the discovery engine
Although our manuscript focuses primarily on exploring NAFLD, we believe that the scientific community may be interested in using our proposed methodology and applying it to additional conditions. Figure 2 illustrates a general representation of the steps required to assess associations, not necessarily restricted to NAFLD, between covariates and outcomes. The first step requires identifying a population with a disease of interest and the date of first diagnosis of the disease for each patient. For certain diseases, assessing when the patient is at a high risk for the disease is straightforward; for example, patients with high blood pressure levels (a commonly measured observation) may be candidates for the engine to explore hypertension outcomes. Other conditions are known to be under-documented and underdiagnosed; thus, a computational algorithm can help assess the probability of disease occurrence (e.g., 41).
The second step is to define a large collection of distinct outcomes: for example, diagnoses, procedures, uncontrolled laboratory observations, and mortality. While not mandatory, a follow-up period should also be defined (e.g., 30 days, 5 years).
The third step is to define and extract covariates; these could be comorbidities defined by ICD or CCS codes, laboratory values, demographic details, and covariates extracted from the clinical narrative notes (e.g., smoking status, alcohol use, nonadherence, family history of cardiovascular disease) as well as more detailed covariates such as genetics-related factors and measurements captured from wearable devices or edible sensors. An observation window for the covariates should be defined (e.g., most recent value for a laboratory value within the preceding 12 months, a history of hospital admissions unrestricted by time, whether colonoscopy was performed within the past decade, whether an immunization was performed over the past 12 months). Another definition may include whether the disease outcome has been observed for the first time after the index date or if it is a recurrence of a preexisting condition.
Once index dates, covariates, and outcomes are defined, the forth step is to extract the actual data from the EMR database; this will result in a table for each outcome with different covariate values, all relative to the index date of each patient and the subsequent presence or absence of the outcome during the follow-up window. For each such data frame, the engine then applies a feature selection algorithm. In the NAFLD use case, we followed a traditional epidemiological approach to select features (i.e., applying a univariate analysis on the covariates and outcome, filtering out covariates with no statistical significance [given a predefined threshold], and then applying a logistic regression model on the statistically significant covariates and the outcome). Although in the context of NAFLD we assessed levels of association by following a statistical approach (i.e., using P values and ORs), alternative approaches may be possible as well (e.g., relying on importance scores calculated by a machine learning algorithm). More advanced feature selection methods may result in more trustable linkages; however, there is no guarantee that such methods may hold any significant difference compared to following a standard statistical approach. Thus, which feature selection method is the most preferable is an open research question within the context of discovery mechanisms focused on EMRs.
Once the engine has provided levels of association between each outcome and covariates, there could be several potential approaches to interpretation, as Figure 2 shows. In a desirable scenario, the engine is capable of validating results that are already known (e.g., tobacco use is harmful). In another scenario, also possibly desirable, the engine is capable of identifying a linkage that has not yet been reported. Such a scenario may trigger the scientific community to evaluate the correctness of the linkage: for example, by designing and applying an experiment in a wet laboratory (e.g., evaluating the potential linkage between covariates and outcome in mice or in zebra fish). Additional possible approaches to test the linkage would be to extract the outcome and covariates of interest at a different medical site (using other EMRs or claims-based data) to evaluate whether the linkage is valid in additional layouts. In another scenario the engine may identify an association; however, the association may be in a different order in time. For example, the engine finds a linkage between carrying disease A and the future development of disease B, consistent with the literature; however, the engine indicates that disease B is actually associated with a subsequent development of disease A. Such scenarios may be possible (we provide examples in the discussion) and may stimulate further debate by the scientific community regarding the interplay between the two diseases.
2.4 Visualization
We were also interested in providing a high-level overview of how the covariates and outcomes were connected so interesting relationships could be revealed across all 174 experiments. As illustrated in Figure 3, we created a network visualization that shows covariates and outcomes as nodes (rendered as a circle) with edges (rendered as a line) connecting them if they had a statistically significant OR relationship. We colored covariate nodes gray, whereas outcome nodes each received a unique color. We sized nodes proportionally to their degree centrality, so those with more connections were larger, and nodes with few connections were smaller. We colored the edges connecting nodes according to the nodes to which they were connected. Thick edges had an OR greater than 1 (a positive association between the covariate and the outcome), whereas thin edges had an OR less than 1 (a negative association between the covariate and the outcome). Because of space limitations, we only show a partial network in Figure 3, illustrating the connections between several outcomes related to diseases of the circulatory system. The nodes on the border of the figure show the covariates associated with only a single outcome, whereas the nodes in the center are shared across multiple outcomes. Such a visualization allows researchers to see how certain covariates can have different relationships among different outcomes.
2.5 Availability of data and materials
IBM's Data Access and Compliance Board approved this study and all its methods, including the EMR cohort assembly, data extraction, and analyses. Data contain potentially identifying information and may not be shared publicly. The data sets used and/or analyzed during the current study, as well as the source code used to develop the engine, are available from the corresponding author on reasonable request (address: 75 Binney St, Cambridge, MA 02142, USA; telephone: 857-500-2425; [email protected]).