Screening for Idiopathic Pulmonary Fibrosis with Comorbid Pattern Recognition in Electronic Health Records

Idiopathic pulmonary ﬁbrosis (IPF) is an irreversible, debilitating, and ultimately lethal ﬁbrosing interstitial lung disease (ILD) of unknown cause 1–3 . The poor prognosis of IPF (mean survival less than 2-5 years post diagnosis 4 ), combined with its worldwide prevalence greater than all but the most common cancers 5 , makes it a serious health problem. Thus, the need for early diagnosis is paramount. Currently, early screening is hampered by the absence of reliable screening tools, non-speciﬁc symptomology, a limited understanding of the phenotypic and genetic markers for early-stage IPF, and the potential need for invasive procedures for conﬁrmatory diagnosis. In this study we introduce a new screening tool that requires no new diagnostic tests, may be universally administered, and does not necessarily require recognition of early symptoms by the patients or care providers. Applying novel pattern discovery algorithms on the detailed history of past medical encounters of individual patients, we leverage subtle comorbidity patterns to compute the Zero-burden Co-Morbidity Risk Score for IPF (ZCoR-IPF), which is expected to be widely, readily, rapidly and inexpensively applicable at points of care. Our algorithm is trained and validated on a large national insurance claims database (n=2,059,559). In out-of-sample validation, we demonstrate that ZCoR-IPF identiﬁes IPF patients with an AUC approaching 84% at 4 years, (cid:25) 86% at 3 years, (cid:25) 87% at 2 years, and exceeding 88% at 1 year before conventional diagnosis. Our out-of-sample accuracy indicates that on average, we correctly predict the eventual IPF status 94 (cid:0) 98 out of every 100 patients, with a positive predictive value > 50% irrespective of sex at a speciﬁcity of 99%. Maximum PPV and NPV achieved are = 70% and (cid:25) 98% respectively. We investigate several important sub-populations, including high risk patients with known IPF co-morbidities, low risk patients who lack those indications, in older ( > 65 ) and younger populations, in patients who already have a diagnosis of Chronic obstructive pulmonary disease (COPD), asthma or have cardiac disorders, and for patients who are never recorded to have experienced dyspnea and thus are at a higher risk of a missed diagnosis. High performance of the ZCoR-IPF score across clinically relevant situations suggests that it may aid providers to more selectively ﬂag patients for detailed diagnostic evaluation, substantially improving patient outcomes and quality-of-life, as well as efﬁciency of healthcare resource utilization. Additionally, .

a CT finding of ILD to see a pulmonologist, while a Medicare claims study 13 found that more than a third of patients saw a pulmonologist > 3 years before IPF diagnosis. A medical chart analysis10 and a patient survey 6 suggested misdiagnosis rates approaching or exceeding 40%. Notably, 28% of respondents in a 600-patient survey 6 reported that the burden of their journey to an IPF diagnosis contributed to a decision to apply for disability or to retire.
The difficulties in reliable early diagnosing IPF may be ascribed to several causes. First, the most common clinical symptoms of the disease 7,17 , insidiously progressive chronic exertional dyspnea and/or chronic, often mild cough, are highly non-specific. These symptoms are easily attributed by patients to age or deconditoning 6 , or by physicians to more common cardio-respiratory diseases 5,6,15,16 . Important risk factors for IPF, namely, older age, male sex, and cigarette smoking 1 , are similarly "non-specific". Second, there is still limited understanding and characterization of phenotypic and genetic findings associated with "early-stage IPF" 10 . Third, the current diagnostic hallmark of IPF, UIP on HRCT or histology 2 , is a late-stage finding 3 . Moreover, UIP may be confirmed via relatively invasive procedures requiring specialized interpretation, e:g:, HRCT or surgical lung biopsy 6 . Lastly, no validated or easily-applicable screening modalities currently exist for IPF 9 .
In this study we introduce an innovative approach with the potential to ameliorate these key challenges: we introduce a tool that requires no new diagnostic tests, may be universally administered, and does not necessarily require recognition of early symptoms by the patients or care providers. Analyzing large databases of electronic health records via novel pattern discovery algorithms, we identify subtle co-morbidity incidence, timing, and sequence characteristics presaging IPF. Combining these discovered features with standard machine learning then leads to a powerful, accurate, automated screening tool based only on diagnostic codes that exist already in the patient's past medical record. Here we report on the training and validation of our screening tool, the Pulmonary Fibrosis Co-Morbidity Risk Score (ZCoR-IPF), which is expected to be widely, readily, rapidly and inexpensively applicable at points of care.

Data Source & Patient Selection
Our patient data comes from the Truven Health Analytics MarketScan ® Commercial Claims and Encounters Database for the years 2003-2018 21 (referred to as the Truven dataset). This US national database merges data contributed by over 150 insurance carriers and large self-insurance companies, and comprises over seven billion time-stamped diagnosis codes. The entire database tracks over 87 million patients for 1 to 15 years, reflecting a substantial cross-section of the US population. We select our cohort(s) from the Truven dataset in accordance with the inclusion/exclusion criteria described in Table I, ensuring that selected patients haveat least three years of medical history recorded in the dataset. The geographical distribution of the patients in our selected cohort(s) is illustrated in Fig. 1a. Fig. 1b illustrates the age distribution at the time of IPF diagnosis (mean % 68 years), which is consistent with the reported mean onset age for IPF (% 66 years 4 ). We also note that observed risk of onset actually increases with age, which is computed as the number of IPF cases normalized by the total number of patients at the same age, as shown in Fig. 1c.
To view the task of predicting a future IPF diagnosis as a binary classification problem, we aim to categorize time-stamped sequences of diagnostic codes into positive and control categories, where the "positive" category refers to patients diagnosed with IPF 1 year (See Tables IV) from the point of screening. We also consider earlier screening upto 4 years before the actual diagnosis, as described later. The control cohort comprises patients who do not have any target codes in their records, within the next 2 years of the point of screening. For both groups, we base our predictions on the past 2 years of medical history. We consider approximately 42 million diagnostic codes (with over 46K unique codes) for both genders (See Table III for detailed enumeration) in this analysis, and ultimately identified n = 2; 059; 559 patients, with 53; 407 patients in the positive group and    Idiopathic non-specific interstitial pneumonitis J84.111 Idiopathic interstitial pneumonia not otherwise specified J84. 1 Pulmonary fibrosis unspecified J84. 10 Pulmonary fibrosis unspecified J84. 11 Idiopathic interstitial pneumonia not otherwise specified 2; 006; 152 patients in the control group. The cohort sizes are described in Table II.
The diagnostic codes specifically for IPF are 516.31 (ICD9) and J84.112 (ICD10). We use a somewhat broader definition, including five additional ICD10 codes that indicate a diagnosis of pulmonary fibrosis without a known cause (See Table IV). This broadened definition of our target reduces the impact of erroneous coding and diagnostic uncertainties for IPF, and does not violate the key characteristics of a pulmonary fibrosis diagnosis for which no clear causal mechanism (such as toxic exposure) could be established.
Importantly, we do not pre-select or reject any diagnostic or prescription code based on its known or suspected comorbidity with IPF. To investigate if our predictive performance changes substantially for patients who are at "high risk" based on known co-morbidities, we also evaluate our performance within a high risk and a low risk sub-cohort. The high risk sub-cohort comprises patients with one or more of the diagnoses enumerated in Table V, which identify the top known co-morbidities 22 . The low risk sub-cohort comprises patients who are not at high risk as specified by the previous condition. Our results for the low risk sub-cohort is of particular significance, as these patients are at a higher risk of being not diagnosed early. Additionally, we investigate the applicability of ZCoR-IPF in patients 1) for whom IPF diagnosis might be confounded due a preexisting condition such as COPD, any heart condition, or asthma (See SI- Table III for cohort defininition codes), and 2) for whom the absense of any indication of dyspnea reduces the odds of IPF suspicion and might delay diagnosis (See SI- Table II for codes that are absent in this cohort). ICD code description K21. 9 Gastro-esophageal reflux disease without esophagitis K21 Gastro-esophageal reflux disease with esophagitis without bleeding I27. 20 Pulmonary hypertension unspecified I27.0 Primary pulmonary hypertension J44. 9 Chronic obstructive pulmonary disease unspecified G47. 33 Obstructive sleep apnea (adult) (pediatric) ? Low-risk patients who lack these diagnoses before IPF diagnosis (positive cohort) or anywhere in their medical history (control cohort).

Modeling & Prediction
The significant diversity of diagnostic codes, along with the sparsity of codes per patient (0.49 to 0.55 per week diagnostic per patient on average) makes this a difficult learning problem. We proceed by partitioning the disease spectrum into 51 broad categories, e:g: infectious diseases, immunologic disorders, and endocrinal disorders (See SI Tab. I for a detailed enumeration of these categories). Some of these categories comprises a relatively large number of diagnostic codes aligning roughly with the categories defined within the ICD framework 23 . The remaining categories represent groups of one or more codes that might have some known or suspected association with pulmonary disorders. Each of the diagnostic categories yield a single time series over weeks  the estimated lower bounds on the survival function at two specificity levels (90 and 95%), and panel b the estimated upper bounds on hazard rates. The lower (upper) bounds are estimated due to the often missing information on actual deaths. Nevertheless, the last diagnostic record in a patient's history is a lower bound on the survival time, allowing us to calculate the plots shown above. Panel c shows the variation of the mean survival time as a function of the specificty at which ZCoR-IPF is operated. The baselines shown are calculated from our patient databse, and is somewhat lower to what is reported in the literature (2-3 years median post-diagnosis). This is expected since our estimate is a lower bound. We have a clear advantage for even high specificities. Panel d illustrates the variation of estimated raw risk as a function of age for screening four years from actual recorded diagnosis of IPF, showing that risk increases almost linearly with age for the patients eventually diagnosed with IPF. Finally, panel e shows the degradation of out-of-sample AUC as we attempt to screen earlier, stepping back from the time of current diagnosis (in absence of ZCoR-IPF screening). Note that starting from 88% AUC at 1 year to diagnosis the performance degrades somewhat to 85% for screening at 4 years to diagnosis.
time series of events, which are compressed into specialized Hidden Markov Models known as Probabilistic Finite Automata 24 . These models are inferred separately for each phenotype, for each sex, and for the control and the positive cohorts, to identify the distinctive average patterns emerging at the population level. Thus, we infer 51¢2¢2 = 204 PFSA models in total in this study. Our inference algorithm (See Supplementary Text, Section XI)) for these models do not presuppose a fixed structure, and is able to work with non-synchronized and variable length data streams. Variation in the structure and parameters of these inferred models between the positive and control groups delineate the estimated risk of an IPF diagnosis at the population level. Given these models, and the history of a specific patient, we can then quantify the likelihood of this patient's particular history being generated by the control PFSA models as opposed to the positive models. We refer to this likelihood difference as the sequence likelihood defect (SLD) 25 , which is the one of the key informative features in our approach. The SLD is a novel concept, involving the generalization of the notion of KL divergence 26 between probability distributions to a generalized divergence between possibly non-iid stochastic processes (See Supplementary text, Section V).
In addition to the phenotype specific Markov models, we use a range of engineered features that reflect various aspects of the patient-specific diagnostic histories, referred to as the "sequence features", "p-scores", and "rare scores". The sequence features include the ratio of number of weeks with the codes of a given phenotype to the total number of weeks in sequence, the ratio of number of weeks with the codes of a given phenotype to the number of weeks with any diagnosis code recorded and the length of the longest uninterrupted subsequence of weeks with the codes of a given phenotype (See Table VII for complete list of such features). The p-scores  and the rare scores encode prevalence characteristics of individual diagnostic codes (see Supplementary text Section ??). Ultimately, we compute a total of 667 features for each patient, which is then used to train a network of standard gradient boosting classifiers 27 aiming to map individual patients to a raw risk score. We randomly choose 75% of our patients for training with the rest held-out as a validation set. More details of the predictive pipeline are given in the Supplementray text, Section ??.

Raw Risk & Relative Risk
Our predictive pipeline produces a continuous estimate of the raw risk score of an IPF diagnosis in future. Thus, our raw risk estimate is a continuous number, and we must choose a decision threshold to make crisp predictions, i:e:, if the raw risk is greater than this calibrated threshold then the individual patient is predicted to be in the positive category. In this study, we select this threshold by maximizing the F 1 -score, defined as the harmonic mean of sensitivity and specificity, to make a balanced trade-off between Type 1 and Type 2 errors.
The relative risk is then defined as the ratio of the raw risk to the decision threshold, and a value > 1 indicates a predicted future IPF diagnosis.

Performance Measurement
We measure our performance using standard metrics including the Area Under the receiver-operating characteristic curve (AUC), sensitivity, specificity and the Positive Predictive Value (PPV). We also report accuracy (acc, See Table VI), which is the probability of a correct prediction (positive or control).

Feature Importance & Comorbidity Spectra
Beyond the demonstrated predictive performance (see Results), calculation of the ZCoR-IPF score offers insights into the comorbid associations of IPF that might actually have predictive value. Estimating the relative importance of the features used is crucial for sanity checks, as well as for insights into the underlying causal mechanisms. We compute the relative importance of the features by estimating the mean change in the raw risk via random perturbation of a particular feature: this is the "feature importance" shown in Fig. 2c for the different diagnostic categories. which illustrates that respiratory disorders are the most important diagnostic category modulating the ZCoR-IPF score.
Importantly, all of our features are based on data already available in the past medical records. We do not demand results from specific tests, or look for specific demographic, bio-molecular, physiological and other parameters; we use what we get in the diagnostic history of patients, which presents un-structured sequence of labels pertaining to the ICD and the prescription codes, and is typically prone to noise, coding errors and sparsity. Our ability to effectively work with uncurated data and achieve high out-of-sample predictive performance showcases the immediate clinical applicability with zero additional burden to patients and providers.
In addition to the patient-specific predictions, we compute the statistically significant log-odds ratio of specific ICD codes occurring in the true positive vs the true negative patient sets. We call these the comorbidity spectra (See Fig. 4). Removing the false positives and the false negatives from consideration in computing the comorbidity spectra allows us to uncover patterns -at the level of individual codes -that are most representative of the patient risk. Importantly, the comorbidity spectra are based on individual codes, as opposed to the feature importances shown in Fig. 2c, which consider aggregated impact of all features that are based on the broad disease categories. Every disorder listed in the co-morbid spectra obviously do not all appear in a single patient, but the idea here is that the codes with high log-odds ratio are significantly more likely in positive cohort. The comorbidity spectra, so named because of disease-category specific color coding, offers unique insight into the predictive co-morbidity burden of IPF.

RESULTS
In this study we demonstrate three key results: 1) high out-of-sample predictive performance for identifying a future IPF diagnosis via leveraging subtle comorbidity patterns recorded in the past medical history of individual patients. 2) the ability of our models to maintain high predictive performance for an eventual diagnosis further into future, upto 4 years. And 3) ability to perform effectively under common confounders, such as a preexisting diagnosis of COPD, asthma or heart disease, or the absence of any indication of dyspnea in the past.
Our main prediction results are presented in Fig. 2a-b, which illustrate the ROC and the precision-recall curves respectively (for screening one year before current diagnosis), shown separately for males and females. As noted in the legend of these panels, our out-of-sample predictive performance is > 88% AUC irrespective of the sex of the patient, with > 55% sensitivity at 95% specificity (55% for males and 60% for females). At 99% specificity, we obtain a positive predictive value (PPV) of 52% for females and 54% for males, with sensitivities at 42% and 38% respectively. At these values we obtain an accuracy (denoted as "acc") of % 94 98% (See Table VI) which indicates the overall fraction of correct predictions. The PPV achieved by ZCoR-IPF at maximum accuracy is 70% for females and 71% for males, with a corresponding Negative Predictive Value (NPV) of 98%.
Thus, to summarize: our predictive pipeline detects about 55-60 out of every 100 patients who will be diagnosed with IPF in 1 year, if we operate at 95% specificity. If we wish to operate at the higher specificity of 99%, then out of 100 positive flags we have about 50-52 true positives. The accuracy metric indicates that we are correctly identify the risk status (positive or control) for approximately 94-98 out of 100 patients, irrespective of sex, highlighting the potentially high clinical significance of the ZCoR-IPF score.
From the inferred relative importance of the features (See Fig. 2d-e), we conclude, as expected, that respiratory disorders in the past are the most important modulators of risk, followed by known or suspected IPF comorbidities, metabolic diseases, cardiovascular abnormalities and diseases of the eye. Infections also feature in the top 20 co-morbidities shown in these panels. Importantly while there are sex differences, the overall pattern of the relative importance ranking remains substantially invariant between males and females. With some exceptions, many of these patterns are not particularly surprising; the contribution of this study is to bring them together systematically to realize an accurate risk estimate via the ZCoR-IPF score.
A key metric to evaluate the potential impact of the ZCoR-IPF score is to estimate the expected change in the survival functions via a Kaplan-Meyer (KM) analysis 28 , and the corresponding change in the mean survival time. The standard KM analysis is specifically designed to handle scenarios with incomplete observations, e:g: not knowing the exact time of death for all patients, but merely a lower bound on their survival times. This is particularly useful in our case, since we found that insurance claims often do not record deaths, with expired patients simply dropping off the database. This creates uncertainty on if the patient had actually expired, or if he or she simply dropped insurance, or got dropped from the database for some other reason. Nevertheless, the time over which they are observed in the database is clearly a lower bound on their survival. With this mind, we construct the survival plots shown in Fig. 3a-b, which represents lower bounds on the survival function (panel a) and upper bounds on the hazard rate (panel b), shown along with 95% confidence bounds. Note that since we can operate our predictors at different specificity-sensitivity trade-offs, we get different curves if we vary the specificity. The survival functions are notably similar across the two sexes. Panel c shows the variation of lower bound on the mean survival times, compared with the baseline currently observed in the database; even at 95% specificity we boost the lower bound on mean survival time from around 100 weeks to approach 180-200 weeks. It is crucial to note that these survival analyses do not take int account the possibility of actually prolonging life via clinical interventions when we push back the time of the diagnosis; thus in actuality we expect the survival times to be markedly better to what is shown in Fig. 3a-c.
Our predictive performance expectedly degrades as we predict earlier, and this is illustrated in Fig. 3e. Importantly however, the degradation is slow enough that we can use ZCoR-IPF with acceptable reliability to predict diagnoses four years into the future. To illustrate how the ZCoR-IPF risk varies over patient age, we estimate the distribution of the scores over the positive and the control cohorts in Fig. 3d. Note that for patients fore who get eventually diagnosed, the risk almost linearly increases with age.
While these results demonstrate the importance of the diverse features used in our approach, understanding the seat of this predictive power is important. The feature importances discussed earlier ( Fig. 2d-e) identify the relative impact of broad disease categories. Importantly, to evaluate the feature importance of a specific diagnostic category, we sum the importance of all features based on that category, not just the presence or absence of individual diagnoses. The latter aspect is investigated via the co-morbidity spectra for out-of-sample patients, shown in Figs. 4 separately for males and females. We find that the important co-morbidities are diverse, vary with the sex of the patients, but is clearly dominated by respiratory disorders, followed by diseases of the cardiovascular and circulatory systems. Again, while many of these patterns are expected at the population level, design of the personalized ZCoR-IPF score is not immediately obvious.
Since IPF co-morbidities have been investigated in the literature, a relevant question here is if our performance is dramatically better in sub-cohorts defined by the presence of these high risk diagnoses in the past (defined in Table V). The results are tabulated in Table VI, showing that our performance in the high risk sub-cohort is more or less comparable with full cohort performance. The AUCs achieved for the low risk cohort is somewhat lower (> 85% for males and females), but still acceptably high. Thus, even within the low risk patients, we still detect 48 out of every 100 patients who are going to have a diagnosis in 1 year.

DISCUSSION
The present study describes the development and performance of ZCoR-IPF in a large US commercial claims database. Our key finding is that in both men and women ages 48-90 years (n=1345, n=1191, respectively), ZCoR-IPF accurately identifies IPF cases 1-4 years sooner than occurred in a variety of everyday community or academic practice settings during 2003-2018: at 208 weeks (4 years) before IPF diagnosis, ZCoR-IPF would have predicted that classification with an AUC approaching 84%, at 156 weeks (3 years) before diagnosis, with an AUC of % 86%, at 104 weeks (2 years) with %87% , and at 52 weeks (1 year) before diagnosis, with an AUC approaching 88%, regardless of the patient's sex (Fig. 3e). Importantly, ZCoR-IPF achieves such results noninvasively, inexpensively, and almost instantaneously, since it relies only on diagnostic data already in the patient's electronic medical record, and runs on existing information technology infrastructure. Also of note, ZCoR-IPF expands the data available to both primary care practitioners and expert IPF diagnosticians. The score reflects a sophisticated, highly-detailed automated analysis of cormorbidities, considering more than 600 features related to the incidence, timing of individual diagnostic codes. ZCoR-IPF thus supplements information currently used to evaluate and diagnose IPF, which focuses mainly on respiratory signs and symptoms, pulmonary function, and the radiographic and histologic appearance of the lung 1,2,29 . Adding comorbidity as a new dimension of assessment, albeit in much simpler fashion 30 , is an approach that has proved fruitful in COPD management 31 .
ZCoR-IPF may be expected to contribute in two main ways to increase timeliness and ease of IPF identification. First, ZCoR-IPF might serve as a screening tool to aid primary care physicians, radiologists, or pulmonologists to more selectively flag patients for referral for HRCT, referral to a pulmonologist, or both. Presently, the leading candidates for such flagging would be patients with one or more of chronic dyspnea, chronic cough, and/or chronic "Velcro crackles" on auscultation, restrictive ventilatory patterns on pulmonary function tests, or incidental ILA or ILD on chest or abdominal CT 5,7,10,14,17,18,20,32 . As these are relatively large groups, ZCoR-IPF might be applied to help distinguish individuals who require immediate referral versus those who require increased surveillance, versus those who require less frequent follow-up. ZCoR-IPF might be an especially useful tool in patients with ILA, since while these findings might reflect an early stage of the disease, only some 0.5%-2% of this group will ever develop IPF 10,33 . Indeed, given the severity and actionability of the IPF diagnosis, the frequency of smoking history among patients with IPF1 , and the non-invasive nature, applicabilty, speed, and low cost of ZCoR-IPF, this tool might be applied in primary care to screen smokers or former smokers for increased surveillance or evaluation for IPF. The feasibility, diagnostic yield, cost-effectiveness, and ultimately, effect on patient outcomes of using ZCoR-IPF in these screening settings merit prospective evaluation.
Second, ZCoR-IPF might serve as a diagnostic aid for pulmonologists, radiologists, pathologists, or multidisciplinary teams in cases showing abnormalities suggestive of, or associated with IPF, but not UIP on HRCT or histopathology. These cases are relatively frequent: roughly half of patients histopathologically diagnosed with IPF lack classic CT findings associated with the disease 10 . Hence ZCoR-IPF might help patients without UIP on HRCT to avoid more invasive tests, especially surgical lung biopsy 2,34 , and/or may increase clinicians' diagnostic confidence. The effect of ZCoR-IPF use as a diagnostic aid on healthcare resource utilization and on diagnostic confidence in IPF classification also warrants prospective assessment.
ZCoR-IPF also might speed recruitment and decrease costs of clinical trials of new therapies for IPF and other progressive fibrosing ILD. The tool might do so by allowing more confident inclusion in study samples of patients with provisional IPF diagnoses.
Improving IPF screening and diagnosis may offer considerable opportunity to substantially enhance patient outcomes and quality-of-life, as well as efficiency of healthcare resource utilization. First and foremost, earlier, more efficient and confident diagnosis should enable timelier access to anti-fibrotic treatment, which slows but does not reverse disease. Evidence of similar effect sizes across a spectrum of IPF severity suggests that starting anti-fibrotics earlier in the course of disease may better preserve lung function 35 . Notably, in patients with IPF, each year's delay following an initial CT scan in referral to a specialized ILD center has been shown to be associated with a 2.94% (95% CI: 0.86-5.03%) increase in pulmonary fibrosis extent on chest CT 14 .
Conversely, more prompt and firm IPF diagnosis should spare patients unneeded or harmful treatments, e:g:, corticosteroids, as well as unneeded diagnostic tests and healthcare visits 1,6,[11][12][13][14][15][16]36 . Earlier identification also should allow more prompt referral for lung transplantation, the only current cure for the disease. Such referral is recommended immediately upon IPF diagnosis, since evaluation for eligibility and waits for graft availability may take months or years; starting this process when younger and less ill may allow patients to avoid disqualification for advanced age or frailty 1,6 . In the meanwhile, quicker IPF diagnosis will accelerate patients' access to interven-tions that may improve lung function and quality-of-life, namely supplemental oxygen, pulmonary rehabilitation, and palliative care 6,7 , as well as to clinical trials.
Development and use of a risk score for IPF diagnosis that is based on comorbidities aligns well with the longstanding appreciation of multimorbidity as characteristic of patients with IPF 22 . Indeed, a chart review at a German tertiary referral center 37 (N=272) found 58% of patients with IPF to have 1-3 comorbidities, 30% to have 4-7 comorbidities, and only 12% to have no concomitant illness. The high prevalence of comorbidities in patients with IPF presumably stems from shared risk and pathogenetic factors among IPF and the other conditions, e:g:, male gender, cigarette smoking, or aging accelerated by genetic mutations resulting in telomere shortening 37 . Additionally, some comorbid conditions may be hypothesized to be induced by IPF symptoms, e:g:, depression, some to contribute to IPF pathogenesis, e:g:, gastroesophageal reflux (GER) 2? , while others may at least in some instances represent common mis-diagnoses in patients with IPF, e:g:, COPD, asthma, or pneumonia 6,15,16 .
Strengths and limitations of our work here should be noted. Foremost among the former is our use of detailed "real-world" data from large databases (n=2536 cases, n=450,461 controls) collected over more than a decade throughout the United States or at an academic tertiary referral center in a major US metropolitan area. The huge overall sample size gives ZCoR-IPF robust statistical power to capture subtle patterns of comorbidity that would be indiscernible even in traditional multicenter cohorts 33 . As well, the size and scope of the databases increase likelihood that our findings are generalizable throughout the United States, and, indeed, elsewhere.
Conversely, use of administrative claims databases has certain disadvantages. Absent patient-level case validation, the most important of these is potential lack of accuracy in the ICD-9 and ICD-10 coding for IPF and related diagnoses 38,39 . A relatively small (n=150 validated cases) patient-level case validation study of another, smaller administrative claims database (Kaiser Permanente) suggested that more than half of IPF codes might be incorrectly applied, including roughly a quarter of cases that appeared to lack even findings of ILD 39 . Arguably, use of data including both patients with IPF and patients with other ILD might minimally affect clinical relevance of our score, given the emerging concept of "progressive chronic fibrosing ILD". Under this concept, subgroups of patients with IPF (which progresses at heterogenous rates from case-to-case) and subgroups of patients with other ILD, e:g:, idiopathic non-specific interstitial pneumonia, hypersensitivity pneumonitis, systemic sclerosisassociated ILD, or rheumatoid arthritis-associated ILD, may have similar phenotypes and disease behavior, and should perhaps be managed similarly, with early initiation of anti-fibrotic therapy 3,40,41 . However, inclusion of patients without ILD among our IPF cases would of course be expected to decrease the accuracy and utility of ZCoR-IPF. It would be desirable to perform case validation studies in our databases, which might provide data enabling ZCoR-IPF to be fine-tuned.
Other disadvantages of our use of a commercial insurance administrative database may include excess proportions, relative to those seen in specialist clinical practice, of females, and of less severe cases of both sexes.
In other claims database studies, these differences respectively have been attributed to a greater tendency of women than men to seek medical attention 42 , and to a tendency of older and sicker patients to migrate from commercial to government insurance plans 43 . Interestingly, one may speculate that miscoding of soft tissue or rheumatologic diseases with an ILD component as IPF may be more prevalent among females, as are those disorders themselves.36 Additionally, commercial claims databases contain at best highly limited information on signs and symptoms of disease that are insufficiently severe or impactful to be considered a discrete diagnosis, e.g., dyspnea, cough, or rales. These databases also provide a paucity of data regarding the results of biochemical or imaging investigations. Nor do claims data capture often important lifestyle variables such as smoking, alcohol consumption, exercise, or diet 43 .

Conclusion
We developed and validated ZCoR-IPF, a risk score for IPF capable of identifying with high accuracy this diagnosis 1-4 years before it occurred in everyday practice conditions in community and academic settings across the United States. This score provides clinicians with detailed data on comorbidity patterns to distinguish patients with IPF from those without the disease, a strategy of particular interest given the well-acknowledged predominance of multimorbidity in patents with IPF. Thus ZCoR-IPF supplements the functional and morphological data regarding the lung that have been the mainstay in evaluating potential IPF cases. ZCoR-IPF has the advantages of considering only diagnostic data already in the patient's electronic medical record, of running on current information technology infrastructure, and of operating inexpensively and almost instantaneously at the point of care. Hence ZCoR-IPF holds promise as a screening tool for primary care practitioners, pulmonologists, and radiologists to use in patients with early signs and/or symptoms of IPF or those at elevated risk for that disease. As well, ZCoR-IPF may offer utility as a diagnostic aid to pulmonolgists, radiologists, pathologists, and multidisciplinary teams in cases with radiological and/or histopathological suggestions, but not the classical radiological and/or histopathological picture of IPF. The effect of ZCoR-IPF on speed, accuracy, and confidence of IPF diagnosis, on health care resource utilization, and ultimately, on patient outcomes, should be assessed prospectively.  457-481 (1958 Fig. 4: Co-morbidity Spectrum (males). Disorders that increase the odds of the patient being a "true positive" vs a "true negative". Such disorders (ranked according to the log-odds ratio) are more likely to be found in patients who are in the positive cohort. Comapring panel a with panel b, we note that these odds change from males to females, but as expected the patterns are broadly similar, with over-representation of respiratory disorders. Mean p-score of feature-phenotype codes within sequence divided by general p-score of feature-phenotype 51 feature-phenotype scores relative to whole score Mean p-score of feature-phenotype codes within sequence divided by mean p-score of all codes in the record 51 aggregation score aggregation of the p-scores in the record 9 high scores proportion proportion of codes with very high p-scores among all codes in the record 1 low scores proportion proportion of codes with very low p-scores among all codes in the record 1 dynamics of mean score mean p-score of second half of the record divided by mean p-score of first half of the record 1 dynamics of st.dev score standard deviation of p-scores of second half of the record divided by standard deviation of p-scores of first half of the record 1 dynamics of score range range of p-scores of second half of the record divided by range of p-scores of first half of the record 1 dynamics of score skew skew of p-scores of seconf half of the record divided by skew of p-scores of first half of the record 1 aggregation relative to phn score aggregation of all feature-phenotype's mean scores divided by corresponding general p-score of feature-phenotype 6 aggregation relative to whole score aggregation of all feature-phenotype's mean scores divided by mean p-score of all codes in the record 6

II. Time-series Modeling of Diagnostic History
Individual diagnostic histories can have long-term memory 1 , implying that the order, frequency, and comorbid interactions between diseases are important for assessing the future risk of our target phenotype. We analyze patient-specific diagnostic code sequences by first representing the medical history of each patient as a set of stochastic categorical time-series -one each for a specific group of related disorders -followed by the inference of stochastic models for these individual data streams. These inferred generators are from a special class of Hidden Markov Models (HMMs), referred to as Probabilistic Finite State Automata (PFSA) 2 .
The inference algorithm we use is distinct from classical HMM learning, and has important advantages related to its ability to infer structure, and its sample complexity (See Supplementary text, Section XI). We infer a separate class of models for the positive and control cohorts, and then the problem reduces to determining the probability that the short diagnostic history from a new patient arises from the positive as opposed to the control category of the inferred models.

III. Inference & Event Periods
We train our predictive pipeline with all diagnostic codes that are recorded in the past 2years from the point at which a prediction is made. This period from which we use data to train our pipeline is called the "inference window". Our aim is to make predictions on the occurrence of the target diagnostic codes at 1year from the end of the inference window. For patients in the control cohort, we make sure that no target code appears for 2years after the end of the inference window. Additionally, when making predictions further into the future (upto 4 years, as described in the main text), we always make sure that the control group has no target codes for 1 year after the predcited time of diagnosis, i:e:, if we are making a prediction of a diagnosis 4 years in future, then control group patients are chosen to have no diagnosis in at least next 5 years.

IV. Step 1: Partitioning The Human Disease Spectrum
We begin by partitioning the human disease spectrum into 51non-overlapping categories. Each category is defined by a set of diagnostic codes from the International Classification of Diseases, Ninth Revision (ICD9) (See Table SI-I for description of the categories used in this study). For this study, we ended up using 17008378 and 25074255 diagnostic codes for males and females respectively (22685 and 23722 unique codes) spanning both ICD9 and ICD10 protocols (using ICD10 General Equivalence Mappings (GEMS) 3 equivalents where necessary), from a total 2,059,559 patients. Transforming the diagnostic histories to report only the broad categories reduces the number of distinct codes that the pipeline needs to handle, thus improving statistical power.
Our categories largely align with the top-level ICD9 categories, with small adjustments, e:g: bringing all infections under one category irrespective of the pathogen or the target organ. We do not pre-select the phenotypes; we want our algorithm to seek out the important patterns without any manual curation of the input data.
For each patient, the past medical history is a sequence @t I ; x I A; ¡ ¡ ¡ ; @t m ; x m A, where t i are timestamps and x i are ICD9 codes diagnosed at time t i . We map individual patient history to a three-alphabet categorical time series z k corresponding to the disease category k, as follows. For each week i, we have: The time-series z k is observed in the inference period. Thus, each patient is represented by RQ mapped trinary series.

V. Step 2: Model Inference & The Sequence Likelihood Defect ¡
The mapped series, disease-category, and IPF diagnosis-status are considered to be independent sample paths, and we want to explicitly model these systems as specialized HMMs (PFSAs). We model the positive and the control cohorts and each disease category separately, ending up with a total of VT HMMs at the population level (RQ categories, P IPF status categories: positive and control). Each of these inferred models is a PFSA; a directed graph with probability-weighted edges, and acts as an optimal generator of the stochastic process driving the sequential appearance of the three letters (as defined by Eq. (1)) corresponding to disease category, and IPF status-type (See Section XI in the Supplementary text for background on PFSA inference).
To reliably infer the IPF status-type of a new patient, i:e, the likelihood of a diagnostic sequence being generated by the corresponding IPF status-type model, we generalize the notion of Kullbeck-Leibler (KL) divergence 4,5 between probability distributions to a divergence h KL @GjjHA between ergodic stationary categorical stochastic processes 6 G; H as: h KL @GjjHA a lim n3I I n xXjxjan p G @xA log p G @xA p H @xA (2) where jxj is the sequence length, and p G @xA; p H @xA are the probabilities of sequence x being generated by the processes G; H respectively. Defining the log-likelihood of x being generated by a process G as : L@x; GA a I jxj log p G @xA The cohort-type for an observed sequence x -which is actually generated by the hidden process G -can be formally inferred from observations based on the following provable relationships (See Supplementary Text Section XI, Theorem 6 and 7): where r@¡A is the entropy rate of a process 4 . Importantly, Eq. (4) shows that the computed likelihood has an additional non-negative contribution from the divergence term when we choose the incorrect generative process. Thus, if a patient is eventually going to be diagnosed with IPF, then we expect that the disease-specific mapped series corresponding to her diagnostic history be modeled by the PFSA in the positive cohort. Denoting the PFSA corresponding to disease category j for positive and control cohorts as G j C ; G j H respectively, we can compute the sequence likelihood defect (SLD, ¡ j ) as: ¡ j L@G j H ; xA L@G j C ; xA 3 h KL @G j H jjG j C A With the inferred PFSA models and the individual diagnostic history, we estimate the SLD measure on the right-hand side of Eqn. (5). The higher this likelihood defect, the higher the similarity of diagnosis history to that of women with IPF.

VI. Step 3: Risk Estimation Pipeline With Semi-supervised & Supervised Learning Modules
The risk estimation pipeline operates on patient specific information limited to the available diagnostic history in the inference period, and produces an estimate of the relative risk of IPF, with an associated confidence value.
To learn the parameters and associated model structures of this pipeline, we transform the patient specific data to a set of engineered features, and the feature vectors realized on the positive and control sets are used to train a gradient-boosting classifier 7 . The complete list of 667 features used is provided in Tab. VII in the main text.
We need two training sets: one to infer the models, and one to train the classifier with features derived from the inferred models. Thus, we do a random 3-way split of the set of unique patients into feature-engineering (PS7), training (PS7) and test (SH7) sets. We use the feature-engineering set of ids first to infer our PFSA models (unsupervised model inference in each category), which then allows us to train the gradient-boosting classifier using the training set and PFSA models (classical supervised learning), and we finally execute out-of-sample validation on the test set. Fig. 2c in the main text shows the top PH features ranked in order of their relative importance (relative loss in performance when dropped out of the analysis).

VII. THRESHOLD SELECTION ON ROC CURVE
Once the ROC curve has been computed, we must choose a decision threshold to trade-off true positive rate and false positive rate. In situations where the number of negatives vastly outnumber the number of positives (which is the case in our problem), it is better to base this trade-off on a measure that is independent of the number of true negatives. The two popular measures considered in the literature are accuracy and the F1-score: accuracy a t p C t n t p C f p C f n C t n F1 a Pt p Pt p C f p C f n The F1-score is the same as accuracy where the number of true negatives is the same as the number of true positives, thus partially correcting for the class imbalance.
The selection of the threshold may also be dictated by the current practice of ensuring high specificities in screening tests. Thus, the most relevant clinically operating point is either the one corresponding to WS7 specificity, which is highlighted in Fig. 2a in the main text, or even higher specificities due to the low prevalence of IPF (See Table VI in the main text).

VIII. NOTE ON RECIEVER OPERATING CHARACTERISTICS (ROC) AND PRECISION-RECALL CURVES
The ROC curve is a plot between the False Positive rate (TPR) and the True Positive Rate (TPR), and the area under the ROC curve (AUC) is often used as a measure of classifier performance. For the same of completeness, we introduce the relevant definitions: In the following P denotes the total number of positive samples (number of patients who are eventually diagnosed), and N denotes the total number of negative samples (number of patients in the control group).
PPV a t p t p C f p (11) a P N C P (12) where as before t p ; t n ; f p ; f n are true positives, true negatives, false positives, and false negatives respectively.
Note that TPR is also referred to as recall or sensitivity, and PPV is also referred to as precision. True negative rate is also known as specificity.
A precision-recall curve, or a PPV-sensitivity curve is a plot between PPV and TPR.
Denoting sensitivity by s, and specifciyty by c, it follows that: PPV a t p =P t p =P C @f p =N A@N=P A a TPR TPR C @@N t n A=NA@N=P A A PPV a s s C @I cA@ I IA (14) Thus, we note that for a fixed specificity and sensitivity, the PPV depends on prevalence. Indeed, it is clear from the above argument that PPV decreases with decreasing prevalence, and vice versa.

IX. EFFECT OF CLASS IMBALANCE
ROC curves are generally assumed to be robust to class imbalance. Note that if we assume that patient outcomes are independent (which is well-justified in the case of a non-communicable condition, particularly in large databases), then t p should scale linearly with the total number of positives P , implying: implying that with different sizes of the set of positive samples (or negative samples), the ROC curve remains unchanged. In particular, note that even if the prevalence is very small (say 0.01%), we cannot cheat to boost the AUC by labeling all predictions as negative, or stating that risk is always zero: in that case, our P is very small, but our t p a H strictly, implying that our TPR a H, thus leading to a zero AUC. We can cheat to boost the accuracy (See the previous section), but not the AUC.
Note that while relative class sizes or imbalance does not affect the ROC (under the assumption that true positives and true negatives scale with the number of positives and negatives), very small absolute sample sizes might still result in poor performance of the model.
The precision-recall curves do get affected by class imbalance, or the prevalence, as shown by Eq (14). However, in diagnostic analysis, they are important since we are generally less interested in the number of true negatives; the ratio of false positives to the total number of positive recommendations by the algorithm is much more relevant, i:e:, the PPV or the precision.

X. GENERATING PFSA MODELS FROM SET OF INPUT STREAMS WITH VARI-ABLE INPUT LENGTHS
Our PFSA reconstruction algorithm 2 is distinct from standard HMM learning. We do not need to pre-specify structures, or the number of states in the algorithm, and all model parameters are inferred directly from data. Additionally, we can operate either with 1) a single input stream, or 2) a set of input streams of possibly varying lengths which are assumed to be different and independent sample paths from the unknown stochastic generator we are trying to infer. At an intuitive level, we use the input data to infer the length of histories one must remember to estimate the current state, and predict futures for the process being modeled. Thus, we do not step through the symbol streams with a pre-specified model structure, and avoid the need to have equal-length inputs. More details of the algorithm are provided in the next section.
The ability to model a set of input streams of varying lengths is particularly important, since medical histories of different patients are typically of different lengths.

XII. Probabilistic Finite-State Automaton
Let ¦ be a finite alphabet of symbols with size j¦j. The set of sequences of length d over ¦ is denoted by ¦ d . The set of finite but unbounded sequences over ¦ is denoted by ¦ ? , the Kleene star operation 8 , i:e: ¦ ? a I daH ¦ d . We use lower case Greek, for example or , for symbols in ¦, and lower case Latin, for example x or y, for sequences of symbols, i:e: x a I P : : : n . We use jxj to denote the length of x. The empty sequence is denoted by .
We denote the set of strictly infinite sequences over ¦ by ¦ ! , and the set of strictly infinite sequences having x as prefix by x¦ ! . Let a fx¦ ! X x P ¦ ? g fYg, we can verify that is a semiring 9 over ¦ ! . We use p to denote the sigma algebra generated by . Definition 2 (Stochastic Process over ¦). A stochastic process over a finite alphabet ¦ is a collection of ¦-valued random variables fX t g tPN indexed by positive integers 10 .
We are specifically interested in processes in which the X i s are not necessarily independently distributed.
Definition 3 (Sequence-Induced Measure and Derivative). For a process P, let P r @xA or simply P r@xA denote the probability P producing a sample path prefixed by x. The measure x induced by a sequence x P ¦ ? is the extension 9 to p of the premeasure defined on the semiring given by Vx; y P ¦ ? ; x @y¦ ! A P r @xyA P r@xA ; if P r@xA > H (16) For any d P N, the d-th order derivative of a sequence x, written as d x , is defined to be the marginal distribution of x on ¦ d , with the entry indexed by y denoted by d x @yA. The first-order derivative is called the symbolic derivative and is denoted by x for short.

Definition 4 (Probabilistic Nerode Equivalence and Causal States 11 ).
For any pair of sequences x; y P ¦ ? , x is equivalent to y, written as x $ y, if and only if either P r@xA a P r@yA a H, or x a y . The equivalence class of a sequence x is denoted by x and is called a causal state 12 . The cardinality of the set of causal states is called the probabilistic Nerode index, or the Nerode index for simplicity.
We can see from the definition that causal states captures how the history of a process influences its future. Since the probabilistic Nerode equivalence is right invariant, it gives rise naturally to a automaton structure introduced below. where Q is a finite set, ¦ is a finite alphabet, X Q ¢¦ 3 ¦ is called the transition map, and e X Q 3 P ¦ , where P ¦ is the space of probability distributions over ¦, is called the transition probability. The entry of e @qA indexed by is denoted by e @q; A. Definition 6 (Transition and Observation Matrices). The transition matrix ¥ is the jQj¢jQj matrix with the entry indexed by q; q H , written as q;q H, satisfying q;q H fP¦j@q;Aaq H g e @q; A (17) and the observation matrix e ¥ is a jQj ¢ j¦j matrix with the entry indexed by q; equaling e @q; A.
We note that both ¥ and e ¥ are stochastic, i:e: non-negative with rows summing up to I.

Definition 7 (Extension of and e
to ¦ ? ). For any x a I : : : k , @q; xA is defined recursively by @q; xA @ @q; I : : : k I A ; k A (18) with @q; A a q, and e @q; xA is defined recursively by e @q; xA k iaI e @ @q; I : : : i I A ; i A (19) with e @q; A a I.

Definition 8 (Strongly Connected PFSA).
We say a PFSA is strongly connected if the underlying directed graph is strongly connected 13 . More precisely, a PFSA G a @Q; ¦; ; e A is strongly connected if for any pair of distinct states q and q H P Q, there is an x P ¦ ? such that @q; xA a q H .
We assume all PFSA in the discussions in the sequel are strongly connected if not specified otherwise. For strongly connected PFSA G, there is a unique probability distribution over Q that satisfies v T ¥ a v T . This is the stationary distribution 14,15 of G and is denoted as } G , or } if G is understood.

Definition 9 ( -Expression
is called event-specific transition matrix, with the event being that is current the output. can also be extended to arbitrary x P ¦ ? by defining x a k iaI i with a I. Definition 10 (Sequence-Induced Distribution on States). For a PFSA G a @Q; ¦; ; e A and a distribution } H on Q, the distribution on Q induced by a sequence x is given by } T G;}0 @xA a } T H x with } G;}0 @A a } H . The entry indexed by q P Q of the vector } G;}0 @xA is written as } G;}0 @x; qA. When } H a } G , the stationary distribution of G, we write } G;}0 @xA as } G @xA, or simply as }@xA, if G is understood.
Definition 11 (Stochastic Process Generated by a PFSA). Let G a @Q; ¦; ; e A be a PFSA and let } H be a distribution on Q, the ¦-valued stochastic process fX t g tP¦ generated by G and } H satisfies that X I follows the distribution } H and X tCI follows the distribution } G;}0 @X I ¡ ¡ ¡ X t A for t P N.
For the rest of this paper, we will assume } H a } G if not specified otherwise. We can show that, when initialized with } G , the process generated by a PFSA G is stationary and ergodic. We also note the, for the process generate by G, we have x a } G @xA T e ¥. Since } G @A a } G , the symbolic derivative of the empty sequence is the stationary distribution on the symbols.

Definition 12 (Synchronizable PFSA and Synchronizing Sequence).
A synchronizing sequence is a finite sequence that sends an arbitrary state of the PFSA to a fixed state 16 . To be more precise, let G a @Q; ¦; ; e A be a PFSA, we say a sequence x P ¦ ? is a synchronizing sequence to a state q P Q if @q H ; xA a q for all q H P Q.
A PFSA is synchronizable if it has at least one synchronizing sequence. Given a sample path generated by a PFSA, we say the PFSA is synchronized if a synchronizing sequence transpires in the sample path.
Definition 13 (Equivalence and Irreducibility). Two PFSA G and H are equivalent if they generate the same stochastic process. A PFSA G is said to be irreducible, if there is not another PFSA with smaller state set that is equivalent to G.

Definition 14.
Consider a PFSA G over state set Q. For a give " > H, we say a sequence x is a "-synchronizing sequence to a state q P Q if k} G @xA e q k I ": While there exists PFSA that is not synchronizable, we can show that an irreducible PFSA always has an "synchronizing sequence for some state q for arbitrarily small " > H. Moreover, we can show that as length increases, sequences produced by PFSA become uniformly "-synchronizing. These two are the underpinning properties for the inference algorithm of PFSA (See Alg. 1), because they imply that x can be used to approximate e @qA if x are properly prefixed and long enough.
Definition 15 (Joint "-Synchronizing Sequence). Let G and H be two PFSA over state sets Q G and Q H , respectively. For a fixed ", a sequence x is said to be jointly "-synchronizing to @q; rA P Q G ¢ Q H if x is "-synchronizing to q and to r simultaneously.
A pair of states @q; rA P Q G ¢ Q H is called a G-joint pair of states if p G @q; rA > H. We also define Q c f@q; rA P Q G ¢ Q H X @q; rA is a G-joint pairg The inference algorithm for PFSA is called GenESeSS for Generator Extraction Using Self-similar Semantics.
With an input sequence x and a hyperparameter ", GenESeSS outputs a PFSA in the following three steps: 1) approximate an almost synchronizing sequence; 2) identify the transition structure of the PFSA; 3) calculate the transition probabilities of the PFSA. See Alg. 1 2 for details.

Algorithm 1: GenESeSS
Data: A sequence x over alphabet ¦, H < " < I Let be the i-th entry of x; 8 Let L a L log qj ;

9
Let p T a p T where is defined in 9;

10
Let q T a p T e ¥;

return L=jxj;
Theorem 2 (Convergence of log-likelihood). Let G and H be two reduced PFSA, and let x P ¦ d be a sequence generated by G. Then we have and lim n3I B x;n a r@GA.

XIV. PIPELINE OPTIMIZATION: HYPER-TRAINING, TRAINING, & VALIDATION
Our pipeline comprises a network of individually trained light gradient boosting machine (LGBM) 20 classifiers that focus on complementary aspects of the problem, and operate on different categories of input features as described next. Importantly, some of these features need to be generated non-trivially from the raw data, and these feature generators have parameters that need to be trained as well (or comprise models that need to be inferred). We call this inference of the feature-generators as hyper-training. Importantly, this is different from the more common notion of hyper-parameters. Hyper-parameters are one or more variables whose scalar values are commonly tuned by grid-search or via some meata-heuristics to optimize classifiers, whereas hyper-training produces generators of features, not simply a set of numbers.

Hyper-training & Training
Trinary Quantization of Medical Histories: The medical histories are mapped into trinary disease-phenotypespecific data-streams to enable generation of some of the features described below, as outlined in Section IV (Step 1).

Feature Categories:
The features used in the pipeline maybe categorized as follows: PFSA scores: The PFSA scores are computed on the basis of the inferred PFSA models as described in the previous sections. The generation of the PFSA models from the trinary data-streams is the first hyper-training step. These scores consist of the negative and positive log-likelihood of a phenotype-specific quantized medical history being generated by the PFSA models for the positive cohort and the control cohort of sex-stratified patients, and the corresponding sequence likelihood defects (See Section ??). Recall that PFSAs are specialized HMMs, and these measures encode the dynamics of the underlying processes, and are sensitive to the ordering, and frequency of the codes at the resolution of the disease phenotypes. Also, recall that disases pheonotypes are broad categories of diagnostic codes, and that we generate PFSA models for each category, and separately for the sexes and the positive cohort and the control cohorts).
Prevalence scores (p-scores): The p-scores focus on individual diagnostic codes, and we create a dictionary of the ratio of relative prevalence of each code (relative to the set of all codes present) in the positive category (for each sex) to the control category. This is the second hyper-training step. In the later steps of the pipeline, we use dictionary look ups to map codes to their p-scores, and also their aggregate measures such as mean, median, and variance to train a downstream LGBM.
Rare scores: These scores consist of a subset of p-scores which correspond to codes with particularly high and low relative prevalences (p-score > P or < :S). Thus, this feature category depends on the p-score dictionary generated in the second hyper-training step.
Sequence scores: Sequence scores are relatively straight-forward statsitical measures such as mean, median, variance, time since last occurrence etc:. on the trinary phenotype-specific sex-stratified histories. No hypertraining is required for the generation of the sequence features.
Thus we require three splits of the training dataset. The first split is used to carry out hyper-training of the PFSA models and the p-score dictionary. The second split is used to train the score-category specific LGBMs, one for each feature category. And the third split is used to train the final LGBM that takes inputs from the outputs of the four LGBMs in the previous layer. The network layout is shown in Fig. 1.

Validation
In validation, or actual prediction of patient fate, we use the trinary mapping, generate the features using the PFSA models and the p-score dictionary, and calculate the raw-risk via the trained LGBM network. The relative score is then obtained by a choice of the operating point reflecting the specificity/sensitivity trade-off discussed before.

Data Splits: Training and Validation Hold-out
All eligible patients are randomly split into the Training set (% US7 of data) and the Test set (% PS7 of data). The Training set is then split into 3 subsets: 1) The hyper-training set (See Fig. 1A) is used to train PFSA models p-score dictionary, 2) the second split (referred to as the pre-aggregation split, see Fig. 1B) is used to train the four feature-category specific LGBMs, and 3) the final split (referred to as the aggregation split, see Fig. 1C) is used to train the aggregate LGBM which uses outputs from the trained LGBMs in the previous layer. This trained pipeline is then validated on the held out validation split (% PS7 of data). Unspecified asthma with (acute) exacerbation B39. 5 Histoplasmosis duboisii 425. 8 Cardiomyopath in oth dis I05. 9 Rheumatic mitral valve disease unspecified 411

SI-
Post MI syndrome