3.1 Calculating disease trajectories
We extracted the data field ‘41270: Diagnoses – ICD10’ for participants in the UK Biobank, containing diagnoses from inpatient admissions records coded in ICD10 and converted the ICD10 codes to groupings of ICD10 blocks and ICD10 chapters. We extracted the data field ‘41280: Date of first in-patient diagnosis - ICD10’ from the inpatient admissions records, containing the date of each ICD10 diagnosis. Each participant’s diagnostic history was described as a vector of ICD10 blocks ordered by date of diagnosis. Only the first recorded instance of each block was used, and duplicates were removed. We filtered the data using the field ’31: Sex’ and for ICD10 blocks with more than 1% prevalence within sexes. Blocks not describing diseases, but procedures/interactions with healthcare were also removed, corresponding to ICD10 blocks beginning with letters O, P, Q, R, S, T, U, V, W, X, Y, Z. On each trajectory, the order of diagnosis was determined and the period to each subsequent diagnosis was calculated from first diagnosis in days.
3.2 Identifying diagnoses belonging to rapidly accruing or slowly accruing disease trajectories
To identify diagnoses that were late-in-order/early-in-time or early-in-order/late-in-time, we calculated the median order of each diagnosis and median time to each diagnosis in days. After plotting median order against median time, we calculated the line of best fit, the orthonormal basis vectors parallel and perpendicular to the line of best fit. Projecting the data points on to the basis vectors yielded coordinates along these directions. We then calculated the standard deviation and identified the blocks lying outside one and two standard deviations of the mean as the late-in-order/early-in-time or early-in-order/late-in-time diagnoses.
3.3 Clustering multimorbidities
We clustered all participants with more than one ICD10 block diagnosis to explore the transitions between clusters over time. The period of record availability was divided into three periods in which diagnoses accumulated equally using the 33% and 66% quantile thresholds of the distribution of time of diagnosis in days from first diagnosis. Morbidity combinations were calculated at the start of each period and at the end of the last period.
Clustering was subsequently performed on the data obtained by merging female and male disease combinations and merging across all time points. The tabulated combined data was represented as a diagnosis matrix with rows representing individuals and columns representing diagnoses. Dimensionality reduction was first performed using Multiple Correspondence Analysis (MCA) from the ‘FactoMineR’ R package. The number of components to retain was determined with an elbow plot (Supplementary Fig. 2A). The resulting space of coordinates was used in k-means clustering using base R. Optimal cluster parameters were selected using elbow and Calinski-Harabasz (Caliński and Harabasz, 1974) plots (Supplementary Fig. 2B-C). Calinski-Harabasz scores were calculated using the ‘fpc’ R package. The movement of individuals between clusters across the four timepoints was visualised as a Sankey diagram using SankeyMATIC1.
3.4 Identifying diagnoses sharing trajectories
Heatmaps describing the coincidence of chapters and blocks in disease trajectories were based on the Jaccard index which expresses the proportion of overlap between two groups. If \({N}_{A\cap B}\) denotes the number of participants with trajectories containing both diagnoses A and B and \({N}_{A\cup B}\) denotes the number of participants with trajectories containing either diagnosis A or diagnosis B, the Jaccard Index is defined to be \({N}_{A\cap B}/{N}_{A\cup B}\). Jaccard values were calculated using the ‘jaccard’ package in R. ICD10 chapters were clustered using complete hierarchical clustering on the mean Jaccard values across blocks and blocks were further hierarchically clustered within chapters. Clustering and associated heatmaps were generated with the ‘Heatmap()’ function of the ‘ComplexHeatmap’ package in R.
3.5 Building diagnosis trajectories
In order to build diagnosis trajectories, we adapted an approach developed elsewhere (Jensen et al., 2014). We identified pairs of diagnoses that were significantly coincident using a bootstrapping test from the ‘jaccard’ R package (Chung et al, 2019). For each diagnosis pair, A and B, we calculated the frequency of occurrences where the time to diagnosis of A is less than B, greater than B and equal to B. We subsequently performed double one-tailed binomial tests of whether A or B more frequently occurs earliest, including the cases where A and B are equal in the comparison group. From the disease pairs with both a significant Jaccard coefficient and a significant ordering, we assembled longer trajectories incrementally. Binomial tests were repeated for each longer trajectory, identifying new diagnoses that significantly occurred before or after the longer trajectory in the population of participants with the longer trajectory. Trajectory that applied to fewer than 20 individuals were removed. This process was repeated until no further trajectories were identified. P-values for all statistical tests were Bonferroni-Holm adjusted (Holm, 1979).
3.6 Calculating 1-year post-trajectory age-relative rates of mortality and hospitalisation of trajectories
We identified the subpopulations with each significant trajectory, correctly ordered and with at least 14 days between consecutive diagnoses. The age at which participants received their final diagnosis within trajectories was approximated from their age at assessment (data field ‘21003: - Age when attended assessment centre’) adjusted by the number of days between their assessment centre visit (obtained from data-field ’53: Date of attending assessment centre’) and the date at which the final disease was diagnosed plus 188 days (midpoint of a year).
For each member of the trajectory subpopulation, we identified three random controls from the whole UK biobank with at least one diagnosis, but without any of the diseases involved in the trajectory, and set the starting point for comparison as the estimated age of the case at which the final diagnosis in the trajectory occurred. With this 1:3 case-control ratio, we used Accelerated Failure Time (AFT) models (Wei, 1992) to calculate the fold increase in the mortality and hospitalisation rate of subpopulation members in the year following the final diagnosis of the trajectory compared to controls of the same age. Mortality models were censored for available follow up in the UK Biobank while hospitalisation models were censored for both available follow up and death. Hospitalisation events were derived from unfiltered hospital diagnoses data including ICD10 codes below 1% prevalence and codes not describing diseases.
3.7 Deriving triage rules
To identify trajectories associated with high 1-year post-diagnosis age-relative mortality and/or hospitalisation, we grouped trajectories by endpoint.. From the total range of trajectory mortality/hospitalisation risk values, we were able to calculate quantile thresholds to establish which trajectories of each endpoint belonged in which risk categories.
Triage tables (Fig. 4 and Supplementary Tables 18 and 19) allow us to interpret increased risks in a clinically useful manner. The endpoint of a trajectory represents a presenting diagnosis and a prior diagnoses to rule in or out whether a patient finds themselves on a high risk trajectory. Hence, the triage tables are grouped by endpoints and subgrouped by prior diagnoses. In Supplementary Tables 18 and 19, risk can be filtered for single prior diagnoses or combinations of prior diagnoses. For each endpoint and prior set of diagnoses, we state the risk category for which the estimated age-relative risk fell within .
3.8 Comparing female and male outcomes of trajectories
To identify disease histories which exhibited significant interactions with sex on future risk, we compared the estimated mortality and hospitalisation rate estimates of trajectories of length 2 which were identified in both females and males. We scaled the estimates without centring within sexes using the scale() function in base R. Estimates were thus expressed in standard deviations of the total distribution of estimates from trajectories of length 2 for each sex and allowed for direct comparison. To identify the largest divergence between sexes of these scaled estimates, we subtracted each male scaled estimate from its corresponding female scaled estimate and visualised the differences grouped by disease endpoint. For mortality, we visualised differences which were greater than 2 standard deviations in absolute terms. Hospitalisation differences were generally smaller and thus we used the lower threshold of 1 standard deviation for visualisation of differences.
To identify high risk disease histories that were unique to each sex, we visualised the estimates of single disease histories that were only identified in either females or males and which were within the top 10% of estimates (Very high risk).
3.9 Code/scripts
Scripts reproducing analysis are available upon request.