Application of Machine Learning to the Prediction of Cancer-Associated Venous Thromboembolism

Venous thromboembolism (VTE) is a common and impactful complication of cancer. Several clinical prediction rules have been devised to estimate the risk of a thrombotic event in this patient population, however they are associated with limitations. We aimed to develop a predictive model of cancer-associated VTE using machine learning as a means to better integrate all available data, improve prediction accuracy and allow applicability regardless of timing for systemic therapy administration. A retrospective cohort was used to fit and validate the models, consisting of adult patients who had next generation sequencing performed on their solid tumor for the years 2014 to 2019. A deep learning survival model limited to demographic, cancer-specific, laboratory and pharmacological predictors was selected based on results from training data for 23,800 individuals and was evaluated on an internal validation set including 5,951 individuals, yielding a time-dependent concordance index of 0.72 (95% CI = 0.70–0.74) for the first 6 months of observation. Adapted models also performed well overall compared to the Khorana Score (KS) in two external cohorts of individuals starting systemic therapy; in an external validation set of 1,250 patients, the C-index was 0.71 (95% CI = 0.65–0.77) for the deep learning model vs 0.66 (95% CI = 0.59–0.72) for the KS and in a smaller external cohort of 358 patients the C-index was 0.59 (95% CI = 0.50–0.69) for the deep learning model vs 0.56 (95% CI = 0.48–0.64) for the KS. The proportions of patients accurately reclassified by the deep learning model were 25% and 26% respectively. In this large cohort of patients with a broad range of solid malignancies and at different phases of systemic therapy, the use of deep learning resulted in improved accuracy for VTE incidence predictions. Additional studies are needed to further assess the validity of this model.


Introduction
Cancer has long been known to confer an increased risk of venous thromboembolism (VTE). 1 The pathophysiological mechanisms are complex and remain incompletely elucidated. 2 Cancer-associated VTE is common, as approximately 20-30% of VTE episodes are associated with a malignancy. 3 Those events are clinically important, as they are a leading cause of mortality in patients with cancer. 4 Several randomized trials have demonstrated the effectiveness of pharmacological prophylaxis. However, applicability has been limited by currently available VTE risk strati cation tools. 5,6 The most commonly used approach to estimate the risk of cancer-associated VTE is the Khorana Score (KS), a clinical prediction rule based on cancer type, peripheral blood cell counts and body mass index. 7 The KS was originally derived from a cohort of patients who had completed at least one cycle of a new chemotherapy regimen. Using this prediction rule, patients are assigned to one of three categories denoting their risk of VTE at 6 months. The KS has been extensively validated in multiple different healthcare systems. 8 In one large review, for a KS greater or equal to 2 and 3, sensitivity was 55.2% (95% CI = 47.5%-62.6%) and 23.4% (95% CI = 18.4%-29.4%) respectively, while positive predictive value was 8.9% (95%CI = 7.3%-10.8%) and 11.0% (95% CI = 8.8%-13.8%) respectively. 8 Several other clinical prediction rules have been derived by different groups. [9][10][11][12][13][14][15][16] They tend to be based on the predictors already included in the KS, in addition to other clinical or tumor-speci c characteristics, routine laboratory test results, presence of germline thrombophilia mutations and chemotherapy administered. In most cases, those algorithms have been derived in patients at the time new systemic therapy was started.
Machine learning (ML) is a computational approach where algorithms are derived automatically from data. In the last decade, ML has found multiple real-world applications including in the medical eld. 17 ML methods are well suited to integrate large amounts of information and derive risk estimation models. Given the proper conditions, ML algorithms can automatically identify complex interactions between predictors which would otherwise be very di cult to elucidate by traditional statistical methods under human supervision. ML will tend to outperform clinical prediction rules, because ML models can easily include multiple predictors and interactions, while clinical prediction rules must be simplistic as they are limited by reliance on human computation. 18 Recent evidence suggests that tumor somatic genetic alterations in uence the risk of VTE.  Notably, in some cases gene-speci c effects appear to be conditional to tumor type. Additionally, available data seem to indicate an interaction of multiple genes, each contributing a small amount of information to risk prediction, rather than a single gene mediating a large part of the risk. Given those elements of interaction and the need to integrate data on multiple covariates, a ML approach could conceivably help optimize VTE risk prediction based on tumor genomic alterations. In this work, we aimed to derive a ML model to estimate the risk of cancer-associated VTE, incorporating cancer-speci c genetic information.

Patient Cohorts
Approval was obtained from the Memorial Sloan Kettering (MSK) institutional review board before initiating this project. The use of data from the ONCOTHROMB 12 − 01 study was authorized by the institutional review board of the Hospital General Universitario Gregorio Marañón (Madrid, Spain). Three cancer patient cohorts were derived: the rst one (main MSK cohort) served to train and internally validate the main model, while the two additional sets (external MSK cohort and ONCOTHROMB cohort) were used for external validation and evaluation of transfer learning. The main MSK cohort consisted of all adults who had MSK-IMPACT™ (Memorial Sloan Kettering Integrated Mutation Pro ling of Actionable Cancer Targets) sequencing performed on their solid tumor malignancy between 2014 and 2019. Patients were included regardless of cancer stage, time from cancer diagnosis or ongoing treatment with anticoagulant or antiplatelet agent. Individuals entered the cohort once their MSK-IMPACT™ result was reported in the clinical information system and were censored at the time of their last clinical note. They were included in the analysis without any restriction based on timing for chemotherapy administration, as reporting a more generalizable model was considered desirable. They were excluded if they had sustained an episode of cancer-associated thrombosis before the MSK-IMPACT™ result was reported. All sequencing included a patient speci c peripheral blood normal control to differentiate between cancer somatic and germline genetic alterations. VTE was de ned as pulmonary embolism or lower extremity deep vein thrombosis (DVT). Lower extremity DVT included thrombi involving the common iliac vein, external iliac vein, common femoral vein, super cial femoral vein, deep femoral vein, popliteal vein, peroneal vein, anterior tibial vein, posterior tibial vein or a deep calf vein. All such events were included regardless of the presence of symptoms. A VTE episode was considered cancer-associated if it occurred after or within the 365 days preceding a diagnosis of solid neoplasm. Events were detected using a review of anticoagulant prescriptions, keyword searches of radiology studies and the Clinical Event Detection and Recording System (CEDARS) natural language processing (NLP) pipeline for patients who were included in the cohort between 2014 and 2016, as described elsewhere. 40 Patients who had MSK-IMPACT™ performed between 2017 and 2019 were assessed only using CEDARS as applied to clinical notes and radiology reports. 42 Brie y, clinical notes and radiology reports were parsed with the spaCy NLP pipeline to derive individual word tokens and associated negation. Documents including any nonnegated token combination from a predetermined reference list were presented in chronological order via a custom graphical user interface and reviewed manually. Token combinations for this second CEDARS VTE event detection step are listed in Supplementary Table 1. All detected events were reviewed by two adjudicators, always including a hematologist. A random subset of patients was audited manually to estimate sensitivity and speci city of the automatic event detection algorithms (see Supplementary   Information).
The external MSK cohort was aggregated separately for a retrospective study of the association of the KS with overall survival. 43 Patients were included if they had an active malignancy and were newly started on chemotherapy between October 2017 and November 2019, provided they had su cient information to compute the KS. All solid tumor types were included. Event detection was conducted using International Classi cation of Diseases (ICD) 10 codes, a search of pharmacy records for full-dose anticoagulant prescriptions and a text search of radiology reports. VTE was de ned as pulmonary embolism or lower extremity DVT. Positive ndings were reviewed manually to ascertain the date of a VTE episode. The ONCOTHROMB cohort was prospectively accrued at several hospitals in Spain and was used to derive the TiC-Onco risk-assessment model as part of the ONCOTHROMB 12 − 01 study. 14  The main MSK cohort dataset was randomly partitioned into a training set comprising 80% of individuals and a validation set with the remainder, stratifying by outcome (VTE or death). The subset of patients who could have their KS calculated was also evaluated separately ("KS subset" of the validation set). In this group, the KS was assessed when an individual was prescribed systemic cancer treatment if this occurred in the rst 6 months after MSK-IMPACT™ report and there had been no treatment in the past year. The 6-month window restriction was applied to allow for a reliable comparison with models featuring genetic predictors. In the KS subset, predictions were made using laboratory and pharmacy data updated at the time the KS was derived; genomic data was the same as reported in the index IMPACT report.
We used three machine learning algorithms to model the cumulative incidence function of cancerassociated VTE adjusting for the competing risk of death in the main MSK cohort training set: Fine-Gray regression, random survival forests and DeepHit. 44,45 Details on the choice of algorithm, computing environment, statistical packages and main functions can be found in Supplementary Information. Multivariate feature imputation was used to handle missing data. Continuous predictors used for the DeepHit model were standardized to zero mean and unit variance. Model features were selected based on prior knowledge of their potential contribution to predicting VTE events. Available features could be broadly classi ed into four groups: basic, lab, chemo and genetic (see Table 1). Those included age, gender, cancer type, metastatic status, time from tumor sampling (i.e. biopsy or surgical resection of tissue sample), time from cancer diagnosis, time elapsed since last systemic therapy administered (strati ed by pharmacological class), routine laboratory test results (most recent value available in the prior 3 months for hemoglobin, total protein, albumin, sodium, potassium, chloride, blood urea nitrogen, creatinine, carbon dioxide, glucose, calcium, aspartate transaminase (AST), alanine transaminase (ALT), total bilirubin and alkaline phosphatase), tumor mutational burden and cancer somatic alterations in oncogenes or tumor suppressor genes included in the rst generation of the MSK-IMPACT™ panel. This assay was described in detail elsewhere. 46 Only oncogenic or potentially oncogenic alterations were retained, including mutations, copy number alterations and fusions. We decided not to use the white blood count, platelet count, activated partial thromboplastin time and prothrombin time because those values tend to change daily secondarily to in uence from chemotherapy (for blood cell counts) and anticoagulation (for clotting times). We felt that even though those predictors might seemingly improve accuracy, the nal model could be less generalizable to other healthcare systems with different approaches to laboratory testing. Features were combined into elementary subsets, and the latter were used to derive 11 nal feature sets destined to be included in models (see Supplementary Information). Optimal hyperparameters for random survival forests and DeepHit were determined using a grid search and tree-structured Parzen estimators respectively. Metrics for all three model types were derived using four iterations of ve-fold cross-validation, producing 20 values of the metric that were averaged to generate the overall metric. The con dence interval was estimated with bootstrapping. The main metric selected to evaluate models was the time-dependent concordance index as originally derived by Antolini et al. 47 This measurement quanti es the ability of the predictive model to discriminate among subjects with different event times along the cumulative incidence function continuum. An index of 1 indicates perfect concordance between model predicted risk and actual survival, while a value of 0.5 means random concordance. Calibration was assessed with plots of predicted vs observed risk of VTE at 6 months. Observed risk was computed with the Aalen-Johansen estimator in order to account for censoring and the competing risk of death. Patients were categorized in 5 predicted risk group using quantile cutoff points. Models were compared to the KS using the C-index and the concordance/reclassi cation

Model Validation
The best model was selected based on the C-index and potential usefulness in clinical practice. This model was re-tted on the whole training set and evaluated on the validation set. All other models were considered secondary. Secondary models designed to account for unavailable predictors were validated on the external MSK cohort and the ONCOTHROMB cohort. We compared those models to the KS in external cohorts using concordance/classi cation tables. Using the KS, the high-risk group was de ned as having a score of 2 or more because this threshold was used in prior studies of pharmacological VTE prophylaxis. 5,6,50 Risk was dichotomized for the DL models using a threshold of 9% risk of VTE at 6 months, because this was the observed risk for individuals with a KS of 2 or more in a large review. 8 As a means to further delineate the role of transfer learning in updating VTE prediction models, we evaluated the rst secondary model in its original state and after ne-tuning the weights of the output layer on a dedicated transfer learning set from the external MSK cohort.

Model Development and Selection
See Fig. 1 for ow diagram of the selection process for all three cohorts and Fig. 2 for an overview of data ow. A total of 29,751 individuals from the main MSK cohort were included in the nal analysis. The characteristics of patients in the main MSK cohort are shown in Table 2. The median age was 62 years. The most frequent tumor type was lung, representing 16% of patients. Less than half of samples were from a metastatic site, with 38% of cases falling into this category. The median time from cancer diagnosis upon cohort entry was 256 days (IQR = 79-1075 days), see Fig. 3. The median observation time was 239 days. Cancer-associated VTE occurred during the rst 6 months of observation in 1,338 (4.5%) of the patients. Cumulative incidence functions for this outcome were derived using Kaplan-Meier and competing risk estimators (Fig. 4). The 6-month cumulative VTE estimates using the Kaplan-Meier method and the competing risk estimator were almost identical (5.0% vs. 4.9%), but the difference was more apparent when considering the full observation period (14.6% vs. 13.5%).  (43) Eleven models using distinct covariate sets were derived for each ML approach (see Supplementary  Information for detail of the feature sets used). The three approaches (Fine-Gray regression, random survival forests and DeepHit) were applied to each feature set on the main MSK cohort training set (n = 23,800) using ve-fold cross-validation. The time-dependent C-index results are provided in Supplementary Table 3. The highest value was noted for the DeepHit model using the "extensive" feature set, including demographics, cancer-speci c characteristics, laboratory values, systemic treatment types and genomic predictors (C-index = 0.74, 95% CI = 0.71-0.76). This result was similar to the ones obtained with random survival forests (C-index = 0.73, 95% CI = 0.70-0.76) or Fine-Gray regression (C-index = 0.71, 95% CI = 0.69-0.74). The DeepHit model using the same predictors but excluding genomic information performed similarly (C-index = 0.73, 95% CI = 0.70-0.75). This "limited" set included: age, sex, cancer type, presence or absence of metastatic disease, time from tumor sampling, time from cancer diagnosis, time from last systemic therapy administered for 13 drug classes, albumin, hemoglobin, sodium, potassium, chloride, calcium, carbon dioxide, glucose, urea, creatinine, total protein, AST, ALT, total bilirubin and alkaline phosphatase. Given the absence of a signi cant improvement in concordance using genomic predictors, we selected the limited feature set. The DeepHit approach was retained, considering that increased complexity was justi ed by the potential to use transfer learning in the future.

Internal Validation
Using the optimal hyperparameters derived from cross-validation, all models were re-tted on the entirety of the main MSK cohort training set and nal metrics computed on the corresponding validation set (n = 5,951). Con dence intervals were estimated with bootstrapping. Results using DeepHit for the 11 feature sets are shown in Supplementary Table 4 Table 4; 26% of patients were reclassi ed accurately by secondary model B. See Fig. 9 for the cumulative incidence of VTE strati ed by predicted risk group. The calibration plot is shown in Supplementary Fig. 3. Predicted risk estimates were outside the con dence  Reclassi ed Low-Risk High-Risk 158 11 High-Risk Low-Risk 153 5 *High-risk group includes patients with a Khorana Score greater or equal to 2. †High-risk de ned as a predicted cumulative incidence of VTE at 6 months of 9% or greater, using Secondary model A ‡Observed risk of VTE at 6 months computed using the Aalen-Johansen estimator High-Risk Low-Risk 46 7 **High-risk group includes patients with a Khorana Score greater or equal to 2. †High-risk de ned as a predicted cumulative incidence of VTE at 6 months of 9% or greater, using Secondary model B ‡Observed risk of VTE at 6 months computed using the Aalen-Johansen estimator Discussion This is the rst report of a deep learning model for the prediction of cancer-associated VTE. Our approach is novel in several aspects. We have included a broad range of solid tumors and considered a diverse group of patients at all phases of their cancer journey, regardless of systemic treatment status. The latter allows VTE risk estimation for a much larger population of patients than what is currently possible using the KS and related prediction rules, as the majority of risk assessment models reported in the literature only consider patients started on a new chemotherapy regimen, limiting generalizability and applicability to everyday clinical practice. Additionally, we purposefully selected a model estimating the cumulative incidence function of VTE adjusting for the competing risk of death, as opposed to computing event probability at a xed 6-month time point. This approach minimizes the potential for bias and allows end users to compute the risk of VTE at any arbitrary time point during the validated observation period, providing added exibility in clinical applications.
The network architecture of the nal model is conducive to the application of transfer learning, a methodology which has been studied and applied extensively for neural networks. 51 We evaluated transfer learning with one external dataset. In this case, there was only a trend toward improvement in model accuracy, arguably because concordance and calibration were already excellent with the original model; this limited the ability of transfer learning to improve prediction modeling. Transfer learning could potentially improve calibration when the model is ported to different cohorts and attenuate inconsistencies in absolute risk estimates, a problem identi ed in several studies evaluating the KS. 52 This approach opens the door to a new paradigm for the prediction of cancer-associated VTE, a world where models can be adapted to new healthcare settings in order to maximize external validity.
ML has been used by several groups to predict cancer-associated VTE, with encouraging results. [53][54][55][56][57] The models presented in those reports were limited to a combination of demographic, cancer-speci c and routine laboratory assay predictors. This is the rst attempt to use ML to estimate the risk of CAT based on somatic genomic predictors in a large cohort of individuals with a solid tumor. There was no signi cant bene t to adding genomic predictors to a model already including demographic, cancerspeci c, laboratory and pharmacological predictors in the cohort including all individuals regardless of systemic therapy status. These ndings suggest that even though cancer somatic genetic alterations contain information about the risk of VTE, redundance exists with other predictors and there is a point at which adding more covariates yields no marginal bene t. The gene-speci c information was limited to a binary marker for oncogenes and tumor suppressor genes. It is possible that future work using more granular information (e.g. alteration type, variant allele frequency) and including other genes (e.g. coagulation factors, cytokines) would result in improved prediction accuracy. Interestingly, albumin was the most important feature in the nal model. An association between a decreased serum albumin level and an increased risk of cancer-associated VTE has been reported previously. 58 On the other hand, while plasma electrolyte levels were important features in the model, those markers have not been previously reported to be associated with the risk of cancer-associated thrombosis. However, for a cohort of this size (29,751 individuals) prospective VTE event capture would be prohibitively costly. We feel satisfactory precautions have been taken to ensure reliability of event capture for this cohort. Notably, the cumulative incidence of VTE was consistent with values reported in other studies, suggesting a low rate of missed cases. VTE cases were identi ed in the main MSK cohort using a novel NLP work ow, which can conceivably be more sensitive than the use of billing data to nd relevant clinical events. Data missingness for covariates is unavoidable for large cohorts and can be problematic when attempting to t predictive models. The possible consequences in this case include decreased model accuracy and more rarely a biased model if missingness is informative and not accounted for properly during imputation. In this regard, we used multivariate feature imputation which uses the entire set of available features to estimate the missing values. Also, missing data was not common and limited to laboratory predictors (missingness between 10% -14% for most of the values with only the carbon dioxide predictor missing for 36% of its values), so the impact on the nal model is expected to be low in the MSK cohorts. However, the substantial rate of missingness noted for albumin in the ONCOTHROMB cohort might have contributed to inferior performance of the DeepHit model for which this laboratory value was an important feature. Ultimately the value of the nal model will greatly depend on its external validity, i.e. its performance in other healthcare systems. As discussed in a recent set of guidelines for the standardization of risk prediction model reporting in cancer-associated thrombosis, additional work will be necessary before implementation in other healthcare systems. 59 Conclusion VTE is an important complication of cancer for which effective pharmacological prophylaxis methods exist. Currently available prediction rules have limited accuracy in stratifying patients for VTE risk. Future avenues to improve the overall bene t of VTE prophylaxis in this group will be contingent on better methods to quantify risk, a task for which ML is well-suited. The work presented here suggests that deep learning for survival analysis can be used to estimate the risk of cancer-associated VTE with accuracy.
Future external validation studies are needed to assess generalizability of the model derived with this cohort. The use of genomic predictors and transfer learning should be further explored and developed. have no potential con icts of interest to report.

Data Availability
The MSK data that support the ndings of this study are available from the corresponding author upon reasonable request. The dataset for the ONCOTHROMB cohort is not openly available.

Code Availability
The code is available online at https://github.com/MOTREC-AI/ML-CAT.     Cancer-Associated VTE Cumulative Incidence Functions in the Main MSK Cohort Cumulative incidence functions were derived from the Kaplan-Meier and the competing risk estimators, the latter using the Aalen-Johansen method.