This is the first report of a deep learning model for the prediction of cancer-associated VTE. Our approach is novel in several aspects. We have included a broad range of solid tumors and considered a diverse group of patients at all phases of their cancer journey, regardless of systemic treatment status. The latter allows VTE risk estimation for a much larger population of patients than what is currently possible using the KS and related prediction rules, as the majority of risk assessment models reported in the literature only consider patients started on a new chemotherapy regimen, limiting generalizability and applicability to everyday clinical practice. Additionally, we purposefully selected a model estimating the cumulative incidence function of VTE adjusting for the competing risk of death, as opposed to computing event probability at a fixed 6-month time point. This approach minimizes the potential for bias and allows end users to compute the risk of VTE at any arbitrary time point during the validated observation period, providing added flexibility in clinical applications.
The network architecture of the final model is conducive to the application of transfer learning, a methodology which has been studied and applied extensively for neural networks.51 We evaluated transfer learning with one external dataset. In this case, there was only a trend toward improvement in model accuracy, arguably because concordance and calibration were already excellent with the original model; this limited the ability of transfer learning to improve prediction modeling. Transfer learning could potentially improve calibration when the model is ported to different cohorts and attenuate inconsistencies in absolute risk estimates, a problem identified in several studies evaluating the KS.52 This approach opens the door to a new paradigm for the prediction of cancer-associated VTE, a world where models can be adapted to new healthcare settings in order to maximize external validity.
ML has been used by several groups to predict cancer-associated VTE, with encouraging results.53–57 The models presented in those reports were limited to a combination of demographic, cancer-specific and routine laboratory assay predictors. This is the first attempt to use ML to estimate the risk of CAT based on somatic genomic predictors in a large cohort of individuals with a solid tumor. There was no significant benefit to adding genomic predictors to a model already including demographic, cancer-specific, laboratory and pharmacological predictors in the cohort including all individuals regardless of systemic therapy status. These findings suggest that even though cancer somatic genetic alterations contain information about the risk of VTE, redundance exists with other predictors and there is a point at which adding more covariates yields no marginal benefit. The gene-specific information was limited to a binary marker for oncogenes and tumor suppressor genes. It is possible that future work using more granular information (e.g. alteration type, variant allele frequency) and including other genes (e.g. coagulation factors, cytokines) would result in improved prediction accuracy. Interestingly, albumin was the most important feature in the final model. An association between a decreased serum albumin level and an increased risk of cancer-associated VTE has been reported previously.58 On the other hand, while plasma electrolyte levels were important features in the model, those markers have not been previously reported to be associated with the risk of cancer-associated thrombosis.
Concordance was preserved in the KS subset of the main MSK cohort validation set for the selected deep learning model using a set of covariates including widely available clinical, pathological and laboratory predictors. Concordance for this model was superior to what was obtained using the KS. Such satisfactory performance in the subset of patients starting systemic therapy was confirmed with a similar model in the larger external MSK cohort validation set. The latter findings are important because this group of patients commencing cancer treatment is currently the focus of pharmacological VTE prophylaxis and has been featured prominently in other studies of a predictive model.
The main limitation of this work is the retrospective nature of the model derivation cohort. Relying on medical records can affect sensitivity and increases the risk of bias in capturing events of interests. However, for a cohort of this size (29,751 individuals) prospective VTE event capture would be prohibitively costly. We feel satisfactory precautions have been taken to ensure reliability of event capture for this cohort. Notably, the cumulative incidence of VTE was consistent with values reported in other studies, suggesting a low rate of missed cases. VTE cases were identified in the main MSK cohort using a novel NLP workflow, which can conceivably be more sensitive than the use of billing data to find relevant clinical events. Data missingness for covariates is unavoidable for large cohorts and can be problematic when attempting to fit predictive models. The possible consequences in this case include decreased model accuracy and more rarely a biased model if missingness is informative and not accounted for properly during imputation. In this regard, we used multivariate feature imputation which uses the entire set of available features to estimate the missing values. Also, missing data was not common and limited to laboratory predictors (missingness between 10% -14% for most of the values with only the carbon dioxide predictor missing for 36% of its values), so the impact on the final model is expected to be low in the MSK cohorts. However, the substantial rate of missingness noted for albumin in the ONCOTHROMB cohort might have contributed to inferior performance of the DeepHit model for which this laboratory value was an important feature. Ultimately the value of the final model will greatly depend on its external validity, i.e. its performance in other healthcare systems. As discussed in a recent set of guidelines for the standardization of risk prediction model reporting in cancer-associated thrombosis, additional work will be necessary before implementation in other healthcare systems.59