The study aims to construct a predictive model of VTE for hospitalized cancer patients with machine learnings. The Logistic regression model, the Support Vector Machine model, the Random Forest model and the XGBoost model included 33, 24, 32, and 35 features, respectively. It demonstrated that the ML model has advantages in dealing with a large number of clinical features. According to the validation of the performance of VTE prediction, all ML models can effectively predict the risk of VTE. The results showed the XG Boost algorithm had the best performance with the highest AUROC. The model can provide a reference for VTE prediction in hospitalized cancer patients. Early forecast and identification could help reduce the VTE in cancer inpatients. We found the XGBoost and the Support Vector Machine are superior to traditional logistic regression in developing a prediction model.
The results provided some risk predictors of VTE in hospitalized cancer patients. Features selected in four ML models include diabetes, renal failure, distant metastasis, lymph node metastasis, times of chemotherapy, D-dimer, fibrin degradation products, international normalized ratio, vascular access carrier, pleural metastasis, hematological malignancies, gynecological tumour, and squamous cell carcinoma. Diabetes and renal failure are essential risk factors of VTE, which are consistent with previous evidence.[15, 16] The results showed cancer-related features influenced the VTE risk in cancer population, such as distant metastasis (especially pleural metastasis), lymph node metastasis, type of tumour (especially hematological malignancies and gynecological tumour), squamous cell carcinoma and times of chemotherapy. It was supported by Weitz et al.[17] Coagulation indicators of fibrin degradation products, international normalized ratio, and D-dimer revealed the predictive performance for VTE. It is consistent with Posch et al.[18]
Some interesting findings were shown in the importance rankings of features in the XGBoost model. Pleural metastasis was found as an important risk factor of VTE for the first time. One possible explanation is that a worse prognosis is associated with coagulation dysfunction compared with distant metastases. Joubert et al.[19] found that patients with pleural metastasis had a significantly lower survival compared to those with other distant metastases (25 vs 52 months; HR = 0.49). The results showed D-dimer, fibrin degradation products, international normalized ratio, antithrombin III, and high sensitivity C-reactive protein were predictive factors in the XGBoost model. It demonstrated the addition of laboratory indicators in risk assessment models is promising, as the conclusion of Ferroni et al.[20] Open abdominal procedures were found a higher weight of VTE risk compared with other types of the open surgery.
No risk assessment model has been designed explicitly for inpatients with cancer.[10] The Khorana score is a widely recommended risk stratification tool for cancer patients by many guidelines currently.[21] Compared to the Khorana score, there are some advantages of the XGBoost model. First, laboratory indicators on admission were collected as features rather than those before chemotherapy. We could evaluate the risk of cancer-associated venous thromboembolism based on laboratory indicators on admission. It solved the problem of getting data before chemotherapy to calculate the Khorana score. Second, our model found more risk factors of cancer-associated venous thromboembolism. For example, open abdominal surgery was added to the model, supported by American Society of Hematology 2019 guidelines[22]. Times of chemotherapy was added to the model. Comorbidities, including hypertension, hyperthyroid, serious infection, diabetes, chronic heart failure and renal failure, were added to the model, which were risk factors of VTE in previous studies.[23] They were identified as risk factors of VTE as previous studies, but also permutation scores were described in the model. Identifying significant clinical risk factors could assist the clinicians and nurses adopt measures for modifiable factors quickly. By ranking the importance of risk factors, it could help patients understand the severity of risk factors.
Although there are many features in the model, most of them could capture from HIS. It is feasible to calculate the risk level of VTE through a small program placed in Hospital Information System. It reduced the clinical burden to assess the risk of VTE manually, and it could provide support for practitioner decisions effectively.
Limitations First, data were obtained from a single hospital retrospectively. Second, several missing values may affect the model performance. Third, overfitting existed in the Random Forest model. One possible explanation is the lack of a larger sample size in the training set. An extensive database produced a stable predictive model. Finally, the outcome variable (VTE) was identified using discharge diagnosis codes. Usually, most asymptomatic VTE events were not detected. Therefore, the predictive model may have limited ability to identify asymptomatic VTE. External validation is required to evaluate the generalization ability of the models.