Study design and patients
This retrospective study investigated the patients with gastrointestinal cancer who underwent CTPA at a tertiary center (Seoul National University Hospital, Seoul, Korea) between 2010 and 2020, using an electrical medical records database (Figure 1). The diagnosis of gastrointestinal cancer was confirmed by the pathological results. Hepatocellular carcinoma diagnosed based on imaging without pathological confirmation was also included (16). The patients with exclusion criteria were not included in the analysis. Exclusion criteria were as follow; suspected PTE associated with causes other than cancer or other malignancy, no evidence of malignancy at the time of CTPA, ambiguous diagnosis of PTE on CTPA, missing data. Diagnosis of PTE was confirmed by trained experts based on CTPA.
This study was approved by the institutional review board (IRB) of the Seoul National University Hospital, Korea (IRB No.2009-146-1159). The need for informed consent was waived by the IRB.
Data collection and definition
Patient characteristics were retrospectively collected including age, sex, cancer diagnosis, the Wells score and components of the Wells score, and D-dimer. Gastrointestinal cancers were defined as cancers of gastrointestinal tract from esophagus to anus, cancers of liver and pancreatobiliary system (17). The Wells score was calculated as previously used in patients suspected PTE (18). Components of the Wells score included signs and symptoms of VTE, alternative diagnosis less likely than pulmonary embolism, heart rate > 100 beats per minute, history of VTE, immobilization, malignancy, hemoptysis. Among them, we did not include malignancy because all of the subjects were diagnosed with cancer. History of VTE was defined as previous history of PTE or DVT(18). A low C-PTP was defined as a Wells score of 0 to 1.5, a moderate C-PTP was 2.0 to 6.0 and a high C-PTP was 6.5 or higher (18). Number needed to diagnosis (NND) was defined as the number of patients who need to be examined to diagnosis one patient with the disease. NND is the inverse of Youn index (sensitivity + specificity – 1) (19).
Study outcome measures
The primary outcome of this study was the area under the receiver operating characteristics curve (AUROC) and accuracy of ML model for diagnosis of PTE in patients with gastrointestinal cancer.
As a secondary outcome, we compared the number of performed CTPA for PTE in the ML model with the conventional model. We also investigated NND and feature importance of the ML model.
Statistical analysis
To compare the baseline characteristics, the Student’s t-test and Chi-square test were used for continuous and dichotomous variables, respectively. If any subgroups had less than four subjects, the Fisher’s exact test was used instead of a Chi-square test.
ML was performed in a 10-fold cross-validation method. We classified the subjects as 90% training group and 10% validation group. The model is trained using demographic information and labels from the training group, and prediction is performed based on the demographic information of the validation group. After that, the model parameters are reset, and training and prediction are performed on the newly split data. This process is repeated 10 times, while the data split is performed by setting the entire subject to be included in the validation group only once. As a result, the model makes predictions once for every subject, which can increase data efficiency.
The random forest model showed good performance in regression and classification, especially in the healthcare field, where deep learning is not accessible due to the lack of data (20, 21). Although PTE is an important disease, the number is relatively small, so we used the random forest model (22). The random forest model can track which features the model mainly considered in the process of making decisions. This is a form of explainable artificial intelligence, which can solve the model reliability problem that most deep learning algorithms currently have. In this paper, we used impurity-based feature importance (23). The feature importance score is calculated by the ratio of the information gain of all nodes split by a specific feature divided by the total sum of the information gains of all nodes. Consequently, using this importance score, we determined how each feature is influenced for classifying the data.
P-value lower than 0.05 indicated statistical significance. Statistical calculations were performed with SPSS and scikit-learn's random forest model. To measure the feature importance, the impurity-based feature importance algorithm was used. When comparing prediction ROC curves between models, statistical analysis based on the Delong test was performed.