Patient selection
This study is a retrospective, single-centre study. Patients who underwent surgery for primary lung cancer between January 2008 and December 2020 at the Department of Thoracic Surgery, Chiba Cancer Centre, were included. Exclusion criteria were as follows: (i) a history of lung cancer; (ii) preoperative chemotherapy; (iii) surgery for multiple lung cancers; (iv) no 18F-FDG-PET/CT using a Siemens Biograph 6 LSO (Siemens, Erlangen, Germany) within three months before surgery; (v) missing analysable imaging data; (vi) weakly integrated and not suitable for an analysis; (vii) partial resection performed; and (viii) pathology report not sufficient to diagnose pathological highly invasive lung cancer. Cases of partial resection were excluded because the lymph nodes were not evaluated. A flowchart of the selection criteria is shown in Fig. 1. Ultimately, 873 of 1668 patients met the criteria and were eligible. This study was approved by the Ethics Review Committee at our institution (No. R04-134), and written informed consent was waived for this retrospective study.
Pathological Findings
Pathological highly invasive lung cancer was defined in cases with any of the following: (i) lymph node metastasis; (ii) vascular invasion; (iii) lymphatic invasion; (iv) pleural invasion; or (v) intrapulmonary metastasis. The control group was defined as cases of lung cancer without any of the above findings.
Image Acquisition
18F-FDG PET/CT images were obtained using a Biograph 6 PET/CT scanner (Siemens Healthcare) at the Chiba Cancer Centre. Imaging was performed following the guidelines of the Japanese Society of Nuclear Medicine. 18F-FDG at 4 MBq/kg of body weight was injected before image acquisition, and imaging started 60 minutes after the acquisition. PET was performed for 2 min per bed, with a shorter imaging time per bed for higher doses and longer imaging time for lower doses to reduce the effect of dose differences. The CT slice thickness was 5 mm. The point spread function algorithm corrected the image reconstruction.
Tumour Segmentation
PET and CT images were retrieved from the electronic medical record, loaded by 3D slicer software, version 4.11, and used for lung cancer segmentation. Grow from seeds implemented in 3D slicer was used to segment lung cancer in CT images. Grow from seeds achieves segmentation by starting from pixels that are simply pointed out as lung cancer and background pixels and enlarging the region of interest [19]. PET Tumor Segmentation, a semi-automatic method, was used for PET image segmentation. PET Tumor Segmentation, which is faster and more consistent than manual segmentation [20], can be implemented as an extension of 3D slicer. The SUVmax, SUVmean, MTV, and TLG of the segmented regions were calculated using the PET-IndiC, an extension of 3D slicer.
Radiomics Feature Extraction
The Python package Pyradiomics (version 3.0.1 https://github.com/Radiomics/pyradiomics) was used for feature extraction. Both PET and CT images were resampled to a uniform voxel size of 2.0×2.0×2.0 by B Spline completion. Discretization of PET and CT images was set to a bin width of 0.5 for PET images and 25 for CT images [21]. A total of 3190 features (1595 each) were extracted. The original 107 features were calculated for each PET and CT image. One thousand four hundred eighty-eight features were calculated with filters of square, squareroot, logarithm, exponential, logarithm, wavelet, and Lapla-cian of Gaussian, and three sigma values of 2.0, 3.0, and 4.0 were used in the Lapla-cian of Gaussian filter.
Model Development
The workflow of this study consists of two steps (Supplementary Fig. 1). The first step involves analysing only CT, only PET, and combined PET/CT features (hereafter denoted as PET/CT) to evaluate prediction performance and stability of each machine learning model. The patient cohort was divided 70% into a training set and 30% into a test set, and feature standardization and selection were performed based on the training set and applied to the test set. Feature selection was performed with Boruta [22], a feature selection method based on the variable importance of Random Forest (RF). Boruta selects variables by creating random shadow variables and repeatedly comparing variable importance to them. Although several feature selection methods exist, Boruta has been validated in many studies, and Degenhardt et al. concluded that Boruta was more effective than other methods in selecting a small subset containing the best predictor variables in omics data [23]. In the present study, the parameters of Boruta were set to n_estimators 300, perk 100, and alpha 0.05.
The machine learning models used in our study were logistic regression (LR), support vector machine (SVM), K-Nearest Neighbour (KNN), RF, Light Gradient Boosting Machine (LGB) [24], Deep neural net (DNN), and TabNet [25]. As deep neural models, not only DNN but also TabNet, which is specialised for tabular data, were used. We also established the ensemble model (ENS), which averaged the prediction probabilities of all models in the test set. For machine learning models other than TabNet, hyperparameters were optimised using five-fold cross-validation with area under the curve (AUC) as the evaluation metrics. TabNet was optimised using the validation set after pretraining with fixed parameters. Details of the hyperparameter settings are shown in Supplementary Table 1. The same procedure was repeated for 100 iterations with different random seeds to evaluate prediction performance and stability, from the division of patients to constructing the machine learning model as one iteration. The mean and standard deviation (SD) of the AUC, accuracy, F1, precision, and recall were calculated as evaluation metrics.
For the second step of the analysis, a calibration plot and decision curve analysis (DCA) [26] based on predicted probability were performed for clinical use. A calibration plot visualises the reliability of the predicted probability comparing the probability and actual proportions. A DCA uses the theoretical relationship between the relative harms of false positives and false negatives to indicate the range or amount of benefit from changing the thresholds on which treatment selection is based. Net Benefit of a model is given by:
$$\text{N}\text{e}\text{t} \text{B}\text{e}\text{n}\text{e}\text{f}\text{i}\text{t}= \frac{TruePositiveCount}{n}-\frac{FalsePositiveCount}{n}\left(\frac{{p}_{t}}{1-{p}_{t}}\right)$$
Here, \({p}_{t}\) is the changing threshold probability. The higher the Net Benefit, the more beneficial the model. To calculate the probability for all patients, all models except for TabNet were analysed by nested five-fold cross-validation with the inner loop of five-fold cross-validation. TabNet was analysed with the inner loop of hold-out. In addition, we performed an analysis limited to Adc, Sqc, and tumours with a horizontal section diameter ≤ 3 cm and ≤ 2 cm, based on the predicted probabilities calculated in the analysis of all patients. In addition, a comparison of the CTR was also performed. The CTR was measured based on thin-slice CT if taken within three months before surgery.
Statistical Analyses
As appropriate, categorical or continuous variables were compared with Fisher's exact test, t-tests or the Mann-Whitney U test. All analyses were two-tailed, with P < 0.05 indicating a significant difference. Statistical analyses of the patient background and a DCA were performed using the R software program (version 3.6.3, http://www.R-project.org). Python (version 3.7) with the scikit-learn package (version 1.0.2) and Pytorch (version 1.10.2) were used to build machine learning models and evaluate their predictive performance.