Study design and participants
The Institutional Review Board of Chang Gung Medical Foundation approved this case-control study. From February 2017 through August 2018, a total of 836 consecutive asymptomatic participants who underwent both CXR and LDCT at Chiayi Chang Gung Memorial Hospital, Taiwan, for lung cancer screening were prospectively enrolled. The inclusion criteria were age between 40 and 80 years old and willingness to participate in follow-up imaging or diagnostic workup. Subjects were excluded if a pulmonary nodule was detected on CXR, or if they had a known medical history of any malignant disease. Serial imaging reports, basic patient information, and demographic data were obtained. Each participant had at least 1 year of follow-up after the LDCT baseline scan. The diagnosis of lung cancer was confirmed based on surgical resection or lung biopsy and was recorded in a hospital-based cancer registry. Patients who had confirmed lung cancer prior to the index date of July 30, 2019 were classified as lung cancer patients (category 1); all other patients were classified as controls (category 0). Figure 1 shows the flowchart of the study.
LDCT image acquisition and interpretation
All LDCT scans were performed with a 64-slice multidetector computed tomography (CT) (Somatom Sensation 64; Siemens Healthcare, Erlangen, Germany) in a low-dose setting without contrast enhancement (volumetric CT dose index ≤2.0 mGy for a standard patient). The scan parameters were 120 kVp, 25 effective mAs, soft-tissue kernel (B30f), and 3 mm slice thickness. All equipment specifications and acquisition parameters followed the recommendations of the ACR Society of Thoracic Radiology Practice Parameters for the Performance and Reporting of Lung Cancer Screening Thoracic CT [12]. Each LDCT baseline scan was reported by one thoracic radiologist with 7 years of experience. The standardized structured reports described the size, shape, location, and texture of the lung nodules, as well as other incidental findings. The density of each lung nodule was reported according to the definition from the Fleischner Society guidelines [13, 14]. The size of each lung nodule was measured on lung windows and recorded as recommended by the Lung-RADS.
Development of the ANN
Each baseline LDCT report consists of a description of the intra- and extra-pulmonary findings, and a Lung-RADS risk category. The reports were designed to aid lung cancer screening. Using data scraping techniques, 22 input features were automatically extracted from the descriptive parts of the baseline LDCT reports and used to develop the ANN. Four of the inputs constituted clinical information or LDCT parameters. Another seven inputs pertained to nodule patterns and sizes based on the Lung-RADS standardized lexicon. The remaining inputs were extra-pulmonary interpretations, which consisted of 11 descriptive features. These inputs were in binary form (0 or 1). The Lung-RADS classification was not included among the input features. Table 1 lists all 22 input features, and shows the distribution of the baseline Lung-RADS categories in the derivation and validation cohorts.
Feed-forward neural networks based on the back-propagation algorithm were constructed using Keras version 2.2.4 [15], a high-level neural network application programming interface that can simplify the ANN construction process. The inputs for the ANN were normalized such that they fell between 0 and 1. The ANN consisted of the first two hidden layers, followed by a dropout layer to prevent over-fitting and a dense layer as the output layer [16]. There were 10 hidden units in each of the first two hidden layers and a rectified linear unit was used as the activation function. We also tested networks including different numbers of hidden units in each layer; none of these proved superior to the 10-unit network. Figure 2 shows the structure of the ANN. An adaptive learning rate optimizer based on the adaptive moment estimation method was used to facilitate convergence [17]. The network weights were randomly initialized between -1 and 1. The learning rate was 0.001 and the dropout rate of the dropout layer was set to 0.1. The output layer eventually generated a number between 0 and 1 using the sigmoidal activation function. The predictive performance of the models was monitored during training to optimize the hyperparameters.
The dataset used in this study is unbalanced, but ANNs are sensitive to such datasets. Due to the iterative nature of the training, ANNs are prone to converge to the majority class. Thus, to achieve a cost-sensitive neural network, we used the class weighting approach; this assigns error weights to samples based on their class [18]. A 2:1 class weight ratio between lung cancer cases (category 1) and controls (category 0) was used in the ANN. We also explored networks with other class weight ratios (5:1, 10:1, 20:1, 25:1, 29:1 and 35:1). However, in terms of sensitivity, specificity and AUC, none of which performed significantly better than the setting with a 2:1 class weight ratio.
Validation and risk group identification
In the training process, the ANN was internally validated via “three-fold cross-validation” [19]. The dataset was divided into three equal parts. At each cycle, one of the three parts was selected as the test set and removed from the dataset, while the remaining cases were used as the training set of the ANN. This process was repeated until the entire dataset had been used once as the test set. Finally, the ANN was validated with the prospective validation cohort.
To investigate the determining factors for predicting lung cancer, we applied a permutation feature importance method proposed by Leo Breiman [20]. The permutation feature importance for each feature used in the ANN was evaluated with the validation cohort, and the performance metric was AUC. At each iteration, one of the features was randomly shuffled, and the permutation feature importance score was calculated to show how much the performance metric decreased. Therefore, a high score revealed a feature with a great contribution to the discriminative ability of the model.
Statistical analyses
Statistical analyses were performed using MedCalc 18.9.1 (MedCalc Software, Ostend, Belgium). Observed distributions were tested against the hypothesized normal distribution (Kolmogorov–Smirnov test). Data are reported as the mean ± standard deviation or number (%) unless otherwise indicated. To determine and compare the performance of the Lung-RADS and ANN, the sensitivity and specificity of the lung cancer classification at different thresholds were analysed based on the results of area under the receiver operating characteristic (ROC) curve analyses. The optimal diagnostic thresholds of the ROC curves were determined using maximized Youden’s [21] index. ROC curves were compared using the method described by DeLong et al. [22]. The sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR+), and negative likelihood ratio (LR–) of each model for lung cancer diagnosis were calculated [23]. In all analyses, P<0.05 was considered to indicate statistical significance.