Patients
Data were collected on 149 patients with lung cancer who were treated at the Affiliated Jinling Hospital, Medical School of Nanjing University between June 2015 and December 2019. The institutional review board of the Affiliated Jinling Hospital, Medical School of Nanjing University approved this retrospective study and waived the requirements for informed consent from the patients. However, consent was obtained from the patients on the publishing of their images and clinical information. The inclusion criteria were as follows: patients of 18 years of age or older; a diagnosis of lung cancer, as confirmed by histopathology according to the eighth edition of the American Joint Commission on Cancer TNM classification and staging system; and a recipient of immunotherapy for the cancer. The exclusion criteria were as follows: patients with missed follow-up (n = 20); patients with a treatment duration of less than 6 months before progressive disease (PD) (n = 25); and patients with partial loss of images (n = 12). In total, 94 patients met the inclusion criteria and were randomly divided at a 7:3 ratio into the training cohort (n = 64) and the validation cohort (n = 28) (Figure 1). For each patient, baseline clinicopathological characteristics were obtained from their medical records; namely, age, sex, smoking status, family history, histological subtype, TNM classification and staging, blood cell counts (platelets, white blood cells, neutrophils, lymphocytes, and monocytes), and levels of thyroid transcription factor 1 (TTF-1), Ki-67, C-reactive protein (CRP), carcinoembryonic antigen, and neuron-specific enolase. The time from the beginning of immunotherapy to the date of disease progression was defined as progression-free survival (PFS). The endpoint of this study was the clinical benefit of immunotherapy, which was defined as either a durable clinical benefit (DCB: complete response, partial response (PR), or stable disease (SD) lasting >6 months) or no durable clinical benefit (NDB: PD or SD that lasted ≤6 months).
Image acquisition
All patients underwent non-enhanced CT imaging of the lungs with one of three multidetector row CT systems (SOMATOM Definition Flash, SOMATOM Emotion, and SOMATOM Perspective, all from Siemens Healthineers AG, Erlangen, Germany). After removing any metallic foreign bodies from the chest area, the patient was placed in the supine position with both hands raised. The CT scan was conducted using the spiral scanning mode and ranged from the thoracic entrance to the underlying layer of the lung, with a single breath-hold scan at the end of inspiration.The following CT acquisition parameters were used: 120 or 130 kVp; 160 mAs; detector collimation: 6 × 1.25 mm, 64 × 0.625 mm, or 64 × 0.6 mm; rotation time: 0.5 or 0.8 s; matrix size: 512 × 512; field of view: 350 × 350 mm). Each patient received a whole-lung scan, and the CT image retrieved from the picture archiving and communication system was reconstructed with a standard kernel. The slice thickness of the CT images ranged between 1 and 1.25 mm.
Image segmentation and radiomic feature extraction
This study followed and adhered to the Image Biomarker Standardization Initiative (IBSI) guidelines [29] and the radiomics prototype software program used (Radiomics, Frontier, Siemens) was IBSI compliant. A volume of interest was drawn semiautomatically around the tumor by a chest radiologist (Y.B., 9 years of experience) in the lung diagnosis using Radiomics and confirmed by another chest radiologist (Z.J.,15 years of experience). Both radiologists were blinded to the patients’ clinical information. First, we imported the CT images into Radiomics. In the segmentation module of Radiomics, a few segmentation tools are available for the semiautomatic delineation of the tumor in three dimensions. The segmentation is semiautomatically produced by drawing a line across the boundary of the tumor. Then, through an automatic algorithm, the tool finds neighboring voxels with the same gray level in three-dimensional (3D) space, generating random walker-based lesion segmentation for solid and subsolid lung lesions [30]. If the segmentation is incorrect, the operators could correct it manually in the 3D domain using the Radiomics prototype. As a result, a total of 111 features (viz., 18 first-order, 75 texture, and 18 size and shape features) were extracted from the CT images using Radiomics. To test the intraclass reproducibility, the data for 25 randomly selected patients were segmented twice by one radiologist (Y.B.) within a 1-month period. To test the interclass reproducibility, the same 25 sets of data were segmented by two radiologists (Y.B. and Z.J.). Spearman correlation analysis was used to assess the differences between the features generated at different times and by different radiologists as well as between the twice-generated features by the same radiologist. Interclass and intraclass correlation coefficients (ICCs) were used to evaluate the intra- and inter-observer agreement of feature extraction, where an ICC value of greater than 0.80 indicated good agreement. As a result, 88 features were retained for further analysis (Figure 2).
Predictive model construction and model testing
A model for predicting the clinical benefits of immunotherapy was constructed on the basis of the CT-derived radiomic features, using the random forest (RF) method. Patients with DCB were labeled as positive and those with NDB as negative. The RF algorithm has a comparably low tendency to overfit and is well suited for datasets with a large number of heterogeneous predictors and cluster-correlated observations; thus, it was adopted for the machine learning-based prediction model. The RF method was used to construct the prediction model because of its high variance-bias trade-off capability. It is a classification algorithm consisting of many decision trees, where each tree represents a weak classifier. A combination of trees can achieve improved model performance. The split quality was measured according to the Gini impurity. Hyperparameter optimization was performed. The RF model was evaluated with an additional independent cohort study, which was randomly selected with potentially unseen test patients. The performance of the model was assessed on the basis of its receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), accuracy, sensitivity, and specificity. Finally, radiomics model 1 was established on the basis of 15 important radiomic features (Figure 3A). The forest consisted of five trees, and the maximum depth of the tree was set as 2. Three features were considered when looking for the best split. Finally, a radiomics score (rad-score) was output for each patient.
Clinicopathological factors plus radiomics model development and radiomics nomogram model construction
The clinicopathological factors were analyzed using univariate logistic regression analysis, in which the predictors with a p value of less than 0.10 were included to find those of significance. Through multivariate logistic regression analysis, the multimodal features,rad-score and significant predictors were then integrated into a single predictive model, whereupon radiomics nomogram model 1 was constructed for the training cohort. Calibration curves were plotted to assess the calibration of radiomics nomogram model 1. Decision curve analysis was conducted to determine the clinical usefulness of the radiomics model and radiomics nomogram model by quantifying the net benefits at different threshold probabilities for the training and validation cohorts.
Clinicopathological factor analysis
The clinicopathological factors were analyzed using univariate Cox proportional hazards (CPH) regression analysis. Those factors with a p value of less than 0.10 and the rad-score were then combined for multivariate CPH regression analysis to identify the independent risk factors, which were subsequently analyzed using the Kaplan-Meier curve and log-rank test. The final model was selected by backward stepwise elimination, with the Akaike information criterion as the stopping rule [31].
Construction of radiomics nomogram model 2
The same rad-score was used to construct radiomics model 2. The multimodal features and parameters, including the rad-score and independent risk factors, were integrated into a single predictive model through multivariate CPH regression analysis, where upon radiomics nomogram model 2 was developed for the training cohort. The prognostic abilities of the generated models were evaluated using the training cohort and validated using the validation cohort. The discrimination performance of the prognostic models was assessed using Harrell’s concordance index (C-index), which ranges between 0.5 (indicating a random distribution of the data) and 1.0 (indicating perfect prediction of the observed survival information by the model). Calibration curves of the nomogram were subsequently drawn for the 2-year PFS of the patients. The calibration curves were used to determine the independent risk factors as well as to indicate both the PFS probabilities predicted by the prognostic models and the observed probabilities.
Statistical analysis
The differences in clinicopathological factors between the training and validation datasets were assessed using the Mann-Whitney U test for continuous variables and the χ2 test for categorized variables. The discrimination power of the models was measured from the AUC and the C-index. The Delong test was used to compare two AUCs, and the log-likelihood ratio was used to assess the increase in predictive power of the C-index. The radiomics nomogram model 2 score was calculated and used to assess risk stratification. Survival curves were generated with the Kaplan-Meier method and compared using the two-sided log-rank test. The optimal cutoff point for continuous prognostic markers in the survival analysis was determined using X-tile software (version 3.6.1; Yale University School of Medicine, New Haven, CT, USA). A calibration curve was generated to demonstrate the goodness of fit, which is a graphical representation of the relationship between observed and predicted survival. The Greenwood-Nam-D’Agostino (GND) method was applied to measure the statistical significance of the goodness-of-fit test results.The prediction error of the models, which was assessed using the “Boot632plus” split method with 100 iterations to calculate estimates of the prediction error curves, was summarized as the integrated Brier score, which represents a valid measure of the overall model performance and can range from 0 (for a perfect model) to 0.25 (for a noninformative model with a 50% incidence of the outcome). The RF model was conducted using Python software (Python Scikit-learn package comprising Python version 3.7 and Scikit-learn version 0.21; http://scikit-learn.org/). The construction of radiomics nomogram model 2, assessment of the model, and decision curve analysis(DCA) were performed using R software (version 3.4.4; http://www.r-project.org). All the codes are available at https://github.com/tomato08217/immune. All statistical tests were two-sided, with a significance level of 0.05.