Combination of CT Imaging-based Radiomics and Clinicopathological Characteristics for Predicting the Clinical Benet of Immune Checkpoint Inhibitors in Lung Cancer

Purpose In this study, we tested whether a combination of radiomic features extracted from baseline pre-immunotherapy computed tomography (CT) images and clinicopathological characteristics could be used as novel noninvasive biomarkers for predicting the clinical benets of non-small cell lung cancer (NSCLC) patients treated with immune checkpoint inhibitors (ICIs). Data on 92 consecutive patients with lung cancer who had been treated with ICIs were retrospectively analyzed. In total, 88 radiomic features were selected from the pretreatment CT images for the construction of a random forest model. Radiomics model 1 was constructed based on the rad-score.Through multivariate logistic regression analysis, the rad-score and signicant predictors were integrated into a single predictive model (radiomics nomogram model 1) to predict the durable clinical benet from ICIs. Radiomics model 2 was developed based on the same rad-score as radiomics model 1. Through multivariate Cox proportional hazards regression analysis using the rad-score and independent risk factors, radiomics nomogram model 2 was constructed to predict the progression-free survival (PFS).


Introduction
According to the latest epidemiological data, lung cancer has become the most common of the malignant cancer types, with the highest morbidity and mortality rates worldwide [1]. Among the common subtypes of lung cancer, non-small cell lung cancer (NSCLC) accounts for more than 85% of patients with lung cancer [2]. Despite signi cant advances in cancer diagnosis and treatment techniques, the long-term survival rate of patients with lung cancer is only 10%-15%, and once metastasis occurs, the 5-year survival rate is less than 5% [3]. Therefore, enhancing the e cacy of treatments is key to improving the overall prognosis of patients with advanced NSCLC.
Although cytotoxic therapy remains an important component of systemic therapy, many chemotherapy regimens are associated with signi cant toxic effects [4]. Compared with traditional cytotoxic agents, tyrosine kinase inhibitors (TKIs) have led to more favorable outcomes and have thus become the rst-line treatment for patients with actionable driver mutations [5]. However, there are still many patients who do not have genetic mutations and would therefore not bene t from TKIs [6]. For such patients, immune checkpoint inhibitors (ICIs) have become a good option, particularly antibodies targeting the programmed cell death-1 (PD1)-programmed cell death ligand-1 (PD-L1) axis, which can restore the body's antitumor immune response by blocking the immune suppression signals between the tumor and T cells [7]. The ICIs that are approved or in development for the treatment of lung cancer are the anti-PD1 antibodies nivolumab and pembrolizumab and the anti-PD-L1 antibodies atezolizumab, avelumab, and durvalumab [8]. Intriguingly, the uses of ICIs have shifted from the advanced stage to earlier stages of lung cancer in an attempt to reach the ultimate goal of curing the disease completely [9].
However, despite the substantial progress made in immunotherapy research and development, the majority of patients still do not derive bene ts from ICIs [10]. Currently, PD-L1 and microsatellite instability/defective mismatch repairs are approved for clinical use as predictive biomarkers of a tumor's response to ICIs [11]. Additionally, multicohort studies such as CheckMate 026 have shown that although a high tumor mutation burden is positively correlated with the e cacy of ICIs [12], this parameter is not a perfect biomarker for the selection of patients for ICI therapy [13]. Hence, there is an important and urgent need to identify and develop predictive biomarkers for immunotherapy. Moreover, an increasing number of studies have pointed out that combining different biomarkers to reduce the assumptive risk associated with each one can improve the performance of the ultimate predictive biomarker for ICIs [14].
Medical imaging is a noninvasive inspection method with prognostic value. The quantitative imaging features extracted from conventional medical images can be used to predict the clinical prognosis or monitor the treatment response of different types of cancer [15][16][17]. Radiomics is a quantitative research method that involves the conversion of image data into high-throughput image feature data that can then be mined and used to describe the intensity, shape, and texture of tumors and quantify the heterogeneity of time and space of tumor tissue. Essentially revealing disease features that cannot be recognized by the naked eye. The technique effectively transforms images into a high-dimensional recognizable feature space and uses statistical and/or machine learning methods to select the most valuable radiomic features for analyzing clinical information. The end goal is to construct a model with diagnostic, prognostic, or predictive value that will provide valuable information for enabling the accurate individualized diagnosis and treatment of diseases [18][19][20].
The nomogram is a graphical diagram that is easy to interpret and contains different kinds of predictors. In recent years, it has become an important tool for cancer research experts [21,22]. In several studies, computed tomography (CT) or positron emission tomography (PET/CT)-based radiomics has been used to improve the prediction of the performance of various cancers, and it has been found that the combination of radiomic features with the traditional tumor staging system and other clinicopathological factors may improve the accuracy of tumor prognosis prediction [23][24][25]. Overall, however, the prediction performances of current prognostic prediction models are generally not very good [26][27][28].
Therefore, in this study, radiomics nomogram models for predicting the tumor response and prognosis of patients with NSCLC following ICI treatment were constructed on the basis of CT radiomic features and clinicopathological characteristics. The ultimate aim is to develop prediction models that can guide individualized immunotherapy in patients with lung cancer.

Methods
Patients Data were collected on 149 patients with lung cancer who were treated at the A liated Jinling Hospital, Medical School of Nanjing University between June 2015 and December 2019. The institutional review board of the A liated Jinling Hospital, Medical School of Nanjing University approved this retrospective study and waived the requirements for informed consent from the patients. However, consent was obtained from the patients on the publishing of their images and clinical information. The inclusion criteria were as follows: patients of 18 years of age or older; a diagnosis of lung cancer, as con rmed by histopathology according to the eighth edition of the American Joint Commission on Cancer TNM classi cation and staging system; and a recipient of immunotherapy for the cancer. The exclusion criteria were as follows: patients with missed follow-up (n = 20); patients with a treatment duration of less than 6 months before progressive disease (PD) (n = 25); and patients with partial loss of images (n = 12). In total, 94 patients met the inclusion criteria and were randomly divided at a 7:3 ratio into the training cohort (n = 64) and the validation cohort (n = 28) ( Figure 1). For each patient, baseline clinicopathological characteristics were obtained from their medical records; namely, age, sex, smoking status, family history, histological subtype, TNM classi cation and staging, blood cell counts (platelets, white blood cells, neutrophils, lymphocytes, and monocytes), and levels of thyroid transcription factor 1 (TTF-1), Ki-67, Creactive protein (CRP), carcinoembryonic antigen, and neuron-speci c enolase. The time from the beginning of immunotherapy to the date of disease progression was de ned as progression-free survival (PFS). The endpoint of this study was the clinical bene t of immunotherapy, which was de ned as either a durable clinical bene t (DCB: complete response, partial response (PR), or stable disease (SD) lasting >6 months) or no durable clinical bene t (NDB: PD or SD that lasted ≤6 months).

Image acquisition
All patients underwent non-enhanced CT imaging of the lungs with one of three multidetector row CT systems (SOMATOM De nition Flash, SOMATOM Emotion, and SOMATOM Perspective, all from Siemens Healthineers AG, Erlangen, Germany). After removing any metallic foreign bodies from the chest area, the patient was placed in the supine position with both hands raised. The CT scan was conducted using the spiral scanning mode and ranged from the thoracic entrance to the underlying layer of the lung, with a single breath-hold scan at the end of inspiration.The following CT acquisition parameters were used: 120 or 130 kVp; 160 mAs; detector collimation: 6 × 1.25 mm, 64 × 0.625 mm, or 64 × 0.6 mm; rotation time: 0.5 or 0.8 s; matrix size: 512 × 512; eld of view: 350 × 350 mm). Each patient received a whole-lung scan, and the CT image retrieved from the picture archiving and communication system was reconstructed with a standard kernel. The slice thickness of the CT images ranged between 1 and 1.25 mm.
Image segmentation and radiomic feature extraction This study followed and adhered to the Image Biomarker Standardization Initiative (IBSI) guidelines [29] and the radiomics prototype software program used (Radiomics, Frontier, Siemens) was IBSI compliant. A volume of interest was drawn semiautomatically around the tumor by a chest radiologist (Y.B., 9 years of experience) in the lung diagnosis using Radiomics and con rmed by another chest radiologist (Z.J.,15 years of experience). Both radiologists were blinded to the patients' clinical information. First, we imported the CT images into Radiomics. In the segmentation module of Radiomics, a few segmentation tools are available for the semiautomatic delineation of the tumor in three dimensions. The segmentation is semiautomatically produced by drawing a line across the boundary of the tumor. Then, through an automatic algorithm, the tool nds neighboring voxels with the same gray level in three-dimensional (3D) space, generating random walker-based lesion segmentation for solid and subsolid lung lesions [30]. If the segmentation is incorrect, the operators could correct it manually in the 3D domain using the Radiomics prototype. As a result, a total of 111 features (viz., 18 rst-order, 75 texture, and 18 size and shape features) were extracted from the CT images using Radiomics. To test the intraclass reproducibility, the data for 25 randomly selected patients were segmented twice by one radiologist (Y.B.) within a 1-month period. To test the interclass reproducibility, the same 25 sets of data were segmented by two radiologists (Y.B. and Z.J.). Spearman correlation analysis was used to assess the differences between the features generated at different times and by different radiologists as well as between the twice-generated features by the same radiologist. Interclass and intraclass correlation coe cients (ICCs) were used to evaluate the intra-and inter-observer agreement of feature extraction, where an ICC value of greater than 0.80 indicated good agreement. As a result, 88 features were retained for further analysis ( Figure 2).

Predictive model construction and model testing
A model for predicting the clinical bene ts of immunotherapy was constructed on the basis of the CTderived radiomic features, using the random forest (RF) method. Patients with DCB were labeled as positive and those with NDB as negative. The RF algorithm has a comparably low tendency to over t and is well suited for datasets with a large number of heterogeneous predictors and cluster-correlated observations; thus, it was adopted for the machine learning-based prediction model. The RF method was used to construct the prediction model because of its high variance-bias trade-off capability. It is a classi cation algorithm consisting of many decision trees, where each tree represents a weak classi er. A combination of trees can achieve improved model performance. The split quality was measured according to the Gini impurity. Hyperparameter optimization was performed. The RF model was evaluated with an additional independent cohort study, which was randomly selected with potentially unseen test patients. The performance of the model was assessed on the basis of its receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), accuracy, sensitivity, and speci city. Finally, radiomics model 1 was established on the basis of 15 important radiomic features ( Figure 3A). The forest consisted of ve trees, and the maximum depth of the tree was set as 2. Three features were considered when looking for the best split. Finally, a radiomics score (rad-score) was output for each patient.
Clinicopathological factors plus radiomics model development and radiomics nomogram model construction The clinicopathological factors were analyzed using univariate logistic regression analysis, in which the predictors with a p value of less than 0.10 were included to nd those of signi cance. Through multivariate logistic regression analysis, the multimodal features,rad-score and signi cant predictors were then integrated into a single predictive model, whereupon radiomics nomogram model 1 was constructed for the training cohort. Calibration curves were plotted to assess the calibration of radiomics nomogram model 1. Decision curve analysis was conducted to determine the clinical usefulness of the radiomics model and radiomics nomogram model by quantifying the net bene ts at different threshold probabilities for the training and validation cohorts.

Clinicopathological factor analysis
The clinicopathological factors were analyzed using univariate Cox proportional hazards (CPH) regression analysis. Those factors with a p value of less than 0.10 and the rad-score were then combined for multivariate CPH regression analysis to identify the independent risk factors, which were subsequently analyzed using the Kaplan-Meier curve and log-rank test. The nal model was selected by backward stepwise elimination, with the Akaike information criterion as the stopping rule [31].
Construction of radiomics nomogram model 2 The same rad-score was used to construct radiomics model 2. The multimodal features and parameters, including the rad-score and independent risk factors, were integrated into a single predictive model through multivariate CPH regression analysis, where upon radiomics nomogram model 2 was developed for the training cohort. The prognostic abilities of the generated models were evaluated using the training cohort and validated using the validation cohort. The discrimination performance of the prognostic models was assessed using Harrell's concordance index (C-index), which ranges between 0.5 (indicating a random distribution of the data) and 1.0 (indicating perfect prediction of the observed survival information by the model). Calibration curves of the nomogram were subsequently drawn for the 2-year PFS of the patients. The calibration curves were used to determine the independent risk factors as well as to indicate both the PFS probabilities predicted by the prognostic models and the observed probabilities.

Statistical analysis
The differences in clinicopathological factors between the training and validation datasets were assessed using the Mann-Whitney U test for continuous variables and the χ 2 test for categorized variables. The discrimination power of the models was measured from the AUC and the C-index. The Delong test was used to compare two AUCs, and the log-likelihood ratio was used to assess the increase in predictive power of the C-index. The radiomics nomogram model 2 score was calculated and used to assess risk strati cation. Survival curves were generated with the Kaplan-Meier method and compared using the two-sided log-rank test. The optimal cutoff point for continuous prognostic markers in the survival analysis was determined using X-tile software (version 3.6.1; Yale University School of Medicine, New Haven, CT, USA). A calibration curve was generated to demonstrate the goodness of t, which is a graphical representation of the relationship between observed and predicted survival. The Greenwood-Nam-D'Agostino (GND) method was applied to measure the statistical signi cance of the goodness-of-t test results.The prediction error of the models, which was assessed using the "Boot632plus" split method with 100 iterations to calculate estimates of the prediction error curves, was summarized as the integrated Brier score, which represents a valid measure of the overall model performance and can range from 0 (for a perfect model) to 0.25 (for a noninformative model with a 50% incidence of the outcome). The RF model was conducted using Python software (Python Scikit-learn package comprising Python version 3.7 and Scikit-learn version 0.21; http://scikit-learn.org/). The construction of radiomics nomogram model 2, assessment of the model, and decision curve analysis(DCA) were performed using R software (version 3.4.4; http://www.r-project.org). All the codes are available at https://github.com/tomato08217/immune. All statistical tests were two-sided, with a signi cance level of 0.05.

Clinicopathological characteristics of the patients
The baseline clinical characteristics of the patients in the training and validation cohorts are listed in Table 1. There were no signi cant differences in sex, age, smoking status, family history, histology type, cancer stage, TTF-1, and Ki-67 (p = 0.070-1.000) between the two cohorts.
Prognostic model performance and development of radiomics nomogram model 1 In total, 88 CT-derived radiomic features were used to build radiomics model 1. Because the logistic regression analysis had identi ed the rad-score, age, stage n1, stage n2, stage n3, and stage M as being independent predictors (Table 2), they were used to build the model, which was designated as radiomics nomogram model 1. After 10-fold cross-validation, radiomics model 1 was found to have an AUC value of 0.848, a sensitivity value of 0.625, a speci city value of 0.906, and an accuracy value of 0.766 for the training cohort, whereas it had an AUC value of 0.795, a sensitivity value of 1.000, a speci city value of 0.704, and an accuracy value of 0.714 for the validation cohort ( Figure 3B). For radiomics nomogram model 1, the AUC value was 0.902, the sensitivity value was 0.857, the speci city value was 0.884, and the accuracy value was 0.875 for the training cohort, whereas the AUC was 0.877, the sensitivity value was 0.800, the speci city value was 0.944, and the accuracy value was 0.893 for the validation cohort (Figure 3Cand Table 3). The independent predictors were presented as radiomics nomogram model 1 ( Figure 4A). The calibration curve for the probability of DCB and NDB in the training and validation cohorts demonstrated good agreement between the predicted and actual values ( Figure 4B). The Brier scores also showed good performance of the model, with values of 0.178 for the train dataset and 0.187 for the test dataset. Decision curve analysis was performed to determine the clinical usefulness of radiomics nomogram model 1 by quantifying the net bene ts at different threshold probabilities. The results showed that radiomics nomogram model 1 had a higher overall net bene t than radiomics model 1 across the majority of the range of reasonable threshold probabilities ( Figure 4C).

Clinicopathological factor analysis
The clinicopathological factors were analyzed using univariate CPH regression analysis to test the hazard ratio of each factor and to determine its signi cance in the probability of PFS. The factors with a p value of less than 0.10 were then combined for multivariate CPH regression analysis to identify independent risk factors, whereupon the rad-score, CRP level, and M stage were found to be the independent risk factors for the patients (Table 4). These independent risk factors were then analyzed using the Kaplan-Meier curve and log-rank test, with the latter indicating signi cant discrimination between the two groups. Figure 5A-C shows the PFS probability of the patients in the high-risk or low-risk cohorts.
Prognostic prediction model performance and development of radiomics nomogram model 2 For radiomics model 2, the C-indexes were 0.717 and 0.760 for the training and validation cohorts, respectively, whereas for radiomics nomogram model 2, the values were 0.749 and 0.791, respectively ( Table 5). The independent risk factors were presented as the nomogram ( Figure 6A). The calibration curve showed that the predicted probability was signi cantly close to the actual PFS of patients, with p = 0.21 in the GND goodness-of-t test ( Figure 6B). The Brier scores also showed good performance of the model, with values of 0.187 for the train dataset and 0.209 for the test dataset. The decision curve analysis of the clinical usefulness of radiomics nomogram model 2 showed that the model had a higher overall net bene t than radiomics model 2 across the majority of the range of reasonable threshold probabilities ( Figure 6C).

Discussion
In this study, an RF model based on the radiomic features of CT images and a radiomics model (radiomics model 1) based on the rad-score were established, following which multivariate logistic regression analysis of the rad-score combined with clinicopathological factors was carried out to construct a radiomics nomogram model (radiomics nomogram model 1) for distinguishing patients who received DCB from those who received NDB after ICI treatment. Then, a second radiomics model (radiomics model 2) based on the rad-score was established, following which multivariate CPH regression analysis of the rad-score combined with independent risk factors was conducted to establish a second radiomics nomogram model (radiomics nomogram model 2) for predicting PFS after immunotherapy. The ultimate goals are to use these models to identify patients who can bene t from immunotherapy and to provide both guidance for individualized immunotherapy and a reference for the advancement of precision medicine.
Multivariate logistic regression analysis of the rad-score and signi cant clinicopathological factors combined showed that the rad-score, age, stage n1, stage n2, stage n3, and stage M were independent predictors, among which all (except age and stage n1) can bene t from immunotherapy. The prediction models built had good prediction performances. Radiomics model 1 had an AUC value of 0.848, a sensitivity value of 0.625, a speci city value of 0.906, and an accuracy value of 0.766 for the training cohort, whereas it had an AUC value of 0.795, a sensitivity value of 1.000, a speci city value of 0.704, and an accuracy value of 0.714 for the validation cohort. The independent predictors (rad-score, age, stage n1, stage n2, stage n3, and stage M) were used to build radiomics nomogram model 1, the performance of which was good, with an AUC value of 0.902, a sensitivity value of 0.857, a speci city value of 0.884, and an accuracy value of 0.875 in the training cohort, and an AUC value of 0.877, a sensitivity value of 0.800, a speci city value of 0.944, and an accuracy value of 0.893 in the validation cohort. Mu et al. [32] conducted radiomics of pretreatment 18 F-FDG PET/CT images to predict the clinical bene t of checkpoint blockade immunotherapy for patients with advanced NSCLC. Their results showed that the multiparametric radiomics signature could predict the patients who received DCB, with AUC values of 0.86 (95%CI 0.79-0.94), 0.83 (95%CI 0.71-0.94), and 0.81 (95%CI 0.68-0.92) for the training, retrospective test, and prospective test cohorts, respectively. Trebeschi et al. [33] performed an arti cial intelligencebased characterization of each lesion on pretreatment contrast-enhanced CT imaging data to develop and validate a noninvasive machine learning biomarker for distinguishing between anti-PD1 immunotherapy responding and non-responding patients with advanced melanoma and NSCLC. Their results showed that the biomarker achieved signi cant performance for NSCLC lesions (AUC = 0.83, p < 0.001) and borderline signi cance for melanoma lymph nodes (AUC = 0.64, p = 0.05). After combining these lesion-wide predictions at the patient level, the immunotherapy response could be predicted with an AUC value of up to 0.76 for both cancer types (p < 0.001). Thus, the performance of our prediction model is more encouraging than that of previous studies, with the possible reasons being as follows. First, it may be attributed to our implementation of a cross-validation approach and uses of a performancedriven feature selection strategy and the RF algorithm for model training (which is robust to over tting) to build a reliable model with higher performance. These strategies have been proposed in previous studies to achieve better model performance [34,35]. Additionally, independent predictors were selected to construct the models, and combining these clinical predictors may have improved the model performance.
Moreover, the same rad-score was used to construct radiomics model 2, the C-indexes of which were 0.717 and 0.760 for the training and validation cohorts, respectively. The rad-score was combined with signi cant factors (p <0.10) for multivariate CPH regression analysis to identify independent risk factors, whereupon the rad-score, CRP level, and M stage were identi ed as such for patients. It was demonstrated that patients with a CRP level greater than or equal to 1.22 and stage M1 cancer were more likely to progress positively after receiving immunotherapy. Radiomics nomogram model 2 was established using the same three independent risk factors and the C-indexes were 0.749 and 0.791 for the training and validation cohorts, respectively. Our model has better predictive performance than that of previous studies [32].The independent predictors and independent risk factors were presented as nomograms, where the calibration curves for the probability of response prediction or survival analysis demonstrated good agreement between prediction and observation. Moreover, the decision curve analysis showed that the radiomics nomogram models had a higher overall net bene t than the radiomics models across the majority of reasonable threshold probabilities. This demonstrates that our models not only have better predictive performance than that of other studies but are also more robust and stable.
To our best knowledge, this is the rst study to have used the noninvasive radiomics approach based on CT images and clinicopathological characters to predict the e cacy of ICIs in Chinese patients with lung cancer. Our study has some limitations; namely, the relatively small sample size due to the relatively late date of approval of ICIs (in 2018) by the China Food and Drug Administration; the use of only a singlecenter cohort; the retrospective nature of the data; and the lack of external validation, which may have introduced selection bias. However, we plan to rapidly expand the sample size and recruit multicenter cohorts for validation of the models in the near future. Additionally, we did not conduct research on overall survival but will add this in future studies.
In conclusion, by combining CT image-based radiomics and clinicopathological factors, radiomics nomogram models were constructed for identifying patients with NSCLC who would most likely bene t from immunotherapy and for assessing their PFS post therapy. The good performances of the models suggest that they could be used to provide more precise guidance for the accurate diagnosis and treatment of NSCLC in clinical practice.