Machine Learning to Improve Treatment Selection for NSCLC Patients Treated with Immunotherapy Using Real World and Translational Data


 BackgroundIn advanced Non-Small Cell Lung Cancer (NSCLC), Programmed Death Ligand 1 (PD-L1) remains the only used biomarker to candidate patients to immunotherapy (IO) with many limits. Given the complex dynamics of the immune system it is improbable that a single biomarker could be able to profile prediction with high accuracy. A promising solution cope with this complexity is provided by Artificial Intelligence (AI) and Machine Learning (ML), which are techniques able to analyse and interpret big multifactorial data. The present study aims at using AI tools to improve response and efficacy prediction in NSCLC patients treated with IO. MethodsReal world data (clinical data, PD-L1, histology, molecular, lab tests) and the blood microRNA signature classifier (MSC), which include 24 different microRNAs, were used. Patients were divided into responders (R), who obtained a complete or partial response or stable disease as best response, and non-responders (NR), who experienced progressive or hyperprogressive disease and those who died before the first radiologic evaluation. Moreover, we used the same data to determine if the overall survival of the patients was likely to be shorter or longer than 24 months from baseline IO. For A literature review and forward feature selection technique was used to extract a specific subset of the patients data. To develop the final predictive model, different ML methods have been tested, i.e., Feedforward Neural Network (FFNN), Logistic Regression (LR), K-nearest neighbors (K-NN), Support Vector Machines (SVM), and Random Forest (RF).Results 200 patients were included. 164 out of 200 (i.e., only those patients with PD-L1 data available) were considered in the model, 73 (44.5%) were R and 91 (55.5%) NR. Overall, the best model was the LR and included 5 features: 2 clinical features including the ECOG performance status and IO-line of therapy; 1 tissue feature such as PD-L1 tumour expression; and 2 blood features including the MSC test and the neutrophil-to-lymphocyte ratio (NLR). The model predicting R/NR of the patient achieves accuracy ACC= 0.756, F1 score F1=0.722, and Area Under the ROC Curve AUC=0.82. The use of the PD-L1 alone has an ACC=0.655. The accuracy of the ML models excluding some of the features from the model were as follow: without PD-L1 value (ACC=0.726), MSC (ACC=0.750), and both PD-L1 and MSC (ACC=0.707), i.e., considering only clinical features. At data cut-off (Nov 2020), median Overall Survival (mOS) for R was 38.5 months (m) (95%IC 23.9 - 53.1) vs 3.8 m (95%IC 2.8 - 4.7) for NR, with p<0.001. LR was the most performing model in predicting patients with long survival (24-months OS), achieving ACC=0.839, F1=0.908, and AUC=0.87. ConclusionsThe results suggest that the integration of multifactorial data provided by ML techniques is a useful tool to improve personalized selection of NSCLC patients candidates to IO. In particular, compare to PD-L1 alone the expected improvement was around 10%. In particular, the model shows that the higher the ECOG, NLR value, IO-line, and MSC test level the lower the response, and the higher PD-L1 the higher the response. Considering the difference in survival among R and NR groups, these results suggest that the model can also be used to indirectly predict survival. Moreover, a second model was able to predict long survival patients with good accuracy.


Introduction
Lung cancer is the leading cancer-related death worldwide with around 470.000 new cases and 390.000 deaths in Europe. Non-Small Cell Lung Cancer (NSCLC) is the most common histology for around 85% (1). Until 2015 the median OS of patients with metastatic NSCLC was around 12 months (2). The advent of Immunotherapy (IO) has radically changed the treatment paradigm of many cancer including NSCLC prolonging survival of metastatic patients from 12 to a median of around 24 months (2). Some of them who responded better to IO reached longer survival up to or more than 5 years (3). However, only 30-50% of patients will bene t from IO in the long term (4)(5)(6).
Indeed, it is implausible that a single biomarker is able to pro le prediction or prognosis with a high accuracy, since the immune system displays dynamics complexity when interacting with its TME. To handle with the density of the available data, Arti cial Intelligence (AI) frameworks and, more speci cally, Machine Learning (ML) techniques, provide e cient, pioneering, and theoretically sound approaches to construct decision-making tools providing individualized prediction (16).
Among molecular biomarkers, the plasma microRNA signature classi er (MSC), re ecting an immunesuppressive host status, was here considered (10). It was previously trained in lung cancer screening cohorts to evaluate the individual risk to develop the aggressive form of the disease (17;18).
More recently, the MSC prognostic value was also validated in advanced NSCLC patients treated with single agent IO (19) and its combination with different clinical scores con rmed its independency from other prognostic features in this setting (20).
This study aims to integrate real word data and MSC test to develop a machine learning algorithm to predict response and e cacy to IO in NSCLC patients. The study also investigates the role of the MSC test and its added value to the algorithm prediction capability, being this latter test costly and still not included in the standard clinical practice as a predictive/prognostic biomarker.

Study population
From July 2015 to Nov 2020, we conducted a prospective observational study (Apollo, INT 22_15) enrolling 200 consecutive aNSCLC patients receiving single-agent anti-PD-(L)-1 inhibitors in rst-(n=70) or further-line therapy (n=130). Complete real-world data and whole blood samples were collected as per clinical practice. The MSC test was prospectively assessed in plasma samples collected at baseline IO.
Inclusion criteria were the followings: cytological/histological diagnosis of advanced NSCLC, patients (relapsed or stage IIIB to IV) that had received at least one infusion of single agent IO in rst-or furtherline. Patients without baseline IO MSC test information were excluded from the study.
This prospective study was conducted at Fondazione IRCCS Istituto Nazionale Tumori of Milan in Italy in collaboration with Politecnico di Milano, for the data analytics. This study was approved from the ethical committee of Fondazione IRCCS Istituto Nazionale Tumori of Milan and all included patients signed informed consent prior plasma and data collection accomplished in accordance with the Declaration of Helsinki, Good Clinical Practice and local ethical guideline.
Real World Data Collection: clinical, blood, and tissue data For this study, demographic, medical history, tumour stage, PD-L1, molecular and radiological data, concomitant medications, treatment responses and survival follow-up were collected and integrated to develop e new predictive model of response and e cacy to IO in NSCLC.

Omic Collection: MSC blood test
Whole blood was collected in 10 ml K2EDTA Vacutainer tubes and the plasma separated by two centrifugation steps. Total RNA was extracted from 200 µl plasma samples. MicroRNA expression was determined by quantitative reverse transcription PCR (RT-qPCR) as previously described (19,21).
The MSC algorithm using 24 miRNAs de ned four different classes of risks: low (L) intermediate (I) and high (H) risk (18) and highly haemolysed (E). The fourth category E, thus not analysable plasma samples, due to the unspeci c released of miRNAs in presence of blood cell lyses, was included (10) (Figure 1). Patients with this category were previously observed to have an intermediate prognosis between patients with H and I risk (20).

Treatment administration
IO was administered intravenously (IV) as monotherapy. Nivolumab was administered initially at a dose of 3 mg/kg and later, since May 2018 in Italy, at a xed dose of 240 mg every 2 weeks (w). Pembrolizumab at a x dose of 200 mg as rst line and at dose 2 mg/kg every 3w in second or third-line setting. Atezolizumab at a xed dose of 1200 mg every 3w and durvalumab at a dose 10 mg/kg every 2w. Therapy was continued until progressive disease (PD), intolerable toxicity, withdrawal or death. Treatment beyond PD was allowed if there was a clinical bene t according to clinician's decision.

Radiological response evaluation
Baseline radiological evaluations included a baseline Total Body Computed Tomography (TB-CT) scan, subsequently performed every 3-4 cycles or every 9-12 weeks as per standard of care, or whenever progression was clinically suspected. Six categories of radiological response were taken into consideration in this study to assess tumour response. Four of them (standard categories) included in Response Evaluation Criteria in Solid Tumours (RECIST1.1) criteria: Complete Response (CR), Partial Response (PR), Stable Disease (SD), Progressive Disease (PD). Two additional categories were included: Hyper Progression Disease (HPD), an atypical pattern of response to single agent IO (an acceleration of the progression compared to the natural history of the disease) as de ned by Ferrara et al. (22) and Lo Russo et al. (23). Eventually, not evaluable (NE) was the sixth category of those patients who died due to PD before the rst radiological evaluation. Statistical analysis 164 patients out to the 200 patients included in the present study having available PD-L1 expression were used as dataset for the ML algorithms since the prediction given by PD-L1 is the unique biomarker used in clinical practice. Conversely, all the 200 patients were included in the survival analysis. The rst endpoint of the study is prediction of responder (R) and non-responder (NR) patients. In the R group were include patients who obtained a CR, PR or SD as per RECIST 1.1 while in the NR group were included those patients who obtained a PD per RECIST1.1., or an HPD or NE response (as described above). Other endpoints were at 24-months Overall Survival (OS), and median progression-free survival (mPFS) and median OS (mOS). mOS is measured from the date of IO start therapy until death, or last follow-up. mPFS was calculated from the date of IO start until PD or death due to any cause, or last follow-up visit for alive patients without PD. Kaplan-Meier was used to calculate mPFS and mOS with their 95% con dence interval, and to generate survival curves. Cox's proportional hazards model were used to calculate the Hazard Ratio (HR) between R and NR groups according to OS and PFS.

MACHINE LEARNING METHODS
After data collection, descriptive analysis and data processing were performed. A rst step consisted in the selection of a set of 21 features which have been determined to be the most relevant ones based on the published literature on NSCLC patients treated with IO and clinician experience. Finally, in the case pair of features showed a linear correlation higher than 0.8, we removed one of them, as customary in ML studies. The result is the set of M=15 most relevant features, provided in Table 1. The problem of predicting R and NR was modelled as a binary classi cation problem, where we want to learn an approximation f (x_i) of the real relationship y=ƒ(x_i) between the i-th patient's feature vector x_i and the response y_i∈{0,1}, where a patient has y_i=0 for NR, and y_i=1 for R. The same modelling has been applied to the problem of estimating the survival at 24 months, i.e., a patient has y_i=0 if the patient does not survive at least 24 months, and y_i=1 if she/he does. Data corresponding to the 40 alive patients with less than 24 months were excluded from this second analysis.
We selected a set of appropriate techniques from the ML literature to perform the above-mentioned classi cation task. More speci cally, we tested Feedforward Neural Network (FFNN), Logistic Regression (LR), K-nearest neighbors (K-NN), Support Vector Machines (SVM), and Random Forest (RF). We applied a feature selection approach to select the proper subset of the original M features appropriate for each method. More speci cally, we used a forward feature selection using the AIC criterion as metric to select the most appropriate set of features for each method and the best method. The 5-fold cross-validation ACC and F1 scores for the analysed methods, as well as the leave one out AUC, with the corresponding 95% con dence intervals were computed using the bootstrap approach (in brackets). The implementation of the procedure has been performed in Matlab, and the code performing all the ML procedures is available at https://trovo.faculty.polimi.it/downloads.html.
164 patients were enrolled in this study and patients were divided in two major groups. 73 belonged to the R group (CR, PR, or SD), and 91 to the NR one (PD, HPD, or NE).
Predicting Responder and Non-Responder patients  For each model, the confusion matrix is presented in Figure 3, to show their performances in terms of true/false positives/negatives.
Logistic Regression as the best model achieves ACC=0.756, F1=0.722, and AUC=0.83. PD-L1 alone has ACC=0.655 (whose performances are illustrated by the red circle in Figure 4). We also evaluated the accuracy of the LR models excluding PD-L1, MSC, and both PD-L1 and MSC from the models, i.e., considering only clinical features. Moreover, we excluded the ECOG, being the only physician-dependent feature. The results of these models are shown in Table 3, and the ROC curves are provided in Figure 1 Supplementary. Removing PD-L1, the accuracy of the corresponding model decreases to ACC=0.726 con rming the high importance of this feature, as reported in the literature. Removing the MSC from the feature decreased the accuracy to ACC=0.750, suggesting that the predictive power of this index is less impactful than PD-L1. Removing both from the data we achieve ACC=0.707. Finally, removing the ECOG decreases the accuracy of the LR model to ACC=0.726, therefore the importance of the physician clinical evaluation is comparable to PD-L1 in the prediction. These ndings are con rmed by the values of the F1 score and the average AUC ( Table 3). The ROC curve obtained by the leave one out methods is presented in Figure 4.

Predicting long-survival patients (≥24-months OS)
To predict if a patient is a long survival (≥24-months OS) another ML binary classi cation analysis was performed.
Notice that since we are solving a different classi cation model, we need to reconsider the use of the above-mentioned methods from scratch. In Table 4 is reported all the procedures for feature selection. Even in this case the LR method resulted to be the most promising according to the AIC criterion. It achieves an ACC=0.855, F1=0.908, and AUC=0.87. The features included in the model were: ECOG, Histology, NLR and IO line. The ROC curves computed using leave one out approach are provided in Figure 2 Supplementary.

Discussion
The use of AI is experiencing a great interest in the medical eld and, in particular, in oncology. In the recent literature, there exists a wide range of publication regarding the use of AI applied to NSCLC, especially focusing on real word data, genomics, circulomics, radiomics. In our study, we aimed to nd an algorithm to predict response and e cacy to IO using real word data (i.e., clinical, tumour, and treatment data) and translational ones (i.e., the result of the MSC test). Combining the current medical literature, clinical experience of the physicians, and ML tools, we developed an algorithm including 5 important features discriminating with a good accuracy (ACC=0.756, F1=0.722, and AUC=0.83) between R and NR patients. The model achieved signi cantly better result comparing to PD-L1 prediction value alone, which is the only currently used biomarker by physician in clinical practice to select NSCLC patients to IO that have an accuracy ACC=0.655 on the analysed dataset. To understand if the algorithm maintains its accuracy using only real word data, we decided to exclude the PD-L1 from the model features. In this case the accuracy of the model decreased, suggesting that, even if the PD-L1 alone it is not enough to provide an effective response prediction, it remains an essential feature for IO prediction to be used in clinical practice. We did the same with the MSC, since this test is an expensive and time-consuming exam, and, therefore, its introduction in clinical practice needs to be justi ed. When we leave out the MSC from the model, the model accuracy reduces even if less than the case of PD-L1 exclusion, again suggesting that the MSC has a role in our model. We also tested the model removing the patient's ECOG, which it is a physician-dependent value, and the results demonstrated a signi cant impact, analogous to PD-L1. Since the model was able to discriminate between R and NR group, we were also able to indirectly predict PFS and OS of these patients.
With a binary classi cation approach, we provided a method to identify and predict those patients with  (27), used a deep patient graph convolutional network to investigate the IO bene t in NSCLC patients. By integrating real real-world data (age, sex, race, histology, stage, ECOG score, smoking status and previous treatment, blood analyses) and genomics in 1937 patients, the algorithm was able to divide patients in two different subgroups: bene cial and non-bene cial patients with a mOS of 20.35 and 9.42 months respectively. Comparing to our model even our sample was smaller we also were able to predict survival and response with comparable results. The model also demonstrated the positive role of TMB and KRAS mutated in IO patients (27). The study by Tian et al (28) has a dual purpose: rst predicting a PD-L1 signature (PD-L1ES) using CT images (in 939 patients) and the second to predict IO response in NSCLC patients combining PD-L1ES and clinical features (in 77 patients). PD-L1ES was able to distinguish patients with a better PFS compared to those with a lower PFS. However, results of the combined model (PD-L1ES and clinical data) were superior to both the clinical and PD-L1ES models singularly (28). Our study also con rmed the importance of PD-L1 and its adding value to clinical features.
The Development and the Validation of a 12-Gene Immune Relevant Prognostic Signature for Lung Adenocarcinoma through ML strategies has been investigated in 954 patients to predict IO. From the discovery dataset of 204 observations including microarray data of gene expression of 1811 genes a Cox Regression was used to decrease the number of features to 336. Random Forest was then used to extract the nal 12 selected genes used to compute the risk score. Patients were classi ed into high-or low-score with an AUC of 0.854, 95%CI = 0.79-0.92). Patients with a high-risk score experienced a lower survival comparing to the low one (HR = 10.6, 95%CI = 3.21-34.95, P < 0.001). (29) Independently from IO, ML and DL techniques are now used in research to predict NSCLC prognosis treated with different therapies to better address precision medicine, however these techniques are still far from their introduction in clinical practice. An interesting study used DL to implement OS prediction of NSCLC patients by integrating microarray and clinical data. A list of 15 relevant genes was built using 7 known relevant biomarker genes and other less known 8 genes. Expression data of the 15 genes and the clinical data were combined and developed an integrative deep NN predicting the 5-year survival status of NSCLC patients with high accuracy (AUC: 0.8163, accuracy: 75.44%), these data are consistent and comparable with our results (30). Another study developed an algorithm to predict NSCLC survival time in 1000 patients treated with different type of therapies. Thirteen features were included in the algorithm, e.g., number of primaries, tumour size, age and stage. Random forest was the best model to predict short period survival term (< 6 months) (31).
Finally, IO biomarker prediction, as we mentioned above, is an unmet clinical need also for other cancer types. In fact, as in NSCLC different efforts have been made to nd predictive biomarkers of IO response using ML or DL methodology in other cancers. An interesting report on melanoma patients integrates histologic data and clinical data to predict IO response. The algorithm consists in a segmentation classi er that takes as input the whole slide image of the patient (haematoxylin and eosin tissue). These results were then combined through a multivariable logistic regression with clinical characteristics such as age, gender, histologic subtypes, etc. The classi er accurately strati ed patients into high versus low risk for disease progression with an AUC=0.80 (32). Lastly, in another work regarding IO prediction in bladder cancer CT-scans were used to develop a ML model according to RECIST methodology and the ROI were processed to extract radiomic features.
Considering a dataset of 43 subjects the model reaches an Accuracy of ACC=0.861 (34).
Our study has different limitations: rstly, the limited simple size. Secondly, we did not used radiomic features in our study and no genomic data are included except the unique molecular data requested as for standard of care.
There are many studies that are trying to extract more information from imaging (radiomics) and genomic data. Radiomics is a very important frontier but still in an early phase and more time will need to include it in clinical practice. The same for genomics. The approach used in this paper include routine information from imaging (e.g. RECIST) and also real word genetic data were used, those already investigated as per standard of care, which both added to the clinical can allow to better extract predictive multifactorial information. This collection can be chipper and easier to be collect.

Conclusion
In conclusion, the results suggest that the data integration provided by AI techniques is a good tool to improve prediction for NSCLC patients treated with IO. More speci cally, the model shows that higher ECOG, NLR value, IO-line, and MSC test level correlate negatively with the response to IO therapy, and, conversely, higher PD-L1 correlates positively with the response. It also con rms that PD-L1 and MSC are relevant biomarkers to improve the accuracy of the model. Moreover, considering the difference in survival among R and non-R groups, these results suggest that the model can also be used to indirectly predict survival (PFS and OS).
Finally, a second binary model was able to identify long survival patients with a high accuracy.

DECLARATION OF INTERESTS
The authors declare the following nancial interests/personal relationships which may be considered as potential competing interests: -C.P. declares personal fees from BMS and MSD, outside the submitted work.
-G.LR. declares personal fees from BMS, MSD and Astra Zeneca outside the submitted work.
-D.S. declares personal fees from AstraZeneca, Boehringer Ingelheim and BMS, outside the submitted work.
The other authors report no con ict of interest.  Process and methods used in this study.

Figure 3
Confusion matrix for the analysed ML models for Responders (R) and Non-Responders (NR). ROC curves (True Positive Rates (TPR) vs. False Positive Rate (FPR)) for the analysed ML models. The performance of PD-L1 are represented as a red circle. As suggested by the AUC con dence intervals, there is no method that outperforms the others signi cantly.