Using Nomogram and Machine Learning Models to Predict Non-Small Cell Lung Cancer Prognosis


 BackgroundPatients with non-small cell lung cancer (NSCLC) often have a poor prognosis. Overall survival (OS) prediction through the early diagnosis of cancer has many benefits, such as allowing providers to design the best treatment plan for patients. In this study, we aimed to evaluate the prognostic factors in NSCLC patients, construct a nomogram, and develop machine learning models to predict the OS. We also conducted feature importance analysis to understand how relevant factors of NSCLC patients impact their OS.ResultsMultiple machine learning models were adopted in a retrospective cohort of patients from 2010 to 2015 in the Surveillance, Epidemiology, and End Results (SEER) database. Independent prognostic factors for NSCLC were determined using Cox proportional hazards regression analysis. We modeled OS and vital status as the outcomes and constructed and validated a nomogram to predict the OS of NSCLC. Furthermore, we applied logistic regression, random forest, XGBoost, decision tree, multilayer perceptron, and LightGBM to predict the patients’ vital status. We tested the prediction ability of the models and evaluated their performances using accuracy, sensitivity, specificity, precision, and the area under the receiver operating characteristic curve. A total of 34,567 patients selected from the SEER database that met our criteria were included in this study. The nomogram visualized the OS prediction results of the Cox regression model. Among the classifiers, XGBoost had the best prediction performance, with an area under the curve of 0.733.ConclusionsThe results demonstrated that machine learning-based classifier models are capable of predicting the outcomes of patients with NSCLC. And Cox regression model-based nomogram interpreted the results well and supports potential medical applications.

evaluated their performances using accuracy, sensitivity, speci city, precision, and the area under the receiver operating characteristic curve. A total of 34,567 patients selected from the SEER database that met our criteria were included in this study. The nomogram visualized the OS prediction results of the Cox regression model. Among the classi ers, XGBoost had the best prediction performance, with an area under the curve of 0.733.

Conclusions
The results demonstrated that machine learning-based classi er models are capable of predicting the outcomes of patients with NSCLC. And Cox regression model-based nomogram interpreted the results well and supports potential medical applications.

Background
Lung cancer is the leading cause of death in cancer patients, and non-small cell lung cancer (NSCLC) accounts for 83% of all lung cancer cases, with an incidence rate of 40.60 per 100,000 and a 5-year survival rate of 22.1% [1]. Since NSCLC has a high incidence and poor prognosis, it is particularly important to determine its prognosis. Currently, clinicians usually determine prognosis based on surgical pathological staging, but this staging only takes into account the primary tumor, regional lymph node involvement, and distant metastasis, ignoring the role of other prognostic factors; this method has a poor predictive effect [2]. There is an urgent need for a clinical prognostic assessment system for patients with non-small cell lung cancer that is based on a large amount of data, with high reliability and good predictive effect.
The National Cancer Institute established the Surveillance, Epidemiology and End Results (SEER) database in 1973. It is recognized as one of the most authoritative sources of oncology patient follow-up data in the world and provides reliable data support for clinical research [3].
Machine learning is a suitable method to address this problem because algorithms can learn quickly from a large number of patients to produce more accurate predictions than those of any set of clinical experts [4]. Many studies have assessed the survival of lung cancer patients by analyzing large datasets, such as those in the SEER database, using machine learning techniques, including logistic regression and support vector machine [5], as well as methods based on integrated clustering [6]. An arti cial neural network was used to predict the survival of patients with NSCLC, with an overall accuracy of 83% [7].
Parsimonious Bayes and decision trees have also been used to predict the survival time in lung cancer with 90% accuracy [8].
In this study, we used the SEER database to extract and build an analysis cohort from a large sample of NSCLC patients, and the least absolute shrinkage and selection operator (LASSO) regression was used to explore the factors that have a signi cant impact on patient prognosis. The Cox risk regression model was used to predict the 1-, 3-, and 5-year overall survival of patients and to present them in a nomogram. Multiple machine learning methods have also been used to construct NSCLC survival prediction models and compare their model parameters to select the best one for the decision support of clinicians during patient treatment.

Dataset and Patient Inclusion
Anonymized patient data were obtained from the SEER database by the National Cancer Institute's SEER*Stat software [3]. The SEER program collected data on the incidence and survival from US regions, accounting for 28% of the national population [9].We screened patients aged 15-64 years with the primary site diagnosed at the trachea, main bronchus, and lung between 2004 and 2015. We included patients with non-small cell lung cancer diagnosed through histology that had complete survival time and active follow-up data. Tumor stage was coded according to the American Joint Committee on Cancer TNM staging system, 6th edition [10]. Patients with cancer diagnosed only on autopsy or death certi cates were excluded. To improve data completeness, we also excluded patients whose stage, grade, and race were missing. The total number of cases included was 34,567.

Feature Selection
Twenty-seven patient characteristics were selected from the SEER database. Because some variables were not in numeric format, for analysis and research purpose, they were assigned values according to Table S1. Some categorical variables, such as histological classi cation, race, surgery status, and radiotherapy status, were processed by one-hot recoding. As the age of the patients was recorded as a range in the SEER database, the median value of the range was applied for this entry.
To simplify the model and reduce irrelevant variables, we used LASSO regression to screen 10 variables that are the most relevant to the outcome. The coe cient values are listed in Table 1. Among them, "regional nodes examined" and "regional nodes positive" have a strong correlation with N, and N is widely used in clinical practice. Therefore, we selected only N from the three. The remaining eight variables were used for the subsequent prediction and analysis.

Multivariate Cox Regression Analysis
The multivariate Cox regression analysis showed that the 8 variables selected by LASSO regression, which were signi cantly related to the outcome, were all independent factors affecting the prognosis of NSCLC patients (all P<0.001) ( Table 2). Figure 1 shows the distribution of the number of patients and the hazard ratio (HR) and P values of the eight groups of variables.
The HR was used to evaluate the relative risk of a variable. A hazard ratio greater than 1 indicates that the variable is positively correlated with the probability of death and negatively correlated with survival time, and a hazard ratio of less than 1 indicates the opposite. For example, compared with female patients, male NSCLC patients have a higher probability of death, and the difference is signi cant.

Construction and Veri cation of Prognostic Prediction Nomogram
The nomogram was based on the Cox regression risk ratio model constructed before and displayed the overall survival rates of patients at 1, 3, and 5 years in a graphical style. For the C-index, a value greater than 0.7 usually indicates a relatively good distinction, and the closer the C-index is to 1, the better the prediction accuracy. The calibration curve was used to compare the consistency between the prediction result of the nomogram and the result of the Kaplan-Meier method. If the two results are identical, the calibration curve coincides with the diagonal. In this study, calibration curves for patients undergoing NSCLC surgery at 1, 3, and 5 years were plotted. As shown in Figure S1, the calibration curve is relatively close to the diagonal, indicating that the prediction results of the nomogram model are highly consistent with the prediction results of the Kaplan-Meier method.

Evaluation Results
Several machine learning methods, including logistic regression, random forest, XGBoost, decision tree, and LightGBM, were selected to build the prognostic models. The prediction accuracy and model performance evaluation of the ve machine learning algorithm models are compared in Table 3 and Figure 3.  Figure 1). The results show that the older the age and the lower the degree of tumor differentiation, the higher the overall survival rate of NSCLC patients. The degree of involvement of the distal lymph nodes has a greater impact on the survival of patients due to its outstanding absolute HR. In addition, the prognosis of male patients is worse than that of female patients.
As a new statistical prediction model [11], the nomogram has many advantages over traditional prognostic models, such as high accuracy, exibility, and ease of generalization, and it has been widely used with various tumor types, including liver, bladder, prostate, cervical, and gastric cancers [12][13][14][15][16]. In this study, a nomogram was constructed to predict the 1-, 3-, and 5-year OS of NSCLC patients using eight variables that showed statistical signi cance in a multifactorial Cox risk regression analysis. The accuracy and reliability of the model were evaluated using the C-index and calibration curve analysis. Cindex, which ranges from 0.5 to 1, where 1 means complete discrimination and 0.5 means no discrimination, was used for the measurement of the nomogram between performance and predicted results. The C-index of the nomogram for OS in our study was 0.709, which shows that the model has the capability of prediction. For the calibration curve, the horizontal coordinate of the graph is the predicted probability, and the probability of the event occurring is predicted using the prediction model, with 0 to 1 indicating that the probability of the event occurring is 0-100%. The vertical coordinates represent the actual incidence of the patient. Blue, red, and green are tted lines for 1-, 3-, and 5-year OS, respectively. The tted line (colored) completely overlaps the reference line (black) if the predicted value is equal to the actual value for each probability; if the predicted value is greater than the actual value, that is, the risk is overestimated, the tted line is below the reference line. In this study, all three tted lines overlapped well with the reference line, indicating the high accuracy of the prediction model. The 3-and 5-year OS have the best prediction at 60-70%, while 1-year OS has the best prediction at around 95%.
Logistic regression, random forest, XGBoost, decision tree, and LightGBM were selected to build the prognostic models (Table 3 and Figure 3). Among them, XGBoost obtained the highest accuracy, precision, and AUC, and is considered the best-performing model. LightGBM had the highest sensitivity, and the logistic regression model had the highest speci city. We also experimented with a deep learning model, multilayer perceptron, which underperformed with accuracy, sensitivity, speci city, precision, and AUC of 0.666, 0.598, 0.726, 0.659, and 0.723, respectively.
Cox regression is a time-dependent model, whereas the machine learning models used in this study are all classi ers. Only one of them is selected in most studies to analyze and predict patient survival. In this study, we not only compare the prediction advantages and disadvantages of the ve classi ers on the vital status of NSCLC patients but also visualize the results of Cox regression models for the prediction of patients' 1-, 3-, and 5-year OS with a nomogram. Owing to the data, there is still much space for improvement in the nal prediction results for both the Cox regression model and the best-performing machine learning model.
This study has limitations. We did not include all tumor prognostic factors in this preliminary study, so the number of indicators selected for modeling may not cover all the prognostic factors. In our future study, we will apply this model to strati ed analysis on Asia patients from SEER database and chose Asia local hospital datasets for comparison, and explore the characteristic variables that have the greatest impact on survival outcomes on Asia patients and verify the model in real clinical settings.

Conclusions
In this study, we took preprocessed data on NSCLC patients derived from the SEER database, screened the characteristics by LASSO regression methods, and selected 8 sets of independent variables that had the greatest impact on survival outcomes among 27 sets of characteristic variables. We then used the Cox regression model to analyze the eight groups of variables of patients and produce nomograms to predict the 1-, 3-, and 5-year OS of patients. The C-index and calibration curves showed that they all had a good predictive effect. We also compared the prediction results of the ve machine learning models for patient survival outcomes, and XGBoost achieved the best performance. We believe these results are useful for predicting patient outcomes and potentially help improve clinical decision making.

Analysis Methods
Feature selection method The LASSO method was applied to select the most relevant features for the outcome. Here, the top ten features were chosen from the twenty-seven extracted features. Furthermore, since there were three strongly correlated variables among the top ten features, we chose the most commonly used one and discarded the other two. Finally, eight features were utilized in both machine learning models and nomograms.

Multivariate Cox regression analysis
Multivariate Cox regression analysis was applied to the variables obtained by LASSO regression to determine whether they were all independent factors. A nomogram based on Cox regression model was performed to predict NSCLC patients on 1-, 3-, and 5 years overall survival (OS).

Construction and veri cation of prognostic prediction nomogram
We used R software (version 4.0.4) to construct a survival prediction model for patients with non-small cell lung cancer surgery based on Cox proportional hazards. The concordance-index (C-index) was used to test the accuracy of the nomogram prediction, and a calibration curve was used to evaluate the predictive ability of the nomogram. The self-sampling and replacement method (bootstrapping) was used to verify the model internally to reduce the over-tting of the model. The program packages "rcorrp.cens" and "Himsc" in R language were used in this study.

Machine learning methods
All features were normalized before putting them into the classi cation models. Using survival status (dead/alive) as the predictor class, the classi cation model incorporated the following methods: logistic regression, random forest, XGBoost, decision tree, multilayer perceptron, and LightGBM. Because the dataset (dead: 16,230, alive: 18,337) was not highly biased, it was unnecessary to apply any sampling method to adjust the balance of the dataset. The data were randomly split into the training and testing datasets with 80% and 20% of all patients, respectively. All analyses were performed using Python 3.6.

Evaluation Methods
The classi cation models were assessed using the area under the receiver operating characteristic (AUC) curve, sensitivity (also known as recall), speci city, accuracy, and precision. The foundation of these assessment variables comes from the four possible outcomes (TP = true positives, TN = true negatives, FP = false positives, FN = false negatives) of the binary classi er. Accuracy accounts for the fraction of correct predictions for both true positive and true negative among all subjects; which is suitable for a balanced dataset. Computed by plotting sensitivity as a function of 1-speci city, the area under the receiver operating characteristic curve is widely used as a performance measurement for classi cation problems at various threshold settings. A higher AUC value indicates a better model that distinguishes between classes. In this study, false positives (e.g., survivor is predicted as non-survivor) may be treated with overmedication, while false negatives (e.g., non-survivors are predicted to be survivors) will not take any extra actions for early prevention. Both cases should be avoided. Precision and recall are good measures to determine when the costs of false positives and false negatives are too high.

Declarations
Ethics approval and consent to participate The SEER database is a public database from which sensitive patient information has been removed.
The data can only be used after it is submitted for research purposes and passed human review.

Consent for publication
Not applicable.
H. L, H. Z and Y. W were responsible for study design and conception; C. L and N. H collected the data, C. L, Z. X, X. L and W. Z were responsible for data processing and data analysis; M. G, G. W and Y. W interpreted the results; All authors drafted the manuscript. All authors revised the manuscript for important intellectual content.

Figure 1
Multivariate Cox regression analysis of overall survival of non-small cell lung cancer patients.

Figure 2
The nomogram of 1, 3, and 5-year overall survival prediction for non-small cell lung cancer patients by Cox risk regression model.