Risk factors and prognostic nomogram for patients with second primary cancers after lung cancer using classical statistics and machine learning

Previous studies have revealed an increased risk of secondary primary cancers (SPC) after lung cancer. The prognostic prediction models for SPC patients after lung cancer are particularly needed to guide screening. Therefore, we study retrospectively analyzed the Surveillance, Epidemiology, and End Results (SEER) database using classical statistics and machine learning to explore the risk factors and construct a novel overall survival (OS) prediction nomogram for patients with SPC after lung cancer. Data of patients with SPC after lung cancer, covering 2000 to 2016, were gathered from the SEER database. The incidence of SPC after lung cancer was calculated by Standardized incidence ratios (SIRs). Cox proportional hazards regression, machine learning (ML), Kaplan–Meier (KM) methods, and log-rank tests were conducted to identify the important prognostic factors for predicting OS. These significant prognostic factors were used for the development of an OS prediction nomogram. Totally, 10,487 SPC samples were randomly divided into training and validation cohorts (model construction and internal validation) from the SEER database. In the random forest (RF) and extreme gradient boosting (XGBoost) feature importance ranking models, age was the most important variable which was also reflected in the nomogram. And, the models that combined machine learning with cox proportional hazards had a better predictive performance than the model that only used cox proportional hazards (AUC = 0.762 in RF, AUC = 0.737 in XGBoost, AUC = 0.722 in COX). Calibration curves and decision curve analysis (DCA) curves also revealed that our nomogram has excellent clinical utility. The web-based dynamic nomogram calculator was accessible on https://httseer.shinyapps.io/DynNomapp/. The prognosis characteristics of SPC following lung cancer were systematically reviewed. The dynamic nomogram we constructed can provide survival predictions to assist clinicians in making individualized decisions.


Introduction
Lung cancer has attracted much attention because of its high morbidity worldwide [1]. With the promotions of screening programs and the development of treatment for lung cancer, the number of long-term survivors is gradually increasing [2][3][4]. However, the risk of SPC among lung cancer survivors is also on the increase, and their prognostic as well as survival analyses cannot be neglected [5].
According to recent studies, second primary lung cancers or other malignancies occur in 33% of patients with non-small cell lung carcinoma [6]. A report of the National Cancer Institute (NCI) on second cancer of cancer survivors has announced that the probability of second cancer after the first cancer is higher than that of ordinary people [7]. The data show that the occurrence of the second primary cancer Lianxiang Luo and Haowen Lin have equally contributed to this work. 1 3 in lung cancer survivors within 20 years of diagnosis is a high probability, but the outcome and survival analyses of these SPCs are not fully understood. Therefore, the prognostic prediction models for SPC patients after lung cancer are particularly needed to guide screening.
Nomograms are widely used as prognostic devices in oncology and medicine [8]. At present, it is widely used to predict the recurrence or prognosis of various cancers such as prostate cancer [9], liver cancer [10], and many other diseases. Machine learning has emerged as a field critical for providing tools and methodologies for analyzing the high volume, high dimensional, and multi-modal data generated by the biomedical sciences [11,12]. The Cox proportional hazard regression model is not only used to predict the survival data but also can show the relationship between variables and survival outcome easily. Hence, we combined Cox regression with Random Forest model [13,14] as well as XGBoost model [15,16] to filter the importance of variables. After that, we established the dynamic nomogram and used calibration curves, ROC curves, ML evaluation model, and decision curve analysis (DCA) curves to verify its accuracy. In a word, we aimed to investigate the risk factors for patients with SPC after lung cancer and to construct a novel dynamic nomogram that predicts the survival of SPC patients.

Collection of patient data
The case of SPCs was extracted from SEER database. The tumor information in the database is unified and standardized through SEER* stat software (accession number: 10799-Nov2021). Lung cancer diagnoses were defined using the SEER site recode International Disease for Oncology (ICD-O) and third edition site codes C34.0-C34.9, excluding histology codes 9050-9055, 9140, 9590-9992. Patients who suffer from SPC after initial primary lung cancer (IPLC) were enrolled. Afterward, there were some exclusion criteria for SPC after lung cancer in the study. The first malignant primary (Yes) with histology codes 9052, 9874, 9699, and 9863, as well as the first malignant primary (NO) with site recodes C34.0-C34.9 were dropped. The diagnostic criteria for multiple primary cancers described by Warren and Gates in 1932 are still used, identifying SPC with no exception: histologically distinct from primary cancer, with a latency of no less than 6 months to exclude errors caused by metastasis and recurrence [3,17,18]. Our data also followed the criteria above. Apart from this, only patients with complete survival data together with follow-up information and patients with "positive histology" (Patients were considered to have a pathologic diagnosis) had been included. In this study, in the case of SIR in SPC, the start of our follow-up was the date of the IPLC diagnosis record in the SEER database and maintained a state of positive follow-up. The SIR was a calculation of the risk of developing second cancer in patients with lung cancer by comparison with the incidence of cancers in the general population (in the same period as lung cancer patients). Overall survival (OS) was calculated from the diagnosis date of the initial primary lung cancer (latency period between IPLC and SPC is more than 6 months) to the date of the last follow-up or death in the SEER database. The criteria for data inclusion were manually selected in SEER * stat, and the SIR was calculated using the MP-SIR module in SEER * stat Software.

Patient cohorts
Relative demographic and clinicopathological information were extracted including age, sex, grade, race, TNM stage (the most common tumor staging system in the world), stage (tumor metastasis stage), lymph node dissection, radiation, chemotherapy, site (SPC), histological type, marital status, SPC surgery history and IPLC surgery history, follow-up time, and latency time between IPLC and SPC. All variables were derived from the same group of patients with IPLC. In the SPC risk study, the study population was patients with lung cancer. In the survival analysis, the study population was patients with SPC after lung cancer. The purpose of the risk of SPC is to study the population in which SPC is prone to develop, so as to further elicit our research on the prognosis of SPC in IPLC survivors. The SPC risk study and the overall survival study in SPC were step-wise and complemented each other.

SIR calculation
The incidence of SPC in the cohort of patients with IPLC was compared with the expected incidence of lung malignancy in the general population using the SIR. SIR analyses were completed using the MP-SIR module in SEER * stat Software. The SIR was a calculation of the risk of developing second cancer in patients with lung cancer by comparison with the incidence of cancers in the general population (in the same period as lung cancer patients). Each variable was individually selected before calculating and they were not distinguished by ethnicity. In study cohort selection, patients with the occurrence of a first lung malignancy primary registered in the SEER database were selected. The latency period between IPLC and SPC was more than 6 months to exclude errors caused by metastasis and recurrence. The detailed MP-SIR session process is shown in Supplementary Fig. 1. Eventually, the SIR table of SPC after lung cancer was output to compare the differences in the risk of recurrence based on the general population.
According to clinical experience, the diagnosis of the second primary cancer in children was inconsistent with that in adults, hence patient less than 20 was excluded. At the same time, samples in the subgroups of " unknown" in age group were also excluded. A comprehensive conclusion can be drawn from the horizontal and vertical comparisons like the comparison of males with females. P-value ≤ 0.05 proved statistically different.

Prognostic risk factor selection
All prognosis-related analyses were performed using the R software (version 4.0.5). Samples were randomly divided into training cohort (accounting for 70%, 7340) and validation cohort (accounting for 30%, 3147). Training cohort was used to establish the mathematical model. A validation cohort was used to test and verify the accuracy of the model. The internal validation was conducted, respectively, in the validation set.
Univariate and multivariate analyses of the training cohort were carried out for the identification of risk factors. P-value < 0.05 (both sides) was considered statistically significant. Based on the Cox regression analysis, we chose the significant prognostic factors to evaluate the OS rate with Kaplan-Meier curves. The differences in survival were assessed using the log-rank test. Especially, the KM curve of age was stratified by the risk score and risk level. When the risk score was larger than 1, it was categorized to the high risks group. Because of the paucity of patients with definite M stage in the data, it was not analyzed. Similarly, race and marital status were also excluded. The "plyr" and "survminer" package for R were used to establish the Cox regression model and the KM survival analysis separately. Subsequently, those significant prognostic features were selected to participate in machine learning further screening.
In our research, two machine learning (ML) model was employed to evaluate what indicators were more significant for the prognosis of SPC patients. Utilizing randomForest, caret, and pROC packages of R software, the training was used to establish a random forest model. With the help of out-of-bag data (OOB) error rate as a reference, we select the value of "nTree" and "mtry" which had the relatively low OOB error rate. Based on these premises, the selected variables were ranked according to the decline of the "Inc-NodePurity" value. The larger the decrease in impurity after a certain split, the more informative the corresponding input variable [19]. Similarly, we performed feature importance analysis in XGBoost model by the "xgboost" and "Matrix" packages. Then, combining classical statistics and machine learning, the key features filtered from a better prediction performance model were applied to establish the prognostic nomogram model. According to the training model, the ROC curves of Cox and ML models were drawn by validation set, and the area under the curve was calculated. The prediction accuracy of the model was judged by the AUC area value (0.5-1) under the subject ROC curve. The closer the AUC was to 1, the better the diagnostic effect [20]. Additionally, machine learning matrix scores including the Accuracy, Recall rate, Precision and F1-score were also applied to evaluate the prediction power of these two models.

Establishment and verification of nomogram
A graphical nomogram model was developed into an online calculator called the dynamic nomogram to calculate the risk score of SPC patients, using the "rms", "survival" and "Dynnom" packages in R software. In order to assess the predictive accuracy of the nomogram model, the concordance index (C-index) was calculated by the "Hmisc" package in R software. The value of the C-index ranged from 0.5 to 1, where 0.5 indicated a random chance and 1 indicated that the model can correctly discriminate the outcome [21]. Also, a calibration curve was drawn to compare the consistency between the predicted probability and the actual result. The abscissa of the calibration curve was the prediction probability, and 0 to 1 indicated that the probability of an event is 0 to 100%. The ordinate was the actual probability: the actual event rate of the patient. The red line was the fitting line, indicating the actual value corresponding to the predicted value. The accuracy of the model was intuitively demonstrated by the coincidence degree of the fitting line. If the predicted value was equal to the actual value, the red line completely coincided with the reference line (blue line). If the predicted value was greater than the actual value, the risk was overestimated and the red line was above the blue line. If the predicted value was less than the actual value, the risk was underestimated and the red line was above the blue line. In addition, DCA, a new method for highlighting models of prediction with clinical net benefits was carried out by the "stdca" package to evaluate the clinical net benefit of the prognostic nomogram, and this new model was compared with the 8th edition AJCC TNM staging system [22,23]. All the internal validation of prognostic models involved the "rms", "foreign" and "survival" packages in R software.

SIR analysis
The SIR of SPC among IPLC survivors is listed in Supplementary table 1. If the SIR was greater than 1, it meant the 1 3 increased risk of developing another type of cancer. Except for the 85 + age group, the SIR of other age groups was all greater than 1. The SIR values of female in different age groups (50-84 years old) were higher than those of male. The SIR of female without chemotherapy (SIR:1.34, 95% Cl 1.32-1.36) was greater than that of male (SIR:1.09, 95% Cl 1.08-1.1), while the SIR of male was the highest in the chemotherapy group (SIR:1.83, 95% Cl 1.78-1.88). In the group that unknown or did not receive radiotherapy, the SIR of female (SIR:1.36, 95% Cl 1.33-1.38) was larger than male (SIR:1.10, 95% Cl 1.08-1.1). Combination of beam with implants or isotopes, radioactive implants and radioisotopes all increase the risk of SPC in female

Analysis of patient data
Ultimately, a total of 10,487 patients with a second primary cancer after being diagnosed with lung cancer were identified. The detailed patient selection process is shown in Fig. 1

Screening of important variables
The results of univariate and multivariate Cox regression analysis are listed in Table 1. The result revealed that CT, grade, histology type, LNR, TNM stage, radiation, stage, surgery history of IPLC patients, surgery history of SPC, age, race, sex, and marital status were associated with the OS of SPC patients. As for OS, hazard ratio (HR) was used to compare the risk of death. HR greater than 1 with P-value < 0.05, indicating that this variable is a risk factor. In multivariate Cox regression analysis, age was indicated as a risk factor(HR:1.02, 95% Cl 1.02-1.03), which would increase the risk of death. Also, black people were more at risk than white people(HR:0.95, 95% Cl 0.86-1.04). The risk was greater for males(HR:1.2, 95% Cl 1.13-1.27) than for females. In addition, LNR, radiation, LC surgery and SPC surgery all reduced the risk of death and were all statistically significant. In univariate analysis, chemotherapy would increase the risk of death, but in multivariate Cox regression analysis, it is a protective factor.

Kaplan-Meier analysis for prognostic factors
Based on the above study, twelve variables were selected for KM analysis. Figure 2A-L depicts the Kaplan-Meier survival curves for OS of SPC patients based on risk scores in the training set. The log-rank p-value for these prognostic factors in the training set were all < 0.01, suggesting that these characteristics significantly affected the patient's survival. The survival of patients who received treatment like LC surgery, SPC surgery or radiation tended to increase. But chemotherapy was a risk factor that was adverse to the patient's survival. Also, Ductal, lobular and medullary neoplasms had more OS benefits based on SPC histology. SPC in male genital system and breast appeared to have longer OS than other sites.

Analysis of machine learning clinical prognosis model
Twelve significant variables were selected based on Cox regression and KM survival curve. Furthermore, based on the Permutation Importance principle [19], these 12 variables were sorted by importance in RF and XGB models (Fig. 3A, Supplementary Fig. 2A). According to the importance ranking of the variables in the two models, the top nine variables were selected to construct the prognostic model. The filtered variables included age of SPC, grade of SPC, site of SPC, Stage_T of SPC, Stage_N of SPC, histology of SPC, LCsurgery (surgery history of IPLC), LNR of SPC and SPCsurgery (surgery history of SPC). Because the sex is basic information about a patient, it is still included in the dynamic nomogram prediction model. Both RF and XGB models revealed that the age of SPC patients had the greatest influence on patient prognosis. The details of our model are shown in supplementary table 4 and supplementary table 5. In Fig. 3B, with the value of ntree = 500 and mtry = 7, the three lines were in a smooth and insignificantly fluctuating state, indicating that the random forest was able to well discriminate the survival status of SPC patients. The red line was the error rate of 1, the green line was the error rate of 0, and the black line is the OOB error rate. The area under the curve (AUC) of RF and XGB model was 0.762 (95% Cl 0.655-0.868) and 0.737 (95% Cl 0.650-0.825), which indicated a certain accuracy between probability and observation in the two models (Fig. 2C, Supplementary Fig. 2B). The detailed prediction performances of two improving machine learning models are shown in Supplementary table 6. By comparing AUC values between models, it can be found that the two combined models are superior to Cox proportional hazards model alone ( Supplementary Fig. 3) reflecting the progressiveness and accuracy of our model.

Construction of prognostic nomogram
Based on Cox analysis and machine learning model, a webbased dynamic nomogram application was also developed. The calculator could individually predict the survival of patients according to their clinical features. Therefore, our nomogram was applicable to both male and female patients without conflict. After entering a participant's Age, Sex, Grade, Stage_T, Stage_N, LNR, LCsurgery, SPCsurgery, Site and Histology on https:// httse er. shiny apps. io/ DynNo mapp/, the corresponding probability of SPC survival could be obtained immediately (shown in Fig. 4).

Calibration and validation of the nomogram
The calibration curves were used to evaluate the nomogram with both training set and validation set in Fig. 5. The discriminative ability and predictive accuracy of the nomogram were evaluated by concordance index (C-index), calibration curves and DCA curves. The nomogram indicated good predictive performance with the C-index of 0.711 (95% Cl 0.699-0.722) in the training cohort and 0.718 (95% Cl 0.710-0.726) in the validation cohort. The calibration curves displayed a favorable consistency between the nomogram predictions as well as the actual outcomes of the 3-year (Fig. 5A, C) and 5-year (Fig. 5B, D) survival in both training cohort and validation cohort. These indicated that our nomogram was judged to have good discrimination and prediction abilities in the internal validation. Apart from these, DCA was implemented in the training cohort and validation cohort to ascertain the clinical usefulness of the nomogram (Fig. 6).
The results showed that both 3-year survival (Fig. 6A, C) and 5-year survival (Fig. 6B, D) prediction in our model performed significantly better than the 8th edition AJCC TNM staging system, with the better net benefits in predicting the OS of SPC patients.

Discussion
The findings of our study revealed that our model was superior to the 8th edition AJCC TNM staging system, revealing our nomogram had good clinical applicability in predicting 3-year and 5-year survival of the SPC patients. To facilitate the application of the nomogram model in clinical practice, we developed a dynamic nomogram which is a web-based calculator accessible for free on https:// httse er. shiny apps. io/ DynNo mapp/.
It can be seen in SIR analysis that patients with SPC after IPLC tend to be younger. Patients younger than 40 years old generally had high SIR, such an early disease may be related to some extraordinary and rare genetic or environmental conditions. Overall, women have a greater risk of SPC than men. In particular, different radiotherapy methods will affect the incidence of SPC, for example, radioactive implants could reduce the risk of SPC in males. Hence, it is crucial to make different radiotherapy plans and active followup. Chemotherapy was risk factor for survival. But when considering other factors, all treatment methods including surgery, radiotherapy and chemotherapy were beneficial to patients. These showed that it is necessary to pay attention to the personalized treatment of surgery, radiotherapy and chemotherapy. Noteworthy, "Other" was significant for OS in TMN stage. It was suggested that the definition criteria for primary tumor in TMN staging system were imperfect and should also be modified based on the analysis of more clinical data as well as clinical experience. By further screening variables with ML, we found that the major factor affecting the prognosis of patients was age. As we all know, cancer was considered to be an aging disease, and a common risk factor for almost all types of cancer was age, which may be related to age-related decline in immune function and reduced ability of gene repair [24]. SPC, second primary cancer; LNR, lymph node removed; CT, chemotherapy; Grade I, Well differentiated; Grade II, Moderately differentiated; Grade III, Poorly differentiated; Grade IV, Undifferentiated; anaplastic; "Other " of the "Grade " includes "unknown, T cell, B cell, null cell ". "Other" of the "Stage_T " includes "Tis, TX, Ta, NA, Blank". "Other" of the "Stage_N" includes "NX, NA, Blank" When the P-value is much smaller than the computer displayable range, it takes the value of 0  Although previous clinical data analysis reports have discussed the second primary cancer following lung cancer [25,26], they have not constructed a nomogram that can be used to predict 3-year and 5-year survival rates with the clinical features. Also, it has not been reported to use a nomogram to study SPC after IPLC. Whether studying bone metastasis of lung cancer [27] or predicting death rate among non-small cell lung cancer (NSCLC) patients with surgery [28], the verification and the accuracy of our model were similar to them, which proved that our model was reliable in establishment and verification. The important difference is that we added Decision Curve Analysis (DCA) to assess the potential clinical utility of predictive nomogram and used RF as well as XGBoost model to test the importance of variables.
We also proved that model combined Cox proportional hazards with machine learning had better predictive performance than just using cox proportional hazards. Our dynamic nomogram based on machine learning models and Cox proportional hazards model greatly improved clinical practicability. After entering a patient's clinical information into a dynamic nomogram, the survival probability of SPC and the 95% confidence intervals of individual patients were displayed automatically. The dynamic nomogram simplified the use of the model through a web-based calculator accessible for free, which was more convenient for clinicians to tailor clinical decisions for an individual. In other words, our nomogram was attested to the excellent sensitivity and good clinical value.
Nevertheless, there are several limitations present in our study. First, the data are extracted from SEER database. There will be an imbalance in the amount of data, for example, the number of cases of Stage T0 (only 8 cases) and Stage N3 (only 28 cases) was far from the other groups, bringing some errors in the model with the prediction of T0 and N3. Therefore, in this study their results(T0 and N3) have been regarded as unreliable. Second, the data used in the study were extracted from the SEER database, which may carry inherent biases. And SEER had highlighted the limitations of incomplete and biased treatment data on https:// seer. cancer. gov/ data-softw are/ docum entat ion/ seers tat/ nov20 20/ treat ment-limit ations-nov20 20. html. According to the website, on one hand, it is improper to incorporate "no/none" and "unknown" into the same group in the SEER program. On the other hand, the SEER program has an ambiguous record of not receiving treatment patients, which may reduce the credibility of clinical prediction. The treatment variables which had low importance may also lead to the bias of screening results caused by limitations. In addition, the use of unimportant as well as ambiguous variables for nomogram prediction can cause physicians' diagnoses to be skewed. It is suggested that the SEER database's definition criteria for treatment information should be upgraded and perfected further. Third, the SEER database lacks information regarding target therapies and immunotherapy, making our model unable to consider all the influence factors. Besides, even though we have verified the reliability of this model with c-index, calibration curves, DCA curves and ML evaluation model, it is still limited and needs to be further validated in prospective clinical trials.
In conclusion, the prognosis characteristics of SPC following lung cancer were systematically reviewed. Our dynamic nomogram can be also used as a simple clinical prediction tool to provide personalized service for patients.