A Molecule Based Nomogram Optimized the Prediction of Relapse in Stage I NSCLC

Early-stage non-small cell lung cancer (NSCLC) is being diagnosed increasingly, and in 30% of diagnosed patients, recurrence will develop within 5 years. Thus, it is urgent to identify recurrence-related markers in order to optimize the management of patient-tailored therapeutics. The aim of the study was to develop a feasible tool to optimize the recurrence prediction of stage I NSCLC. The eligible datasets were downloaded from TCGA and GEO. In discovery phase, two algorithms, Least Absolute Shrinkage and Selector Operation and Support Vector Machine-Recursive Feature Elimination, were used to identify candidate genes. Recurrence associated signature was developed by penalized cox regression. The nomogram was constructed and further tested via two independent cohorts. Results: In this retrospective study, 14 eligible datasets and 7 published signatures were included. In discovery phase, 42 signicant genes were highlighted as candidate predictors by two algorithms. A 13-gene based signature was generated by penalized cox regression categorized training cohort into high-risk and low-risk subgroups (HR = 8.873, 95% CI:4.228–18.480 P < 0.001). Furthermore, a nomogram integrating the recurrence related signature, age, and histology was developed to predict the recurrence-free survival in the training cohort, which performed well in the two external validation cohorts (concordance index: 0.737, 95%CI:0.732–0.742, P < 0.001; 0.666, 95%CI: 0.650–0.682, P < 0.001; 0.651, 95%CI:0.637–0.665, P < 0.001 respectively).


Introduction
With the adoption of low-dose spiral computed tomography screening for the high-risk individuals, the proportion of stage I non-small cell lung cancer (NSCLC) rise sharply. Surgical resection is the curative treatment for stage I NSCLC. However, local relapse and metastasis impede the therapeutic effect, and approximately 30% patients will suffer recurrence within 5 years of diagnosis (1,2). Postoperative adjuvant therapy is a valid approach to prevent the relapse and metastasis, but the effect is still unclear in stage I NSCLC (3,4). Several large randomized studies failed to show a survival bene t from stage I NSCLC patients who received adjuvant systemic treatment after surgery due to side effects of therapy (5).
Thus, in view of the high rate of relapse and the lack of useful biomarkers, a feasible prediction model is needed to predict the relapse free survival of stage I NSCLC patients and identify the high risk patients who may bene t from postoperative adjuvant therapy.
Plenty of studies have been published to identify clinical risk factors for survival or relapse(6). However, it is not enough to predict the relapse by only considering clinical risk factors, because stage I NSCLC is a heterogeneous disease, which comprising different subgroups with distinct molecular alterations (7).
Expression pro ling has showed great promise in getting insight of molecular mechanisms and biomarkers through analysis of thousands of genes(8, 9). And increasing molecular markers have been incorporated into clinical decision-making (10). Previous studies highlighted several relapse related signatures for stage I NSCLC, which showed molecular information and provided a more robust biomarker of relapse (11)(12)(13). Unfortunately, over tting on small discovery data sets and lack of su cient external validation impeded them to be widespread adopted into clinical practice. Therefore, developing and testing a relapse related signature in a large-scale study is needed, and available public gene expression data sets with relapse status brings the opportunity to identify more reliable signature for relapse. In addition, there is an increasing trend that combining molecular features and clinicopathologic features to predict the disease status or prognosis by recent investigations (14).
In this study, we aimed to develop and validate a predicting signature related to the relapse by multiple gene expression datasets, and leverage the clinical features to build a nomogram predicting the relapse of stage I NSCLC, which would improve capacity of relapse prediction and might have tremendous value in guiding the management of stage I NSCLC patients.

Study population and study design
Comprehensive search for eligible expression pro le datasets related to recurrence of NSCLC were performed in Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA). Datasets were ltered by the criteria such as containing stage I NSCLC patients, recurrence status, and relapse-free survival time. After primary ltration, we excluded those patients who had received neoadjuvant therapy, adjuvant chemotherapy, or other pharmaceutical therapy in each dataset. Tumors were pathologically classi ed as stage I according to the seventh edition of the TNM classi cation. After removing patients with no clinical or subtype information, demographic and clinical characteristics of the patients were recorded. Simultaneously, searching for published prognostic signatures related to recurrence were also conducted in Pubmed, Embase, and MEDLINE with the key words including "relapse NSCLC", "recurrence NSCLC", " relapse stage I NSCLC ", " recurrence stage I NSCLC ", "relapse lung cancer", "recurrence lung cancer", "relapse stage I lung cancer", and "recurrence stage I lung cancer". Three steps were designed to develop and test the recurrence-related nomogram for NSCLC patients with stage I. In the discovery phase, eligible datasets and published prognostic signatures were used to screen signi cant genes related to recurrence by two algorithms (LASSO and SVM-RFE), which obtained the candidate features for the further selection. In the training phase, patients from TCGA were included in training cohort due to the completed clinical records and respectively large sample size. Penalized Cox regression was performed to develop a recurrence signature in training cohort, which was subsequently tested in two independent cohorts. Then, a nomogram was constructed by combining recurrence signature and signi cant clinical features. In the validation phase, two independent GEO datasets were selected to test the performance of the nomogram. The study design was displayed by a ow chat in Fig. 1. Recurrence-free survival was de ned as the time from the date of diagnosis to the date of recurrence or last follow-up. The schedule of follow-up and examination were described by the corresponding studies. Forest plot analyses were performed using R package, "meta" (https://github.com/guido-s/meta http://meta-analysis-with-r.org). A heterogeneity test for the combined HR was carried out using the I 2 statistic.

Statistical analysis
The expression pro le datasets were normalized and quali ed using the R package ("affy"). Gene IDs were mapped to genes using the corresponding annotation les. Multiple probes corresponding to the same gene ID were averaged to one measurement for each sample. Differential expression genes between recurrence samples and non-recurrence samples were generated by "limma" package of R software with adjusted false discovery rate less than 0.05 and fold-change greater than 2. Recurrence related genes were identi ed by cox univariable analysis with signi cant P values. Signi cant genes from each dataset and genes from these prognostic signatures were submitted to the Least Absolute Shrinkage and Selector Operation (LASSO) algorithm in order to select candidate features with optimal lambda value, which was de ned by 10 fold cross-validation (R package, glmnet). Meanwhile, another method called Support Vector Machine-Recursive Feature Elimination (SVM-RFE) was used for selection (R package, e1071). Then, a recurrence related signature was developed from all candidate features selected by two algorithms using penalized Cox analysis. The nomogram was developed by "rms" package of R software. The variable included into nomogram were identi ed by backward wise step method and the clinical signi cance. The concordance index (C-index) were used to evaluate the discrimination of the nomogram, and calibration plots revealed the accuracy of the nomogram.
"SurvivalROC" package of R software was used to draw the receiver operating characteristic (ROC) curves. All steps were conducted in R version 3.3.3 software. The cut-off values were de ned by X-tile software with the highest χ2 value in each set (15).

Characteristics of patients
Retrospectively analysis of the gene expression pro les related to recurrence identi ed 14 eligible cohorts, including 13 microarray datasets from GEO (https://www.ncbi.nlm.nih.gov/geo/) and 1 RNA-Seq dataset from The Cancer Genome Atlas (TCGA) (https://cancergenome.nih.gov/). A total of 1414 stage I NSCLC patients were included from 14 published cohorts retrieved by the systematic search. The characteristics of these datasets were listed in Table 1. We didn't integrate these datasets together due to different platforms and different racial groups and segregated them into three phrases. TCGA cohort with the largest sample size and complete clinical records were assigned into the training phase, and we selected another 2 individual datasets with moderate sample size and matching clinical records for independent validation, named GEO50081 and GEO30219. In addition, 7 published prognostic signatures related to recurrence were identi ed, which were also incorporated into the discovery phase (Table S1).
The median recurrence-free survival time of the patients in TCGA was 508 days (range, 2 to 6812 days).
During follow-up, 19.9% of the patients (70 of 351) developed recurrence. The median follow-up time of the patients in the testing cohort were 1729 days and 1935 days (range, 44 and 4393 days and 30 days and 7680 days, respectively). 25% and 24.6% of the patients from two validation cohort showed relapse. In TCGA cohort, the 1-and 5-year RFS rates were 67% and 8.8%, respectively. And the 1-and 5-year OS rates were 85.5% and 43.5% in GEO50081 and 87.3% and 52.8% in GSE30219, respectively.

Construction and validation of the recurrence signature
We conducted a univariable Cox regression analysis to identify 1058 genes associated with recurrencefree survival from differential expression genes in each dataset. After analyzing the published recurrence associated signatures, we obtained 66 signi cant genes and incorporated them with signi cant genes from microarray datasets. LASSO analysis identi ed 36 candidate genes related to recurrence by the optimal lambda which was determined through 10-times cross validations ( Fig. 2A &Fig. S1A).
Simultaneously, another algorithm, named SVM-RFE, selected 13 candidate genes through ranking the features based on their weights and eliminating the feature with the lowest weight ( Fig. 2B &Fig. S1B). Combining the results from two algorithms, 42 candidate genes were selected, including 3 overlapped genes (Table S2). Then, we submitted them into penalized Cox analysis, and a risk signature was built by a 13-mRNA with the corresponding coe cients in the training cohort (Fig. 2C). A cluster plot showed the expression pro le of the 13 mRNAs (Fig. 2D). Among the 13-mRNAs, DUSP4, LDOC1, NRIP3, B3GNT7, and CCBP2 showed positive effect in predicting the recurrence, while the rest 8 genes showed negative effect in prediction, including PCSK6, TPSB2, HSF4, CPNE7, CAPN12, GABRE, HLF, and AGER (Table S3).
After calculating the risk score of each patient by recurrence related signature, we dichotomized patients in two group at the median. Kaplan-Meier (KM) survival analysis showed distinct different survival between high risk group and low risk group (Hazard Ratio (HR): 8.837, 95% con dence interval: 4.228-18.480, P < 0.001) (Fig. 3A). Prediction ability of the signature was evaluated by time independent ROC curve, showing well performance (AUC:0.79 at 1-year recurrence-free survival).
The robustness of the 13-mRNA signature was further con rmed in two independent cohorts (GSE50081 and GSE30219). Survival curves were remarkably different between the high risk group and the low risk group based on the signature, showing recurrence-free survival in 51.6% and 34.3% and 56.3% and 45%, respectively. (HR: 3.556, 95% CI: 1.587-7.966; P = 0.001 and HR: 2.586, 95% CI: 1.226-5.286; P = 0.007) (Fig. 3B&C). Time independent ROC curve was used to demonstrate the predictive capability of survival prediction in two datasets. The AUCs at the 5-year recurrence-free survival was 0.73 in GSE50081 cohort and 0.72 in GSE30219 cohort, respectively. In addition, recurrence associated signature was further validated in another three cohorts that contained the 13-mRNA expression information (GSE37745, GSE8894, and GSE31210) by a xed effects meta-analysis model due to their limited sample size (Fig.   S2). There was low heterogeneity or inconsistency across these cohorts (HR = 1.87, 95%CI:1.29-2.70, P < 0.001; Heterogeneity: I 2 = 0%, τ 2 = 0, P = 0.37), suggesting that these data are not the result of selection bias. A strong concordance was exhibited with previous results by meta-analysis, indicating that the recurrence pattern was a reliable and generalizable signature in predicting the recurrence of stage I NSCLC.
Building a nomogram Univariable analysis revealed that histology and risk score were two signi cant risk factors for recurrencefree survival in training cohort. After multivariable analysis, it remained as an independent risk factor ( Table 2). For the purpose of clinically application, we developed a nomgram in the training cohort to individually predict the probability of recurrence by incorporating the recurrence related signature with age and histology selected by backwards wise step method with least AIC value (Fig. 4A). The Concordance index (C-index) of the nomogram was 0.737 with 95%CI ranged from 0.731 to 0.743 (P < 0.001), indicating the good discrimination ability. Calibration plots of 1-year, 3-year, and 5-year illustrated the excellent accuracy of the prediction in the training cohort (Fig. 4B). X-tile software was used to generate the optimal cut-off value by the highest χ2 value, which categorized patients into tertiles (low risk group: 0.528-2.6, medium risk group: 2.6-11.7, and high risk group: 11.7-13.6). KM survival analysis revealed signi cant distinctions between each risk subgroup (HR = 8.837, 95%CI:4.228-18.480, P < 0.001) (Fig. 4C).

Validation of the nomogram
To better validating the performance of the nomogram in predicting recurrence, two previously used cohorts were utilized for further testing. The favorable calibration plots indicated the nomogram retained the accuracy of prediction in the validation cohorts (Fig. 4D&E). The discriminable ability was assessed by C-index, which was 0.667 for the GSE50081 and 0.651 for the GSE30219 (95%CI: 0.652-0.682, P = 0.0002 and 95%CI: 0.637-0.665, P = 0.0004, respectively). Remarkably, the nomogram was separately applied to predict prognosis for stage IA and IA patients, and survival analysis yielded the signi cant distinctions between three subgroups, which was consistent with previous result (Fig. 5A&B). Furthermore, we utilized the tertiles method in the validation cohorts. In GSE50081, low risk group and medium risk group showed signi cant different survival compared to high risk group (Fig. 5C). But no statistical signi cance was achieved between low risk group and medium risk group (P = 0.857). While low risk group showed the better recurrence-free survival than medium and high risk subgroup in GSE30219 (Fig. 5D). However, medium risk subgroup had the similar survival with high risk subgroup (P = 0.939).

Discussion
Recurrence of stage I NSCLC is a challenging clinical issue, which shortens the survival time and reduces the effect of surgery(16). Adjuvant therapy has been recommended for postoperative therapy for stage II or III NSCLC patients according to guidelines(3), however, it remains controversial in stage I NSCLC.
Published clinical trials have not showed a consistent survival bene t due to drug toxicity and side effect (17). Thus, how to select the patients at high risk of recurrence is an unsolved problem in management of patients with stage I NSCLC.
Increasing investigations were devoted to seek the risk factors for recurrence of early stage NSCLC, which provided the clues for decision-making of postoperative management(18). Brandt et al identi ed that pT stage and lymphovascular invasion were correlated with distant recurrence and declined the disease-free survival (19). Yu and his colleagues developed and validated a radiomics model for prediction of clinical outcome, which bene ted to the choice of treatment (20). In addition, numerous molecular signatures, based on the expression pro les, have been proposed to predict the recurrence-free survival in recent years, which reveal the heterogeneous differences between individuals. Noro et al proposed a two-gene prognostic classi er to predict the recurrence for lung squamous cell carcinoma after surgical resection (21). Shuta's study built a relapse-related molecular signature for lung adenocarcinomas to identify patients at high risk of relapse(22). However, as the problems of small sample size, different microarray platforms, heterogeneity from patients, and diverse range gene selection algorithms, few molecular signatures were broadly adopted in clinic for early stage lung cancer. Compared with previous studies, this study strengthened several aspects and made up the de ciencies. This novel nomogram maximized the potential of molecular biology and clinical factors. In addition, the reliability and robustness of the nomogram was tested in multi-scaled cohorts from different patients and platforms, which showed well performance. As a functional tool derived from molecular biomarkers and clinical variables, it may help optimize patient care by providing better prediction of recurrence, selecting patients for postoperative adjuvant therapy, and stratifying patients in prospective clinical trials. To our knowledge, this is the rst study assessing recurrence for stage I NSCLC by combining molecular signature with clinical variables.
In our preliminary work, we found 14 expression pro le datasets related to recurrence from GEO database and TCGA. By checking the clinical records and the sample size, we found that TCGA contained the largest sample size and completed clinical records, GSE31210 only involved lung adenocarcinoma samples, and either of GSE41271 or GSE68465 lacked of the expression values of TPSB2 due to different microarray platforms. Thus, we assigned TCGA into the training cohort and selected GSE50081 and GSE30129 into the validation cohort due to moderate sample size and matching clinical records. The diverse racial group and wide geographic distribution of patients made themselves representativeness and generalizability, which enhanced the reliability of the model. Candidate genes were screened by two routine algorithms in order to minimize the possibility of missing or ignoring key markers. L1 penalized Cox regression analysis, a broadly adopted method, was utilized to construct the 13-mRNA signature from candidate genes by yielding the corresponding coe cients (14,23). Our 13-mRNA signature exhibited favorable discrimination in the training and the validation cohorts, with an AUC of 0.79, 0.73, and 0.72, respectively. The cutoff values of different datasets, used to de ne the high risk and low risk groups by recurrence associated signature, were determined by the corresponding median risk scores owing to different platforms. Published studies presented evidence supporting that a series of clinical variables, such as age, histology, and differentiation, were associated with recurrence-free survival of early stage NSCLC (10). Therefore, we considered the clinical variables and constructed a nomogram by incorporating clinical variables with our molecular signature to provide an easy-to-use tool for clinicians, showing good calibration and discrimination in the training and validation cohorts. Univariable regression analysis revealed that histology showed statistically signi cant results in the training cohorts. However, after adjusting with the 13-mRNA signature, it wasn't signi cant, which might be related to the respective small sample size. And our meta-analysis of the entire cohort revealed that age and histology were two key variable associated with recurrence-free survival (Fig. S3). Backward wise step method demonstrated that age, histology, and signature were eligible variables with the least AIC value, which could be incorporated into the nomogram. Validation analysis con rmed the reliability and generalization of the nomogram. Tertile strati ed method allowed the remarkably distinctions between survival curves. Notably, we found no statistically signi cant differences between low and medium risk groups in GSE50081 and medium and high risk groups in GSE30219. This might be the lower samples in low risk group in GSE50081 and lower samples in medium risk group, which couldn't discriminate themselves from other groups.
There are some limitations of this study should be mentioned. First of all, this was a multi-scaled study based on the datasets from GEO and TCGA, but prospective studies are required to further validate our nding. In addition, several signi cant clinical variables were not recorded in some datasets, which hampered the accuracy of our model. Furthermore, different platforms of the datasets hindered the integrated analysis of these datasets, which reduced the power of the model.

Conclusion
In this study, we developed a reliable and robust nomogram which could be reproducible in ethnically and geographically diverse cohorts. It will provide classi cation of patients at high risk for recurrence is important for identifying those who may bene t from adjuvant chemotherapy and optimize the design of prospective clinical trials.