Early-stage NSCLC Survival Prognosis Model Prediction of Overall Survival for Early-stage Non-small Cell Lung Cancer via a Novel 7-gene Prognostic Signature and Allele Frequency Deviation (AFD)

Background: Due to the late and poor prognosis of non-small lung cancer(NSCLC), the mortality of patients is high, underlines the need to identify a credible prognostic marker for NSCLC patients. The aim of our study is to examine the association of allele frequency deviation (AFD) with the patient's survival, as well as identification and validation of a new prognostic signature to predict NSCLC overall survival(OS). Methods: First, we developed a new algorithm to calculate AFD from whole-exome sequencing(WES) data, then we compared the predictability of the patient's survival between AFD, tumor mutation burden (TMB) and change of variants allele frequency (dVAF). Second, we overlapped the differentially expressed genes (DEGs) from our data with the genes associated with the survival of The Cancer Genome Atlas (TCGA) database to confirm all genes significantly related to the survival of lung cancer. We identified 149 genes, 31 of which are new genes and have not been reported for lung cancer, that was used to develop a new prognostic model. Lung cancer adenocarcinoma (LUAD) data from the TCGA database was used to validate the gene-signature model. The prognostic model relating to the genes was established and validated in training and LUAD validation groups. Results: There was a significant association found between the high AFD value and poor survival among non-small cell lung cancer (NSCLC) patients. A novel seven genes (UCN2, RIMS2, CAVIN2, GRIA1, PKHD1L1, PGM5, CLIC6) were obtained through multivariate Cox regression analysis and significantly associated with NSCLC patients survival. Cox regression analysis confirmed that AFD and 7-gene signature are an independent prognostic marker in NSCLC patients. The AUC for 5-year survival in AFD and the AUC for 3-year survival in both training and validation groups were greater than 0.7. Conclusion: As a result, AFD and 7-gene signatures were identified as new independent predictive factors used for predicting the survival among NSCLC patients.


Background
Despite of any advancement in lung cancer treatment, it remains one of the most common types and the leading cause of cancer-associated mortality among men and women worldwide [1]. Non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC) are identified as two major types of lung cancer. The two main types of non-small cell lung cancer are lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD) [2], so these histological subtypes may determine the choice of treatment [2,3]. In recent years, the absolute and relative frequencies of lung cancer's incidence and mortality have risen dramatically around the world [4,5]. Overall, the five-year survival rate for lung cancer is 19% [6]. A total of 228,820 new lung cancer cases and 135,720 lung cancer deaths were expected to occur in 2020 [7].
However, more than half of NSCLC patients are diagnosed with either advanced or metastatic stage(third or fourth stage) disease, significant and longer survival rates can be obtained for those who are diagnosed at an early stage, but in advanced stages, curative treatment options are prolonged and limited, resulting in poor prognosis and low survival rates [8]. Time is certainley the crucial factor for all cancer patients, in addition to the fact that NSCLC is a heterogeneous group of diseases, all these reasons have led to the creation of a clear unmet medical need for the new marker that can aid the clinicians through facilitating the accurate and early diagnosis of lung cancer, enhancing predictive clinical outcomes, and guide the customized treatment [9,10].
Recently, a variety of studies have been conducted to identify predictive biomarkers to guide long-term NSCLC patients prognosis. Such biomarkers are segregated into: 1) single biomarker like SLC2A1, PKM, EPCAM, ALCAM, CADM1, HIF1A, and PTK7 [11][12][13][14][15][16][17], as well as molecular biomarker such as Tumor mutation burden(TMB), blood Tumor Mutation Burden(bTMB) [18], or some of the other recent markers currently being studied to predict prognostic condition or metastasis; and 2) gene markers that are found through the study of high-throughput gene expression profiles, they are built through using several prognostic genes. Many studies have demonstrated TMB as a biomarker for NSCLC patients [19], for example, Rizvi et al. [20] demonstrated that high TMB levels were correlated with improved ORR, prolonged PFS in retrospective analysis of NSCLC patients. Chae YK et al. [21] reported that treating with PD-1/PD-L1 inhibitors leads to a longer OS among patients with high TMB in comparison to those with low TMB. Owada-Ozaki Y et al. [22] confirmed that high TMB is a poor prognosis factor in NSCLC patients. In addition, dVAF has also been identified as a predictor of clinical outcomes at NSCLC and UC [23], Moreover, several studies bagan to identify gene biomarkers related to NSCLC prognosis [24][25][26]. Another study on AFD involved cervical cancer patients revealed that AFD is positively correlated with therapy response and helps in expecting the progression-free survival 27 .
Based on these results, the investigation of a new prognosis biomarker for OS in early-stage NSCLC patients will be critical in the rapid assessment of diagnosis and therapeutic efficacy. Therefore, our study is considered the first to study the direct association of AFD with prognostication, as there is a lack of studies that paid due attention to the comparision of TMB and dVAF with AFD in order to predict the progression of NSCLC patients. We have conculded that AFD is an independent prognostic factor to be used to predict the survival of NSCLC patients (primary aim), we reached this by developing a new algorithm to determine the deviation of tumor from normal allele frequency through WES of tumor and normal samples. Moreover, we analyzed the RNA.seq data to identify the NSCLC-related gene biomarkers for better identification of a reliable gene signature relevant to NSCLC prognosis. This approach has helped us to recognize a novel 7-gene signature which can be used for successfully prediction of prognostic risk in NSCLC patients (secondary aim).  [70].

Alignment and quality control
In-house pipelines were used to process the sequencing of 54 WES data. Kemer content and over-represented sequences [27]. Sequencing readings were aligned with the human reference genome (hg38) using the Burrows-Wheeler Aligner (BWA) software package with the default parameters [28], and we removed the reads that were mapped in multiple genome positions. Then we assessed the quality of the map using SAMtools flagstat [29]; and All the genome sites for somatic variants were called by using VarScan [30]software with patameters of base quality higher than 30 and supporting reads ≥ 200 (Supplementary Figure1).

Calling of SNV from the whole-exome sequence
After mapping all the readings to the human reference genome(hg38) using (BWA) [28], Picard 1.67 was used to mark the duplicate readings that re-aligned around the known indels. We performed the base quality recalibration by using GATK version 3.7 [31]. Somatic mutations were called using Mutect2 when each of the following criteria was met: first, the frequency of allele variants was 1% in tumor samples and 1% in standard samples; second, both standard and tumor samples had sequencing coverage of 200; third, the alternative readings in tumor samples were10; fourth, the corrected p-value was less than 0.05. SNVs have been annotated by using ANNOVAR in multiple databases [32] and further filtered with population frequency in ExAC,1000 Genomes,dbSNP138,clinvar_20170130 and avsnp147.

Allele Frequency Deviation (AFD)
Allele frequency was calculated for 54 samples and WBC was used as a control to avoid the factors that could affect the AF at site of tumor sample. As displayed in ρk equal points number in bin k/all point, point density By using density, we can avoid outlier and noise that would be given less weight due to their low density, in this case the AFD can reflect patient clinical situation.
Finally, the AFD of a patient was calculated by : Where di represent distance values of all points in a grid,  i represent density of grid, n represent the total number of grid.

Tumor Mutation Burden (TMB) calculation
The number of nonsynonymous single-nucleotide variants, insertion or deletion variants is known as the burden of tumor mutation [33]. It is known that WES is the gold and standerd method to determine TMB [34], it is used to calculate TMB from tumor samples and their matched WBC normal samples. We used quantile method based on measurements of TMB to divided patients into quantiles [35].

Variant allele frequencies changes (dVAF) calculation
We calculated variant allele frequencies(VAF) of somatic mutations for all genes in WES for tumor samples and its corresponding WBC in our cohort consisting of 54 patients. Non-synonymous and synonymous variants were included in this calculation, synonymous mutations were considerd purposefully to reduce sampling noise [36].
Only variants observed in normal samples were used for the tumor mean VAF calculation. The change in mean VAF(dVAF) was calculated as: The association between the AFD and dVAF was tested. The statistical analysis of dVAF was performed to assess its ability to predict the progression of overall survival.

RNA.seq analysis
For Gene expression analysis of lung cancer RNA-seq data and TCGA dataset, the reads were mapped against the human genome(hg38) using STAR2 software [37].
The mapped reads with quality more than 10 were selected using Samtools. The read counts per gene were defined using featurecount [38] as the reference transcriptome.
Differential expression analysis was performed using edgeR [39], comparing tumor samples to their matched normal samples. The selected genes are the ones statisticall significant differentially expressed between tumor and normal samples and their FDR <0.05 and abs(logFC) > 1(Supplementry Figure 3).

Construction of a prognostic gene signature
We first determined the candidate prognostic genes which are significantly associated with OS, and select the common ones for constructing the gene signature model. An unvariate Cox proportional hazard regression analysis as well as lasso regression were implemented with each gene to identify the relationship of genes with OS in the data [40]. P-value less than 0.05 was used as a cutoff to define and select the candidate genes related to patients survival. Finally, a multi Cox proportional hazard regression analysis was carried out to select the final list of genes-related to OS and then create the model of prognostic signature. The hazard ratio from the analysis of multivariate cox regression was used to assess the protective genes(HR < 1) and risky genes(HR > 1). The risk scoring for each patient was then estimated using the following equation to calculate the expression values pertaining to the selected genes weighted by regression coefficients in multivariate cox regression analysis.
Where n is the number of selected prognostic genes, xp i  is the expression value of the prognostic gene i and C HR i represents the estimated regression coefficient for the corresponding gene i in the multivariate cox regression analysis. Subsequently, the median prognostic score was used to differentiate between the high-risk and low-risk groups. The patients with lower risk than median value were assigned to a low-risk group while the others were assigned to a high-risk group. The prognostic performance of the prognostic score model was measured by using the ROC curve by comparing the area under the respective receiver operating characteristic curve.
Finally, the prognostic score model was examined to check its association with the survival of the NSCLC patient.

Statistical analysis
The Spearman correlation test was analyzed among the variables.

Patients characteristic
The  Overall, these findings indicate that the AFD values are significantly different and independent from the other two factors.

Allele frequency deviation shows an active power to predict patient outcomes
To evaluate the sensitivity and specificity of AFD, TMB and dVAF for OS predicting in NSCLC patients, we implemented a time-dependent ROC curve.
The TMB and AFD significantly achieved almost same AUC values 0.721 and 0.713 respectively (Figure 1. a-right & b-right panel). In contrast, the AUC for dVAF was not high(0.56), which means that the dVAF remains a challenge, this creates the need for more verifications of this factor through future investigation.
In particular, AUC values of more than 0.7 for AFD in these data demonstrate a high OS predictive performance, and AUC values of less than 0.6 for dVAF show poor OS predictive performance (Figure 1.c-right panel). These results demonstrated that AFD has the power to predict the overall survival which is reflected by the AUC performance.

Overall survival
According respectively (Figure 1 a-left panel). The one-sided stratified log-rank P-value was 0.0064, indicating a significant difference between the two groups as well as an increase in genetic variants in high-value AFD groups with any increase in the AFD value in patients; therefore, the risk of death in this group of patients increased. In contrast, in the group of AFD low-value, the decrease of AFD values reflects the low genetic variants, so the death risk observed was lower than in the other group. Besides, the median of overall survival could not be estimated in both groups of AFD.
On the other hand, the Kaplan-Meier curve for TMB showed that high-level Patients had significantly shorter overall survival than the low-level patients, with 35 Notably, the one-sided stratified long-range P-value was 0.03, which indicates the difference between the two groups in regards to overall survival. Similarely, in the AFD, the median overall survival of the high-level group and low-level group could not be estimated (Figure 1. b-left panel).
We also performed the Kaplan-Meier estimation for the dVAF. Our finding showed that there is no statistical significance between the high-level and low-level groups of dVAF. The p-value was 0.17 due to the one-side stratified long-rank estimation (Figure 1. c-right panel). The total number of patients in every group was the same while the number of patients in the case of AFD groups was 14 and 40 patients for the high and low-level groups respectively. Moreover, the Kaplan-Meier estimation revealed that patients in low and high-level of dVAF groups have the overall survival at 13 months 92.5%(95% CI, 84.7 to 100) and 78.6% (95% confidence interval, 59.8 to 100) respectively, while the overall survival at 35 months was 84.4%(95% CI,73.6 to 69.7) for the low-level of dVAF group and in the high-level group, there is no event at 35 months, therefore, the overall survival could not be estimated( Table 2).

The Allele frequency deviation is an independent prognostic factor.
Herein, we conducted univariate and multivariate cox regression models in the NSCLC data. The AFD, TMB, and dVAF with other clinicopathological factors, including gender, smoking, age, pT, and tumor-size were used as covariates.  Table 3).

Identification of gene sets associated with overall patient survival
For NSCLC training set data, we used the KM analysis to establish the association between the gene expression and overall survival. We identified 409 genes associated with overall survival, we downloaded the TCGA data to screen and confirm the genes that really associated with survival in NSCLC, we identified 1177 genes. By overlapping the two datasets, 149 genes log-rank P value < 0.01 were identified to be associated with NSCLC survival. Of those, 31 genes have not been reported in NSCLC patients, which were used to conduct the next analyses to develop a prognostic signature model (Supplementary Figure 3).

Identification of a 7-gene prognostic signature
To identify NSCLC gene signature association with OS, we selected unreported 31 prognosis-related genes which associated with survival, followed by univariate Cox regression and lasso regression analysis for further selection, we identified 24 genes then 11 genes from both analysis respectively. Out of these genes, 7 genes were finally identified and established using a multivariate cox regression analysis to participate in overall survival, which was used to construct a prognostic model, and a multivariate cox regression analysis was conducted using a stepwise regression method (Supplementary Figure 3). The formula(1) is as follows: The information related to 7 genes is shown in the Table 4. Finally, a set of 7 genes including(n=3) the risky gene(HR>1) and(n=4) the protective genes(HR<1) were examined. Table 4. displayed the prognostic correlation of 7 genes with the NSCLC patients survival.

The validation of 7-gene prognostic signature
Based on the gene expression as well as regression coefficients of the 7 genes from the multivariate cox analysis, we built a prognostic model for predicting the prognosis using the risk score approach. A risk score for each patient was given in the prognostic model. The median risk score of(0.7334 and 0.9367) were used as the cutoff points to classify the patients into high-risk and low-risk groups in the training and LUAD-TCGA validation groups respectively. (Figure 2. B&E) shows the predictive power of overall survival through a 7-gene signature for patients. (Figure 2 C&F) showed the distribution of the gene risk score, the level of gene expression, and he survival status of patients in both data.
Patients who belong to the high-risk group were found to have a significantly shorter OS than patients belonging to the low-risk group, as shown in Kaplan-Meier curves, with 29.4% higher risk and 3.9% lower risk of death for high and low risk groups respectively(HR =1.042 , 95% CI, 1.02 to 1.06, P= 0.0002 )( Table 3 & Table   5). The P-value of one-side stratified log-rank was 0.00037, confirming a significant difference between the high and low-risk groups, therefore, the clinical outcomes of patients in the low-risk group are found to be better than those in the high-risk group (Figure 2-A). The overall survival at 13 months was 98%(95% CI, 94.2 to 1) and 84.3%(95% CI 74.9 to 94.9) in the low and high-risk groups respectively; 68.6%( 95% CI, 56.4 to 83.5) in the high-risk group at 31 months ( Table 5).
For the TCGA validation group, Kaplan-Meier curves showed that overall survival was significantly longer in the low-risk group compared to the high-risk group, with 23.8% lower risk and 47.9% higher risk of death in the low-risk and high-risk groups, respectively (HR= 2.01 , 95% CI, 1.57 to 2.557, P <0.001 ) ( Table   5). The single-sided stratified log-rank p-value was 0.0001, indicating the difference between the two groups (Figure 2-D) (Figure 2-B&E), suggesting a strong OS prediction efficiency.

Clinical independence of the 7-gene signature model
To evaluate the contribution of the 7-gene signature as an independent prognostic biomarker in the NSCLC training data and LUAD TCGA data set, univariate and multivariate Cox regression models were implemented. The 7-gene signature and other clinicopathological factors, including gender, age, stage, tumor size, smoking were included as covariates. Table 3 shows the correlation between the OS and these factors. It was found out through univariate regression analysis that risk score, stage and tumor size were significantly associated with patients survival in NSCLC training set (Figure 3. A); risk score, T, N, M and stage were identified to have significantly correlation with OS of LUAD-TCGA validation set (Figure 3.B) and (Table 3).
Interestingly, the corresponding multivariate cox regression analysis revealed that  Table 3). These results indicate that our prognostic model of 7-gene signature is a highly prognostic independent biomarker and it presents independent predictive performance through clinical application.

Discussion
When considering prognosis, NSCLC is believed to be an extremely heterogeneous disease where survival time among patients differs based on their pathological stages.
Traditional clinicalopathological variables like TNM level, tumor size, sex, age, as well as tumor factors such as cell differentiation, vascular invasion, and vascularity have been used in a broad framework to predict patient outcomes for diagnosis and treatment of NSCLC patients at early stages. Predicting outcomes was found to be insufficient due to the difference in effectiveness from among treatment strategy [41][42][43][44][45]. Consequently, an inspection of molecular prognostic markers that reliably represent the biological traits of tumors should be crucial for NSCLC patients treatment as well as for individualized prevention.
In the current study, we first developed a new algorithm to determine the genetic Previous studies have shown that TMB is significantly correlated with the Immune Checkpoint Inhibitor (ICI) such as PD-L1 and PD-1, as well as other biomarkers like EGFR and TP53 [46][47][48]. In our research, we evaluated the relationship between AFD and TMB as well as dVAF and found that there was no correlation between AFD and the other two factors. Further, the AUC of the prediction for patients survival in both AFD and TMB was high and almost the same while AUC in dVAF was lower, suggesting that AFD had a substantial capacity not less than the capacity of TMB to predict OS. In addition, these results are consistent with our findings in the Kaplan-Meiren analysis of NSCLC patients, where AFD showed more significance than TMB in OS prediction and patients were also defined into high risk group and low-risk group, the patients with high AFD value had shorter OS compared to the low AFD group. On the contrary, in both univariate and multivariate analysis, TMB tended to be a non-independent prognostic factor for predicting NSCLC survival, and there was no significant association between TMB and OS, which is consistent with previous studies [49,50]. Interestingly, AFD displayed a prognostic ability in both univariate and multivariate cox regression analysis, and emerged as an independent prognostic factor.
A number of studies have reported that tumor size is a prognostic factor used to predict patient progression and outcomes [51]. A Previous study related to AFD demonstrated the effectiveness of AFD in predicting the benefit and response of cervical cancer patients to treatment and the predicted evidence of metastases was better than tumor size [27]. AFD performed in our study was independent of the tumor size, patients with high AFD had worse progression compared to patients with low AFD values. As a result, we concluded that AFD has the potential ability to predict patient survival, consequently suggesting the use of AFD for clinical purpose instead of tumor size due to its accuracy. In contrast to our finding, Rajiv Raja at el. [23] has reported a close correlation of dVAF with clinical outcomes in NSCLC and UC.
Our result has shown that AFD is strongly correlated with patients survival better than dVAF, where dVAF has not shown statistically significant predictability to predict NSCLC patient's survival.
Recently, molecular biomarkers and gene signatures occupied a great deal of interest from researchers and are used in clinical practice for many aspects of cancer including tumorigenesis, progression and prognosis [52]. Gene signatures [53] as well as TMB and bTMB [54,55]  The seven genes in our signature consist of UCN2, RIMS2 and CAVIN2 as risk factors, and GRIA1, PKHD1L1, PGM5 and CLIC6 as a protective factors. It has been reported that CLIC6 is a member of the intracellular chloride channels consisting one of the dopamine receptor-mediated signaling pathways and has changed its expression in breast cancer [59,60]. And there are no previous reports related to the prognosis of patients' cancer outcomes. Chen Zheng et al. [61] has been reported that PKHD1L1 may be a PTC-associated tumor suppressor gene and maybe a potential molecular biomarker useful as a therapeutic target in the coming years. PGM5 has been reported as a diagnostic and prognostic biomarker independently associated with the survival of patients with liver cancer [62] and colorectal cancer [63]. Sloane K Tilley, et al. [64] reported that increased expression and hypermethylation of GRIA1 was correlated with survival patients with Basal-Like Bladder Cancer and was used as a Prognostic Biomarker. Another report for Guodong Yang, et al. [65] showed that GRIA1 is one of the top 10 target genes in the protein-protein interaction network present in the five-miRNA signature model used as a novel prognosis biomarker and therapeutic target for CRC patients. Silvia Codenotti,et al. [66] reported that Cavin2 is a useful marker for discriminating the degree of differentiation in liposarcoma(LPS) tumors.
Another study conducted by Bayader Annabi, et al. [67] highlights the role of cavin2 in the regulation of each inflammatory and angiogenic for TNF-activated MSC.
Where there are no previous reports related to the prognosis of cancer outcomes in patients. Stephane Esnault, et al. [68] reported that ucn2 had the downstream function of inflammation, tissue remodeling, and lipid synthesis in human lung fibroblasts(HLFs). Where no previous survival prediction reports have been reported for cancer patients. RIMS2 has been reported to be mutated in melanoma [69] and there are no other reports related to the prediction of cancer patient's outcomes.
7-gene signature and AFD are still a new biomarker and factor respectively and has not yet been used as a prognostic marker for the prediction of clinical outcomes in lung cancer or any other type of cancer, moreover, the 7 genes are novel genes in lung cancer have not been reported . That is said our study is the first to demonstrate that 7-gene signature is an independent prognostic biomarker and AFD is also an independent prognostic factor in NSCLC, this is due to that AFD is more comprehensive to identify not only the selected mutations but also any genetic alteration at every single site of the

Ethics declarations Ethical approval and consent to participate
The data analysis process was conducted according to the ethical standerds (Fudan University Shanghai Cancer Center Institutional Review Board No. 090977-1).
Informed consents of patients or their relatives were obtained while donating a samples to the tissue bank of Fudan University Shanghai Cancer Center [70].

Consent for publication
Not applicable.