Combination the Gene Expression of Adjacent Normal and Tumor Makes more Robust Gene Signature as Cancer Prognosis Biomarker

Purpose: We tried to explore new gene signature via the combination of tumor-derived expression prole and the adjacent normal-derived expression prole to nd more robust cancer biomarker. Methods: Log2 transformed ratio of tumor tissue and the adjacent normal tissue (Log2TN) expression, tumor-derived expression, and normal-derived expression were used to do univariate Cox regression in The Cancer Genome Atlas (TCGA) lung squamous cell carcinoma (LUSC) respectively. Then, we used factor analysis and least absolute shrinkage and selection operator Cox (LASSO-Cox) to select gene signature in TCGA LUSC for Log2TN, tumor, and adjacent normal respectively. Results: By comparing Log2TN with tumor and adjacent normal in LUSC, we found that genes derived from Log2TN show more robust (p = 0.006 and p = 0.001) and have lower p-values (p < 0.001). Gene signature selected from Log2TN shows the best generalization in the three GEO datasets even though only tumor-derived expression proles were available in the three datasets. Enrichment analysis showed that the tumor cells mainly focus on proliferation with losing functional of metabolism. Conclusions: These results indicate that (1) Log2TN could get more robust genes and gene signature than tumor-derived expression proles used traditionally; (2) the adjacent-normal tissue may also play an important role in the progress and outcome of the tumor. Implications By combined of tumor-derived expression prole and the adjacent normal-derived expression prole, we could nd more robust gene signature than traditionally method. Using these robust gene signatures, robust cancer biomarkers could be constructed and will do great help to improve cancer prognosis.


Introduction
Cancer is the second leading cause of disease-related death worldwide. In 2018, about 9.6 million people died because of cancer according to the GLOBOCAN 2018 estimates of cancer incidence and mortality produced by the International Agency for Research on Cancer 1 . Lung cancer is the most commonly diagnosed cancer and leading cause of cancer death 1, 2 . One reason for the high mortality of cancer is that when cancer is diagnosed, it has been the advanced stages 3,4 . Another reason is recurrence and metastasis after drug-treatment or surgery, especially in patients with advanced tumor stages [5][6][7] . As time goes on, nowadays, cancer is no longer a type of aged disease, people who get cancer become younger and younger 1 . Early diagnosis, choice of appropriate therapeutic strategies, and e cient monitoring can have a pivotal role in reducing cancer-related mortalities. Diagnosis biomarker and prognosis biomarker could help to diagnose cancer in the early stage and predict progression in the future, thus patients could get appropriate therapy in the beginning.
Nowadays, a lot of biomarkers were developed for types of cancer. Before the high-throughput sequencing developed, it was majorly single molecule-based biomarkers. For example, prostate-speci c antigen (PSA) has somewhat revolutionized the assessment of prostate cancer in 1987 8 . In 1998, it was found that %fPSA (free PSA) was an independent predictor of prostate cancer 9 . Prostate cancer antigen 3 (PCA3) was found in 1999 by Bussemakers et al and promoted the diagnosis of prostate cancer 10 . Then, it was found that PCA3 was also a prognosis biomarker for prostate cancer 11 . Glypican-1 (GPC-1) is a member of heparan sulfate proteoglycans. A lot of researches had found that GPC-1 is a prognosis biomarker in many cancers such as pancreatic cancer, colorectal cancer, and prostate cancer 12 . Human epidermal growth factor receptor 2 (HER2) is a diagnosis and prognosis biomarker for many cancers, especially for breast cancer 13 . With the development of high-throughput sequencing technology, multiple gene expression-based prognosis signatures were developed. MammaPrint was a 70-gene-based signature used for prediction recurrence risk of breast cancer 14 . MammaPrint is the rst successful prognostic signature that was marketed by Agendia (the Netherlands). It was developed in 2002 and approved by US food and drug administration (FDA) in 2007. PAM50 is another gene-based signature for prediction recurrence risk of breast cancer, also approved by the FDA 15 . Yang et al. developed a 28 hypoxia-related gene-based recurrence prognosis signature for prostate cancer 16 . In addition to these gene expression-based signatures, other signatures based on micro RNA (miRNA), long non-coding RNA (lncRNA), and gene methylation were also very popular. For example, a combination of miR-331 and miR-21 could be a diagnostic and prognostic signature for gastric cancer 17 . Three lncRNAs including AI364715, GACAT1, and GACAT2 could be constructed as a signature for diagnostic and prognostic of gastric cancer 18 . Deleted in Split hand/Split foot 1 (DSS1) promoter hypomethylation could predict poor prognosis in melanoma and squamous cell carcinoma patients 19 .
Signatures like described above are innumerable and couldn't be list here one by one. These genes (miRNA, lncRNA) expression-based prognosis signatures were developed based on gene (miRNA, lncRNA) expression in tumor tissue. This faced a big problem: there is no stander or baseline for expression. Different methods such as an array, sequencing, and Q-PCR get different expression values in different scales. Different instruments or operators also get different results. Another problem is that extrapolation and robustness of these signatures are not enough for widely using 20,21 . Besides, the combination of expression in tumor tissue and adjacent-normal tissue maybe could get better performance.
In this study, we explored these questions in lung squamous cell carcinoma (LUSC) on recurrence prognosis issue. Instead of using expression pro les in tumor tissue, we used the log2 transformed ratio of tumor tissue and the adjacent normal tissue (Log2TN) to explore the prognosis signature. For a single gene, compared with the tumor tissue expression pro le, Log2TN based was more signi cant. Log2TN based signature was also more signi cant. Furthermore, the signature explored from Log2TN was very robust in the test dataset even if using the tumor tissue expression pro le.

Materials And Method
Data collection: To investigate the Log2TN based recurrence-related prognosis signature in LUSC, the LUSC dataset of The Cancer Genome Atlas (TCGA) was downloaded from UCSC Cancer Browser (UCSC Xena; https://xenabrowser.net/datapages). Then we searched the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) for LUSC datasets that ful lled the following criteria: included samples were hybridized to the Affymetrix Human Genome U133 Plus 2.0 Array (GEO accession number GPL570) platforms; information on the recurrence event was available. The expression matrix was downloaded and the mRNA expression pro les were log2 transformed for further analysis.
Genes selection and Cox regression: Univariate Cox regression was used to investigate the recurrence prognosis related genes. Factor analysis (FA) and least absolute shrinkage and selection operator (LASSO) Cox were both used to reduce the dimensionality and to select the most signi cantly relapse-related genes to build a recurrence prognostic model using the multivariate Cox regression method. Survival rates were calculated by the Kaplan-Meier method, and the signi cance of differences between survival curves was determined using the log-rank test. Uni-and multivariate analyses were performed using Cox proportional hazard models. Cox regression and survival curves were performed through the R language "survival" package. Factor analysis was performed through the R language "psych" package. LASSO-Cox was performed through the R language "SIS" packages.
Gene set enrichment analysis: We used gene set enrichment analysis (GSEA) to investigate the potential mechanisms in the c2 (c2.cp.kegg.v7.0.symbols) of the molecular signatures database (MSigDB) using the JAVA program (http://software.broadinstitute.org/gsea/index.jsp). The pre-ranked method was used to perform GSEA analysis. The number of random sample permutations was set at 1000, and the gene set size was set from 5 to 500. The signi cance threshold was set at p<0.05.

Statistical analysis:
All statistical analyses including Fisher's exact test and student's t-test are based on R language 3.6.2 version and attached packages. R is a free software environment for statistical computing and graphics Genes explored via Log2TN are more robust and signi cant: In the TCGA LUSC dataset, we selected the 40 tumor-normal paired sample with recurrence event information and calculated Log2TN for every gene. Then univariate Cox regression analysis was performed for each gene using Log2TN, expression pro le of tumor tissue, and expression pro le of normal tissue within these 40 individuals respectively. For the three datasets from GEO, even if only the expression pro le of the tumor is available, we performed univariate Cox regression analysis for each gene using the tumor-derived expression pro le in each dataset. Signi cant genes were picked out using P < 0.05. Then we investigated the overlapping between the TCGA LUSC and the other three datasets. Signi cant genes in TCGA LUSC dataset in anyone of the three datasets of GEO are de ned as overlapped genes. The overlapping of signi cant genes between the TCGA LUSC dataset and the other three datasets are shown in Table 1.  Cox regression model based on the consistent genes obtained via Log2TN is robust: We acquired 35, 25, 94 consistent genes from tumor, normal and Log2TN respectively. The 40 tumornormal paired samples in the TCGA LUSC dataset were used in the following analysis. For each of them, we selected the top 25 signi cant genes for further analysis. In the following analysis, we ran the same process for tumor, normal, and Log2TN using the corresponding expression value, for example, when using the 25 genes from normal, we used the expression pro le from normal. First, factor analysis (FA) was performed to nd out the nonredundant genes. Eigenvalues above zero was used to con rm the number of factors. Then, we selected the genes with the absolute value of factor loading above 0.5. At last, the LASSO was used to identify the prognostic genes and built the Cox regression model. For Log2TN, 10 factors were retained. In the 10 factors, there were 22 genes with the absolute value of factor loading above 0.5. Then after the LASSO selection, 14 genes (ZNF275, PODXL2, SLCO4C1, POMZP3, ACAD11, MAPK4, BCAR3, DKKL1, FRK, TRPM3, NRP1, PAEP, KLHL13, HRSP12) remained to build the Cox regression model. As shown in Table 3  Then, we tested the selected genes in the three datasets from GEO for each of them. The selected genes were used to build the Cox regression model in each of the three datasets. The results were summarized in Table 3. Although the expression pro le of tumor tissue was used to build the Cox model in the three datasets from GEO (only expression pro le was obtained), we found that genes selected via Log2TN had the best performance. The Cox regression models based on these genes in the four datasets were all signi cant. In GSE8894 and GSE30129, Cox models based on these genes were the most signi cant. In GSE37745 and TCGA LUSC, the Cox models of normal based genes got the lowest p-value while the Log2TN based Cox models were the second. For each Cox model in each dataset, we calculated the prognostic index (PI) for each sample. We divided a dataset into two groups with the median of PI as the cutoff. Then, Kaplan-Meier curves were drawn and p-values were obtained by the log-rank test. The results of the three conditions in four datasets were shown in Fig. 1. The signi cant level of them was summarized in supplementary table 1. Except the TCGA LUSC, Log2TN based Cox models in the three datasets from GEO got the most difference between the two survival curves. It is unbelievable that tumorderived genes from TCGA LUSC got the worst performance in the three datasets of GEO while the expression pro le of the three datasets was tumor-derived. It seems that the adjacent-normal tissue of tumor plays an important role in the progress of tumor. Maybe even if it is adjacent-normal tissue, something has happened, which makes it not the same as the non-tumor normal tissue. Hence, Log2TN that calculated based on adjacent-normal and tumor could get the most robust genes.
In the previous analysis of this study, we used only the 40 tumor-normal paired samples with recurrence event information in TCGA LUSC because Log2TN needed paired samples to calculate the value. For comparable, these 40 tumor samples were used for tumor and normal too previously. To test the robustness of the selected genes via tumor, normal, and Log2TN, we built the Cox model with the 386 tumor samples that have recurrence information in the TCGA LUSC using tumor-derived expression pro le based on the selected genes for them respectively. For normal-derived genes, it was not signi cant (p = 0.461, likelihood ratio test). It is reasonable because genes were selected based on the expression pro le of normal while tested using tumor-derived expression pro le. However, compared to the tumor, Log2TN derived genes got a more signi cant Cox model (tumor: p = 0.0315, Log2TN: p = 0.0191, likelihood ratio test). For each of the three models, we calculated the PI and divided the 386 samples into two groups using median of PI. Then, we drew Kaplan-Meier curves and assessed the difference between two survival curves using log-rank test. As shown in Fig. 2, although it isn't signi cant in the Cox model using normal-derived genes, the survival curve is discriminating. The power of their distinguishing for survival is almost the same. Thus, not only the tumor tissue, but also the adjacent-normal tissue may in uence the outcome.
Gene set enrichment analysis of Log2TN genes: Log2TN derived genes show the best robust performance. Thus we investigated the biological functions of genes ranked according to Log2TN univariable Cox regression via gene set enrichment analysis of KEGG pathways. For each gene, we multiply the sign of coe cient by -log10 of the p-value as its prognosis score. For example, if the coe cient and p-value of the Log2TN univariable Cox regression for a gene is -0.35 and 0.01, the score for this gene − 1 * (-log(0.01, 10)) = -2. Then, we ranked genes with this score and performed GSEA analysis. As shown in Fig. 3, these positively correlated pathways include multiple biological pathways, such as cell cycle, DNA replication, homologous recombination, ribosome, primary immunode ciency, and p53 signaling pathway. The high-level Log2TN of genes in these pathways results in worse outcomes. These negatively correlated pathways mainly include metabolismrelated pathways such as complement and coagulation cascades, fatty acid metabolism, retinol metabolism, primary bile acid biosynthesis, tryptophan metabolism, drug metabolism cytochrome P450, valine leucine and isoleucine degradation, and glycine serine and threonine metabolism. It seems that the tumor cells mainly focus on proliferation with losing functional metabolism.

Discussion
With the development of high-throughput technology, tremendous data of biomedicine accumulated. It provides the opportunity to explore new biomarkers of diagnosis and prognosis. Biomarkers evolved from a single gene or few genes before to multiple genes now. Maybe it would develop towards entire omics or even the combination of multi-omics in the future with reducing cost. These genes that are aberrantly expressed in cancer tissue have attracted substantial interest in the construction of biomarkers using expression pro les of tumors 13,[22][23][24][25] . However, it assumed that all the difference between tumor and adjacent-normal comes from tumor tissue and the adjacent-normal tissue has nothing changed during the tumor development. If an aberrantly expressed gene mainly results from the change of adjacentnormal tissue, when constructed the biomarker using tumor-derived expression pro les, it may not achieve good performance. Furthermore, it is hard to get the absolute expression pro les in tumor tissue. Different methods and technicians could also bring in error. In the present study, we didn't use the aberrantly expressed genes in cancer. Instead of it, we used the log2-transformed ratio between tumor and adjacent-normal to do Cox regression and lter prognosis related genes in the TCGA LUSC. The signi cantly prognosis related genes ltered via Log2TN had more overlapped genes with the three GEO datasets, even if the tumor-derived expression pro les were used. In the TCGA LUSC, Cox regression pvalues via Log2TN were lower than the other two. Moreover, genes ltered via Log2TN show more robust with fewer inconsistent genes. These LASSO-Cox selected Genes that were used for the construction of the Cox regression model also showed better performance for Log2TN in the three GEO datasets. From our study, we could conclude that: (1) combination of the expression pro les of tumor and adjacentnormal could obtain more robust recurrence prognosis related genes; (2) the adjacent-normal tissue may also play an important role in the progress and outcome of the tumor. This nding opens a door to construction of a more robust prognostic strategy from the combination of the tumor and normal. We hope that in this way, constructed biomarkers could be used in any way, such as Q-PCR, array, and RNAseq in the future because what it needed is the relative value.
Nevertheless, there are inevitably some limitations in our study that should be acknowledged. First, there are only 40 tumor-normal paired samples in TCGA LUSC could be used in our study for analyzing the Log2TN. It may not provide enough power to support our conclusion. Second, all of the three GEO datasets provide only the tumor-derived expression pro le. Good veri cation couldn't be got via these three datasets. Third, although genes Log2TN show more robust, it isn't a huge gap between Log2TN and Declarations