A novel prognostic predictor for clear cell renal cell carcinoma constructed by weighted gene co-expression correlation network analysis

Clear cell renal cell carcinoma (ccRCC) accounts the largest proportion of all types of renal tumor, it is highly invasive leading to shorter survival time of patients suffer from tumor. This study aims at constructing a new prognostic signature to provide a new reference for clinic work to adjust treatment protocols. Material and Methods Comprehensive gene expression profiles of ccRCC were downloaded from The Cancer Genome Atlas (TCGA). Weighted gene co-expression network analysis (WGCNA), a genome-wide profiling analysis, was applied to investigate key genes associated with clinical traits. Cox regression analysis was conducted to screen prognostic gene in ccRCC. LASSO analysis was used to enhance the robust of correlated genes. Then, a prognostic signature was developed as an independent factor to predict outcomes of ccRCC patients. The co-expression network was constructed by WGCNA, 26 modules were identified in total, including a module (grey) showed no significance. By associating the features with clinic features, we selected three modules (pink, purple and turquoise) significantly associated with the overall survival(OS) of ccRCC patients. Univariate Cox analysis for genes from each module separately and then we used LASSO regression to screen the significant genes for selecting potential prognosis element-genes. 8 genes (CAPRIN2, IFI44, LTV1, ZNF320, MTHFR, XPOT, BCL3 and PAX2) finally reserved by multivariate COX regression. Score W was defined as a prognosis modelestablished by the final potential prognosis 8 genes. Multivariate regression suggested the Score W was an independent prognosis predicting element. ROC curve method was performed and the AUC value of Score W showed satisfied and attractive clinical prognosis importance. Our findings provided the framework of co-expression gene modules of ccRCC and identified valuable prognostic method might be detection for ccRCC patients.


Introduction
Renal cell carcinoma (RCC), derived from different types of renal cell, were classified to clear cell renal carcinoma, papillary carcinoma and chromophobe renal carcinoma [1,2] . Clear cell renal cell carcinoma (ccRCC) is the main histologic subtype of renal cancer, accounts for approximately 70% of RCC cases [3,4] . ccRCC can be invasive, nearly 30% patients present metastases when the first time of diagnosis, and quite a few, about 30%, patients develop metastases eventually [5,6] . Metastatic ccRCC patients exhibit dismal 8% five-year overall survival (OS)for most ccRCCs being radio-chemo-resistant [5,7] . Targeting therapy represents the best option to cancer treatment except for surgery [8,9] . Because of the radio-chemo-resistant, the prognosis of the ccRCC can also be unpredictable. So novel molecular biomarkers and new prognosis predicting method are urgently needed.
WGCNA is defined as an effective method frequently applied to investigate the complicate relationships between phenotypes and gene expression profiles [10,11] . The WGCNA can group gene expression data into a model or network, known as modules, similar behaving genes with a common biological relationship or function were included in one module [12] . WGCNA has been successfully used to study various biological processes and was demonstrated to be quite valuable for exploring potential biomarkers and therapeutic targets [13,14] .
Topological overlap measure (TOM) was used to be a proximity measure to establish network modules that combine the adjacency of two genes and the connection strengths with which these two genes interact with other neighbor genes [15,16] . The correlation between module eigengenes, which are defined as principal components of the expression profiles, and all clinical traits is calculated, it helps in the process of comparing differentially expressed genes and visualizing the interactions between different co-expression modules.
In this study, we aimed to provide new insight of therapeutic targets and provide a better prognosis method for ccRCC patients. Expression data and clinical profiles of ccRCC samples were downloaded from TCGA, WGCNA was used to identity key gene has relationship with clinical traits such as age, gender, survival time, stage. Then, we select three modules which contain multiple genes that most correlate with clinical traits. Functional and pathway enrichment was used to analysis genes selected from WGCNA. LASSO regression, univariate and multivariable Cox regression were used to select the potential prognostic genes to establish a prognosis signature for ccRCC.

Data collection
RNA sequence data and clinic information of ccRCC patients were collected from The Cancer Genome Atlas (TCGA) database (https://cancergenome.nih.gov/). We filter out duplicate samples and unanalyzable samples, 507 samples remained in total.

Construction of co-expression modules of ccRCC
We used R package "WGCNA" to establish gene co-expression network in ccRCC. The soft-thresholding power β was calculated in the construction of each module using the pickSoftThreshold function of the WGCNA. Power value was screened using gradient algorithm to test the independence and the power value of different modules ranges from 1 to 20. After determining suitable power value when the index value for the reference dataset exceeded 0.8, the gene modules were constructed. The minimum number at 40 was set for each module, and the heatmap tool package was used to analyze and visualized the strength of correlation between each module. In order to generate dendrogram plot, a cut-line (0.25) was chosen.

Construction of module-clinic trait relationships
Modules form WGCNA were identified according to gene expression similarities in samples In order to acquire the module we are interested in, the significance between clinical traits (survival time, status, age, gender, historical stage) and each module was calculated. The gene modules significantly correlated with "futime" namely survival time in TCGA dataset were retained for next step.

Functional enrichment analysis of co-expression modules
The intra-modular connectivity indicates the gene-fit of modules. The genes inside the same co-expression module have high connectivity. The genes within one module may have similar function. To identify the biological function of the significant modules, the corresponding gene annotation information was mapped to the database for Annotation, Visualization, and Integrated Discovery. The "ClusterProfiler" function of R was used to conduct Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis for each module. P<0.05 after correction was chosen as the thresholds. The R package "GOplot" were used to visualize the gene function identified by GO and KEGG analysis.

Univariate Cox analysis and LASSO analysis in certain modules.
The genes in the module whose P-value between the gene module and "futime" was less than 0.01 were selected to receive univariate Cox analysis, which removes genes that have no significant correlation with OS. The results of univariate analysis was included in least absolute shrinkage and selection operator (LASSO) regression to eliminate the effects of collinearity. The LASSO scales the size of the parameter vector, β, and that unimportant variables (variables whose β coefficients are close to zero) are deleted from the model. The LASSO is usually applied to reduce the false discovery rate in high-dimensional Cox regression models.

Construction of the Prognostic Signature
Multivariable Cox analysis was performed to develop a predictive model of OS, basing on the prognostic genes from LASSO regression analysis of certain modules. Combine the expression values of the prognostic genes weight by their regression coefficients to establish a prognosis predicting risk score for ccRCC patients as follows: n was the number of prognostic genes, exp i was the expression of gene i, β i the regression coefficient of gene i in the multivariable analysis. Median score value was defined as a cutoff value, classifying ccRCC patients into low risk and high risk groups.
Combine the Score W with clinic factors by using univariate and multivariable COX regression to identify whether it can be the independent factor to predict survival time.

Statistical Analysis
Score W was included in multivariable Cox regression together with other clinic information to evaluate whether it can be an independent feature for survival prediction. The Kaplan-Meier curve was used to estimate the differences in survival time of low-and high-risk ccRCC patients. The R package "survivalROC" was used to estimate the prognostic signature. Statistical analysis was conducted by SPSS 25.0 (Chicago, IL, USA) and R 3.6.1(https://www.r-project.org/). Independent sample t-tests were used to analysis correlation between clinical traits and predicting factors. All statistical significance was defined as a P-value less than 0.05 unless otherwise stated.

Co-expression gene modules form WGCNA
R package of "WGCNA" was used to construct co-expression network. All genes were included in cluster analysis to generate gene modules. We chose the power of β =6 (scale free R 2 =0.92) as the soft thresholding to guarantee scale-free network (Fig.1A). The result of hierarchical clustering analysis is presented in Fig.1B. We merged some modules whose height is lower than 0.25, because these modules are not considered to be capable of acting as a standalone module. Finally, we identified 25 modules ranging from 845 genes (Turquoise) to 40 genes (White) in size ( Fig.2A). Meanwhile, genes not classified into any one module were included in an insignificant module (Grey), and it not belongs to any other 25 modules. Interactions of the 25 co-expression modules were analyzed (Fig.2B), light color represents low overlap while darker red color indicates higher overlap.

Gene co-expression modules associated with clinic traits
Download clinic traits dataset from TCGA database. Associations between clinic traits we were interested in and modules were identity based on correlation between modules eigengene and the clinic traits (Fig.2C). We investigated correlation coefficient and P value between each module and each clinical feature to select modules we are interested in. As we chose P<0.01 as a significant threshold, results showed that there are 3 modules (pink, purple, turquoise) were positively correlated to both survival time and survival state at the same time. Furthermore, the dendrogram and heatmap was used to quantify groups of correlated eigengenes (Fig.3).

Functional enrichment analysis
GO enrichment analysis was conducted in 3 positively correlated modules (Fig.4). Genes in pink module mainly enriched in regulation of mRNA metabolic process, RNA splicing, nuclear speck, spliceosomal complex, catalytic step 2 spliceosome and pre-mRNA binding. Genes in purple module mainly enriched in nuclear chromosome segregation, DNA replication, chromosome segregation, mitotic nuclear division, condensed chromosome, chromosomal region, chromosome, centromeric region and DNA-dependent ATPase activity. Genes in turquoise module mainly enriched in membrane docking, organelle localization by membrane tethering, RNA splicing, golgi vesicle transport, nuclear speck, SNARE complex, SNAP receptor activity and spliceosomal complex.

Univariate Cox proportional hazards regression and LASSO regression analysis
For next step to establish a prognosis model to predict the outcome of ccRCC patients, the genes inside these 3 modules were included in univariate Cox proportional hazards regression. In order to improve the robustness and eliminate collinearity between genes as much as possible, we select LASSO regression method to process OS correlated genes data of three modules. When the partial likelihood deviance reached its minimum, remove the gene if its coefficient is 0. In pink module, the log (Lambda) reached its minimum of -3.45 (Fig.5A). In purple module, the log (Lambda) reached its minimum of -3.05 (Fig.5B). In turquoise module, the log (Lambda) reached its minimum of -3.175 (Fig.5C).

Construction of signature
The application of LASSO reduces the dimension of multivariable Cox regression analysis to an acceptable level. We merged all the correlated genes of 3 modules, and put them into multivariable Cox regression analysis. The regression coefficients from the multivariable Cox analysis were used to construct Score W and develop a prognostic model to predict OS. 8 genes (CAPRIN2, coefficient =1.16E-05; IFI44, coefficient = 1.33E-06; LTV1, coefficient =3.10E-06; ZNF320, coefficient =-3.85E-06; MTHFR, coefficient =-5.75E-06; XPOT, coefficient =2.02E-06; BCL3, coefficient =8.62E-07; PAX2, coefficient =7.89E-07) were included in the signature. The information of 8 genes was showed in Table 1. The K-M survival curves of 8 genes were plotted in Fig.6, the high expression of 6 genes (CAPRIN2, IFI44, LTV1, MTHFR, XPOT and BCL3) can significantly influence OS of ccRCC patients. We used the median score value as the cutoff value to classify patients into low or high risk. The Score W distribution, gene expression, survival status of each patient was ranked by the Score W values of prognostic signature (Fig.7A). K-M curve was plotted to indicate the correlation between Score W and survival time of patients with ccRCC(P<0.001, Fig.7B). The time-dependent ROC curve suggest a considerable effective performance of the signature for OS prediction (P<0.05, AUC=0.796, Fig.7C). Correlation between Score W and clinical traits was showed in Fig.7D. Results indicated that higher Score W often appeared in the advanced cancer stage, higher Score W means the worse status of patients.
We bring Score W as one element which was similar to other clinic traits (age, gender, histological grade, and pathological stage) in univariable and multivariable Cox analysis. The results showed that predictive ability of Score W has significance in univariable and multivariable Cox analysis for OS of ccRCC patients(  Fig.7F]. The details of each element by time-dependent ROC analysis were presented in Fig.7G. Only 6 genes (CAPRIN2, IFI44, LTV1, MTHFR, XPOT and BCL3) were significant with clinical pathological parameters (Fig.8). Results indicated CAPRIN2, IFI44, LTV1, XPOT and BCL3 were the risky genes for procession of ccRCC while MTHFR was a protective gene.

Discussion
In recent decades, ccRCC accounts for about 3% of all malignant tumors and 85% of all primary kidney cancer, reaching the most proportion in all types of renal cancer [17,18] . It is the third most common cancer in the urinary system, and it has the maximum fatality rate. ccRCC is highly invasive, nearly 60% patients present metastasis at the procedure of diagnosis and therapy [19] . Meanwhile, metastatic cancer is not sensitive to chemotherapy and radiotherapy [20] . So, 5-year survival rate of ccRCC becomes negative and is of great significance to the therapeutic effect of tumor. As we already know, histologic grade, pathological stage, local scope of the tumor, presence of regional nodal metastases showed significant effects on 5-year survival rate and occurrence of metastatic cancer [21] . However, these factors were not able to be detected early for ccRCC patients, resulting in 20-30% of the ccRCC patients were diagnosed late in advanced and metastasis cancer. Investigating potential molecular markers and construct an effective prognostic module is necessary for the diagnosis, treatment and prognosis of patients with ccRCC.
WGCNA has been widely used to investigate potential biomarkers of various diseases. Previous studies only focus on other clinic traits such as Clarke et al. associated gene module with histologic grade [22] , Tian et al.
preferred the relationship of metastasis [23] and He et al. showed interest in pathological stage, few studies combine the module eigengene with survival time [24] . In this study, WGCNA was used to farmwork gene coexpression network of ccRCC gene expression profiles downloaded from TCGA, and then calculated the stability of gene module. The relationship between stable modules and clinical features were assessed after gene modules were identified. Three modules (pink, purple and turquoise) were selected out for the significant correlation with survival time. The results of functional enrichment analysis indicated that the module pink was mainly enriched in the regulation of RNA metabolic process and different way of RNA splicing. The module purple was mainly associated with chromosome and DNA replication. The module turquoise was more enrich in the function of membrane of organelle in cell.
Previous studies have identified many molecular signatures that classify patients into different prognostic groups. However, those studies were lack of stability. In this study, LASSO regression was used to enhance the robustness of prognosis model. Of note, the result of the AUC analyses in our study showed that the AUC values of the combination of 8 genes were more 0.796 in the OS, suggesting that the Score W could be regarded as a novel factor that may serve as a prognosis indicator for ccRCC patients All the elements in our model of potential prognosis 8 genes (CAPRIN2, IFI44, LTV1, ZNF320, MTHFR, XPOT, BCL3 and PAX2) were widely explored in previous literature. Jia et al. said that CAPRIN2 activate the Wnt pathway to influence cancer cell growth [25] . Huang et al. demonstrated that IFI44 is a novel tumor suppressor, and affect cancer stemness, metastasis, and drug resistance via regulating met/Src signaling pathway [26] . Meanwhile, Puig-Butille et al. showed the over-expression of IFI44 influence the procession of cancers [27] . Ghalei et al. showed LTV1 is necessary for ribosome assembly and cell growth [28] . In Yu's study, ZNF was identified as a key gene in microsatellite instability in certain cancer [29] . Liew et al. focuses on the recent evidence-based reports on the associations of the MTHFR C677T polymorphism and the various diseases globally, showed that MTHFR is associated with various diseases with the epidemiology [30] . Gonzales et al. present the most comprehensive analyses to date of MTHFR polymorphism-mutations and breast cancer risk [31] . Lin et al. demonstrate that XPOT Down-regulation of XPOT significantly inhibited tumor proliferation and invasion in vitro and vivo, and XPOT may affect tumor progression through cell cycle and ubiquitin-mediated proteolysis [32] . Wang et al. links Akt and MAPK pathways to NF-κB through Bcl3 and provides mechanistic insight into how Bcl3 functions as an oncoprotein through collaboration with IKK1/2, Akt, and Erk2 [33] . Boudjadi et al. investigated that PAX3 contributes to the pathogenesis of the soft tissue cancers alveolar rhabdomyosarcoma and biphenotypic sinonasal sarcoma [34] . However, CAPRIN2, IFI44, LTV1, ZNF320, XPOT, are rarely explored, therefore, we believe these genes mentioned in the context have more experimental significance in further studies.
In conclusion, by combining molecular signature and clinical characteristics, we constructed a novel risk score model, Score W , for ccRCC patients' survival. The risk score showed significantly correlation with survival time and acts as an independent predicting factor. We provided the framework of co-expression gene modules of ccRCC and identified valuable prognostic method might be detection for ccRCC patients. The Score W can be used as a new prognostic method which may conduce to treatment decisions for individuals. However, further verification in a large clinic samples are needed and that is what we plan to do next step and hold promise for clinical practice in the near future.

Ethics approval and consent to participate
The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study complied with the Declaration of Helsinki and was approved by Ethics Committees of The Fourth Hospital of Hebei Medical University.

Conflicts of interest
All the authors declare that they have no competing interests.

Consent for publication
The authors have no interests ethical, legal and financial conflicts related to the article. All authors read and approved the manuscript to publish.

Funding
There is no Funding.  Figure 1 Network topology for different soft-thresholding powers (A) Analysis of the scale-free fit index for various softthresholding powers (β) and the mean connectivity for various soft-thresholding powers. The approximate scalefree topology can be attained at the soft-thresholding power of 6. (B) Dendrogram of consensus module eigengenes obtained by WGCNA on the consensus correlation. The red line is the merging threshold, and groups of eigengenes below the threshold represent modules whose expressions profiles should be merged due to their similarity.  The eigengene dendrogram and heatmap identify groups of correlated eigengenes.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download. Dataandmaterials.rar