Candidate Genes Identied by Constructing a Weighted Gene Co-Expression Network and Key Prognostic Factors of Pancreatic Adenocarcinoma

Background Pancreatic adenocarcinoma(PAAD) is a kind of terribly malignant tumor of the digestive tract with high mortality globally, however, the pathogenic molecular mechanisms of which remain unclear currently. Therefore, it is very important to explore the underlying mechanisms and screen for novel prognostic markers and treatment targets. We performed a weighted gene co-expression network analysis(WGCNA) to identify hub genes associated with clinical parameters of interest using The Cancer Genome Atlas(TCGA) data. In addition, Gene ontology(GO) enrichments and survival analyses were conducted. Methods The 14 datasets including gene expression proles and clinical phenotypes for PAAD patients in TCGA were analyzed by WGCNA in R. After constructing the scale-free network of gene co-expression, modules with clinical signicance were distinguished, and hub genes were validated, which also key genes regulated relevant traits. GO enrichments were performed on the modules. Furthermore, the signicance of hub genes was conrmed via survival analysis. We discovered 12 candidate genes in total which had signicant effects on the overall survival of PAAD patients. The expression of DUSP26, TCEAL2, CELF3, CDK5R2, NAP1L2, INA had strong and positive inuences on the pathological T stage, while the expression of CEP55, HJURP, KIF23, CDCA5, SAC3D1, ATP6L had great negative impacts on the histological stage. Meanwhile, we also found that the tumor primary location and age of onset were key factors affecting the prognosis of PAAD. These candidate genes probably play important roles in proliferation, differentiation and metastasis of PAAD cells, which may be acted as novel prognostic markers and effective therapeutic targets. Besides, most of them are rst reported in PAAD and deserved in-depth research. information was from the original le downloaded by Illumina. Phenotype data contained the pathologic TNM stage, survival information, histologic grade, tumor growth location, history of exposure to hazardous factors and so on, items of incomplete information were not included. some of which is the rst to be discovered. They probably serve as novel prognostic markers and effective therapeutic targets.


Introduction
released the latest estimates of new cancer cases and deaths in the United States. In 2019, PAAD would be expected to strike approximately 56,770 people and to kill 45,750 others, making it the second most frequently diagnosed cancer in the digestive tract just behind colon tumor and the second highest mortality rate among all types of tumors. In contrast to the steady increase in survival for most cancer types, improvements have been so slow for PAAD, in large part because greater than a half of cases are diagnosed at a advanced stage since its special anatomical structure and location, meanwhile, patients do not have unique symptoms (Modolell et al., 1999). In recent years, screening is mainly based on annual magnetic resonance imaging(MRI) and endoscopic ultrasound(EUS), but it is only be used for high-risk individuals, and the acceptability and cost limit its spread (Diane et al., 2019). Given there are no standard protocol for screening currently (Kamisawa et al., 2016;He et al., 2014), chemoresistance and impaired drug access (Schober et al., 2014), and more than 80% of patients have locally advanced or metastatic tumors that are unresectable when diagnosed (Higuera et al., 2016), it is extremely vital to discover novel candidate genes, which play important roles in tumorigenesis and could be new targets for screening and treatment.
Development of high-throughput sequencing had brought light on the research of screening and molecular mechanisms of tumors. TCGA database, which provided numerous publicly available genomic and clinical data of many cancer types, could help researchers to further study and understand the molecular level pathogenesis of each cancer. To explore the underlying mechanisms and identify novel prognostic markers and treatment targets, we performed WGCNA on genomic and clinical data of PAAD downloaded from TCGA to identify hub genes within interested phenotypes and analyze functional enrichment pathways which located in to nd early biomarkers and provide new targets for future molecular level treatments, ultimately bene ts the patients.

Acquiring and preparing genetic and clinical data
The 182 standardized gene expression RNAseq and 222 phenotype data of the 14 datasets for PAAD patients were downloaded from UNIVERSITY OF CALIFORNIA SANTA CRUZ(UCSC Xena)(https://xenabrowser.net/). Due to the absence of some clinical phenotypic data, the sample TCGA-HZ-A8P0-01A was deleted, ultimately 181 gene expression RNAseq samples with corresponding clinical phenotypic data were selected to study. The gene expression level was measured as fragments per kilobase of transcript per million mapped reads(FPKM), the matching annotation information was from the original le downloaded by Illumina. Phenotype data contained the pathologic TNM stage, survival information, histologic grade, tumor growth location, history of exposure to hazardous factors and so on, items of incomplete information were not included.
R was used to preprocess the sequencing data. The gene lter package was applied to remove the low expression value usually represent noise, and the limma package was served to normalize and adjust the data between row groups (Fig.S2). The median absolute deviation (MAD) of the rst 5,000 genes were selected.
Constructing gene co-expression network Gene co-expression network was constructed by the WGCNA package in R. After detecting whether there were missing value and outlier samples (Fig. S3), squared Euclidean distance of each sample was calculated by function adjacency, and whole sample network connectivity according to distance was standardized by function scale. Function pickSoftThreshold was used to calculate scale-free topology tting indices R 2 corresponding to different soft thresholding powers β (Fig. S1). The β value was chosen as long as R 2 reached 0.85. After that, the gene expression matrix was transformed into an adjacency matrix and a Topological Overlap Matrix (TOM), and then the corresponding dissimilarity of TOM (dissTOM) was calculated. For module detection, hierarchical clustering was used to produce a hierarchical clustering tree (dendrogram) of genes by function hclust based on dissTOM. The Dynamic Tree Cut method was performed for branch cutting to obtain different gene modules, represented by the branches of the cluster tree and different colors (Fig. 1). A relatively large minimum module size of minClusterSize = 30 was chosen to avoid generating too many small modules. The module eigengene(ME), which can be considered as a representative of the gene expression pro les of a module, was de ned as the rst principal component of a given module. It was calculated by function moduleEigengenes. Modules would be merged if their correlation of MEs was greater than 0.75, which means they have similar expression pro les. Co-expression networks of key genes were visualized by Cytoscape (Fig. 7).

Identifying the genes associated with interested clinical phenotype
The correlations between MEs and clinical phenotypes of interest were evaluated by Pearson's correlation tests and p < 0.05 was considered to be signi cantly correlated, subsequently the relationships between gene correlation with trait associated modules and gene signi cance for trait were plotted (Fig.3). Genes that were signi cantly related to both a trait and a model may be very important.

Screening for candidate genes with clinical signi cance
Module member-ship(MM) of genes represents the membership with respect to the module, the closer its value is to 1, the more relevant it is to the module. Gene signigicancer(GS) re ects the correlation between gene expression and phenotypic data. In our study, at least MM-value 0.8 & GSvalue 0.2 were selected as key genes, which had higher degree of connectivities than other genes in the module.

GO enrichment and survival analysis
To explore the potential biological themes of genes in the modules, the clusterpro ler package in R was used to annotate and visualize GO terms (Yu et al., 2012). In order to validate our conclusion, the R package survival was used to carry out log-rank tests, risk form and plot Kaplan-Meier survival curves to con rm signi cant genes. Besides, we also analyzed the survival of patients affected by different levels of in uencing factors, including age, gender, alcohol intook, tumor growth location, smoking index and surgical procedures.

Results
Gene co-expression network of PAAD The 5,000 most variant genes according to MAD value were selected for further analysis from the original over 18,000 protein-coding genes. When the value of soft thresholding power β was 7, the connectivity between genes met a unexceptionable scale-free network distribution (Fig. S1). Thirteen modules were identi ed by hierarchical clustering and the Dynamic branch Cutting (Fig. 1). Each module was assigned a unique color as an identi er. The ME of each module was calculated (Table S1). The number of genes in different modules ranged from 38 to 819 (Table 1), the name of genes were also listed (Table S2), the grey module represented a gene set that was not assigned to any of the modules. Notes.

*
The grey module genes with module number 0 were unclassi ed.
Clustering dendrogram of genes.
The hierarchical clustering tree was produced by hierarchical clustering based on dissTOM of genes. Thirteen modules were identi ed by Dynamic Tree Cutting method with a medium sensitivity(deepSplit = 2) to branch splitting. Each module was assigned a color as an identi er. The colored rows below the dendrogram represented the gene similarity expression modules.
Gene ontology enrichment analysis of each module 4,621 genes in 13 modules were mapped to GO database, be in order to obtain their potential functions in the following three aspects, biological process(BP), cellular component(CC) and molecular functio (MF) (Tables S3-S5). Notes.
There were no statistically signi cant results in GO enrichment analysis of purple module.
Functional enrichment results of each module(took the rst three terms of the minimum FDP value ).
The length of bars represented the number of genes, the color of bars corresponded to module color according to legend, on the left side of the label represented functional enrichment results and corresponding GO encoding.

Modules associated with the clinical phenotypes of interest
We analyzed the correlation and signi cance between MEs and clinical phenotypes, including pathologicstage, historyofdiabetes, tumordiameter, histologicgrade, ageatinitialpathologicdiagonosis and so on (Fig. 3). Meaningful and interesting clinical phenotypes such as pathologicstage, tumordiameter and histologicgrade were selected for in-depth analysis.

Module -trait associations were evaluated via correlations and p-value between MEs and clinical traits of interest.
Each row and column corresponds to a module eigengene and a trait, respectively. Each cell contains the corresponding correlation( rst line) and pvalue(second line). The table is color-coded by correlation according to the color legend. Pink, blue and salmon modules negatively correlated to maximum_tumor_dimension(p 0.05). Magenta and black modules positively and red module negatively correlated to neoplasm_histologic_grade(p 0.01). Tan module positively and red module negatively correlated to pathologic_T(p 0.05).
Although the correlation between traits and modules had been found, the module itself still contained many genes, and which needed to be further studied were the most important genes. The genes highly correlated with trait associated modules were also greatly signi cant for the trait (Fig. 4).

Figure 4
The relationships between gene correlation with trait associated modules and gene signi cance for trait.  -24). Correlation between tan module and pathologic_T , salmon module and maximum_tumor_dimension were excluded according to Module membership vs. gene signi cance(p 0.05). We searched for the hub genes in trait associated modules, which were also the key genes to regulate trait.

Screened and veri ed candidate genes of interested clinical phenotypes
A total of three interested clinical phenotypes and their ve corresponding modules were selected to study, we extracted the key genes from modules with signi cant correlation with phenotypes( Fig. 4, Table 2), which were core regulatory genes in the gene regulatory networks, then used NCBI database(https://www.ncbi.nlm.nih.gov/gene/) to query their functions for further screening. We tried our best to select higher MM & GS value in order to obtain more accurate target genes. Overall, 22 candidate genes were identi ed that played regulatory or pivotal roles in the pathological T stage, histological stages and tumor diameter of PAAD, respectively. Notes.
* MM the full name is module member-ship, the closer its absolute value is to 1, the more correlated it is with the module. GS, the abounding name is gene signigicancer, re ects gene expression and phenotypic data correlation.
Survival analysis was used to verify and screening for more accurate and e cient candidate genes in the modules (Table 2). We set the critical expression values of candidate genes based on their median. According to the p value of log-rank test in Kaplan-Meier curve (Fig. 5) We also analyzed the effects of different exposure factors in the phenotypic data on survival and discovered that the overall survival varies with episode age and tumor growth site, divided by onset at age 67 years, the overall survival was obviously increased when comparing younger vs. older group(p 0.01), pancreatic body tumors had the highest overall survival, followed by the tail, the head was the lowest(p 0.01). Alcohol Cytoscape of the nal selected candidate genes Mutations in human genes occur at every moment. The key genes have essentially greater connectivities and play crucial pivotal roles in genetic network. Many studies had shown that the human gene networks conform to scale-free characteristics, which means that it can still work as long as the key genes are not abnormal.
The rst 30 weighted neighbor genes of each candidate genes Nodes and edges represented genes and correlations between candidate genes and weighted neighbor genes. Light red nodes were the hub genes of the network.

Discussion
WGCNA is a systems biology approach used to describe patterns of genetic association between different samples. It can be used to identify highly synergistically changing sets of genes. The candidate biomarker genes or therapeutic targets were authenticated according to the endogenicity of gene sets and the association between gene sets and phenotypes. Instead of focusing only on differentially expressed genes, WGCNA used information from thousands or nearly 10,000 of the most varied genes, or from all genes, to identify the sets of genes and perform a signi cant association analysis with the phenotypes of interest. On the one hand, it made full use of information, on the other hand, it transformed the association between thousands of genes and phenotypes into the assciation between several gene sets and phenotypes, which avoided the problem of multiple hypothesis testing and correction. Hence, it had been generally and successfully applied in various biological contexts (Amin et al., 2016).
We constructed gene co-expression network, divided the modules according to the similarity of gene expression, performed GO enrichment analysis and then established the gure of module-trait relationships. After verifying that the genes signi cantly correlated with trait were also remarkably related to trait associated module, we get candidate genes that were meaningful for traits. In order to fully screen and validate the accuracy and effectiveness of candidata genes, cox regression analysis was used lines with the total survival time and endpoint events corresponding to each sample in the phenotypic data, we nally obtained 12 candidate genes (Fig. 5). The use of differentially expressed genes in WGCNA analysis is not recommended because it completely invalidates the scale-free topology assumption.
In our study, we used MAD to lter out the rst 5,000 genes with the most variation. We identi ed modules and candidate genes associated with pathologic T stage and histologic grade. The expression levels of which were associated with overall survival of PAAD patients, and they could be prognostic biomarkers and effective therapeutic targets for PAAD.
The enrichment analysis of genes in red module revealed that they could encode proteins including various growth factors functioning as regulation of hormone levels, localization, ion migration, biological process and negative regulation of response to wounding, etc, which were a series of positive neuro-endocrine regulation of pancreatic function, while maintaining the acid-base balance and stability of microenvironment. It had inhibitory effects on the destruction of normal organ structure in pathological stage T, organ dysfunction, and the acid-base imbalance caused by high metabolism in anoxic microenvironment. DUSP26/NEAP caused down-regulation of EGFR (Wang et al., 2008), many types of tumour produce far more EGFR than normal cells do, and one effect of that is to stimulate the proliferation of tumour cells.TCEAL2 encoded a member of the transcription elongation factor A (SII)-like(TCEAL) gene family, the family members contain TFA domains and may function as nuclear phosphoproteins that modulate transcription. Pillutla et al. (1999) found the expression of TCEAL1 was the lowest in hematopoietic cells of tumor origin and affected survival of patients. TCEAL2 may promote the encoding of tumor suppressor genes, its mechanism deserves further study. CELF3 played a role in the carcinogenesis and progression of colorectal cancer (Zhou et al., 2018), but our analysis proved that the high expression of CELF3 had a negative regulatory effect on pathological T and signi cantly prolonged the overall survival of PAAD patients. Genes play diverse roles in different tumors, which needed to be further proved by follow-up experiments. CDK5R2 was involved in positive regulation of cyclin-dependent protein serine/threonine kinase activity, the main ligands were TGF-βs, including TGF-β1~TGF-β5, these members had similar structures and functions that inhibited cell proliferation. It had been demonstrated that decreased expression of CDK5R2 was correlated with poor prognosis of hepatocellular carcinoma (Lu et al., 2011), and Pérez-Morales et al.(2018) indicated CDK5R2 could be a potential lung cancer grading and staging biomarker, it may have similar effects on PAAD patients. NAP1L2 was a kind of nucleosome assembly protein, regulated cell differentiation and tissue maturation. It may control pathologic T by means of promoting the differentiation and maturation of PAAD cells. INA encoded internexin neuronal intermediate lament protein alpha, loss of the expression of alpha-internexin was closely related to aggressiveness of tumors and patient's adverse prognosis (Wang et al., 2018), and alpha-internexin could be independent prognostic biomarker of pancreatic neuroendocrine tumors (Song et al., 2017). The expression of module genes could signi cantly inhibit the increase of tumor volume and the expansion of involvement range of adjacent tissues, providing new target sites for future neoadjuvant chemotherapy.
The enrichment analysis of magenta module showed it was associated with cell proliferation, such as cell cycle process, cell division, mitotic nuclear division, etc. CEP55, upregulated and associated with the Janus kinase/signal transducer, activator of transcription signaling pathway and cell proliferation in clear cell renal cell carcinoma(ccRCC), was signi cantly associated with survival time, and overexpression of CEP55 could promote migration, invasion and EMT of ccRCC cells via the activation of PI3K/AKT/mTOR pathway(Chen et al., 2019). Tumor recurrence and metastasis is the primary cause of cancer-associated mortality in patients, therefore, identi cation of e cient diagnostic and prognostic molecular markers may improve survival times. In advanced-stage serous ovarian carcinoma, high expression level of HJURP was signi cantly associated with lymph node metastases (p=0.018) and lower overall survival, which was identi ed as an independent prognostic biomarker . Montes de Oca et al. (2015) also found HJURP was associated with local and metastatic relapse and poor outcome in breast cancer. KIF23, a nuclear protein and a key regulator of cellular cytokinesis, was associated with glioma malignancy and conferred a worse survival time in glioma, which suggested KIF23 was a new novel prognostic biomarker with potential therapeutic implications (Sun et al., 2016). The expression of CDCA5 was associated with decreased survival in HCC and CDCA5 knockdown arrested cell cycle in the G2/M phase(Tian et al., 2018). Wang et al.(2018) also found the expression of CDCA5 was signi cantly higher in HCC tumour tissues and predicted poor overall survival. These basically explained why the magenta module was related to histological staging of PAAD patients.
GO enrichment analysis of the black module indicated that it was related to regulation of cell migration, cell motility, cell differentiation, vasculature development,etc. Han et al.(2018) demonstrated SAC3D1 was overexpressed in the hepatocellular carcinoma tissues when compared with matched normal tissues. ATP6L was markably overexpressed in the poorly differentiated colorectal cancer (CRC) tissues evidently located in invasive front and metastatic foci via the EMT program (Wang et al., 2020), meanwhile, ATP6L enhanced metastatic capacity in prostate cancer cells.
Six kinds of phenotypes were selected for survival analysis. We found that when 67 years old was chosen as the cut-off point of onset age, survival probability of patients was distinctly decreased when comparing older vs. younger group, the risk increased by 1.66 times, suggesting that the age of onset was able to seriously affect the prognosis, which was probably related to cancer consumption and multi-organ functional disturbance. The pancreatic body tumors had a comparatively preferable prognosis, followed by the pancreatic tail tumors. Compared with the body and tail tumors of pancreas, the risk of death from the pancreatic head tumors was 2.87-fold and 1.32-fold, respectively, which possibly related to the fact that the head was more likely to cause obstructive jaundice and secondary suppurative cholangitis and other lethal complications. However, previous drinking level, smoking index, gender, and different surgical methods had no signi cant impact on the life cycle of patients, which provided references for us to use patient information to judge the prognosis in clinical practice.
We also found the modules related to disease types, which suggested that tumor is a complex regulatory process of multiple genes, and the expression of different genes could develop into various pathological types. In the next step, we will collect relevant cases from the clinic, detect gene expression levels and verify them, so as to provide corresponding markers and therapeutic targets for different disease types and references for pathological classi cation.
The development of tumor is a multifactor and multistep complex process. By comprehensively analyzing the gene expression pro le of patients with pancreatic cancer and their corresponding clinical phenotypes, we had identi ed the genes that played key roles, some of which is the rst to be discovered. They probably serve as novel prognostic markers and effective therapeutic targets. It was an original study that introduced new work that had not been submitted to or accepted by any other journal before. Since we used R software to analyze and study the data in pulic databases, we had received ethical approval and written informed consent.
-Consent for publication All our authors have agreed to be published.
-Availability of supporting data The 182 standardized gene expression RNAseq and 222 phenotype data of the 14 datasets for PAAD patients were downloaded from UNIVERSITY OF CALIFORNIA SANTA CRUZ(UCSC Xena)(https://xenabrowser.net/), the accuracy and availability of supporting data are guaranteed.
-Competing Interests The authors declare there are no competing interests.  Functional enrichment results of each module(took the rst three terms of the minimum FDP value ).

Figure 3
Module -trait associations were evaluated via correlations and p-value between MEs and clinical traits of interest.

Figure 5
Survival analysis for twelve candidate genes based on Kaplan Meier plotter.

Figure 6
Survival analysis for six exposure factors based on Kaplan Meier plotter.

Figure 7
The rst 30 weighted neighbor genes of each candidate genes.