Integrative Bioinformatics Analysis Identied Hub Genes in Association with Development of Lung Adenocarcinoma

Background: Cancer-related death is mainly caused by lung adenocarcinoma (LUAD), an aggressive malignant tumor. Therefore, the identication of important LUAD-related genes and the further analysis of its prognostic signicance are critical for the survival of LUAD patients. Methods: Weighted Gene Co-expression Network Analysis (WGCNA) and differential gene expression analysis methods were adopted to screen out TCGA-LUAD database and the gene expression proles of GSE32863 from GEO. Functional annotation analysis and protein-protein interaction (PPI) network were conducted on differential co-expression genes. Furthermore, survival analysis was carried out on twelve hub genes that were identied by applying the CytoHubba plugin of Cytoscape. Finally, we validated the protein level of survival-related gene in HPA database. Results: A total of 358 differential co-expression genes were extracted from the database of TCGA and GEO. These genes were mainly enriched in extracellular structure organization, cell−substrate adhesion and response to acid chemical (biological process), cell−cell junction and collagen−containing extracellular matrix (cellular component), DNA−binding transcription activator activity, glycosaminoglycan binding, and growth factor binding (molecular function). In the KEGG analysis, the main pathways were Drug metabolism−cytochrome P450. Moreover, in a PPI network, the 12 hub genes (GNG11, ADRB2, ADCY4, TGFBR2, IL6, GPC3, VIPR1, GRK5, CAV1, RAMP3, RAMP2, and CALCRL) were identied. The expression level of ADCY4, VIPR1, and TGFBR2 was corresponded with clinical stages and overall survival (OS) in LUAD patients. Finally, the protein level of ADCY4 and VIPR1 was down-regulated consistently with mRNA levels in LUAD samples. Conclusion: Perhaps, ADCY4, VIPR1 and TGFBR2 play important role in tumorigenesis, they will serve as prognostic biomarkers and therapeutic targets of LUAD in the


Background
Among various cancers, lung cancer accounts for the highest incidence and is the leading cause of cancer-related death worldwide (1). Lung cancer can be classi ed into two pathological forms, including non-small cell lung carcinoma (NSCLC) and small cell lung carcinoma (SCLC). The main subtypes of NSCLC refer to lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LSCC), and large-cell carcinoma (2). Recently, it is demonstrated that the morbidity of LUAD is higher than that of LSCC in the world (3) . Molecular targeted therapy and immunotherapy gradually improved the survival rate of LUAD patients. For instance, tyrosine kinase inhibitors (TKIs) is used as the rst-line clinical treatment for advanced LUAD patients suffering from sensitive epidermal growth factor receptor (EGFR) gene mutations (4) . What's more, the purpose of the immunotherapy for lung cancer reverses immune checkpoints, mainly including programmed death protein-1 (PD-1) and programmed death ligand-1 (PD-L1), and obvious and therapeutic effect was observed in speci c lung cancer patients (5) . However, the 5year survival rate of LUAD patients with pessimistic prognosis remains at a low level (6,7). Due to the lack of studies on the speci c molecular mechanisms underlying the pathogenesis of LUAD, no effective targeted therapies were discovered. Therefore, accurate indicators for the tumorigenesis of LUAD are considered to have the capability to signi cantly improve the prognosis and treatment of LUAD.
With the development of genomic technologies, it is increasingly popular to explore the molecular mechanisms of cancer and discover the accurate biomarkers with gene expression pro les analysis of bioinformatics (8). A systems biology algorithm of Weighted Gene Co-expression Network Analysis (WGCNA) can be adopted to make a signi cant association analysis with clinical traits and identify the potential biological functions of these genes (9). One important aim of WGCNA is to construct coexpression modules of highly correlated genes and evaluate the association between interested modules and clinical traits (10). In addition, differential gene expression analyses were utilized to explore molecular mechanisms underlying genome regulation and discover changes of expression levels between tumor and normal tissues (11). Differential gene expression analyses can contribute to the discovery of potential cancer-related biomarkers. Therefore, the results obtained by WGCNA and differential gene expression analysis, were combined to select highly related genes that can be regarded as candidate biomarkers.
In this study, the LUAD-related mRNA expression pro les from the TCGA and GEO database were analyzed to discriminate different co-expression genes using WGCNA and differential gene expression analysis. The LUAD development was further explored through protein-protein interaction (PPI) analysis, functional enrichment analysis and survival analysis. Our study provided a new method to understand potential molecular mechanisms of LUAD with the analysis of differential co-expression genes.

Methods
The bioinformatics analysis process of hub gene is shown in Figure 1.

Data collection
The gene expression pro les of LUAD were downloaded from database of The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov/) and the Gene Expression Omnibus (GEO) database of the National Center for Biotechnology Research (http://www.ncbi.nlm.nih.gov/geo/). The TCGA data containing information of 535 LUAD samples and 44 normal tissues, were downloaded by R package TCGA biolinks (12). In this study, genes were kept with a cpm (count per million) ≥1. We use function rpkm in edgeR package (13) to lter a total of 15,127 genes with RPKM values for next analysis. Gene expression pro le of GSE32863 were downloaded from and performed on Illumina HumanWG-6 v3.0 expression beadchip (GPL6884). GSE32863 included 58 lung adenocarcinoma and 58 adjacent non-tumor lung tissues from LUAD patients. Data-set GSE32863 was used as a training set to construct co-expression networks and identify hub genes in this study.
Identi cation of Functional Modules In A Co-expression Network WGCNA package in R was used to construct co-expression network for data pro les of TCGA-LUAD and GSE32863 (9). For relating modules to external sample traits, the modules of highly correlated genes among samples were explored by WGCNA. In order to build a scale-free network, we selected soft powers β = 3 and 20 according to the pickSoftThreshold function. Next, the following formula: aij=|Sij| β(aij: adjacency matrix between gene i and gene j, Sij: similarity matrix which is done by Pearson correlation of all gene pairs, β:soft power value) was used to create the adjacency matrix. Then, topological overlap matrix (TOM) and the corresponding dissimilarity(1-TOM) were transformed. Afterwards, the similar gene expressions were classi ed into different gene co-expression modules by constructing a hierarchical clustering of genes based on the 1-TOM matrix. According to the previous study, we calculated the module-trait associations to identify functional modules (14). Therefore, high-related coe cient modules were relevant to clinical traits, and were considered candidates for subsequent analysis. It has a speci c description of these method in the previous study (14).

Differentially Expressed Genes (DEGs) Screening and Interaction With the Functional Modules
Differential expression analyses on RNA-Sequencing and microarray data were integrated using the R package limma (15). Limma was respectively applied in the TCGA-LUAD and GSE32863 dataset to select out differentially expressed genes (DEGs) between LUAD and normal tissues. To control for the false discovery Rate, the Benjamini-Hochberg method was used to adjusted p-value. The threshold for identifying DEGs was set at |logFC| ≥ 1.0 and adj. p < 0.05. Finally, in order to identify potential prognostic genes, we use R package VennDiagram to construct the overlapping genes between DEGs and co-expression genes.
Enrichment Analysis for the potential prognostic genes A Gene Ontology (GO) analysis contains three domains: molecular functions (MFs), cellular components (CCs) and biological processes (BPs). The potential functions of the aberrantly expressed intersecting genes involved in signaling pathways were analyzed using the Kyoto Encyclopedia of Genes and Genomes (KEGG). R package clusterPro ler package(16) was used to explore the functions among genes of interest, with a cut-off criterion of adjusted p<0.05.

Construction of PPI and Screening of Hub Genes
Protein-protein network (PPI) of these candidate hub genes were constructed by STRING online tool(https://string-db.org) (17). In this study, we set a combined score ≥ 0.4 as the cutoff value to build a visualized network model using Cytoscape (v3.7.2) (18). Degree algorithm, calculated by CytoHubba (a plugin in Cytoscape) (19), was used to nd hub nodes in a co-expression network. Subsequently, we selected the gene with the top 12 degree values (with a connectivity degree of≥6) as hub genes.

Clinical relevance and Prognostic Values of Hub Gene
Based on the TCGA dataset, we analyzed the relative expression of hub genes among tumor individual cancer stages. Survival analysis for the hub genes was performed by Kaplan-Meier univariate survival analysis with the TCGA database, using the survival package in R software. Moreover, GEPIA (http://gepia.cancerpku.cn/index.html) was used to evaluate the association between disease-free survival (DFS) and hub genes expressed in LUAD patients (20). Log-rank p<0.05 were considered as statistically signi cant.

Protein Expressions of Survival-Related Hub Genes
The Human Protein Atlas database (HPA, https://www.proteinatlas.org/) were used to validate the expression differences of the survival-related genes between LUAD and normal tissues by immunohistochemistry (IHC).

Result Identi cation of Co-Expression Gene Modules
We used the WGCNA package on R to construct gene co-expression networks from the TCGA-LUAD and GSE32863 datasets. Totally, 15 modules in the TCGA-LUAD and 7 modules in the GSE32863 (Figure 2A and Figure 3A) were identi ed, and a gray module was excluded, because it was assigned into no cluster.
Then, the heatmap of module-trait relationships was plotted to evaluate the relevance between the two clinical traits and each module. The results in Figure 2B and 3B, revealed that the turquoise module in the TCGA-LUAD and brown module in the GSE32863 were proved to be highest associated with normal lung tissue (turquoise module: r = 0.79, p = 5e−127, and brown module: r = 0.97, p = 9e−71). Moreover, compared with normal tissues, the expression of the genes in the turquoise module and brown module was all downregulated in LUAD tissues.

Extraction of Overlapping Genes between the DEGs and Co-expression Networks
We discovered that 3,583 DEGs in the TCGA-LUAD dataset ( Figure 4A) and 957 DEGs in the GSE32863 dataset ( Figure 4B) were dysregulated in tumor tissues using the limma package (cut-off criteria of |logFC| ≥ 1.0, adj. p < 0.05). A total of 5313 and 1699 co-expression genes were enrolled in the turquoise module of TCGA-LUAD dataset and the brown module in GSE32863, respectively. Totally, as shown in Figure 4C, 358 overlapping genes were extracted to verify the genes among co-expression networks.

Functional Enrichment Analysis
In order to further explore the potential functions of the 358 overlapping genes, GO and KEGG enrichment analyses were performed by the cluster pro ler package. Several GO-enriched gene sets were displayed in Figure 5. The results of the biological process (BP) revealed that these genes are mainly enriched in extracellular structure organization, cell−substrate adhesion and response to acid chemical. The cellular component (CC) of these genes were mainly involved in cell−cell junction and collagen−containing extracellular matrix. Most importantly, molecular function (MF) analysis proved that DNA−binding transcription activator activity, glycosaminoglycan binding and growth factor binding were suggested to be associated with these genes. In the KEGG analysis, the main pathways were enriched by these genes such as Drug metabolism − cytochrome P450, Ether lipid metabolism and Arachidonic acid metabolism ( Figure 6).

Construction of PPI Network and Identi cation of Hub Genes
We applied the STRING database to establish the PPI network among the overlapped genes ( Figure 7A). Figure 7B gives information that hub genes were extracted from the PPI network with a connectivity degree ranking. Finally, the top 12 highest-scored genes, including GNG11, ADRB2, ADCY4, TGFBR2, IL6, GPC3, VIPR1, GRK5, CAV1, RAMP3, RAMP2, and CALCRL were selected as hub genes.

Clinical relevance of hub gene expression in LUAD patients
After the 12 hub genes were screened out, based on the dataset in TCGA, we validated that ADCY4 expression had a strong positive correlation with tumor individual cancer stages by applying package cliCor in R (p<0.05; Figure 8A). Then with VIPR1, stage I had a signi cant expression difference with stage IV (p=0.033, Figure 8E). Finally, TGFBR2 expression of stage I and stage III had signi cant differences (p=0.024, Figure 8I).

Prognostic Values and Protein Expression of 12 Hub Genes
We applied the R survival package and the GEPIA2 database to perform OS ( Figure 9) and DFS ( Figure  10) analyses of the twelve hub genes by Kaplan-Meier plotter, respectively. The Kaplan-Meier analysis revealed that the lower expression level of ADCY4, VIPR1, and TGFBR2 was obviously related to the worse OS of the LUAD patients (ADCY4: p=0.003, VIPR1: p=0.019, and TGFBR2: p=0.002) ( Figure 9A, 9E, and 9I). No signi cant difference was found in LUAD patients with the expression level of ADCY4, VIPR1 and TGFBR2 in DFS (p>0.05) ( Figure 10A,10E, and 10I). Furthermore, based on the HPA database, the protein level of the ADCY4 and VIPR1 gene was extremely lower in LUAD tissues, compared with normal tissues (Figure 11). The expression level of TGFBR2 is not discovered in the HPA database.

Discussion
Although molecular targeted therapy and immunotherapy have been gradually applied in LUAD patients, the prognosis of LUAD is generally poor. Different cancer subtypes present different molecular targets. Thus, it is vital to explore better biomarkers for speci c prognosis and progression of LUAD. In this study, through integrated bioinformatic analysis of DEGs and co-expression genes that are associated with LUAD, we nally obtained 358 genes with the same expression trends in the TCGA and GSE32863 databases. GO and KEGG pathway enrichment analyses suggested that differential co-expression genes were mainly enriched in the pathways that are related to cell proliferation, invasion and migration. Furthermore, according to the connectivity degree ranking from the CytoHubba plugin in Cytoscape, we screened out the top 12 LUAD-related genes, including GNG11, ADRB2, ADCY4, TGFBR2, IL6, GPC3, VIPR1, GRK5, CAV1, RAMP3, RAMP2, and CALCRL. Among them, the expression of ADCY4, VIPR1 and TGFBR2 were found to be associated with clinical stages of LUAD. Moreover, the down expression of ADCY4, VIPR1, and TGFBR2 was closely related to poor OS in LUAD. Finally, we carried out survival and immunohistochemical analysis for ADCY4, VIPR1 and TGFBR2.
The cAMP signaling pathway can be activated by speci c adenylyl cyclases. Due to the distinct tissue distribution of speci c AC isoforms, cAMP can selectively regulate effector proteins. Therefore, ACs often integrate cellular signals and are regarded as key enzymes in these signaling pathways (21). For example, high expression of ADCY7 resulted in growth, stimulating apoptosis and increased c-Myc expression in leukemic cells (22). One study illustrated that, served as an oncogene for gastric cancer, ADCY3 could be regulated by DNA methylation (23,24). It was reported that ADCY4 was signi cantly downregulated in breast cancer. ADCY4 was strongly related to G protein coupled receptors and the cAMP signaling pathway. ADCY4 expression was correlated with OS and tumor stages of patients suffering from breast cancer (25). Recent studies reported that VIPR1 can inhibit the development of several cancers, including medulloblastoma(26), prostate cancer (27) and lymphoblastoma (28), suggesting that VIPR1 plays an important role in the inhibition of the growth and development of cancer cells. In hepatocellular carcinoma, DNA methylation of VIPR1 signi cantly inhibited gene transcription and low expression of VIPR1 was related to poor survival and histological differentiation (29). It was reported that VIPR1 was down-regulated in lung adenocarcinoma tissues compared with normal tissues (30). The expression of VIPR1 was lower in lung cancer cells H1299 than that in normal lung epithelial cells BEAS-2B. Moreover, the growth, migration, and invasion of H1299 cells were remarkably inhibited by VIPR1 (30). TGFBR2 can inhibit the malignant progression of various cancers, such as glioblastoma (31), cervical cancer(32), bladder cancer (33) and so on. In the TGF-β signaling, phosphorylation and activation of TGFBR1 begins with the binding of TGF-β ligands to TGFBR2 (34). Afterwards, the activated TGFBR1 activates Smad proteins that can suppress EMT in cancer (34)(35)(36). Li's study proved that the lower expression of TGFBR2 was related to up-regulated abilities of proliferation, migration and invasion in lung cancer cell (A549) (37).
In our work, several limitations were also pointed out. Although the potential diagnostic genes between LUAD and normal tissues were identi ed using comprehensive bioinformatics analysis, it was hard to be very accurate for each LUAD patient. Furthermore, the speci c molecular mechanisms of ADCY4, VIPR1 and TGFBR2 which had a signi cant effect in the prognosis of LUAD patients should be veri ed through several experiments in the future.

Conclusion
This study aimed at identifying survival-related genes that are associated with LUAD, thus further developing a new therapeutic strategy for targeted genes. Through integrated bioinformatic analysis of the DEGs and co-expression genes associated with LUAD, we nally obtained 358 genes with same expression trends. Further analysis suggested that three hub genes, including ADCY4, VIPR1 and TGFBR2, were signi cantly related to tumor clinical stages and overall survival. Therefore, ADCY4, VIPR1 and TGFBR2 may be important in the tumorigenesis and serve as prognostic biomarkers and therapeutic targets of LUAD in the future.

Availability of data and material
The datasets analyzed during the current study are available in the The Cancer Genome Atlas (https://portal.gdc.cancer.gov), The Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/geo) and The Human Protein Atlas database(https://www.proteinatlas.org).

Competing interests
The authors declare that they have no potential con icts of interest.

Funding
Not applicable.
Authors' contributions XLH has made substantial contributions to the design of the work, the analysis and interpretation of data as well as drafted the work. HLX has agreed to be personally accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work. WJJ and KHT have approved the submitted version All authors have read and approved the manuscript. Figure 1 The bioinformatics analysis process.