Integrative bioinformatic characterization of driver genes and prognostic features in human gastric carcinoma

Background: Gastric cancer (GC) is the second most common cause of cancer death worldwide but could be more curable if diagnosed at an earlier stage. At present, the capability to predict the efficaciousness of molecular diagnosis for GC for each patient remains elusive. The purpose of this study was to identify tumor biomarkers through systems analysis of multigene predictors exploiting the available data resource. Results: Our study showed that 61 shared DEGs were significantly involved in functions such as gastric acid secretion. COL1A1, MMP9, and SPP1 were highly connected nodes in the PPI network. Among the 61 common DEGs, 34 driver genes were identified, including CST1 and MMP9. A total of 34 drug-gene interactions were found, which involved 9 common DEGs. Moreover, CTHRC1 and INHBA were related to clinical outcome in gastric cancer patients. Conclusion: The identification of new driver genes, such as MMP9, CTHRC1, and INHBA, may contribute to the understanding of the etiology of gastric cancer and the development of individualized therapies.

3 related driver genes found function-altering mutations in 30 oncogenes and 18 tumor suppressor genes [7]. Additionally, mutations in a number of other genes, such as tumor protein P53 (TP53) and human epidermal growth factor receptor 2 (HER2), have been identified to be drivers of GC [8], and a few candidate driver genes, including RHOA and CDH1, have been identified using next-generation sequencing technology [9]. Understanding the unique molecular profiles of GC patients and identification of putative new druggable targets may facilitate precision medicine for GC patients and improve the accuracy of cancer prognosis. However, the molecular mechanisms and driver genes in GC development and progression have not yet been completely elucidated.
The Gene Expression Omnibus (GEO) and Cancer Genome Atlas (TCGA) are two available public database repositories that offer the opportunity to study the genomic landscape of cancers by mining gene sequencing data and associated large-scale clinical outcomes data [10]. In this study, we downloaded two datasets from the GEO database and TCGA database that are associated with GC and screened them for common differentially expressed genes (DEGs). Additionally, we performed functional enrichment analysis and protein-protein interaction (PPI) analysis using these common DEGs. Moreover, the identified candidate driver genes were then subjected to functional enrichment analysis, drug-gene interaction analysis, and survival analysis. The identification of novel candidate driver genes offers a more complete view of the genomic processes governing GC development and facilitates precision medicine for GC patients.

Determining DEGs
We obtained a total of 359 DEGs from the GSE54129 dataset that met our screening criteria of pvalue < 0.05 and |log 2 FC| > 2, among which 154 were up-regulated and 205 were down-regulated. A total of 970 DEGs were obtained from TCGA STAD dataset. Thereinto, the expression of 427 genes were up-regulated and 543 genes down-regulate. Heat maps and volcano plots of DEGs identified in GSE54129 and TCGA STAD are shown in Figure 1A and 1B respectively. We found 61 common genes (20 up-regulated genes and 41 down-regulated genes) that were differentially expressed in both datasets ( Figure 1C).

Functional enrichment analysis of common DEGs
Functional enrichment analysis showed that the common DEGs are enriched in 3 KEGG pathways, including metabolic pathways, glycolysis/gluconeogenesis, and gastric acid secretion (Figure 2A), and are involved in 5 GO BP terms, including digestion, potassium ion import, retinoid metabolic processes, oxidation-reduction processes, and cellular aldehyde metabolic processes.

PPI network of common DEGs
The PPI network of the 61 identified common DEGs is shown in Figure 2B, with a total of 59 nodes and 225 interactive relationship pairs. Nodes with a PPI degree greater than 10 are listed in Table 1,

Driver gene prediction
The scores of potential cancer-driven genes in each sample from TCGA are shown in Supplementary   Table 1. The overall rank of each gene in all samples was determined in order to determine the overall strength of potential cancer drivers compared with other drivers using the condorcetRanking function. From the 61 common DEGs, we found 34 potential driver genes. As shown in Figure 3A, these 34 potential driver genes are enriched in 1 KEGG pathway (gastric acid secretion) and 8 GO BP terms (including negative regulation of canonical Wnt signaling pathway, positive regulation of monocyte chemotaxis, regulation of muscle contraction, positive regulation of vascular endothelial growth, cell differentiation, potassium ion import, and embryonic skeletal system development). The aggregate DawnRank scores of these 34 potential driver genes are shown in Supplementary Table 2, and the top 10 potential driver genes with the highest aggregate DawnRank scores are listed in Table   2, including Cystatin SN (CST1), MMP9, Gastrokine 1 (GKN1), Secreted Frizzled Related Protein 4 (SFRP4), and CTHRC1.

Drug-gene interaction prediction
Using DGIdb, we obtained 34 drug-gene interaction pairs, including 9 potential driver genes (3 up-5 regulated genes and 6 down-regulated genes) and 33 drugs. The specific interaction relationships are presented in a network diagram ( Figure 3B). From this drug-gene interaction network, we found that ATPase H+/K+ Transporting Subunit Alpha (ATP4A, down-regulated), Somatostatin (SST, downregulated), MMP9 (up-regulated), and COL1A1 (up-regulated) can each be targeted by several drugs.

Potential driver gene survival analysis
Only up-regulated potential driver genes were used in our survival analysis [6]. We screened two upregulated driver genes related to survival, including CTHRC1 and Inhibin Subunit Beta A (INHBA). A K-M survival curve is shown in Figure 4. The results show that higher expression of these two driver genes predicts shorter overall survival.

Discussion
In the present study, we identified 61 common DEGs in the two GC datasets, of which 34 were found to be candidate driver genes. These driver genes are enriched in the 'potassium ion import' GO term and the 'gastric acid secretion' KEGG pathway. Moreover, the potential driver genes, including the upregulated gene MMP9, had higher degrees in the PPI network, higher aggregate DawnRank scores, and are predicted targets of more drugs. Furthermore, CTHRC1 and INHBA have prognostic value in GC patients.
Proteins in the matrix metalloproteinase (MMP) family are associated with the degradation of extracellular matrix in normal physiological processes. The presence of MMPs have been demonstrated in GC samples [11], and MMP9 expression was enhanced in GC compared to normal mucosa [12]. A previous study reported that the T allele of the MMP-9 promoter affected GC tumor progression and invasiveness [13]. Shan et al. showed that MMP-9 increased during the pathogenesis of GC through interaction with HER2 [14]. High MMP-9 mRNA expression levels were shown to be associated with poor survival in patients with metastatic GC [15]. In line with the previous study, our study found that MMP-9 is up-regulated and may be mutated to become a driver gene associated with GC progression. Future studies are needed to validate this.
Recent studies have demonstrated that the clonal architecture of driver genes has prognostic value in a number of cancers [16,17]. In our study, we found that the overexpression of CTHRC1 and INHBA 6 predicted shorter overall survival of GC patients. Recent studies showed that CTHRC1 was overexpressed after promoter demethylation in metastatic GC and was an independent prognostic marker in GC [18,19]. Wang et al. found that INHBA up-regulation was associated with poor survival in GC [20]. Similarly, Oshima et al. showed that high INHBA gene expression was a useful independent outcome predictor and was associated with significantly poorer 5-year overall survival of GC patients after curative surgery [21]. Together, the expression of these two genes may be associated with the survival outcomes of GC patients, and may be useful for developing personalized GC treatment.

Conclusions
In conclusion, we identified driver genes, including MMP-9, CTHRC1, and INHBA, which may play critical roles in GC development. Furthermore, CTHRC1 and INHBA may be effective prognostic markers for GC. The present study contributes to our understanding of the detailed mechanisms of GC and offers a foundation for the development of personalized cancer therapies.

Collection of public datasets
Gene expression dataset GSE54129 (accession number) was got from the GEO repository, which is an online international resource for the retrieval of functional genomic data [22]. In total, GSE54129 (platform: GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array) includes data from 111 human GC tissues from 111 patients with GC that underwent subtotal gastrectomy and 21 noncancerous gastric tissues from 21 volunteers who underwent gastroscopy for a health examination.
TCGA database (http://firebrowse.org/) is a leading comprehensive repository of cancer genomic profiles that stores data about multi-dimensional major cancer-causing genomic alterations in various cancers [23]. We also downloaded the stomach adenocarcinoma (STAD) patients' clinical and RNA-Seq data (Level_3__RSEM_genes_normalized__data) from TCGA database, which includes a total of 415 GC samples and 35 controls. Copy number variation (CNV) data (CopyNumber_Gistic2.Level_4) was also obtained from TCGA database.

Data processing and identification
The R package affy [24] (version 1.50.0) and the limma package [25] (version 3.10.3l) were applied to process the data downloaded from the GEO database. The Robust Multi-array Average (RMA), background correction, normalization, and expression calculation was performed. The platform annotation file was used for probe annotation and a few probes that were not annotated with gene symbols were removed. In the case of multiple probes targeting the same gene, the mean expression value of those probes was considered as the final expression of the gene. For the data downloaded from TCGA database, the expression profile data was presented as the processed RSEM value, which can be used directly with a log2 conversion.
Subsequently, an empirical Bayesian method was implemented in the limma package [25] for identification of DEGs between GC samples and controls in GSE54129 and the dataset from TCGA database, respectively. A gene was considered differentially expressed when this analysis satisfied the conditions of Benjamini & Hochberg (BH)-corrected p-value < 0.05 and |log fold change (FC)|> 2.

Functional and pathway enrichment analysis
The commonly used enrichment analysis tool DAVID [26] (version 6.8) was applied to perform functional enrichment analysis of common DEGs, including Gene Ontology (GO) Biological Process (BP) enrichment analysis [27] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis [28]. The cut-off criteria for included genes was a p-value<0.05 and a gene count ≥2.

PPI network analysis
We determined the PPI relationships of the identified common DEGs using the Search Tool for the Retrieval of Interacting Genes (STRING) (version: 10.0) [29]. The PPI score was set to medium confidence (0.4) and the organism was set to Homo sapiens. Cytoscape software was used to visualize the resulting complex networks (version: 3.2.0) [30]. Subsequently, the CytoNCA plug-in [31] (version 2.1.6) was used to perform PPI node topology analysis (parameter ='without weight'). Each 8 node's ranked score was used to identify the key nodes involved in the protein interaction relationships in the PPI network.

Driver gene prediction
It is generally thought that nodes in a gene interaction network are more important and more likely to be key genes in a network if they are more closely related to other genes [32]. DawnRank [33] applies this theory in its algorithm and assigns high score values to genes that significantly affect the abnormal expression of downstream genes. A gene's rank score can then be used to determine which genes in the sample can be used as driver genes.
We extracted gene expression data from the common DEGs in the TCGA STAD cancer samples and control samples, as well as gene mutations in tumor samples. A PPI network was constructed as an adjacency matrix (e.g., node i and node j were connected, Aij was considered to be 1, otherwise Aij = 0) in order to predict and analyze cancer driver genes. The DawnRank algorithm was used to obtain the gene rank score for each gene in each patient. The higher the score, the more likely it is that the gene is a driver gene. Finally, functional enrichment analysis of the identified driver genes was conducted.

Prediction of small molecule drugs
The Drug-Gene Interaction database (DGIdb) mines multiple existing resources for drug-gene interactions and generates assumptions about how genes may be developed as therapeutic targets or prioritized for drug development [34]. In the present study, we used DGIdb2.0 [34] to predict druggene interactions for the identified GC driver genes. The preset filter was set to 'FDA Approved.' Cytoscape was then used to build a drug-gene network map.

Survival analysis
To perform survival analysis, we used TCGA clinical information, including overall survival (OS), survival status, and disease samples screened. The STAD samples were divided into high expression group and low expression group according to the mean gene expression. Log-ranktest was done and the statistical significance threshold set to p < 0.05, and the relationship between survival and driver genes were determined.      Survival analysis of candidate driver genes identified two potential prognostic factors.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.