Identication of Potential Diagnostic Genes for Bladder Cancer by Bioinformatic Analysis

Background: Cytology and transurethral cystoscopy constitute the gold standard for the diagnosis of bladder cancer (BC). However, some minor lesions cannot be detected in time with these techniques, resulting in a high rate of missed diagnosis. Finding biomarkers that are economical, convenient, sensitive, and specic has become an urgent priority. Methods: Gene expression prole data from BC and normal bladder tissue were downloaded from the Gene Expression Omnibus (GEO) database and used as a training set to screen for differentially expressed genes (DEGs). The bladder gene expression and related clinical data derived from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) databases were used as a validation set. The effectiveness of the DEGs as diagnostic criteria was veried in terms of gene expression, gene mutation and diagnostic eciency. Results: Two upregulated and eight downregulated hub genes were identied by screening. In terms of gene expression, the expression levels of these genes were signicantly different between bladder cancer tissues and normal tissues. In terms of clinical diagnostic ecacy, TOP2A had the highest single diagnostic value, while the combinations of TOP2A/CNN1, TOP2A/ISG15/CNN1 and TOP2AISG15/ACTG2 had the largest area under the curve (AUC) among two- or three-indicator combinations. Conclusion: either diagnostic However, this still conrmed in a larger sample with further biological experiments.


Introduction
Worldwide, bladder cancer (BC) accounts for 3% of all incident cases of cancer. Approximately 150 thousand patients die from BC each year [1][2]. Although cytology and transurethral cystoscopy constitute the gold standard for diagnosing BC, BC is easily missed due to tumour heterogeneity and some subtle pathological changes such as small papillary tumours and micrometastasis [3]. Therefore, it has been observed that approximately 1/4 of newly diagnosed BC cases in clinical practice involve muscle in ltration [4][5]. In addition, non-muscle-invasive BC has a high risk of recurrence after treatment, at more than 50%. More seriously, 30% of cases of noninvasive bladder cancer transform into invasive bladder cancer [6]. Because of the high risks of recurrence and progression due to the uncertainty of diagnosis, BC leads to repeated traumatic examination and treatment, resulting in a heavy nancial burden for patients [2,7].
In light of this situation, molecular markers, which can safely and effectively diagnose BC without trauma, have become another ideal choice. However, to date, few widely accepted molecular markers have been applied for clinical diagnosis. For example, nuclear mitotic apparatus protein (NMP22), which has been approved as a BC biomarker by the US Food and Drug Administration (FDA), still has the problem of relatively low sensitivity and even needs to be combined with cystoscopy [8-10]. Conversely, a screening analysis of 1502 people with a high risk of BC showed that only 85 subjects were positive for NMP22. A further examination of 69 suspected BC patients who were willing to be tested revealed that only 3 patients were diagnosed with BC [11]. Just as importantly, NMP22 is also positive in cystitis, urinary calculi and other pathological conditions [12]. This highlights the desperate need to identify novel reliable molecular markers.
With the great advancement of high-throughput technologies and the establishment of various public medical databases such as the Cancer Genome Atlas (TCGA https://cancergenome.nih.gov) and Gene Expression Omnibus (GEO https://www.ncbi.nlm.nih.gov) online databases, tissue differentially expressed gene (DEG) analysis and functional enrichment analysis have been applied to many human cancers. In the current study, the gene expression pro le data of BC and normal bladder tissue were extracted from the GEO database as a training set, and the DEGs between the tissues were taken as the research objects. Expression pro les and corresponding clinical data related to bladder tissue were obtained from the databases of TCGA and Genotype-Tissue Expression (GTEx, https://gtexportal.org) and used as a validation set. Using the Human Protein Atlas (HPA https://www.proteinatlas.org) and other databases to systematically study the e ciency of screening genes in diagnosis, we identi ed potential biomarkers for the diagnosis of BC.

Microarray data
In this study, three sets of human genetic screening data were downloaded from the GEO database (www.ncbi.nlm.nih.gov/geo/); these datasets were based on the PL96 (Affymetrix Human Genome U133A Array), PL570 (Affymetrix Human Genome U133 Plus 2.0 Array) and GSE13507 datasets, respectively. Table 1 details the speci cs of each of the three datasets, including the numbers of normal and BC specimens and the year of the trial. For different GEO data on the same platform, we followed a standard operating procedure and normalization method based on other scholars' studies [13][14].
Meanwhile, 407 BC and 28 normal samples collected from TCGA and GTEx were used as a validation set [15]. In the screening process, we removed the probes corresponding to multiple molecules. In this case, only the probe with the largest signal value was retained.

Screening of differentially expressed genes (DEGs)
After the normal control group and the BC group were compared and analysed using the limma package in R (v.3.5.3), DEGs were obtained, and the results were deduplicated. Screening criteria were as follows: absolute value of log 2 gene expression fold change (FC) ≥ 1, adjP < 0.05. Volcano plots were drawn with the "ggplot2" R package. To eliminate the in uence of mixed false-positive expression genes and screen the prediction biomarkers for subsequent clinical diagnosis, only common differentially expressed genes were selected for analysis. In the process, the "VennDiagram" R package was used to draw Venn diagrams to achieve this purpose.

PPI network construction and module analysis
The online database STRING (https://string-db.org/) was used to analyse the protein interaction network of common DEGs (minimum required interaction score=0.40), and the results were imported into Cytoscape software for visualization and correlation analysis. The hub genes were screened by using Cytohubba, a plug-in of Cytoscape software.

Gene Ontology and pathway enrichment analyses
The online Gene Ontology (GO) database (http://geneontology.org/) is an effective bioinformatics tool that provides a systematic understanding of the biological functions of gene products at the human cell level. KEGG (https://www.kegg.jp/) is a database that uses genetic information to calculate and speculate on higher-level and more complex cell activity as well as biological behaviour. Before analysis, wide genome annotation for the human R package "org.Hs.eg.db" was used to transform the gene symbol codes for all data into Entrez IDs. KEGG pathway analysis and Gene Ontology (GO) analysis were employed using the clusterPro ler package to better understand the intrinsic biomedical functions and essential characteristics of genes.

Oncomine database analysis
Oncomine (https://www.oncomine.org/resource/login.html) is the largest oncogene chip database and integrated data mining platform in the world [16], which aims to mine the key genes of various cancers. Using the public datasets available, DEG expression in BC tissues and noncancerous tissues was compared. The P value was generated using Student's t-test (p value ≤ 0.01, fold change 1).
2.6 GEPIA database analysis GEPIA (Gene Expression Pro ling Interactive Analysis) is a web server that uses integrated and standardized TCGA and GTEx gene expression information to identify the differences and characteristics between cancer and normal gene expression [17]. In the present study, screening gene expression values of human BC tissue originating from TCGA normal and GTEx databases were matched with those in normal tissues, and we veri ed the gene expression difference among different tissues through the online GEPIA platform.

Gene expression veri cation with R software
This stage includes three parts: (1) The TCGA expression difference in paired samples (cancer and adjacent tissues) from BC patients; (2) The expression discrepancy among cancer samples with different pathological stages and control normal samples; (3) Expression differences among cancer samples with different histological grades and normal control samples. To achieve the above objectives, validation data were downloaded from level 3. HTSeq-TPM RNAseq in the TCGA BLCA (bladder urothelial carcinoma) project. After log2 conversion, it was calculated and visualized using paired t-test or Kruskal-Wallis test and ggplot2 package of R statistical software.

Genetic alteration and HPA database analysis
The alteration frequency and mutation type of screening genes in BC were compiled using cBioPortal (https://www.cbioportal.org/). By inputting the target genes in the "quick select" option, details and summaries of mutation types and gene ampli cation in BC can be observed in the "Cancer Type Summary" and "OncoPrint" modules, respectively. In addition, the "Comparison/Survival" module was used to obtain the difference in overall survival between the altered and unaltered observed gene groups, and a Kaplan-Meier diagram with the log-rank test P value was drawn. To verify the protein expression of the studied genes, biopsy immunohistochemical micrographs of tumour or normal tissues were acquired from the Human Protein Atlas (HPA http://www. proteinatlas.org/).

Propensity score matching (PSM) and random forest analysis
The specimens collected from the above TCGA and GTEX databases were divided into a normal group and a BC group. The gender and age in the data were taken as the variables to be matched, and the genes to be selected were taken as the screening indicators for cancer diagnosis. Since the age in the GTEX database is presented in 10-year bins, we divided all data into ve groups (patients under the age of 40 years were in the rst group, and those over the age of 70 years were in the fth group). In addition, since the number of people in the cancer group was much larger than that in the normal group, the two groups were matched by propensity score (PSM) at a ratio of 1:5 according to the principle of statistical proportion selection. After the matchit, tableone and random forest packages were downloaded [18], random forest analysis was carried out in R software.

Diagnostic ROC
The aforementioned samples collected from TCGA and GTEx databases were divided into a normal group and a BC group. First, ROC analysis of a single gene was carried out. After log 2 transformation of TPM RNA-seq expression data of the target gene in each sample, the transformed values were included in each group for calculation. The area under the curve (AUC) and visualization of each gene predicting cancer were calculated by using the pROC (version 1.17.0.1) and ggplot2 R software packages, respectively. Then, a logistic regression model was used for all genes. Since the screening genes had been determined at this step, the formula is as follows: glm (formula = status ~ TOP2A + ISG15 + CNN1 + MYH11 + ACTA2 + MYLK + LMOD1 + ACTC1 + ACTG2 + TAGLN, data= data, family = binomial).
A model was established according to the regression coe cient of each gene, representing the value generated by all indicators through the model, and the value was used to predict a nal outcome, which was the joint indicator ROC analysis. Finally, two or three random genes were combined for ROC analysis.
Under the working conditions of R software, merged polygenic data can be modelled by the glm function, joint ROC analysis by the pROC package, and visualization by the ggplot2 package. The prediction ability of genes was identi ed by the numerical value of AUC.
2.11 GSEA and gene correlation analysis GSEA (gene set enrichment analysis) is an analytical method that derives gene sets to ascertain diverse biological functions between two phenotypes [19][20]. After loading the clusterpro ler (version 3.14.3), stat (version 3.14.3) and ggplot2 packages under the working state of R software, GSEA, Pearson correlation analysis and visualization operation were executed for the gene screened by random forest analysis and joint diagnosis ROC curve analysis. It is generally accepted that the conditions of false discovery rate (FDR) < 0.25 and i < 0.05 indicate signi cant enrichment.

Identi cation of DEGs
To exclude the batch effect, the gene expression levels of multiple GSE sets in the same PL570 or PL96 platform were standardized. The differences before and after standardization are shown in Fig. 1A and 1B. After limma calculation and inclusion condition screening, 271 upregulated genes and 42 downregulated genes were detected in the PL96 platform dataset. In addition, 172, 66 upregulated and 680,392 downregulated DEGs were selected from the PL570 and GSE13507 datasets, respectively. Then, volcano plots using Prism software (Fig. 1C) showed the meaningful DEGs screened from the three target datasets, and Venn diagrams (Fig. 1D) presented the 2 upregulated and 16 downregulated overlapping genes, which were calculated by the VennDiagram software package.

PPI network analysis
A network diagram was established for 16 downregulated genes through the PPI online network (Fig. 2A). These genes were intersected using Cytohubba in Cytoscape to identify hub genes. To ensure the rationality of the study, 12 methods, including the MCC method, were used to calculate the top 10 ranked genes. After different methods of summary and analysis, 8 downregulated genes were included for further research (Additional le1). In the TCGA database, Spearman correlation analysis of 8 downregulated hub genes and 2 upregulated genes (Fig. 2B) was performed using the R software pheatmap package. The downregulated genes were closely related, but the correlation with the upregulated genes was not obvious. Figure 2C showed the log2 expression FC values of the screened hub genes in different data sets

Gene Ontology and pathway enrichment analyses
With Homo sapiens as the background, the clusterPro ler software package analysed the screened genes and assessed their enrichment patterns. Next, BP, CC and MF enrichment analysis was performed on the downregulated DEGs (Fig. 2D). The downregulated genes were mainly enriched in muscle contraction and muscle system processes (ontology: BP), contractile bres (ontology:CC) and structural constituents of muscle (ontology: MF). KEGG pathway analysis showed that these integrated DEGs were mainly enriched in vascular smooth muscle contractions.

Gene expression analysis
First, the Oncomine database was searched to compare the mRNA transcription levels of target genes in BC samples (Fig. 3A). The results showed that the expression of TOP2A and ISG15 increased in cancer samples and decreased in control samples, while the expression of CNN1 and other genes showed the opposite outcome. In Sanchez-Carbayo's study [21], TOP2A expression in BC was signi cantly increased and was 9.402 times higher than that in normal tissues (Fig. 3B). Another typical study [22] also compared the expression of CNN1 among normal (58 cases) and benign tissues (10 cases), superior BC (126 cases) and in ltrating bladder urothelial carcinoma (62 cases) and found that its expression level decreased in turn (Fig. 3C). The expression results calculated based on the GEPIA database are shown in Figue 3D. TOP2A and ISG15 were highly expressed in cancer tissues, and other genes were poorly expressed. This was exactly the same as the result found by using the Oncomine database. Interestingly, the following paired t-test results of gene expression in adjacent and cancer tissues by R software (Fig. 3E) were still completely consistent with the rst two studies, and the P values were all less than 0.001. To further test our hypothesis, we compared the expression of screening genes in cancer samples with different pathological stages and histological grades with that in normal controls. The results showed that although TOP2A was not overexpressed in low-grade human bladder cancer versus diseasefree tissue, there were signi cant statistical expression divergences of all genes between normal tissue and different pathological stages (Fig. 3F) or histological grades of tumour tissue (Fig. 3G).

Genetic alteration and HPA database analysis
As shown in Figue 4A, the mutation rate of the target genes was between 1% and 8%. Among them, the missense mutation rate of NYH11 was the highest, which was more obvious than other mutations, such as deletion mutations. From the total mutation rate of related genes, the mutation rate of the Cornell & Trento study in 72 patients was 43.06%. Likewise, in the TCGA database containing 412 patients, the proportion still reached 27.91% (Fig. 4B). Within the "Comparison/Survival" module, the impact of ISG15 and all downregulated genes on health and survival can be explained by gene variation (Fig. 4C-D). At the protein level, immunohistochemical staining of the screened genes in the HPA database showed that the expression of TOP2A and ISG15 increased in BC, while the expression of MYH11 and ACTA2 decreased (Fig. 5). In addition, the expression of expected downregulated genes, such as CNN1, was not obvious in either cancer tissues or normal tissues; therefore, it was extremely di cult to compare them.

PSM and random forest analysis
After matching, Figue 6A shows that the average difference in propensity scores for gender and age decreased to 0. All 28 original normal samples were included, while only 122 cancer tissue samples remained. In age Group 1, due to a shortage of samples, only 22 cases were included in the analysis instead of the intended 40. The similarity of the two matched groups was identi ed by Figure 6B, and it could be found that they have good similarity. The matched data were then subjected to random forest analysis. The results indicated that the mean decrease in Gini (MDG) results of TOP2A were the largest, followed by MYH11 and MOD1 (Fig. 6C). It is generally accepted that the MDG is directly proportional to the prediction ability of the screened gene. Therefore, it could be considered that TOP2A has greater value in the diagnosis of BC than other genes. Fig. 6D came from the multidimensional scale diagram of R software. The red and blue points represent the cases in the cancer group and control group, respectively. The more obvious the aggregation of points of the same colour, and the greater the distance between different colour points in the diagram, the more signi cant the calculation result is.

Diagnostic ROC
The results of independent diagnostic ROC curve analysis showed that each gene had valuable predictive ability (Fig. 7A). Logistic regression models of all genes showed that TOP2A, ISG15, MYLK and TAGLN had statistical diagnostic value ( Table 2). The AUC calculated by from the diagnostic ROC of all genes combined was 0.996 (Fig. 7B). In view of clinical practicality, we tested combinations of 2 or 3 genes for joint diagnosis and found that the combinations of TOP2A/CNN1, TOP2A/ISG15/ACTG2 and TOP2A/ISG15/CNN1 had the largest area under the ROC curve ( Fig. 7C-D).

GSEA and gene correlation analysis
Because TOP2A holds a special position in the diagnosis of BC, TOP2A was speci cally studied. GSEA showed that the REACTOME_SIGNALING_BY_RHO_GTPASES and REACTOME_M_PHASE gene sets were signi cantly enriched (Fig. 7E). A gene correlation study (Fig. 7F) showed that 10 genes, FABP7 and SOX14, were most signi cantly correlated, of which 8 genes were statistically signi cant.

Discussion
Because BC is prone to relapse, easily progresses and has high treatment costs, it is particularly imperative to identify cancer biomarkers that can re ect the biological functions of BC. Konety et al. posited that the ideal bladder cancer marker would not only be safe, economical and stable but also have low false-positive and false-negative rates [23]. Meanwhile, diagnostic biomarkers can take many forms and may be lipids, proteins or DNA in this study, and their expression levels are highly correlated with the risk of bladder cancer [24].
In the current research, the datasets of the same PL570 and PL96 platforms were merged and standardized, and then the limma package was used to nd DEGs on these two platforms combined with the GSE13507 dataset. After the intersection of the DEGs of the three datasets, two common upregulated genes (TOP2A and ISG15) and 11 downregulated DEGs were found. Thereafter, GO and KEGG enrichment analyses of downregulated DEGs were carried out, revealing that the DEGs were associated with a decline in muscle system contraction. Next, eight downregulated genes identi ed through the PPI network, together with the above two upregulated genes, were selected as hub genes. To verify the diagnostic validity, the transcriptional level, mutation rate and protein expression of hub genes in BC tissues were compared with those of normal tissues in TCGA and GTEx datasets, and satisfactory results were found.
To validate their clinical applicability, the hub genes were analysed by random forest analysis and combined ROC analysis, and TOP2A was found to have conspicuous diagnostic advantages. In random forest analysis, TOP2A had the highest MDG value, which meant that its separate diagnostic value was the most signi cant. Among combinations of two indexes, TOP2A and CNN1 were the highest-performing pair. Among three-gene combinations, TOP2A combined with CNN1 plus ISG15 or ISG15 plus ACTG2 had the largest AUC.
TOP2A, also known as topoisomerase II α, is an enzyme that changes the topological state of DNA by breaking double-stranded DNA during mitosis and participates in the process of DNA transcription and replication in cell proliferation [25][26]. Generally, cancer is a disease characterized by malignant proliferation and death regulation disorder of tumour cells. Therefore, abnormal TOP2A overexpression is closely related to cancer. Previously, its overexpression was reported in ovarian cancer, gastric cancer, colorectal cancer and other cancer tissues [27][28][29]. In the present study, the expression of TOP2A in BC and normal tissues was compared in the GEPIA, Oncomine, HPA and R software platforms, and it was found that the expression of TOP2A in distinct stages of bladder cancer was signi cantly different from that in normal tissues. GSEA also showed mitotic phase and Rho GTPase enrichment. These results were consistent with those of Gao et al [30][31]. Moreover, similar ndings were found in mouse bladder cancer models [32]. On this basis, Zeng [33] further demonstrated that the invasiveness, proliferation and migration ability of TOP2A knockout BC cells were markedly restrained by xenograft tumour formation in nude mice and other tests. Regarding the clinical application of TOP2A for BC, Kim's report is the only one to date regarding urinary cell-free DNA [34]. His research showed that the expression level of TOP2A in BC patients was higher than that in normal persons and haematuria patients. In addition, the areas under the ROC curves of TOP2A for BC, non-muscle-invasive BC and muscle-invasive BC were 0.741, 0.701 and 0.838, respectively.
Although few studies have been performed on BC, CNN1 was found to be expressed at low levels in an array of cancer tissues [35][36][37]. Menéndez [38] reported that CNN1 can inhibit tumour formation by mediating a variety of angiogenic factors. Scratch and wound healing tests also showed that the invasion and migration ability of CNN1-overexpressing lung squamous cell carcinoma cells decreased signi cantly [39]. ISG15 overexpression has been reported to promote distant metastasis of mouse liver cells. However, in fact, its effects on different tumour cells are contradictory [40][41][42] and still need to be further clari ed by scholars.
Finally, it is worth pointing out that our research may have some limitations. First, DEGs were detected in resected tissues using microarray or sequencing technology. Unfortunately, it is unreasonable for these two techniques to be widely used in clinical practice. The search must continue for an ideal technology that can nd genetic differences in blood or urine samples economically and quickly. Another intrinsic limitation is that the speci c cancer-promoting mechanisms of the screening biomarkers are still obscure; thus, further in vivo and in vitro experiments are necessary. Last but not least, there were relatively few clinical diagnostic tests for screening gene biomarkers. In Kim's study on the diagnostic function of TOP2A in bladder cancer, the sample size of 83 cancer patients still proved to be insu cient. External validation in diverse and much larger-scale populations with longer follow-up periods may be bene cial in further studies.
In summary, the purpose of the present study was to identify the hub genes involved in the development of BC through a comprehensive bioinformatics analysis. The diagnostic effectiveness of the screened genes was veri ed by expression, mutation rate and clinical application. Declarations Figure 1 The    In the HPA database, the immunohistochemical staining of each gene in normal and cancerous tissues.
The left staining picture showed the cancerous tissue and the right demonstrated the normal one. Each line corresponded to a common gene name and antibody number.  Figure 7C showed the joint diagnostic e cacy of any two genes of 10 candidate genes, and the combination mode of obtaining the maximum AUC value was TOP2A add CNN1; Likewise, the maximum e ciency modes of joint diagnosis of three random genes were shown in Figure 7D, the combination were TOP2A + ISG15 + cnn1 and TOP2A + ISG15 + ACTG2 respectively. (E)