Controversial T1G3 bladder cancer is the key to revealing the changes in the biological functions of bladder cancer cells

Background T1G3 shows a higher chance of recurrence and progression among early bladder cancer types and the available treatment option is controversial. High recurrence and progression are the problems that need to be explored and solved. Changes in the internal signals of bladder cancer cells and differential genes may be the root cause of these problems. Methods GSE120736, GSE19915, GSE19423, GSE32548 and GSE37815 datasets were obtained from Gene Expression Omnibus (GEO ) to identify differentially expressed genes (DEGs). Bladder cancer transcript data from The Cancer Genome Atlas (TCGA) were clustered into different cell-specific gene sets according to weighted gene co-expression network analysis (WGCNA). Multiple sets of databases were used for gene expression comparison, functional enrichment, and protein interaction analysis, including The Human Protein Atlas, Cancer Dependency Map, Metascape, Gene set enrichment analysis, and DisNor. Results DEGs were obtained through GEO data comparison and intersection. After WGCNA was proven to recognise cell-specific gene sets, candidate DEGs were selected and shown to be specifically expressed in cancer cells. Candidate DEGs were related to mitosis and cell cycle. Further, 12 functional candidate markers were identified from the sequencing data of 30 bladder cancer cell lines. These genes were all up-regulated and previously shown to be closely related to bladder cancer progression. Conclusions Twelve functional genes with specific differential expression in bladder cancer cells were identified. WGCNA can identify the relatively specific expression sets of different cells in bladder cancer with greater tumour heterogeneity, which provides new perspectives for future cancer research.

Results DEGs were obtained through GEO data comparison and intersection. After WGCNA was proven to recognise cell-specific gene sets, candidate DEGs were selected and shown to be specifically expressed in cancer cells. Candidate DEGs were related to mitosis and cell cycle. Further, 12 functional candidate markers were identified from the sequencing data of 30 bladder cancer cell lines.
These genes were all up-regulated and previously shown to be closely related to bladder cancer progression.
Conclusions Twelve functional genes with specific differential expression in bladder cancer cells were identified. WGCNA can identify the relatively specific expression sets of different cells in bladder cancer with greater tumour heterogeneity, which provides new perspectives for future cancer research.

Background
Bladder cancer (BC) is the ninth most frequently diagnosed cancer worldwide and 13th in terms of cancer-related deaths [1]. Most cases (75%) are diagnosed with non-muscle-invasive bladder cancer (NMIBC), yet 70% of NMIBCs typically recur and 25% progress to muscle-invasive disease [2]. In NMIBC, the best post-operative treatment of high-grade T1 (T1HG) or T1G3 has always been 3 controversial, and no consensus has been reached yet. T1G3 with less differentiated cancer cells are more likely to relapse and progress than other T1 stage BCs [3]. Bladder preservation with bacillus Calmette-Guerin (BCG) induction and maintenance, second transurethral resection (TUR), or radical cystectomy are the current treatments for T1G3 [4,5]. Progression rate according to the stage at re-TUR in T1 BC patients treated with BCG can reach approximately 25% [6]. As a result, the search for the optimal management of T1G3 continues [7].
There is, therefore, a need to identify biomarkers that are capable of distinguishing between progressive and non-progressive T1 BC or identify multiple targets as the basis for combination therapy. A retrospective multicentre study, in which a total of 96 patients with T1G3 urothelial carcinoma of bladder were included, used Biomark Fluidigm Arrays based on the quantitative polymerase chain reaction (qPCR), allowing validation of differentially expressed genes in a larger series. ANXA10, DAB2, HYAL2, SPOCD1, and MAP4K1 were identified to predict the progression in T1G3 BC patients [8]. Another study containing specimens from TUR of 92 patients with T1HG BC identified TOPO-2α, p16, survivin, and E-CAD to assist the accurate prediction of survival in single T1 high-grade urothelial carcinoma, that is less than 3 cm in diameter [9]. In the last few years, marker research has been conducted not only in pathological tissues but also in liquid biopsies from blood and urine collected from patients with T1G3 [10,11].
Based on the computational principles, weighted gene co-expression network analysis (WGCNA) is not only used for topology network construction to identify highly co-expressed genes [12] but is also well suited to decompose high-throughput sequencing data into subsets of genes with cell-specific expression from high heterogeneity tissues. Fluctuations in the proportion of various cells within the tumour, as well as differences in the collection of pathological tissue sites, can result in significant differences in the sequencing results. The change in gene expression is mainly affected by a specific dominant cell population within the tumour. Genes are not specifically expressed in cells that are easily affected by the proportion of other cells. The expression level of the cell-specific genes is not easily affected by the proportion of other types of cells, changing with the change in cell ratio and having a high co-expression feature. Based on this, WGCNA used tumour heterogeneity to find 4 specific gene subsets derived from cancer cells in BC.
In order to resolve the controversy surrounding T1G3 BC, it is important to understand the abnormal signals and find aberrantly expressed genes in cancer cells, rather than finding the differentially expressed genes (DEGs) obtained after comparison of samples. With the increased use of the highthroughput sequencing and microarray data, the application of bioinformatics in BC is increasing [13][14][15]. However, few studies are focused on T1G3 either due to the insufficient pathology or the absence of normal tissue for comparison. In Gene Expression Omnibus (GEO), we selected 5 items that met the requirements, including T1HG/T1G3 and low-grade T1 (T1LG)/ T1G2. By comparing and using the intersection to obtain DEGs, we aimed to bring DEGs into the gene sets obtained by WGCNA to identify DEGs in cancer cells and used BC cell lines to predict functional candidate markers.

GEO data
The GSE120736, GSE19915, GSE19423, GSE32548, and GSE37815 gene expression profile matrix normal samples and 411 cancer samples were combined into a matrix file using the Perl language merge script (www.perl.org). The gene name is then converted from Ensembl id to a matrix of gene symbols via the Ensembl database (asia.ensembl.org/index.html). Using the edgeR R package, the genes that had the same name were combined, and the genes with an average expression of more than 1 were selected and normalised.

Cell fraction of various cells in the tumour environment (TME)
EPIC (gfellerlab.shinyapps.io/EPIC_1-1/) application is designed to estimate the proportion of immune and cancer cells from the bulk tumour gene expression data [16]. This is done by fitting gene expression reference profiles from the main non-malignant cell types and simultaneously accounting for an uncharacterized cell type without prior knowledge about it (e.g. cancer cells in solid tumours samples). EPIC established reference gene expression profiles for major tumour invasive immune cell types (CD4T, CD8T, B, NK, macrophages) and further deduced reference spectra of cancer-associated fibroblasts (CAFs) and endothelial cells. We used EPIC to calculate the cell fraction of cells in TME.

WGCNA and module preservation
WGCNA was performed using the WGCNA R package [17]. Because some genes with no significant expression changes between samples were highly correlated in WGCNA, the genes with the most differential expression were used in subsequent WGCNA. Genes with the highest 25% of DEG variance were chosen, guaranteeing the heterogeneity and accuracy of bioinformatics statistics for further coexpression network analysis [18,19]. First, RNA-seq data were filtered to reduce outliers. The coexpression similarity matrix consisted of the absolute values of the correlation between transcript expression levels. A Pearson correlation matrix was constructed for paired genes. We constructed a weighted adjacency matrix using the power function amn = |cmn|β (cmn = Pearson correlation between gene m and gene n; amn = adjacency between gene m and gene n). The parameter β emphasized a strong correlation between genes and penalized a weak correlation. Next, an appropriate β value was selected to increase the similarity matrix and achieve a scale-free coexpression network. The adjacency matrix was then converted into a topological overlap matrix (TOM), which measures the network connectivity of genes defined as the sum of adjacent genes generated by all other networks [20]. Average linkage hierarchical clustering was performed on TOMbased dissimilarity measurements, and the minimum size (genome) of the gene dendrogram was 30.
Through further analysis of modules, we calculated their dissimilarity and constructed module dendrograms.

Identify cell-related modules
Gene significance (GS) was calculated to measure the correlation between genes and sample traits and calculated to measure the correlation between the module and sample traits [21]. Statistical significance was determined using the relevant P values. In order to increase the capacity of the modules, we selected a cutoff value (< 0.25) to merge modules with similar heights. Here, we choose cell fraction of various cells calculated by EPIC as phenotype.

Immunohistochemistry (IHC)
In order to confirm the protein expression, immunohistochemistry data are provided on The Human Protein Atlas (www.proteinatlas.org), which aims to map all the human proteins in cells, tissues and organs [22]. After finding the pathology of urothelial carcinoma, we compared the expression of genes in cancer cells and stroma.

Expression of genes in cell lines
The

Pathway, process and protein-protein interaction enrichment analysis
Metascape (metascape.org) provides a variety of functions, such as gene enrichment analysis and protein interaction network analysis [23]. For each given gene list, pathway and process enrichment analysis were carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways and CORUM. All genes in the genome were used as the enrichment background. Terms with a P-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 were collected and grouped into clusters based on their membership similarities. A subset of enriched terms was selected and rendered as a network plot, where terms with a similarity > 0.3 were connected by edges to capture the relationships between the terms 8 further. Protein-protein interaction enrichment analysis was carried out to construct the subset of proteins that form physical interactions with at least one other member in the list. If the network contained between 3 and 500 proteins, the Molecular Complex Detection (MCODE) algorithm [24] was applied to identify densely connected network components.
DisNor (https://disnor.uniroma2.it/) was used to generate and investigate a protein interaction network, linking disease genes from the causal interaction information annotated in SIGNOR and protein interaction data in Mentha. The Multi-Protein Search module was selected. The first neighbour with physical interactions (score > 0.4) was chosen as the complexity level, which analysed the causal relationship between proteins.

Gene set enrichment analysis (GSEA)
Data from GSE97768 30 BC cell lines were divided into two groups according to the median expression of cancer cell candidate DEGs. GSEA (http://software.broadinstitute.org/gsea/index.jsp) was used to identify the potential functions correlated with key genes by assessing whether a series of previously defined biological processes were enriched in the gene rank derived from whole genes between the two groups. In the molecular signature database v6.2, curated hallmark gene sets were used. Terms with |normalised enrichment score| > 1, nominal P-value < 0.05, and false discovery rate of q-value < 0.25 were identified.

Identification of DEGs
In total, 86 T1G3/T1HG samples and 90 T1G2/T1LG samples were considered in this study. The online GEO2R tool was utilized to determine the DEGs based on cut-off values: P values <0.05 and | logFC | ≥1. We obtained overlapping DEGs of the five or random four datasets, including 19 up-regulated and 13 down-regulated genes ( Figure 1) to confirm the reliability of DEGs in BC.

DEGs in modules constructed by WGCNA
Although the tumour purity was high in the T1 tumour tissue, we do not think that the identified DEGs are not specifically or highly expressed in cancer cells. We used TCGA urothelial carcinoma data with high tumour heterogeneity (mainly stage ii/iii /iv) to cluster previously identified DEGs by WGCNA.
After clustering, we found that DEGs were mainly concentrated in several modules (Figure 2A).
Among them, the brown and green modules had the most genes, with 9 and 8 genes, respectively ( Figure 2B). In order to further prove that the modules recognised by WGCNA are cell-specific expression gene sets, the different cell fractions calculated by EPIC (Table S1) were correlated to the modules obtained by WGCNA. The green module had a correlation with CD4 T cell components of more than 50% ( Figure 2C). Therefore, these genes are specifically expressed in CD4 T cells, and not in cancer cells. The brown, red, sky-blue, steel-blue, and tan modules were positively correlated ( Figure 2A). These modules and candidate DEGs are highly likely to be derived from BC cells and may have relatively high expression specificity ( Figure 2D).

WGCNA identifying cell-specific expressing modules
After analysing the calculation principles, we thought that WGCNA was suitable for identifying cellspecifically expressed genes. In order to find genes specifically expressed in cancer cells in pathological tissues, it was needed to demonstrate that WGCNA can mine cell-specific expression genes. WGCNA-identified modules correlated with various cellular fractions, the correlation between B cells and the cyan module reached 0.9, and the correlation between CAFs and the pink module RNA expression of the 10 genes was significantly higher in fibroblasts than in urothelial cells ( Figure   3D) according to the data downloaded from the Cancer Dependency Map (Table S2). The genes within a module recognised by WGCNA were highly co-expressed and specifically or relatively specifically expressed by certain types of cells. Based on this, we can identify relatively specifically expressed DEGs in cancer cells ( Figure 4A).

Expression of candidate DEGs in cancer cell
The 16 candidate DEGs focus on several highly relevant modules (Figure 2A). These genes are likely to be expressed relatively specifically by cancer cells. In a comparison of IHC and the cell line RNAseq, we confirmed the relative specific expression of these genes at the protein and RNA levels. In IHC images, the staining of urothelial cancer cells was significantly higher than the surrounding tissues or cells ( Figure 4B). The comparison of urothelial cancer cells with fibroblasts (the main component in the matrix) reflected the specific degree of expression of these DEGs in cancer cells better ( Figure   4C). In addition to GSDME, the other 15 genes had higher RNA expression in urothelial cancer cell lines than in surrounding tissues or cells (Table S3).

Protein-protein Interaction Analysis
Based on Metascape, the MCODE networks identified for 16 candidate DEGs were gathered, and components of only one densely connected network (AURKB, CCNB2, CDC20, and CDCA5) were identified ( Figure 5C). Pathway and process enrichment analysis were applied to each MCODE component, and the three best-scoring terms by P-value were retained as the functional description of the corresponding components, shown in the tables underneath corresponding network plots. In DisNor, we imported 16 candidate DEGs and found that there was no regulatory relationship or protein interaction between 11 genes identified ( Figure 6).

BC and candidate markers
We searched Pubmed for studies of candidate DEGs in BC, and the role of 11 genes in BC was clarified. We summarized the 11 genes in terms of chemotherapy resistance, stem cell characteristics, progression/poor prognosis/relapse, migration/invasion/proliferation (Table 1). These 11 genes and SAPCD2 were functionally enriched in single-gene GSEA. Therefore, we defined the 12 functional genes as candidate markers (AURKB, CCNB2, CDC20, CDCA5, CKS2, E2F2, MCM2, MELK, NUSAP1, SAPCD2, TOP2A, and TRIP13) and speculated that these 12 up-regulated genes interfere with biological function signals in BC cells to jointly promote cancer progression.

Discussion
The high recurrence rate in early BC and low survival rate in advanced BC continue to be problems with poor scope for improvement [41]. With the occurrence and progress of tumours, the components and signal exchange in tumours will become more and more complicated, and the possibility of a cure will gradually decrease. Abnormal signals in cancer cells causing tumorigenesis and development have not been identified. Our study aimed to find DEGs in cancer cells, and the poor prognosis of early BC T1G3 is the key to solving the problem.
At present, many studies compare the differences between tissue samples and directly use DEGs as the difference between cancer cells for experiments and analysis. However, most of these DEGs may not be caused by signal differences in BC cells, nor are they relatively specific or highly expressed in BC cells. The signal changes in cancer cells are not singular, and targeted therapy targeting a single gene may not be effective. To date, no studies have identified DEGs inside BC cells. We compared T1 stage tumours with higher tumour purity, and the tumour components of T1G3/T1HG and T1G2/T1LG were very similar, but the DEGs obtained at this time could not be used as candidate genes. The candidate DEGs screened by the WGCNA construction module are relatively specific and have high expression in the cancer cell. Most of candidate markers' functions have been verified in BC, but no known studies have been undertaken to analyse them.
WGCNA is based on the construction of scale-free networks and overlapping topological networks, clustering highly co-expressed genes into one module. The highly co-expressed genes recognised by WGCNA can also be recognised as a class of cell-specifically expressed genes. In highly heterogeneous tumours, the expression levels of these specifically expressed genes will change with the proportion of such cells, showing a high degree of co-expression. The high correlation between the cell ratio of CAFs and the pink module and the differential expression of genes within the pink module in cancer cells and CAFs confirmed our inference. Based on this inference, we continued to identify 16 candidate DEGs that are relatively specifically expressed by cancer cells.
The 12 candidate markers were relatively specific and functional in cancer cells in BC. Genes relatively specific and differentially expressed in BC cells are important to explore changes in biological functions in BC. The function of the 12 candidate markers was related to mitosis and the cell cycle (Fig. 5B), suggesting a close relationship with cancer progression. The 12 markers had similar functional enrichment and were related to BC progression or drug resistance in previous studies, excluding SAPCD2 (Table 1). There may be protein-protein interactions between AURKB, CCNB2, CDC20, and CDCA5 (Fig. 5C). However, according to the results in DisNor, there is no direct relationship between them (Fig. 6), which means that it is difficult to control BC by targeting a single gene effectively, and the relationship between them should be clarified and then combined with 13 targeted therapy.
This study takes advantage of the characteristics of different clinical stages of BC to find out the DEGs inside BC cells with a new perspective, based on the characteristics of WGCNA. Next, data from 30 BC cell lines with monoclonal characteristics were used to predict biologically functional DEGs as candidate markers, excluding interference from TME. Through clever design, we can find out the relatively specific differentially expressed functional genes in BC cells. Perhaps, this can start the era of joint targeted therapy for BC. Such an approach may also provide some help for other cancers.
However, we did not further explore the relationship between these 12 genes through experiments and needed to study this further.

Conclusions
Our study aimed to provide insight into the difficult problem of T1G3 BC. We identified relatively

Funding
The present study was supported by 345 Talent Project (Shengjing Hospital of China Medical University).

Availability of data and materials
The datasets used and/or analysed during the current study are available from GEO (www.ncbi.nlm.nih.gov/geo/) and TCGA (portal.gdc.cancer.gov).

Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable Figure 1 Overlapping DEGs of the five or random four datasets.

Figure 1
Overlapping DEGs of the five or random four datasets.       Entering 16 genes in DisNor, only 11 genes were identified. There is no direct relationship between them.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.