GEO data
The GSE120736, GSE19915, GSE19423, GSE32548, and GSE37815 gene expression profile matrix files were obtained from the Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) database. The platform of the GSE120736 dataset is the GPL10558 Illumina Human HT-12V4.0 expression beadchip, and this dataset contains 27 T1 high-grade BC tissues and 14 T1 low-grade BC tissues. The platform of GSE19915 is the GPL3883 Swegene Human 27K RAP UniGene188 array, and it contains 7 T1G3 BC tissues and 10 T1G2 BC tissues. The platform of GSE19423 is the GPL6102 Illumina human-6 v2.0 expression beadchip, and it contains 9 T1 high BC tissues and 39 T1 low BC tissues. The platform of GSE32548 is the GPL6947 Illumina HumanHT-12 V3.0 expression beadchip, and it contains 38 T1G3 BC tissues and 14 T1G2 BC tissues. The platform of GSE37815 is the GPL6102 Illumina human-6 v2.0 expression beadchip, and it contains 5 T1G3 BC tissues and 13 T1G2 BC tissues.
GSE97768, RNA sequencing data for 30 BC cell lines, can be seen as the result of the sequencing of different monoclonal cancer cells. We used this as a cancer cell isolated from a complex tumour microenvironment and performed a Gene Set Enrichment Analysis (GSEA) in a single gene to identify the biological functions involved in BC cells.
Microarray data processing
GEO2R online software was used to analyse the raw data of microarray in order to recognise the DEGs. P-value <0.05 and |FC| ≥ 1 were used as the cut‐off criteria. Subsequently, Bioinformatics & Evolutionary Genomics (http://bioinformatics.psb.ugent.be/webtools/Venn/) was used to identify overlapping DEGs in five gene expression microarrays. The up-regulated and down-regulated genes were measured.
TCGA data download and pre-processing
RNA-seq and related clinical information for human bladder transitional cell carcinoma and papilloma samples were obtained from the TCGA database website (portal.gdc.cancer.gov), containing 430 tissues and 405 cases. These data were updated on August 26, 2019. RNA-seq data containing 19 normal samples and 411 cancer samples were combined into a matrix file using the Perl language merge script (www.perl.org). The gene name is then converted from Ensembl id to a matrix of gene symbols via the Ensembl database (asia.ensembl.org/index.html). Using the edgeR R package, the genes that had the same name were combined, and the genes with an average expression of more than 1 were selected and normalised.
Cell fraction of various cells in the tumour environment (TME)
EPIC (gfellerlab.shinyapps.io/EPIC_1-1/) application is designed to estimate the proportion of immune and cancer cells from the bulk tumour gene expression data[16]. This is done by fitting gene expression reference profiles from the main non-malignant cell types and simultaneously accounting for an uncharacterized cell type without prior knowledge about it (e.g. cancer cells in solid tumours samples). EPIC established reference gene expression profiles for major tumour invasive immune cell types (CD4T, CD8T, B, NK, macrophages) and further deduced reference spectra of cancer-associated fibroblasts (CAFs) and endothelial cells. We used EPIC to calculate the cell fraction of cells in TME.
WGCNA and module preservation
WGCNA was performed using the WGCNA R package[17]. Because some genes with no significant expression changes between samples were highly correlated in WGCNA, the genes with the most differential expression were used in subsequent WGCNA. Genes with the highest 25% of DEG variance were chosen, guaranteeing the heterogeneity and accuracy of bioinformatics statistics for further co-expression network analysis[18, 19]. First, RNA-seq data were filtered to reduce outliers. The co-expression similarity matrix consisted of the absolute values of the correlation between transcript expression levels. A Pearson correlation matrix was constructed for paired genes. We constructed a weighted adjacency matrix using the power function amn = |cmn|β (cmn = Pearson correlation between gene m and gene n; amn = adjacency between gene m and gene n). The parameter β emphasized a strong correlation between genes and penalized a weak correlation. Next, an appropriate β value was selected to increase the similarity matrix and achieve a scale-free co-expression network. The adjacency matrix was then converted into a topological overlap matrix (TOM), which measures the network connectivity of genes defined as the sum of adjacent genes generated by all other networks[20]. Average linkage hierarchical clustering was performed on TOM-based dissimilarity measurements, and the minimum size (genome) of the gene dendrogram was 30. Through further analysis of modules, we calculated their dissimilarity and constructed module dendrograms.
Identify cell-related modules
Gene significance (GS) was calculated to measure the correlation between genes and sample traits and determine the significance of each module. Module eigengenes were considered the main components in the principal component analysis of each gene module, and the expression patterns of all genes were summarized as a single feature expression profile within a given module. Next, GS was defined as the log10 conversion of the p-value in the linear regression between gene expression and clinical data (GS = lgP). Module significance (MS) was defined as the average GS within the module and calculated to measure the correlation between the module and sample traits[21]. Statistical significance was determined using the relevant P values. In order to increase the capacity of the modules, we selected a cutoff value (< 0.25) to merge modules with similar heights. Here, we choose cell fraction of various cells calculated by EPIC as phenotype.
Immunohistochemistry (IHC)
In order to confirm the protein expression, immunohistochemistry data are provided on The Human Protein Atlas (www.proteinatlas.org), which aims to map all the human proteins in cells, tissues and organs[22]. After finding the pathology of urothelial carcinoma, we compared the expression of genes in cancer cells and stroma.
Expression of genes in cell lines
The Cancer Dependency Map Project at Broad Institute (depmap portal, depmap.org/portal/) systematically identifies genetic and pharmacologic dependencies and the predictive biomarkers. It contains 56202 gene expression data of 1201 cell lines from the Cancer Cell Line Encyclopedia (CCLE). We searched and compared gene expression between 36 urinary tract cell lines and 39 fibroblast cell lines.
Pathway, process and protein-protein interaction enrichment analysis
Metascape (metascape.org) provides a variety of functions, such as gene enrichment analysis and protein interaction network analysis[23]. For each given gene list, pathway and process enrichment analysis were carried out with the following ontology sources: KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways and CORUM. All genes in the genome were used as the enrichment background. Terms with a P-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 were collected and grouped into clusters based on their membership similarities. A subset of enriched terms was selected and rendered as a network plot, where terms with a similarity > 0.3 were connected by edges to capture the relationships between the terms further. Protein-protein interaction enrichment analysis was carried out to construct the subset of proteins that form physical interactions with at least one other member in the list. If the network contained between 3 and 500 proteins, the Molecular Complex Detection (MCODE) algorithm[24] was applied to identify densely connected network components.
DisNor (https://disnor.uniroma2.it/) was used to generate and investigate a protein interaction network, linking disease genes from the causal interaction information annotated in SIGNOR and protein interaction data in Mentha. The Multi-Protein Search module was selected. The first neighbour with physical interactions (score > 0.4) was chosen as the complexity level, which analysed the causal relationship between proteins.
Gene set enrichment analysis (GSEA)
Data from GSE97768 30 BC cell lines were divided into two groups according to the median expression of cancer cell candidate DEGs. GSEA (http://software.broadinstitute.org/gsea/index.jsp) was used to identify the potential functions correlated with key genes by assessing whether a series of previously defined biological processes were enriched in the gene rank derived from whole genes between the two groups. In the molecular signature database v6.2, curated hallmark gene sets were used. Terms with |normalised enrichment score| > 1, nominal P-value < 0.05, and false discovery rate of q-value < 0.25 were identified.