Constructing the Brain-UMAP / Clustering of gene expression data identifies diverse disease types
To characterize and better understand the molecular intricacies of brain tumors, we downloaded uniformly processed RNA-seq abundances values from recount-brain, a curated repository for human brain RNA-seq datasets, for three different uniformly processed datasets − 702 adult glioma samples from TCGA1, 270 adult glioma samples from CGGA5,6, 1409 healthy normal brain samples from GTEx4 across 12 GTEx-defined brain regions (Supplementary Table 1a). Retrieving data from recount7 ensured that consistent bioinformatic pipelines were used for these three datasets thus resulting in no batch effects between the three datasets.
The most common solid tumors in children are brain tumors with approximately 1.15 to 5.14 cases per 100,000 children in the United States alone8. To adequately represent a wide range of CNS tumors in our reference landscape, we additionally included 802 pediatric tumor samples (Supplementary Table 1b) from the Children Brain Tumor Network (CBTN)3. Figure 1a represents an overview of data sources (details in Supplementary Table 1c)
Gene expression data from each of the datasets was converted to units of Transcripts per million (TPM) to avoid inter-pipeline difference and were limited to a common set of 19142 protein-coding genes9. While both CBTN and recount used different bioinformatic pipelines (Supplementary Table 1c), in order to ensure that there were no batch effects we used combat10 method from the R package “sva” to remove unwanted variation in our combined dataset.
We explored three different dimension reduction techniques (Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (tSNE) and uniform manifold approximation and projection (UMAP) for data visualization. We chose UMAP to build a Brain-UMAP (Fig. 1b) on batch corrected TPM integrated counts, as UMAP segregated the mini clusters well and was very effective in visualizing clusters and their relative proximities (Supplementary Fig. 1a).
While 2-dimenisonal representation of the data is helpful, we also provide a 3-dimensional representation of the data on the interactive web-based platform Oncoscape5) where users can easily toggle among and compare different patient groups, while using a suite of interoperable tools.
At a first glance, distinct clustering of samples is observed. The adult glioma samples from the different datasets, TCGA-LGG, TCGA-GBM and CGGA, co-localized closely within the same region of the UMAP, whereas the pediatric tumor samples clustered between the GTEx healthy normal brain and the adult glioma samples. (Fig. 1b).
Distinct gene expression profiles in normal human brain
The 1409 normal brain samples segregated into two distinct clusters of multiple supratentorial regions and cerebellum, as we have previously shown11 (Fig. 1c). The supratentorial regions further revealed three anatomically distinct regions for basal ganglia (caudate, nucleus accumbens, putamen), cortex (amygdala, Brodmann Area 24, cerebral cortex), and the midline structures (spinal cord, substantia nigra and hypothalamus). We confirmed that different classifiers such as postmortem interval (PMI), age (years), sex, Hardy score and type of sample preparation (Supplementary Fig. 1b) were not associated with distinct clustering patterns, suggesting that the sample clusters we observe are based on actual biological differences between these brain regions rather than sample preparation parameters. However, we observed that a few samples with much lower RNA integrity number (RIN) compared to all the other samples from different brain regions converged at a point (Supplementary Fig. 1b).
Clustering of transcriptomic glioma datasets reveals distinct glioma subtype
Within the adult glioma clusters, we observe that while the samples from the two TCGA datasets cluster together, there is a more continuous pattern when looking at gene expression profiles for samples from TCGA-GBM and TCGA-LGG (Fig. 2a). This contrasts with the distinct clusters observed by analyzing whole exome single nucleotide alterations (SNAs) and whole genome copy number alterations (CNAs) from the same patients, as shown by Bolouri12 et al. In line with previous reports, we observed that the age of the patients at diagnosis for the TCGA-GBM and TCGA-LGG samples revealed a sharp gradient illustrating the known correlation between age and outcome (Fig. 2b). By contrast, patient gender was not associated with any specific clusters (Supplementary Fig. 2a). Regarding chromosomal alterations, tumors with a gain of chromosome 7 and hemizygous deletion of chromosome 10 (Fig. 2c) or co-gain of chromosome 19 and 20 (Supplementary Fig. 2b) co-localized in the top area of the UMAP containing the TCGA-GBM samples (Fig. 2c). Tumors exhibiting a co-deletion of chromosome 1p and 19q and mutation in isocitrate dehydrogenase 1 and 2 genes (IDH-mut) (Fig. 2d) were concentrated in the lower half of the UMAP containing the TCGA-LGG samples. Tumors containing mutation in IDH1 (Fig. 2e), TP53 (Fig. 2f) and ATRX (Fig. 2g) were also found to be concentrated and clustered together in specific regions of the adult glioma UMAP. Using the supervised DNA methylation classification, transcriptional subtype, MGMT promoter status and TERT promoter status (Supplementary Fig. 2c-f) to color the adult gliomas from TCGA-LGG and TCGA-GBM also revealed distinct patterns across the adult glioma landscape. Selecting for common glioma mutations and copy number alterations clearly shows three distinct subtypes of glioma - IDHmut-1p19q co-deleted oligodendrogliomas, the IDH mutated astrocytomas with p53 and ATRX mutations, and the wild-type IDH (IDH-wt) glioblastomas with gain of chromosome 7 and loss of chromosome 10 molecular GBM. (Fig. 2h).
Adult glioma subtypes from TCGA and CGGA show similar gene expression profiles
We next assessed if glioma samples of similar molecular subtypes from the TCGA and CGGA datasets exerted similar gene expression profiles and co-localized in their respective clusters. Similar to the TCGA samples, using grade, IDH mutation status and chromosome 1p 19q codeletion status for the CGGA samples, three distinct subtypes of adult glioma from the CGGA samples were observed. Interestingly, we observed that the oligodendrogliomas, astrocytomas and IDH-wt tumors from both CGGA and TCGA colocalized (Fig. 3a-f). Separated from the large cluster containing adult glioma subtypes, there were two small clusters of gliomas from the CGGA dataset. Based on their grade and IDH mutation status, one cluster consisted of a mix of IDH mutant grade 2 and grade 3 oligodendrogliomas while the second cluster consisted of grade 4 IDH-wt glioblastomas. For the remainder of the paper, we will refer to the adult glioma datasets from TCGA and CGGA by their molecular subtypes (oligodendrogliomas, astrocytomas and IDH-wt glioblastomas).
We analyzed the survival for each subtype for all three glioma subtypes and observed similar survival rates between TCGA and CGGA datasets for the respective subtypes. (Fig. 3g). We then used a nearest neighbor approach to predict the survival for different UMAP subregions. Survival was predicted for samples with the median survival of its nearest neighbors, present in close proximity on the UMAP landscape (Fig. 3h).
We observed a small number of glioma samples formed an isthmus connecting to the normal brain samples. These samples were characterized by a low amount of copy number alterations and were a mix of IDH-wt glioblastomas, IDH-mut oligodendrogliomas or astrocytoma. These tumors were characterized by longer survival compared to other gliomas of their respective molecular subtype. It is conceivable that in some samples this region may represent either early forms of gliomas or a mix of glioma and normal brain.
Taken together, our results show that the transcriptomic data from different datasets can be combined to generate a population of adult gliomas, creating a landscape where the location of specific tumor sample can be predictive of subtype and outcome.
Pediatric Tumors cluster by disease type
We added the 802 pediatric tumors from the CBTN dataset to our analysis to compare their gene expression patterns to both normal brain and adult glioma samples (Fig. 4a-b). We observed the formation of distinct subclusters for several tumor types that correlated with established molecular subgroups. As an example, we observed that medulloblastoma samples were split into three distinct clusters that correlated with known Medulloblastoma subtypes13 (Wnt, Sonic hedgehog (SHH) and groups 3,4, Supplementary Fig. 4a). Similarly, Ependymomas (EPN) samples formed several clusters that correlated with the anatomic tumor location (supratentorial (ST)-EPN, spinal-EPN, and posterior fossa (PF)-EPN) 13(Fig. 4a, Supplementary Fig. 4b). Pediatric pilocytic astrocytomas (PAs) and pediatric low-grade gliomas clustered closely together suggesting that they exert similar gene expression patterns. The schwannomas separated from the pediatric tumors and were located near the neurofibromas, which were localized adjacent to the malignant peripheral nerve sheath tumors (MPNST). The subependymal giant cell astrocytoma (SEGA) form a tight cluster as do the meningiomas. Interestingly, we observed that neurocytomas, DNET, ganglioglioma clustered near normal brain samples, specifically the hypothalamus and amygdala samples. Building a UMAP with just normal brain samples and the pediatric tumors, showed similar clustering profile (Supplementary Fig. 4c). Taken together, these results suggest that pediatric tumors cluster by disease type and also form subclusters made by subtypes in the case of medulloblastomas and ependymomas.
Using the reference landscape to understand pathway regulation
Alterations in signaling pathways are a hallmark of cancer and understanding the extent to which these pathways are dysregulated in tumor samples compared to healthy normal brain can help inform researchers about the underlying mechanisms of different cancer types.
Bulk gene expression from adult glioma, pediatric brain tumors and healthy brain samples was subjected to a Gene Set variation Analysis (GSVA) and the gene set variation scores for each pathway were used to color in the Brain-UMAP. A score closer to 1 represented up-regulation of pathway in the given samples, whereas a score closer to -1 represented down-regulation of the pathway. We calculated GSVA scores for all pathways present in Reactome14, KEGG15 and biocarta16 pathways. We then tested whether there is a difference between the GSVA enrichment scores from different pair of phenotypes using a linear model and moderated t-statistic.
As examples, we found that 605, 589, and 529 (Supplementary Table 2a-c) pathways were up-regulated in IDH-wt glioblastomas, IDH-mut astrocytomas and oligodendrogliomas respectively compared to healthy brain samples from GTEx. A total of 456 pathways (Supplementary Table 2d) were up regulated in all three adult glioma subtypes compared to healthy brain. The top pathways which were enriched in all adult glioma subtypes were pathways enriched for cell cycle, DNA repair, translation, splicing, oncogenic signaling pathways such as RAS pathway, Notch pathway, MHC pathway, PI3K/AKT Signaling, Wnt pathway, SHH pathway (Fig. 5, Supplementary Fig. 5a). Additionally, neurotransmitter pathways, calcium channel, potassium channel opening pathways were up regulated in healthy brain regions compared to pediatric tumors and adult gliomas (Supplementary Fig. 5b).
Interestingly, we noted that the two small clusters (IDH mutated grade 2 and grade3 oligodendroglioma and grade 4 IDH-wt glioblastomas) from the CGGA dataset were enriched in pathways related to olfaction, glucoronidation, ascorbate and aldarate metabolism and xenobiotics (Supplementary Fig. 5c) in comparison to the main adult glioma cluster.
While visualizing pathways across the reference Brain-UMAP is extremely informative, researchers exploring targets for drug development may be also interested in investigating individual genes of a particular pathway. For example, Reactome’s mismatch repair pathway (R-HAS-5358508) and Biocarta RELA pathway (M10183) were both found to be up-regulated in all adult glioma subtypes compared to healthy normal brain, coloring in the gene expression for individual genes over the Brain-UMAP show different gene expression patterns across members of the same pathway. (Fig. 6, Supplementary Fig. 6). For example, almost all the genes in the mismatch repair pathway have elevated gene expression levels in medulloblastoma, except RPA3 and POLD4.
Studying candidate genes at multiple genomic levels
While bulk gene expression exhibited illuminating patterns in our data, analyzing other genomic information such as copy number alteration, gene fusions, or somatic alteration for each of these tumors can further enhance our understanding for a given gene of interest. Since processed copy number, gene fusions and somatic variants were publicly available only for two out of five datasets (TCGA and CBTN), we first built a much smaller UMAP using only the bulk gene expression data from these two datasets. The resulting UMAP (Fig. 7a) showed a similar clustering pattern as our original Brain-UMAP. Next, we downloaded the copy number calls, gene fusion calls and somatic variants for adult gliomas from Genomic Data Commons (GDC) and the pediatric tumors from CBTN.
Of note, the IDH-mut oligodendrogliomas, astrocytoma and pediatric high-grade gliomas have the highest mutational burden and number of gene fusions across all brain diseases (Fig. 7b-c, Supplementary Fig. 7a-b) in comparison to copy number profiles (Fig. 7d-e) which show a mixture of genes with amplified and lost copy number profiles across all brain diseases. Gene fusions are another class of potential oncogenic drivers in cancer, including pediatric cancers.17 We observed that specific gene fusions and/or gene fusion partners were enriched in specific cancer subtypes. Adult IDH-wt glioblastoma frequently harbored gene fusions involving EGFR (17.7% of tumors, most commonly EGFR − PSPHP1, EGFR − LINC01445, EGFR − SEC61G − DT), whereas pediatric lower-grade gliomas frequently harbored BRAF fusions (KIAA1549 − BRAF in 32% of low-grade gliomas and in 60% of pilocytic astrocytomas). In addition, supratentorial ependymomas most commonly harbored the C11orf95 − RELA fusion (71% of tumors, also known as ZFTA-RELA) and meningiomas most frequently harbored fusions in NF2 and YAP1 (both 15% of tumors) (Supplementary Fig. 7c).
Armed with additional genomic information such as gene fusions, copy number variation and somatic alternation, we investigated members of the Reactome mismatch repair pathway (R-HAS-5358508) (Supplementary Fig. 8a-c). We observe that different genes belonging to this pathway exhibit different trends, for example genes such as POLD1(chr19), RPA2(chr1), and LIG1(chr19) loose a copy in IDH-mut oligodendrogliomas, while genes RPA3(chr19), PMS2(chr19), PCNA(chr20) and POLD2(chr7) gain a copy in IDH-wt GBM. While most genes show elevated gene expression levels for all brain tumors, of interest is EXO1 and RPA3 which are only up-regulated in IDH-wt GBM samples. While members of this pathway do not form gene fusions, they get mutated in different brain tumors. For example, in pediatric high grade tumors, we observed mutations in all the members of the mismatch repair pathway - MSH2 (22%), MSH6 (15.71%), POLD3 (14.29%), MSH3 (12.86%), LIG1(11.43%), EXO1(10%), PMS2(10%), PCNA(10%), POLD1 (8.57%), RPA2(8.57%), MLH1 (7.14%), RPA1(7.14%), POLD2(5.71%), POLD4(5.71%) and RPA3(5.71%).
Oncogenes show altered gene expression in tumor samples, leading to abnormal phenotype in samples. Understanding gene expression patterns across various cancers of the nervous system can further our understanding of the disease. When studying known oncogenes such as EGFR, PTEN and CIC at a gene level (Fig. 8), as expected, we observe high number of mutations, high number of gene fusions, amplified gene expression values and copy number gains for EGFR across IDH-wt GBM. This contrasts with PTEN which shows loss of 1 copy in IDH-wt GBM samples. CIC, a transcriptional repressor, shows high number of mutations and copy number loss in oligodendrogliomas. For pediatric tumors, we observe BRAF gene fusions (Fig. 8) in 63% pilocytic astrocytoma tumors and 34% of low grade pediatric tumors (Supplementary Table 3a) ALK (Fig. 8) mutations are also observed in 38% high-grade pediatric tumors, 33% spinal cord ependymomas, 22% ATRT and 21% of the medulloblastoma tumors (Supplementary Table 3b).This reference landscape is a useful research tool for the scientific community, where researchers can explore existing data to increase their understanding of oncogenic pathways and individual genes that make up these pathways, potentially uncovering candidates for novel therapeutic targets. By providing access to this reference landscape via an open source website like Oncoscape, we provide an interactive tool to researchers doing CNS research.