Patient recruiting and characterization
Glioblastoma tissues, corresponding genomic data and patients' clinical information (diagnosis, gender, age, WHO grade and overall survival (OS)) were obtained from 471 GBM patients (GBMs). 422 GBM patients were recruited from Beijing Tiantan Hospital while 49 GBM patients recruited from Sun Yat-Sen University Cancer Center and Jilin University First Hospital. Among all these participants, 109 patients’ tumors were performed with whole exon sequencing (WES) and 358 patients and 20 normal brain controls were sequenced by RNA-Seq. After quality control, 463 samples thus remained for the following analysis procedure, 62 patients have both WES and RNA-seq data. All experiments were performed in accordance with the guidelines of the institution ethics committee (AAALAC standard), and were conducted in accordance with principles expressed in the Declaration of Helsinki. This research was approved by ethic committees of the participating institutes. Besides, informed consent was obtained before any performance.
According to the World Health Organization (WHO) 2016 and 2021 classification of central nervous system tumors, patients in this study were diagnosed by three independent experienced pathologists based on histological characteristics combined with molecular characteristics 29,30. In this study, glioblastoma were further divided into PRS types, which refers to primary GBM (pGBM), secondary GBM (sGBM) and recurrent glioblastoma (rGBM). pGBM is primary GBM without any clinical or histologic evidence related to less malignant lesions. sGBM in this study refers to glioblastoma that developed from low grade glioma (LGG) by definition, usually accompanied with IDH mutation. In addition, rGBM refers to a patient whose GBM has been diagnosed to relapse again. Considering the impact of IDH mutation status on the prognosis of GBMs, whole exome sequencing (WES) and RNA sequencing (RNA Seq) were performed on 471 individuals of East Asian descent, which were grouped by EAS IDH1wt and IDH1 mutant groups. IDH1 mutation status in this study was detected by Sanger sequencing independently, 1p19q co-deletion status was checked by Fluorescence in situ hybridization (FISH), MGMT promoter methylation status was determined following previous protocol 31.
Sample preparation and sequencing
Glioblastoma tissues were snap-frozen in liquid nitrogen immediately after surgical resection and preserved in liquid nitrogen. First, we extracted genomic DNA from the tumor and matched blood samples, and confirmed its high integrity by 1% agarose gel electrophoresis. DNA was then fragmented for quality control, and paired libraries were prepared. For whole exome sequencing, Agilent SureSelect kit v5.4 was used for target capture. Sequencing was performed on the Illumina HiSeq 4000 platform using paired-end sequencing strategy. Total RNA was extracted from each tumor sample, followed by removing tRNA and rRNA. After quality control and quantification, mRNA was reverse transcribed into cDNA, then followed by library preparation and sequencing.
Somatic variant identification
Short sequence reads were mapped to the human reference genome GRCh37 using BWA-MEM (v.0.7.17) with default parameters. Following GATK best practice, PCR duplicates were first removed and subsequently realigned and recalibrated (available at https://github.com/gis-rpd/pipelines, GATK v.4.1.9.0). Similar steps, including marking duplication, realignment and recalibration, were performed for BAM files. SNVs were identified using MuTect (v.1.1.7). Tumor samples were then used to call somatic mutations against the paired normal. In addition, based on the validation results, we further filtered out false positive mutations and retained only those supported by at least five mutation reads with variant allele frequency (VAF) > 0.05. Somatic insertions and deletions (Indels) were double checked using Strelka (v.1.0.15) with default parameters. All qualified somatic variants were annotated using Oncotator (v.1.9.9.0).
Driver genes identification
To identify driver mutations in EAS and EUR cohorts, the MutSigCV (v.1.41) and Maftools R package were used to infer significantly mutant driver genes (q<0.1 in any caller, and non-silent mutations n≥5). We annotated GBM drivers with a combination of driver lists: (1) Significantly mutated genes in GBM proposed by previous publications; (2) GBM associated proto-oncogene and tumor suppressor genes in Genomic Data Commons (GDC) database 32; (3) New drivers identified from the EAS cohort in this study.
Mutational signatures deconvolution and identification
To uncover etiological factors that drive GBM oncogenesis and pathogenesis, Maftools and YAPSA R package was used for de novo discovery of mutational signatures in EAS-GBM cohort. The MutationalPatterns R package was then employed to characterize and visualize results from de novo mutational signature inference. The same protocol was performed on the TCGA cohort. Bases mutational patterns in each category were then assigned to single base substitutions (SBS) features as previously reported in Catalogue of Somatic Mutations in Cancer (COSMIC), proportions of each signature were then estimated to investigate its contribution to GBM formation.
Copy number variation (CNV) of EAS and EUR-GBMs
Sequenza (v.3.0.0) was firstly employed to estimate copy number profile (including allele-specific copy number), tumor purity and ploidy for each patient (18). Default settings were used following Sequenza documentation. GISTIC (v.2.0.23) was used to identify significantly amplified and deleted regions. Output segmentation files generated from Sequenza were then used as input for GISTIC (19). GISTIC parameters were set as default according to the manual. Chromosome arms were labeled as ‘altered’ in each cohort if GISTIC q<0.1. GII was calculated as the percentage of a tumor genome showing a copy number deviated from the median of the genome (all CNVs are integers as inferred by Sequenza). GII deletion was defined as the percentage of a tumor exhibiting significant decreased copy number. Then, microsatellite instability (MSI) indexes were evaluated by MSISensor via inputting alignment bam files.
Transcriptome clustering using non-negative matrix factorization
Raw RNA-seq reads from 388 tumors and 20 normal samples of EAS cohort and 160 tumor samples of EUR cohort were aligned to the reference genome (UCSC hg19 with annotations from GENCODE v.37) using STAR (v.2.7.3a). 30 RNA-Seq samples were removed from the cohort since PCA results detected that potential RNA degradation occurred in these samples. Gene expression profile was subsequently quantified using RSEM (v.1.3.3). Expression counts were estimated by RSEM, TPM were subsequently calculated by dividing the read counts by the length of each gene in kilobases. For clustering of tumor samples, the top 3000 most variable coding genes (based on median absolute deviation) were selected from tumor transcriptome data. To obtain robust clusters consensus, an unsupervised clustering method named non-negative matrix factorization (NMF) was used (NMF R-package v. 0.23.0). Optimal rank parameters were first determined using 50 runs of random starting points with default settings, followed by 300 runs with optimal ranks to acquire the final NMF clustering solutions. Silhouette analysis was performed to assess the consistency of proposed clustering solutions. Clustering significance was evaluated using SigClust, and all class boundaries were proved to be statistically significant.
Principal component analysis and clustering
Principal Component Analysis (PCA) was first employed to detect RNA decomposition of these samples. Next, PCA were performed again to check correlation between proposed molecular subtypes and batches based on top 3000 variable genes used in NMF clustering, PCA was realized using R built-in function prcomp. Sample points were colored respectively based on their NMF clusters and batches, while shapes were annotated by PRS type. Besides, t-SNE clustering was performed to check grouping efficiency by applying the t-SNE R package. Phylogenetic tree of the EAS cohort was analyzed to plot the genetic distance among different molecular subtypes using the monocle R package (v.0.2.3).
Molecular subgroup mapping across EAS and EUR Cohort
To investigate correlation among molecular subtypes identified from EAS and EUR cohorts, an unsupervised subclass mapping method SubMap (https://www.genepattern.org/) was used to identify correspondence or commonality of subtypes from the two cohorts. SubMap analysis was performed on the GenePattern online platform using default settings with a random seed of 1. The intersection of genes selected for NMF clustering (3,000 most variable genes in each cohort) were input to perform analysis. Significant correspondences were identified with a cutoff of Bonferroni adjusted P < 0.01.
Marker genes selection and pathway enrichment analysis
To identify marker genes for each GBM subtypes, Comparative Marker Gene Selection algorithms on the GenePattern online platform were performed with default settings and a threshold score > 0.5. Top 500 differential expressed genes (DEGs) were selected as marker genes for each subtype, DESeq2 (3.12) in R were then conducted to validate results. To further compare similarity across EAS and EUR subtypes, VENN plots were performed to check the overlapped marker genes on Venn webtools (http://bioinformatics.psb.ugent.be/webtools/Venn/).
To uncover Gene Ontology (GO) and pathways (Kyoto Encyclopedia of Genes and Genomes, KEGG) enrichment of each GBM subtypes, marker genes of each subtype were respectively uploaded to online tools: Metascape (online websites with gene annotation, visualization: https://metascape.org/gp/index.html). Using the fgsea R package (v.3.12), GSEA was performed with assistance of hallmark gene set from MSigDB (v.7.4). Significant enriched gene sets were filtered based on a cutoff of q < 0.01.
Immune profile analysis of RNA subtypes
In decomposing immune components of tumor samples, enrichment levels of the 29 immune signatures were quantized using single sample gene set enrichment analysis (ssGSEA v.10.0.2) score, resulting in higher immune scores denoting a greater immune component in a tumor 33. ESTIMATE R package (v.9.1.0) along with CIBERSORTX (https://cibersort.stanford.edu/), MCP-Counter (v.1.1) R package as well as online tools TIMER (http://cistrome.org/TIMER/) were then employed to check data consistency using default settings. To predict Immune Checkpoint Blockade (ICB) therapy responses, an 18-gene T cell inflamed gene expression profile (GEP) was calculated, it refers to a weighted sum of these 18 genes after normalization with 11 housekeeping genes following previously published methods 14.
HLA supertype and haplotype identification
Human leukocyte antigen (HLA) molecules are highly polymorphic. They contain various supertypes (also known as serotypes) and haplotypes even within the same race, and across different races. Different HLA supertypes and haplotypes have shown a significant disparity in antigen presenting ability and peptide ligands affinity 34. Expression of HLA genes among GBM subtypes were obtained and compared from transcriptome TPM value. HLA supertypes and haplotypes were identified using two independent software tools, arcasHLA (v.0.2.5) and seq2HLA with default parameters 35,36. Only haplotypes and supertypes confirmed by the two softwares were retained for future analysis. HLA loss of heterozygous (LOH) was considered as an individual that only one allele gene was detected on his or her allele loci. Association between HLA status and clinical features were then investigated by univariate and multivariate analysis.
Detection and analysis of TCR, BCR CDR3 sequences from GBMs RNA-Seq data
TRUST (v.3.0.1) was applied (https://bitbucket.org/liulab/trust) to extract TCR/BCR sequence from 345 GBMs RNA-seq data 33. CDR3 sequences with significantly fewer reads (less than 3) were excluded in this analysis. The number of total sequencing reads was obtained from each bam file using Samtools, each variable (V), joining (J), or constant (C) region of CDR3 sequences were annotated with the help of TRUST. In order to compare the richness of TCR/BCR among different subtypes, we normalized CDR3s counts by total reads and tumor purity in each sample. Clonotype diversity of T/B cells was estimated by TCR/BCR CDR3s per kilo TCR/BCR reads (CPK) in each sample as previously described 37. Shannon entropy calculated from the frequencies of TCR β chain and IgH CDR3 sequences is another index evaluating diversity of TCR/BCR, and such an index represents the adaptive ability of patients' immune systems. γδ T cell fraction, which plays a pivotal role in tumor cytotoxic immunity 38, was estimated by the total number of γ or δ CDR3s divided by the total number of TCR CDR3s in each sample 39. In identifying B cell lineage clusters in each sample, immunoglobulin heavy/light chain subtypes were aligned by the constant (C) region of IgH CDR3 sequences or Ig light chain; the fraction of each subtype was calculated by dividing heavy/light chain total reads in each sample. TCR, BCR and tumor clones Shannon Index (SI) are calculated with the formula: SI =
, where
is the proportion of clone i 40.
Survival analysis and correlation of multilayer features
Univariate and multivariate Cox analysis were performed to check correlations between prognosis and features (clinical features, including Chemo-/ Radiotherapy Status, Molecular Subtypes and PRS Types; genomic features, including somatic mutation of genes, Ploidy and GII). Survival outcomes of different groups were estimated by the Kaplan-Meier method and log-rank tests were used to calculate P values. To uncover the correlation of multilayer features in EAS cohort, a list of multiple layers features was curated, including baseline information (including age, sex and others), treatment regimen (chemotherapy status, radiotherapy status), genome indices (Ploidy, GII, GII deletion), somatic mutation of diver genes, molecular subtype. Pairwise correlations of all features were surveyed using the GGally R package (v.1.5.0) and corrgram R package (v.1.13.). MATH (Mutant-Allele Tumor Heterogeneity) scores were calculated to assess intratumor heterogeneity among GBM samples, it’s calculated as:
, where
is a vector of the VAFs of all mutations from sample i and MAD denotes median absolute deviation 41.
Construction and validation of prognostic signature and nomogram
Based on patients' clinical, genomic features and molecular subtypes, an innovative nomogram specific to EAS-GBM was first developed by rms r package (v.6.2-0.). Calibration charts were performed to determine 1-, 3- and 5- year OS prediction accuracy of our nomogram. Sensitivity and specificity of nomogram were assessed using survivalROC R package. Then, a machine learning predicting model was constructed. Cox regression coefficients were firstly utilized to estimate risk scores for each patient. Patients were then divided into two groups (low risk and high risk) according to their risk score: The boundary value was set by the gaurvminer R package (v.0.4.9). Subsequently, survival outcomes of the two groups were evaluated by Kaplan-Meier prognosis plot and a log-rank test was used to calculate P values. Receiver operating characteristic (ROC) curve and area under curve (AUC) value were calculated via the survivalROC R package (version 1.0.1). Consistency indexes (C index) were then calculated with randomly performing 10-fold cross-validation 100 times to assess model prediction efficiency in each input setting.
Somatic mutation pathway annotation
Somatic mutation pathway enrichment was annotated and summarized by Maftools, then the results were visualized by the online tool Pathway Mapper (http://www.pathwaymapper.org/). Homologous recombination deficiency (HRD) refers to an incorrect alignment of chromosomes during meiosis, which is strongly linked to cancer formation. We selected a panel of genes, including BRCA1/2, CDK12, ATM, MLH1/3, MSH2/3/6, PMS1/2 and PALB2 mutation to evaluate patients' HRD status. DNA mismatch repair (MMR) pathways play key roles in proofing of errors during DNA replication, MMR associated genes including MLH1, MLH3, MSH2, MSH6, PMS1 and PMS2 were investigated to determine MMR status.
Statistics
The Wilcoxon rank-sum test and Kruskal-Wallis test were used to check the difference in significance of continuous variables among different groups. Spearman’s rank correlation was employed to check correlation between continuous variables. Chi square or fisher’s exact tests were applied to test for enrichment of categorical variables among different groups. Multiple testing was accounted for by using false discovery rate q-values unless otherwise indicated 42. P<0.05 was set to be statistically significant across all the tests above.