A multistep computational procedure to identify candidate master Transcriptional Regulators (TRs) of glioblastoma (GBM)


 We describe a multistep computational procedure to identify candidate master Transcriptional Regulators (TRs) of glioblastoma (GBM) from glioblastoma single cell gene expression profiles and tissue-specific transcription factors.


Introduction
To identify candidate master transcriptional regulators of glioblastoma, we assembled a 3-step computational work ow named "Rhabdomant" that: (i) reconstructs the gene regulatory network (GRN) from GBM single cell gene expression pro les and GBM-speci c transcription factors; (ii) anchors regulatory interactions to the GBM epigenetic landscape, through the association of target genes to transcriptional regulators (TRs) that are potentially bound to their cis-regulatory elements; (iii) scores of candidate master TRs from the differential enrichment of their regulons in different cell states.

Procedure
Step 1: network reconstruction. To reconstruct the GBM regulatory network, we used ARACNe-AP 1 . ARACNe-AP requires in input a gene expression matrix and a prede ned list of transcriptional regulators.
As gene expression data, we used the single cell transcriptomes (normalized gene counts) of all 883 neoplastic cells from Darmanis et al. 2 . To limit the effects of scRNA-seq data sparsity in GRN reconstruction 3 , we removed from the input matrix genes detected only in a limited number of cells (normalized counts >0 in at least 100 cells, i.e., slightly over the number of cells in which a gene is detected on average in this dataset). The lter retained a total of 9,845 genes. As candidate transcriptional regulators (TRs), we used a list of transcription factors whose DNA-binding motifs were found enriched in the open chromatin regions of brain cancers and their dedicated transcriptional cofactors. Brie y, taking advantage of the epigenetic analyses provided by a recent large-scale ATAC-seq pro ling of several human tumor types, including GBM 4 , we carried out a DNA binding motif enrichment analysis using the HOMER ndMotifsGenome.pl function on the open chromatin regions of GBMs, and then selected the transcription factors whose DNA-binding motifs were highly enriched (FDR<0.0001) in at least 10% of these genomic regions. We excluded few factors (i.e., BATF, JUNB, PRDM1), since they are known to work mainly as transcriptional repressors. We manually curated this list with partner transcriptional co-factors (e.g., NOTCHs for RBPJ, etc) for a total of 151 TRs (Supplementary Table 1). ARACNe-AP was run with 100 bootstrap iterations setting the DPI (data processing inequality) tolerance to 0 and the MI (mutual information) P value threshold to 10 -8 . This resulted a single cell GBM network comprising 32,016 interactions between 70 TRs and 6,079 genes.
Step 2: anchoring of regulatory interactions to GBM epigenetic landscape. Since the network inferred in the rst step is based only on GBM single cell expression data, we intersected the gene interactomes of the GBM network with epigenetic data to eliminate spurious associations generated by the noise intrinsic to scRNA-seq data and to retain only biologically relevant associations between TRs and putative target genes. Speci cally, we rst analyzed the epigenetic data provided by Corces et al. 4 using the HOMER annotatePeaks.pl function to map the TRs potentially able to bind each of the ATAC-seq peak of GBMs. Then, we used the associations between each ATAC-seq peak and its target genes (Data S2 of Corces et al. 4 ) to map each TR to a set of putative target genes (Supplementary Table 2). Finally, we intersected this set of putative direct target genes with the interactomes of the GBM network, thus removing upstream regulators, indirectly interacting genes, and false positives and retaining only the lists of target genes with binding motif for a TR in their cis-regulatory elements (regulons).
Step 3: scoring of candidate master TRs. To identify master regulators of GSCs, we applied the VIPER algorithm 5 to the GBM regulons resulting from Step 2. VIPER uses the expression level of genes that are most directly regulated by a given protein, such as the targets of a transcription factor, to infer the activity of the transcription factor on a set of samples. We used the VIPER algorithm as implemented in the msviper function of the Bioconductor viper package (v. 1.16.0) in R 3.5.0. The msviper requires in input i) a gene expression signature resulting from the comparison of two types of samples, ii) a null model composed by a set of signatures obtained after permuting the samples at random, and iii) a gene regulatory network. To reduce the effect of outlier samples on the gene expression signature, we performed the msviper analysis with bootstrap, i.e., imputing the gene expression signature as a matrix of bootstrapped signatures. Speci cally, we generated the bootstrapped signature matrix comparing, in 100 bootstrap iterations, the expression pro les of GSC versus GDC cells with the bootstrapTtest function. We de ned the null model as a set of gene expression signatures obtained through the ttestNull function by randomly shu ing 1,000 times the samples among the GSC and GDC sets and we used the GBM regulons from Step 2 as the gene network. Finally, we used the bootstrapmsviper function to integrate the regulator activity results of the msviper analysis across all bootstrapped iterations. Based on VIPER analysis, we de ned as candidate master TRs those regulomes (i.e., the TR and its regulon) with an FDR ≤0.05 and a number of target genes >70. This resulted in 27 candidate MRs, of which 15 were candidate master TRs of the GSC state and 7 were candidate master TRs of the DGC state.