Clustering-based functionality analyses of genes based on the CRISPR knockout screening data subsets using HDBSCAN
We first designed a study to examine the functional similarity of genes at a more precise level, using CRISPR knockout screening data. On the one hand, analyzing individual screening data independently has several limitations, such as an insufficient data size and inherent biases originating from the experimental settings, consequently limiting model performances. On the other hand, using the entire screening datasets at once may lead to reduced prediction accuracy due to imbalanced sample sizes among the experimental data samples, or the dilution of meaningful signals coming from a small subset of data.
To overcome these shortcomings, we devised an analysis workflow that merges two screening datasets from separate studies into one, followed by a clustering analysis using HDBSCAN. This approach might address multiple aforementioned issues: merging two datasets increases the length of the feature vector assigned for each gene, which partly neutralizes experimental setting-dependent biases and thereby increases the model's robustness. Furthermore, we could generate unique subsets from the available data, establishing a 'population of clustering results' that enables statistical analysis. This may allow us to better examine differences between clusters and similarities among genes with the same label within a cluster.
We were able to analyze 137 unique subsets using this workflow (Supplementary table 2) and identified several notable instances. By clustering data subset generated by merging experimental results of two cell proliferation-related studies, Behan [18] and Tzelepis [19], we found several interesting distribution patterns within t-SNE space. We analyzed 104 genes, marked number 9, which were clustered into one spot (Figure 2A). The Gene Ontology database [20]-based enrichment analysis revealed that 32 out of 104 genes were associated with cytoplasmic translation (Fold Enrichment = 50.60, P-value = 2.38E-43). Additional 11 genes were found to be associated with the U2-type prespliceosome assembly (Fold Enrichment = 89.87, P-value = 1.26E-17) (Figure 2D). In this gene list, we identified FAM86B1 as the only gene with no known function.
Interestingly, we found that CRISPR-Cas9 knockout of the genes associated with label number 9 resulted in very high levels of cell death. To compare with our current knowledge about cell death and survival, we examined the locations of genes that are known to be specifically associated with the cell death pathway within the t-SNE [21] space (Figure 2B). Contrary to our initial assumption, the distribution of cell death-associated genes within t-SNE space did not provide any meaningful information, implying a possible discrepancy between mechanistic association and actual phenotypic involvement.
Similar clustering analysis was done with a data subset generated by merging the experimental results of two studies in different categories, DNA damage response-related screening [22] and cell growth screening [23] (Figure 2C). We examined 1022 genes associated with cluster label number 21. Among the 1022 genes studied, 471 were linked to nucleobase-containing compound metabolic processes, according to the gene ontology database. (Fold Enrichment = 3.31, P-value = 2.04E-125). Moreover, 373 additional genes were associated with the RNA metabolic process (Fold Enrichment = 4.53, P-value = 5.24E-131) (Figure 2E). We also found EBNA1BP2 gene with unknown function.
Functionality analysis of genes based on whole CRISPR data using agglomerative hierarchical clustering
In addition to functionality analyses based on data subsets, we analyzed whole CRISPR knockout data using a hierarchical aggregation clustering algorithm, in order to investigate features as a whole. Genes were assigned to one of the seven clusters (Figure 3A, 3B): we found knockouts of genes in clusters 3, 4, and 6 evidently triggered cell death in almost all cell types (strong red and blue signals in Figure 3B), whereas the patterns were less distinct for the other clusters.
To further visualize and assess the functional characteristics of gene clusters, we used Appyters [24], which projects input gene lists to a pre-computed UMAP space and detects associations with Wikipathway biological pathways [25]. We found that Cluster 1 had the strongest association with three biological processes: adipogenesis, the IL17 signaling pathway, and DNA Damage Response (Figure 3C, F). We also confirmed the presence of C1QTNF4, which is predicted to activate cytokine activity [26], within this cluster. On the other hand, genes within Cluster 2 were grouped in pink and orange clusters on Wikipathway UMAP, indicating the strongest correlation with phosphodiesterases involved in neural function (Figure 3D, G). The presence of the DBNL gene in Cluster 2 supports this result, as it is believed to be involved in activities such as nervous system development, Rac protein signaling, and podosome assembly [27]. Clusters 3, 4, and 6 showed largely similar patterns, and genes in these clusters were largely grouped together in green clusters on UMAP (Figure 3E, H). These clusters showed the strongest connection with the mitochondrial electron transport chain and the oxidative phosphorylation system. Interestingly, we found the AURKAIP1 gene within this cluster, which had previously only been associated with the upstream or positive regulation of proteolysis [28].
Functionality analysis of genes using similarities of data subset-wise clustering patterns
As we mentioned in the previous sections, it is possible to generate a pool of clustering results by using subsets of available data. Based on the assumption that genes with similar functions are likely to be close together in the feature space, we compared the clustering labels (n=137) of each gene and searched for pairs with high label overlap values, to see if these pairs could represent interesting emergent properties.
Using more than 20,000 gene pairs with clustering label overlap values greater than 80% (Supplementary table 5), we generated a functional similarity network which revealed several potential functional clusters (Figure 4). To analyze each cluster of genes and discover distinct annotation patterns, we employed the Enrichr platform [29], which performs gene enrichment analysis using multiple biological databases. We identified 29 genes linked to the dark blue cluster that are associated with cholesterol metabolism and related signaling pathways. Similarly, other gene clusters showed the strongest association with purine metabolism (green, 60 genes), nephrotic syndrome (yellow, 107 genes), vitiligo (pink, 121 genes), and Notch signaling (gray, 55 genes), respectively.
Further investigation of the aforementioned mechanisms revealed a clear association between them and cell death. High cholesterol is known to trigger both apoptosis and autophagy [30] and purine metabolism clearly plays a role in apoptosis [31]. Nephrotic syndrome increases the apoptosis rate of circulating lymphocytes [32], and vitiligo - an autoimmune disease of the skin - involves a gradual loss of melanin cells [33]. Finally, high Notch signaling is clearly associated with altered cellular fate [34]. This clearly demonstrates that cell death is a recurrent theme among all the mechanisms related to the gene list we obtained from our network analysis.
Case studies of gene pairs
We conducted a case study using gene pairs from the analysis that had a high clustering label overlap (Table 1). In the case of the HRH1/CHRM3 pair, which shows 82.48% clustering label overlap, we identified explicit functional associations in all four databases (Gene ontology, Reactome reaction, pathway [35], and KEGG pathways [36]). In the case of the ONECUT1/IRF7 pair, which shows slightly lower 80.29% clustering label overlap, we found a research paper that states these are two of the three genes most closely associated with inflammatory responses related to head and neck cancer [37]. One interesting case is KCNA1/SCN9A gene pair, which shows 82.48% clustering label overlap and known functional commonality as a voltage-gated ion channel. After further examination, we uncovered that these genes are both associated with senescence caused by oncogenes (OIS) [38, 39], which has not been described in available databases.
Table 1 Case study gene pair list
Gene pair name 1
|
Gene pair name 2
|
Label duplication probability
|
Journal name
|
ONECUT1
|
IRF7
|
80.29%
|
Pilot Study of Combined Aerobic and Resistance Exercise on Fatigue for Patients with Head and Neck Cancer: Inflammatory and Epigenetic Changes
|
SFRP1
|
SLIT3
|
82.5%
|
Comprehensive DNA Methylation and Mutation Analyses Reveal a Methylation Signature in Colorectal Sessile Serrated Adenomas
|
KCNA1
|
HRH1
|
82.5%
|
Different genes may be involved in distal and local sensitization: A genome‐wide gene‐based association study and meta‐analysis
|
HRH1
|
LDLRAD4
|
82.5%
|
1. Histamine Induces Upregulated Expression of Histamine Receptors and Increases Release of Inflammatory Mediators from Microglia
2. Epigenetics of neuroinflammation: Immune response, inflammatory response and cholinergic synaptic involvement evidenced by genome-wide DNA methylation analysis of delirious inpatients
|
GIMAP4
|
STAT4
|
82.5%
|
Genetics of Behçet's disease lessons learned from genomewide association studies
|
SOX8
|
SFRP1
|
82.5%
|
RNA-seq reveals downregulated osteochondral genes potentially related to tibia bacterial chondronecrosis with osteomyelitis in broilers
|
HERC6
|
SFRP1
|
82.5%
|
NA methylation in demyelinated multiple sclerosis hippocampus
|
AJAP1
|
SOX8
|
82.5%
|
Methylomics analysis identifies epigenetically silenced genes and implies an activation of β-catenin signaling in cervical cancer
|
* Tables are sorted alphabetically, and when using data, for data with fewer screens (for example, SARS CoV 2 data or Protein/peptide accumulation data), the same phenotypic data are combined into a single data set. If the number of genes used in one experimental paper is different for each screen, it is described as the number of genes.