A novel data-driven approach reveals gene networks and biological processes underlying autism

Background: Developments in gene-hunting techniques identied several ASD associated genes. The considerable signicance of cluster analysis associated with gene network studies has led to reveal many disrupted key pathways in ASD, even if its genetic underpinnings remain a challenging task. This study aims to determine, through a novel data-driven approach, how networks of mutated genes impact biological processes underlying autism. We analyzed the VariCarta dataset, which presents more than 200,000 genomic variant events collected from 13,069 people with ASD. Firstly, we created a whole-genome and an exome sequencing subset. Then, for each subset we compared pairwise patients of each group to build “patient similarity matrices”. Hierarchical-agglomerative-clustering and heatmap were performed to identify clusters of patients with common occurrences of gene networks within these matrices. The subsequent enrichment analysis (EA) highlighted biological processes that might be impacted by the mutated genes of each subgroup. The FE biological of each are An additional full biological

prioritized the disrupted processes that occur in possibly genetically related clusters of patients. The availability of VariCarta [21], an ASD speci c database, allowed us to execute our analysis on a signi cant amount of patients.
We rstly de ned a metric to measure the genetic similarity between patients according to their mutations and then we applied hierarchical clustering in order to identify groups of genetically similar individuals [22]. Afterwards, we proceeded with the enrichment analysis upon each cluster of patients, as to identify ASD-related biological pathways, which might have been disrupted by the mutations. Finally, we discussed the implications of autism gene networks knowledge for clinical practice.

Gene Database
For this research work we used the VariCarta dataset from British Columbia University, a web-based database of human DNA genetic variants identi ed in individuals with an ASD diagnosis [21]. It also presents also a list of genes both with a set of mutational events and the reference to the patient affected by the mutation. This information was fundamental for the cluster analysis we carried out.
VariCarta was developed with the aim to identify rare, possibly causative genomic variants in ASD individuals. To address this challenge, it is necessary to collect a large number of subject information also through the aggregation of data, with the risk of methodological inconsistencies and subject overlap across studies. VariCarta developers tackled this demanding task by collecting and cataloging literaturederived genomic variants found in ASD subjects, through the use of an ongoing semi-manual curation and with a robust data import pipeline. Thus, while developing the database, it was possible to nd and correct errors, to convert variants into a standardized format, to identify and harmonize cohort overlaps and to document data origin. The database is constantly updated with new relevant gene-targeted ASD research papers. The current version contains 184,212 variant events from 13,069 subjects, collected across 69 publications. The version used in the present paper is the one dated 12/11/2019. It consists of 211,669 records, each one containing a mutative event, as reported in the paper from where it was retrieved. Since a single mutative event can be reported by more than one paper, we removed duplicated events during the analysis.

Analysis of the Dataset
The dataset was accessible both using a web interface or by downloading the whole dataset in csv format. As the web interface allows limited research, we downloaded the whole dataset in csv format. Each row of the dataset is a variant event including, among others, the symbol of the affected gene, the category of mutation (synonymous SNV and nonsynonymous SNV, frameshift insertion, etc.), the adopted sequencing type (whole genome sequencing, exome sequencing, targeted sequencing) and the subject id that is a unique identi er of the patient presenting the variant. The dataset is also provided with reference information allowing to trace the paper where the information has been gathered from. Since the number of variants detected in each patient might be affected by the used sequencing type, we handled whole-genome and exome sequencing separately. In the current study we did not consider targeted analysis sequencing, because this technique is focused on identifying speci c genes highly related to a disease, assuming that these are known. Such an assumption can be possibly made only for well documented causes of ASD, such as tuberous sclerosis or Fragile X syndrome, diagnosed in about 15% of individuals with ASD [23]. This is not the case for the "idiopathic ASD", which represents the majority of all ASD diagnoses. Therefore, limiting the analysis to some genes, while ignoring the others, could lead to a limited detection of gene mutations in a single subject. Furthermore, in the database the number of mutative events revealed by targeted sequencing is only about 2% of all events (3,698/184,212 mutations).
From the remaining variant events we created two subsets, one for whole-genome sequencing and one for exome sequencing; they were composed, respectively, by 84.6% (155,799/184,212 variants) and 13.4% (24,715/184,212) of all mutative events collected in the dataset. For each subset we selected the two features "Gene Symbol" and "Subject id" and made a pairwise comparison of the patients of each group to build a "patient similarity matrix" de ned as follows: Let A be a matrix NxN where N is the number of patients; let G i be the set of genes affected by a mutation in the subject i and G j the set of genes affected by a mutation in the subject j; we de ned each element a i,j of A as the intersection between G i and G j .. A Log2 transformation has been applied to each element ai,j of the matrix A in order to normalize the results.
The whole-genome matrix numbered 2,062 patients and the exome matrix 7,427. In order to sort rows and columns to highlight clusters underlying the common occurrence of genes' networks in patients, we used the clustermap function of the Python library Seaborn [24]. This function leverages hierarchical clustering [25] and heatmaps [26] to identify clusters inside the rows and columns of the input matrix that can be either a rectangular observation matrix or a square distance matrix. Hierarchical clustering is a widely used clustering algorithm that is able to identify hierarchical relations between groups [27]. It is particularly effective when some hierarchical structure (like a taxonomy) is expected to be identi ed and when the number and nature of groups and subgroups are not known in advance. Hierarchical clustering has been used in biology [28] since the 70 s. Nowadays it is applied to genetics, combined with heatmaps for microarray analysis [29], and, recently, to psychiatry to identify subgroups of patients with ASD based on comorbidity [30] or phenotype analysis [31] [32].

Hierarchical Agglomerative Clustering (HAC)
The speci c algorithm we used is the Hierarchical Agglomerative Clustering (HAC) [33], a bottom up iterative process. It tries to progressively identify sets of similar subsets of data leveraging a distance function, also called hierarchical linkage. In the HAC process items are iteratively aggregated using the distance function to evaluate similarities between subgroups.
The HAC we used during the analysis was the Python implementation provided by the Seaborn library in clustermap function. Seaborn's clustermap combines HAC to sort rows and columns of its heatmaps and shows dendgrograms on the axes to highlight the hierarchical structure. The implementation of HAC included in clustermap begins with a forest of clusters that have yet to be used in the hierarchy being formed. When two clusters B and C from the forest are combined into a single BC cluster, B and C are removed from the forest, and BC is added to the forest. When only one cluster remains in the forest, the algorithm stops, and this cluster becomes the root. A distance matrix is maintained at each iteration. The d[i,j] entry corresponds to the distance between cluster i and j in the original forest. At each iteration, the algorithm has to update the distance matrix to re ect the distance of the newly formed cluster BC with the remaining clusters in the forest. The computation of the distance d[i,j] depends on the method used.
In that event we adopted the "complete" method that applies the following function: This function, also called Farthest Point Algorithm or Voor Hees Algorithm, implies that the distance between two clusters u and v is the maximum distance between the farthest points. We applied this method in order to try to maximize the differences between clusters. The function used to measure the distance between the points is the Euclidean distance (2-norm). We then used the partitions to try to identify genes networks that could be characteristic of each subgroup.

Enrichment Analysis
In order to identify biological processes that could be impacted by mutated genes of each subgroup we used enrichment analysis. Gene Set Enrichment Analysis [34] (GSEA) (also called functional enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may be associated to disease phenotypes. The method uses statistical approaches to identify signi cantly enriched or depleted groups of genes. This can be done by comparing the input gene set to each of the bins (terms) in the gene ontology. A statistical test can be performed for each bin to evaluate if it is enriched for the input genes. Results for each pathway are expressed in terms of Fold Enrichment (FE), ie the ratio between the number of genes present in the cluster list belonging to that pathway and the number of genes expected to belong to that pathway in a random set of genes of the same size. The setting used for the enrichment analysis was the GO Ontology database (Released 2020-02-21) [35]. The applied reference list of expected genes was the one of homo sapiens.

Statistical Analysis
PantherDB [36] has been applied for the Enrichment analysis (PANTHER Overrepresentation Test; Released 07/04/2020). To verify the statistical signi cance of the submitted set of genes we applied the Chi-square/Fisher's Exact Test. The obtained raw p-values were adjusted for multiple comparisons by means of the False Discovery Rate method (FDR) [37]. A statistical signi cance threshold of p < 0.005 (two-tailed) was applied for all analyses. As both raw and FDR-adjusted p-values are strongly dependent on sample size, once the statistically signi cant terms were identi ed, we ranked the biological processes by Fold Enrichment, a measure of the effect-size [38]. In order to highlight interactions between the set of identi ed genes we used String DB [39] Gene Network View. In the produced network interaction graph, the number of edges represents the existing interactions between the submitted genes. It was compared with the expected number of edges, ie the expected interactions between a set of random genes of the same size. STRING DB was used also to compute protein-protein interaction (PPI) enrichment p-values on partitions identi ed by the intersection of genes sets identi ed by the clustering.

Results
Cluster heatmap for whole-genome sequencing The cluster heatmap of the dissimilarity matrix of the patients whose DNA has been analyzed using whole-genome sequencing is presented in Fig. 1a. All 2,062 patients belonging to the group are distributed along the two axes. Each number is the subject identi cation (id) assigned to a speci c patient as it is presented in the VariCarta database. Each cell of the matrix is the log2 transformation of the number of the common genes mutated between two patients. The usual diagonal line that should show the comparison of each patient with himself was suppressed by the clustermap function.
Further analyses were performed on the macro-clusters represented in Fig. 1b. The dendrogram showing their relationships is extracted from the one visible in the cluster heatmap axes limiting it to the third hierarchy level.
The whole-genome matrix presented a high number of shared variations between patients. Three clusters have been identi ed (A, B, and C). We labeled the three clusters according to their density -ie the amount of shared variations between the patients -from the highest to the lowest. We considered as a density factor (DF) the average value of all the elements of a cluster. Each value of the matrix is the log2 transformation of the number of mutated genes in common between the two patients representing, respectively, the row and the column of the matrix. Cluster A is composed by 574 patients sharing 8,357 mutated genes (DF: 1.429); Cluster B is composed by 507 patients sharing 6,818 mutated genes (DF: 1.014); Cluster C is composed by 650 patients sharing 7,704 mutated genes (DF: 0.720). Beyond the three clusters, we also identi ed 331 patients who shared none or very few mutated genes (all together 41 genes). We included them in the "mixed group" D, whose DF was just 0.001.

Cluster heatmap for exome sequencing
The same similarity matrix computed only on exome data is presented in Fig. 2. In this case the similarity matrix is almost empty, meaning that most of the patients do not share any mutations or few mutations between each other. Only a single dense cluster (Cluster E) is detectable in the top left corner of the matrix. This cluster is composed of 218 patients, which is 2.9% of all the patients whose DNA has been analyzed using exome sequencing (218/7,427). The density of this identi ed cluster is very high (DF: 2.551).
Enrichment analysis for whole-genome sequencing For each of the clusters shown in Fig. 1b we extracted the list of common mutated genes -genes mutated in at least two patients of the cluster -and then we executed the enrichment analysis. The results of the 40 biological processes with the highest FE values are presented in Table 1. Only clusters A, B, and C have been considered, as the mixed group D did not return any statistically signi cant result. The involved biological processes are ranked by Fold Enrichment to highlight the most speci c processes involved.

Cluster comparisons
Most of the processes of the three clusters are common, even though some differences can be detected either because of the different order in the list that indicates a different weight of that process in the cluster or because of the presence of cluster-speci c processes. Therefore, for each cluster we extracted mutations shared by at least two subjects in the same cluster. As shown in Fig. 3, by intersecting the three clusters we identi ed seven partitions. We executed enrichment analysis on all the partitions separately.
None of the partitions, except the intersection between the three clusters returned signi cant results. Table 2 shows the enrichment analysis on the ABC intersection. The resulting network interaction graph, generated by String-DB, is presented in Fig. 4. For the sake of clarity, the analysis has been limited to genes, which were present in at least 50 patients. The

Enrichment analysis for exome sequencing
For the cluster shown in Fig. 2 we extracted the list of genes mutated in at least two patients and then we executed the enrichment analysis. The results of the biological processes with the highest FE values are presented in Table 3. The involved biological processes are ranked by Fold Enrichment. The full list of biological processes with a FE ≥ 1.5, as well as the number of reference genes, the expected and observed numbers of genes, and the raw p-values for each biological process of the clusters and of the intersection are provided in the Additional le 1, while the list of occurrences of mutated genes can be retrieved from the Additional le 2.

Discussion
We used the VariCarta dataset to apply a metric to measure the genetic similarity between patients, followed by a hierarchical clustering analysis. We identi ed three main genetic clusters of ASD patients, each one characterized by a set of common mutated genes. The subsequent enrichment analysis (EA), performed upon the clusters' genes, allowed us to pinpoint disrupted biological processes both common to the three clusters and cluster-speci c.
Most of the processes that were common to the three clusters were involved in neuron projections guidance and morphogenesis. Proper plasticity of axon and dendrites is a highly dynamic process leading functional circuits formation [40], which is crucial in the very early stages of brain development [41] [42]. These processes involve numerous ASD related genes associated with disrupted synaptic connectivity and function [43] [44]. Neuronal migration and morphogenesis defects in ASD contribute to an altered cortical connectivity in different brain regions [45] [46], as the well-known prefrontal area [47]. Therefore, in the developing brain, a premature alteration in neuronal plasticity, affecting mostly cortical regions involved in cognitive and behavioral functions, might trigger the autistic phenotype [48] [49].
Furthermore, also fMRI [50] and electrophysiological [51] studies have emphasized the role of an atypical interconnection of speci c brain areas in ASD.
Considering these ndings, the biological processes we underlined for each cluster give further support to the theory that neurogenesis and its molecular microenvironment are associated with autism. This pathogenetic background was pointed out also by the enrichment analysis performed on the intersection of the three clusters, which con rmed the role in the ASD etiopathology of atypical gene networks related to biological processes affecting neuron development as, for example, "axon guidance", "neuron projection guidance", or "dendrite morphogenesis".
As further evidence of a disrupted connectivity in ASD, the EA on the single clusters and on their intersection showed the importance of processes like "cell junction assembly" and "synapse assembly".
The integrity of synaptic proteins and cell adhesion molecules is crucial for synaptic formation, signal transduction and transmission [52]. Thus, synaptic dysfunction due to molecular damage and the consequently altered signal conduction are consistent with ASD etiopathology [53]. Overall, the presence of shared genetic networks between the three clusters and, consequently, of common atypical biological processes involving both neurite and synaptic formation and function might be the underlying cause of comparable ASD phenotypes among different individuals.
Besides the shared genetic networks, the enrichment analysis signaled also some processes that differentiate the three clusters and that could possibly bring to phenotypic dissimilarities. Actually, the analysis brought to light the presence of cluster-speci c processes, which were not shared with the other clusters; this is, for example, the case of the biological process "kidney development" in Cluster A, or "cardiac septum development" in Cluster C. This observation should be considered with caution, as the relative FDR p-values -though statistically signi cant -are close to the threshold of < 0.005. Nevertheless, it is noteworthy that even if ASD-related processes regarding heart and kidney are still poorly characterized in literature, more recently they have become a feature of scienti c and clinical interest [54] [55].
Our ndings might prompt to speculate that shared mutations could underlie the core ASD-symptoms, while the cluster-speci c ones may be one of the possible explanations for the ASD phenotype heterogeneity and autism continuum. Therefore, both cluster-speci c and common atypical biological processes might interact to shape the speci c phenotype of each individual. At any rate, these observations point to the fact that when considering the disrupted processes of the different ASD phenotypes we should not only look at the CNS but at other systems as well. Thus, this is an important evidence supporting the polygenic model that assumes that ASD is the result of rare and common variants combination [56].
It is noteworthy to mention that we also identi ed a group of 331 people composed of individual patients that have either no or just a few mutated genes in common with the rest of the sample. They could not be included in any cluster. These ndings further support an ASD heterogeneity, as it may indicate that the clinical phenotype might be also the outcome of genetic common variants that could interact with a main mutation and/or with epigenetic [56] or environmental factors [57].
We also performed an analysis of VariCarta exome data, collected from more than 7,400 patients, in order to evaluate if there is a difference in terms of information provided by whole-genome compared to exome sequencing. Even though a single dense cluster -involving less than 3% of the patients -has been identi ed, data resulting from exome sequencing seems not to be su cient to identify signi cant similarities between ASD patients. The comparison of the results we obtained by analyzing the wholegenome with those of the exome draw the attention to the major contribution of noncoding mutations to the development of autistic traits; a nding in line with previous studies [58].
A relevant element of our work was the use of patient similarity analytics, followed by a hierarchical clustering analysis. In this way, we identi ed the genetic clusters and the related biological processes. As already highlighted in scienti c literature [59], the use of a similarity metrics between patients in order to identify "patient similarity networks" is becoming a signi cant aspect of big data analytics in healthcare systems. By assisting patients' clustering, it allows researchers to detect homogeneous subgroups of individuals sharing similar characteristics. This data-driven approach has been already applied in medicine to the study of different diseases, including behavioral disorders [60] and other diseases of the CNS [61]. The main condition to develop patient similarity algorithms is the existence and availability of large datasets containing detailed information of each individual. VariCarta is an example of this kind.
Speci cally, in ASD research and clinical medicine patient similarity algorithms could play a role in structuring autism heterogeneity, i.e., in identifying ASD patient subgroups who share the same etiopathology. Subgroups of patients can be then assessed by additional ne-grained strati cations, based -for example -on their genetic characteristics or biological processes. Once subsets of patients have been isolated and characterized, it becomes possible to perform systematic individualized analyses, by evaluating the distance of a single patient from each subgroup.
Enumerating both the common and speci c processes that characterize different ASD subpopulations might allow understanding which one, among a large number of autism genetic variants, are those that play a major role in the etiopathology of ASD. This could help to bridge the crucial gap between the detection of new genetic risk variants for ASD [62] and the clinical translation of these discoveries.
Although patient similarity analytics is still in development, it promises to be helpful in predicting patients' trajectory over time [63], in providing clinical decision support [64], and in tailoring individual treatments [65]. Therefore, stratifying ASD patients into clinical subgroups through the identi cation of common and speci c disrupted biological processes might be bene cial both for a more precise prognosis and for choosing tailored therapeutic approaches to ASD. This is because the identi cation of pathways underlying a speci c subpopulation could give us the right information about the natural history of the disorder and, consequently, the speci c clinical intervention for that subpopulation. Thus, this methodological approach could provide a fresh insight into precision medicine applied to ASD [65], as it can help guide clinical management and support the best therapeutic choice that ts a speci c patient.

Limitations
Our ndings should be viewed with some limitations in mind. Firstly, as the database we used does neither contain genetic data of family members of the ASD patients nor of neurotypical subjects we did not perform the same analysis on a group of non-affected individuals. Therefore, we did not compare our results with a control group. It will be important for future research to replicate our ndings, comparing them with genetic data of neurotypical subjects. Secondly, the absence in the dataset of the description of each phenotype, including the IQ, did not allow us to con rm our clustering results. We could not verify if the biological processes impacted by the mutation networks of each cluster actually produced different phenotypes. A comparison of the identi ed gene networks and related biological processes with associated phenotypes will be necessary to con rm the clinical validity and usefulness of our results. Finally, we included in the analysis all mutations, without differentiating their type (base substitution, deletion, or insertions), category of nucleotide mutation, or sequence variation (exonic, intronic). However, at this stage, we preferred to focus on the identi cation of networks of mutated genes characterizing ASD subpopulations, rather than on the impact of mutations on a speci c ASD patient.

Conclusions
To the best of our knowledge, this is the rst time that clustering analysis has been applied to patient similarity in an ASD study. We deem that an important asset of the proposed methodology is determined by the fact that identifying ASD subpopulations using genetic sequencing, and not only phenotypic clustering, might support the diagnostic process by more precisely depicting and differentiating the speci c autistic traits and phenotypic characteristics of each person.
To improve our understanding of ASD heterogeneity, neurobiology and genomic architecture, study designs have to increasingly consider innovative methodologies and newly developed biomedical informatics. Integrated genomic approaches, supported by advanced mathematical modeling, might lead to a better comprehension of the etiology and of the pathogenetic mechanisms of ASD. The proposed methodology might represent a novel approach to help disentangle ASD complexity and an instrument to foster more focused genotype-phenotype studies.

Declarations
Ethics approval and consent to participate This study was conducted exclusively on anonymized data, and the study was approved by the Ethical Review Board of our institution. As the study was based on a free-available and anonymous dataset, informed consent was waived by the Board.

Consent for publication
Not applicable.

Availability of data and materials
The VariCarta dataset is freely available at https://varicarta.msl.ubc.ca/index, both using a web interface or by downloading the whole dataset in csv format. All data generated or analyzed during the present study are included in this published article and its additional les.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.