Sample details. Clinical microarray data were collected from an NDD cohort (n = 10,620) (Uddin et al. 2016) with unrelated patients mainly reported with autism, epilepsy, intellectual disabilities and other rare disorders. Our second cohort comprised of unrelated cases (n = 3,176) consisting of a heterogeneous population carrying rare CAs (Uddin et al. 2016). These two cohorts were recruited from Sickkids hospital (total cases n = 8,929; NDD cases n = 7,107; CA cases n = 1,822) in Toronto, and Credit Valley Hospital (total cases n = 4,867; NDD cases n = 3,513; CA cases n = 1,354) in Mississauga, respectively (Fig. 1). Inclusion criteria for NDD samples comprised of the presence of any neurodevelopmental disorder as the primary phenotype, such as autism spectrum disorder, language/speech delays, developmental delay, learning disability, mental retardation, seizures, or hypotonia, which were documented by diagnostic behavior, phenotypes, and chromosomal microarray analysis. Regarding CAs cohort, the primary phenotype was reported as congenital heart defects. There also exists the possibility that the CA patients might have some degree of NDD phenotype as a secondary manifestation that may have been under-reported.
Chromosomal Microarray Analysis. A circular binary segmentation algorithm (Olshen et al. 2004) was applied on obtained clinical microarray data from both hospitals using International Standards for Cytogenomic Arrays ISCA 180 K comparative genomic hybridization array (aCGH) to detect large CNVs. To compare individual probe intensities, we used a pool of 10 samples for reference. Each sample variant was annotated by employing numerous tools, including ANNOVAR (Wang et al. 2010) and Horizon platform of GenomeArc Analytics. The clinical laboratory geneticist manually annotated pathogenicity (pathogenic, likely pathogenic, VUS, likely benign) of each CNV applying American College of Medical Genetics (ACMG) guidelines (Kearney et al. 2011). CNVs smaller than 10Kb and larger than 10Mb were excluded from analysis. The original dataset had de novo variant information, where parent DNA was accessible (Uddin et al. 2016).
Control dataset. In this study, data from 9,692 unrelated samples (Uddin et al. 2016) with no known psychiatric history, has been used as population control (Fig. 1). This was collected from several major population-scale studies that utilized high-resolution microarray platforms. Illumina 1 M from the Study of Addiction Genetics and Environment (SAGE) (Bierut et al. 2010) and the Health, Aging, and Body Composition (HABC) (Coviello et al. 2012) assayed 4,347 control samples; Illumina Omni 2.5 M from the Cooperative Health Research in the Region of Augsburg KORA projects (Verhoeven et al. 2013) and Collaborative Genetic Study of Nicotine Dependence (COGEND) (Bierut et al. 2007) assayed 2,988 control samples; and Affymetrix 6.0 from the PopGen project (Krawczak et al. 2006) and the Ottawa Heart Institute (Stewart et al. 2009) assayed 2,357 control samples. Using a high-resolution control will allow us to improve false positive calls from the ISCA low resolution case cohorts and will provide convincing association signals.
Gene set curation and overlap analysis. We used the GRCh37/hg19 build and unique coding sequence (CDS) ids for identifying regions of the DNA that encode for proteins, and removing repeats or duplicates, to analyse our data. CNVs were interpreted based on probable clinical significance or pathogenicity, variant type (deletion, duplication), inheritance (familial, de novo), gender (male, female), phenotype (NDD, CA), gene density and content (Fig. 2). First, we extracted the genes from the control dataset with frequency > 0.001 using the respective CDS ids, and similarly, we extracted the genes from the respective CNVs using the CDS ids. Subsequently, all gene overlaps between the control gene lists and patient gene lists were removed. And the remaining genes extracted from CNVs based on gender, pathogenicity, and type were compiled (Fig. 3). We performed Fisher’s exact test (FET) using the R package (GeneOverlap) to measure statistical significance (P-value < 0.05).
Proteomic and multi-tissue transcriptome expression analysis. We used proteomic data from human protein expression studies at different developmental stages and expression data from multiple tissues to further analyse the genes that were extracted from NDD CNVs and had no overlap with genes from the CA CNVs. Proteomic and multi-tissue transcriptome datasets are described in detail in the following section.
Proteomic data analysis. To analyse protein expression levels at two developmental stages in human tissues, we used high-resolution genome-wide Fourier-transform mass spectrometry data (downloaded from the Human Proteome Map) (Kim et al. 2014), including in-depth proteomic profiling of 30 histologically normal human samples: 7 fetal tissues (heart, liver, gut, ovary, testis, brain, and placenta), and 18 adult tissues (frontal cortex, spinal cord, retina, heart, liver, ovary, testis, lung, adrenal, gallbladder, pancreas, kidney, esophagus, colon, rectum, urinary bladder, and prostate), and 6 hematopoietic adult cells (B cells, CD4 cells, CD8 cells, NK cells, monocytes, and platelets) (Additional file 1: Suppl. Figure 3) (Kim et al. 2014). For processing the data, fragmentation (high-high mode) was applied using the high-resolution Fourier transform mass spectrometers, identifying the proteins encoded by 17,294 genes, which accounts for 84% of annotated protein-coding human genes (Kim et al. 2014). For measuring protein expression, we used spectral counts per gene per sample. We performed Fisher’s exact t-test for the overlapped NDD and CA gene lists with CE or pLI enrichment using the R package (GeneOverlap) to measure statistical significance.
Multi-tissue transcriptome analysis. We measured expression levels (in triplicate) using Affymetrix GeneChip Human Exon 1.0 ST array (Gardina et al. 2006) and transcriptomes from cerebellum, breast, heart, liver, muscle, kidney, thyroid, pancreas, prostate, spleen, and testis, removing probes prone to multiple hybridizations. We used the Robust Multi-array Average (RMA) algorithm (Irizarry et al. 2003) to subtract the background signal and normalized the log2 expression values for each exon. Expression of 16,713 RefSeq genes were surveyed in all 11 tissues. A log2-transformed intensity threshold of ≥ 6 to define the expression (Kang et al. 2011) was used to detect 16,411 genes with at least one exon expressed in a tissue sample. Reads per kilobase of transcript per million (RPKM) was used as the expression unit for exons from the mapped reads (Additional file 1: Suppl. Figure 4).
The CNVs chosen for proteomic and multi-tissue transcriptome data were genes from NDD pathogenic deletion CNVs that were not overlapped with CA gene list.
Constraint gene analysis and data filtering. We have defined ‘constraint genes’ in our analysis if a gene present in both NDD and CA CNVs had a significant overlap with either critical exon (CE) or pLI ≥ 0.9 (probability of being Loss of Function intolerant). CE and pLI filtering methods are described in detail in the following section.
Spatiotemporal expression data from human brain and Critical Exons (CE): Critical exons are highly expressed exons with low mutation burden. For this project, we have recalculated critical exon matrix based on our previous work (Uddin et al. 2016). For deleterious genes that harbor de novo mutations, critical exons were significantly enriched in individuals with ASD relative to their siblings without ASD (Uddin et al. 2014). We utilized these highly specific set of genes (critical exon genes) derived from computing exon level spatiotemporal RNA-seq expression of 388 tissue samples (derived from 42 different brain donors). RPKM was used as the expression unit for exons from the mapped reads. The selection of donors was made to include at least two sex and aged-matched donors, and each developmental period: prenatal (8–37 weeks post-conception), early childhood (10 months to 15 years), and adulthood (> 17 years). We derived the expression data of 16 brain regions within 3 developmental periods for each donor (Fig. 4 and Additional file 1: Suppl. Figure 5). We used gnomAD to identify the non-synonymous rare (< 0.01 frequency) mutation burden. An exon is categorized as ‘critical exon’ if its expression is high (> 75th percentile) and gnomAD population non-synonymous mutation burden is low (< 75th percentile) compared to the entire dataset. A gene is considered a ‘critical exon gene’ if one or more exons were annotated as ‘critical exon’ for at least 50 RNA-seq brain samples.
pLI
As a second filtering criteria for constrained genes, we obtained pLI scores from Exac database to identify the tolerance of a susceptible gene to loss of function and pLI ≥ 0.9 are extremely LoF intolerant (Lek et al. 2016). We performed Fisher’s exact t-test for the overlapped gene lists (NDD and CA) with CE or pLI enrichment using the R package (GeneOverlap) to measure statistical significance.
Pathway enrichment analysis. We performed the enrichment analysis of the most significant gene overlaps from the respective type of CNVs to determine the major pathways in which the constrained genes were expressed. We scanned the KEGG pathway database which comprises of an assembly of the up-to-date interactions, reactions, and relations of molecular networks (http:// www.genome.jp/kegg/pathway.html) and GO database (http://geneontology.org/) to identify all the pathways in which five or more genes (from the constraint gene set) were expressed. Only the pathways having more than 50 genes and less than 1000 genes were considered for this analysis. We called a gene set enriched if it overlapped between our gene set and the KEGG-GO pathway database with significance (Fischer Exact Test (FET)). The pathways were identified by their unique KEGG ID and name. The significant pathways (p < 0.05) with a false discovery rate (FDR) < 0.01 were used to construct the pathway network map using Cytoscape (https://cytoscape.org/) for visualization.