Selection of samples from SFARI-SSC database
Brain circumferences of ASD probands from 2760 Simons Simplex Collection (SSC) families (10) were normalized by the age of the measurement. Probands with larger than 2 standard deviations (SD) above average were labeled as ASD-macro (ASD probands with macrocephaly comorbidity) and those with 2 SD below average were labeled as ASD-micro (ASD with microcephaly). This standard resulted in the identification of 47 and 52 probands with macrocephaly and microcephaly, respectively. We next selected 52 probands from the same database with the smallest difference from average and labeled them as “ASD-other” (probands with no brain size phenotype).
Genomic variation data from SSC dataset
SNV/INDEL data from the 41 ASD-macro, 37 ASD-micro and 38 ASD-other probands and their fathers were downloaded from the Simons Foundation Autism Research Initiative (SFARI) SSC database (10). Fathers from ASD-other families (n=38) were used as controls for ASD macro and ASD micro probands. Fathers from the ASD micro group (n=37) were used as controls for ASD-other probands for subsequent genomic variation analysis (Table S1A).
In total 320,710 SNV/INDEL loci were downloaded. Since the minimum sample number for each group for specific loci was required to be 8 (nsample >=8), a total of 60,617 loci were annotated using ANNOVAR (27). Variations in the intergenic, upstream, or downstream regions (> 10 Kb distance from TSS or TTS) were identified and excluded for further analysis. For each of the remaining loci, alternative allele (ALT) frequencies were compared between each ASD group and controls. Loci with an increased ALT frequency of 10% or more in ASD than in CTRL were considered “ASD-enriched loci.” There were 7,373, 5,233 and 2,458 ASD-enriched SNV/INDEL loci in ASD-macro, ASD-micro and ASD-other groups, respectively (Table S1B-D). These loci affected 457, 411 and 562 genes in each of the groups (Table S1E). These gene lists were the input for GO analysis using Gene Set Enrichment Analysis (GSEA) (28).
Structural variation data was also obtained from the SFARI-SSC database. Only deletions were used for subsequent analysis. After filtering by nsample >=8 in each group, 1,044, 1,330 and 1,223 deletion loci were included for ASD-macro, ASD-micro and ASD-other, respectively (Table S1 G-I). These loci were annotated by AnnotSV (29). The same standard identifying ASD-enriched loci as applied to SNV/INDEL data was used. GO analyses were performed for genes intersected with the ASD- or control-enriched deletions (Table S1F) using GSEA (see above).
Mapping genomic variations to GOs
The GO and gene lists were downloaded from a GSEA database (“msigdb.v7.1.symbols.gmt” under “All gene sets”). Genomic variation rate per GO was calculated based on these loci (in the formula listed below). Specifically, for each selected ASD or control individual from the SSC dataset, the number of ASD-enriched loci (increase of alternative allele frequency in probands over controls >=0.1) in each type of variation (SNV/INDELs and SVs) for each gene (genei) were summed for all genes in each GO (GOj) and normalized by the total number of genes in that GO to get variation_rateGOj. [FS1]
The variation-rateGO was calculated for all 10,131 GOs. GO with zero counts for all samples (n=1845) were excluded from subsequent analysis. The individuals from each of the 6 groups (ASD-macro, ASD0micro and ASD-other, CTRL-macro, CTRL-micro and CTRL-other) were randomly divided into group 1 and group 2, with 115 and 114 samples, respectively (Table S3A).
gWGCNA analysis of genomic variation data
We employed a weighted correlation network analysis (WGCNA), a data mining method used for studying biological networks based on pairwise correlations between variables (30), to analyze GO correlations. With WGCNA default settings selected (soft threshold power=6; minimum module size=30), the sample tree was first generated for group 1 based on inter-sample Spearman correlation using variation_rateGO. Next, the WGCNA network was constructed with dissimilarities among modules to be at least 15%. [FS2]
After the construction of the network, the eigengene for each module was associated with each of the two “phenotypes:” ASD_vs_CTRL (labeled as “ASD”) (all ASD samples coded as 1 and controls as 0); and brain size (labeled as “Brain”) (all SSC probands with macrocephaly coded as 2, SSC proband with microcephaly as 0 and probands from “ASD-other” group as 1, all control individuals (fathers) coded as 1) using Pearson correlation. Modules with significant correlation (p<0.05) for either phenotype were detected for group 1 samples. Based on the correlation with ASD and brain phenotypes, GO modules were defined as several “module groups” such as ASD-macro, ASD-micro, etc. (Table S3B).
The same settings were then applied to network construction for group 2 samples to detect GO modules associated with ASD and brain size. Venn diagrams were generated to find results from group 1 that could be confirmed by those from group 2 for each module group.
In the “confirmed” GOs, if a few representative GO groups (e.g., GOs related to “WNT signaling”) were enriched in each module group, were tested using a proportions test (31).
Correlation among GOs related to ASD
The enriched representative GOs detected above were used as seeding GOs (sGOs). Pearson correlation among these seeding GOs and all other GOs for ASD and Control samples were calculated for each of 3 sets of samples: ASD-macro, ASD-micro and ASD-other. The correlation matrix was flattened using a R library “corrplot”, and the p value for each GO pair was calculated. Using p<0.05 as the cutoff, GO pairs that displayed positive and significant correlation in ASD samples and negative or insignificant correlation in control samples were considered “correlated GO” (cGO) for subsequent analysis.
Detection of loci enriched in seeding and correlated GOs
Total occurrence of each gene in each of the 8 GO groups was calculated (4 sGO groups and 4 cGO groups, Table S3F-I), with 4 seeding GO groups (GOs related to “neuron”, “WNT”, “organ morphogenesis” and “synapse”, Fig.5) and 4 groups of GOs correlated to each of the sGO. The occurrence value was standardized into a z score.
The ALT allele frequency difference between ASD and CTRL (>=0.1, see above) was also standardized into a z score. The “combined enrichment score” for each locus was defined as the product of the z score for gene frequency in a specific GO group and the z score for ALT allele enrichment. Loci with positive combined enrichment score were selected.
Detection of SNV/INDEL loci enriched in ASD or controls in validation dataset
Whole genome DNA was extracted from fibroblast cell lines of 8 ASD with macrocephaly and 5 gender/age matched control individuals previously published with the clinical phenotype (22). Genomic variations detected in these cell lines were called “validation dataset” in this paper. Exome libraries were produced in the CWRU Genome Core using the Illumina ultra-sensitive exome library protocol. The libraries were sequenced on Illumina HiSeq 2500 at 8 libraries/run, which yielded ~ 100-150 million reads per library. The reads were aligned to the hg19 reference genome using BWA (32) with default settings (Gap open penalty=6, Mismatch penalty=4, etc) and yielded about 20X coverage of the human genome in each library. VCFtools (33) were used to call variants. The single nucleotide variant (SNV) lists were generated after filtering out loci with low reading depth (<10), low number of reads in support of variant (n<3), low alignment quality (q<20) and low base quality (Q<30). ANNOVAR (27) was used to annotate the SNVs. The ASD-enriched SNV/INDELs were those with significant p value for alternative allele frequency difference between ASD and control [5,000 times resampling using R, similar to previously reported method (34)]. All SNV/INDELs were compared with records in dbSNP. Filters used for calling INDELs were the same as those used for SNVs. The annotation of INDEL was performed using SeattleSeq Annotation 138 (35). GO analysis was performed on genes harboring selected SNV/INDELs using GSEA. SNVs under negative selection were found using Funseq (36).
Matepair sequencing and calling of deletions in validation dataset
“Jumping libraries” for matepair sequencing (37) from 8 ASD samples with macrocephaly and 2 control samples (described above) were built using Illumina Nextera Mate Pair Sample Prep Kit. The libraries were sequenced in the CWRU sequencing core using an Illumina HiSeq 2500 machine with 8 samples/run, which yield ~ 60 million reads per library. The reads were first trimmed off adapter sequences and reverse complemented using Nxtrim (38), and then aligned to hg19 using BWA, resulting in an average physical coverage of about 2X and actual coverage of human genome of ~ 50X.
The bam files were used as input for DELLY (39) and LUMPY (40) (version 0.2.13) to call deletions. The start and end intervals of each structural variant (SV) from both algorithms were intersected using BEDTools (41). The matched intervals were used for the final inference of deletion breakpoints (Fig. S4). The deletions from all samples were collapsed if they overlapped. The frequency of deletions among ASD samples was calculated for these collapsed deletion events. The deletions from 2 control samples were inferred following the same process (“control deletion list”). The deletions were classified as “ASD-Control-shared deletions” if the ASD deletions overlapped with the control list or “ASD-specific deletions” otherwise. Deletions were then annotated using AnnotSV (29). GO analysis for genes intersected with the deletions was performed using GSEA.
To test if the variations selected by our pipeline from SFARI-SSC samples were enriched by putative targets of any transcription factors, we applied CHEA analysis (42) for the genes that carried the 1,514 loci selected from SFARI-SSC ASD-macro samples and 410 loci selected from SFARI-SSC ASD-other samples. Using the CHEA online (https://maayanlab.cloud/chea3/#top) function “ChIP-Seq-> Literature” library. Similarly, genes affected by the 515 ASD-specific deletions in the validation dataset were tested with CHEA online tools. The results from the 3 input sets were compared using a Venn diagram (Fig. S5).
RNASeq for neural progenitor cells (NPCs) from validation dataset
Total RNA was extracted from NPCs from 8 ASD-macro and 5 control lines as mentioned above. One control line (COVE) was excluded from subsequent analysis as its karyotype was abnormal. The RNA was purified and quantified, and samples of high quality (RIN>= 7.0) were used. The Illumina TruSeq Stranded Total RNA kit with Ribo Zero Gold (for rRNA removal by hybridization/bead capture) was used for library preparation. Optimized libraries were then loaded onto the HiSeq 2500 flowcell (8 libraries/lane) for 50 bp single-end sequencing.
Adapter sequences were trimmed and filtered. Reads that passed quality filter were aligned to hg38 using HISAT2 (43) and converted to sorted BAM files with samtools (44). Identification of differentially expressed genes and statistical analyses was performed using DESeq2 (45).
ChIPSeq for NPCs from validation dataset
ChIPSeq libraries for BRN2 were prepared using NPCs from 3 ASD lines and 3 control lines. About 10 million NPCs for each library were collected between passages 6-9 at day 3 (about 80% confluency) and cross linked using 4% formaldehyde for 10 mins at room temperature. The cells were resuspended in lysis buffer, sonicated, and incubated with antibody (POU3F2 (D2C1L) from Cell Signaling) linked DynaBeads Protein G overnight at 4 ◦C. The DynaBeads were washed and then reverse crosslinked at 65 degrees for 12 hours. RNA and antibody were removed with RNase A (Ambion Cat # 2271) and proteinase K (Invitrogen, 25530-049). The pulled-down DNA was end-repaired, and a ploy-A tail was added, linked with adapter and PCR amplified. The PCR product was gel purified and fragments in the 250–400 bp were excised and purified with Qiagen MinElute Gel Extraction Kit (Qiagen, 28606). The Bioanalyzer DNA 1000 assay (Agilent) was used to access the quality of the libraries. Seventy-five bp single-end reads were generated for high quality libraries using the HiSeq 2500 (8 libraries/lane on flowcell) sequencing pipeline. Reads were adaptor trimmed with fastx_clipper, aligned with hg19 using BWA (32), and further processed using SamTools (44). Peaks were called by MACS14 (46) with default settings using sorted bam files with redundant reads removed. Called peaks were overlapped with published BRN2 on human NPC (47) and ATAC-Seq data from sample NPC lines (48) (see below).
Further bioinformatic processing of RNASeq, ChIPSeq and HiC data
The genes with selected genomic variation loci and differentially expressed between ASD and control NPCs were illustrated using Venn diagrams (https://bioinformatics.psb.ugent.be/webtools/Venn/). The ATAC-Seq data for 3 ASD and 2 control NPC lines were previously published (48) and obtained from the Gage lab through collaboration. These lines were a subset of the NPC lines on which we performed RNASeq and ChIPSeq experiments. Overlap between ATAC-seq peaks and BRN2 ChIPSeq were found using BedTools (41). Published BRN2 ChIPSeq data at NPC stage (with 2 samples) (47) were downloaded. To ascertain that the BRN2 binding positions were in actively transcriptional sites, BRN2 peaks from at least 2 control lines or published BRN2 peaks from both NPC samples needed to overlap with ATAC-Seq peak from both control lines.
Similarly, published β-catenin ChIPSeq from hESCs (human embryonic stem cells) (49) were obtained from GEO and overlapped with control NPC ATAC-Seq data. The ATAC-Seq confirmed BRN2 and β-catenin ChIPSeq peaks were annotated using HOMER (50).
Published HiC data from human NPCs were obtained (51). Both sides of significantly associated HiC intervals were overlapped with selected genomic variations using BedTools.