Identication of Autism Risk Genes in a Chinese Cohort via Whole-Exome Sequencing with the Joint Calling Analysis

Autism spectrum disorder (ASD) is a highly heritable neurodevelopmental disorder characterized by decits in social interactions and repetitive behaviors. Although hundreds of ASD risk genes, implicated in synaptic formation, transcriptional regulation, and chromatin remodeling, have been identied, the genetic analysis on east Asian ASD cohorts in the whole-geome or whole-exome level is still limited(1-5). Here we performed whole-exome sequencing on 168 ASD probands with their unaffected parents of Chinese origin. We applied a joint calling analytical pipeline based on GATK best practices and identied numerous de novo variants including single nucleotide variants (SNVs) and insertion or deletions (INDELs). By querying the Simons foundation autism research initiative (SFARI) gene database, we found that there were potential novel ASD risk genes in East Asian cohorts, which did not exist in European American populations. Furthermore, our analysis pipeline identied de novo copy number variations (CNVs) of known ASD-related gene based on a suciently large sample size, validated by quantitative PCR. Our work indicated that there may be differences in potential ASD genetic components existing across different geographical populations, suggesting that genomic analysis over large cohorts are required for each population in order to precisely identify ASD risk genes.


Introduction
Current prevalence of ASD has approximately increased to 1 in 49 children in the United States, and males are four times more susceptible for ASD than females (6). Recently, tremendous efforts in ASD genetic studies using whole-exome and whole-genome sequencing have enabled high-throughput assessment of protein-disrupting variants in large ASD cohorts, in which de novo single nucleotide variants (SNVs), insertions and deletions (INDELs) and copy number variants (CNVs), as well as rare inherited variants are major contributors of genetic risks for ASD (1,7,8). Although genomic information of large cohorts consisting of tens of thousands ASD patients have been collected, east Asian populations are still underrepresented groups. Whether geographical factors may contribute to genetic causes of ASD remained to be addressed (5).
In this study, we collected 168 ASD probands with their parents and performed whole-exome sequencing analysis. 150 bp paired-end sequencing short reads were mapped against human reference genome build 38 (GRCh38/hg38). SNVs and INDELs were jointly called across all samples and ltered by GATK Variant Quality Score Recalibration (VQSR) and Convolutional Neural Network (CNN) tools. CNVs were called using a cohort mode GATK pipeline detecting germline copy number variants.
Interestingly, we found that potential ASD risk genes identi ed in this study are largely distinct from the result in SFARI databases and ASD gene candidates in the Japanese population, suggesting that geographical difference may play a critical role in genetic variations leading to psychiatric diseases including ASD.  Figure S1A Population strati cation using genotyping data of common exonic SNPs

Materials And Methods
To de ne a set of common exonic SNPs, we rst selected variants that are; 1) on the In niumExome-24v1-1_A1 genotyping array, 2) with MAF > 0.05 in East Asian (EAS) population of ExAC(5) annotated by VEP and 3) biallelic in EAS. After combining the information of these SNPs in our cohort (OWN) with the data of the same SNPs in African (AFR), American (AMR), East Asian (EAS), European (EUR) and South Asian (SAS) individuals in the 1000 Genomes Project(6), we performed further ltering and linkage disequilibrium (LD)-based pruning using PLINK v1.9(7) with the following options and parameters; --maf (minor allele frequency) 0.05, --mind (maximum per-person missing) 0.2, --geno (maximum per-SNP missing) 0.2, --hwe (Hardy-Weinberg disequilibrium p-value) 1×10 -10 and --indep (SNP window size, number of SNPs to shift and variance in ation factor threshold) 50 5 2. By using the data of 1064 SNPs that passed the lters described above, we performed multidimensional scaling with PLINK.

Real-time Quantitative PCR Validation
To con rm de novo CNVs detected by WES, quantitative PCR (qPCR) was performed using DNA from probands, their parents and controls. The comparative CT method (delta-delta CT method) was used for relative quanti cation, with data normalized against an endogenous control sequence (glyceraldehyde 3phosphate dehydrogenase, GAPDH) with two normal copies. Genomic DNA was ampli ed using SYBR Green (Thermo Fischer Scienti c) and qPCR was performed on a StepOnePlus™ Real-Time PCR System (Applied Biosystems). Data were analyzed using R version 3.6.3 and pcr package (19

Statistical Analyses
We statistically evaluated the observed number of dDNMs in each gene using TADA-Denovo (20). We included IMPACT HIGH and CD missense mutations in the TADA-Denovo analysis. Parameters for this analysis were determined by following the TADA manual. Per-gene mutation rates for LOF and CD missense DNMs were obtained from mirDNMR based on sequence context (21). After performing variant ltering, we discovered a set of 442 de novo mutations (DNMs) (Table S1). We classi ed SNV and INDELs into three classes, including HIGH-impact. MODERATE-impact, and Possible damaging. The HIGH-and MODERATE-impact were de ned by VEP (Ensembl Variant Effect Predictor, https://asia.ensembl.org/info/docs/tools/vep/index.html). Brie y, the HIGH-impact variants usually lead to truncation of protein product, such as gain or loss of STOP codons as well as frameshift-causing INDELs. We identi ed 11 HIGH-impact SNVs and 8 HIGH-impact INDELs ( Figure. (Table S1). To further categorize the severity of missense variants, we annotated missense into a new class, named Possible damaging missense DNMs, which were de ned as the variants predicted to be damaging by at least two of the seven following prediction algorithms: SIFT(10), PolyPhen-2 HumVar(11), PolyPhen-2 HumDiv(11), LRT(12), Mutation Taster(13), Mutation Assessor (14) and PROVEAN(15) annotated by dbNSFP4.0a (16,17). We found 64 Possible damaging missense DNMs (Table S1).

Results
Next we statistically assessed the observed number of de novo variants in each gene using Transmission and De Novo Association Test-Denovo (TADA-Denovo) and identi ed one gene signi cantly enriched for de novo mutations (SYNGAP1 q val < 0.05) (Table S2). Overall, de novo ASD risk genes detected in ASD probands from the Chinese cohort showed little overlapped with the list of de novo ASD risk genes in ASD probands from the Japanese cohort( Figure. 1D) (5). Only a few SFARI genes, including SYNGAP1, POGZ and NCOA6 were found in both East Asian cohorts.
Odds Ratio may not be a good measure of genetic risks for ASD.
We next re-annotated DNM data from 4872 ASD probands and 1943 unaffected siblings originally from db-denovo v.1.6.1 with the same pipeline as used for our dataset. By comparing the proportion of individuals carrying one or more HIGH, MODERATE, LOW and MODERATE impact mutations in the case groups with controls, We con rmed that carriers of HIGH impact DNM were signi cantly enriched in both our cohort and the db-denovo ASD cohort (p= 2.792 × 10 -11 , odds ratio [OR] = 3.182105 in our ASD cohort; Furthermore, the odds ratio of MODIFIER impact DNM carriers in the db-denovo ASD cohort suggests that a type of mild mutations inhibit the onset of ASD(p= 0.7444, odds ratio [OR] = 1.185185 in our ASD cohort; p= 2.2 × 10 -16 , odds ratio [OR] = 0.1182785 in the db-denovo ASD cohort, Figure. S1B). Taken together, these results indicate that odds ratio can only be partially used to determine the effect of different mutation types on the incidence of ASD.

Identi cation of CNVs in ASD risk genes with the WES dataset
Although the gold standard for copy number variations detection is the chromosomal microarray analysis (CMA), various toolkits has emerged to identify CNVs with the whole-exome sequencing (WES) dataset (18). However, the current reported algorithms for CNV detection is not optimal for the WES dataset and incompatible with the GRCh38/hg38 reference genome.
We applied a germline CNV calling protocol based on GATK cohort mode (version 4.1.4.1) (See Supplementary Methods) and identi ed numerous de novo CNVs in the probands (Table S3). To exclude the false positive hits, we set 2 standards for CNV screening. First, selection of duplication or deletion signals appearing in more than 2 continuous exons. Second, CNVs should ful ll the HIGH-impact criterial, leading to protein truncation, such as deletion of START or STOP codons.
To prioritize ASD risk genes, we rst examine CNVs happened in the known SFARI genes ( Figure. 2A-G). We found 8 CNVs exhibiting duplication or deletions in known SFARI genes, such as duplications of AMT, RAI1, TBC1D23, and deletions of TBR1, SHANK3, MECP2, GIGYF1 ( Figure. 2A-G). We further validated the CNV results by performing quantitative PCR ( Figure. 2H), con rming the feasibility and faithfulness of our new methods.
Importantly, we further identi ed de novo large CNVs, containing multiple genes ( Figure. S2A-H, Figure. S3A-H). To investigate whether these candidate genes may be involved in brain development, we examine the expression pattern of candidiate genes which exhibited either duplications or deletions in ASD patients in the GTEx Analysis Release V8 database (dbGaP Accession phs000424.v8.p2). We found that numerous candidate genes indeed were expressed in the central nervous system (Figure. S4), suggesting that genes implicated in these de novo large CNVs may contribute to pathogenesis of ASD.

Discussion
With accumulating genomic studies on autism cohorts world-wide, the genetic architecture of ASD has emerged over the last decade. Composed of de novo and rare inherited mutations, genetic variants play a decisive role in determining the etiology of ASD. Although the rapid development of DNA sequencing technology, precise identi cation of genetic variants in the large scale genome sequencing over hundreds and thousands of ASD core trios is still very challenging.
In this work, we applied the latest GATK package (v4.1.4.1) and the GRCh38/hg38 dataset, which is compatible for ongoing update of Ensembl genome database. We focused on the identi cation of de novo variants, including SNVs, INDELs and CNVs, with the customized joint calling pipeline. Importantly, we found several critical CNVs containing ASD-risk genes, such as SHANK3, TBR1 and MECP2, indicating that screening CNVs with the WES dataset would be very valuable for ASD genetic studies.
Interestingly, about 40% genes carried de novo HIGH-impact variants (18), existed in the SFARI gene list, suggesting that there are potentially novel ASD genes in the Chinese cohorts. Consistently, among the 46 genes carried de novo HIGH-impact variants in the Japanese ASD cohorts, only 8 genes appeared in the SFARI gene list, suggesting that geographical factors may play a critical role in contributing the ASD. Taken together, we suggest that although the overall genetic architecture of ASD remains similar across different populations, the genetic components may vary due to geographic isolations. Thus, in order to comprehensively acquire the ASD risk genes, genome-wide sequencing in large cohorts from different populations would be required. . Ethical review number of our study is 2016-4, and committee members of IRB who approved this study was Dr. Yi-Feng Xu. Patients were collected from outpatient Department of the Child and Adolescent Psychiatry, Shanghai Mental Health Center. Written informed consent was obtained from parents for all minor children and those who were unable to give consent. All participants were ascertained using the protocol approved by the appropriate Institutional Review Boards.

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
All authors contributed to the work and meet the criteria for authorship. Study concept and design: B Yuan and Z Qiu. Experiment and data analysis: B Yuan. Acquisition of Clinical information: PP Cheng, YS Du. Interpretation of WES data: B Yuan. Experiment: B Yuan, PP Cheng, R Zhang, Drafting of the manuscript: Study supervision: Z Qiu and YS Du.