SV calling
The cohort was whole-genome sequenced and read alignment and SV calling using Isaac50 and Manta21, respectively, were performed by the Genomics England Bioinformatics team20. The details of sequencing and variant calling have been previously described. Manta VCF files were converted to BEDPE format using SVtools51 and then BEDTools52 was utilised to extract proband-specific SVs with ≥ 50 bp in length for each family (Figure S12). We removed SVs having evidence of clipped reads (i.e., split reads) at breakpoints in either parent. Specifically, SVs supported by ≥ 4 clipped reads at either breakpoint or ≥ 2 clipped reads at both breakpoints in either parent were excluded. The genomic coordinates of SVs called from hg19-aligned genomes (1915 probands) were converted to hg38 genomic coordinates using LiftOver53. SVs found in 3+ samples were removed because such SVs were likely germline SVs and artifacts. We selected SVs flagged as “PASS” or “MGE10kb” and further narrowed down SVs with the Manta score > 30 and supporting discordant reads > 10. We rescued SVs with imprecise breakpoints if they were supported by CANVAS54. We excluded SVs with VAF < 0.1 to remove mosaic SVs. Translocation through retrotransposon-mediated 3′ transduction was excluded to focus on dnSVs. All SVs were manually validated to identify high-confident de novo SVs using IGV browser55. Long-read sequencing data, RNA-seq (whole blood RNA sequencing from 5,546 samples) and diagnostic SNV/Indelswere obtained from Genomic England20. The previous dnCNV calls from array-based or exome-seq were obtained from DDD study22. Insertion events called by Manta21 were further classified into retrotranspositions using RepeatMasker. This process resulted in a lower sensitivity because Manta with Issac-based alignment is not optimized to call retrotranspositions. Retrotransposition-specific identification tools, such as MELT56 or xTea57, are needed to increase sensitivity for retrotransposition detection.
SV classification
We used ClusterSV2 (https://github.com/cancerit/ClusterSV) to group rearrangements (i.e., breakpoints) into rearrangement clusters. We defined complex SVs as those with ≥ 2 clustered breakpoints except for simple SVs involving reciprocal inversion, balanced translocation, templated insertion, and dispersed duplication. In general, we classified the types of complex SVs according to the previous study that comprehensively characterised somatic complex SVs using thousands of cancer genomes2. In short, complex SVs involving two inversions were categorised into Loss-invDup, Dup-Trp-Dup, Inv-Loss (i.e., inversion with flanking deletion), Loss-invLoss (i.e., paired deletion inversion), and Dup-invDup (i.e., paired duplication inversion) according to read patterns and copy numbers (Figure S7). Complex SVs involving two deletions were classified as Multi-Loss. Bridge deletion (i.e., bridge of templated insertion) and Translocation-Loss (i.e., translocation with deletion) were classified using the previously described criteria2,58. Local-3 jumps involving three local rearrangements were discovered according to the read patterns described in the previous cancer study2. Breakpoints filtered out near unresolved SV classes were rescued if they could resolve the unresolved SV classes according to the types of SV defined. Complex SVs that didn’t fit into the described classes were classified as ‘Unclassfied’. All complex SVs were manually validated using IGV browser55, Samplot59, or BamSnap60.
SV phasing to identify parent of origin and estimate the timing of triplication from maternal origin
We used unfazed61, which employs both read-based phasing and SNV allele-balance phasing, to identify parent of origin for dnSVs. To classify the timing of maternally-derived triplication into meiosis I and II, we first identified triplications (including those in complex SVs) from maternal origin (step 1) and further classified them into meiosis I and II (step 2) using a set of informative genotypes (Figure S5). At least five SNPs were required for each step.
Evaluation of clinical relevance of dnSVs
The identified SVs disrupting exons were reviewed for potential clinical relevance by NHS clinical scientists and/or Genomics England. We considered SVs as being potential (likely) pathogenic SVs if at least one of the following criteria were fulfilled: i) the variant had been clinically assessed as likely pathogenic or pathogenic by an NHS genomic laboratory hub ii) the variant had been reviewed on a research basis and considered appropriate to return through the Diagnostic Discovery pathway to an NHS genomic laboratory hub for evaluation. To estimate the fraction of probands with complex SVs in the previous study, we obtained 19 probands25 with complex SVs in Table S1 where “sv_type” was “CPX” and “role” was “proband”.
Enrichment test of non-coding dnSVs in known pathogenic genes in NN disorders
We first extracted the non-coding dnSVs (i.e., intronic and intergenic dnSVs) for which genomic coordinates didn’t include any exons in NN disorders based on the Gencode basic V45 GTF file and then obtained the known pathogenic genes associated with NN disorders from the Gene2Phenotype developmental disorders panel42. Specifically, we kept all genes with organ specificity equal to “Brain/Cognition”, allelic requirement equal to “monoallelic_autosomal”, and a confidence category equal to “strong” or “definitive” (n=190 genes). We then computed the observed-over-expected ratio for the overlap between the non-coding dnSVs and known pathogenic genes in NN disorders using the Genome Association Tester software62. Intronic and intergenic regions were obtained based on the canonical transcript of protein-coding genes in the Gencode basic V45 GTF file using BioMart and GencoDymo R packages, and bedtools52. These two regions were used as a workspace in GAT to test the over-representation of intronic and intergenic dnSVs in intronic and intergenic regions of the known pathogenic genes, respectively. We added a window of 5-500kb (5, 10, 25, 50, and 500kb) up- and down-stream to each intergenic dnSVs to perform the enrichment test. Number of random samples (“--num-samples”) for each GAT run was set to 1000.
Distribution of dnSVs across genomic properties
Genomic property metrics and fragile site (FS) information were downloaded from PCAWG structural variation paper based on hg182. Only 1,362 out of 1,377 GEL de novo simple deletions were able to liftover from hg38 to hg18 for further association analysis with the genomic properties. Telomere (chromInfo.txt.gz) and centromere (cytoBand.txt.gz) information were downloaded from UCSC Genome Browser (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/). To test for associations between dnSVs and the above library of genomic properties, the median of real dnSV positions (one random side from each breakpoint were chosen to reduce dependence) were compared to 1,000 median random positions using Monte Carlo simulation method. For each genomic property, the frequencies of random medians generated by the simulation will form a normal distribution. The empirical p values were calculated by the sum of random medians lower or higher than real median, then divided by 1,000. To calculate the median shift, the real observations from each genomic property were normalised on a scale from 0 to 1. Then the the shift of the median are calculated by the difference between the observed median and 0.5, assuming a uniform distribution has a median of 0.5
Enrichments near telomeres and centromeres
We equally partitioned the genome into 5-Mb bins based on their distance to the telomere ends. For comparison, we also partitioned the genome based on their distance to the centromeres. We identified that dnSVs are enriched near the telomere ends. However, the dnSVs are more evenly distributed to the centromeres. For the validation cohort, we downloaded the all_dnsv.csv file from Belyeu et al. 202125. In total, there are n=309 CEPH and SFAI dnSVs across autosomes after removing chrX and chrY. And finally, only n=192 dnDELs were used in validation analysis.
Replication timing profiles
The meiotic replication timing (RT) profile was downloaded from Pratto et al., 202144 using table expRT_MeiS_ALL_S2_2to4C_SCP3_DMRT1, which was profiled from S-phase nuclei sorted from human testis. The reference mitotic RT was downloaded from Koren et al., 201463, which was profiled from unsynchronised, flow-sorted lymphoblastoid cell lines.The Replication timing skew (RTS) values were calculated as the difference between the proportions of early (E) and late (L) replication regions per chromosome as: