Population WGS-based spinal muscular atrophy carrier screening in a cohort of 1076 healthy Polish individuals

Spinal muscular atrophy is a severe neuromuscular disorder with an autosomal recessive inheritance pattern. The disease-causing gene is SMN1, and its paralogue, SMN2, is a disease course modifier. Both genes SMN1 and SMN2 show over 99.9% sequence identity and a high rate of crossing over in the genomic region. Due to this reason, SMN1/SMN2 is usually excluded from the whole-genome sequencing (WGS) analysis and investigated with traditional methods, such as MLPA and qPCR. Recently, novel bioinformatic algorithms dedicated to analyzing this particular genomic region have been developed. Here, we analyze the SMN1/SMN2 genomic region with a dedicated program, SMNCopyNumberCaller. We report a similar prevalence of SMN1 gene deletion carrier status (1 per 41 people) to published data from the Polish population (1 per 35 people). Additionally, SMNCopyNumberCaller can identify SMN2 CNVs and SMN2Δ7-8 present in 153 healthy Polish individuals. Two other programs for the CNV analysis in standard genomic regions were not able to provide reliable results. Using WGS-based tools for SMN1/2 genomic region analysis is not only an efficient method in terms of time but will also enable more complex analysis such screening for markers related with a silent carrier status and identification of further genetic modifiers. Although still an experimental method, soon WGS-based SMN1/SMN2 carrier identification may become a standard method for patients screened with WGS for other purposes.

disease caused by the degeneration of motor neurons located in the anterior horn of the spine. The disease manifests with symmetrical muscular weakness, poor sucking, respiratory problems, muscular atrophy, and scoliosis. The course of the disease varies from very severe to mild, depending on the amount of functional SMN protein. The severity of the disease and age of onset depends on the number of copies of the SMN2 gene, which can partially compensate for the missing SMN1 gene function; thus, it determines the disease severity (Prior et al. 2000). Additionally, some individuals carry SMN1 copies on one chromosome (duplication allele) and no copies on the other (deletion allele). Thus, they are silent (2 + 0) carriers (Luo et al. 2014). The status of being a silent carrier cannot be detected neither with qPCR nor with multiplex ligation-dependent probe amplification (MLPA); however, the majority of silent carriers carry the g.27134 T > G variant associated with a duplication haplotype and present across different populations (Luo et al. 2014).
Population-wide SMA screening to quantify the SMN1 copy number (CN) is recommended by the American Communicated by Michal Witt Mateusz Sypniewski and Dominika Kresa contributed equally to this work College of Medical Genetics and Genomics (Gregg et al. 2021). SMA is a good candidate for population-based screening as a relatively common lethal autosomal recessive disorder and because of the treatment availability (Ben-Shachar et al. 2011, Prior 2008, Sugarman et al. 2012, Zhao et al. 2021. SMA carrier frequency is estimated to be 1 in 25-50 in the general population, depending on the ethnic group, with the highest carrier frequency identified among Caucasians. In the ancestrally diverse Asian genomes, the carrier frequency estimated from whole-genome sequencing (WGS) data with SMNCopyNumberCaller was 1 in 52 (Chan et al. 2022). In the Polish population, the carrier frequency is 1 in 35 (Jedrzejowska et al. 2010). In Poland, a standard method recommended by the Polish Society of Human Genetics method for diagnostic purposes is MLPA. Neither PCR-restriction fragment length polymorphism (RFLP) nor WGS-based analysis is feasible for SMN1/SMN2 genomic region analysis (Gos and Jedrzejowska 2022).
The analysis of the region with bioinformatic methods from WGS data is challenging. The disease-causing gene, SMN1, and its paralog, SMN2, reside in a ~ 2-Mb region on 5q with a large number of complicated segmental and inverted segmental duplications (Prior et al. 2008). Due to the high sequence homology between SMN1 and SMN2 and multiple crossing over, the SMN region is difficult to resolve, both with short-and long-read sequencing (Mandelker et al. 2016). Therefore, the region is usually excluded from the WGS analysis. To resolve this issue, recently, several bioinformatic tools dedicated to the analysis of this genomic region have been developed (Chen et al. 2020).
The purpose of this study is to estimate the carrier frequency of SMN1 and SMN2 duplications/deletions and the common partial deletion SMN2∆7-8 in the Polish population using the WGS data. To obtain this information, we have applied a dedicated program for the SMN1/SMN2 genetic region: SMNCopyNumberCaller. Additionally, we analyzed the region with two widely used tools for CNV detection: CNVnator and LUMPY.
The cohort consisted of 1076 unrelated Polish individuals without symptoms of neuromuscular disorders or spinal muscular atrophy described in Kaja et al. (2022). The study was approved by an Ethics Committee of the Central Clinical Hospital of the Ministry of Interior and Administration in Warsaw (decision nr: 41/2020 from 3 April 2020 and 125/2020 from 1 July 2020). WGS data processing is described in detail by Kaja et al. (2022). Briefly, reads were mapped to the GRCh38 human reference genome using the Speedseq framework v.0.1.2; structural variants in SMN1 and SMN2 regions were called with SMNCopyNumberCaller version 1.1.2 (Chen et al. 2020). The SMN1/SMN2 coordinates were chosen based on their transcript coordinates from ENSEMBL, where the chosen transcript for SMN1 was ENST00000380707.9 and for SMN2 ENST00000380743.9. The results of the analysis are presented in Table 1 and Fig. 1. With a SMNCopyNum-berCaller, we obtained a SMN1 heterozygous gene deletion carrier prevalence of 2.41%, meaning that 1:41 per person is carrying only one copy of the SMN1. Three copies of SMN1 were found in 5.3% of all individuals. A genotype with 2 SMN2 copies was most common, followed by a genotype with 1 SMN2 copy. A truncated SMN2 variant (SMN2Δ7-8) has been identified in 153 cases (Fig. 1). Additionally, with a SMNCopyNumberCaller, it is possible to identify copy number of g.27134 T > G, which is an SNP associated with 2 + 0 silent carrier status. Three individuals from our cohort carried the SNP g.27134 T > G; among them, one presented with SMN1 two copies (a potential 2 + 0 genotype, silent carrier) and 2 two individuals with three SMN1 copies (2 + 1 genotype). For LUMPY and CNVnator, only the number of total individuals with CNVs within SMN1 and SMN2 has been analyzed. However, it is much lower than the prevalence from both, screening the results based on dosage analysis (Jedrzejowska et al. 2010) and from the dedicated program SMNCopyNumberCaller.
In approximately 95% of all patients, SMA is caused by a homozygous deletion of exon 7 of the SMN1 gene. Mostly in the rest of the cases, it comes to a heterozygous exon 7 deletion, and a small heterozygous pathogenic variant is present on the other allele (compound heterozygosity) (Prior et al. 2020). The genes SMN1 and SMN2 differ with only five base pairs between their transcripts. Although these differences do not change amino acids, a single nucleotide difference at + 6 in exon 7 causes the inclusion of exon 7 in SMN1 transcripts, and the exclusion of this exon in SMN2 transcripts. Even if the amino acid sequence is maintained, a splicing enhancer binding is affected in SMN2 resulting in transcripts lacking exon 7. Due to this, only 10-15% of the functional SMN protein is produced from the SMN2 gene (Prior 2008). Three methods of dedicated analysis of the SMN1/SMN2 genomic regions have been developed. The first is based on a Bayesian hierarchical model and computes the probability that the fraction of SMN1-derived reads is equal to or smaller than 1/3 at three base differences between SMN1 and SMN2. However, this method is not calling copy numbers making it not feasible for the analysis (Larson et al. 2015). The second method is based on the PCR approach. Its main limitation is not sufficient performance on the NGS relatively low-depth data as the method is based on only a single locus (Feng et al. 2017). The recently developed method, SMNCopyNumberCaller, enables more complex analysis (Chen et al. 2020). These include DNA deletions, both whole-gene deletion/duplication and a partial deletion of a region that includes exons 7 and 8, and small variant detection (Vijzelaar et al. 2019). Additionally, with this program, it is possible to analyze haplotypes and identify two SMN1 copies located on the same allele. Such events are relatively common and occur in nearly 5% of the healthy population (Prior 2008). In Poland, the prevalence of the SMN1 gene deletion carrier is estimated to be 1 per 35 individuals (17/600) and the incidence of SMA is 1 per 4900 (Jedrzejowska et al. 2010). We measured a similar prevalence with SMNCopyNumber-Caller (1:41). Furthermore, we obtained with SMNCopy-NumberCaller a similar SMN2 distribution that was reported based on the dosage analysis, whereas two standard CNVs analyzing programs proved not to give reliable results in the case of SMN1/SMN2 regions and showed much lower prevalence. We identified a silent carrier (2 + 0) in one case. Jedrzejowska et al. (2010) noted a discrepancy between the SMA incidence calculated from the carrier and real data. We argue that this may be due to the presence of silent carriers in the population.
Early detection of SMA is crucial taking into account the availability of treatment: Nusinersen, Risdiplam, and Zolgensma (Romanelli Tavares et al. 2021). In Poland, in 2021, SMA carrier screening program for newborns was approved on April 2021 and covered the whole Poland since March 2022. The diagnosis is based on genetic tests, i.e., a two-stage screening test with qPCR and MLPA. Thanks to the introduction of newborn screening for SMA, the time from birth to diagnosis can be significantly reduced (Gailite et al. 2022).
WGS-based analysis of the SMN1/SMN2 genomic region has several benefits. WGS allows screening for multiple disorders and is considered for implementation in all newborns in the UK (Burki 2022). When implemented, it will allow SMA screening from the WGS data; therefore, efficient methods are needed. Additionally, participants from population-based WGS studies could benefit from such screening. Another advantage of the WGS-based method is the possibility to perform more complex analysis, including the identification of additional genetic markers and modifiers and the detection of the combination of CNV and single nucleotide variants on the same run (Chen et al. 2020).
In this study, we investigated SMN1/SMN2 carrier frequency with a dedicated program, SMNCopyNumberCaller. The obtained results are comparable with state-of-the-art screening methods, while two standard CNVs analyzing programs were not reliable for SMN1/SMN2 analysis. Our study confirmed the necessity to use dedicated tools for these Fig. 1 Number of CN for SMN1, SMN2, and SMN2∆7-8 detected using SMNCopyNum-berCaller genetic region analysis. Although analysis of the SMN1/ SMN2 genomic region in WGS data is seldom performed nowadays, with dedicated tools it should become more common soon.