Spinal muscular atrophy (SMA; type I, MIM #253300; type II, MIM #253550; type III, MIM #253400; type IV, MIM #271150, SMA5, SMN1x2, SMN2x0) is a relatively common neuromuscular genetic disease caused by the degeneration of motor neurons located in the anterior horn of the spine. The disease manifests with symmetrical muscular weakness, poor sucking, respiratory problems, muscular atrophy and scoliosis. The course of the disease varies from very severe to mild, depending on the amount of functional SMN protein. The severity of the disease and age of onset depends on the number of copies of SMN2 gene, which can partially compensate for the missing SMN1 gene function, thus, it determines the disease severity (Prior et al. 2000).
Population-wide SMA screening to quantify the SMN1 copy number (CN) is recommended by the American College of Medical Genetics and Genomics (Gregg et al. 2021). SMA is a good candidate for population-based screening as a relatively common lethal autosomal recessive disorder and because of the treatment availability (Ben-Shachar et al. 2011, Prior 2008, Sugarman et al. 2012, Zhao et al. 2021). SMA carrier frequency is estimated to 1 in 25-50 in the general population, depending on the ethnic group, with the highest carrier frequency identified among Caucasians. In the Polish population, the carrier frequency is 1 in 35 (Jedrzejowska et al 2010). In Poland the standard and recommended through the Polish Society of Human Genetics method for carrier screening is Multiplex Ligation-dependent Probe Amplification (MLPA) and quantitative Polymerase Chain Reaction (qPCR). Neither PCR-restriction fragment length polymorphism (RFLP) nor whole genome sequencing (WGS) based analysis is feasible for SMN1/SMN2 genomic region analysis (Gos and Jedrzejowska 2022).
The analysis of the region from WGS data with bioinformatic methods is challenging. The disease-causing gene, SMN1, and its paralog SMN2, reside in a ~2-Mb region on 5q with a large number of complicated segmental and inverted segmental duplications (Prior et al. 2008). Due to the high sequence homology between SMN1 and SMN2 and multiple crossing over, the SMN region is difficult to resolve, both with short- and long-read sequencing (Mandelker et al. 2016). Therefore, the region is usually excluded from the WGS analysis. To help with this issue, recently several bioinformatic tools dedicated to the analysis of this genomic region have been developed (Chen et al. 2020).
The purpose of this study is to estimate the carrier frequency of SMN1 and SMN2 duplications/deletions and the common partial deletion SMN2∆7–8 in the Polish population using the WGS data. To obtain this information, we have applied a dedicated program for the SMN1/SMN2 genetic region: SMNCopyNumberCaller. Additionally, we analyzed the region with two widely used tools for CNV detection: CNVnator and LUMPY.
The cohort consisted of 1076 unrelated healthy Polish individuals described in Kaja et al. (2022). The study was approved by an Ethics Committee of the Central Clinical Hospital of the Ministry of Interior and Administration in Warsaw (decision nr: 41/2020 from 3 April 2020 and 125/2020 from 1 July 2020). WGS data processing is described in detail in Kaja et al. (2022). Briefly, reads were mapped to the GRCh38 human reference genome using the Speedseq framework v.0.1.2; structural variants in SMN1 and SMN2 regions were called with SMNCopyNumberCaller version 1.1.2 (Chen et al. 2020). The SMN1/SMN2 coordinates were chosen based on their transcript coordinates from ENSEMBL, where the chosen transcript for SMN1 was ENST00000380707.9 and for SMN2 ENST00000380743.9. The results of the analysis are presented in Table 1 and Figure 1. With a SMNCopyNumberCaller we obtained a SMN1 heterozygous gene deletion carrier prevalence of 2.41%, meaning that 1:41 per person is carrying only one copy of the SMN1. Three copies of SMN1 were found in 5.3% of all individuals. A genotype with 2 SMN2 copies was most common, followed by a genotype with 1 SMN2 copy. A truncated SMN2 variant (SMN2Δ7-8) has been identified in 153 cases with the following distribution: 0 copies-910; 1 copy- 149; 2 copies-3; 3 copies-1. For LUMPY and CNVnator only the number of total individuals with CNVs within SMN1 and SMN2 has been analyzed. However, it is much lower than the prevalence from both, screening results based on dosage analysis (Jedrzejowska et al. 2010) and from the dedicated program SMNCopyNumberCaller.
Table1. SMN1 and SMN2 CN detected using SMNCopyNumberCaller (modified after Jedrzejowska et al. 2010). 12 samples with no calls are not considered in the table.
|
SMN2
|
SMN1
|
0
|
1
|
2
|
3
|
4
|
overall
|
1
|
1
|
6
|
10
|
8
|
1
|
26
|
2
|
80
|
401
|
486
|
13
|
1
|
981
|
3
|
6
|
21
|
27
|
2
|
0
|
56
|
4
|
0
|
1
|
0
|
0
|
0
|
1
|
overall
|
87
|
429
|
523
|
23
|
2
|
1064
|
In approximately 95% of all patients, SMA is caused by a homozygous deletion of exon 7 of the SMN1 gene. Mostly in the rest of the cases, it comes to a heterozygous exon 7 deletion and a small heterozygous pathogenic variant is present on the other allele (compound heterozygosity) (Prior et al. 2020). The genes SMN1 and SMN2 differ with only five base pairs between their transcripts. Although these differences do not change amino acids, a single nucleotide difference at +6 in exon 7 causes the inclusion of exon 7 in SMN1 transcripts and exclusion of this exon in SMN2 transcripts. Even if the amino acid sequence is maintained, a splicing enhancer binding is affected in SMN2 resulting in transcripts lacking exon 7. Due to this only 10-15% of the functional SMN protein is produced from the SMN2 gene (Prior 2008). Three methods of dedicated analysis of the SMN1/SMN2 genomic regions have been developed. The first is based on a Bayesian hierarchical model and computes the probability that the fraction of SMN1-derived reads is equal to or smaller than 1/3 at three base differences between SMN1 and SMN2. However, this method is not calling copy numbers making it not feasible for the analysis (Larson et al. 2015). The second method is based on the PCR approach. Its main limitation is not sufficient performance on the NGS relatively low-depth data as the method is based on only a single locus (Feng et al. 2017). The recently developed method, SMNCopyNumberCaller, enables more complex analysis (Chen et al. 2020). These include DNA deletions, both whole-gene deletion/duplication and a partial deletion of a region that includes exons 7 and 8, and small variant detection (Vijzelaar et al. 2019). Additionally, with this program, it is possible to analyze haplotypes and identify two SMN1 copies located on the same allele. Such events are relatively common and occur in nearly 5% of the healthy population (Prior 2008).
In Poland, the prevalence of the SMN1 gene deletion carrier is estimated to be 1 per 35 individuals (17/600) and the incidence of SMA is 1 per 4,900 (Jedrzejowska et al. 2010). We measured a similar prevalence with SMNCopyNumberCaller (1:41). Further, we obtained with SMNCopyNumberCaller a similar SMN2 distribution that was reported based on the dosage analysis, whereas two standard CNVs analyzing programmes proved not to give reliable results in the case of SMN1/SMN2 regions and showed much lower prevalence.
Early detection of SMA is crucial taking into account the availability of treatment: Nusinersen, Risdiplam and Zolgensma (Teveres et al. 2021). In Poland in 2021 SMA carrier screening program for newborns was approved on 04/2021 and has been implemented since then. At the time, it covers around 70% of live births and will be implemented in the whole of Poland by the end of 2022. The diagnosis is based on genetic tests i.e. two-stage screening test with qPCR and MLPA. Thanks to the introduction of newborn screening for SMA, the time from birth to diagnosis can be significantly reduced.
WGS-based analysis of the SMN1/SMN2 genomic region has several benefits. WGS allows screening for multiple disorders and is considered for implementation in all newborns in the UK (Burki 2022). When implemented, it will allow SMA screening from the WGS data, therefore efficient methods are needed. Additionally, participants from population-based WGS studies could benefit from such screening. Another advantage of the WGS-based method is the possibility to perform more complex analysis, including the identification of additional genetic markers and modifiers, and detecting the combination of CNV and single nucleotides variants on the same run (Chen et al. 2020).
In this study, we investigated SMN1/SMN2 carrier frequency with a dedicated program, SMNCopyNumberCaller. Obtained results are comparable with state-of-the-art screening methods, while two standard CNVs analyzing programs were not reliable for SMN1/SMN2 analysis. Our study confirmed the necessity to use dedicated tools for these genetic regions analysis. Although analysis of the SMN1/SMN2 genomic region in WGS data is seldom performed nowadays, with dedicated tools it should become more common soon.