Our CNV calling pipeline has identified 5 062 CNVs ranging from 200 kbp to 75 260 kbp (median size 320 kbp). Altogether, 4 042 individuals (31,19%) present variation, of which 79,56% carried only one CNV, and 23,44% were carriers at least two CNVs. Moreover, one woman from the Slovak population has shown a presence of even 32 CNVs suggesting genomic instability. The gains to losses ratio was approximately 2.5:1 in all the populations (Table 1).
Table 1
Data summary for individual populations of pregnant women undergoing NIPT analysis.
Population | Samples | Samples with at least 1 CNV | CNVs | Gains | Losses |
Slovak | 9 230 | 2 900 | 3 585 | 2 578 (72%) | 1 007 (28%) |
Czech | 1 583 | 510 | 622 | 460 (74%) | 162 (26%) |
Hungarian | 1 919 | 632 | 855 | 611 (71%) | 244 (29%) |
Sum | 12 732 | 4 042 | 5 062 | 3 649 (72%) | 1 413 (28%) |
Excluding the sex chromosome X, the sixth chromosome contained the most gains, exactly 11.6%, 10.7%, and 10.8% of all found gains, in the Slovak, Czech and Hungarian populations, respectively. On the other hand, the highest count of losses was observed on the chromosome seven for all three populations (Slovak 10.0%, Czech 14.2% and Hungarian 12.3%). With a few exceptions, the overall count of CNVs decreased with the length of the chromosomes (Fig. 1a, Supplementary Table 1). In order to find out the length distribution of the variants, we divided them into size ranges of 100 kbp. The most frequent size of CNVs was 200 kbp to 500 kbp, this range contained around 70–85% of all the CNVs. Larger CNVs were rare and their count decreased with the increasing size (Fig. 1b, Supplementary Table 2).
By comparing the distributions of CNV distances, either to chromosomal ends or to centromeric regions, we found CNVs overrepresented close to telomeres and centromeres (Fig. 2). The average frequency of CNVs per one Mbp of random genome sequence was 0.041%, while the average CNV frequencies within 1 Mbp proximal to centromere and telomeres, were 8.48% and 7.70%, respectively (Table 2). However, it can result from a combination of technical and biological effects since the detection method provides reduced precision in regions with low mappability, which usually include regions near centromere and chromosome ends.
Table 2
Average CNV frequency within 1 Mbp of different genomic regions.
Region | Slovak | Czech | Hungarian | Average |
1 Mbp of random haploid genome sequence | 0.040% | 0.038% | 0.044% | 0.041% |
1 Mbp proximal to centromere | 5.76% | 7.04% | 12.66% | 8.48% |
1 Mbp proximal to telomeres | 7.31% | 7.41% | 8.39% | 7.70% |
Using the Chi-square test we compared population differences in the count of CNVs on all chromosomes, we found a statistically significant difference in the CNVs gains (p-value = 0.0113). A comparison of the individual population pairs showed a significant difference between Slovak and Hungarian populations (p-value = 0.0396 from Chi-square test). However, when comparing population differences in the count of CNVs on individual chromosomes, we did not find any significant difference after Bonferroni correction (0.05/23 = 0.0022).
We found a statistically significant difference in CNV length distribution between populations (p-value = 8.69x10− 14) when we compared the count of CNV gains in individual length ranges (Fig. 1b). The individual population pairs comparison showed a significant difference between Slovak and Hungarian populations (p-value = 8.88x10− 16). When we compared population differences in each individual population CNVs length range pairs we found a significant difference between Slovak and Hungarian population in length range 200–300 kbp (p-value = 0.000315), 3–4 Mbp (p-value = 1.86x10− 18) and 4–5 Mbp (p-value = 0.000225) and Czech and Hungarian population in length range 3–4 Mbp (p-value = 0.000758), all after Bonferroni correction (0.05/23 = 0.002). We did not find any significant population difference in the count of CNV losses in all and individual length ranges.
We continued by searching the most prevalent CNVs in the population, specifically those with a frequency exceeding 1%, that can be considered copy number polymorphisms [23]. We found 7 gains and 8 losses, which showed allelic frequency ≥ 1% in at least one population (Supplementary Table 3). When we compared these variants with publicly available database gnomAD SVs v2.1 (European) [24], we found no comparable range in six cases (gains: 8:2340000–2580000; 15:32020000–32420000; 22:22280000–22580000; losses: 7:64680000–64900000; 9:11840000–12200000; 15:22760000–23080000 (Supplementary Table 4). After applying automated ACMG guidelines available at https://genovisio.com, 8 variants were classified as variants of uncertain significance (VUS) that have no known clinical relevance and 7 variants were benign. According to the ISV tool [25] 4 variants were VUS and 11 variants were benign. Using the artificial intelligence integrated in the X-CNV predictive tool [26], we identified 10 variants as benign, 2 as likely benign, 2 as VUS and 1 as pathogenic. In 7 variants, prediction matched in all three tools (Supplementary Table 4).
Considering the counts of variants between populations, we found a difference in representation of variants 8:2260000–2640000 (p = 2.18x10− 8), 8:2340000–2580000 (p = 2.29x10− 13), and 12:20960000–21400000 (p = 1.63x10− 3; statistically significant after Bonferroni correction; Supplementary Table 3). These CNVs were not present in at least one population, so we considered their occurrence to be zero in the given population. When comparing such CNVs only between the two populations with non-zero counts, we observed a different representation of 12:20960000–21400000 (SK-HU, p = 0.00168) (Supplementary Table 3).
Since CNVs can overlap different genomic regions, we explore the representation of protein-coding genes, long non-coding RNAs (lncRNA), and microRNAs (miRNAs) in our cohorts. Coordinates for individual genomic regions, also known as biotypes, were downloaded from the Ensembl genome database [27]. Subsequently, according to Woodwark and Bateman [28] three types of overlaps in each biotype have been defined (Fig. 3): I.) biotype in CNV (CNVs that entirely encompass the genomic region), II.) biotype partially overlapped the CNV (the start position of CNV is upstream, while the end position is included within the given region or the start position of CNV is included within, while the end position is downstream of the given region), and III.) CNV in biotype (genomic region that entirely encompasses the CNV). The total sum of all CNV-biotype overlaps in the studied populations for gains and losses is shown in the Table 3.
Table 3
The ratio of CNV-biotype overlaps in a given population for gains and losses.
Biotype | Gains | Losses |
Slovak | Czech | Hungarian | Slovak | Czech | Hungarian |
gene* | 30.59% | 32.56% | 26.84% | 11.74% | 7.82% | 6.69% |
lncRNA | 57.42% | 64.90% | 46.90% | 25.03% | 20.99% | 18.40% |
miRNA | 0.003% | 0.003% | 0.003% | 0.001% | 0.001% | 0.000% |
*gene represents protein-coding sequences, including both exons and introns; lncRNA - long non-coding RNA; miRNA - microRNA. |
On average, 39% of CNV sequences overlap protein-coding genes, while 30% fall on gains and 9% on losses. Moreover, more than half of all CNV sequences (aver. 78%) overlapped lncRNA (56% of gains, 22% of losses). On the other hand, CNV-miRNA overlaps were near zero since miRNAs constitute a small portion of the genome. Every type of CNV-biotype overlap calculated separately is listed in Supplementary Table 5 and Supplementary Fig. 1.