Population genetic studies strongly depend on variant calling. Thus, like other types of variants, MS analysis is affected by genotyping methods. Since most MS calling methods analyze only predefined MS regions, the numbers of target MS are different among studies (700,000 loci in Gymrek et al., 2017 and Willems et al., 2014, and 1.6 million loci in Jakubosky, Smith, et al., 2020) (Willems et al. 2014; Gymrek et al. 2017; Jakubosky et al. 2020). In this study, we analyzed a larger number of MS regions (8,468,218 MS) than previous studies for a comprehensive analysis of human MS polymorphisms. Compared to conventional PCR-based MS studies (Rosenberg et al. 2002), this study has another advantage: MS were not influenced by ascertainment bias. Most conventional studies have analyzed pre-screened MS marker sets, which are influenced by the MS marker selection. On the other hand, we did not select MS based on the allele frequency in certain populations and could analyze features of MS and compare the variation among different populations without the influence of ascertainment bias.
We first selected parameters for the analysis based on the evaluation of monozygotic twins in the KPGP. The concordant rate of this parameter set was estimated to be 99.87%, which is sufficiently accurate for population studies (Table S1). We then selected 8,468,218 MS based on the call rate and HWE in the KPGP for 81 Korean individuals. Our MS calling identified 253,114 MS with MAF ≥ 1% in the SGDP and HGDP (Supplementary Fig. 1). Since the SGDP and HGDP datasets represent human genome diversity, these MS polymorphisms can be used for future population studies. Previous genome-wide studies reported that MS polymorphisms could influence gene expression patterns and the human traits (Gymrek et al. 2015; Jakubosky et al. 2020). Therefore applications of our MS set and our MS calling method may contribute to discovering novel disease susceptibility genes (Press et al. 2014). In particular, 696 genes with non-3n MS are good targets for disease studies (Table S5).
The amount of genetic variation reflects the effective population size (N). A comparison of heterozygosities across regions showed Africa with the highest (Fig. 3, Table S4), and America, which was estimated to have a very small effective population size, showed the lowest (Fig. 3, Table S4) (Hey 2005). Such patterns were observed in other genetic variations and MS in a previous study (Willems et al. 2014; Auton et al. 2015), suggesting that the heterozygosity of MS reflects the size of each population. Theoretical population genetics also predicts that the effectiveness of natural selection depends on the selection coefficient (s) of the genetic variation and population size (Charlesworth 2009). Therefore, a comparison of genetic variations of MS in the CDS may provide additional information about the evolution of MS. Most of the 3n MS should not cause severe damage to protein functions and have neutral or nearly neutral effects, whereas non-3n MS cause a frameshift and should have deleterious effects. We attempted to evaluate the strength of negative selection among populations. For this purpose, we compared the average heterozygosity of non-3n MS in CDS with the average heterozygosity of 3n MS in CDS among populations (Fig. 3f, Table S4). The African population showed the lowest ratio, and populations with lower heterozygosity tended to have a higher ratio (Fig. 3f, Table S4). This pattern indicates that the African population has the largest effective population size and that stronger natural selection has acted to remove deleterious non-3n MS. In Central Asia and Siberia, the heterozygosity was not the lowest, but the heterozygosity ratio of non-3n MS to 3n MS in CDS was the highest (Fig. 3f, Table S4). A previous study showed that subdivided populations show a higher effective population size and lower selection coefficient (Cherry and Wakeley 2003). The Central Asia and Siberia population may be composed of subpopulations, which may affect the selection pressure against MS. These results indicate that MS is evolved by the combination of population history and natural selection.
MS has been used to infer genetic structures because of high genetic diversity (Rosenberg et al. 2002). In a previous study, PCA was conducted using 53,002 MS, but the genetic structures by the MS PCA were less clear than that by SNP PCA (Mallick et al. 2016). In the present study, we used a larger number of MS (253,114 MS with MAF ≥ %) and obtained highly concordant results with SNPs (Fig. 4 ab). Although the overall patterns were similar between MS and SNP (Fig. 4, Supplementary Fig. 3), small differences were observed in African and Oceanian populations (Fig. 4c-f). In Oceanian populations, NAN-Melanesians (NAN; Non-Austronesian) and Bougainville, who belong to Melanesians, were clustered in the SNP PCA but not in the MS PCA (Fig. 4 ef). In the African populations, Biaka and Mbuti populations showed different patterns between the SNP and MS PCAs (Fig. 4 cd), which may be caused by hidden population structures. Although the efficiency of using MS for genetic structures should be evaluated by a larger number of samples, these results indicate that MS can be an additional marker set and may detect hidden population structures in the human population.
In addition to the modern human samples, we analyzed a deep sequenced ancient human sample (F23) (Kanzawa-Kiriyama et al. 2019). In ancient genome sequencing, the DNA fragmentation and library construction process should affect the quality of the sequence reads. To evaluate the quality of the MS call, we compared the distribution of the VAF of this sample with that of modern human samples (Fig. 5a-d). The clear skew of the VAF was observed in MS with unit lengths ≤ 2 bp, suggesting that MS with a short unit are strongly affected by the quality of DNA samples. However, the distributions of unit lengths ≥ 3 bp were not different, and therefore we used these MS for the analysis. In the PCA, F23 was close to East Asians, which is consistent with the SNP PCA in a previous study (Kanzawa-Kiriyama et al. 2019) (Fig. 5 ef). This result suggests the applicability of MS to ancient human samples.
We found 22 novel highly polymorphic MS for the personal identification. Using the allele frequencies in a Japanese population, the discriminative power was estimated to be 1⋅10− 17, which is sufficient for personal identification. Although the discriminative power of our MS set is slightly lower than that of the Globalfiler kit, which is a standard MS set, for a Japanese population (5.6⋅1018) (Fujii et al. 2015), the length of our MS was shorter and can be genotyped by short-read sequencers. Additionally, the PCR success rate of MS is known to be affected by the length of the MS (Schneider et al. 2004), and our shorter MS may be robust to DNA degradation.
This study provides a comprehensive catalog of MS in human populations and shows the applicability of MS to modern and ancient human population studies. Nevertheless, our study has several limitations. First, the genotyping of MS needs reads that cover MS regions. Therefore, the amount of data and read length strongly affect the results. For example, we removed 824,459 MS and 395 samples from the SGDP and HGDP due to insufficient depth. Deeper sequence data would improve the quality of the MS calling. Second, long MS cannot be analyzed using short-read data. A recent study using a long-read sequencer reported high genetic variation in long repeat regions (Audano et al. 2019). In the future, the application of our algorithm to long-read data should detect a larger number of polymorphic MS.
To conclude, here we analyzed MS polymorphisms using large publicly available human genome sequencing datasets. This study revealed a pattern of MS polymorphisms and identified polymorphic MS in the human population. The comparison of the heterozygosity among populations suggests that MS have evolved by random genetic drift and negative selection. PCA suggests that MS detect the genetic structures of human populations. Currently, large-scale sequencing projects are ongoing worldwide, in which the analysis of MS, in addition to SNPs, should provide deeper understanding of human genetic variations and benefit genome medicine.