Determining population structure from k-mer frequencies.

Abstract

demographic history, and natural selection 2 . Population structure is thus observed as a systematic difference in allele frequencies among populations due to non-random mating among individuals.
Genetic differences within and among populations are examined by studies that investigate changes in frequencies of alleles and genotypes over time 3 .
Identi cation of population structure and gene ow among populations is informative for genetic ancestry and provides information about both demographic history and geographic origins 4,5−7 . For example, gene ow, or a gene transfer from one population to another, is indicative of migration processes 8 . When individuals of a single population possess recent ancestry from two or more separate sources, this population is considered admixed. Admixed populations contain high levels of genetic diversity that re ect contributions of the intermixture of source populations with different genetic variants 9 .
Understanding gene ow among populations informs a diversity of studies of species 10 . For example, studies in population structure across marine species have analyzed connectivity among populations, leading to the establishment of networks of marine protected areas 11,12 . Understanding this connectivity among populations, which entails evaluation of population structure across taxa, is a key factor for the effective design of these networks and preserving biodiversity on a large scale 11,12 . This knowledge is fundamental for ensuring the long-term survival of ecosystems inhabited by these species 13 .
Population structure is also an important confounding variable in GWAS due to the possibility of inaccurate associations between genotype and the trait of interest in a genetic study 14 . The presence of population structure may cause false positive or negative associations between genotype and trait due to differences in local ancestry that are not related to disease risk or trait variance 15 . Thus identi cation of population structure, and controlling for it, removes the confounding factors 16 . This step enables GWAS to nd new genetic associations and improve the detection, treatment, and prevention of certain diseases 17-20 . There are three types of population inference approaches: model-based, distance-based, and statistical (sometimes referred to as algorithmic 21 ). An example of a model-based population inference approach is Structured Association, which assigns samples to subpopulation clusters (possibly allowing fractional cluster membership) using a model-based clustering program such as structure 22,23 . However, the applicability of this approach to large genome-wide data sets is limited by its high computational cost when allowing fractional cluster membership 24 . Faster model-based approaches, such as admixture 21 , fastStructure 25 , and frappe 26 adopt the likelihood model embedded in structure but incorporate relaxation methods for improving computational e ciency.
However, because these approaches are based on genetic assumptions about the data, including Hardy-Weinberg equilibrium (HWE) within populations and linkage equilibrium (LE) between loci, violating these assumptions may lead to misleading results 22,26,27 . Marker deviation from equilibrium can also signify a possible sequencing 28-31 or genotyping error and thus such markers should be excluded from further analysis 32 . Incorrect inference of genotypes are known to occur due to low coverage of DNA sequencing 33 . The assumptions of Hardy-Weinberg proportions, which must be met for the marker to be included in the analysis, sometimes are not met due to genotyping errors 34 . Markers that do not meet this requirement should not be included in the analysis 32 .
Thus, marker genotypes of SNPs or microsatellites, or haplotype frequencies generated from the sequence data, require careful data preprocessing steps 32 . Additionally, before running these methods, the number of populations (K) must be set but may not be known in advance. While model-based approaches are very powerful in population structure identi cation, they are thus limited by computational cost, operate on genetic assumptions that must be held, and are sensitive to sample size 35 .
Alternative distance-based population inference approaches adopt a pairwise distance matrix computed among each pair of individuals. Some examples of implemented distance-based approaches are Genetic Similarity Score Matching (GSM) 36 , Spectral-GEM 37 , and FastPop 38 . GSM and Spectral-GEM require high computational intensity when the sample size is large. FastPop results in complex computation and has not been established when inferring genetic ancestry among more than 4 populations 39 .
Statistical approaches, such as PCA, can be applied to genotyped data (individual allele frequencies, single-nucleotide polymorphism (SNPs)) to extract linear combinations of individuals that share the greatest similarities. A graphical overview (scatter plot) of the population structure can be shown using principal components as axes of variations. PCA is e cient and has been implemented for ancestry inference in eigenstrat 23,62 and smartpca 62 . Statistical approaches are able to handle large-scale genomic datasets and are not restricted by genetic assumptions 32 .
PCA has advantages over the model-based approaches in that it is a non-parametric method (it does not require a prede ned number of populations) and does not rely on modeling assumptions (HWE, LE). PCA is also computationally e cient. Albeit, current PCA-based approaches, just as model-based approaches, operate on genomic markers, which require careful identi cation to be useful for population structure analysis.
Overall, existing methods to identify population structure are computationally intensive. The number of available markers grows as the number of samples included in the analysis increases, thus reducing the e ciency of computation. Identi cation of these genotypes requires rigorous steps, and reducing the number of informative markers is often desirable for e cient population structure determination 32,40 .
Ancestry informative markers are usually determined as a set of minimum markers needed to determine the population structure and lower the genotyping cost. Selection of informative markers using the supervised method relies on self-reported ancestry information from individuals, while the unsupervised approach applies PCA to determine markers that are associated with the signi cant principal components and then score each marker 41 .
In this work, we investigate our ability to determine population structure with PCA using frequencies of kmers present in a genome. Information on the presence and absence of k-mers has shown promising results in population differentiation whole-genome sequencing reads from two distinct superpopulations 42 . K-mers are shorter substrings that can be "overlapped" to reconstruct the full sequence and deliver equivalent genomic information as a whole sequence 43 . K-mers or k-mer pro les of a sequence (k-mer and its frequency in the genome) can be generated e ciently 44 . Then the structure of the genome can be investigated directly from the k-mer pro les of the genome. The problem of sequencing or genotyping errors may be reduced because PCA aggregates the k-mer frequency information; therefore extra counts of frequencies that could potentially accrue due to sequencing errors should not substantially in uence the PCA projection. Additionally, sequencer errors can be identi ed and removed by ltering out k-mers of frequency one (singletons), which are generally considered a result of sequencer errors 45,46 , and not to be included in the nal PCA computation.
In this study, we examined the ability to determine population structure based on k-mer frequencies present in a genome, with the goal of formulating a quick and accurate approach for population structure identi cation. We used samples from across human populations from the 1000 Human Genome Project 47 . The structure of these populations has been established previously and includes both the separation of the ve superpopulations, independent populations within these, and populations with mixed ancestry.
We were able to apply PCA to frequencies of k-mers present in genomes to accurately separate superpopulations and populations, and identify admixed individuals. For comparison, we investigated population strati cation based on the number of k-mer matches between pairs of genomes using a popular alignment-free sequence comparison tool, mash 48 . We were able to build accurate population trees using this approach; however, the results depended on the parameter selection and it was di cult to identify a priori the k value (k-mer length) and sketch (reduced representation of a sequence) size needed for accurate results. Thus, the practicality of this approach is limited compared to our PCAs of k-mer frequencies.

Results
Population structure of ve human superpopulations shows two major groups To differentiate samples from the ve human superpopulations identi ed by the 1000 Genomes Project 35 we used principal component analysis with a K-Means clustering algorithm. 80% of the variance was captured by 21 PCs. The rst two PCs contained 14.5% of the variance. Using 21 PCs, which hold 80% of the variance, caused undecidability in the K-Means clustering algorithm where no obvious in ection (elbow) point was observed (Fig. 1a). However, using the rst 2 PCs, which hold 14.5% of the variance, showed deterministic results in K-Means clustering (Fig. 1b). Thus, we applied K-Means clustering to the rst 2 PCs throughout this study. The African superpopulation was strongly differentiated from all other samples along PC1. The "elbow point" of the K-Means scree plot (when the change in the value of inertia is no longer signi cant) indicated the samples grouped into three clusters. There was a strong differentiation of samples of African Ancestry (AFR_LWK) from all other populations on PC1, and differentiation within this population along PC2 (Fig. 2).
Population structure from four human superpopulations shows four groups To determine whether we could differentiate the non-African populations we repeated the PCA analysis excluding AFR_LWK samples. 80% of the variance was captured by 18 PCs. The rst two PCs contained 13% of the variance. Plotting the rst two PCs, we observe four distinct groups corresponding to the four superpopulations ( Fig. 3). The elbow point of the K-Means scree plot indicated the samples grouped into four clusters (Supplementary material Appendix 1, Figure A1).

Population structure from four human superpopulations including samples of admixed origin
To differentiate the samples of admixed origin we repeated the PCA analysis with additional samples of admixed origin from EAS and EUR superpopulations. 80% of the variance was captured by 36 PCs. The rst two PCs contained 8.4% of the variance. Plotting the rst two PCs, we observe four distinct groups corresponding to the four superpopulations; individual populations were not clearly differentiated (Fig. 4). The "elbow point" of the K-Means scree plot indicated the samples grouped into four clusters (Supplementary material Appendix 1, Figure A2).

Population structure from three populations of admixed and non-admixed origin
To differentiate the samples of admixed origin we repeated the PCA analysis with samples from the East Asian superpopulation including one population of admixed origin. 80% of the variance was captured by 14 PCs. The rst two PCs contained 13% of the variance. Plotting the rst two PCs, we observe three distinct groups corresponding to the three populations CDX, CHB, and JPT (superpopulation EAS), with the CHB population placed in between CDX and JPT populations. The "elbow point" of the K-Means scree plot was not prominent and indicated the samples of admixed origin formed a continuous grouping (Supplementary material Appendix 1, Figure A3). However, we observed a strong differentiation of samples by population on PC1 (Fig. 5).
Additionally, we repeated the PCA analysis to differentiate samples of admixed origin with samples from the European superpopulation including one population of admixed origin. 80% of the variance was captured by 14 PCs. The rst two PCs contained 13% of the variance. Plotting the rst two PCs, the three populations CEU, FIN, and TSI (superpopulation EUR) appear visually differentiated along PC 1, with CEU (admixed) placed in between FIN and TSI (Fig. 6). However, the "elbow point" of the K-Means scree plot was not prominent, indicating a lack of strong grouping of samples (Supplementary material Appendix 1, Figure A4). While the scatter plot of the PCA provides a graphical overview of the population structure along the axes of variations of the rst 2 PCs, K-Means clusters individuals into genetically homogeneous subpopulations by placing each observation to the cluster with the nearest mean 49 . Consequently, there is a visual discordance between sample populations and K-Means assignments.

Accuracy of K-Means clustering
The AMI score for K-Means clusters and the expected clusters from the analysis of the ve superpopulations of non-admixed origin was 0. Memory usage when analyzing population structure with kmer frequencies These analyses (using k-mer length of 21) stored a dictionary data structure with k-mer content for each sequence. The dictionary of 48 vectors of k-mers and their frequencies occupied 41Gb of space in a pickled (compressed) format. Reduction of space to 38Gb was possible by calculating the intersection of k-mers across all vectors. Generating these data structures required an HPC node with at least 250Gb of RAM for the dictionary compression step. Additionally, calculating PCA on this dataset required ~ 239Gb of RAM, and took 2 hrs 36 minutes and 51 seconds of job wall-clock time on a 36-core HPC node.
Population structure from k-mer presence alone For comparison with our PCA approach, we used mash to estimate population structure from the same samples. Monophyly of the superpopulation groups was observed on the unrooted tree for various parameters of k-mer length (k) and sketch size (s) (Supplementary material Appendix 1, Figure A15-18). Speci cally, the trend of accurate grouping by population was observed with shorter k-mer length and higher sketch size, and conversely lower sketch size and longer k-mer length produced trees with monophyletic groupings of samples by population (Fig. 7).
While we saw a general trend of improved accuracy with low k-mer length and higher sketch size, and conversely longer k-mers and smaller sketch size, we saw exceptions in the accuracy trend for various parameters. For example, k-mer length 24 and sketch size 3000 produced accurate results, however, the accuracy dropped with setting k = 24 and s = 5000 and 7000, and the accuracy picked up again when k = 24 and s = 8000-30000. While this test case allowed us to compare results to the ground truth, it is di cult to select a priori the parameters that produce accurate results.

Discussion
In this work, we showed that population structure can be accurately detected from k-mer frequencies using PCA, including accurate representation of samples from admixed origin. Application of PCA to vectors of k-mer frequencies reduces the dimensionality of the data, which makes the dataset manageable for the following step of applying a clustering algorithm to detect structure in the dataset. We suggest that k-mer frequencies may be easier to calculate compared to accurate genotypes and allow the estimation of population structure hierarchically.

Population identi cation
We hypothesize that our initial observation of two larger clusters in the data from the ve human superpopulations (Africa and all others) is likely due to individuals from African populations possessing the greatest genetic variation, as predicted by the out-of-Africa model of human origins 35 . Thus, the initial dominating force in the signal differentiates this superpopulation in a way that is accurate although not described by the 1000 Genomes Project. Excluding this superpopulation, PCA groups the samples by superpopulation. Again the majority of variation is among superpopulations, leading us to take a hierarchical approach to examine clusters of samples individually for substructure (i.e. separation joining the centers of the established populations as described by 50 . While it is not surprising that these samples are not accurately assigned to a third population, as this same result was found by the 1000 Human Genome Project, when samples were grouped into three populations, there did appear to be a third intermediate population, which was largely comprised of the expected admixed samples.

K-mers versus marker genotypes
Our analysis of k-mer frequencies using PCA produced comparable results to those produced by the model-based approach using marker genotypes 35 . However, k-mer frequencies can be simpler to count and errors may be more readily identi ed. When analyzing marker genotype data, a number of preprocessing steps to evaluate data quality are necessary 21, 27 , as population structure estimation using genetic markers, is susceptible to genotyping errors 56 . This evaluation includes an assessment of SNP call rates, MAFs, veri cation of the HWE assumptions, and relatedness between individuals. Additionally, the identi cation of ancestry informative markers, which constitute a minimal number of markers needed to obtain population structure, is necessary to ensure the accuracy of the results 27 . In contrast, k-mer frequencies can be viewed as summary statistics of a genome resulting from SNPs. While errors in k-mer counts occur due to sequencing or genotyping errors, these can easily be identi ed as k-mers of low frequency. The overwhelming majority of k-mers of frequency 1 are not found in a genome and thus are most likely due to sequencing errors and therefore can easily be discarded 59 .
Model-based population analysis methods can also be limited by inadequate sample sizes and the number of markers analyzed 60 . These methods depend on an estimation of allele frequency that is sensitive to small samples 32,61 . In contrast, our PCA-based k-mer frequency approach does not depend on allele frequency estimation and thus is not affected by sample size 27 . We were able to produce accurate results with a number of samples as low as six per population.
The PCA-based approach also has the advantage of operating without a preset number of populations and no modeling assumption requirements 50,55 . This makes the analysis of population structure using a non-model-based approach an appealing choice. The application of PCA and K-Means to k-mer pro les of genomes makes it easy to detect a number of populations (clusters) present in the dataset, which is a major parameter in the model-based method that is required to be set in advance.
However, the two approaches deliver different types of information. While a more complex but sophisticated method such as structure describes the population structure by probabilistic assignment to classes, the PCA combined with K-Means, provides a graphical and quantitative representation of population structure along axes of variation in the dataset. Thus, the goal of the investigation of population structure should be taken into consideration when deciding which approach is more applicable for the analysis.

Consideration of computational e ciency
The PCA-based approach is computationally e cient and can handle genomic marker data for thousands of individuals 34 . PCA was also shown to be e cient when applied to k-mer frequencies in our work. The major computational cost comes with the data preprocessing step when analyzing either marker genotypes or k-mer frequencies. The preprocessing of genetic data can consume computational time. On the other hand, when analyzing k-mer frequencies, the memory requirements needed to store kmer pro les grow with k-mer length, as does the computational power needed to calculate distances between sequences based on k-mer frequencies. However, shorter k-mers are less informative while longer k-mers are more unique. Thus, when analyzing k-mers, there is a tradeoff between speci city, computational e ciency, and memory requirements.
Identi cation of population structure based on the number of shared k-mers mash's identi cation of all ve superpopulations, suggests that mash has an adequate amount of sensitivity to differentiate samples by superpopulation even with the presence of samples with greater variation (AFR). However, grouping individuals into populations based on k-mer presence using mash distances was more di cult than using PCA and k-mer frequencies. mash produced accurate groupings (samples were placed on the tree according to the superpopulation) only for particular combinations of kmer length and sketch size, and accuracy was not necessarily predictable. We were able to identify the kmer length that produced accurate results by checking with the results produced by the 1000 Genome Project; however, nding the right parameter for k-mer length would be hard without knowing the correct population structure in advance. Thus, while mash is a robust tool for identifying genome relationships on a species level 51 , on a population level, in comparison to the PCA approach, mash showed less viable performance for differentiating populations due to its high sensitivity to parameter selection which is unknown in advance.

Conclusion
In sum, principal component analysis together with K-Means clustering appears to successfully identify population structure based on the k-mer frequencies present in genomes. This approach is robust in differentiating samples at the superpopulation and population levels. Notably this approach differentiated samples hierarchically, and PCA was able to discern the population signal in samples of admixed and non-admixed origin within a single superpopulation. These results are comparable to model-based approaches that identify populations using genotypes, and which provide information on fractional membership of a sample to a population. However, using k-mer frequencies does not depend on genetic assumptions or the process of marker selection curation. In contrast, the method using k-mer presence to group samples lacked sensitivity to consistently identify populations. With the increasing availability of whole-genome data, we anticipate that the use of k-mer frequencies combined with PCA and k-means clustering will be a strong addition to population structure investigations.

Differentiating human superpopulations
To examine the use of k-mer-based methods in identifying population structure we rst obtained genome data for humans from the International Genome Sample Resource (IGSR) 62 . These data have been used to identify human population structure 35 , thus providing a comparison for the population structure we identify with our alternative approaches. This dataset contains genome data for individuals classi ed by superpopulation as Africa (AFR), East Asia (EAS), Europe (EUR), South Asia (SAS), and the Americas (AMR). Each superpopulation contains multiple populations. For our initial analysis, we analyzed one population per superpopulation. We downloaded sequenced reads that were aligned to the GRCh38 human reference sequence from 6 samples from a single population per superpopulation identi ed as having the least admixed origin (referred to here as "non-admixed") 62 : 1. Luhya population in Webuye, Kenya (LWK) of African Ancestry 2. Peruvian population in Lima, Peru (PEL) of American Ancestry 3. Toscani population in Italy (TSI) of European Ancestry 4. Indian Telugu population in the UK (ITU) of South Asian Ancestry 5. The Japanese population in Tokyo, Japan (JPT) of East Asian Ancestry All samples were sequenced using PCR-free high coverage technology and listed under the "1000 Genomes 30x on GRCh38" data collection. Data were accessed as cram les. Each cram le was converted into a bam le using samtools. We then used the bcftools mpileup command to lter out regions with low-quality scores, call the variants, and perform pileups. Finally, the fasta les were built using the bcftools consensus command.
Population structure from k-mer frequencies using PCA For each sample, we built pro les of canonical k-mers (k-mer or its reverse complement, whichever comes rst lexicographically) from the fasta le using Jelly sh 44 , a tool for fast, memory-e cient counting of kmers in the DNA sequence, for k-mer length 21. When choosing the length of k-mers we were guided by a general rule used by the alignment-free methods for sequence comparisons -shorter k-mers are more likely to be present in a sequence (e.g. 1-mers); thus they are less informative in analyzing closely related genomes 63 ; however, longer k-mers are more unique to particular species and are therefore more useful for similarity identi cation across species 64 . Moreover, widely used alignment-free sequence comparison tools such as mash found a k-mer length of 21 to give accurate estimates of sequence similarities 47 . We ltered out singletons to account for possible errors produced by the sequencer 59 . We then sorted each kmer pro le in alphabetical order and calculated the intersection of k-mer pro les across all samples. Additionally, sequencer errors were identi ed and removed by ltering out k-mers of frequency one. For this purpose, the ag -L 2 is used when analyzing the k-mer content of the sequence with Jelly sh (applicable to our method) to only include k-mer frequencies above 1.
We performed a Principal Component Analysis (PCA) on vectors of k-mer frequencies of each sample in python using the scikit-learn (sklearn) library 49 . We normalized the data (frequencies of k-mers) by scaling all the values to be between 0 and 1using the StandardScaler function from the scikit-learn (sklearn) library 49 . Normalization of the dataset is a necessary step due to PCA calculating a new projection of the dataset with new axes based on the standard deviation of the variables 65 . Variables with different standard deviations (high versus low) will have different weights for axes calculation. Normalization of data allows for uniform standard deviation across all variables, thus PCA calculates axes with all variables having equal weight. We visualized the projection of the two PCs with the most variance using a scatter plot from the matplotlib package 66 .
To identify populations, we used K-Means clustering based on the PCs 55,67 . We found the optimal number of principal components that capture the greatest amount of variance in the data by plotting the explained variances in a scree plot. We used explained variance ratio as a metric to evaluate the usefulness of the principal components and to choose how many components to use in the model 68 . The explained variance ratio is the percentage of variance that is attributed by each of the selected components. To avoid over tting the model we chose the number of components to include in the model by adding the explained variance ratio of each component until we reach a total of around 80%, which is considered an adequate amount of variance to derive informative results 65 . We used these PCs to cluster samples into different numbers of groups (k from 1-10). We determined the optimal number of clusters (K) by using the "elbow method" heuristic approach 69 . The sum of the squared distances to the nearest cluster center (aka inertia) was measured using the K-Means model for each k. From the scree plot, we determined the "elbow point", i.e. the point after which the inertia starts decreasing in a linear fashion. We clustered the dataset by tting the numbers of PCs with 80% of the variance into the K-Means model with k number of clusters.
Because the PCs with 80% of the variance caused undecidability in the K-Means clustering algorithm ( Fig. 1), we hypothesized that the dataset contains noise that suppresses the true biological signal, and only a small fraction of k-mers "drive" (dominate) the data. From the scree plots of PC variance and K-Means inertia plots, we saw that the variance is spread out, mostly equally, through all the PCs.
Commonly, the rst three PCs contain most of the variance of the data (around 80%), however, our dataset shows that the rst three PCs have only a slightly higher variance than the rest of the PCs.
Nevertheless, when using the rst two PCs we saw deterministic results by K-Means and they matched the expected results. Thus we used the rst two PCs throughout this work.
Human population structure and substructure The analysis of the dataset including all ve superpopulations showed strong differentiation of the AFR superpopulation from the rest of the superpopulations. Thus, we repeated the analysis for four superpopulations excluding AFR, with 24 samples of non-admixed origin (AMR_PEL, EAS_JPT, EUR_TSI, and SAS_ITU).
Differentiating human populations with k-mer frequencies using PCA Because we were able to differentiate human superpopulations, we examined whether this approach could differentiate populations within those superpopulations. We obtained an additional 12 samples (6 per population) of European Ancestry and East Asian Ancestry of non-admixed origin: Genomes for each individual were obtained as described above. We repeated our analysis with these six populations from four superpopulations (excluding AFR).

Admixed populations
To determine whether we could identify populations of admixed origin, we selected samples from populations that were previously identi ed as comprising roughly equal parts of fractional membership of two non-admixed populations 35 from the same superpopulation (we refer to these as "admixed" populations). We obtained six samples from the Han Chinese population in Beijing China (CHB) of East Asian ancestry that appears to be a mixture of JPT and CDX ancestry, and another six samples from Utah residents (CEU) of Northern and Western European ancestry that appear to be a mixture of FIN and TSI ancestry. Genomes for each individual were obtained as described above. We repeated our analysis with these eight populations from four superpopulations (AMR_PEL, EAS_CDX, EAS_CHB, EAS_JPT, EUR_CEU, EUR_FIN, EUR_TSI, and SAS_ITU; excluding AFR).
Because we continued to differentiate the four superpopulations, we focused a subsequent analysis on the single EAS superpopulation with 18 samples (12 of non-admixed and 6 of admixed origin; EAS_CDX, EAS_CHB, and EAS_JPT) to determine our ability to identify admixture. We repeated our PCA analysis on these samples alone. Additionally, we repeated the analysis on the single EUR superpopulation with 18 samples (12 of non-admixed and 6 of admixed origin; EUR_CEU, EUR_FIN, EUR_TSI).
Population structure from a number of shared k-mers between sequence pairs For comparison with our kmer frequency-based PCA approach, we examined population identi ability using mash. mash can be used to build phylogenies for family-level data and shows promise for population genetic analyses of polyploid sequences 70 . The principle behind mash is that each sequence is converted into a MinHash sketch, a vastly reduced representation of a sequence, then two sketches are compared by calculating the fraction of shared k-mers between a pair of sequences (Jaccard index).
Finally, the mash distance is calculated, which estimates the rate of sequence mutation under a simple evolutionary model 47 . mash has been investigated for basic population genetic analyses of polyploid and diploid species 70 and showed some promising results in the population strati cation of plants.
For each sample we built mash sketches using a mash sketch command with -m 2 ag to lter out single k-mers, -k N ag to analyze the k-mer length of N, and -s M ag to build sketches of size M. We repeated the process for parameters of -k N = 21, 24, 27, 29, 32 and -s M = 1000, 3000, 5000, 7000, 8000, 9000, 10000, 12000, 15000, 18000, 20000, 23000, 25000, 28000, 30000 to compare the results for a different set of parameters. We calculated pairwise distances between each pair of samples using mash and used the distances to build a Neighbor-Joining tree (NJ) for each set of parameters -k-mer length and sketch size. mash does not assign samples by population thus to verify that grouping by superpopulation we checked for monophyly of each of the groups in the NJ tree by superpopulation label. We used the is.monophyletic function in the R package 71 to check whether each population was monophyletic in the resulting tree and ggplot 72     PCA generated using k-mer frequencies from four superpopulations (America (AMR), East Asia (EAS), Europe (EUR), South Asia (SAS)) including samples of admixed and non-admixed origin in EAS and EUR using 2PCs. Samples are labeled by population. K-Means algorithm identi ed four clusters present in the data (circled).

Figure 5
PCA generated using k-mer frequencies from the EAS superpopulation including samples of admixed and non-admixed origin (CDX, CHB, and JPT) using 2PCs. Samples are labeled by population. K-Means algorithm identi ed three clusters present in the data (circled) corresponding closely to the expected populations. Figure 6 PCA generated using k-mer frequencies from the EUR superpopulation including samples of admixed and non-admixed origin (CEU, FIN, and TSI) using 2PCs. Samples are labeled by population. K-Means algorithm identi ed three clusters present in the data (circles). While populations appear to separate along PC 1, K-Means clustering differentiates the two non-admixed populations (FIN and TSI) but groups nds overlap between CEU and TSI, as well as between CEU and FIN. Heatmap plot for the presence of trees with all superpopulations being monophyletic built from pairwise mash distances for different k-mer length and sketch size parameters.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.