Of the 1,098 individuals analysed, 336 were typed with the MassARRAY® EUROFORGEN NAME assay20, and 762 were typed with the custom AmpliSeqTM EUROFORGEN NAME panel21. Only the 102 loci that were included in both assays were used for the population genetic and ancestry analyses below. The information concerning the physical position and rs-numbers of the loci included in the AmpliSeq™ design is shown in Supplementary Table S1. All samples were also typed for the 165 AIMs of the Precision ID Ancestry Panel6,21 and in this work 6,24, and this work. Two AIMs, rs12913832 and rs4833103, were present in both the EUROFORGEN NAME panel and the Precision ID Ancestry Panel. These two AIMs performed best in the EUROFORGEN NAME panel, and the results from the Precision ID Ancestry Panel were not used. Of the 1,098 individuals, 28 individuals had no genotype calls in more than 10% of the loci. The data of these individuals were excluded from further analysis leaving data of 1,070 individuals for further analyses.
The data obtained with the EUROFORGEN NAME and Precision ID Ancestry panels were tested separately for Hardy-Weinberg equilibrium. For the EUROFORGEN NAME panel, the data of the AIM rs7873963 was in Hardy-Weinberg disequilibrium in five populations (Pcor = 4.9E-04). There was an excess of homozygotes of the T allele, which was caused by a deletion downstream of the locus that was associated with the C allele. Only samples typed with the MassARRAY® assay were affected by the deletion; the locus was in Hardy-Weinberg equilibrium in the populations typed with the AmpliSeqTM EUROFORGEN NAME panel. The locus, which was also in LD with another locus (see below), was excluded from further population genetic analysis.
The HWE was also assessed for the markers present in the Precision ID Ancestry panel. After Bonferroni correction, the AIM rs310644 was in Hardy-Weinberg disequilibrium in the Pakistani and Portuguese populations (Pcor = 3.07E-4). Among Portuguese individuals, 74 had the TT genotype, two had the CC genotype, while no heterozygote individual was observed. Among Pakistani individuals (N=72), 43 individuals had the TT genotype, 13 the CC genotype, and 16 the CT genotype.
LD analysis was performed on the combined dataset including 265 AIMs with 34,980 pairs of loci. Besides LD most likely due to physical linkage, LD between alleles at different chromosomes was also observed. Supplementary Table S4 and S5 show the pairs of loci that were in statistically significant LD in the different populations. Several loci in the EUROFORGEN NAME panel showed statistically significant LD. The HaploView software was used to evaluate if these loci could belong to haplotype blocks. The analysis showed that two groups of markers on chromosome 4 (rs4975193 - rs1757928 - rs337277 - rs1699387, and rs17616434 - rs4833103), one group on chromosome 7 (rs9649356 - rs1227171), one group on chromosome 10 (rs2031581 - rs2765650), and one group on chromosome 12 (rs10862511 - rs10506882) seemed to form haplotype blocks. The loci rs1406045 (typed with the EUROFORGEN NAME panel) and rs4463276 (typed with the Precision ID Ancestry Panel) on chromosome 6 as well as rs621341, typed with the EUROFORGEN NAME panel, and rs6754311, typed with the Precision ID Ancestry Panel on chromosome 2 were in linkage disequilibrium (Table S4). To ensure marker independence, one locus in each pairwise comparison was eliminated for the population genetic analyses (Supplementary Table S6). The performance of the loci in terms of heterozygote balance, locus balance, noise level, and the number of genotype drop-outs was evaluated. For each pair, the locus with the best performance was retained. If the loci performed equally well, preference was given to the locus with the shortest read length. After evaluating the LD (Supplementary Table S6), the final numbers of loci for further genetic analysis were 72 for the EUROFORGEN NAME panel and 161 for the Precision ID Ancestry Panel. The combined dataset included 233 SNP markers.
Genetic structure
The population variation of reference groups from Sub-Saharan Africa (N = 606), Europe (N = 604), the Middle East (N = 134), South-Central Asia (N = 689), and East Asia (N = 504) was analysed. Figure 2 shows a PCA plot of the combined data set with 233 AIMs. The Sub-Saharan African, European, South Asian, and East Asian clusters were separated from each other by PC1 and PC2. The Middle Eastern cluster was located between the East Asian and the European clusters with a small overlap with the European cluster. The North African cluster was situated between the Sub-Saharan African and the Middle Eastern clusters, while the NE African cluster was found between the North African and Sub-Saharan African clusters. Supplementary Figure S1 shows a similar analysis based on the 72 EUROFORGEN NAME markers only. PCA analyses showed that the Middle Eastern cluster had a larger overlap with the Southern European populations from Greece and Albania than with the Danish population. There was a substantial overlap between the Middle Eastern and South-Central Asian populations mainly consisting of individuals from Afghanistan.
Population assignment based on z-score and LRTo evaluate the genetic structure of the populations, STRUCTURE analyses were performed using K = 3 to K = 7. Figure 3 shows the results for K = 4 to K = 6 for the 233 loci in the combined data set. The most likely number of clusters was K = 4 corresponding to the Sub-Saharan, East Asian, South-Central Asian, and European populations. Co-ancestry contribution from Sub-Saharan, European, and South-Central Asian populations was observed among individuals from North-East Africa and North Africa, whereas the Middle Eastern individuals shared cluster memberships with primarily the European populations and, to a smaller degree, South-Central Asians. With K = 6, an additional cluster corresponding to Middle Eastern individuals was observed. The cluster differed from those of the North-East African and North African populations mainly due to the Sub-Saharan contribution to the latter populations, and it differed from the clusters of the European populations due to the South-Central Asian contribution to the cluster. Some variation within the European cluster was also observed at K=6. South Europeans shared more cluster membership with the Middle Eastern, North-East African, and North African populations than the North Europeans. The STRUCTURE analysis performed with EUROFORGEN NAME markers showed a similar pattern (Supplementary Figure
Based on the STRUCTURE and PCA results, the 14 populations typed in this work were grouped into five meta-populations: 1) a European meta-population including individuals from Albania, Denmark, Greece, Portugal, and Slovenia, 2) a Middle Eastern meta-population including individuals from Afghanistan, Iran, Iraq, Syria, and Turkey, 3) a North-East African meta-population including individuals from Eritrea and Somalia, 4) a North-African meta-population including individuals from Morocco, and 5) a South-Central Asian meta-population including individuals from Pakistan.
A z-score test was performed for each of the 1,070 individuals using the GenoGeographer software 22 and the cross-validation method (https://en.wikipedia.org/wiki/Cross-validation_(statistics)). This was done for the EUROFORGEN NAME panel (72 loci), the Precision ID Ancestry Panel (161 loci), and the combined dataset (233 loci). The AIM profiles were tested against both the individual’s meta-population of origin and the four other meta-populations. Table 1 shows the results of the z-score tests. The results of the test of each AIM profile against each meta-population with the three sets of AIMs were categorised as either “Accepted”, “Ambiguous”, or “Rejected” (Figure 1).
Table 1 shows the biogeographic categorizations of the individuals based on the AIM results obtained with the Precision ID Ancestry Panel, the EUROFORGEN NAME panel, and the combined panel. Irrespectively of the origin of the sample, the number of AIM profiles categorised as “Ambiguous” was lower with the combined set of markers than with the Precision ID Ancestry Panel. The reduction in the number of ambiguous profiles was most pronounced for individuals from the Middle East and South-Central Asia. In both cases, the population assignments primarily changed from ambiguous to concordant. For the Middle Eastern individuals, it was the result of fewer profiles being accepted as possibly belonging to the European meta-population. For the South-Central Asian individuals, it was the result of fewer profiles accepted as possibly belonging to the Middle Eastern meta-population.
For the North African and the North-East African meta-populations, the number of profiles assigned to the ‘Rejected’ category increased when the combined panel was used. Regarding North African individuals, four profiles classified as ‘Accepted’ and two profiles classified as ‘Ambiguous’ with the Precision ID Ancestry panel were assigned as ‘Rejected’ with the combined panel. For the North-East African individuals, three profiles (one defined as ‘Accepted’ and two as ‘Ambiguous’) were classified as ‘Rejected’ when the combined panel was used. These AIM profiles were outliers in all reference populations (z-scores > 1.64; P < 0.05).
Figure 4 shows the distribution of the log LRs for all individuals with z-scores ≤ 1.64 (P ≥ 0.05) for their populations of origin. Overall, the combined panel (red distribution in Figure 4) led to an increase in LRs compared to those of the two panels separately. The increase in LR for the combined panel was greatest when the AIM profiles of individuals from North Africa and North-East Africa were compared with those from individuals from Europe, the Middle East, and South-Central Asia, while it was smallest when the AIM profiles of individuals from 1) Europe and the Middle East and 2) the Middle East and South-Central Asia were compared.