3.1 Y haplogroups
The current inter-population differences in Y haplogroup frequencies in Europe, the Near East, and North Africa are expressed by PCA analyses in Figs. 2A-2B. A simplified picture showing the most frequent Y haplogroup in each population is shown in Fig. 3A. The geographical distribution of individual Y haplogroups and their combinations is displayed in Supplementary Figs. 1–13. In Europe, the minimum spanning tree branches out from Hungary into three major regions: The first one (Western and Northern Europe) is characterized by the dominance of R1b-S116, I1, and R1b-U106; the second one (East-Central and Eastern Europe) is rich in N and/or R1a; the third one (Southeastern Europe) has the highest frequency of I2a-P37.2, E1b-M78, and J2. Other Y haplogroups have too low frequencies, but a noteworthy phenomenon is the disproportionate presence of G2a* & G2a2 in the Central Mediterranean.
In the Near East and North Africa, the root of the minimum spanning tree is centered in Lebanon and we can clearly see a separation of North African countries - an effect of the autochthonous Berber lineage E1b-M81. The Near Eastern nations significantly diverge along the Factor 2 axis: Whereas the Caucasus (non-Arab) region is characterized by the high frequency of G, J2, and R1b, in the Arabian Peninsula, we mostly find the predominance of J1. Y haplogroups E1b-M78, E1b-M123, I, L, R1a, and T have a central position.
3.2 Ancestry components
The geographical distribution of autosomal ancestry components is displayed in Supplementary Figures 14-18. The most frequent pre-Holocene ancestry components (in the k=11 model) are shown in Figure 3B, which clearly illustrates the dominance of Anatolian Neolithic ancestry. Other autosomal clusters have a more peripheral distribution: EHG in Northeastern Europe, CHG in Georgia, Iranian Neolithic in Iran, and Natufian in Arabia and Egypt. The Villabruna component is a minor element with a regional peak in the Baltic region and the Taforalt component is concentrated in Northwestern Africa. The Yamnaya component (a mixture of CHG and EHG, not included in this model) is the most frequent in the northern zone of Europe, from Iceland and Ireland through Scandinavia to the Baltic region and northwestern Russia.
Nevertheless, it should be emphasized that in contrast with Y haplogroups, the frequencies of these components are not fixed numbers and depend both on their choice and their number. As a result, absolute frequencies will always change when the ancestry model is changed, and only relative proportions may remain more or less constant. However, even relative proportions can be significantly skewed when the model includes ancestry components that are closely related. The meaningfulness of the ancestry models can, therefore, be verified by their mutual comparison (Supplementary Table 1) or by correlations with ancestry-specific markers – in this case Y haplogroups (Supplementary Tables 2-8).
The k=7 model (Figure 4A) distinguishes only five components relevant for 39 European countries (Villabruna, EHG, Yamnaya, Anatolian Neolithic, CHG). As a result, the situation is simplified and three missing ancestries (Iranian Neolithic, Natufian, Taforalt) are naturally merged with their most closely related counterparts: Iranian Neolithic with CHG, and Natufian and Taforalt with Anatolian Neolithic. In addition, the k=7 model does not differentiate a specific Villabruna component that once introgressed into the gene pool of West European farmers (designated as ‘WEF Villabruna’ in the present study) and that is apparently combined with Anatolian Neolithic. Since Yamnaya is a mixture of CHG and EHG [25], the residual frequencies of CHG and EHG reflect their population history unrelated to the Yamnaya formation.
The more detailed k=11 model (Figure 4B) is available for 56 countries and includes seven components with meaningfully high frequencies (Villabruna, EHG, Anatolian Neolithic, CHG, Iranian Neolithic, Natufian, Taforalt). It does not distinguish Yamnaya and offers the ‘purest’ proportions of EHG and CHG, which are, inevitably, much higher than in the k=7 model. However, some CHG in the Mediterranean is merged with other Near Eastern clusters. The Villabruna component now incorporates WEF Villabruna, which explains why Villabrunak=11 frequencies in Western Europe are disproportionately higher when compared with Villabrunak=7 (Figure 5A). On the other hand, Villabrunak=11 frequencies in Eastern Europe are somewhat deflated, which must be ascribed to the broadly distorting effect of EHG ancestry: EHG is originally an Upper Paleolithic mixture of Villabruna and Ancient North Eurasian (ANE) ancestry [31], and EHG populations later mixed with Villabruna populations in the Mesolithic Balkans and particularly in Scandinavia and the Baltic region [14, 32], creating a specific ‘Baltic hunterer-gatherer’ ancestry (BHG).
The most detailed k=12 model (Figures 4C-4F) is likewise available for 56 countries. Similar to the k=7 model, it separates Yamnaya as an independent genetic cluster and in addition, it replaces EHG with the above-mentioned BHG. The geographical distribution of BHG is most similar to EHGk=11 (r = 0.95, p < 0.001), confirming that EHGk=11 does not represent ‘pure’ EHG and includes a significant proportion of Villabruna ancestry. Given that the frequency of BHG is higher than the sum of EHGk=7 and Villabrunak=7, BHG may also include some proportion of WEF Villabruna, a component that is now singled out in Western Europe and whose frequencies appear to be underestimated, being non-zero in only seven countries. In both models k=11 and k=12, we find practically the same absolute frequencies of the Natufian and Taforalt components (r = 1.00, p < 0.001), and the Iranian Neolithic component (r = 0.99, p < 0.001). Furthermore, we can also find a very high mutual concordance regarding Anatolian Neolithic ancestry in all three models in Europe, although the absolute frequency of Anatolian Neolithick =12 is ~8% lower.
3.3 Relationships between ancestry components and Y haplogroups
In general, the proportion of Anatolian Neolithic, Iranian Neolithic, Natufian, and Taforalt components is the most consistent across all models and the meaningfulness of these data can be further demonstrated by the impressive relationships of Natufian and Taforalt with their ancestry-specific Y haplogroups (Figures 5B-5C). Besides J1, Natufian is also significantly associated with E1b-M123 and T, and all these three Y haplogroups can be designated as typically ‘Arabic’. On the other hand, the connection between Anatolian Neolithic and its original paternal signature Y hg G (especially G2a2) [33] is completely diluted in the Near East. A relatively strong relationship is retained only in Europe, but even here, Anatolian Neolithic is more strongly correlated with Y hg J2 (Supplementary Figures 19A-19D).
CHG ancestry and Iranian Neolithic ancestry share a relatively recent common origin [34] and both were originally accompanied mainly by Y hgs J2 and L [35-36]. In today’s Near East and North Africa, the situation is very different and CHG is mainly correlated with Y hg G2 in the Caucasus area, whereas Iranian Neolithic shows the only noteworthy connection with Yamnaya-associated R1a (Supplementary Figures 20A-20B). Since Y hg G has low diversity in the Caucasus and is not supposed to be autochthonous [37], we must assume a strong founder effect and extensive replacement of local CHG-associated paternal lineages caused by the intrusion of populations with Anatolian Neolithic ancestry. This process may not have included only Y hg G but also other Y haplogroups, especially I and J2 (Supplementary Figures 20C-20D), as indicated by their tight clustering in Figure 4E.
The results regarding EHG and Villabruna (and their Holocene derivatives BHG and Yamnaya) are the least coherent, which stems from their long historical interconnection. The Villabruna component was typical of Mesolithic Europe west of the Carpathians and its original Y haplogroups were I and R1b-V88 [24-25, 38]. These lineages also prevailed in the mixed EHG & Villabruna (BHG) populations between the Balkans and Northern Europe during the Mesolithic. At present, the highest frequency of BHG-derived Villabruna (Villabrunak=7) can be found in the Baltic region (Latvia and Lithuania), but its original Y haplogroups were already replaced by R1a during the Late Eneolithic (Corded Ware) period and by the Uralic lineage N during the Iron Age [33, 39]. Consequently, Villabrunak=7 is currently associated mainly with N and R1a, and the combination of these two Y haplogroups is significantly complementary in this regard (r = 0.90, p < 0.001) (Supplementary Figure 21A). This means that Y hg N is practically interchangeable with R1a and represents the same autosomal ancestry. The mismatch between N & R1a and Villabruna ancestry later spread with Slavic-speaking populations to Central Europe (Figure 5D). However, there are two other populations with a notable proportion of Villabruna ancestry that retained Y hg I and experienced extensive geographic expansion. One is typical of Scandinavia (especially Sweden), with the predominance of Y hg I1, and the other can be found in the Western Balkans, with the overwhelming dominance of I2a-M423 (a subbranch of I2a-P37.2) [37].
The population history of WEF Villabruna in Western Europe was very different: This ancestry component correlates positively with Anatolian Neolithick=7 (r = 0.46, p = 0.003), a West European Y haplogroup R1b-S116 (r = 0.70, p < 0.001), and especially with I2a-M26 (r = 0.79, p < 0.001 in 29 countries) (Supplementary Figures 21B-21D). I2a-M26 is a subbranch of I2a-P37.2, which introgressed into the gene pool of West European farmers and is most widespread in Sardinia (38.9%) [33, 40].
The EHG component dominated east of the Carpathians during the Mesolithic and its main Y haplogroups were Q, R1a, and R1b [15, 25, 41]. Almost all non-Yamnaya EHG ancestry in today’s Europe is descended from the mixed BHG population, which can be illustrated by a very high geographical correlation between EHGk=7 and Villabrunak=7 (r = 0.87, p < 0.001) (Supplementary Figure 21E). Consequently, EHGk=7 shares many relationships with Villabrunak=7, including a very strong association with N & R1a (r = 0.94, p < 0.001) (Supplementary Figure 21F). The combination of EHGk=7 and Villabrunak=7 components is, therefore, practically interchangeable with BHGk=12, although their sum is lower than the frequency of BHGk=12 (Supplementary Figures 22A-22F).
During the Eneolithic period (5th-4th millennium BC), the mixture of EHG and CHG in the steppe gave rise to the Yamnaya component, whose paternal signatures were likewise Q, R1a, and R1b. However, in the k=7 model, Yamnaya correlates most strongly with Y hg I1 (r = 0.60, p < 0.001) (Supplementary Figure 23A). Although until recently, finds of Y hg I1 prior to the Nordic Bronze Age (1700-500 BC) were very rare and its history in Scandinavia was enigmatic, Posth et al. [32] documented this branch in a male from northern Germany (~3233 cal. BC), who was assigned to the local Funnelbeaker culture and had a Villabruna-like genetic profile. This shows that Yamnaya-associated Y hgs were largely replaced during admixture with the indigenous population of Scandinavia. In addition, Yamnaya was also part of the I2a-P37.2 expansion in the Balkans (Supplementary Figure 23B). As a whole, Yamnaya is almost perfectly linearly correlated with five ‘European’ Y haplogroups (I, N, Q, R1a, R1b) (r = 0.96, p < 0.001) and with indigenous European components EHG and Villabruna (Supplementary Figures 23C-23E). On the other hand, it is strongly mutually exclusive with non-European Y haplogroups and non-European ancestry components (Supplementary Figure 23F).
3.3 Male height vs. Y haplogroups
The relationships between male height and Y haplogroups are displayed by factor analyses in Figures 4A-4F, in Table 1, and in great detail in Supplementary Figures 24-32. In Europe, these comparisons confirm previous findings [7] and identify Y hg I as the main predictor of tallness (r = 0.57, p < 0.001). At the same time, this is true not only in Europe but even in the entire sample of 60 countries (r = 0.76, p < 0.001) (Figure 6A). At present, Y hg I has two main frequency peaks in Sweden (46.3%, mostly I1) and in Bosnia and Herzegovina (55.3%, mostly I2a-P37.2 and its subbranch I2a-M423). The strength of the positive relationship slightly increases when Y hg I is combined with R1b-U106 (r = 0.80, p < 0.001) (Figure 6B), whose geographical distribution is more limited, with a frequency peak in the Netherlands (34.2%). A combination of five European Y haplogroups (I, N, Q, R1a, R1b) improves the correlation coefficient even further (r = 0.83, p < 0.001) (Supplementary Figure 30A). Among ancestry-specific Y hgs, those associated with Germanic nations (I1, I2a-M223, R1b-U106) are the most noteworthy (r = 0.64, p < 0.001) (Supplementary Figure 30E). The factor analyses in Figures 4A-4C also help to identify a more subtle combination of six ‘height-related’ Y haplogroups, which have the most specific relationship to male height in Europe: I1, I2a-P37.2, I2a-M223, N, Q, and R1b-U106 (r = 0.72, p < 0.001) (Figure 6C). Nevertheless, the additive effect of I2a-M223 is negligible and ambiguous in other combinations.
Given that the relationships between Y haplogroups and height can be influenced by environmental factors, the correlation coefficients must have been adjusted (Table 2). In Europe, these potentially confounding variables consist of nutrition (protein supply) and three socio-economic factors with the most direct causal effect (child mortality, total fertility, inequality-adjusted Human Development Index). Interestingly, the strength of the partial correlations is often even greater: Y hg I increases its predictive power to r = 0.67 (p < 0.001), I2a-P37.2 to r = 0.64 (p < 0.001), and R1a to r = 0.42 (p = 0.011), reflecting the economic underdevelopment of Eastern Europe, which hinders the full expression of the genetic potential. In contrast, I1 largely loses significance (to r = 0.39, p = 0.019) and R1b-U106 becomes an insignificant factor (to r = 0.08, p = 0.64), although its partial correlation is partly retained in the total sample (r = 0.27, p = 0.049). This must be ascribed to the fact that both I1 (r = 0.65, p < 0.001) and R1b-U106 (r = 0.68, p < 0.001) are most strongly associated with high protein quality in Europe (the ratio between the proteins from dairy & pork / wheat). Also noteworthy are the amplified negative tendencies of R1b-S116 (r = -0.39, p = 0.021) and partly even I2a-M223 (r = -0.28, p = 0.11), suggesting that their role is influenced by higher living standards in Western European countries.
The most parsimonious combination of Y haplogroups after adjusting is I1 & I2a-P37.2 (r = 0.70, p < 0.001) and this result improves only slightly (to r = 0.72, p < 0.001) with five Y hgs (I1, I2a-P37.2, Q, R1a, R1b-U106) (Figure 6D), not to mention that it is practically the same as the combination of mere four lineages (I1, I2a-P37.2, R1a, R1b-U106). The six ‘height-related’ Y hgs partly lose their importance (r = 0.66, p < 0.001). Although Y hg N does not contribute to these relationships, its outlier frequency in Finland (Figure 6C) indicates that its phenotypic effect in the Finnish population may be highly specific. This would not be surprising given Finland’s isolated genetic history [42]. After excluding Finland, it is the combination of I1, I2a-P37.2, N, Q, and R1b-U106, which gives the highest partial correlation in 38 European countries (r = 0.81, p < 0.001), but I1, I2a-P37.2, N, and R1b-U106 reach nearly the same value (r = 0.80, p < 0.001) (Supplementary Figures 31A-31B). Although N and R1a are otherwise mutually complementary, adding R1a to these combinations decreases the partial r-value because it leads to overestimating male height in the Baltic region and Eastern Europe as a whole (Supplementary Figures 31C-31D).
In contrast with European Y haplogroups, non-European Y haplogroups (E, G, J, L, T) correlate strongly negatively with height in the total sample (r = -0.82, p < 0.001), which can be ascribed mainly to Y hg J1 (r = -0.80, p < 0.001). This negative relationship reaches a maximum when J1 is combined with E1b-M123 and T (r = -0.84, p < 0.001) (Figure 6E). As already mentioned, these three Y haplogroups are typical of the Arab peninsula and reach the highest frequency in Yemen (78.4%). Interestingly, we can also observe that Y haplogroups J2 and L, which are widespread in the Near East and correlate negatively with male height in Europe (r = -0.57, p < 0.001), are strongly associated with tall statures in the Near East (r = 0.77, p < 0.001). This result further slightly increases to r = 0.78, when J2 & L are combined with E1b-M78 or G2a*& G2a2. After adjusting for variables specific to the Near East, the combination of E1b-M78, J2, and L reaches a very high partial correlation of r = 0.92 (p < 0.001), although this comparison does not include Bahrain and Qatar.
When similar adjustments are performed in the total sample (in 58 countries, again without Bahrain and Qatar), the importance of European Y haplogroups markedly decreases (to r = 0.50, p < 0.005), confirming the expected influence of better living conditions in Europe. As already mentioned, this confounding effect is most evident in the Western European lineage R1b-S116, whose relationship to height is largely reversed. The role of Y hg I and various Y haplogroup combinations also decreases but still remains highly significant, and the highest partial correlation can be found in four Y haplogroups: I1, I2a-P37.2, Q, R1b-U106 (r = 0.66, p < 0.001) (Figure 6F). After the exclusion of Finland, the best result is again achieved with the same five Y haplogroups as in Europe (I1, I2a-P37.2, N, Q, R1b-U106) (r = 0.72, p < 0.001). The negative relationship of the three ‘Arabic’ Y hgs E1b-M123, J1, and T is also retained (r = -0.62, p < 0.001), but the drop in the r-value is far greater, suggesting that the association between height and these markers may be more strongly distorted by the lower quality of life.
3.4 Male height vs. ancestry components
Table 3 shows the relationship between male height and ancestry components according to all three models tested. This comparison identifies two ancestries that are consistently associated with tallness: Villabruna and Yamnaya. A significant positive relationship can be found even with EHGk=11 (a component in Yamnaya) and BHGk=12 (a mixed EHG-Villabruna ancestry) (Figures 7A-7D). In all comparisons, Yamnaya or EHGk=11 appear to be stronger predictors of height than Villabruna or BHG. However, the role of non-Yamnaya EHGk=7, which does not include the Villabruna component, is non-significant. Furthermore, in the k=7 model after adjusting, the difference between Villabruna and Yamnaya in Europe nearly disappears (r = 0.58 vs. r = 0.61) (Table 4), and it even reverses (r = 0.57 vs. r = 0.55), when only three elementary factors (nutrition, child mortality, total fertility) are used as potential confounders. Disregarding the k=11 model (in which EHG includes a large proportion of Villabruna), BHG and Villabruna also correlate more strongly in the more developed Western Europe, where we generally observe highly linear relationships approaching r = 0.90 (Table 5). In contrast, correlations between height and ancestry in Eastern Europe are much weaker, suggesting a greater role of the environment. This is reminiscent of the observation made in the case of Y hgs I2a-P37.2 and R1a.
These data suggest that the genetic potential of the Villabruna cluster, which is more represented in Eastern Europe, may actually be higher than that of Yamnaya. This is further supported by the fact that Villabrunak=7 correlates more strongly with key height-related, adjusted combinations of Y haplogroups (Supplementary Table 2). In fact, the deficit of Villabruna ancestry in Western European populations descended from the Bell Beaker culture (Y hg R1b-S116) can explain why they are ~3 cm shorter than other Europeans with a similar proportion of Yamnaya ancestry. This can be illustrated by the example of Ireland on the one hand, and Sweden on the other hand (Figures 7A-7B). The EHGk=7 component likewise gains in importance after adjusting (r = 0.49, p = 0.003), but does not combine well with Villabruna or Yamnaya, indicating its secondary role (Supplementary Figures 33A-33B).
Figures 7E-7F and Supplementary Figures 33-34 illustrate that height in Europe is in an inverse relationship with all five Near Eastern ancestry components and this remains true even after controlling for the environment (Table 4). In the Near East and North Africa, Anatolian Neolithic predicts tall statures and similar tendencies can be observed in the CHG component. On the other hand, Natufian has a negative role. These relationships (albeit weaker) are likewise retained after adjustments.
The situation in the total sample changes more: After controlling for environmental factors, negative partial correlations in the k=12 model can be found in Natufian (r = -0.50, p < 0.001), WEF Villabruna (r = -0.36, p = 0.009), and partly in Iranian Neolithic (r = -0.27, p = 0.057) and Anatolian Neolithic (r = -0.27, p = 0.055). The r-values in Anatolian Neolithic radically change from positive to negative, which must be ascribed to the high frequencies of this cluster in affluent Western European countries. The fact that WEF Villabruna (Supplementary Figure 34E) decreases a positive correlation coefficient when combined with BHG in Europe suggests that it carries predispositions for short height that were typical of Anatolian Neolithic.
A noteworthy anomaly that can be observed in virtually all graphic comparisons in Europe is the eccentric trend in the Western Balkan area of the Dinaric Alps (former Yugoslavia and Albania). Here, we find the tallest statures in the world and the highest frequencies of Villabruna-associated Y hg I, yet Villabruna and Yamnaya ancestry are insufficient to explain this phenomenon and local heights appear to increase with the proportion of Anatolian Neolithic and CHG ancestry (cf. Figure 7E). This is in striking contrast to the situation in other regions of Europe. Inevitably, correlation coefficients in Europe profoundly increase when Western Balkan countries are excluded (Table 5). The same anomalous tendency can be seen in Supplementary Figure 30A, where Montenegro is the tallest European country, despite a relative deficit of European Y haplogroups (I, N, Q, R1a, R1b). A lineage that improves the position of Montenegro in this graph and significantly increases the correlation coefficient in Europe (from r = 0.61 to r = 0.68, p < 0.001) is E1b-M78. Another slight increase occurs after the inclusion of J2 (r = 0.70, p < 0.001). The specific role of E1b-M78 and J2 in the Western Balkans can also be seen in Supplementary Figures 32A-32B, which compare non-European Y haplogroups.
These findings raise interesting questions regarding the well-known Holocene founder effect of E1b-M78 in the Balkans, which included the subbranch E1b-V13 [43-44]. The frequency of E1b-V13 currently reaches its maximum in the Ghegs from North Albania (37.8%) [44] and Kosovo (up to ~44%) [45]. Interestingly, Cruciani et al. [43] demonstrated a high geographical correlation between E1b-V13 and the subbranch J2b in Europe and their similar times to the most recent common ancestor (TMRCA) ~2700-2000 BC, supporting a long shared history. This observation can be confirmed even in the present study because E1b-M78 and J2 mutually overlap both in the Balkans (r = 0.75, p = 0.008) and in the seven countries of the Dinaric Alps (r = 0.84, p =0.017).
3.5 Lactose tolerance vs. Y haplogroups
Lactose tolerance-associated alleles included in this study consist of 13910*T, 13915*G, 14009*G, 13907*G, and 14010*C, but only 13910*T and 13915*G were represented in high frequencies. The typically European allele 13910*T was available for 35 out of 39 European countries and for 52 countries from the total sample (Figure 8A). It has two main frequency peaks in Ireland (86.6%) and Iceland (85.3%) but is also widespread in Scandinavia (~75%) and in the United Kingdom (74.5%). Table 6, Supplementary Table 9, and Figure 9A illustrate that the distribution of 13910*T in Europe is most closely tied to three ‘Germanic’ Y haplogroups I1, I2a-M223, and R1b-U106, whose combination is complementary (r = 0.79, p < 0.001) and is clearly responsible for the elevated 13910*T frequency in Eastern Europe. Although the presence of 13910*T in Eastern Europe can also be linked to N and R1a, their role is considerably weaker. Besides the Germanic Y hgs, the most important lineage is obviously R1b-S116 in Western Europe. Six Y haplogroups (I1, I2a-M223, Q, R1a, R1b-S116, R1b-U106) are the strongest correlates of 13910*T both in Europe (r = 0.89, p < 0.001) and in the entire sample of 52 countries (r = 0.94, p < 0.001) (Figure 10A). Although adding Y hg N improves the outlier position of Finland, it overestimates 13910*T frequency in other Baltic countries (Supplementary Figures 35A-35F). Expectably, all non-European Y hgs have a negative relationship with 13910*T in Europe and this applies especially to E1b-M78 & J2 (r = -0.78, p < 0.001), which are concentrated in the Balkans.
In the Near East and North Africa, the situation is much simpler. 13910*T is represented in low frequencies, although a notable exception is Northwestern Africa (~10-20% 13910*T). The dominant lactose tolerance-associated allele is 13915*G, whose frequencies are available for 30 countries (13 in the Near East and 16 in the Near East & North Africa). Its occurrence reaches a peak in Saudi Arabia (58.7%) and Yemen (54.9%), but outside the Arabian Peninsula, it abruptly decreases to 3.7% in Egypt and 2.9% in Syria (Figure 8B). The only Y haplogroup correlating consistently with 13915*G is J1 (r = 0.87, p < 0.001 in the Near East and r = 0.89, p < 0.001 in the complete sample of 30 countries), which is graphically demonstrated in Figure 9B. In the Near East & North Africa, E1b-M123 is also gaining importance, and Y hg T shows a decently positive role in the whole sample (Table 6, Supplementary Table 10). Nevertheless, E1b-M123 and T have only a slightly additive effect when combined with J1, showing that J1 is definitely the most important factor (Figure 10B, Supplementary Figures 36A-36D). Figure 10B also shows that 13915*G starts to increase in a population only when the cumulative frequency of E1b-M123, J1, and T reaches > 30%. This suggests that not all subbranches of these Y haplogroups are associated with 13915*G and a higher resolution would be needed. In any case, 13915*G is not positively correlated with any other Y haplogroup, confirming that its present distribution is closely related to the expansion of pastoral populations from the Arabian Peninsula.
3.6 Lactose tolerance vs. ancestry components
In contrast with the diverse spectrum of Y haplogroups connected with 13910*T, there is only one ancestry correlating consistently positively with this allele in Europe: Yamnaya (r = 0.83, p < 0.001 in the k=7 model) (Table 7, Figure 11A). However, Villabrunak=11 and BHGk=12 are also significantly correlated, and Villabrunak=7 reaches significance when Europe is divided into a western and eastern half (Figure 11B, Supplementary Figures 37-38). These relationships reflect the involvement of Villabruna-associated lineages I1 and I2a-M223, as well as the weaker 13910*T selection in Eastern Europe (mirrored by the less significant role of Y hgs N and R1a) and the weaker relationship of 13910*T with WEF Villabruna ancestry in Western Europe (which consequently weakens the relationship between 13910*T and Villabrunak=11 in Western Europe and brings it closer to the Eastern European correlation line).
The Near Eastern allele 13915*G shows the only significant connection with Natufian ancestry, but the correlations are even more linear than those observed in the three ‘Arabic’ Y haplogroups: In both k=11 and k=12 models, they reach r = 0.91 (p < 0.001) in the total sample of 27 countries, r = 0.93 (p < 0.001) in 13 countries from the Near East & North Africa, and r = 0.96 (p < 0.001) in 10 countries from the Near East (Supplementary Figures 39A-39B). Still, even here, we can see that the selection for 13915*G postdates the spread of the Natufian component, as Levantine countries (Lebanon, Syria) have nearly zero 13915*G levels despite 20-25% Natufian ancestry.
3.7 Male height vs. lactose tolerance
Previous studies [7-8] documented a paradoxical relationship of lactose tolerance with male height, which is positive in Europe but negative in the Near East. The present study unequivocally confirms these findings (Figures 12A-12B). In Europe, this positive relationship can be explained by the association between 13910*T and the Yamnaya component because Eastern European countries (and particularly those from the Western Balkans) deviate from the correlation line. The negative correlation between 13915*G and male height is likewise easy to explain due to the strong multicollinearity of 13915*G with the ‘Arabic’ Y haplogroups and Natufian ancestry, which consistently predict the shortest heights.