Lactase deficiency in Russia: multiethnic genetic study

Lactase persistence—the ability to digest lactose through adulthood—is closely related to evolutionary adaptations and has affected many populations since the beginning of cattle breeding. Nevertheless, the contrast initial phenotype, lactase non-persistence or adult lactase deficiency, is still observed in large numbers of people worldwide. We performed a multiethnic genetic study of lactase deficiency on 24,439 people, the largest in Russia to date. The percent of each population group was estimated according to the local ancestry inference results. Additionally, we calculated frequencies of rs4988235 GG genotype in Russian regions using the information of current location and birthplace data from the client’s questionnaire. The attained results show that among all studied population groups, the frequency of GG genotype in rs4988235 is higher than the average in the European populations. In particular, the prevalence of lactase deficiency genotype in the East Slavs group was 42.8% (95% CI: 42.1–43.4%). We also investigated the regional prevalence of lactase deficiency based on the current place of residence. Our study emphasizes the significance of genetic testing for diagnostics, i.e., specifically for lactose intolerance parameter, as well as the scale of the problem of lactase deficiency in Russia which needs to be addressed by the healthcare and food sectors.


INTRODUCTION
Lactose is one of the most common disaccharides in milk and dairy products consumed by humans. The lactose content ranges from 0.1 to more than 7 g per 100 mL, depending on the source of milk, the category of dairy products, the fat content in the product, and the type of starter used at the fermentation stage, due to which the final sugar concentration is significantly reduced [1][2][3]. The digestion of lactose occurs in the small intestine in the presence of a β-galactosidase enzyme called lactase-phlorizin hydrolase (LPH), or lactase in short, the activity of which determines the ability of the body to successfully break down milk sugar into glucose and galactose monomers [4].
The lactase enzyme is encoded by the LCT gene located on chromosome 2. Its upstream regulatory region MCM6 is responsible for modulating LCT expression through epigenetic modifications. Polymorphisms in the MCM6, particularly rs182549 (−22018 G/A) and rs4988235 (13910 C/T), are associated with a high age-related methylation level in both MCM6 and LCT regions which lead to lactase non-persistence by silencing LCT [5,6]. Such decreased lactase activity generally occurs after the weaning phase. It is considered to be a normal "wild type" and is widespread around two-thirds of the world's population with wide diversity in different countries [7][8][9][10]. The ability to digest lactose in adulthood is provided by the maintenance of a high level of lactase activity due to continued post-weaning expression of the lactase gene, and results from at least 5 different mutations in a lactase regulatory region within an intron of MCM6 [11]. It is assumed that lactase persistence (LP) alleles have undergone a positive selection simultaneously with the spread of agriculture [11][12][13]. The average frequency of lactase persistence is higher in northern Europe, moderate in southern Europe and the Middle East, and low in some Asian and African communities [12,14,15]. Their evolutionary advantage provides access to nutrient-dense dairy products and liquid sources, which is critical for the populations living in hot, arid environments [14].
Decreased lactase activity in adulthood virtually leads to lactose malabsorption which is related to the amount of mis-absorbed lactose. A limited amount of lactose intake does not cause recognizable symptoms in people with lactose maldigestion: the majority of subjects with lactose maldigestion will tolerate with no or minor symptoms up to 12 g of lactose as a single dose or more in case lactose-contained products are dispersed to multiple doses throughout a day [16,17]. For this very reason, worldwide rates of lactase deficiency and lactose malabsorption may be much higher than those of lactose intolerance. Lactose intolerance depends not only on the lactase genotype and lactose dose, but also on the age, microbiota, specifics and the sensitivity of the GI tract, and the ability to lactose digestion [18,19]. In the large intestine, noncleaved lactose disaccharides lead to a violation of water absorption and are fermented by the intestinal microbiota with the release of gases (H 2 , CO 2 , CH 4 ) and other metabolites. These factors negatively affect the function of the gastrointestinal (GI) tract [20,21] as well as a person's well-being. Thus, one may face digestive problems (diarrhea, abdominal pain, bloating, nausea) after taking lactose-containing food [21]. This leads to a decrease in dairy product consumption, meanwhile, they are a key source of calcium in a balanced diet, as well as an important source of protein and a set of micronutrients [22,23]. The solution to avoid digestive symptoms and keep the dairy products in a diet may be the choice of lactose-free and low-lactose dairy [24]. Most guidelines and public health organizations suggest consuming dairy products up to three servings per day to ensure adequate nutrient intake and optimal bone health [25][26][27][28][29][30].
During the last decades, a set of works was dedicated to analyzing the various national management strategies [17], symptoms evaluation analysis [31], and the global and regionfocused situation [7,8]. Specifically, for Russia, it was concluded that figures may overestimate the prevalence of deficiency due to the unrepresentative samples concerning the total population in the country [7]. Indeed, the Russian Federation is a country of more than 146 million people and at least 190 ethnic groups which tend to be genetically diverse and may include people similar to those of European descent or individuals almost indistinguishable from Asians. The population is unevenly distributed throughout the country, which is divided into 85 federal subjects or regions. Evaluation of accurate regional patterns of lactase deficiency is substantial to guide the management of GI symptoms. Also, it is a crucial driver for lactose-free and low-lactose products' dairy market growth.
To evaluate the actual level of lactase deficiency in Russia we focused on a genotype-related hypolactasia. We aimed to compare the frequency of GG genotype in rs4988235 (13910 C/T) in regulatory region (MCM6) of the LCT lactase enzyme gene (in populations living in Russia and differences between Russian regions) which is in strong linkage disequilibrium (D' = 0.9985, R 2 = 0.9825) with another known lactase non-persistence CC genotype in rs182549 (−22018 G/A), as well as rs4954490, rs41380347, and rs145946881 variants, which were previously mentioned for one of the Russian populations [14]. The other two variants, rs41525747 and rs869051967 [14] are absent in 1000 G reference panel, and their allele frequencies are low among the Russian population (<0.1%). Therefore, we decided to estimate the lactase deficiency frequency solely via rs4988235, which is also the most widely investigated in studies on Russian population so we could compare our results with existing ones.
We also analyzed the prevalence of lactose intolerance phenotypic manifestations based on previously published data with the frequency of the presence of lactase persistence genotype which determines the genetic risk of lactase deficiency manifestation for the Russian population.

MATERIALS & METHODS Study populations
In our study, we analyzed the genetic data of 40,164 individuals from the database of Genotek, the Russian consumer genetics and research company (www.genotek.ru). Genotek clients included in our analyses provided informed consent for their data to be used for research purposes and took an online questionnaire. The current research was approved by the Genotek Ethics Committee (protocol No. 8 "Lactose intolerance in the Russian population") and performed in accordance with the Declaration of Helsinki. DNA extraction and genotyping were performed on saliva samples that were genotyped on Illumina Infinium Global Screening Array v.1-v.3 microarrays (~650,000 SNPs). We also included uploaded to Genotek website samples from 23andMe genetic company (www.23andme.com) which were performed on 23andMe v5 custom microarray chip. All 23andMe users who shared their data with us signed informed consent for their data to be used for research purposes and took the online questionnaire. All samples in a Genotek cohort were processed in batches (192-576 samples per batch). The GenomeStudio software (Illumina, San Diego, CA) and manually created cluster files were used to cluster the raw signals and call the genotypes. SNPs with a call rate <0.9 within the batch were removed. Then we removed the calls on the Y chromosome for women and the heterozygous calls on the X chromosome for men. We calculated the MCD (maximum cluster distance) metric for each genotype as the maximum (over all probes for SNP) of the normalized distances between individual signal and the corresponding cluster center in the plane with coordinates Norm Theta and Norm R. We considered all heterogeneous genotypes with MCD > 4 as missing calls. Finally, we removed samples with a call rate < 0.9 after all previous steps.
Statistical analysis of the genotypes was performed using R scripts. Confidence intervals were constructed using the Hmisc package and binconf (alpha = 0.05).

Ancestry estimation
An in-house algorithm for local ancestry inference was used to assign the ancestry label to each individual. Our algorithm first used Positional Burrows-Wheeler Transform (PBWT) [32] for initial population prediction and then a custom Hidden Markov Model for phasing error correction and smoothing. We used the reference panel with 17,559 genomes with known ancestry. Each individual from the reference panel corresponds to one of the 101 populations. All populations were clustered into 23 population groups based on their genetic similarity.
For each individual in the current study, we estimated the percentage of each population group according to the local ancestry inference results. We assigned the population group label to people whose maximum percentage of that population group exceeds 70% and included them for further analysis. Assuring there were enough samples for frequency estimation, we selected only those population groups in which there were 100 or more people. After such filtering, seven population groups remained: Central Asia, East Slavs, Ashkenazi Jews, North Caucasus, Siberia, Volga-Ural region, and Western Asia. The total number of populations in all groups was 56. The full list of all regions and populations used in the current research is presented in Table S1. In order to show the adequacy of our approach and to visually validate the selected clusters, we constructed a PCA plot (see below).
Only representatives of selected population groups were considered for further analysis. Among them, 29,559 individuals were genotyped on Illumina GSA microarray and 835-on 23andMe v5. All subsequent data preprocessing was performed using BCFtools (version bcftools 1.12-75-g5329f29) and PLINK (version v1.90b6.10). Sex chromosomes have been removed. Then all SNPs with missing calls in more than 5% of samples (plink --geno 0.05) and all samples with missing calls in more than 3% of SNPs (plink --mind 0.03) were filtered out. After this filtration, 29,104 samples remained.
Allmissing cases were simply replaced withhomozygous major allele (plink--fill-missing-a2). To identify biases caused by distinct platforms, a basic χ 2 test of associations between SNPs and chip type was performed. We identified 11,023 SNPs passing the p-value threshold 0.00001 responsible for the major difference (plink --assoc--pfilter 0.00001). All those SNPs were removed from consideration.
For the proper analysis, we needed to filter out all people with at least one close relative (1st or 2nd degree) in the cohort. For this stage, we excluded all variants with minor allele frequency below 0.01 (plink --maf 0.01). Then the pairwise comparison of samples and PI-HAT calculation was made with plink --genome. Thus, 4,635 samples with at least one close relative (PI-HAT > 0.125) were removed from the data.
Eventually, we obtained data from 24,469 samples and pruned it (plink--indep 50 5 2). To the pruned data, we applied the principal component analysis (PCA) in PLINK. A PCA-plot was constructed from the principal components and the HDBSCAN clustering algorithm was run.

Evaluation of frequencies in different Russian regions
To calculate the frequency of rs4988235 GG genotype (which is in strong linkage disequilibrium with rs182549 CC genotype) in Russian regions we used the information of current location and birthplace in client's questionnaire, as well as the data about the birthplace of their close relatives (parents and grandparents). Since we aimed to focus primarily on Russian residents, we filtered out not only people born abroad, but also people whose ancestors were born abroad. Thus, the data for the frequency estimation included 11,325 individuals. Finally, 10,622 East Slavs clients were selected from this subcohort.
In the last decades, the intensity of migration has increased significantly [33]. Since we aimed to study the prevalence of genetic variants of lactase deficiency in the context of population differences, we focused on the information about the ancestral birthplaces of an individual rather than on their current location. Thus, if a grandparent was born in a region, such observation was given a weight of 1; if a mother or father was born in a region, the observation was given a weight of 0.5; for the individual himself, the observation was given a weight of 0.25. Hence, we calculated the total weight for each region by summarizing all weights of the observations. If the total weight was less than 13 (e.g., 52 individuals, each with a minimum weight of 0.25) and there were less than 20 individuals in the region, such a region was not considered for further analysis. Thus, 69 Russian regions were included in the research for estimating the rs4988235 GG genotype among all samples and 66 for only East Slavs. The list of Russian regions used in current research can be found in Tables S4-S5. The data for the map plotting were downloaded from gadm.org (version 3.6) and then were visualized in R using the package ggplot2 version 3.3.5.

RESULTS
To study the prevalence of lactase deficiency in the Russian population, we evaluated the frequencies of rs4988235 and rs182549 genotypes among 40,164 individuals. Taking into account that the GG genotype in rs4988235 is in strong linkage disequilibrium with the CC genotype in rs182549 (Tables S4-S5), we concentrated on the first one as it is more likely to be used in lactose intolerance research. Since the Russian population is multiethnic, we aimed to estimate the frequencies in groups of populations genetically distinct from each other. For this purpose, the proportion of each population group in the human genome was estimated using the LAI algorithm. People with evidence of recent admixture events were excluded from consideration. After QC filtering and exclusion of admixed individuals (see Methods), the final sample consisted of 29,559 people.
We constructed a PCA plot clustering people in the filtered 56 populations according to their ethnicity (Fig. 1).
We obtained a single map of 7 population groups (Central Asia, East Slavs, Ashkenazi Jews, North Caucasus, Siberia, Volga-Ural region, and Western Asia) using genetic distance among population pairs, each assigned to one of the groups. The distribution of populations into clusters was as expected. A total of 24,469 people were included in the following research after the PCA plot was built (12,410 females, 12,059 males). The final cohort remained of 24,439 individuals with the presence of two lactaserelated SNPs (rs4988235 and rs182549). The characteristics of the final cohort are summarized in Table 1.
Among the population groups, the highest prevalence of the GG genotypes was observed in Western Asia and Siberia, as well as among Central Asia (96.3%, 88.7%, and 83.8%, respectively) (Fig. 2, Table S2) According to our observations, the lowest percentage of the GG genotype (42.8%; 95% CI: 42.1-43.4%) is among population groups in the East Slavs group which consists of Russians, Belarusians, and Ukrainians. We calculated the GG genotype frequency in individuals from different Russian regions. The highest percentage of people with lactase deficiency genotype was found in North Ossetia (83.2%), slightly lower in the Chechen Republic (77.8%), and in Dagestan (74.3%) (Fig. 2b, Table S3). A high prevalence of the GG genotype was also detected in Buryatia (65%), Sakha (Yakutia) Republic (63.4%), and Tomsk Oblast (60.5%).
The lowest percentage of people with the genotype associated with lactase deficiency was observed in Zabaykalsky Krai (22.8%). Among the top cities and regions with population exceeding one million, the percentage of the GG genotype was as follows in Table 2.
The values in Russian regions were obtained not by ethnicity or ancestry estimation but by the current place of residence. Hence, the results may differ from the values inherent in the natives of the region.

DISCUSSION
The prevalence of the LP genotype in Europe remained low until the Bronze Age [34,35]. There are at least 2 hypotheses for the origin of lactase persistence. The first one is explained by the appearance of LP in Europeans after the mass migration of the eastern steppe populations associated with the Yamnaya culture; the second one is explained by the appearance of LP initially in the pastoral steppe population when farming replaced hunting and gathering in the Middle East and then was brought to Western Europe at the beginning of Corded Ware culture [36]. Each of these hypotheses has strong arguments for their existence; nevertheless, one of the recent studies on the evolutionary origin of the LP genotype concludes that the increase in the percentage of LP in the European population was unlikely related to steppe movements and their genotypes, and began no earlier than after 3000 BP [15]. Other studies showed that the large-scale migrations of steppe nomads were connected precisely with the beginning of milk consumption and domestication of horses: the advantage of a permanent source of protein combined with the potential epicenter of horse domestication allowed steppe people to begin large-scale migrations from their Pontic-Caspian steppe across Eurasia [37]. In addition to genetic variations in humans, high genetic diversity has been observed in cows from those regions where dairy farming is common and humans are lactose persistent [38]. However, at the same time, there are a number of cases of nomads keeping the high prevalence of the reference genotype of lactase nonpersistence alongside actively using fermentation. This prompts the puzzling question of the initial need for positive genetic selection of lactase persistence in adulthood for humans considering that its prevalence is lower than non-persistence and that humans can consume fermented milk products successfully.
The geographic specificity of the Caucasus primarily affected the genetic gap between the East European Plain and the Caucasus. There is a number of genetic and linguistic evidence indicating predominantly a Southern, Middle Eastern origin of the Caucasus populations in the process of Neolithic migrations from the steppe and a subsequent genetic drift in a geographically distinct landscape [39,40]. One may speculate that the spread of lactase persistence from Eastern Europe could not pass the geographical barrier but remained as the main route of the ancient gene flow and mixing from the Anatolian side where cattle breeding originated. It is worth mentioning the cultural peculiarities of this region which may have been both the cause and the consequence of the high prevalence of lactase deficiency (for instance, usage of different milk sources, plenty of variants of milk fermentationdozens of types of cheese, cottage cheese, sour cream, drained sour milk, fermented dairy drinks, yogurts, etc.).
The Siberian cluster of populations deserves special attention. The vast majority of Siberian populations, even from the far north, are genetically of southern origin and multilayered complex admixture, as emphasized by several studies [41][42][43][44]. In particular, it is assumed that the Sakha (Yakut) people descended from a common ancestral population from the Lake Baikal region, genetically and culturally close to Mongols and Buryats until they were displaced by the latter [43]. The northward migration can be explained by the arrival of horse-breeding peoples from the Asian steppes at the beginning of the first millennium BC, at a time when reindeer breeding flourished in the Eastern Sayan region [42,45]. Thus, the high percentage of lactase deficiency in Siberian populations, in addition to the genetic component, can be explained by the following factors: widely spread milk fermentation practices (the lactose content of mare's milk is 1.5 times higher than of cow's one [46]); later domestication and the presence of admixture with steppe herders have been suggested as possible explanations. Another explanation could be the fact that reindeer milk contains the lowest concentration of lactose among other herd animals (less than 3 g) and there was no need to digest large amounts of lactose [23].
The results of our study indicate the prevalence of lactase deficiency in East Slavs populations (42.8%; 95% CI: 42.1-43.4%) is higher than the average in European populations (28%; 95% CI: 19-37%) [7]. Moreover, the individuals from the North Caucasus are more susceptible to lactase deficiency than those from Central Russia, which generally coincides with the frequency of the GG genotype from north to south, as well as the history of cattle breeding. The moderate percentage of lactase deficiency across the territory of the Russian Federation persisted even after we analyzed its prevalence only among the East Slavs.
The obtained data on the lactase deficiency genotypes among Ashkenazi Jews is unique due to the limits, outdatedness, and a lack of a genetic affiliation with Ashkenazi Jews of the previous statistics for this population. Our statistics turned out to be similar to previously published data (60-80%) [19,47,48]. The prevalence of the lactase deficiency genotype is comparable with that of the North Caucasus and Central Asia. However, according to mtDNA data [49,50], the ancestors of most recent Ashkenazi Jews lived in Southern Europe and not in the Middle East or the Caucasus. They likely originated as a result of the migration of Mediterranean Jews northward to Germany and later underwent expansion eastward.
According to the results of the earlier published meta-analysis, the prevalence of lactase deficiency among the Russian population was 61% (95% CI from 59 to 64%) [7]. Indeed, most of the studies conducted on the territory of Russia are either limited to a narrow cohort of people of a certain age (for example, students from 17 to 26 years old) [8] or a specific region of the analysis [51,52], or the assessment was carried out in small ethnic groups of the Russian population [53][54][55]. Thus, according to the data of the rs4988235 prevalence [8,[53][54][55][56] among Russians, the frequency of lactase deficiency occurs in 35-54% of cases (Table  S5). The prevalence of lactase deficiency based on analysis of rs4988235 is 42.8% among East Slavs populations which lies in the range revealed previously. Depending on the ethnic groups, the frequency of the rs4988235 GG genotype varied from 28% to 94% of cases (see Table S5). For some populations, values of lactase deficiency correlate with our estimations. For instance, figures are slightly different (within 5%) in the case of Bashkortostan, Buryatia, Perm oblast, Kostroma oblast, Dagestan, Mordovia, and Arkhangelsk oblast. It could be explained by the fact that ethnicity is predominant for that region (as Buryats in Buryatia and Mokshas in Mordovia) or, according to the chosen approach based on the assumption of ancestors, that there is a high probability of interethnic marriages. The differences in the prevalence of lactase deficiency of East Slavs across Russian regions (e.g., in Yakutia and Ivanovo oblast) are based on the respectively low or high representation of East Slavs in a particular region compared to the other populations. In the vast majority of cases, ethnic estimation is much higher than the revealed value of lactase deficiency for the region where this ethnic group lives. This tendency is relevant for Sami (the Kola Peninsula) (48% vs 33.8% in Murmansk), Mari people (74% vs 56.6% in Mari El), Udmurts (55 or 40% vs 30.9%), Russians in Kursk (54% vs 37.9%), Russians in Rostov (54% vs 43.85%), Komi (70% vs 30.9% in Komi republic), Erzas (57% vs 40.7% in Mordovia). As it was mentioned before, general investigations were carried out within a particular ethnic group. However, the aim of the current study was to estimate the average prevalence of a lactase deficiency in Russian regions based on the historical aspects of a person's ancestry considering the fact of multiculturalism in Russia. It is known that the frequency of the genotype associated with lactase deficiency is affected by migration and interethnic marriages. For example, the prevalence of an adult-type hypolactasia (rs4988235) among Nenets who had four Nenets grandparents was 90%. In cases of three, two, and one grandfather or grandmother of Nenets origin, the frequency of the genotype decreases sharply: 72%, 60%, and 28%, respectively [54]. Thus, accurate national estimates depend to a large extent on the proper representation of all population groups in a country.
In the current study we discovered the initial prevalence of lactase deficiency in non-admixed individuals from various populations and regions. Application of local ancestry inference methods might be useful to study the frequency of lactase deficiency among admixed individuals.
Despite the high prevalence of lactase deficiency in Russia, there is a rising trend in dairy products consumption. In 2020, dairy consumption reached 272 kg per capita [57]. Nevertheless, the annual amount of dairy products in Russia is still insufficient according to the Ministry of Health of the Russian Federation (340 kg per capita) [58]. This could potentially be due to the fact that people have gastrointestinal symptoms after lactose product intake, thus they exclude dairy products from their diet and replace them with lactose-free alternatives more frequently [8,56]. Genotyping for lactase deficiency has shown excellent correlation with the lactose hydrogen breath test (H 2 test), the measurement of glucose and urinary galactose/creatinine levels, which leads to the conclusion that the analysis of the lactase nonpersistence variants may be considered as an accurate and convenient test for predicting the presence of this disease in a patient with suspected lactose malabsorption [24,[59][60][61][62][63].
Our cohort study's genetic result highlights sizeable lactase deficiency across Russian regions and can testify in favor of the healthcare system paying more attention to this problem as well as for the food system to rebalance the dairy market towards low lactose and lactose-free products. We hope that such an accessible method of determination of lactase deficiency as genotyping will be implemented in a wide range of organizations for preventing nutritional imbalance.

CONCLUSION
We conducted the most extensive study of lactase deficiency genotype prevalence in the Russian population. We compared frequencies in different populations and in different regions, thus obtaining almost complete regional coverage. We also obtained the largest genetic data on lactase deficiency among Ashkenazi Jews to date.
Our results are in accord with the observations from the past decades and can be explained by the historical ethnic migration and the cultural characteristics of the Russian regions. The geographical features of residence and climate, animal species diversity, and ancestral migration played a major role in the use of lactose-contained products in these regions which we discussed in the current study (e.g., the geographical barrier of the Caucasus, later cattle domestication and milk fermentation in Siberia, lowlactose reindeer milk consumption in the northern part of Russia, etc.).
From the practical standpoint, public awareness of the actual lactase deficiency conditions may positively impact the nutritional behavior and status of the population, as well as support the growth of the market of lactose-free and low-lactose products. These features highlight the importance of such studies from scientific, epidemiological, and economic standpoints.

DATA AVAILABILITY
For the Genotek dataset, the user agreement (available at https://www.genotek.ru) states that disclosure of individual-level genetic information and/or self-reported information to third parties for research purposes will not occur without explicit consent, and the consent was not obtained from the individuals. Due to the user agreement, the individual level cannot be made directly available, and the dataset could pose a threat to confidentiality. Data have to be accessed indirectly via Genotek Ltd, https://www.genotek.ru/. Data requests should be sent to Genotek Ltd at info@genotek.ru.