Allele Frequency Analysis Suggests Potentially Protective Effect in the Lithuanian Population

Background. In the scienti�c literature, a wide range of effect variants that protect against complex disease phenotypes has been identi�ed. Analysis of these variants and overall genetic structure of isolated or, in our case, small populations is important in association analysis. When analysing admixture populations during GWAS, one could expect inaccuracies, which could be eliminated by choosing distinct populations as one of the interests of study. Population genetic structure determines similarities and differences between individuals or different groups of individuals and the factors that may lead to those differences. Results. In our study, we identi�ed six missense effect variants in the Lithuanian population having frequencies that were signi�cantly different compared to other European populations. Three of these effect variants may potentially protect against type 2 diabetes and coronary heart disease. Conclusions. Even though high rates of these diseases in the Lithuanian population and other populations indicates the presence of environmental factors and the lack of knowledge about the interactions between regulatory regions and other effect variants. Identi�cation of these effect variants is important not only to provide a better understanding of the microevolutionary processes and etiopathogenetic mechanisms, but also to develop disease prevention programs and novel, personalised therapies using genome editing or other genetic tools.


Background
Speci c genomic loci and variants associated with survival vary between populations due to microevolutionary processes.In a changing environment, variants that once were protective may become deleterious, and therefore microevolutionary processes are ongoing and lead to transformations in the genetic architecture of a population that is adapting [1].Natural selection mediates adaptation process, which can have various consequences, such as the prevalence of complex diseases leading to high mortality: hypertension [2], coagulation changes [3] and hyperlipidemia [4].Such research ndings have implications for population-speci c (geographically and ethnically) diagnosis worldwide [5], which is important for developing new prevention and treatment strategies in an era of personalised medicine.
To understand the mechanisms of complex diseases and traits, it is important to answer the question of natural selection and adaptation through the genomic variation uctuation process in the population during periods of time.Some genomic variants cannot simply be categorized as 'risk' or 'protective' because of the con icting interpretation of their effect.Thus, we refer to these variants as 'effect variants'.
Effect variants are usually rare but those that provide selective advantage tend to become common in the population.That is why our analysis includes rare or previously rare effect variants.In addition, most of these variants are likely to be common in biologically redundant genes, thereby escaping the effects of purifying selection, preserving these variants at high frequencies in different populations [6].For example, if a person has an effect variant that protects against obesity, it is possible that this person will be less likely obese and more likely to pass this variation to his offspring due to positive selection.Based on this logic, complex disease rates in the population would drop in the future.
However, complex disease rates are steady and one of the reasons is the exploding growth of the human population, which results in an accumulation of extremely rare variants [7].GWASs under-represent low-frequency (0.5%≤MAF < 5%) and rare (MAF < 0.5%) variants that could underlie much of the unexplained heritability of many complex traits [8].In addition, minor alleles are more likely to be characterised as risk alleles in the published GWASs on complex diseases because minor alleles are more easily detected as risk alleles in GWASs [9].
Frequency was not the only criterion for selecting effect variants.Mostly non-synonymous single nucleotide effect variants were chosen for the study in order to analyse ones that affect the structure of the protein and may have a function-altering effect.
Many effect variants protect against disease by disrupting protein function, typically via loss-of-function or gene knockout effects, and have an impact on clinically relevant phenotypic effects.In this case, most of the functionally relevant loss-of-function variants should be removed through purifying selection.One of the examples is effect variants in IFIH1 and IL23R genes, which are thought to protect against immunemediated disease through impairment of the host-pathogen response.These variants are likely to have been subjected to negative selection pressures in the past, which would account for their current scarcity in the population [10].Recent studies have however shown that synonymous mutations can in uence the amount of protein that is produced; so-called optimal codons are faster for cells to process and lead to increased protein production [11].This reveals that synonymous mutations most likely play an underappreciated role in human variation.That is why we also included some of the synonymous effect variants as well.
The identi cation of effect variants and a better understanding of interactions between them could provide the possibility to characterise candidate genomic regions and specify their functional signi cance across different populations [12][13][14].The aim of this study was to identify and analyse single nucleotide effect variants (risk and protective) in the Lithuanian population.

Results
After using the compiled catalogue of effect variants (144 variants in total) as a reference, the sequencing and genotyping data of our sample group were ltered.Filtered variants were tested for Hardy-Weinberg equilibrium and 70 genome variants passed (39 variants from genotyping and sequencing data; 7 variants from genotyping data alone; 24 variants from sequencing data alone).Sample sizes used for calculations of allele frequencies in the Lithuanian and various European populations are presented in Table 1.Frequencies of six missense variants were signi cantly different between the study group and other European populations (Table 2).According to the scienti c literature, these variants may have protection against alcohol dependence (ADH1C, rs698, p = 0.05), type 2 diabetes (PPARG, rs1801282, p = 0.05; SLC30A8, rs13266634, p = 0.03 (FIN), p = 0.03 (GBR), p = 0.04 (CEU)), coronary heart disease (ZC3HC1, rs11556924, p = 0.04 (GBR), p < 0.01 (FIN)), obesity (SH2B1, rs7498665, p = 0.03), and oesophageal cancer (PLCE1, rs2274223, p = 0.04).Distribution of variant genotypes in the studied Lithuanian population is presented in Table 2.
Filtered effect variants were compared with primate species to ascertain which allele is derived and which is ancestral in order to avoid the erroneous assumption in some cases that the rare allele is the derived allele for common variants.The analysis showed that several of our catalogue-selected effect variants (in PLCE1, ADH1C, and SH2B1 genes) in humans are in fact ancestral.

Discussion
According to free access in silico analysis tools, ve of the six effect variants for which frequencies in the Lithuanian population differed signi cantly from European populations are considered benign (regarding Varsome or UniProt) or a risk factor (Ensembl).All these ve variants (PPARG: rs1801282, SLC30A8: rs13266634, ZC3HC1: rs11556924, PLCE1: rs2274223, SH2B1: rs7498665) were selected from the scienti c articles for our catalogue of effect variants.In these articles, these variants were identi ed as candidate protective genome variants after GWAS data was ltered for nonsynonymous SNPs to increase the likelihood of them being functional and after bioinformatic analyses were performed to detect evidence of positive natural selection for the effect variant and to estimate the probability of the mutation being damaging.In addition, a variant was considered protective when it was more frequent in controls than cases [5].
Variant rs698 in the ADH1C gene is known as protective (according to the Ensembl, ClinVar and OMIM) and has an impact on ethanol metabolism.Even though databases de ne this variant as protective, various studies suggest that this variant is associated with slower ethanol metabolism, which could lead to a longer period of consuming alcohol and the consumption of greater quantities.Therefore, people carrying the variant have a higher risk of heavy and excessive drinking [15,16].According to one study, common SNPs are responsible for as much as 30% of the variance in alcohol dependence, but few have been identi ed [17].Power analyses however indicate that additional SNPs associated with alcohol dependence are likely to have small effect sizes and are more with more common psychiatric disorders [18].This shows that an understanding of the molecular mechanisms involved in excessive alcohol consumption and other complex conditions are still unresolved and that the collection of large numbers of well-characterised cases and controls is needed.
Besides function, the origin of effect alleles must also be addressed.Every disease-associated SNP consists of two alleles, of which one is considered as risk-associated and the other as disease-protective.A common practice to ascertain whether a nonsynonymous SNP is protective (i.e. the respective derived allele is protective) is to deduce which allele is derived and which is ancestral, since a minor allele does not necessarily equal the derived (mutant) one.The origin of the allele could be determined by using genomic alignments with primate species.The effect variants we analysed did not have a very low (< 1%) minor allele frequency, and we cannot assume that the rare allele is the derived allele.
However, if a derived allele provides a protective function and gives an individual a selective advantage, one might expect positive selection to sweep it to become the most common allele in the population [5].This may be the reason why the effect variants we analysed have allele frequencies greater than 1%.Moreover, this could be the reason why databases and SNP analysis tools call these variants polymorphisms.
Comparison with primate species showed that variants analysed in PLCE1, ADH1C, SH2B1 are indeed ancestral.The protective nature of genomic variants can be considered when the allele is derived, which is why we did not interpret these variants as protective.Despite contradicting data, signi cant variants may have some effects on the aetiopathogenesis of particular complex diseases.
In our study, effect variants may have an impact on protection against type 2 diabetes (variants in PPARG and SLC30A8 genes) and coronary heart disease (ZC3HC variant).It is important to keep in mind the effects of environmental factors.According to data from The Lithuanian Department of Statistics (Statistics Lithuania) [19], the highest number of deaths (55.4%) in 2018 was caused by diseases of the cardiovascular system.In 2014, 4.4% of the population had type 2 diabetes and 7.5% had coronary heart disease.Even though our population have effect variants that may protect against these diseases, lifestyle and other environmental factors may in uence the frequency of morbidity.Also, many studies concentrate on effect variants of coding genomic parts, but interactions between coding and non-coding variants are as important but are not examined enough.Although these effect variants may reduce the risk of disease (or maintain health), there are additional genetic mechanisms that control this process.Not only are the effects of single genomic variants important, but their interactions and the interactions between regulatory regions are also consequential [9].
Butler et al. [5] estimated an integrated haplotype score for the effect variants that we have analysed in the PPARG, SLC30A8, and ZC3HC1 genes that showed that these variants may have undergone recent positive selection [20].This shows that a derived allele is bene cial for an individual's tness and may be protective.However, the functional impact of these variants has to be con rmed and additional analysis is needed.According to Plenge et al., most alleles associated with complex diseases (approximately 85%) fall outside the protein-coding sequence, and thus each disease-associated allele should be evaluated to see whether it is in linkage disequilibrium with a variant that changes protein structure.If it is, then these ndings should be fast-tracked for functional studies in human cells and animal models to assess the gain-offunction or loss-of-function.For non-coding effect variants, the effect on gene expression should be evaluated in a relevant human cell type.For example, if a risk allele is associated with higher gene expression, then pharmacological inhibition may be effective in treating the disease [21].

Conclusions
During our study, we identi ed three effect variants in the Lithuanian population group that may protect against type 2 diabetes and coronary heart disease.A better understanding of common variants and their effects can help build better databases, because sometimes the effect of the variant can be incorrectly described, as was previously demonstrated in this study regarding the variant in the ADH1C gene.Detection of these effect variants is important not only to provide a better understanding of the aetiopathogenetic mechanisms and microevolutionary processes.It could broaden the knowledge about the differences between populations and this way move towards personalised medicine as well.Knowledge

Table 1
Sample sizes used for calculations of allele frequencies in the studied Lithuanian and European populations.

Table 2
Distribution of variant genotypes in studied the Lithuanian population.