A Common Founder Effect of the Splice Site Variant c.-23+1G>A in GJB2 Gene Causing Autosomal Recessive Deafness 1A (DFNB1A) in Eurasia

The mutations in the GJB2 gene are known to be a major cause of autosomal recessive deafness 1A (OMIM 220290). The most common pathogenic variants of the GJB2 gene have high ethno-geographic specicity in their distribution that being attributed to a founder effect related with Neolithic migration routes of Homo sapiens. Curiously, the c.-23+1G>A splice site variant is frequently found among deaf patients of both Caucasian and Asian origin. It is currently unknown whether this mutation did spread across Eurasia as a result of the founder effect or it could have multiple local centers of origin. To determine the origin of the c.-23+1G>A we reconstructed 𝑓 2-haplotypes by genotyping SNPs on the Illumina OmniExpress 730K platform in 23 deaf individuals homozygous for this variant from different populations of Eurasia (Yakuts, Tuvinians, Evenk, Kumyk, Armenian, Russians and Slovak). The analysis revealed that the homozygosity regions in different individuals overlapped in one short region with the length of ~5.2 kb. These data support the hypothesis of the common founder effect in distribution of the c.-23+1G>A variant of GJB2 gene. Based on the published data on the c.-23+1G>A prevalence among 16,177 deaf people and calculation of TMRCA of the 𝑓 2-haplotypes carrying this variant we reconstructed the potential migration routes of the c.-23+1G>A carriers around the world. This analysis indicates that the c.-23+1G>A variant may have originated approximately 6,000 years ago in the territory of the Caucasus or Middle East, followed by spread throughout Europe, South and Central Asia and other regions of the world.

The haplotype analysis of the chromosomes with c.-23+1G>A mutation in Mongolian population revealed different mutant haplotypes suggesting multiple origin of this mutation in Mongolia (Tekin et al., 2010). However, only one mutant haplotype was found in Turkish patients, which was similar to a major haplotype found in Mongolia (Tekin et al., 2010). In the recent study, ve different haplotypes in deaf patients from six Eurasian populations homozygous for mutation c.-23+1G>A (Yakuts, Russians, Evenks, Tuvinians, Mongols and Turks) were reconstructed based on the analysis of variability of nine SNP markers (Tekin et al., 2010;Solovyev et al., 2017). The structure of the mutant haplotypes was similar in all studied populations and one haplotype (GTACCAGAC) was prevalent: Mongolians (62.5%), Yakuts (99.1%), Russians (100%), Evenks (100%), Tuvinians (100%) and Turks (100%). These data suggest the distribution of this mutation across Eurasia was a result of the founder effect (Solovyev et al., 2017). However, the studied SNP markers do not have high variability in studied populations, so the issue of de ning migration routes of the c.-23+1G>A mutation carriers requires a further in-depth study with an increased resolution. Thereby, it is currently unknown whether this mutation did spread across Eurasia as a result of the founder effect or could it have had several local centers of origin?
The aim of this work to determine haplotypes with the c.-23+1G>A mutation of GJB2 gene based on the genome-wide analysis of deaf patients from different regions of Eurasia to determine the origin and reconstruct the possible migration routes of the c.-23+1G>A mutation.

Samples and genotyping
To increase the reliability of the mutant haplotypes detection and avoid Phasing probabilities for this study, we focused only on the individuals with hearing loss diagnosis and who were homozygous for the c.-23+1G>A mutation in the GJB2 gene. To answer the question about the common origin of the mutation, we collected 23 DNA samples of deaf individuals from different regions of Eurasia (14 Yakuts, 3 Tuvinians, 2 Russians, 1 Evenk, 1 Kumyk, 1 Armenian, and 1 Slovak). Anthropological, linguistic, and geographical a liations of deaf individuals are presented in Table 1. DNA was extracted from blood leukocytes using the phenol-chloroform method. All participants gave written informed consent for participation in the study.
All 23 individuals were genotyped using the Illumina OmniExpress 730K array according to the manufacturer's speci cations. The missing data value was less than 5%. The absence of cryptic relatedness corresponding to the rst and second-degree relatives in our dataset was con rmed using the KING software (Manichaikul et al., 2010). A clean data set of 693,521 autosomal variants only was obtained, with chromosome 13 having 26,626 variants.

PC and ADMIXTURE analysis
To conduct PC and ADMIXTURE analysis, we combined newly generated genotypes with the data from previous studies (Li et al., 2008;Rasmussen et al., 2010;Behar et al., 2010;Yunusbayev et al., 2012;Metspalu et al., 2011;Fedorova et al., 2013;Raghavan et al, 2014;Behar et al., 2013). Individuals with more than 1.5% missing genotypes were excluded from the combined dataset. Only markers with a 97% genotyping rate and minor allele frequency (MAF) > 1% were retained. The marker set was thinned by excluding SNPs in strong LD (pairwise genotypic correlation r2 > 0.4) in a window of 1000 SNPs, sliding the window by 150 SNPs at a time. The nal dataset included 242,229 SNPs presented in 391 individuals from 24 different world populations and in 23 individuals homozygous for the c.-23+1G>A. The PCA was performed for the autosomal dataset using the "smartpca" software of the EIGENSOFT package (Patterson et al., 2006). We inferred population structure in our dataset using a model-based clustering method implemented in the ADMIXTURE software (Alexander et al., 2009). We performed ADMIXTURE assuming K = 2 to K = 10 genetic clusters in 100 replicates and assessed convergence between individual runs. We found that the best clustering solution that met our selection criteria was K = 8.

Analysis of the region anking the c.-23+1G>A mutation
The search for a likely common region in the sample was carried out by comparing the allele frequencies according to the beta distribution function. The beta distribution can accurately describe the stochastic behavior in time of allele frequencies and the in uence of evolutionary pressures, such as mutation and selection (Tataru et al., 2015). By this method, we compared the frequency of the allele "A" with the frequency of the allele "B". If the frequency of these alleles is signi cantly different, the most frequent allele is selected. Then algorithm checks the next position values of the beta distribution, alleles are included in one common chromosomal region if the values overlap. This continues until there are alleles without statistically signi cant values. To evaluate the reliability of identi ed common chromosomal regions, we used the following equation: where, R -rating of reliability of the common chromosomal region; L -length of the tested common chromosomal region in bp; SNP -number of SNPs in the tested common chromosomal region.
This reduces credibility of long homozygous region with a small number of SNPs and increases credibility of the homozygous region with high number of SNPs. In the end we get a le where indicated: the start position, the P value, and the end position of the common chromosomal regions in chromosome. This analysis is similar to a combination of RoH and GWAS analyses, but it is more sensitive and does not require a large number of samples. This method does not use a moving window that potentially introduces arti cial runs and fails to capture runs larger than the window, instead it analyses all alleles and chromosomes in parallel, which also increases the calculation speed.
For a detailed analysis of homozygosity regions, we scanned chromosome 13 in each individual starting from the c.-23+1G>A mutation position in the left and right directions until the rst heterozygous allele position. We did not consider the physical distances between SNPs.
The 2-haplotypes method and age dendrogram The 2-haplotype was found by pairwise comparison of individuals with each other. From the point of the c.-23+1G>A mutation, where we scan left and right along the chromosome until we reach a point where two individuals have inconsistent homozygote genotypes, which gives us an (over-) estimate of the distance to the rst recombination breaking the haplotype (Mathieson and McVean, 2014). This approach is highly scalable and nds shared haplotypes directly from genotype data, which avoids the need for statistical phasing. Then we use the length of, and the number of non-matching markers on this haplotype to infer their ages, and therefore a lower bound for the age of the variants they carry.
The 2-haplotype age was calculated by the TMRCA (the Time to Most Recent Common Ancestor) software by the University of Illinois at Urbana-Champaign available in the public domain (http://faculty.scs.illinois.edu/~mcdonald/tmrca.htm). We considered the mutation rate as 1·10 -8 of per-base per-generation. The mutation rate was given to each model for converting TMRCA from coalescence unit to years, taking into account the average mutation rate probability based on physical lengths of 2-haplotypes and numbers of SNP-markers. It should be noted that an accurate age calculation of the GJB2 region was obstructed because of many uncertainties. The age dendrogram was created with the UPGMA method based on the ages of each 2-haplotype.

Results
Ancestral background To answer the question about the common origin of the c.-23+1G>A mutation, we collected 23 DNA samples of deaf people from different regions of Eurasia (14 Yakuts, 3 Tuvans, 2 Russians, 1 Evenk, 1 Kumyk, 1 Armenian and 1 Slovak) (Fig. 1A). To con rm the ethnic identity of patients, we compared their genetic components with different populations of the world using PC analysis (Patterson et al., 2006) and performed a global ancestry inference with ADMIXTURE (Alexander et al., 2009). For the PC and ADMIXTURE analysis, we combined our patients genotypes with the data from previous studies (Li et al., 2008;Rasmussen et al., 2010;Behar et al., 2010;Yunusbayev et al., 2012;Metspalu et al., 2011;Fedorova et al., 2013;Raghavan et al, 2014;Behar et al. 2013). Figure 1 shows that all individuals homozygous for c.-23+1G>A clustered according to their ethnic identity and geographic origin (Fig. 1B). The ADMIXTURE analysis (K = 8) con rms that the ancestral composition of each studied individual corresponds to the ones of a respective population (Fig. 1C). Thus, the results of both analyses (PC and ADMIXTURE) demonstrate that genetic pro les of all samples with c.-23+1G>A substitution correspond to their geographic location and ethnic origin.

Common chromosomal region
In this study, we used a new method that considers the likelihood of allele frequencies by the function of the beta distribution. This analysis detects the most probable extended regions of linked alleles in the studied samples and resemble the GWAS analysis but considers the length of the common chromosomal region making it possible to work on a small sample. On the gure 2A it is shown that the study sample has a long common region anking the c.-23+1G>A mutation on chromosome 13 (R = 14.11). No such long common chromosomal regions were detected on other autosomes ( Fig. 2A).
Detailed analysis of the homozygosity regions anking the c.-23+1G>A mutation showed that their length varied between individuals with different ancestral backgrounds (Fig. 2B). Shorter homozygous regions were found in individuals from Eastern and Central Europe (Russian -18.9 kb, Slovak -296.3 kb) and from Caucasus populations (Armenian -29.2 kb, Kumyk -96.6 kb), while these regions were longer in Siberian individuals (Tuvinians, Southern Siberia) having the longest homozygosity regions (from 2.7 to 7.6 Mb). Despite the differences in the length of the homozygosity regions for each individual, all of them overlap in thẽ 5.2 kb chromosomal region (5 SNPs: rs7329857, rs3751385, rs2274083, rs2274084 and rs7987144). But this haplotype (GCTCC) is typical for most of the populations presented in the NCBI database (The 1000 Genomes Project Consortium).

Mutation age
In our case, we cannot use the LD structure to calculate the age of the mutation, because the LD structure will be calculated based on the majority of people in our sample (Yakuts). To avoid this, we used a more simpli ed method of 2-haplotype (Mathieson and McVean, 2014) for the calculation of the time to the most recent common ancestor (TMRCA). This method is suitable for the hypothetical estimation of the closest common ancestor between two haplotypes. We determined the lengths of the 2-haplotypes by pairwise comparison of each individual from the position of the c.-23+1G>A mutation to the position where both individuals have different homozygous alleles (Fig. 3). The samples from the Caucasus (Kum, Arm) and Europe (Rus2) have short 2-haplotypes being equidistant from other studied samples and composing the European cluster.
Using both the genetic and physical lengths of the region, and the number of the conditional number of heterozygous alleles that our genotyping resolution allows, we have calculated approximate haplotype age. The TMRCA results based on the 2-haplotype method give an approximate age of the divergence of the mutant haplotypes but not of individuals in general. The obtained results indicated the more distant mutant 2-haplotypes in the samples of Caucasus (Kum, Arm) and Europe (Rus2) (Fig. 4). Common ancestor for all samples lived ~ 245 generations ago (if duration of one generation is considered as 25 years, then the approximate age of the mutation ~ 6125 years).

Discussion
For the genome-wide reconstruction of haplotypes with the c.-23+1G>A mutation in the GJB2 gene, we collected DNA samples from 23 deaf individuals with con rmed c.-23+1G>A mutation in the homozygous state. The samples have different anthropological and linguistic a liations (Yakuts, Tuvinians, Evenk, Kumyk, Armenian, Russians and Slovak) from four regions of Eurasia -Siberia, Caucasus, Eastern and Central Europe (Fig. 1A). Thus, to con rm the geographic location and ethnic origin reported by the sample donors, as the rst stage of our study, we performed the PC and ADMIXTURE analyses in the context of worldwide populations. In the PC analysis, all samples with c.-23+1G>A mutation clustered according to their ethnic identity and geographic origin (Fig. 1B). The ADMIXTURE analysis (K = 8) con rmed the ancestral composition of each studied individual corresponds to a respective population (Fig. 1C). The results of both analyses (PC and ADMIXTURE) demonstrated that genetic pro les of all samples with c.-23+1G>A mutation correspond to their geographic location and ethnic origin and exclude the likelihood of distant relationship between observed patients (Fig. 1B, C).
On the next step, the analysis of the homozygosity on chromosome 13 for each sample revealed a common homozygous segment of ~ 5.2 kb (from 20762929 to 20768144 position), anking the c.-23+1G>A mutation ( Fig. 2B). However, this common homozygous segment containing 5 SNPs (rs7329857, rs3751385, rs2274083, rs2274084 and rs7987144) is typical for most worldwide populations (GCTCC) (The 1000 Genomes Project Consortium). Recombination in the chromosome is random and the distribution of runs of homozygosity (ROH) on samples is likely to be uneven and there is a high probability that the common chromosomal region with the mutation in the homozygous state will be shorter than it actually is. In this study, to con rm the common chromosomal region associated with the c.-23+1G>A mutation in a small number of samples, we applied a new approach, which takes into account the state of the SNP position relative to the probability of divergence of alleles by the beta distribution function and its length.  Table). This analysis showed that c.-23+1G>A mutation is present with various frequencies on almost all continents of the world (from 0.6% in East Asia to 58.1% in Siberia and Central Asia), except for South-East Asia and Sub-Saharan Africa (Supplementary Table). Based on the data about the extensive accumulation of the c.-23+1G>A mutation, there are three possible geographic regions from which this mutation could begin to spread: Central Asia, the Caucasus or Middle East and South Asia (Fig. 5). However, it is unlikely that the center of origin may be in South Asia because this mutation was found almost in all regions of the world (Fig. 5). In addition, earlier we considered that the c.-23+1G>A mutation could begin to spread from Central Asia across Eurasia and only after then spread out around the world (Solovyev et al., 2017). This hypothesis was based on the highest observed frequency of this mutation in the territory of Central Asia and Siberia and the highest diversity of haplotypes with the c.-23+1G>A in patients from Mongolia. However, the results of the current study do not support this hypothesis, since we have found shorter homozygosity regions in individuals from Europe (Russian -18.9 kb, Slovak -296.3 kb) and the Caucasus (Armenians -29.2 kb, Kumyks -96.6 kb) (Fig. 2B). A relatively short homozygosity region can indicate on the more ancient haplotypes with this mutation in populations of the Caucasus and Europe, and on the contrary, the relatively recent distribution of this pathogenic variant among Central Asian and Siberian populations.
To test this hypothesis, we analyzed the 2-haplotypes using both the genetic and physical lengths of the haplotype and the number of the heterozygous alleles, which allow us to observe changes in the structure of the haplotype over time and display the demographic events. The analysis of the 2-haplotypes length data showed that individuals from populations of the Caucasus and Europe (Kum, Arm and Rus2) carried more divergent haplotypes than other studied individuals (Fig. 3). The TMRCA calculation of the 2-haplotypes showed that the most recent common ancestor with c.-23+1G>A lived not less than 6,000 years ago (Fig. 4). Thereby, the data on the more ancient c.-23+1G>A haplotypes from populations of the Caucasian origin (Kum, Arm and Rus2) points to the fact that c.-23+1G>A mutation originated somewhere in the regions of the Caucasus or Middle East and then spread throughout Europe, Central Asia and South Asia. The possible geographical center of origin of the c.-23+1G>A, its frequency among all mutant GJB2 chromosomes in different world populations and assumed migration routes of the c.-23+1G>A carriers are presented in Figure 5.
It is interesting that the prevalence of c.-23+1G>A in the world is quite comparable to the hypothetical routes of expansion of Indo-European languages (~4000 -~2500 BC) across Europe, Middle East, the Caucasus and South Asia (Tassi et al., 2017). Moreover, the world prevalence of the c.23+1G>A corresponds to the expansion of populations of European descent, which are widely presented in North America and Australia, and also had a signi cant impact on the genetic structure in modern populations of South America and Siberia. The extremely high frequency of c.-23+1G>A among indigenous populations in Central Asia and Siberia may represent a random population effect and can be explained by the genetic structure and demographic events in the history of these populations.

Conclusion
The obtained results support the hypothesis of common origin of mutation c.-23+1G>A in GJB2 gene. It is most likely that c.-23+1G>A originated in the territory of the Caucasus or Middle East and only then spread throughout in Europe, South Asia, Central Asia and Siberia.