Correlation analysis between language gene polymorphism and geography/society parameter from twenty-six countries

Human language diversity, as a biological phenotype, shall be genetically linked with language gene polymorphism. Meanwhile, this phenotype is historically shaped by local geographical/social factors. But how many language gene polymorphisms have direct correlations with some geography/society characteristics during the long-run evolution of human languages is an interesting question and largely remains uninvestigated. This study selected a series of geography/society factors (including 13 geographical factors and 21 social factors) from 26 countries and 111 single nucleotide polymorphisms (SNPs) randomly selected from 13 language genes. Principal component analysis (PCA) was performed to explore their potential correlations. Preliminary but interesting results were obtained as follow. (1) Most geographical parameters are concentrated into one cluster in the PCA diagram. The cluster contains 12 parameters that are positively correlated with each other; (2) PCA diagrams divide social parameters into four clusters, among which exist positive and negative correlations; (3) The strongest positive correlations were observed at one of ATP2C2 gene SNPs (ATP-1: rs78371901); the strongest negative correlations were found at one of NFXL1 gene SNPs (NFX-6: rs1440228); and the least correlations with language gene SNPs were observed at four geography/society factors: aash (Annual average rainfall), fore (Forest coverage), pden (Population density of the country) and rway (Runway trac mode).


Introduction
It is a common sense that biological phenotypes are shaped by geographical factors, and a typical phenotype is xed and inherited by a genotype [1][2] . Natural selection means selective reservation of phenotypes in a speci c geographical environment, so human languages (there are at least 5000 different human languages worldwide) also represent marvelous results of natural selection (at least in the early stage of human language occurrence) [3][4][5][6] . Different geographical environments yield different local languages, plus different language gene polymorphisms. But how many language gene polymorphisms have direct correlations with some geography/society characteristics during the long-run evolution of human languages is an interesting question and largely remains uninvestigated.
During the early stage of human evolution, human being and other kinds of animals shall be very similar in the context of language emergence and its usage for vocality and communication. Later on due to the uniqueness of human language genes and evolutionary advantages of human brain, human language gradually becomes a powerful characteristic of human intelligence. There are some evidence that the FOXP2 gene of Neandertals already acquired key mutations that only found in modern human being [7] , suggesting that language genes had nished their main mutational acoustic adaptations before Homo ergaster and other early Homosapiens migrated out of Africa [8][9] . Later on Homosapiens had to get adapted to different novel geographical conditions around the world in north Europe, east Asia, south Asia, north America and south America, etc. Thus language genes had many chances to be polymorphic in order to micro-adjust the vocal system for better acoustic adaptation. Many studies [10][11][12] found that design of acoustic communication systems within species appears to be directly shaped by environmental factors, so it is with the phonological structure of human languages. Environments in which higher frequencies are less faithfully transmitted (for example, denser vegetation, higher ambient temperatures, much more precipitation, speci c geomorphology, etc.) would favor greater use of sounds with lower frequencies [13] . Besides, this environmental in uence would happen in a much shorter timespan than the interval between "speciation events" [14,15] .
Theoretically, the early stage of human being's evolution permits an environmental situation in which vowels are not essential (full opening of mouth is not bene cial or not allowed). For example, desiccation, chilliness or many predators around, all these situations promote phonological formation of consonant prone languages. One language example is Hebrew, with 22 consonants and no vowels. The historical potential reasons may contain important societal event(s), but more probable is natural environmental situation at the early phase of Hebrew language. More and more studies indicated that low temperature and dryness signi cantly in uence vowel usage [10][11][12] . The effectiveness of different phoneme pro le is determined by anatomical structure of a phonatory organ, and the latter is determined by genes. So by postulation, geographical conditions shall be also in uential in the formation of language gene polymorphism during the long-term evolution time.
Of course, human languages, especially modern human language, often contain a large part of arti cially designed extent in the course of standardization, and the newly designed part may have nothing to do with any geographical even historical in uence. But the geographical in uence on historical adaptation re ected on the local language gene polymorphism shall leave distinctive mark. Besides, natural selection happens in both natural (geographical) environment and cultural (societal) environment, though in the early stage of human civilization more selective adaptation may be implemented through geographical factors. Geographical factors bring driving force to human population, while society factors may in uence individuals in a human population with adaptive pressure no less than geographical factors do [16][17] .
This study focused on correlation analysis among a total 111 single nucleotide polymorphisms (SNPs) from 13 language genes and a series of geography/society factors collected from 26 countries, and the author did nd some interesting correlational parameters, including several strongest positive and negative correlations.

Results And Discussion
General correlations among all selected geography parameters Figure 1 showed the general correlations among all selected 34 geography/society parameters, in which geographical factors themselves have speci c intra-correlation characteristics. First, most geographical parameters are concentrated into one cluster. This cluster contains 12 parameters that correlate positively with each other: aash, cula, gres, mrvr, area, road, gnei, airp, port, geft, rway, and agrc. Only one parameter, fore, is not in the cluster. On the contrary, the fore parameter has a negative correlation with all the above 12 parameters. Second, in the above mentioned cluster, the port parameter has least correlation with other parameters, which is somehow unexpected. But it is relatively easy to understand why the fore parameter negatively correlated with all other geographical parameters. The more forest, the less human activity, the less values of those human activity-related geographical factors.

General correlations among all selected society parameters
When we look at correlation pro les among the 21 society parameters, we found that they are basically divided into four clusters: cluster 1 (popu, army, pold, cnpl, indu, mepr, relg, mipr, weft, tova, aash), cluster 2 (hdi, agdp, pden), cluster 3 (mrta, fert, regi, rupo) and cluster 4 (crim, ceex, race). Cluster 1 has 11 parameters that positively correlate with each other; cluster 1 has little correlations with cluster 2 and cluster 4, but it has a strong negative correlation with cluster 3. Cluster 2 and cluster 3 has a more negative correlation. Cluster 2 contains hdi (Human development index) and agdp (Average GDP per person). This means the more level of economic development, the less values of those in cluster 3. Economy can decrease the diversity of human society.

Correlations between geography and society parameters
The hdi and agdp parameters in the cluster 2 have 8-9 negatively correlated factors: fore (Forest coverage), rupo (Rural population), regi (Country/regions that speak the same language), mrta (Mortality rate), fert (Fertility rate), crim (Country and region for importation), ceex (Country and region for exportation), race (Race in the country) and aash (Annual average rainfall), in which fore, regi and aash belong to geographical factors. So hdi and agdp also have negative correlations with some geographical parameters (though the correlation level is not high), which means economic progress also diminishes the diversity level of geographical elements.
There are two parameters, port and fore, specially positioned in all PCA diagrams (supplementary le 4). These two parameters are strongly negative with each other, and neither have strong positive correlations with most other geography/society parameters. There are nine parameters negatively correlating with port: pden, fore, rupo, mrta, fert, regi, crim, ceex and race. All other parameters either positively correlate, or have little correlations, with port. So the parameter port in uences the geography/society environment, though not strongly, but broadly.
Correlation between a speci c gene SNP and Geography/society parameters PCA analysis was undertaken for 13 language genes one by one with 34 geography/society parameters.
Each language gene gives around 10 different SNPs (see SNP number in Table 2). All PCA results were quanti ed and demonstrated in Figure 2. The strongest negative correlations were seen at (NFX-6~area, NFX-6~army, NFX-6~gres, NFX-6~mrvr, and NFX-6~road). NFX-6 (rs1440228) is one of SNPs of NFXL1 gene. NFXL1 encodes a Nuclear Transcription Factor (X-Box Binding-Like 1). Gene Ontology annotations related to this gene include DNAbinding transcription factor activity and proximal promoter DNA-binding transcription repressor activity, plus RNA polymerase II-speci c activity. It is associated with a disease of Speci c Language Impairment. The area (Area of the country), army (Active duty army), gres (Geographical resource), mrvr (Main river) and road contain four geographical factors. Interestingly, gres and road and involved both the strongest positive and strongest negative correlations.
Four geography/society parameters demonstrated least correlations with language gene SNPs ( Figure  2C), and they are aash (Annual average rainfall), fore (Forest coverage), pden (Population density of the country) and rway (Runway tra c mode); Another several parameters demonstrated second least correlations with language gene SNPs, and they are ceex (Country and region for exportation) , crim (Country and region for importation), agrc (Agriculture, forestry, husbandry and shery) and relg (Religion in the country).
In Figure 2C, there is another interesting point. For each geography/society parameter, the number of SNPs with positive correlations with it is almost the same as the number of SNPs with negative correlations with it. That suggests that each parameter is coincidently balanced by similar numbers of language gene SNPs with opposite correlations with it.

Discussion
In this study, the basic data include 13 language genes and their randomly selected 111 single nucleotide polymorphisms (SNPs), SNP pro les in 26 countries, and 34 geography/society parameters in 26 countries. In order to undertake principal component analysis (PCA), SNP genotypes and all geography/society parameters have to be quantitatively represented into numerical values.
Why these parameters in Table 3? Actually this manuscript only contains very general geography/society parameters, and is apparently lacking in detailed description of geography/society features for selected countries. More geographical parameters needed in the future, such as temperature, altitude, rugosity, some googlemap factors, etc. The number of lowest altitude, number of highest altitude,river direction mountain orientation precipitation amount, precipitation type, humidity, etc. Also, parameter combinations shall be described in the future because synergistic effects have been rarely considered.
One of the limitations in this study is the 'data age'. Only those factors representing geography/society features in the early stage of human evolution can leave signi cant marks in language gene polymorphisms. Though modern geography/society factors still work for language gene polymorphism's evolution, modern time duration is too short for the whole history of human being. In uence of modern time on language gene SNPs may be only steadily seen after a long time in the future. Data in Table 3 are mostly modern parameters; only a few parameters can last long time and have few changes, such as Geomorphology feature type. But even Geomorphology feature may change a lot after several thousand years.
A limitation also exists in the data of language gene polymorphisms. At present, human genome sequencing samples are not balanced for different countries, and many developing countries contribute much less to genome sequence data than developed countries. So data from poor or most developing countries have been not representative enough.
Potential interaction between ATP2C2 and NFXL1 genes may be worth tackling since both hold strongest correlations with geography/society parameters [37] . By now there is no any report on their potential interaction, though bioinformatics search can generate some hints (Figure 3). Figure 2 illustrates known interactions among proteins encoded by 13 language genes, in which most interactions have not been experimentally con rmed. ROBO1 has interactions with other eight genes; FOXP1, CMIP, CNTNAP2 and ATP2C2 have interactions with seven genes, but TPK1, FLNC and TM4SF20 have not been found to interact with other language genes. As for the interaction between ATP2C2 and NFXL1, known information is little because only one interaction between the two genes exists as text-mining nature ( Figure 3). But both are co-expressed with DCDC2 gene, which is associated with reading disability and modulates neuronal development in the brain [38] . By postulation, if a protein interacts with more other proteins, its polymorphisms will have more in uential biological consequences, thus have more chances to have correlations with more other proteins. To note, 13 language genes almost all are involved in human learning capacity development, thus participate the determination of one's ability to interact with geography/society factors; so those SNPs that have most or strongest correlations with geography/society factors may also hold stronger correlations with other language genes' SNPs. This point will be investigated using more language gene SNPs in the near future.
This study is based on one assumption: geography/society factors directly in uence acoustic adaptation of Homosapiens, and the latter is re ected on gradual or sudden optimization of anatomical structure of a phonatory organ in a local geographical/societal environment, while continuous language gene mutational evolution directly facilitates the above optimization process. So in the time scale of human being evolution, there should be meaningful correlations between some geography/society factors and human language gene polymorphisms. Such a 'meaningful' correlation will be more manifested only in a condition that more sophisticated data and better pattern recognition algorithms are both ready for use.
On the other hand, this study is also highly dependent on one question: what is the relationship between anatomical structure of a phonatory organ and language gene itself? This relationship could be direct or indirect. If it is direct, then functional optimization of a phonatory organ will be easily re ected on the change in language gene SNP pro les, and such a correlation will be relatively easy to pinpoint between geographical/societal factors and language gene SNPs; if it is indirect, the expected correlation will be much harder to nd. By now there have been already a few studies on the proteomic pro les of (human) vocal tissue/organs, but proteins in Table 1 were not found in these studies only except for FLNC [41][42][43][44] . Biological studies on this matter in the future 5-10 years are greatly expected for providing a help.

Conclusions
This study tackles the question whether there is some correlation between human language gene polymorphisms and geography-society parameters collected from twenty-six countries. By the method of principal component analysis, the main correlations within geography parameters, within society parameters, between geography and society parameters, and between language gene SNPs and geography-society parameters, were analyzed. This study obtained its preliminary results, the main points of which contain: (1) most geographical parameters are concentrated into one cluster. This cluster contains 12 parameters that correlate positively with each other; (2) society parameters are divided into four clusters in the PCA diagrams and among them exist positive and negative correlations; (3) the strongest positive correlations were seen as (ATP-1~army, ATP-1~gres, ATP-1~pold, ATP-1~popu, and ATP-1~road); the strongest negative correlations were seen as (NFX-6~area, NFX-6~army, NFX-6~gres, NFX-6~mrvr, and NFX-6~road); the least correlations with language gene SNPs were observed at aash, fore, pden and rway. But the meanings of such correlations need time to decipher in the future.
The authors have to admit that this study still has several points to improve as follow. In this study, whole countries are not always an appropriate unit of analysis for language gene polymorphisms and their correlations, especially when one considers very large and arguably heterogeneous ones such as "France" or "China". All twenty-six countries have language sampling points as shown in the WALS database [45] , but it is not clear whether the SNP data t those geographical points. Besides, each language gene has hundreds of DNA polymorphism variants and most of them have not been functionally characterized.
Later on we may have to focus on a few SNPs that have been linked with language disorders [46,47] .
Importantly, GALTON'S problem raises questions about the nature of explanation in cross-national research [48] , and this problem likely exists in this version manuscript. For more than a decade it has been standard in such studies to include controls at least for Galton's problem and for country contact. Careful steps have to take in order to solve the above problems later on.