The WBBC Pilot Dataset and Variants Identified
The WBBC pilot project sampled 10,376 individuals from 29 of 34 administrative divisions of the People’s Republic of China (Fig. 1a and Supplementary Table 1). We performed whole-genome sequencing (WGS) in 4,535 individuals on NovaSeq 6000 platform. After removing contaminated and duplicated samples, 4,480 individuals were retained for downstream analyses and statistics. The mean sequencing coverage was 13.9×, which covered 99.77% of the genome, with a range between 9.6× and 65.2× (Supplementary Fig. 1a and Supplementary Table 2). Additionally, 6,025 individuals were genotyped by high-density Illumina Asian Screening Array (ASA), including 184 individuals who were also whole-genome sequenced.
Here, we identified 81,498,995 variants after filtration from 103.96 million total raw variants (Ts/Tv = 2.15), including 74,118,191 single-nucleotide variants (SNVs) and 7,380,804 insertions and deletions (INDELs). Of these variants, which the majority of variants (44.2 million, 54.5%) were singletons (Fig. 1b), 93.3% were rare (allele frequency, AF < 0.005) and low-frequency (AF = 0.005-0.05) variants (Supplementary Table 3). Additionally, C>T, G>A, A>G and T>C constituted the most common SNV substitution types (Supplementary Fig. 1b), and the length of INDELs mainly distributed between -10bp and 10bp (Supplementary Fig. 1c). We provided a database of genetic variations for the Han population in four sub-regions (North, Central, South and Lingnan) (https://wbbc.westlake.edu.cn/genotype.html).
Comparing the variants with the WBBC and other existing databases, 45,696,726 variants were found not to present in the 1000 Genome Project (1KG)22, gnomAD23 and UK10K6 (Fig. 1c). Of these, 45.6 (99.79%) million were rare variants (MAF < 0.005). We also found 31.37 (38.5%) million novel variants that were not present in dbSNP Build 15124, including 29,015,419 SNVs and 2,353,726 INDELs. Of these novel variants, singletons accounted for 83.3%, and 99.97% of the variants (31.26 million) were rare with MAF < 0.005.
We assessed the SNV variants calling accuracy and sensitivity by comparison with SNP array data in 184 individuals from the whole genome sequencing samples (13.3× - 54.7×). Supposing the SNP array data as the true genotype, the heterozygote discordance rate of genotypes was reduced 6-fold from 0.134 to 0.022 at 13.3× sequencing depth and 4-fold from 0.004 to 0.001 at 25.3× after genotype refinement (Supplementary Fig. 2a). The non-reference genotype concordance rate extended to 99.88% at 25× with increasing sequencing depth (Supplementary Fig. 2b). The non-reference sensitivity and specificity had an effective increase after genotype refinement with BEAGLE from 0.9211 to 0.9924 and from 0.9931 to 0.9999 (Supplementary Fig. 2c, d).
Trait-associated Variants between Populations
We downloaded the public annotation data of human genome-wide association studies (GWAS) from the NHGRI-EBI GWAS Catalog database (https://www.ebi.ac.uk/gwas/) to compare the difference in the rare and low-frequency alleles in 107,124 shared variants between the Chinese (WBBC) and European (EUR). As we know, GWAS usually reported common variants, we found that some trait-associated common variants in European had much lower allele frequency in Chinese. For example, in the shared 17,781 rare variants in WBBC, 61.04% of the variants were common in EUR (Fig. 1d). Regarding to the number of trait-associated variants with different allele frequency, we found that body height, body mass index, blood protein levels, bone mineral density and educational attainment were among the top five (Supplementary Table 4 listed the top 50 phenotypes), suggesting that it should be careful when interpreting and applying the genetic data of these traits between populations. The 4 SNPs with highest difference in derived allele frequency were in the SLC24A5 (rs1426654_A, WBBC = 0.012 and EUR = 0.997)25, SLC45A2 (rs16891982_G, WBBC = 0.003 and EUR = 0.938)26, EDAR (rs3827760_G, WBBC = 0.926 and EUR = 0.011)27 and SULT1C4 genes (rs4149433_T, WBBC = 0.884 and EUR = 0.011)28, which were associated with the traits of body mass index, eye color, hair morphology, and lobe attachment. These biological traits between the Chinese and European were remarkable, which were probably inherited from the ancient ancestors and adapted to local environments in a long period of time.
Further, we compared the difference of allele frequency between WBBC and 1000 Genome Project populations, including East Asian (EAS), South Asian (SAS), Admixed American (AMR), European (EUR) and African (AFR)22. A total of 19,778,701 shared variants were extracted and consequently divided into rare (MAF < 0.5%), low-frequency (0.5% ≤ MAF ≤ 5%) and common variants (MAF > 5%) (Fig. 1e and Supplementary Table 5). We found that 6.06% and 5.05% of the rare variants in the WBBC were the common variants in the AFR and EUR, respectively. In terms of the common variants in the WBBC, 6.83% and 5.79% variants were rare variants in the AFR and EUR, respectively. Further, we assessed the 2,020,520 shared variants among East Asians including Chinese (WBBC), Inner Mongolians (IMG)29, Koreans (KOR)11, Japanese (JPT) and Vietnamese (KHV)22 (Supplementary Fig. 1d and Supplementary Table 6). We observed that 1.52% and 0.56 % common variants in the IMG and JPT populations were rare frequency variants in the WBBC. The low-frequency variants in the WBBC displayed diverse frequency in other East Asian population (Supplementary Fig. 1d). Relatively, the variants frequency of the KHV and KOR populations indicated a minor difference with the WBBC population. Evidently, the East Asian population had a similar frequency spectrum in the common variants (Supplementary Fig. 1d).
Variants Annotation and Individual Genome
To characterize variants with a biological consequence, we annotated all the variants from 4,480 individuals regardless of medical conditions by ANNOVAR tools30. As expected, 77,862,672 (95.5%) variants were in intergenic and intronic regions (Supplementary Table 3). The variants in intergenic and intronic regions comprised 89.64% of novel variants (Supplementary Fig. 1e). In coding and splice regions, the missense accounted for 54.22% of the novel variants, while synonymous and splice variants made up 40.5% of the novel variants (Supplementary Fig. 1e). We also found that the missense, stop-gain, frameshift indels and non-frameshift indels variants were markedly increased among rare variants, compared with low frequency and common variants, which were signatures of population expansion and weak purifying selection. We predicted about 300,000 deleterious variants by SIFT, PolyPhen-2 or MutationTaster in 4,480 individuals, with 97.3% of the variants being rare alleles with MAF < 0.5% (Supplementary Table 3). Interestingly, we also identified 1,842 pathogenic or likely pathogenic variants recorded by ClinVar in our dataset regardless of medical conditions. Of these predicted disease-causing variants, 97.4% variants were rare, 1.7% variants were low frequency, and 0.9% were common variants, which arose from selection pressure subjected on these rare variants. The c.315-48 T>C in FECH gene were common variant (AF > 0.3) in the Chinese and Asian population, however, the AF was only 0.06 in the CEU population and gnomAD. This deep intronic variant was considered as pathogenic variant and might cause erythropoietic protoporphyria (EPP) in the European patients31.
We selected 1,151 healthy individuals for the autosomal variants’ statistic of a personal genome. On average, an individual carried 2,936,012 SNVs and 191,333 INDELs, including 8,915 missense, 10 stop loss, 70 stop gain and 126 frameshift or non-frameshift indels (Supplementary Table 7). In total, 96.5% of the variants were located in intergenic and intronic regions. For the disease-associated variants predicted in silico, about 1,623 variants were deleterious by SIFT32, 1,714 variants were probably or possibly damaging by Polyphen233, and 8,591 variants were disease-causing by MutationTaster34. In these pathogenic and deleterious variants, we observed the higher ratio of heterozygote and non-reference homozygote (Het/Hom) (Supplementary Table 7). The proportions of Het/Hom were also very high in novel SNVs and INDELs variants, which indicated that the majority of novel variants occurred as heterozygotes in the Han Chinese population.
The homozygous pathogenic variants are responsible for recessive Mendelian disorders. To measure the prevalence of pathogenic variants in a healthy individual, we annotated the variants by ClinVar35. In total, we identified 757 pathogenic variants in all the healthy population, and in average each individual carried 11 pathogenic variants (het/hom = 2.08) (Supplementary Table 7 and Supplementary Table 8). Each genome carried 3.6 ± 2.1 (mean ± SD) pathogenic homozygote variants in Han Chinese population. Additionally, we found that 19 pathogenic variants existed in all four sub-regions (North, Central, South and Lingnan population) were mainly relevant to the immune system, metabolic, and hearing impairment diseases, and 16 out of the 19 (84.2%) variants had a higher frequency (MAF > 0.01) in the Han Chinese population (Supplementary Table 8), and lower frequency (MAF < 0.001) in 1000 Genome Project (1KG)22 and gnomAD database23.
Whole Genome-wide Singleton Density Score Analysis and Selection Inference
The Singleton Density Score (SDS) can be applied to infer recent allele frequency changes by calculating the distance between the nearest singletons on either side of a test-SNP using whole-genome sequence data36. We tried to infer the recent allele frequency changes at SNVs of the Han Chinese population by calculating SDS based on 4,258,941 bi-allelic SNVs and 17,951,337 singletons from 4,334 whole-genome sequenced Han individuals. We found a novel significant selection signatures in SNX29 gene (Fig. 2a) on chromosome 16p, which encoded the sorting nexin-29 protein and was ubiquitously expressed in the kidney, lymph node, ovary and thyroid gland tissues37. More than 30 SNPs on SNX29 gene exhibited strong selection signatures (P < 5Í10-8), which indicated significant enrichment of selection in this genomic region. Relatively higher derived allele frequency (DAF) was observed on the top SNP rs75431978 in the Han Chinese population (DAF = 0.181, P = 5.54Í10-16) and East Asian population (DAF = 0.146, Mongolian = 0.18, Korean = 0.187, Japanese = 0.12), compared to the values obtained in 1000 Genome Project SAS (DAF = 0.062), EUR (DAF = 0.003) and AFR (DAF = 0.002) populations (Fig. 2b). SNX29 was reported to be a biomarker for vasodilator-responsive Pulmonary Arterial Hypertension38 and major mental disorders39. We also identified other two novel potential selection signals DNAH1 rs78947691 on chromosome 3 (DAF = 0.270, P = 2.65Í10-8) and WDR1 rs148629931 (DAF = 0.058, P = 5.44Í10-8) on chromosome 4 (Fig. 2a). The top rs78947691 in the intron 16 of DNAH1 gene and rs148629931 in the upstream of WDR1 gene have relatively high DAF in the EAS population, comparing with other populations (Fig. 2b). Although these two SNPs were unreported, the polymorphisms in the DNAH1 gene showed the potential association with male infertility in the Chinese40, 41, while the variation in the WDR1 gene were the risk factors for gout development in the Chinese population42, 43.
We also confirmed several significant natural selection signals at ADH gene clusters (rs1229984, P = 6.07Í10-16), the MHC region (rs9380181, p = 6.43Í10-11), and ALDH2 (rs671, P = 1.68Í10-14) (Fig. 2a). These three selection signature regions have also been identified in the Japanese population44. The alcohol-metabolizing enzymes such as the alcohol dehydrogenase (ADH) genes, including ADH1A, ADH1B, ADH4, ADH5, ADH6, and the aldehyde dehydrogenase (ALDH2) gene, had an effective impact on the alcohol metabolism pathway and the consequent alcoholism protective effect, which strongly indicated diverse ethnic-specific alcohol consumption patterns45, 46, 47, 48, 49. Similarly, the high derived allele frequency (DAF) in this genomic loci, particularly rs1154413 (ADH5), rs4148887 (ADH4), rs2156733 (ADH6), rs3819197 (ADH1A), rs1229984 (ADH1B), and rs671 (ALDH2) (0.726, 0.734, 0.764, 0.790, 0710 and 0.239, respectively), illustrated corresponding alleles associated with alcoholism in the Han Chinese, when compared with other non-East Asian (Fig. 2b and Supplementary Table 9). Interestingly, we observed a higher-level DAF in these SNVs in South and Lingnan regions compared to the North and Central Han, which reflected the recent regional DAF changes and adaptation in this populous ethnicity and articulated different drinking habits or specific-alcohol consumption. In addition, we identified other selection signals in chromosome 12, including rs11066280 (P = 1.41Í10-13) in HECTD4 gene, rs11066015 (P = 2.57Í10-12) in ACAD10 gene and rs3782886 (P = 4.11Í10-12) in BRAP gene, which were limited to EAS ancestry populations. These three genes adjacent to the ALDH2 gene in chromosome 12q were within a large linkage disequilibrium (LD) block50, revealing that the region had been under positive selection for a long time.
We estimated the selection coefficient trajectories for ADH1A and ADH1B genes with a hidden Markov model51 using the ancient allele frequencies from East Asian ancient individuals (9,500 - 300 BP) and present-day allele frequencies in the WBBC. The derived allele of the SNP rs3819197 was present around the 7,000 year ago, but was very rare for a long time (Fig. 2c). The strength of selection at the ADH1A (rs3819197) and ADH1B (rs1229984) gene increased around 4,000 years ago. The derived allele (A) of rs671 in ALDH2 gene was a common variant (MAF = 0.174) and strongly selected in modern East Asian population, yet very rarely in non-East Asian (Fig. 2b), however, the allele A was absent in the East Asian ancient DNA data, suggesting a more recent selection.
Imputation Reference Panel for the Chinese Population
We evaluated the genotype imputation accuracy of the WBBC, 1KG (Phase 3, v5a)22, CONVERGE52, and two combined reference panels (WBBC+EAS and WBBC+1KG) in the Chinese population (Supplementary Fig. 3). The results showed that the WBBC panel, with almost fifteen-fold more Chinese samples than the 1KG Project, yielded substantial improvement for imputation for low-frequency and rare variants (Fig. 3a). The two combined panels, WBBC+EAS and WBBC+1KG, almost tied and possessed both the highest Rsq and number of well-imputed variant in the shared sites with a MAF range of 0.2% to 50%, followed by the WBBC, 1KG and CONVERGE (Fig. 3a). For the rare variants with MAF less than 0.2%, WBBC+EAS panel showed the best performance, and the WBBC panel performed roughly the same as the WBBC+1KG (Fig. 3a). This result indicated that merging EAS individuals of the 1KG to increase the haplotype size of the WBBC could improve panel’s performance across all MAF bins, but merging the whole 1KG cannot yield more improvement than merging-EAS-only and even not equal to it when the imputed variants were quite rare. Taking all shared variants together, the WBBC+EAS yielded the most well-imputed variants in shared sites, while the CONVERGE panel imputed the least (Fig. 3b). The proportion of imputed variants with Rsq ≥ 0.8 for CONVERGE was the only one under 50% across five panels, even it was population-specific to Chinese (Fig. 3c), indicative of the importance of coverage sequencing depth of a reference panel.
To comprehensively evaluate the imputation accuracy for the five panels, we further calculated the non-reference (NR) genotype concordance rate between imputed and genotyped variants by chip array and WGS respectively (imputation vs. chip array and imputation vs. WGS). Two combined panels had the most promising distributions of the NR concordance rates, which were almost coincident with each other, indicating that the NR concordance rates for Chinese imputation could barely benefit from the extra non-East Asian haplotypes of the reference panel (Fig. 3d). Besides, we could know that the peaks of two combined panels in density plots were higher than other panels, indicating that the distributions of NR concordance rates were more concentrated in the two combined panels (Fig. 3d). The performance of the WBBC panel was slightly behind the two combined panels, but was superior to the 1KG and CONVERGE (Fig. 3d). We also calculated the NR allele concordance rate between the imputed genotypes and the directly sequenced genotypes. Not surprisingly, the two combined panels performed best and were approximately coincident and very closely followed by the WBBC (Fig. 3e). This result suggested that the improvement provided by the EAS and 1KG were unremarkable. Considering all variants together, the WBBC+EAS panel showed the highest NR allele concordance rate, followed by the WBBC+1KG, WBBC, 1KG and CONVERGE (Fig. 3f).
Overall, we employed Rsq, and NR allele concordance rate for both WGS and array genotype to measure the imputation accuracy for the five panels. Our results demonstrated the superiority of the WBBC as a reference panel for Chinese population imputation. Compared to the 1KG and CONVERGE, WBBC panel greatly improved the imputation accuracy, especially for the rare and low-frequency variants. Besides, merging EAS/1KG haplotypes into the WBBC could further improve the imputation accuracy.
To facilitate genotype imputation in Chinese population, we developed an imputation server with user-friendly website interface for public use (https://imputationserver.westlake.edu.cn/). Users can register and create imputation jobs freely by uploading their bgzipped array data (VCF-formatted) to our server under a strict policy of data security. To ensure the integrity of array data for next phasing and imputation, some basic QC should be performed, such as removing mismatched SNPs, monomorphism and duplicate SNPs. The server provided a choice of four reference panels to conduct the imputation, including the WBBC, 1KG Phase3, WBBC combined with EAS, and WBBC combined 1KG Phase3. All panels in both GRCh37 and GRCh38 were built to meet different needs. Besides, service of phasing was also provided in our server for users who cannot afford the corresponding heavy computational load. An email of reminder will be sent to the user when the imputation job is finished, and then user can download the imputed genotype data and the corresponding statistics file with an encrypted link. The SHAPEIT and MINIMAC were employed in our server for phasing and imputation, respectively. More details including the policy of data security, statistics of four reference panels, and the reference manual were specified in our website.
Genetic Evidence Supported the Geographical Boundaries of the Qinling-Huaihe Line and Nanling Mountains
To explore the Chinese population structure, we performed principal component analysis (PCA) on 2,056 Han Chinese individuals and 205 minority individuals from 29 of 34 administrative divisions of China (Fig. 4a). PC1 and PC2 revealed the main genetic structure of the Chinese population, with PC1 displaying a population stratification along the north-south cline, reflecting the geographical locations (Fig. 4b). The genetic difference of the Han population corresponded to the geographical boundaries of the Qinling-Huaihe River Line and Nanling Mountains. Based on the PCA analysis and traditional geographical boundaries in China, the Han Chinese could be classified into four groups: the North Han (Gansu, Hebei, Heilongjiang, Henan, Inner Mongolia, Jilin, Liaoning, Ningxia, Qinghai, Shaanxi, Shandong, Shanxi and Tianjin) (Supplementary Fig. 4a and Supplementary Fig. 5), the Central Han (Anhui and Jiangsu) (Supplementary Fig. 4b and Supplementary Fig. 6), where Central Han were closed to North, but embedded in both North and South Han, the South Han (Chongqing, Fujian, Guizhou, Hubei, Hunan, Jiangxi, Sichuan, Yunnan and Zhejiang) (Supplementary Fig. 4c and Supplementary Fig. 7), and the Lingnan Han (Guangxi, Guangdong and Hainan) (Supplementary Fig. 4d and Supplementary Fig. 8). PC3 and PC4 displayed no discernable geographical structure and subpopulations (Supplementary Fig. 4e). When the 104 JPT (Japanese in Tokyo, Japan) and 99 KHV (Kinh in Ho Chi Minh City, Vietnam) samples from the 1000 Genomes Projects (1KG) were included, the KHV population formed a cluster overlapping with Lingnan Han, while the JPT population was closer to the North Han Chinese (Supplementary Fig. 4f).
We estimated ancestral composition in the Han Chinese population from 27 provinces using the ADMIXTURE program. The average number of presumed ancestral populations were calculated in each province with the optimal K = 3. When the value of component 1 was sorted, the four regions were arranged from northern to southern China (Fig. 4c). The ancestry fractions of the North Han accounted for about 66% on component 3. The ancestral component of the Central Han was closer to the North Han with 52.1% on component 3, while the admixture components in the South Han were 46.3% on component 1 and 40% on component 3 respectively, which did not show the predominant ancestral components. We found a distinctly higher proportion of component 1 in Lingnan Han, at 78% of ancestry composition compared to other ancestral components. North Han, South Han, and Lingnan Han showed significantly different clusters, while central Han embodied the ancestral components of both northern and southern populations. In southern China, South Han and Lingnan Han were clearly distinguished from each other, which was consistent with the PCA results.
We combined the 1KG data (CHB, CHS, CDX, JPT and KHV) and conducted ADMIXTURE analysis from K = 2 to K = 8 to explore the admixture with Asian population in the Han Chinese and 1KG population (Supplementary Fig. 9). At K = 4, the cross-validation error was the lowest. Based on the outcome of the admixture analysis, individuals from the Guangdong province displayed a complex genetic pattern consisting of Lingnan Han and South Han. Moreover, Fujian province which was obviously consisted of a genetic mosaic of diverse people showed more genetic differentiation compared to the other southern provinces at K = 4 and K = 5. Exploration of structuring patterns also revealed that the Jiangxi and Hunan provinces shared the same components from K = 2 to K = 6 in the South region, corresponding to the adjacent geographical locations, and resulting from known historical migration events. There were no significant components difference among the Northern provinces. Most components of the CHB population were consistent with North Han, except for part of samples originating from South Han, while the CHS population mainly came from South Han. From K = 2 to K = 6, the JPT population and Han Chinese population were classified into different groups, but more close to the North Han at K = 2. KHV population clustered with Lingnan Han at K = 2, whereas from K = 3 onwards, the KHV population and Lingnan Han clustered separately.
We collected 340 published ancient genomes from 8 countries or regions from 40,000 to 300 years ago and 95 representative present-day genomes [Inner Mongolia (IMG), North, Central, South and Lingnan Han from the WBBC, and CHB, CHS, CDX, JPT and KHV individuals from the 1KG] to reveal the population relationships between modern and ancient individuals in Asia (Fig. 4d). The PCA analysis showed that there was strong genetic divergence of ancient individuals between the Northern (Mongolia and Russia) and Southern area (Taiwan, Thailand and Vietnam). And the ancient samples spread from North to South dispersedly compared to the modern samples. The ancient individuals from North Asia (e.g., Mongolia and Russia) were closer to modern North Han than South Han, while both modern and ancient samples from the Southern area (South, Lingnan, Taiwan, Thailand and Vietnam) were closely clustered together, which was consistently fit well with geographic distribution of the populations. The 88 ancient individuals from the China mainland were mostly close to the modern North Han, and there was population stratification with the modern Southern Chinese population in the PCA analysis, which suggested the human migration and admixture in Northern and Southern China during the long population history of East Asia53, 54.
Population Structure and Demographic History in Four Sub-regions of the Han Chinese Population
Weir-Cockerham FST is an allele frequency-based metric to measure the population differentiation due to genetic structure. We calculated pairwise FST and performed hierarchical clustering for 27 administrative divisions of China and 26 populations of the 1KG. The 27 administrative divisions were mainly clustered into three groups and showed an association with geography (Fig. 5a and Supplementary Table 10). Anhui and Jiangsu provinces, which we designated as Central region of China, were clustered with Northern provinces, indicative of a closer genetic relationship. The other two groups, South and Lingnan, aligned with the regions we designated. Besides, the hierarchical branches suggested that the population differentiation between South and North was smaller than that between the South and Lingnan (Fig. 5a), reflecting the relatively shorter genetic distance. The two most remote regions in geography, North and Lingnan, were also found to have the largest population differentiation (Fig. 5a). Not surprisingly, the pairwise FST clustering results between the WBBC and 1KG populations showed that the four designated regions were clustered into the East Asian (EAS) group (Supplementary Fig. 10 and Supplementary Table 11). In particular, North and Central were clustered with CHB, while South and Lingnan were clustered with CHS (Supplementary Fig. 10). Using the four 1KG continent-level ancestry groups (AFR, EUR, AMR and SAS) as the Non-Chinese population reference, we further investigated the geographical patterns of FST in 27 administrative divisions of China. The AFR group showed the largest FST that ranged from 0.14 to 0.15 (Supplementary Fig. 11b and Supplementary Table 12), indicative of the greatest population differentiation to the WBBC, while the SAS and AMR group yielded the least value (Supplementary Fig. 11a, c). The geographical patterns of FST across the four sub-regions were similar to each other. On average, the Han Chinese in Northern provinces had the relatively closer genetic structure to the Non-Chinese populations of the 1KG. Interestingly, Qinghai province was conspicuously highlighted in the geographic heatmaps, as its pairwise FST value was obviously smaller than that of other administrative divisions (Supplementary Fig. 11), indicative of the genetic structure particularity of the Qinghai Han Chinese.
Next, we detected the IBD segments with the logarithm of the odds (LOD) score > 3 across individuals in the WBBC55. Unlike the Weir-Cockerham FST, IBD analysis is a haplotype-based approach to reveal the genetic structure and investigate the common ancestry of populations. The total IBD segment counts in each pair of administrative divisions were normalized by the corresponding sample size. We then performed the hierarchical clustering based on the matrix of normalized pairwise IBD counts. Similar to the results of FST clustering, 27 administrative divisions were also mainly clustered into three groups, and individuals from Anhui and Jiangsu provinces were clustered in North (Fig. 5a, b). Besides, the results showed that most Southern provinces shared more IBD segments with Northern provinces than with Lingnan (Fig. 5b), just as observed in the Fst analysis (Fig. 5a), suggesting that the Han Chinese in South and North shared more common ancestry than South and Lingnan. The Fujian and Hunan, which we designated as the Southern province, had been found that joined up with Lingnan provinces by the multiple hierarchical branches, indicating that they were more close to Lingnan in ancestry (Fig. 5b), and contiguous to Lingnan geographically (Fig. 5c).
We inferred the history of effective population size for the Han Chinese, and the results across the four regions were shown in Fig. 5d. In the period from 1 million years ago to ~ 6 thousand years ago (kya), the Han Chinese size histories of four regions experienced almost identical dynamics. From 200 kya to ~10 kya, the effective population size experienced a steep decline and then grew rapidly, with the lowest point reached at ~60 kya, which was indicative of a bottleneck, consistent with previous demographic history studies14, 22, 56. Around 6 kya, the size histories of the Han Chinese from the Lingnan began to deviate from the other three regions, potentially reflecting the existence of a population substructure within the Lingnan Han Chinese (Fig. 5d).
Using the Han Chinese in the most northern province (Heilongjiang) of China as the reference, we estimated relative genetic drifts and inferred a rooted maximum likelihood tree between 27 administrative divisions by TreeMix software57. In the result shown in Fig. 5c, the relative drift of the provinces and municipalities were in line with the geographic location. To gain a better understanding of the result, we further drew a geographic heatmap that suggested a general genetic drift trend from the North to Lingnan, with the drift parameter increasing as the latitude decreased (Fig. 5c). To judge the confidence in the trend and tree topology, we performed ten bootstrap replicates by resampling blocks of SNPs. The trend was repeated in all replicate results (Supplementary Fig. 12). Besides, we found that the tree topology of administrative divisions in Central, South and Lingnan was stable. In the North, however, the tree topology was slightly different across the replicates, indicating that the genetic structures of the Northern administrative divisions were very similar and could not be precisely presented in the tree topology (Supplementary Fig. 12).
Enlightened by the genetic drift estimation results, we further investigated the homogeneity degree in the genetic structure of the Northern and Southern Han Chinese respectively. We performed the Wilcoxon rank-sum tests58 for Northern and Southern administrative divisions using their respective pairwise FST values, normalized IBD segments counts and relative drift parameters. The results showed that the Han Chinese from North had smaller population differentiation (P = 4.6e-10) and genetic drifts (P = 2.5e-11), and shared more IBD segments with each other (P = 1.9e-13) than those from South (Fig. 5e). These results suggested that the genetic structure of the Han Chinese in North was significantly homogeneous than those in South.
Signatures of Recent Positive Selection in Four Sub-regions of the Han Chinese Population
We employed the integrated haplotype score (iHS) test to identify recent natural signatures of positive selective sweeps in the North, Central South, and Lingnan Han populations59. The top 1% genomic regions with higher |iHS| scores were found in each population (Supplementary Table 13-16). The numbers of overlapping genomic windows of selective sweep regions across the four populations were shown in supplementary Fig. 13. Only 34 (26%) sweep regions were found in all the four populations. Most regions were shared in two or three of the four subgroups. Averagely, 23.2% of the regions were independent in North, Central and South Han. However, the Lingnan Han had distinctly excess independent sweeps (50, 38.5%), which might be inherited from separate ancestral components, consistent with the conclusion from our demographic history analysis. Importantly, we observed the EDAR gene in the first three sweep regions in all four subgroups. EDAR is genetic determinant of hair thickness and has been under the strong selection pressure in East Asian60, 61, 62. A large genomic region extending for at least 285 kb on the chromosome 7 in the top 10 window regions in all four subgroups contained eight contiguous robust selection genes (EPHB6, KEL, LLCFC1, MTRNR2L6, PRSS1, PRSS3P2, TRPV5 and TRPV6; Supplementary Table 13-16). The signature of positive selection of this region has been described only in European-Americans and not in African-Americans population63. Our finding indicates that the interesting candidate region is not population-specific.
We conducted Gene Ontology (GO) and KEGG pathway analysis for candidate genes in the top 1% genomic regions with signals of recent selection by iHS (Supplementary Table 17 and Supplementary Table 18). The terms were selected according to p-value (< 0.05). The results of the GO analysis showed a significant enrichment of positively selected genes for ethanol metabolic process and ethanol oxidation in four sub-regions (Supplementary Table 17), consistent with the selective signatures by whole genome-wide singleton density score (SDS) analysis in the Han Chinese population. We also observed intriguing enrichment of keratinocyte differentiation, epidermal cell differentiation and skin development (CTSL, COL7A1, PKP3, SEC24B, SLITRK6, WNT10A, KRT10, KRT12, KRT20 and KRT23) in the South and Lingnan Han, which were not present in the North and Central Han populations. The KEGG analysis found that 23 pathways were enriched in the Han population with adjusted p-values < 0.05 (Supplementary Table 18). Of these pathways, Southern individuals displayed significantly enriched terms more than the northern population. Tyrosine metabolism, retinol metabolism and fatty acid degradation were identified in four sub-regions.