Genomic analyses of 10,376 individuals provides comprehensive map of genetic variations, structure and reference haplotypes for Chinese population

Here, we initiated the Westlake BioBank for Chinese (WBBC) pilot project with 4,535 whole-genome sequencing individuals and 5,481 high-density genotyping individuals. We identified 80.99 million SNPs and INDELs, of which 38.6% are novel. The genetic evidence of Chinese population structure supported the corresponding geographical boundaries of the Qinling-Huaihe Line and Nanling Mountains. The genetic architecture within North Han was more homogeneous than South Han, and the history of effective population size of Lingnan began to deviate from the other three regions from 6 thousand years ago. In addition, we identified a novel locus (SNX29) under selection pressure and confirmed several loci associated with alcohol metabolism and histocompatibility systems. We observed significant selection of genes on epidermal cell differentiation and skin development only in southern Chinese. Finally, we provided an online imputation server (https://wbbc.westlake.edu.cn/) which could result in higher imputation accuracy compared to the existing panels, especially for lower frequency variants.


Introduction
Understanding the architecture of the human genome has been a fundamental approach to precision medicine. Over the past decade, great progress has been made to unravel either the genetic basis of complex traits and diseases (Timpson et al., 2018) or the human evolutionary history (Nielsen et al., 2017). The in-depth analysis of global populations with diverse ancestry could improve the understanding of the relationship between genomic variations and human diseases (2019). However, genetic studies exhibited a vast imbalance in global population, with individuals of European descent took up ~79% of all genome wide association study (GWAS) participants (2019; Martin et al., 2019). Similarly, most of the whole-genome sequencing (WGS) efforts were predominantly conducted on European populations, such as Dutch (Genome of the Netherlands, 2014), UK (Consortium et al., 2015) and Icelandic population (Gudbjartsson et al., 2015). Even in larger genomic projects such as the Trans-Omics for Precision Medicine (TOPMed) program, which consisted of ~155k participants from >80 different studies, only 9% of samples were of Asian descent (Kowalski et al., 2019). Therefore, large-scale genomic data are required to understand the genetic basis in Asian population. Recently, some studies have sequenced and analyzed the Asian populations including Japanese (Nagasaki et al.,

Population Genetic Structure and Demographic History in Four Sub-regions of the Han Chinese Population
Weir-Cockerham FST is an allele frequency-based metric to measure the population differentiation due to genetic structure. We calculated pairwise FST and performed hierarchical clustering for 27 administrative divisions of China and 26 populations of the 1KG. The 27 administrative divisions were mainly clustered into three groups and showed an association with geography ( Figure 3A and Table S5). Anhui and Jiangsu provinces, which we designated as Central region of China, were clustered with Northern provinces, indicative of a closer genetic relationship. The other two groups, South and Lingnan, aligned with the regions we designated. Besides, the hierarchical branches suggested that the population differentiation between South and North was smaller than that between the South and Lingnan ( Figure 3A), reflecting the relatively shorter genetic distance. The two most remote regions in geography, North and Lingnan, were also found to have the largest population differentiation ( Figure 3A).
Not surprisingly, the pairwise FST clustering results between the WBBC and 1KG populations showed that the four designated regions were clustered into the EAS group ( Figure S14 and Table S6). In particular, North and Central were clustered with CHB, while South and Lingnan were clustered with CHS ( Figure S14). Using the four 1KG continent-level ancestry groups (AFR, EUR, AMR and SAS) as the Non-Chinese population reference, we further investigated the geographical patterns of FST in 27 administrative divisions of China. The AFR group showed the largest FST that ranged from 0.14 to 0.15 ( Figure S15B and Table S7), indicative of the greatest population differentiation to the WBBC, while the SAS and AMR group yielded the least value ( Figure S15A and S15C). The geographical patterns of FST across the four sub-regions were similar to each other. On average, the Han Chinese in Northern provinces had the relatively closer genetic structure to the Non-Chinese populations of the 1KG. Interestingly, Qinghai province was conspicuously highlighted in the geographic heatmaps, as its pairwise FST value was obviously smaller than that of other administrative divisions ( Figure S15), indicative of the genetic structure particularity of the Qinghai Han Chinese.
Next, we detected the IBD segments with the logarithm of the odds (LOD) score > 3 across individuals in the WBBC (Lander and Schork, 1994). Unlike the Weir-Cockerham FST, IBD analysis is a haplotype-based approach to reveal the genetic structure and investigate the common ancestry of populations. The total IBD segment counts in each pair of administrative divisions were normalized by the corresponding sample size (see Methods). We then performed the hierarchical clustering based on the matrix of normalized pairwise IBD counts. Similar to the results of FST clustering, 27 administrative divisions were also mainly clustered into three groups, and individuals from Anhui and Jiangsu provinces were clustered in North (Figures 3A and 3B). Besides, the results showed that most Southern provinces shared more IBD segments with Northern provinces than with Lingnan ( Figure 3B), just as observed in the Fst analysis ( Figure 3A), suggesting that the Han Chinese in South and North shared more common ancestry than South and Lingnan. The Fujian and Hunan, which we designated as the Southern province, had been found that joined up with Lingnan provinces by the multiple hierarchical branches, indicating that they were more close to Lingnan in ancestry ( Figure 3B), and contiguous to Lingnan geographically (Figures 3C and S15).
We inferred the history of effective population size for the Han Chinese, and the results across the four regions were shown in Figure 3D. In the period from 1 million years ago to ~ 6 thousand years ago (kya), the Han Chinese size histories of four regions experienced almost identical dynamics. From 200 kya to ~10 kya, the effective population size experienced a steep decline and then grew rapidly, with the lowest point reached at ~60 kya, which was indicative of a bottleneck, consistent with previous demographic history studies (Genomes Project et al., 2015;Terhorst et al., 2017;Wu et al., 2019). Around 6 kya, the size histories of the Han Chinese from the Lingnan began to deviate from the other three regions, potentially reflecting the existence of a population substructure within the Lingnan Han Chinese ( Figure 3D).
Using the Han Chinese in the most northern province (Heilongjiang) of China as the reference, we estimated relative genetic drifts and inferred a rooted maximum likelihood tree between 27 administrative divisions by TreeMix software (Pickrell and Pritchard, 2012). In the result shown in Figure 3C, the relative drift of the provinces and municipalities were in line with the geographic location. To gain a better understanding of the result, we further drew a geographic heatmap that suggested a general genetic drift trend from the North to Lingnan, with the drift parameter increasing as the latitude decreased ( Figure 3C). To judge the confidence in the trend and tree topology, we performed ten bootstrap replicates by resampling blocks of SNPs. The trend was repeated in all replicate results ( Figure S16). Besides, we found that the tree topology of administrative divisions in Central, South and Lingnan was stable. In the North, however, the tree topology was slightly different across the replicates, indicating that the genetic structures of the Northern administrative divisions were very similar and could not be precisely presented in the tree topology ( Figure S16).
Enlightened by the genetic drift estimation results, we further investigated the homogeneity degree in the genetic structure of the Northern and Southern Han Chinese respectively. We performed the Wilcoxon rank-sum tests (Wilcoxin, 1947) for Northern and Southern administrative divisions using their respective pairwise FST values, normalized IBD segments counts and relative drift parameters. The results showed that the Han Chinese from North had smaller population differentiation (p-value = 4.6e-10) and genetic drifts (p-value = 2.5e-11), and shared more IBD segments with each other (p-value = 1.9e-13) than those from South ( Figure 3E).
These results suggested that the genetic structure of the Han Chinese in North was significantly homogeneous than those in South.
Whole Genome-wide Singleton Density Score Analysis biallelic SNVs and 17,943,790 singletons from 4,395 Han individuals were conducted for the SDS computation. On chromosome 16p, we found novel significant selection signatures in SNX29 gene ( Figure 4A), which encoded the sorting nexin-29 protein and was ubiquitously expressed in the kidney, lymph node, ovary and thyroid gland tissues (Fagerberg et al., 2014). In SNX29 gene, more than 30 SNPs showed strong selection signatures (p < 510 -8 ), which indicated significant enrichment of selection in this genomic region. Relatively higher DAF was observed on the top SNP rs75431978 (DAF = 0.176, p = 1.3110 -15 ) in the Han Chinese population, compared to the values obtained in 1000 Genome Project EUR (DAF = 0.003) and AFR (DAF = 0.002) populations. SNX29 was reported to be a biomarker for vasodilator-responsive Pulmonary Arterial Hypertension (Thayer et al., 2020). Although the function of the SNX29 gene remained unknown, it could be considered a biological target of nature selection pressure in the Han population.
We also confirmed several significant natural selection signals at ADH gene clusters (rs1229984, p = 5.5110 -16 ), the MHC region (rs9380181, p = 2.0410 -10 ), and BRAP-ALDH2 (rs3782886, p = 4.2910 -12 ) ( Figure 4A). These three selection signature regions have also been identified in the Japanese population, but the strongest signal loci were at different variants (Okada et al., 2018). The alcohol-metabolizing enzymes such as the alcohol dehydrogenase (ADH) genes, including ADH1A, ADH1B, ADH4, ADH5, ADH6, and the aldehyde dehydrogenase (ALDH2) gene, had an effective impact on the alcohol metabolism pathway and the consequent alcoholism protective effect, which strongly indicated diverse ethnic-specific alcohol consumption patterns (Bierut et al., 2012;Choi et al., 2005;Druesne-Pecollo et al., 2009;Edenberg, 2007;Ehlers et al., 2012). Similarly, the high derived allele frequency (DAF) in this genomic loci, particularly rs1154414 (ADH5 gene), rs4148887 (ADH4 gene), rs2156733 (ADH6 gene), rs975833 (ADH1A gene), rs1229984 (ADH1B gene), and rs4646776 (ALDH2 gene) (0.729, 0.734, 0.763, 0.789, and Table S8). Interestingly, we observed a higher-level DAF in these SNVs in South and Lingnan regions compared to the North and Central Han, which reflected the recent regional DAF changes and adaptation in this populous ethnicity and articulated different drinking habits or specific-alcohol consumption. The top SNP rs12374406 in LOC100507053 gene, as the strongest selection signature on chromosome 4, showed a higher DAF in non-East Asian population, which was also an alcohol dependence risk gene in European-American and African-American populations (Gelernter et al., 2014). In addition, we identified other selection signals in chromosome 12, including rs11066280 (p = 1.2210 -13 ) in HECTD4 gene, rs11066015 (p = 2.3010 -12 ) in ACAD gene and rs3782886 (p = 4.2910 -12 ) in BRAP gene, which were limited to EAS ancestry populations. The SNP rs11066280 showed a potential association with blood pressure level (Eom et al., 2017), while rs3782886 may cause a predisposition to an alcohol dependence disorder (Kim et al., 2019). The SNP rs11066015 indicated a strong linkage disequilibrium with rs671 (p = 4.5910 -11 ), which was a well-studied variant relevant to alcohol metabolism in East Asian populations (Igarashi et al., 2019). These three genes adjacent to the ALDH2 gene in chromosome 12q were within a large linkage disequilibrium (LD) block (Levy et al., 2009), revealing that the region had been under positive selection for a long time.

Signatures of Recent Positive Selection
We employed the iHS test to identify recent natural signatures of positive selective sweeps in the North, Central South, and Lingnan Han populations (Voight et al., 2006). For the iHS test, we computed the fraction of SNVs with |iHS| >2 in the 200 kb non-overlapping genomic windows (Pickrell et al., 2009). The top 1% of genomic clusters were identified as the candidate regions under natural selection. We determined the adaptive candidate genes near SNVs within the 100 kb regions. In total, 130 genomic regions with higher |iHS| scores were found in each population (Table S9-S12). The numbers of overlapping genomic windows of selective sweep regions across the four populations were shown in Figure S17. Only 34 (26%) sweep regions were found in all the four populations. Most regions were shared in two or three of the four subgroups. Averagely, 23.2% of the regions were independent in North, Central and South Han. However, the Lingnan Han had distinctly excess independent sweeps (50, 38.5%), which might be inherited from separate ancestral components, consistent with the conclusion from our demographic history analysis. Importantly, we found the EDAR gene in the first three sweep regions in all four subgroups, which have showed the strong signatures of positive selection in East Asians (Mou et al., 2008;Riddell et al., 2020;Tan et al., 2013).
In addition, we conducted Gene Ontology (GO) and KEGG pathway analysis for candidate genes in the top 1% genomic regions with signals of recent selection (Tables 2 and S13). The terms were selected according to p-value (< 0.05). The results of the GO analysis showed a significant enrichment of positively selected genes for ethanol metabolic process and ethanol oxidation in four sub-regions, consistent with the selective signatures by whole genome-wide singleton density score (SDS) analysis in the Han Chinese population. We also observed intriguing enrichment of keratinocyte differentiation, epidermal cell differentiation and skin development (CTSL,COL7A1,PKP3,SEC24B,SLITRK6,WNT10A,KRT10,KRT12,KRT20 and KRT23) in the South and Lingnan Han, which were not present in the North and Central Han populations. KEGG analysis found that 23 pathways were enriched in the Han population with adjusted p-values < 0.05 (Table S13). Of these pathways, Southern individuals displayed significantly enriched terms more than the northern population. Tyrosine metabolism and retinol metabolism and fatty acid degradation were identified in four sub-regions.

Imputation in the Chinese Population
We evaluated the genotype imputation accuracy of the WBBC, 1KG (Phase 3, v5a) (Genomes Project et al., 2015), CONVERGE (CONVERGE, 2015), and two combined reference panels (WBBC+EAS and WBBC+1KG) in the Chinese population ( Figure S18). For the comparison purpose, we extracted 729,958 imputed variants that were shared by the five panels. The variants were then grouped into nine MAF bins (< 0.1%, 0.1%-0.2%, 0.2%-0.3%, 0.3%-0.5%, 0.5%-1%, 1%-2%, 2%-5%, 5%-20% and 20%-50%) to differentiate the detailed imputation performance for variants with different MAF, especially for low-frequency and rare variants, which are usually difficult to impute accurately (Das et al., 2018). We obtained average R-square values (Rsq) from Minimac4 info files (Das et al., 2016) and counted the well-imputed variants (Rsq ≥ 0.8) in each MAF bin. The results showed that the WBBC panel, with almost fifteen-fold more Chinese samples than the 1KG Project, yielded substantial improvement for imputation for low-frequency and rare variants ( Figure 5A). The two combined panels, WBBC+EAS and WBBC+1KG, almost tied and possessed both the highest Rsq and well-imputed variant counts for variants with a MAF range of 0.2% to 50%, followed by the WBBC, 1KG and CONVERGE ( Figure 5A). For the rare variants with MAF less than 0.2%, WBBC+EAS panel showed the best performance, and the WBBC panel performed roughly the same as the WBBC+1KG ( Figure 5A). This result indicated that merging EAS individuals of the 1KG to increase the haplotype size of the WBBC could improve panel's performance across all MAF bins, but merging the whole 1KG cannot yield more improvement than merging-EAS-only and even not equal to it when the imputed variants were quite rare. Taking all shared variants together, the WBBC+EAS yielded the most well-imputed variants, while the CONVERGE panel imputed the least ( Figure 5B). The proportion of imputed variants with Rsq ≥ 0.8 for CONVERGE was the only one under 50% across five panels, even it was population-specific to Chinese ( Figure 5C), indicative of the importance of coverage sequencing depth of a reference panel.
To comprehensively evaluate the imputation accuracy for the five panels, we further calculated the genotype concordance rate between imputed and genotyped variants by chip array and WGS respectively (imputation vs. chip array and imputation vs. WGS).
Since the reference allele was not informative for the evaluation of imputation accuracy, the concordance rate was only calculated in the non-reference (NR) alleles.
We randomly masked one-fifteenth sites on QCed chip array for 5,679 individuals, resulting in 2,600 SNPs (chromosome 2 only). The concordance rates of NR alleles were then computed for each sample after imputation. To gain a better understanding of the distribution of the genotype concordance, we separated the NR alleles into homozygote and heterozygote. The result was shown in Figure 5D, with each dot representing an individual and the density plots around the scatter diagram showing the distributions for both NR allele homozygote and heterozygote concordance rates.
Two combined panels had the most promising distributions of the concordance rates, which were almost coincident with each other, indicating that the concordance rates for Chinese imputation could barely benefit from the extra population-diverse haplotypes of the reference panel ( Figure 5D). Besides, we could know that the peaks of two combined panels in density plots were higher than other panels, indicating that the distributions of concordance rates were more concentrated in the two combined panels ( Figure 5D). The performance of the WBBC panel was slightly behind the two combined panels, but was superior to the 1KG and CONVERGE ( Figure 5D). We also calculated the NR allele concordance rate between the imputed genotypes and the directly sequenced genotypes. There were 184 samples that were used for both the chip array genotyping and WGS of which 179 met our quality control criteria. All of them had been held-out from the WBBC to avoid the potential artifacts. We divided the variants that shared by the five panels into three variant groups, including rare, low-frequency and common variants. Not surprisingly, the two combined panels performed best and were approximately coincident and very closely followed by the WBBC ( Figure 5E). This result suggested that the improvement provided by the EAS and 1KG were unremarkable. Considering all variants together, the WBBC+EAS panel showed the highest NR allele concordance rate, followed by the WBBC+1KG, WBBC, 1KG and CONVERGE ( Figure 5F).
Overall, we employed Rsq, and NR allele concordance rate for both WGS and array genotype to measure the imputation accuracy for the five panels. Our results demonstrated the superiority of the WBBC as a reference panel for Chinese population imputation. Compared to the 1KG and CONVERGE, WBBC panel greatly improved the imputation accuracy, especially for the rare and low-frequency variants.
Besides, we found that merging EAS haplotypes into the WBBC could improve the imputation accuracy, while the extra diverse haplotypes of the 1KG could barely contribute to it.

The WBBC Genotype Imputation Server
To facilitate genotype imputation in Chinese population, we developed an imputation server with user-friendly website interface for public use (https://imputationserver.westlake.edu.cn/). Users can register and create imputation jobs freely by uploading their bgzipped array data (VCF-formatted) to our server under a strict policy of data security. To ensure the integrity of array data for next phasing and imputation, some basic QC should be performed, such as removing mismatched SNPs, monomorphism and duplicate SNPs. The server provided a choice of four reference panels to conduct the imputation, including the WBBC, 1KG Phase3, WBBC combined with EAS, and WBBC combined 1KG Phase3. All panels in both GRCh37 and GRCh38 were built to meet different needs. Besides, service of phasing was also provided in our server for users who cannot afford the corresponding heavy computational load. An email of reminder will be sent to the user when the imputation job is finished, and then user can download the imputed genotype data and the corresponding statistics file with an encrypted link. The SHAPEIT v2 (Delaneau et al., 2011) andMINIMAC v4 (Das et al., 2016) were employed in our server for phasing and imputation, respectively. More details including the policy of data security, statistics of four reference panels, and the reference manual were specified in our website.

Discussion
We initiated the Westlake BioBank for Chinese ( Finally, using R-square and NR-allele concordance rate metrics, we evaluated and compared the genotype imputation performance of the WBBC pilot with two existing panels, the 1KG Phase3 and CONVERGE. Besides, given that the haplotype size of a panel and the genetic background between the panel and array are two crucial factors for imputation accuracy (Bai et al., 2019;Das et al., 2018), we built and evaluated two more combined panels that merged the WBBC with the 1KG and EAS group by the reciprocal imputation approach (Huang et al., 2015). The 1KG Project, which consisted of 2,504 individuals from 26 worldwide populations, is the most diverse and commonly used panel for genotype imputation due to its high quality (Genomes Project et al., 2015). The CONVERGE is the largest and population-specific reference panel for Chinese imputation so far. The quality of variants, however, is not very reliable because of the low-coverage sequencing depth (CONVERGE, 2015). In our study, the WBBC panel yielded substantial improvement for imputation accuracy for low-frequency and rare variants than these two existing panels. The WBBC+EAS and WBBC+1KG panels performed better than WBBC panel alone, and the WBBC+EAS panel yielded highest imputation accuracy for rare variants, the most well-imputed variants and the highest proportion of well-imputed variants. This observation was consistent with and further expanded our previous finding that population-specificity between reference panel and the imputed array was reasonably rigorous for the Han Chinese genotype imputation, and the accuracy benefited from the increasing of haplotype size via extra diverse individuals was limited, especially for rare variants (Bai et al., 2019). Here, to maximize utilization of the WBBC pilot, we provided a large population-specific Genotype Imputation Server, which included the WBBC, 1KG and the two combined reference panels for Chinese sample imputation.
In summary, we characterized large-scale genomic variations in Chinese population.
Our finding provided the comprehensive genetic evidence for the geographical boundaries of the Qinling-Huaihe line and Nanling Mountains to divide the Han Chinese population into four subgroups. We elucidated the regional genetic structure and signatures of recent positive selection differences among the Han Chinese ethnic.
We also created a user-friendly website and high-performance genotype imputation server for Asian samples. The online resource would practically be important for the genomic variants filtration of monogenic diseases and consequent association with complex traits in the population genetics field.

AUTHOR CONTRIBUTIONS
H  (Table S1). All the participants signed the consent forms. The research program was approved by the Institutional Review Board of the Westlake University.

Whole genome sequencing and variants calling
Genomic DNA was extracted from peripheral blood samples collected from all the participants using the blood DNA extraction kit (TianGen Biotech, China). We  For the X chromosome, we called the genotype in the pseudo-autosomal region (PAR) and non-pseudo-autosomal region (non-PAR), separately. We defined the parameter ploidy with 2 for females and 1 for males in non-PAR, while the parameter was 2 for all individuals in PAR. Then we merged the data together. We only called the genotypes on the Y chromosome for males as the haploid chromosome.

Sample and variant filtrations
The sex estimation and confirmation were analyzed by the ratio of sequencing depths aligned to the X chromosome and autosomes (Wu et al., 2019). In the raw calling set (

Variant annotations
The functional annotation of variants were performed with the ANNOVAR tool (Wang et al., 2010). and were consequently retained for further analyses.

Evaluation of genotype concordance
We applied the sequencing and genotype data from 184 individuals to estimate the whole genome sequencing calling accuracy. The genotype from SNP arrays were considered as the reference allele and our calling variants were test set. We also conducted the LD-based genotype refinement for the low confidence genotypes and missing sites via BEAGLE 5.1 with default settings (Browning et al., 2018). We computed the heterozygote disconcordance, non-reference genotype concordance, homozygote genotype concordance, specificity and non-reference sensitivity for the shared variants ( Figure S20) (Linderman et al., 2014).

PCA, ADMIXTURE and effective population size inference
We removed the variants in imputed dataset by Rsq ≤ 0.95, and merged it with our sequencing dataset by GATK v4.  (Patterson et al., 2012). To obtain the optimal K value, we analyzed the ADMIXTURE with 10 random seeds for each K ranging from 2 to 8. The default 5-fold cross-validation procedure was carried out to estimate prediction errors. The K value with the highest log-likelihood was selected as the most probable model. We further estimated the history of effective population size for four regions using SMC++ (Terhorst et al., 2017). Using the ancestral components analyzed by ADMITURE with K = 4, we designated 10 most representative samples with the high sequence-depth as the distinguished lineage sample for each region. We followed the suggestion of SMC++ authors and masked all low-complexity regions of the genome using the 1KG Phase3 supported data

FST statistics, IBD analysis and genetic drift estimation
We next performed FST statistics (Weir and Cockerham, 1984), genetic drift estimation and identity-by-descent (IBD) analysis. We calculated weighted and 'B' respectively. The hierarchical clustering was then performed based on the matrix by using the same method as FST clustering.
We computed relative genetic drift estimates between each province using TreeMix v1.13 with default settings on the same SNPs as the FST analysis used (Pickrell and Pritchard, 2012). The genetic drift was represented by a 'drift parameter' in TreeMix, more details were described elsewhere in the study (Pickrell and Pritchard, 2012). A maximum likelihood tree for the Han Chinese population from 27 administrative divisions was then plotted. Note that the Heilongjiang province, which was located in the northern most part of China, was set as the reference point. For judging the confidence in our tree topology, ten bootstrap replicates were generated by setting the -bootstrap -k flag ranging from 10 to 100 (step-size = 10) to resample blocks of contiguous SNPs for drift parameter estimation. Plink version 1.9 was used in this part to calculate allele counts of SNPs for reformatting of input data that the software required (Chang et al., 2015).

Calculation of the singleton density score
The Singleton Density Score (SDS) can be applied to infer recent allele frequency changes in the past 2,000-3,000 years by calculating the distance between the nearest singletons on either side of a test-SNP using whole-genome sequence data (Field et al., 2016

Calculation of iHS values
To detect the genomic signatures of recent positive selection, we computed the integrated haplotype score (iHS) using the R package rehh v3.1.0 (Gautier et al., 2017;Voight et al., 2006). The data from 2,860 North, 148 Central, 5,274 South and 92 Lingnan Han Chinese individuals were extracted from the imputed and phased files. SNVs (Pickrell et al., 2009). The genes located in the top 1% of windows were considered to be significant regions. The genes or genomic regions were defined within 100 kb of the identified non-overlapping SNVs. We performed the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis for the adaptive candidate genes using R package clusterProfiler v3.16.0 (Yu et al., 2012).

Reference panel construction
The multi-allelic sites were split into bi-allelic sites via the BCFtools norm tool

Quality control, pre-phasing and imputation
The rigorous variant-level and sample-level quality control were then performed as

Reference panel evaluation for imputation in the Chinese population
We evaluated the accuracy of genotype imputation for five reference panels in the Chinese population. These panels included the most widely used panel, the 1KG (Genomes Project et al., 2015), and the largest Chinese-specific panel CONVERGE (CONVERGE, 2015), and our own WBBC panel, and two combined panels that merged the WBBC datasets with the 1KG and EAS respectively. The imputation accuracy of these panels was then compared with each other by three different metrics.
The design for the entire evaluation was detailed in Figure S18. We transformed the format of five panels into M3VCF and performed genotype imputation by jointly using Minimac3/4 (Das et al., 2016). The length of chunks for imputation was set to 20MB with 4MB overlapped between contiguous chunks. The accuracy of different reference panels was evaluated by three metrics. In the first one, the estimated value of the squared correlation between imputed genotypes and true, unobserved genotypes (i.e., R-square) (Das et al., 2016), was calculated based on the imputed dosage and produced with the imputation results by Minimac4. This value was also the most commonly used metric. In this study, an imputed variant with the Rsq ≥ 0.8 was considered as 'well-imputed'. The second metric was non-reference allele (NR-allele) concordance. The variants that had been masked in the beginning were imputed by different panels. We then calculated the NR-allele concordance between imputed genotypes and the original ones in chip array for each individual (Imputed vs. Array) (Huang et al., 2009). The third metric was similar to the second, but the NR-allele concordance was calculated between imputed genotypes and WGS genotypes by the samples that we hold-out (Imputed vs. WGS). The definition of concordance and corresponding formula was specified in Figure S20 (Linderman et al., 2014).

Genotype imputation server
Using the WBBC Phase 1 WGS data and 1KG Phase 3 data (Genomes Project et al., 2015), we developed a genotype imputation server for public use. We included the WBBC and 1KG reference panel in the server and re-constructed two combined panels, the WBBC+EAS and WBBC+1KG. All panels were built in both GRCh37 and GRCh38  We specified a policy of data security to protect the user's data across the entire interaction process with the server. Also, we wrote a help manual and illustrated all processes of our pipeline to facilitate users. Detailed information could be found in our website (https://wbbc.westlake.edu.cn).

Data and Code Availability
The allele frequencies of all variants and genotype imputation server are available via    Benjamini and Hochberg (BH) correction was applied to enrichment p values for multiple testing.
"-" represents that the adjusted p-value is >0.05.     The 1KG means 1000G Phase3 and EAS means East Asian group in 1000G Phase 3.
All imputations were conducted on chromosome 2.

54
The BWA and GATK4 recommendation pipeline and variant analysis strategy were performed on the WBBC-cohort dataset with the GRCh38 and GRCh37 human reference genome, respectively.              The bootstrap replicates were generated by the -bootstrap -k flag of TreeMix software.
The plots were y-axis free. The scale bar shows ten times the average standard error of the entries in the sample covariance matrix for the estimated drift parameter.  The red dots denoted the samples removed by the FREEMIX scores (> 0.05).

Figure S20. Definition of genotype concordance, specificity, and sensitivity
Letter A represents reference allele while B represents non-reference (NR) allele. The dot means missing called allele. For each metric, the value equals to the corresponding red rectangles divided by all rectangles with a blue border.