A database of 5,305 healthy Korean individuals reveals genetic and clinical implications for an East Asian population

doi:10.21203/rs.3.rs-1707953/v1

Download PDF

Article

A database of 5,305 healthy Korean individuals reveals genetic and clinical implications for an East Asian population

https://doi.org/10.21203/rs.3.rs-1707953/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 02 Nov, 2022

Read the published version in Experimental & Molecular Medicine →

You are reading this latest preprint version

Despite substantial advances in disease genetics, studies to date have largely focused on individuals of European descent, limiting further discoveries of novel functional genetic variants in other ethnic groups. To alleviate the paucity of East Asian population genome resource, we established Korean Variant Archive 2 (KOVA 2), composed of 1,896 whole genome sequences and 3,409 whole exome sequences from healthy individuals of Korean ethnicity. This is the largest genome database from the ethnic Korean population to date, surpassing the 1,909 Korean individuals deposited in gnomAD. The variants in KOVA 2 displayed all known genetic features of those from previous genome databases, and we compiled data from Korean-specific runs of homozygosity, positively selected intervals, and structural variants. In doing so, we found loci that are strongly selected in the Koreans relative to other East Asians, such as ADH1A/1B and UHRF1BP1. Our analysis of allele ages revealed a correlation of variant functionality and evolutionary age. The data can be browsed and downloaded from a public website (https://www.kobic.re.kr/kova/). We anticipate KOVA 2 will serve as a valuable resource for genetic studies on East Asian populations.

genome database

whole exome sequencing

whole genome sequencing

East Asian

Korean

Koreans are known to have migrated to the Korean peninsula at least 40,000 years ago, probably through two routes, from northeast and southeast Asia^1,2. Complex but constant admixture with neighboring Chinese and Japanese populations ensued throughout history³, yet many studies have suggested ethnic Koreans to be genetically distinct from the Chinese and Japanese. About 83 million people, the ethnic Korean population is the 15th largest ethnic group in the world. Especially in South Korea, modern national-wide health systems can provide an opportunity to study the genetics of various diseases in this population if appropriate genetic infrastructure is provided.

As the field of human genetics advances, more attention is being paid to non-European populations as a new venue for obtaining novel insights into the genetics and physiology of human development, physiology, and disease. Although East Asians make up nearly a fifth of the world population, they comprise only 8.2% of participants in genome-wide association studies (GWAS)⁴. Likewise, control databases – compilations of apparently-healthy genomes – for East Asian populations are scarce compared to those for Europeans. For example, among the 141,156 participants in the gnomAD version 2, only 9,977 East Asians were listed, and 1,909 were Koreans. Notably, smaller databases of Korean genetic information (with sizes of approximately one thousand individuals) have been released^5–7, including one established by our group (Korean Variant Archive [KOVA]⁵). However, as the cohort sizes of human genetics studies increase, it becomes necessary to construct larger Korean control databases.

Here we introduce KOVA 2, a Korean control database that includes 5,305 individuals. Using the variant set from KOVA 2, we determined Korean-specific runs of homozygosity (ROH) regions, intervals of positive selection, structural variants, and allelic ages. As a public resource, KOVA 2 will serve as an essential tool for genetic studies for Koreans and East Asians.

Cohorts and sample preparation

We collected whole exome sequence (WES) and whole genome sequences (WGS) data for Korean individuals from independent research groups in Korea (Supplementary Table 1). All sequencing data were obtained from normal tissues or blood samples following standard protocols⁵. This project was performed with the approval of the Institutional Review Board of each group (Seoul National University and others), in which all donors provided written informed consent if available. All the experiments were performed on de-identified samples and in accordance with relevant guidelines and regulations.

Variant calling

We used BWA mem v0.7.17⁸ with default options to map raw reads to the GRCh38 + decoy reference sequence. After marking duplicates and sorting by coordinate with MarkDuplicatesSpark, the mapping quality was recalibrated by BQSRPipelineSpark, implemented in GATK version 4.1.3.0⁹. Qualimap v2.2.1¹⁰ was used to generate quality control metrics for the mapped sequence data. Single nucleotide variants (SNVs) and small insertions and deletions (indels) were then called for each sample using GATK HaplotypeCaller with option ‘-ERC GVCF’. To jointly genotype samples, we created a genomicsDB using GenomicsDBImport in GATK and followed the GATK best practice guideline. Briefly, SNVs and indels were recalibrated by GATK’s VQSR model to respectively select 99.7% and 99.0% of true sites from the training set. The detailed workflow is described in Supplementary Fig. 1. Further analyses were mostly performed with Hail¹¹, which is an open-source Python library for genome data analysis. After merging the WES and WGS data using Hail, we excluded multi-allelic variants and variants that had genotype quality (GQ) < 20, read depth (DP) < 10, allelic balance (AB) < 0.2, or were in Low Complexity Regions (Supplementary Fig. 2)¹².

Sex inference

We inferred the sex of each sample by calculating sex chromosome ploidy, which is defined as the coverage of sex chromosomes divided by the coverage of chromosome 21. To assign X and Y ploidy cutoffs, we calculated F-stat scores based on the linkage disequilibrium (LD)-pruned bi-allelic SNVs (MAF > 0.05, call-rate > 0.99, inbreeding coefficient score ≧ -0.03 and R² for LD pruning < 0.1) using the ‘annotate_sex’ function of the gnomAD Hail library with parameters ‘male_threshold = 0.8, female_threshold = 0.5’. An XX karyotype was defined if X chromosome ploidy ranged between [1.7, 3.4] and [1.55, 2.45] for WES and WGS, respectively, while an XY karyotype was assigned when Y chromosome ploidy ranged between [0.2, 2.3] and [0.45, 1.11] and X chromosome ploidy was below 1.65 and 1.50 for WES and WGS, respectively (Supplementary Fig. 3). In subsequent analyses, only samples assigned with XX or XY karyotype were used; 92 were excluded as being of ambiguous sex.

Relatedness inference

To remove close relatives, we estimated kinship and the probability of identity-by-descent (IBD) being zero for every pair of samples based on the LD-pruned variants with MAF ≧ 0.001, call-rate > 0.99, HWE P > 1.0 x 10^− 8, inbreeding coefficient score > -0.025, and R² for LD pruning < 0.1. After calculating kinship using the ‘pc_relate’ feature¹³ in Hail, we selected the maximal independent set of samples with kinship < 0.1 using ‘maximal_independent_set¹⁴’, also in Hail. From related sample pairs, we chose the one with higher coverage depth.

Population structure analysis

All bi-allelic autosomal SNVs from our dataset and the 1000 Genomes Project Phase 3 (KG)¹⁵ were merged and filtered; variants were retained if they had MAF > 0.001, call-rate > 0.99, HWE P > 1.0 x 10^− 8, and inbreeding coefficient score > -0.025. We then pruned the variants to those with LD R² < 0.1. To perform a principal component analysis (PCA) on the Hardy-Weinberg-normalized variants, we used the ‘hwe_normalized_pca’ function of Hail with k = 30. Each sample was assigned to an ancestry, determined as the ancestry with maximum probability emitted from a Random Forest model trained on the KG PCA result. We removed non-Korean or Korean-outlier samples iteratively until Chinese, Japanese, Korean, and Vietnamese all became distinguishable based on PCs 1 and 2.

Sample QC

The overall process is summarized in Supplementary Fig. 2. Firstly, we excluded samples with ambiguous clinical status or having mean coverage depth < 40X and < 10X for WES and WGS, respectively. Samples with ambiguous or abnormal sex were then excluded, as were duplicated samples and closely related samples. We further removed samples with ambiguous ethnicity, followed by samples with Het/Hom ratio > 1.8 (Supplementary Fig. 2, 4, and 5). Finally, after combining the WES and WGS data, we re-performed the relatedness inference procedure to remove WES samples that overlapped or related with WGS samples.

Variant quality control

The overall process is summarized in Supplementary Fig. 2. Variants were considered to violated Hardy-Weinberg Equilibrium (HWE) on allelic frequency (P < 1.0 x 10^− 6) when allele frequency was > 0.01 or inbreeding coefficient score was < -0.03, and those variants were removed. Functional annotation was performed by the Variants Effect Predictor (VEP) version 101¹⁶. For each variant, we selected the most severe functional consequences using the gnomAD package of Hail. Ti/Tv and Het/Hom scores were computed using the ‘compute_sample_qc_metric’ function implemented in Hail (Supplementary Fig. 4).

Phasing

After carrying out sample-level and variant-level quality control, WGS data were phased with SHAPEIT4 version 4.2.2¹⁷. As input to SHAPEIT4, we converted the VCF file to a PLINK file format with option ‘--geno 0.1 --maf 0.001’ to keep SNVs with missingness < 10% and MAF > 0.001. We used the genetic maps for reference version hg38 that are provided by SHAPEIT4¹⁸. We also phased our data with Beagle 5.2 (beagle.21Apr21.304.jar)¹⁹, for which we used the hg38 genetic map available at the Beagle website²⁰ and the reference panel created by the 1000 Genome Project.

Runs of homozygosity (ROH)

PLINK v1.90b6.12^21,22 was used to call ROH regions from SHAPEIT-phased data with options ‘--maf 0.05 --hwe 0.00005 --homozyg --homozyg-snp 50 --homozyg-kb 500 --homozyg-density 10 --homozyg-gap 10 --homozyg-window-snp 50 --homozyg-window-missing 5 --homozyg-window-het 1 --homozyg-window-threshold 0.05’. To ensure fair comparison of ROH intervals from KOVA 2 with other populations in the KG, the regions were called from randomly selected sets of 105 samples from KOVA 2. After merging the ROH results from KOVA 2 and KG data, we calculated F_ROH scores, representing inbreeding levels using the ‘Froh_inbreeding’ function of detectRuns package version 0.9.6²³.

Regions of positive selection

Selected variants in positive selection sweeps were captured from phased KOVA 2 and KG data using iSAFE v1.0.7²⁴ software with default options (--MaxRegionSize 6000000 --window 300 --MaxRank 15 --MaxFreq 0.95 --IgnoreGaps) plus the performance-improving parameter ‘--vcf-cont’ with random outgroup (non-target) samples comprising 10% of the data.

Effective population size estimation

To estimate historical effective population size, we used the IBDNe software²⁵ according to the recommended protocol. Briefly, after detecting IBD segments with hap-IBD.jar²⁶, we refined them through removals of any breaks and short gaps from the segments using merge-ibd-segments.17Jan20.102.jar²⁷. Finally, we used ibdne.23Apr20.ae9.jar²⁵ with default options to estimate the effective population size from the refined IBD segments.

Allele ages

Genealogical Estimation of Variant Age (GEVA) version v1beta²⁸ with parameters ‘--Ne 10000 --mut 1e-8 --maxConcordant 500 --maxDiscordant 500’ was used to estimate the ages of variants from autosomal haplotype data phased by SHAPEIT4. Allele ages were computed by the joint clock model, which combines mutation and recombination clock models. To compare allele ages as estimated by our data with those estimated from 1000 Genome data, we downloaded the Atlas of Variant Age from the developer’s website²⁹. Chimpanzee variants called from 25 individuals were downloaded from the Great Ape Genome Project³⁰.

Imputation of array data

Imputation of variants based on KOVA 2 was performed as previously described⁷. Variants present on the Infinium Global Screening Array (GSA-24v3-0_A1) were extracted from WGS data of 197 COVID-19 patients and imputed using Impute2³¹. Panel imputation accuracy was compared using the aggregated squared Pearson correlation coefficient (R²) determined between the imputed genotype dosages and the true genotypes from genome data.

Calling of Structural Variants (SVs)

Manta v1.6³² was used to call structural variants for individual WGS samples. The convertInversion.py script provided with Manta was applied to represent inversion events in the manner of gnomAD SV v2.1³³. Slightly different SV representations across VCF files were merged using svimmer³⁴. A SV was defined as known if it overlapped with any entry in the gnomAD SV v2.1 dataset.

Profile of KOVA 2 variants

To construct a Korean control database, we collected WES and WGS data generated from multiple projects that targeted Korean individuals (Supplementary Table 1). Samples were originated from normal tissue of cancer patients (40.2%), healthy parents of rare disease patients (28.4%), or healthy volunteers (31.4%) (Supplementary Table 1). Raw reads from 6,654 sequencing libraries (4,258 WES and 2,396 WGS) were collected, processed and filtered according to criteria from our previous experience⁵ and other studies. Exclusion criteria for samples and variants are described in the Methods, Supplementary Tables 2 and 3, and Supplementary Fig. 2. After filtering out 1,349 samples (20.3% of initial samples), variants from the remaining 5,305 samples (3,409 WES and 1,896 WGS) were used in subsequent analyses. A total of 40,414,379 SNVs (874,026 coding and 39,540,353 non-coding) and 2,888,275 indels (37,663 coding and 2,850,612 noncoding) were called. From WGS data only, 144,388 CNVs (65,017 deletions, 10,956 duplications, and 68,415 others) were called (Supplementary Fig. 6).

Evaluation of the minor allele frequency (MAF) distribution revleaed high enrichment of rare variants (< 1%) that included a larger proportions of novel variants not found in the control gnomAD v3.1³⁵ database (Fig. 1a, Supplementary Table 4, Supplementary Fig. 7). As seen in other population datasets, adding data from more Korean individuals was not sufficient to saturate newly discovered variants, whether in coding or non-coding regions, whereas common coding variants (> 5% frequency in gnomAD v3.1) were quickly saturated at < 500 samples (Fig. 1b). Interestingly, common noncoding variants still displayed an increasing trend up through analyzing 1,800 WGS samples, indicating a larger sample size is needed to fully cover this group of variants (Fig. 1b). As expected, variant function indicators such as the non-silent/silent (NS/S) ratio, CADD³⁶, ReMM³⁷, FunSeq2³⁸, and LINSIGHT³⁹ all showed increased functionality as MAF decreased (Fig. 1c, d). PCA located KOVA 2 samples in a cluster discrete from Japanese, northern Chinese, southern Chinese, and Southeast Asian samples (Fig. 1e and Supplementary Fig. 5). Finally, we found that the distribution of variants in proximal intron regions indicates strong selection against any base change as variants approach exon-intron boundaries (Fig. 1f). These results demonstrate the high quality of the KOVA 2 variant set.

Pathogenic variants

To determine if KOVA 2 set contains variants that have previously been annotated as pathogenic, we selected KOVA-specific rare variants (MAF < 0.001) in high pLI genes and compared them against ClinVar. A total of 25 variants (seven LoF and 18 missense variants) were identified in KOVA 2 participants were labeled as “likely pathogenic” or “pathogenic" in relation to diseases that follow a dominant inheritance pattern (Table 1). This observation suggests that these variants may not be pathogenic as previously thought. Alternately, since KOVA 2 is composed of three main types of individuals, i.e., healthy volunteers, normal genomes of cancer patients, and healthy parents of rare disease patients, one may argue that the variants may predispose carriers to develop cancer or their children to manifest rare diseases.

Table 1

Missense and high-confidence (HC) loss-of-function (LoF) variants identified in KOVA 2 that are pathogenic or likely pathogenic in the ClinVar but not found in gnomAD.
Variant class	Locus (hg38)	Base change	AC	AN	AF	Carrier type*	Gene symbol	pLI	ClinVar^**	Dominant or Recessive***	ClinVar condition
Lof (HC)	chr3:128481942	CG > C	1	12,234	0.8 x 10^− 4	C	GATA2	0.98	P	D	Lymphedema, primary, with myelodysplasia; GATA2 deficiency with susceptibility to MDS/AML
	chr3:41236467	CAG > C	1	12,148	0.8 x 10^− 4	V	CTNNB1	1.00	P	D	Mental retardation, autosomal dominant 19; Inborn genetic diseases
	chr6:79026060	A > C	1	12,150	0.8 x 10^− 4	P	PHIP	1.00	P	D	Developmental delay, intellectual disability, obesity, and dysmorphic features
	chr7:128846444	C > T	1	12,136	0.8 x 10^− 4	C	FLNC	1.00	P	D	Myofibrillar myopathy, filamin C-related; Myopathy, distal, 4; Cardiomyopathy, familial hypertrophic, 26; Dilated cardiomyopathy, dominant
	chr9:95458142	G > T	1	12,120	0.8 x 10^− 4	V	PTCH1	1.00	P	D	Gorlin syndrome
	chr12:868379	C > T	1	12,134	0.8 x 10^− 4	C	WNK1	1.00	P	D/R	Hereditary sensory and autonomic neuropathy type IIA
	chrX:40064351	G > A	1	12,224	0.8 x 10^− 4	C	BCOR	1.00	P	D	Oculofaciocardiodental syndrome
Missense	chr1:42927147	C > T	1	12,152	0.8 x 10^− 4	C	SLC2A1	0.99	LP	D/R	Not provided
	chr2:108753474	A > G	1	9,310	1.1 x 10^− 4	C	RANBP2	1.00	P	D	Encephalopathy, acute, infection-induced, 3, susceptibility to
	chr3:123296110	G > A	1	12,122	0.8 x 10^− 4	P	ADCY5	0.99	LP	D/R	Inborn genetic diseases
	chr3:128483925	C > T	1	12,238	0.8 x 10^− 4	P	GATA2	0.98	P	D	Lymphedema, primary, with myelodysplasia; GATA2 deficiency with susceptibility to MDS/AML
	chr5:128395182	C > T	1	12,160	0.8 x 10^− 4	P	FBN2	1.00	C	D	Congenital contractural arachnodactyly
	chr5:138570987	T > C	3	10,530	2.8 x 10^− 4	P,V	HSPA9	0.97	P	D/R	Even-plus syndrome
	chr6:157206668	C > T	1	12,144	0.8 x 10^− 4	V	ARID1B	1.00	LP	D	Coffin-Siris syndrome 1
	chr6:3154909	C > T	1	12,116	0.8 x 10^− 4	V	TUBB2A	0.94	P/LP	D	Cortical dysplasia, complex, with other brain malformations 5
	chr7:150952508	G > A	1	12,128	0.8 x 10^− 4	C	KCNH2	0.99	LP	D	Arrhythmia; Long QT syndrome 2; Congenital long QT syndrome
	chr7:5528486	G > C	1	12,126	0.8 x 10^− 4	V	ACTB	0.99	LP	D	Not provided
	chr9:130872896	C > T	1	12,244	0.8 x 10^− 4	C	ABL1	1.00	P/LP	D	Chronic myelogenous leukemia, BCR-ABL1-positive; Lymphoblastic leukemia, acute, with lymphomatous features; Leukemia, Philadelphia chromosome-positive, resistant to imatinib
	chr9:132328351	A > G	1	12,152	0.8 x 10^− 4	P	SETX	0.96	P	D/R	Spinocerebellar ataxia, autosomal recessive, with axonal neuropathy 2
	chr11:119089747	G > A	1	12,144	0.8 x 10^− 4	V	HMBS	0.95	P	D	Acute intermittent porphyria
	chr11:119092785	G > A	1	12,138	0.8 x 10^− 4	V	HMBS	0.95	LP	D	Not provided
	chr12:47978736	G > A	8	10,014	8.0 x 10^− 4	C, P, V	COL2A1	1.00	LP	D	Spondyloepiphyseal dysplasia, Namaqualand type
	chr15:48470646	C > T	1	12,140	0.8 x 10^− 4	P	FBN1	1.00	LP	D	Not provided
	chr16:9840706	G > A	1	12,148	0.8 x 10^− 4	V	GRIN2A	1.00	P/LP	D	Epilepsy, focal, with speech disorder and with or without mental retardation;
	chr18:44951948	G > A	1	12,250	0.8 x 10^− 4	C	SETBP1	1.00	P	D	Chronic myelogenous leukemia, BCR-ABL1 positive; Schinzel-Giedion syndrome
* C: normal sample of a cancer patient, P: parent of a rare disease patient, V: healthy volunteer
** P: pathogenic, LP: likely pathogenic
*** D: dominant, R: recessive, D/R: observed in both patterns

Regions of homozygosity and positive selection

Compared to populations with higher burden of consanguinity, homozygous pathogenic variants in ROH are rarely found in an outbred population such as Koreans⁴⁰. Rather, such regions can be used to signify a positive selection that was imposed on the population in the form of a selective sweep⁴¹. In terms of ROH profile, the population in KOVA 2 does not much deviated from East Asian populations in general (Supplementary Fig. 8). To further characterize intervals that represent positive selection in KOVA 2, we applied the iSAFE algorithm²⁴. It yielded a total of 16,272 loci that were selected in at least one population (iSAFE > 0.2), and identified a number of unique loci unique to each population (172 for KOVA 2, 149 for Japanese, 77 for Chinese, and 364 for European) (Fig. 2a). Although functional and expression analyses of these loci did not yield notable features, a well-known locus in LCT showed a strong selection signal in Europeans (Fig. 2b). Interestingly, we provide evidence that two loci – ADH1A/1B and UHRF1BP1 – are among the most strongly selected loci in Koreans when compared to Japanese, Chinese, and European populations (Fig. 2b). ADH1A and ADH1B encode alcohol dehydrogenase 1A and 1B, and are known to comprise a recently-selected locus in East Asians^42,43. Here we show that Koreans display the strongest signal among the East Asian populations that we evaluated, having the highest frequency of “haplotype #1”, which represents the East Asian haplotype identified by a previous study⁴² (Fig. 2c). This signal is also reflected in the minor allele frequency of rs1229984, being the lowest among the studied populations (Fig. 2c). The prominent Korean-specific signal we observed in UHRF1BP1 is not reported elsewhere and the function of the gene remains largely unknown.

Allele ages

Next, we sought to determine if KOVA 2 can be used to estimate the dates of origin for variants, or allele ages, and the implications of such information with regard to the function and frequency of variants. Notably, estimation of allele ages may lead to the discovery of population-specific variants that emerged recently. To carry out this analysis, we first phased our WGS-originated variants using a previously reported method^7,15,44. This allowed us to estimate population size, which came to 10–20 million, a value not greatly deviated from the current Korean population size of approximately 50 million – especially given the recent population explosion (e.g., the Korean population was approximately 20 million in 1950 and 13 million in 1925⁴⁵.) (Supplementary Fig. 9). Next, variants with frequency > 1% were used to estimate allele ages. As expected, the obtained allele ages showed a strong correlation with MAF, being greater in variants with high MAF, or vice versa. Interestingly, variants of greater age showed higher overlap with variants those from chimpanzees, suggesting that some of these variants may have a primate-level origin (Fig. 3a and Supplementary Fig. 10). Separating variants by function revealed older allele ages and higher overlap with chimpanzees to correspond to less functionality as indicated by annotation (Fig. 3b, Supplementary Table 5 and Supplementary Figs. 10 and 11). Remarkably, high confidence LoF and high CADD score missense variants were the youngest and showed minimal overlap with chimpanzees; in addition, all functional classes of rare variants of MAF < 5% were young and non-overlapping with chimpanzees (Fig. 3c and Supplementary Fig. 10). This trend was not clearly replicated when variants were categorized by pLI score (Supplementary Fig. 12). Overall, these findings suggest that most rare variants are of relatively recent origin, and therefore tend to be population-specific.

Here we report the establishment of the largest Korean control genome database to date, along with its genetic features and applications. KOVA 2 displayed the major features of ethnic genome databases, and we added considerable genetic information on the dataset. The variant set in KOVA 2 has been uploaded and will be shared among the community, to be used as a control set in East Asian genetic studies.

Within the KOVA 2 dataset, the identified variants displayed typical patterns of purifying selection and frequency-functionality relationships. Although the sample size does not confer enough power to cover all rare variants in the population as is the case with other larger variant sets, it exhibits the best coverage of common variants in Koreans and hence better performance when imputing variants (Supplementary Fig. 13). Although KOVA 2 can serve as a control set for filtering non-pathogenic variants for rare Mendelian diseases, we identified a list of ClinVar pathogenic variants present in low frequency. Whether these variants are non-pathogenic in the Korean population or genomic background enabled these carriers to escape developing the associated diseases should be further elucidated.

A combined analysis of positive selection signatures and allele age estimation may lead to the discovery of genetic loci that have recently arisen and been selected in a population. Not surprisingly, the top signals in this Korean population overlapped with those from neighboring populations in East Asia, demonstrating a recent diversion, continuous admixture, and the similar environmental constraints that were exerted on these populations during recent evolution. Nevertheless, our results suggest loci that merit further investigation. For example, we further dissected a known East Asian-selected alcohol dehydrogenase gene locus and demonstrated it to be the most strongly selected among the three East Asian populations analyzed here. The major haplotype in East Asians (“Haplotype #1” in Fig. 2c) was the most abundant in Koreans, and among East Asian populations, Koreans have the highest frequency of the functional variant ADH1B p.Arg48His (Fig. 2c). The variant is known to cause increased aldehyde production relative to its wild-type counterpart due to increased oxidation of ethanol, and subsequently triggers adverse reactions such as flushing and nausea⁴⁶. In the longer term, the variant is also protective against alcohol dependency⁴⁷. The functional consequence of the second locus of interest, UHRF1BP1, remains uncertain as it has received little study. Nevertheless, it is remarkable that associations of the variants in this gene with systematic lupus erythematosus have been repeatedly reported in East Asians^48–50. The gene is the most strongly expressed in testes (Supplementary Fig. 14), making it possible to infer that it can confer selection through affecting the reproductive process in males. Looking beyond these two loci, a new algorithm based on large-scale population data may discover novel loci that were missed in our study.

Finally, we have deposited the data in a genome browser and enabled downloading of the variant set by users with a minimal registration process. The establishment of a Korean-specific variant set and comparative analysis will bolster various types of genetics and genomics studies on East Asians. In addition, it will serve as a seed for much larger genome data sets that will be available soon, especially if it can be merged with data from of North Korean individuals.

Data availability

List of annotated SNVs, indels, and CNVs with frequency information can be viewed and downloaded from the KOVA 2 website (https://www.kobic.re.kr/kova/).

Acknowledgements

We thank Korean Bioinformatic Center (KOBIC) Biobank (ID 2020-006) and Clinical and Omics Data Archive for sharing Korean genome data, and Drs. Jungmin Choi, Joon Yong An and Choong Won Chung for critical comments.

Funding

A part of this study was supported by the research programs through the National Research Foundation funded by the Ministry of Science and Technology (2014M3C9A2064686, 2018M3C9A5064708, and 2020M3E5D7086836).

Author contributions

Jeongeun L. and Jean L.: compiled genome data, processed and analyzing data, writing the manuscript. S.J. and Jeongha L.: analyzing data. I.J., J.O.Y., and B.L.: constructed data sharing website. J.C., B.-O.C., H.Y.G., J.O., I.-J.J., S.L., D.B., Y.K., S.-S.,Y., Y.-J.K., J.-H.C., W.-Y.P.: provided genome data. J.H.B.: provided genome data, analyzing data. M.C.: designing research studies, analyzing data, writing the manuscript.

Competing interests

The authors declare no competing interests.

Jin, H.-J. et al. Y-chromosomal DNA haplogroups and their implications for the dual origins of the Koreans. Hum Genet 114, 27–35 (2003).
Kim, W., Shin, D. J., Harihara, S. & Kim, Y. J. Y chromosomal DNA variation in East Asian populations and its potential for inferring the peopling of Korea. J Hum Genet 45, 76–83 (2000).
Wang, Y., Lu, D., Chung, Y.-J. & Xu, S. Genetic structure, divergence and admixture of Han Chinese, Japanese and Korean populations. Hereditas 155, 19 (2018).
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The Missing Diversity in Human Genetic Studies. Cell 177, 26–31 (2019).
Lee, S. et al. Korean Variant Archive (KOVA): a reference database of genetic variations in the Korean population. Sci Rep-uk 7, 4287 (2017).
Kwak, S. H. et al. Findings of a 1303 Korean whole-exome sequencing study. Exp Mol Medicine 49, e356–e356 (2017).
Jeon, S. et al. Korean Genome Project: 1094 Korean personal genomes with clinical information. Sci Adv 6, eaaz7835 (2020).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Arxiv (2013).
Auwera, G. A. V. der & O’Connor, B. D. Genomics in the cloud: using Docker, GATK, and WDL in Terra. (O’Reilly Media).
Okonechnikov, K., Conesa, A. & García-Alcalde, F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32, 292–294 (2016).
Team, H. Hail 0.2.77-684f32d73643. https://github.com/hail-is/hail/commit/684f32d73643.
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Conomos, M. P., Reiner, A. P., Weir, B. S. & Thornton, T. A. Model-free Estimation of Recent Genetic Relatedness. Am J Hum Genetics 98, 127–148 (2016).
Team, H. “maximal independent set” method. https://hail.is/docs/0.2/methods/misc.html#hail.methods.maximal_independent_set.
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol 17, 122 (2016).
Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat Commun 10, 5436 (2019).
Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Genetic map for reference version hg38 by SHAPEIT4. https://github.com/odelaneau/shapeit4/blob/master/maps/genetic_maps.b38.tar.gz.
Browning, B. L., Tian, X., Zhou, Y. & Browning, S. R. Fast two-stage phasing of large-scale sequence data. Am J Hum Genetics 108, 1880–1890 (2021).
Browning, B. L., Tian, X., Zhou, Y. & Browning, S. R. Genetic map for reference version hg38 by Beagle 5.2. http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/plink.GRCh38.map.zip.
Purcell, S. & Chang, C. PLINK 1.9. .
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 1–16 (2015).
Biscarini, F., Cozzi, P., Gaspa, G. & Marras, G. detectRUNS: Detect runs of homozygosity and runs of heterozygosity in diploid genomes. (2018).
Akbari, A. et al. Identifying the Favored Mutation in a Positive Selective Sweep. Nat Methods 15, 279–282 (2018).
Browning, S. R. & Browning, B. L. Accurate Non-parametric Estimation of Recent Effective Population Size from Segments of Identity by Descent. Am J Hum Genetics 97, 404–418 (2015).
Zhou, Y., Browning, S. R. & Browning, B. L. A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data. Am J Hum Genetics 106, 426–437 (2020).
Browning, B. L. & Browning, S. R. Improving the Accuracy and Efficiency of Identity-by-Descent Detection in Population Data. Genetics 194, 459–471 (2013).
Albers, P. K. & McVean, G. Dating genomic variants and shared ancestry in population-scale sequencing data. Plos Biol 18, e3000586 (2020).
Albers, P. K. & McVean, G. Human Genome Dating. https://human.genome.dating/download/index.
Prado-Martinez, J. et al. Great ape genetic diversity and population history. Nature 499, 471–475 (2013).
Howie, B. N., Donnelly, P. & Marchini, J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. Plos Genet 5, e1000529 (2009).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Eggertsson, H. P. et al. GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs. Nat Commun 10, 5402 (2019).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 47, D886–D894 (2018).
Smedley, D. et al. A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. Am J Hum Genetics 99, 595–606 (2016).
Fu, Y. et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol 15, 480 (2014).
Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat Genet 49, 618–624 (2017).
Lee, Y. et al. Genomic profiling of 553 uncharacterized neurodevelopment patients reveals a high proportion of recessive pathogenic variant carriers in an outbred population. Sci Rep-uk 10, 1413 (2020).
Pemberton, T. J. et al. Genomic Patterns of Homozygosity in Worldwide Human Populations. Am J Hum Genetics 91, 275–292 (2012).
Han, Y. et al. Evidence of Positive Selection on a Class I ADH Locus. Am J Hum Genetics 80, 441–456 (2007).
Okada, Y. et al. Deep whole-genome sequencing reveals recent selection signatures linked to evolution and disease risk of Japanese. Nat Commun 9, 1631 (2018).
Wall, J. D. et al. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Korea, S. Korean statistical information service. https://kosis.kr/eng/ (2022).
Edenberg, H. J. The genetics of alcohol metabolism: role of alcohol dehydrogenase and aldehyde dehydrogenase variants. Alcohol Res Heal J National Inst Alcohol Abus Alcohol 30, 5–13 (2007).
Li, D., Zhao, H. & Gelernter, J. Strong association of the alcohol dehydrogenase 1B gene (ADH1B) with alcohol dependence and alcohol-induced medical diseases. Biol Psychiat 70, 504–12 (2011).
Wu, J. et al. The Rare Variant rs35356162 in UHRF1BP1 Increases Bladder Cancer Risk in Han Chinese Population. Frontiers Oncol 10, 134 (2020).
Morris, D. L. et al. Genome-wide association meta-analysis in Chinese and European individuals identifies ten new loci associated with systemic lupus erythematosus. Nat Genet 48, 940–946 (2016).
Yin, X. et al. Meta-analysis of 208370 East Asians identifies 113 susceptibility loci for systemic lupus erythematosus. Ann Rheum Dis 80, 632–640 (2021).

(Not answered)

Leeetalsupple.pdf

Download PDF

Journal Publication

published 02 Nov, 2022

Read the published version in Experimental & Molecular Medicine →

Editorial decision: revise
24 Jun, 2022
Review #1 received at journal
23 Jun, 2022
Review #2 received at journal
21 Jun, 2022
Reviewer #2 agreed at journal
14 Jun, 2022
Reviewer #1 agreed at journal
13 Jun, 2022
Reviewers invited by journal
13 Jun, 2022
Submission checks completed at journal
01 Jun, 2022
Unknown event
30 May, 2022
Editor assigned by journal
30 May, 2022
First submitted to journal
30 May, 2022

You are reading this latest preprint version

A database of 5,305 healthy Korean individuals reveals genetic and clinical implications for an East Asian population

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Materials And Methods

Cohorts and sample preparation

Variant calling

Sex inference

Relatedness inference

Population structure analysis

Sample QC

Variant quality control

Phasing

Runs of homozygosity (ROH)

Regions of positive selection

Effective population size estimation

Allele ages

Imputation of array data

Calling of Structural Variants (SVs)

Results

Profile of KOVA 2 variants

Pathogenic variants

Regions of homozygosity and positive selection

Allele ages

Discussion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1