Cohorts and sample preparation
We collected whole exome sequence (WES) and whole genome sequences (WGS) data for Korean individuals from independent research groups in Korea (Supplementary Table 1). All sequencing data were obtained from normal tissues or blood samples following standard protocols5. This project was performed with the approval of the Institutional Review Board of each group (Seoul National University and others), in which all donors provided written informed consent if available. All the experiments were performed on de-identified samples and in accordance with relevant guidelines and regulations.
Variant calling
We used BWA mem v0.7.178 with default options to map raw reads to the GRCh38 + decoy reference sequence. After marking duplicates and sorting by coordinate with MarkDuplicatesSpark, the mapping quality was recalibrated by BQSRPipelineSpark, implemented in GATK version 4.1.3.09. Qualimap v2.2.110 was used to generate quality control metrics for the mapped sequence data. Single nucleotide variants (SNVs) and small insertions and deletions (indels) were then called for each sample using GATK HaplotypeCaller with option ‘-ERC GVCF’. To jointly genotype samples, we created a genomicsDB using GenomicsDBImport in GATK and followed the GATK best practice guideline. Briefly, SNVs and indels were recalibrated by GATK’s VQSR model to respectively select 99.7% and 99.0% of true sites from the training set. The detailed workflow is described in Supplementary Fig. 1. Further analyses were mostly performed with Hail11, which is an open-source Python library for genome data analysis. After merging the WES and WGS data using Hail, we excluded multi-allelic variants and variants that had genotype quality (GQ) < 20, read depth (DP) < 10, allelic balance (AB) < 0.2, or were in Low Complexity Regions (Supplementary Fig. 2)12.
Sex inference
We inferred the sex of each sample by calculating sex chromosome ploidy, which is defined as the coverage of sex chromosomes divided by the coverage of chromosome 21. To assign X and Y ploidy cutoffs, we calculated F-stat scores based on the linkage disequilibrium (LD)-pruned bi-allelic SNVs (MAF > 0.05, call-rate > 0.99, inbreeding coefficient score ≧ -0.03 and R2 for LD pruning < 0.1) using the ‘annotate_sex’ function of the gnomAD Hail library with parameters ‘male_threshold = 0.8, female_threshold = 0.5’. An XX karyotype was defined if X chromosome ploidy ranged between [1.7, 3.4] and [1.55, 2.45] for WES and WGS, respectively, while an XY karyotype was assigned when Y chromosome ploidy ranged between [0.2, 2.3] and [0.45, 1.11] and X chromosome ploidy was below 1.65 and 1.50 for WES and WGS, respectively (Supplementary Fig. 3). In subsequent analyses, only samples assigned with XX or XY karyotype were used; 92 were excluded as being of ambiguous sex.
Relatedness inference
To remove close relatives, we estimated kinship and the probability of identity-by-descent (IBD) being zero for every pair of samples based on the LD-pruned variants with MAF ≧ 0.001, call-rate > 0.99, HWE P > 1.0 x 10− 8, inbreeding coefficient score > -0.025, and R2 for LD pruning < 0.1. After calculating kinship using the ‘pc_relate’ feature13 in Hail, we selected the maximal independent set of samples with kinship < 0.1 using ‘maximal_independent_set14’, also in Hail. From related sample pairs, we chose the one with higher coverage depth.
Population structure analysis
All bi-allelic autosomal SNVs from our dataset and the 1000 Genomes Project Phase 3 (KG)15 were merged and filtered; variants were retained if they had MAF > 0.001, call-rate > 0.99, HWE P > 1.0 x 10− 8, and inbreeding coefficient score > -0.025. We then pruned the variants to those with LD R2 < 0.1. To perform a principal component analysis (PCA) on the Hardy-Weinberg-normalized variants, we used the ‘hwe_normalized_pca’ function of Hail with k = 30. Each sample was assigned to an ancestry, determined as the ancestry with maximum probability emitted from a Random Forest model trained on the KG PCA result. We removed non-Korean or Korean-outlier samples iteratively until Chinese, Japanese, Korean, and Vietnamese all became distinguishable based on PCs 1 and 2.
Sample QC
The overall process is summarized in Supplementary Fig. 2. Firstly, we excluded samples with ambiguous clinical status or having mean coverage depth < 40X and < 10X for WES and WGS, respectively. Samples with ambiguous or abnormal sex were then excluded, as were duplicated samples and closely related samples. We further removed samples with ambiguous ethnicity, followed by samples with Het/Hom ratio > 1.8 (Supplementary Fig. 2, 4, and 5). Finally, after combining the WES and WGS data, we re-performed the relatedness inference procedure to remove WES samples that overlapped or related with WGS samples.
Variant quality control
The overall process is summarized in Supplementary Fig. 2. Variants were considered to violated Hardy-Weinberg Equilibrium (HWE) on allelic frequency (P < 1.0 x 10− 6) when allele frequency was > 0.01 or inbreeding coefficient score was < -0.03, and those variants were removed. Functional annotation was performed by the Variants Effect Predictor (VEP) version 10116. For each variant, we selected the most severe functional consequences using the gnomAD package of Hail. Ti/Tv and Het/Hom scores were computed using the ‘compute_sample_qc_metric’ function implemented in Hail (Supplementary Fig. 4).
Phasing
After carrying out sample-level and variant-level quality control, WGS data were phased with SHAPEIT4 version 4.2.217. As input to SHAPEIT4, we converted the VCF file to a PLINK file format with option ‘--geno 0.1 --maf 0.001’ to keep SNVs with missingness < 10% and MAF > 0.001. We used the genetic maps for reference version hg38 that are provided by SHAPEIT418. We also phased our data with Beagle 5.2 (beagle.21Apr21.304.jar)19, for which we used the hg38 genetic map available at the Beagle website20 and the reference panel created by the 1000 Genome Project.
Runs of homozygosity (ROH)
PLINK v1.90b6.1221,22 was used to call ROH regions from SHAPEIT-phased data with options ‘--maf 0.05 --hwe 0.00005 --homozyg --homozyg-snp 50 --homozyg-kb 500 --homozyg-density 10 --homozyg-gap 10 --homozyg-window-snp 50 --homozyg-window-missing 5 --homozyg-window-het 1 --homozyg-window-threshold 0.05’. To ensure fair comparison of ROH intervals from KOVA 2 with other populations in the KG, the regions were called from randomly selected sets of 105 samples from KOVA 2. After merging the ROH results from KOVA 2 and KG data, we calculated FROH scores, representing inbreeding levels using the ‘Froh_inbreeding’ function of detectRuns package version 0.9.623.
Regions of positive selection
Selected variants in positive selection sweeps were captured from phased KOVA 2 and KG data using iSAFE v1.0.724 software with default options (--MaxRegionSize 6000000 --window 300 --MaxRank 15 --MaxFreq 0.95 --IgnoreGaps) plus the performance-improving parameter ‘--vcf-cont’ with random outgroup (non-target) samples comprising 10% of the data.
Effective population size estimation
To estimate historical effective population size, we used the IBDNe software25 according to the recommended protocol. Briefly, after detecting IBD segments with hap-IBD.jar26, we refined them through removals of any breaks and short gaps from the segments using merge-ibd-segments.17Jan20.102.jar27. Finally, we used ibdne.23Apr20.ae9.jar25 with default options to estimate the effective population size from the refined IBD segments.
Allele ages
Genealogical Estimation of Variant Age (GEVA) version v1beta28 with parameters ‘--Ne 10000 --mut 1e-8 --maxConcordant 500 --maxDiscordant 500’ was used to estimate the ages of variants from autosomal haplotype data phased by SHAPEIT4. Allele ages were computed by the joint clock model, which combines mutation and recombination clock models. To compare allele ages as estimated by our data with those estimated from 1000 Genome data, we downloaded the Atlas of Variant Age from the developer’s website29. Chimpanzee variants called from 25 individuals were downloaded from the Great Ape Genome Project30.
Imputation of array data
Imputation of variants based on KOVA 2 was performed as previously described7. Variants present on the Infinium Global Screening Array (GSA-24v3-0_A1) were extracted from WGS data of 197 COVID-19 patients and imputed using Impute231. Panel imputation accuracy was compared using the aggregated squared Pearson correlation coefficient (R2) determined between the imputed genotype dosages and the true genotypes from genome data.
Calling of Structural Variants (SVs)
Manta v1.632 was used to call structural variants for individual WGS samples. The convertInversion.py script provided with Manta was applied to represent inversion events in the manner of gnomAD SV v2.133. Slightly different SV representations across VCF files were merged using svimmer34. A SV was defined as known if it overlapped with any entry in the gnomAD SV v2.1 dataset.