Evaluation of optimal methods and ancestries for calculating polygenic risk scores in East Asian population

doi:10.21203/rs.3.rs-2489951/v1

Download PDF

Article

Evaluation of optimal methods and ancestries for calculating polygenic risk scores in East Asian population

https://doi.org/10.21203/rs.3.rs-2489951/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 06 Nov, 2023

Read the published version in Scientific Reports →

You are reading this latest preprint version

Polygenic risk scores (PRSs) have been studied for predicting human diseases, and various methods for PRS calculation have been developed. Most PRS studies to date have focused on European ancestry, and the performance of PRS has not been sufficiently assessed in East Asia. Herein, we evaluated the best-performing PRSs for East Asian populations using data for seven diseases: asthma, breast cancer, coronary artery disease, glaucoma, hyperthyroidism, hypothyroidism, and type 2 diabetes (T2D). A total of 42 PRSs were generated for East Asian samples by applying three PRS methods [linkage disequilibrium (LD) pruning and P-value thresholding (P + T), PRSice, and PRS-CS] and genome-wide association study (GWAS) data from two biobank-scale datasets [European (UK Biobank) and East Asian (BioBank Japan)] to seven diseases. In most cases, PRS-CS showed better predictive performance for disease risk than the other methods and classified low- and high-risk groups more clearly. In addition, the East Asian GWAS data outperformed those from Europeans for T2D PRS, but neither of the two GWAS ancestries showed a dominant effect on PRS performance for other diseases. For East Asian populations, PRS-CS using large-sample GWAS data is likely to provide superior performance, and a PRS generated with GWAS from other ancestries may also perform well.

Biological sciences/Genetics/Genotype/Genetic predisposition to disease

Biological sciences/Genetics/Genotype

Genome-wide association studies (GWAS) have provided information on a large number of genetic variants that contribute to the risk of complex diseases. The genetic susceptibility of individuals to disease can be estimated by calculating the polygenic risk score (PRS) using the risk genetic variants. There has been considerable interest in PRS and the field is growing rapidly, with more than 2,700 PRS algorithms presented in the open resource catalog¹. In addition, evidence for the clinical utility of PRS in diseases such as coronary artery disease (CAD)² breast cancer³, and diabetes⁴ is currently increasing⁵, and the possibility of applying PRS for efficient diagnosis and personalized treatment has been suggested⁶.

PRSs are calculated from the number of alleles of genetic risk variants, typically weighted by the effect of the variants, estimated from GWAS data. Several methods for calculating PRS have been developed, such as linkage disequilibrium (LD) clumping and P-value thresholding (P + T), PRSice⁷, and PRS-CS⁸. To avoid overfitting and to increase performance, these methods estimate the effects of nearby genetic variants by considering LD, reduce the uncertainty of the genetic effects, and tailor the PRS to target populations⁹. These methods differ in two key criteria: which genetic variants are included in the study, and how to apply weights for genetic variants.

Choosing an appropriate GWAS is one of the most important considerations to optimize PRS performance¹⁰. When selecting a GWAS, the ancestry of the study population is a key factor, since the transferability of PRSs across populations is poor owing to differences in allele frequencies and LD patterns of genetic variants^11,12. Although the number of GWAS has been increasing in non-European regions¹³, most are still performed in Europe¹⁴. This imbalance in GWAS results has led to twice as many PRS studies for Europeans than non-Europeans¹². Moreover, the performance of PRS when applying data from GWAS conducted in Europe to populations with non-European ancestry is unclear.

To explore the performance of PRS for those of non-European ancestry, we tested PRSs under various conditions in a South Korean cohort: Korean Genome and Epidemiology Study (KoGES)¹⁵. We generated PRSs for seven diseases: asthma, breast cancer, CAD, glaucoma, hyperthyroidism, hypothyroidism, and type 2 diabetes (T2D) using three PRS methods: P + T, PRSice, and PRS-CS. In addition, we used biobank-scale GWAS summary statistics from European and East Asian cohorts, UK Biobank (UKB) and BioBank Japan (BBJ)^16,17. The PRSs for the seven diseases were evaluated using two performance metrics and compared for each PRS method and each GWAS population. These results could guide the selection of a valid PRS method and its appropriate GWAS, depending on the particular population of interest.

For the analysis, we used data from Health Examinees (HEXA), which consists of over the 40-year-old South Korean adults¹⁸. Table 1 presents the descriptive characteristics of the participants for the seven diseases: asthma, breast cancer, CAD, glaucoma, hyperthyroidism, hypothyroidism, and T2D. For each disease group, more than 300 cases and 30,000 controls were included, and the average age of disease cases was higher than that of the controls (P < 0.05, Student’s t-test). For asthma, hyperthyroidism, and hypothyroidism, there was a significantly higher proportion of women in the disease cases and these diseases are known to affect women more frequently^26,27. For T2D and CAD, the incidence in men was higher, which is in accordance with previous research^28,29. In the disease groups for asthma, CAD, and T2D, for which body mass index (BMI) is a risk factor^30–32, the average BMI was higher than that in the control groups.

Table 1

Basic characteristics of KoGES participants. All data are presented as mean ± standard deviation or numbers (%). BMI, body mass index; CAD, coronary artery disease; T2D, type 2 diabetes.
		Case	Control
Asthma	N	959	56,702
	Age, years	55.4 (8.4)	53.8 (8)
	Women	682 (71.1%)	37,064 (65.4%)
	BMI	24.3 (3.2)	23.9 (2.9)
Breast Cancer	N	351	30,752
	Age, years	54 (7.1)	52.9 (7.7)
	Women	351 (100%)	30,752 (100%)
	BMI	23.5 (2.8)	23.6 (2.9)
CAD	N	1,643	56,022
	Age, years	59.9 (6.8)	53.6 (8)
	Women	783 (47.7)	36,965 (66%)
	BMI	24.9 (2.9)	23.9 (2.9)
Glaucoma	N	374	47,028
	Age, years	59.6 (7.5)	53.7 (8)
	Women	204 (54.5%)	31,125 (66.2%)
	BMI	23.9 (2.8)	23.9 (2.9)
Hyperthyroidism	N	836	38,151
	Age, years	54.5 (7.7)	53.7 (8.1)
	Women	725 (86.7%)	24,748 (64.9%)
	BMI	23.4 (2.8)	23.9 (2.9)
Hypothyroidism	N	860	38,151
	Age, years	54.2 (7.4)	53.7 (8.1)
	Women	800 (93%)	24,748 (64.9%)
	BMI	23.6 (3)	23.9 (2.9)
T2D	N	4,886	51,340
	Age, years	57.9 (7.4)	53.4 (8)
	Women	2,424 (49.6%)	34,416 (67%)
	BMI	25 (3.1)	23.8 (2.8)

Performance of PRS in an East Asian population

We calculated PRSs for the seven diseases under various conditions, using three PRS methods: P + T, PRSice, and PRS-CS, and two GWAS summary statistics from Europe and East Asian. Summary statistics for European and East Asian populations were obtained from the UKB and BBJ, and their ancestry was considered in the PRS calculation. A total of 42 PRSs, including six PRSs for each disease, were generated. All PRSs were significantly associated with their target disease in a positive direction (OR > 1 and P < 0.05, logistic regression).

To quantify and compare the predictive performance of PRS for each disease, we considered several metrics. The cumulative incidence plot visually presents the disease incidence and enables comparison of different PRSs. High- and the low-risk groups (highest 10% and lowest 10% of the PRS distribution) were identified, and the incidence of disease risk was analyzed over time to compare PRS methods, ancestries (from the GWAS summary), and risk groups (Fig. 1). For all PRSs, there were large differences in the incidence between the risk groups, with the high-risk group showing a higher incidence than the low-risk group (Figure S1). However, the disease incidence between each PRS method and each population of GWAS data varied by disease. For example, for asthma and T2D, PRS-CS showed superior performance to other PRS methods in that the high-risk group had a higher incidence than the others, while the low-risk group had a lower incidence than the others. In addition, T2D GWAS summary statistics from East Asia (BBJ) showed better performance than the data from Europe (UKB) when applying the same PRS method. However, the optimal PRS methods and ancestry of GWAS data for the classification of risk groups differed for other diseases. For CAD, glaucoma, hyperthyroidism, and hypothyroidism, PRS-CS performed optimally when summary statistics from Europe were used. However, when using summary statistics for East Asians, PRS-CS performed better only for the classification of one of the low-or high-risk groups. In addition, among the breast cancer PRSs, using European GWAS data with PRS-CS showed the worst performance, with the high-risk group of PRS-CS having the lowest incidence among high-risk groups and the low-risk group having the highest incidence among low-risk groups. Except for T2D, the ancestry of the GWAS data did not show a dominant effect on either side, and P + T and PRSice did not optimally classify either risk group for any disease.

The receiver operating characteristic (ROC) curve is one of the most common metrics for characterizing the accuracy of PRS. AUC, defined as the area under the ROC curve, provides an estimate of the probability that the predicted risk of a randomly selected case is higher than the predicted risk of a randomly selected control³³. ROC curves and the AUC of each PRS were generated for subjects with specific diseases by applying five-fold cross-validation (Fig. 2). For diseases other than breast cancer, PRS-CS showed a better or similar predictive performance to other PRS methods. Similar to the results of the cumulative incidence plot (Fig. 1), summary statistics of T2D GWAS from East Asia showed better performance than summary statistics from Europe when applying PRS methods, P + T (AUC = 0.631 and 0.583 for East Asia and Europe, respectively), PRSice (AUC = 0.639, 0.567), and PRS-CS (AUC = 0.669, 0.616). Moreover, summary statistics from the East Asian GWAS also outperformed those from Europeans for CAD and hypothyroidism, while the opposite was true for hyperthyroidism. For breast cancer, PRSice and P + T showed the most significant AUC when applying summary statistics from East Asia and Europe, respectively.

Applying P-value thresholds in PRS-CS

For the analysis of breast cancer, PRS-CS showed the poorest performance metrics among the PRS methods (Figs. 1 and 2). One of the biggest differences between the three PRS methods is that P + T and PRSice apply P-value thresholds to decrease noise, whereas PRS-CS applies continuous shrinkage priors to effect sizes of genetic variant. To identify the effect of these differences on the performance of PRS-CS, we considered four thresholds of GWAS P-values, 5 × 10^− 2, 5 × 10^− 4, 5 × 10^− 6, and 5 × 10^− 8, and generated PRSs using PRS-CS. As in the previous analysis, the AUC of each PRS was calculated by applying five-fold cross-validation, and the incidence of the lowest 10% and highest 10% subgroups of PRS distribution was generated.

As shown in Table 2, the predictive power of PRS-CS increased by applying different P-value thresholds for glaucoma (AUC 0.57 to 0.579) and hypothyroidism (AUC 0.553 to 0.561), when using BBJ summary statistics. Although PRS-CS did not outperform the other PRS methods, the predictive power for breast cancer did increase with P-value cut-offs (AUC 0.582 to 0.586 in BBJ, 0.551 to 0.589 in UKB), and classification of risk groups was better, in that the incidence decreased in the low-risk group (0.90–0.68% in UKB) and increased in the high-risk group (1.74 to 2.09% in BBJ, 1.19–1.77% in UKB). As shown in Fig. 3, PRS-CS showed better performance than P + T and PRSice when P-value thresholds were applied, even if it was not an optimal method previously (Fig. 1). For the remaining diseases, applying the P-value thresholds did not positively affect the performance of PRS (Table 2 and Figure S2).

Table 2

Predictive performance of PRS-CS by P-value threshold. AUC is mean values of five-fold cross validation. Lowest 10% and highest 10% subgroup of PRS distribution were presented as low- and high-risk respectively. AUC area under the receiver operator characteristic curve, CAD coronary artery disease, T2D type 2 diabetes. a, SNPs satisfying threshold were not in linkage disequilibrium reference panel of PRS-CS.
	Threshold	AUC of BBJ	Incidence of BBJ		AUC of UKB	Incidence of UKB
	Threshold	AUC of BBJ	Low-risk	High-risk	AUC of UKB	Low-risk	High-risk
Asthma	1	0.547	1.06%	2.27%	0.564	1.09%	2.51%
	5×10^− 2	0.542	1.25%	2.12%	0.562	1.39%	2.64%
	5×10^− 4	0.542	1.37%	1.86%	0.555	1.23%	2.69%
	5×10^− 6	0.537	1.54%	2.32%	0.544	1.44%	2.25%
	5×10^− 8	0.531	1.30%	2.15%	0.538	1.25%	2.13%
Breast Cancer	1	0.582	0.58%	1.74%	0.551	0.90%	1.19%
	5×10^− 2	0.579	0.68%	2.09%	0.520	1.09%	1.16%
	5×10^− 4	0.584	0.74%	1.67%	0.581	0.71%	1.58%
	5×10^− 6	0.581	0.58%	1.77%	0.589	0.68%	1.77%
	5×10^− 8	0.586	0.68%	1.54%	0.578	0.58%	1.67%
CAD	1	0.559	2.08%	3.52%	0.556	2.24%	4.08%
	5×10^− 2	0.558	2.10%	3.47%	0.549	2.20%	3.90%
	5×10^− 4	0.549	2.25%	3.30%	0.536	2.18%	3.62%
	5×10^− 6	0.542	2.10%	3.31%	0.533	2.20%	3.28%
	5×10^− 8	0.541	2.01%	3.28%	0.525	2.17%	3.33%
Glaucoma	1	0.570	0.34%	1.22%	0.574	0.49%	1.33%
	5×10^− 2	0.575	0.42%	0.97%	0.546	0.51%	1.27%
	5×10^− 4	0.579	0.59%	1.10%	0.560	0.53%	1.20%
	5×10^− 6	0.562	0.49%	1.22%	0.575	0.38%	1.16%
	5×10^− 8	0.535	0.53%	1.08%	0.558	0.55%	1.05%
Hyperthyroidism	1	0.564	1.64%	3.39%	0.551	1.51%	3.39%
	5×10^− 2	0.551	1.49%	2.87%	0.540	1.59%	3.10%
	5×10^− 4	0.556	1.56%	3.10%	0.549	1.69%	3.18%
	5×10^− 6	0.555	1.59%	3.21%	0.550	1.82%	3.00%
	5×10^− 8	0.553	1.49%	3.05%	0.536	1.97%	2.85%
Hypothyroidism	1	0.553	1.85%	3.31%	0.593	1.28%	3.56%
	5×10^− 2	0.540	1.61%	2.82%	0.584	1.36%	3.74%
	5×10^− 4	0.561	1.56%	3.36%	0.588	1.20%	3.74%
	5×10^− 6	0.544	1.59%	2.72%	0.587	1.33%	3.28%
	5×10^− 8	NA^a	NA^a	NA^a	0.583	1.36%	3.46%
T2D	1	0.669	2.51%	19.83%	0.616	3.98%	15.15%
	5×10^− 2	0.660	2.72%	18.98%	0.603	4.53%	14.19%
	5×10^− 4	0.647	2.63%	18.16%	0.594	5.32%	14.39%
	5×10^− 6	0.638	3.25%	17.41%	0.600	4.43%	15.10%
	5×10^− 8	0.634	3.52%	16.54%	0.596	4.73%	14.66%

To date, the utility and performance of PRS methods for disease risk prediction have been predominantly investigated in populations with European ancestry. In addition, the transferability of the European PRS to East Asian populations has remained unclear. Given the deficiency of PRS studies for East Asian population, we explored which PRS calculation methods proved optimal for specific diseases in an East Asian population, and whether the PRS generated using GWAS data from Europe is effective for risk prediction in East Asia. In the current study, we generated 42 PRSs for seven diseases in East Asia by applying three PRS methods and GWAS summary statistics from two populations, and compared the performance of PRSs based on the cumulative incidence and AUC.

In our study, PRSs showed predictive power for overall disease risk in East Asians regardless of the calculation method and ancestry of GWAS data. Although the reduced predictive power of PRS transferred between populations is well known³⁴, a PRS using European GWAS data showed similar, sometimes better, performance than the PRS using data from East Asia, depending on the disease and calculation method. This finding may be attributable to the power of the European GWAS, which is greater than that of the East Asian GWAS. The power of GWAS is highly affected by sample size and becomes stronger as both cases and controls increase^35,36. UKB comprised approximately half a million individuals, while BBJ included approximately 200,000, and the number of cases and controls for most diseases was higher in the UKB than in the BBJ (Table S1). Similar to previous studies^37,38, our results highlight an opportunity to use large-scale European GWAS data for the construction of PRSs in East Asia.

Of the three PRS methods tested, PRS-CS exhibited superior performance to the other methods under various conditions. With the exception of breast cancer, PRS-CS optimally classified disease risk groups as either low- or high-risk or both (Fig. 1 and S1) and showed a decent predictive performance for the overall risk of disease (Fig. 2). Unlike the other two methods, PRS-CS applies continuous shrinkage priors to genetic effects, and use of a large training sample size is encouraged because this has a significant effect on the shrinkage parameter⁸. Therefore, the superior predictive power of PRS-CS in our results may be attributable to the GWAS data from BBJ and UKB, which are among the largest cohorts in each population. In addition, the performance of the PRS-CS for breast cancer risk was improved, to have similar or better predictive power than the other methods, by applying P-value thresholds (Table 2 and Fig. 3). One of the reasons for this is that the application of continuous shrinkage priors may not be able to sufficiently eliminate the total noise of genetic effects. Therefore, when the sample sizes of the training data are sufficiently large, better performance can be achieved by choosing PRS-CS as the calculation method and applying various P-value thresholds.

In this study, we provided two performance metrics for PRS: cumulative incidence of low- and high-risk groups and ROC–AUC (Fig. 1 and Fig. 2). Depending on the purpose of the PRS, various performance metrics can be utilized, and the optimal PRS may vary accordingly. Taking a PRS study of breast cancer using BBJ GWAS data as an example, the best PRS method was PRS-CS when the aim of the study was to classify the low-risk group (Fig. 1). On the other hand, P + T is most suitable for classification of the high-risk group at 40 to 50 years of age (Fig. 1). For the overall risk of breast cancer, the predictive performance was best using PRSice (Fig. 2). Therefore, when evaluating the performance of the PRS, practical metrics should be selected and applied.

In summary, we generated PRSs for seven diseases in East Asia using three PRS calculation methods and GWAS data from European and East Asian countries. We estimated the predictive performance of various PRSs using two metrics and showed that a PRS based on GWAS, not only from East Asian but also from European ancestry, works well as a predictor of disease risk in East Asia. Furthermore, we suggest that PRS-CS will be more effective than P + T and PRSice when the number of GWAS sample sizes is large enough, and the application of P-value thresholds can increase the predictive power of PRS-CS. Although it is obvious that a grid search (applying all known PRS methods and GWAS summary statistics) is the best way to find the most appropriate PRS model, the results of our study can help researchers to select the proper PRS method and GWAS data. Further investigation of various PRS methods, traits, and ancestry of the study population is needed to validate our results.

Study populations

The present study was conducted using community-based genomic cohort data from the HEXA of the Korean Genome and Epidemiology Study¹⁸. The survey for the HEXA study was performed at 38 hospitals and local health-screening centers from 2004 to 2013, following standardized procedures. In total, 65,642 urban participants completed the initial and follow-up surveys. Epidemiological data were provided by the Korea Centers for Disease Control and Prevention. For sample quality control, participants with a genotype relative score greater than 0.125 or a body mass index outside the criteria of 15–50 were excluded.

Genotype data

The genotype data were produced by the Korea BioBank Array, which is optimized for the Korean population and includes 833,535 single nucleotide polymorphisms (SNPs)¹⁹. Imputation analysis was conducted with ShapeIT v2²⁰ and IMPUTE v2²¹ using 1000 Genomes Phase 3 data as a reference panel²². For quality control, SNPs with minor allele frequency less than 0.01 or Hardy–Weinberg equilibrium P-values less than 10^− 6 or missing data > 0.05 were excluded. A total of 7,915,509 SNPs remained.

Phenotype definition

Participants who constituted the T2D case and control groups were identified by their answers to the questionnaire on T2D diagnostic history and fasting glucose level. Those who replied ‘Yes’ to the questionnaire or had a fasting glucose level above 126 mg/dL were classified into the case group, and those who answered ‘No’ to the questionnaire and had a fasting glucose level less than 126 mg/dL constituted the control group.

For breast cancer classification, participants who answered ‘Yes’ to the questionnaire on cancer diagnosis and ‘Breast cancer’ to the questionnaire on cancer type were classified as the case group. The breast cancer control group was defined as those who answered ‘No’ to the questionnaire on cancer diagnosis.

For the other diseases (asthma, CAD, glaucoma, hyperthyroidism, and hypothyroidism), the participants were classified using a diagnostic history questionnaire for each disease. Those who answered ‘Yes’ were defined as disease cases, and those who answered ‘No’ were defined as controls. The case group for all the diseases comprised more than 300 individuals. The characteristics of the samples are listed in Table 1.

PRS calculations

For the PRS calculation, GWAS summary statistics from the UKB and BBJ were selected. We used a total of 14 summary statistics, in which one or more SNPs satisfied GWAS significance (< 5 × 10^− 8). GWAS summary statistics were obtained from the NHGRI-EBI GWAS Catalog (www.ebi.ac.uk/gwas) and JENGER (jenger.riken.jp/result)^17,23. GWAS data are shown in Table S1.

We applied three methods for PRS calculation: P + T, PRSice, and PRS-CS. For P + T methods, subsets of independent SNPs were obtained by applying LD clumping (r² < 0.1), P-value threshold (< 5 × 10^− 8), and physical distance threshold (> 1000 kb). LD clumping was conducted using 1000 genome phase 3 data as the LD reference panel²², considering the ancestry of the GWAS summary statistics. The individual risk score for the P + T method was then calculated using the following equation:

$$PRS={{\beta }}_{1}\times {SNP}_{1}+{{\beta }}_{2}\times {SNP}_{2}+\cdots {{\beta }}_{n}\times {SNP}_{n}$$

where β is the coefficient that represents the association between each SNP and disease.

The second PRS method, PRSice, is a P + T method that tests PRS at a large number of thresholds and applies the best-fit PRS to the study samples. PRSice-2 and its default parameter settings (physical distance > 250 kb and r² < 0.1)⁷ were used. For the third method, we used PRS-CS, which is based on continuous shrinkage priors⁸. In the analysis, an LD reference panel based on external European and East Asian samples from the 1000 Genomes Project was used with default parameter settings. Since the results of the PRS-CS provide the effect of SNPs, the individual risk scores for PRS-CS were calculated using the same equation as for P + T above.

Statistical analysis

SNP quality control, sample filtering, and PRS calculation were performed using PLINKv.1.9.0²⁴. For the evaluation of PRS performance, we generated a ROC curve by plotting the true positive rate against the false positive rate of PRS at various thresholds, using the R package ‘ggplot2’. Furthermore, the AUC was calculated by applying five-fold cross-validation for each subject of the disease²⁵. The parameters used to plot the ROC curve and calculate the AUC were the averages of each fold. For each disease and PRS method, we divided the subjects into high-risk and low-risk groups, representing the lowest and highest 10% of the PRS distribution, and identified the disease incidence by age. The incidence of diseases was presented as a cumulative incidence plot using the R package ‘ggplot2’. Student’s t-tests and regression analyses were performed using basic packages of R version 4.05.

Competing interests

All authors declare no competing interests.

Data availability

UKB and BBJ summary statistics used in this study were downloaded from NHGRI-EBI GWAS Catalog (www.ebi.ac.uk/gwas) and JENGER (jenger.riken.jp/result). This paper does not report custom code.

Acknowledgements

This study was conducted using the resources of Korean Genome and Epidemiology Study (application code: 21060104-01-01).

Author contributions

Dong Jun Kim wrote the original draft and contributed to methodology, formal analysis and data curation. Sun Bin Kim contributed to software. Myeong Jae Cheon and Young Kee Lee contributed to validation and visualization. Joon Ho Kang and Ji-Woong Kim contributed to investigation. Byung-Chul Lee provided supervision and contributed to project administration. All authors reviewed and edited the manuscript.

Lambert, S. A. et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat Genet 53, 420–425, doi:10.1038/s41588-021-00783-5 (2021).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet 50, 1219–1224, doi:10.1038/s41588-018-0183-z (2018).
Mars, N. et al. The role of polygenic risk and susceptibility genes in breast cancer over the course of life. Nat Commun 11, 6383, doi:10.1038/s41467-020-19966-5 (2020).
Wilson, P. W. et al. Prediction of incident diabetes mellitus in middle-aged adults: the Framingham Offspring Study. Arch Intern Med 167, 1068–1074, doi:10.1001/archinte.167.10.1068 (2007).
Polygenic Risk Score Task Force of the International Common Disease, A. Responsible use of polygenic risk scores in the clinic: potential benefits, risks and gaps. Nat Med 27, 1876–1884, doi:10.1038/s41591-021-01549-6 (2021).
Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat Rev Genet 19, 581–590, doi:10.1038/s41576-018-0018-x (2018).
Choi, S. W. & O'Reilly, P. F. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience 8, doi:10.1093/gigascience/giz082 (2019).
Ge, T., Chen, C. Y., Ni, Y., Feng, Y. A. & Smoller, J. W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 10, 1776, doi:10.1038/s41467-019-09718-5 (2019).
Choi, S. W., Mak, T. S. & O'Reilly, P. F. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc 15, 2759–2772, doi:10.1038/s41596-020-0353-1 (2020).
Page, M. L. et al. The Polygenic Risk Score Knowledge Base offers a centralized online repository for calculating and contextualizing polygenic risk scores. Commun Biol 5, 899, doi:10.1038/s42003-022-03795-x (2022).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet 51, 584–591, doi:10.1038/s41588-019-0379-x (2019).
Duncan, L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat Commun 10, 3328, doi:10.1038/s41467-019-11112-0 (2019).
Popejoy, A. B. & Fullerton, S. M. Genomics is failing on diversity. Nature 538, 161–164, doi:10.1038/538161a (2016).
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The Missing Diversity in Human Genetic Studies. Cell 177, 26–31, doi:10.1016/j.cell.2019.02.048 (2019).
Kim, Y., Han, B. G. & Ko, G. E. S. g. Cohort Profile: The Korean Genome and Epidemiology Study (KoGES) Consortium. Int J Epidemiol 46, e20, doi:10.1093/ije/dyv316 (2017).
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 12, e1001779, doi:10.1371/journal.pmed.1001779 (2015).
Nagai, A. et al. Overview of the BioBank Japan Project: Study design and profile. J Epidemiol 27, S2-S8, doi:10.1016/j.je.2016.12.005 (2017).
Health Examinees Study, G. The Health Examinees (HEXA) study: rationale, study design and baseline characteristics. Asian Pac J Cancer Prev 16, 1591–1597, doi:10.7314/apjcp.2015.16.4.1591 (2015).
Moon, S. et al. The Korea Biobank Array: Design and Identification of Coding Variants Associated with Blood Biochemical Traits. Sci Rep 9, 1382, doi:10.1038/s41598-018-37832-9 (2019).
Delaneau, O., Marchini, J. & Zagury, J. F. A linear complexity phasing method for thousands of genomes. Nat Methods 9, 179–181, doi:10.1038/nmeth.1785 (2011).
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39, 906–913, doi:10.1038/ng2088 (2007).
Genomes Project, C. et al. A global reference for human genetic variation. Nature 526, 68–74, doi:10.1038/nature15393 (2015).
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 47, D1005-D1012, doi:10.1093/nar/gky1120 (2019).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559–575, doi:10.1086/519795 (2007).
Wald, N. J. & Old, R. The illusion of polygenic disease risk prediction. Genet Med 21, 1705–1707, doi:10.1038/s41436-018-0418-5 (2019).
Mulder, J. E. Thyroid disease in women. Med Clin North Am 82, 103–125, doi:10.1016/s0025-7125(05)70596-4 (1998).
Fuseini, H. & Newcomb, D. C. Mechanisms Driving Gender Differences in Asthma. Curr Allergy Asthma Rep 17, 19, doi:10.1007/s11882-017-0686-1 (2017).
Chen, L., Magliano, D. J. & Zimmet, P. Z. The worldwide epidemiology of type 2 diabetes mellitus–present and future perspectives. Nat Rev Endocrinol 8, 228–236, doi:10.1038/nrendo.2011.183 (2011).
Jamee, A., Abed, Y. & Jalambo, M. O. Gender difference and characteristics attributed to coronary artery disease in Gaza-Palestine. Glob J Health Sci 5, 51–56, doi:10.5539/gjhs.v5n5p51 (2013).
Peters, U., Dixon, A. E. & Forno, E. Obesity and asthma. J Allergy Clin Immunol 141, 1169–1179, doi:10.1016/j.jaci.2018.02.004 (2018).
Powell-Wiley, T. M. et al. Obesity and Cardiovascular Disease: A Scientific Statement From the American Heart Association. Circulation 143, e984-e1010, doi:10.1161/CIR.0000000000000973 (2021).
Eckel, R. H. et al. Obesity and type 2 diabetes: what can be unified and what needs to be individualized? J Clin Endocrinol Metab 96, 1654–1663, doi:10.1210/jc.2011-0585 (2011).
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36, doi:10.1148/radiology.143.1.7063747 (1982).
Scutari, M., Mackay, I. & Balding, D. Using Genetic Distance to Infer the Accuracy of Genomic Prediction. PLoS Genet 12, e1006288, doi:10.1371/journal.pgen.1006288 (2016).
Spencer, C. C., Su, Z., Donnelly, P. & Marchini, J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet 5, e1000477, doi:10.1371/journal.pgen.1000477 (2009).
Hong, E. P. & Park, J. W. Sample size and statistical power calculation in genetic association studies. Genomics Inform 10, 117–122, doi:10.5808/GI.2012.10.2.117 (2012).
Ho, W. K. et al. European polygenic risk score for prediction of breast cancer shows similar performance in Asian women. Nat Commun 11, 3833, doi:10.1038/s41467-020-17680-w (2020).
Fritsche, L. G. et al. On cross-ancestry cancer polygenic risk scores. PLoS Genet 17, e1009670, doi:10.1371/journal.pgen.1009670 (2021).

No competing interests reported.

Supportinginformation.docx

Download PDF

Journal Publication

published 06 Nov, 2023

Read the published version in Scientific Reports →

Editorial decision: Major revision
06 Mar, 2023
Reviews received at journal
08 Feb, 2023
Reviewers agreed at journal
27 Jan, 2023
Reviewers invited by journal
19 Jan, 2023
Editor assigned by journal
19 Jan, 2023
Editor invited by journal
18 Jan, 2023
Submission checks completed at journal
18 Jan, 2023
First submitted to journal
17 Jan, 2023

You are reading this latest preprint version

Evaluation of optimal methods and ancestries for calculating polygenic risk scores in East Asian population

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

Performance of PRS in an East Asian population

Discussion

Methods

Study populations

Genotype data

Phenotype definition

PRS calculations

Statistical analysis

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1