Tuning Parameters for Polygenic Risk Score Methods Using GWAS Summary Statistics from Training Data

Predicting genetic risks for common diseases may improve their prevention and early treatment. In recent years, various additive-model-based polygenic risk scores (PRS) methods have been proposed to combine the estimated effects of single nucleotide polymorphisms (SNPs) using data collected from genome-wide association studies (GWAS). Some of these methods require access to another external individual-level GWAS dataset to tune the hyperparameters, which can be difficult because of privacy and security-related concerns. Additionally, leaving out partial data for hyperparameter tuning can reduce the predictive accuracy of the constructed PRS model. In this article, we propose a novel method, called PRStuning, to automatically tune hyperparameters for different PRS methods using only GWAS summary statistics from the training data. The core idea is to first predict the performance of the PRS method with different parameter values, and then select the parameters with the best prediction performance. Because directly using the effects observed from the training data tends to overestimate the performance in the testing data (a phenomenon known as overfitting), we adopt an empirical Bayes approach to shrinking the predicted performance in accordance with the estimated genetic architecture of the disease. Results from extensive simulations and real data applications demonstrate that PRStuning can accurately predict the PRS performance across PRS methods and parameters, and it can help select the best-performing parameters.


PRStuning
AUC that needs to be evaluated on another individual-level genotype dataset. 120 We incorporate empirical Bayes (EB) theory to shrink the effect sizes of SNPs, 121 which leads to the attenuation of the predicted AUC so as to overcome the 122 overfitting phenomenon [23]. In PRStuning, we adopt a point-normal mixture the best predicted AUC. 135 We applied PRStuning to GWAS datasets of three common diseases, 136 including coronary artery disease (CAD), type 2 diabetes (T2D), and inflam- Here M is the total number of pre-selected SNPs used for constructing PRS. is the cumulative distribution function of a standard normal distribution. We 162 use R m1,m2 to denote the LD coefficient between SNP m 1 and SNP m 2 . 163 We can calculate τ 2 j (j = 0, 1) by directly plugging in the observed values 164 of allele frequencies and LD coefficients since τ 2 j is not directly related to the 165 SNPs' effects on the disease. In GWAS, z-scores from the allele frequency difference test are usually used to assess the association of each SNP with the disease. Each z-score is calculated with the following formula: wheref j,m is the observed allele frequency for each group, s 2 j,m is the 180 variance of the genotype in the controls or cases, and n 0 , n 1 are the 181 sample sizes of the two groups. To simplify this expression, we define N (δ m /s m , 1) given the allele frequency difference δ m . 184 We will further demonstrate in the Supplementary Note that δ = (δ 1 , . . . , δ M ) is related to the LD pattern among the pre-selected SNPs and the underlying effects of the risk SNPs in terms of the allele frequency differences between the two groups, i.e., where S is a diagonal matrix with the m-th diagonal element equal to s m , β = (β 1 , . . . , β M ) with β m being the underlying effect of SNP m, and R is the LD coefficient matrix. Given δ, the joint distribution of the z-scores (6) We further assume that the standardized effect β m /s m follows a point-normal distribution, i.e., Here δ 0 is a point mass at zero, π represents the prior proportion of SNPs We first consider the case where the pre-selected SNPs are independent. In it is reasonable to assume σ 2 ∝ n.

235
In total, we simulated M = 10, 000 independent SNPs and varied the sam-236 ple size from 4, 000 to 10, 000 in the training GWAS to explore the performance 237 trend across different sample sizes. Each sample size setting was replicated 50 238 times. And for each replication, we simulated additional 1000 cases and 1000 239 controls as testing data. 240 We used the AUC evaluated on the testing data as the benchmark, and  Since all SNPs are independent, we only considered P+T as the PRS method. problem. In contrast, with the same summary statistics from the training data,

251
PRStuning was able to shrink the estimates of allele frequency differences and 252 produce AUC estimates comparable to those from the testing data.  We changed the p-value threshold from {1, 5e−1, 5e−2, 5e−3, 5e−4, 5e−5, 5e−6} and the sample sizes of training data from 4, 000 to 10, 000. The grey, yellow, and red panels represent AUC predicted from PRStuning, AUC evaluated on testing data, and the unadjusted AUC directly estimated by plugging in the training summary statistics, respectively. The AUC evaluated on the testing data is the benchmark. PRStuning is able to yield AUC estimates comparable to the benchmark results.
In order to further demonstrate the accuracy of PRStuning, we summarize the average correlation of the AUC estimates ρ AUC and the average relative 255 difference of the best-performing AUC estimates rd AUC in Table 1 SNPs are linked as reflected in their LD. 269 We first performed simulations with SNPs with an AR(1) auto-regressive 270 LD structure. We fixed the auto-regressive coefficient ρ to 0.2, which is the   We further evaluated PRStuning with simulations based on real genotype  We changed the p-value threshold from {1, 5e−1, 5e−2, 5e−3, 5e−4, 5e−5, 5e−6} and the sample sizes of training data from 4, 000 to 10, 000. The grey, yellow, and red panels represent AUC predicted from PRStuning, AUC evaluated on testing data, and the unadjusted AUC directly estimated by plugging in the training summary statistics, respectively. The AUC evaluated on the testing data is the benchmark. PRStuning is able to yield AUC estimates comparable to the benchmark results.  We changed the proportion of risk SNPs from {1, 3e−1, 1e−1, 3e−2, 1e−2, 3e−3, 1e−3, 3e−4, 1e−4, 3e−5, 1e−5} and the sample sizes of training data from 4, 000 to 10, 000. The grey, yellow, and red panels represent AUC predicted from PRStuning, AUC calculated from testing data, and the unadjusted AUC, respectively.
the UK Biobank (UKBB) [28], which collected genetic and health records from around 500, 000 participants in the UK. The quality control procedure is sum-  Table 4 The predicted AUC values for C+T with different p-value thresholds in the simulation experiment based on the UKBB data. We randomly selected 80% of individuals to calculate the summary statistics as training data and the rest as testing data. We used the data collected from the 1KG as the reference panel for calculating LD. The AUC estimates from PRStuning were very close to the actual AUC values obtained from the testing data. The correlation ρ AUC reached 0.996 and the relative difference rd AUC was 3.7%.

334
We applied PRStuning to GWAS summary statistics from three diseases, 335 including coronary artery disease (CAD), type 2 diabetes (T2D), and inflam-336 matory bowel disease (IBD). Table 6 summarizes the sources of the publicly  Table 5 The predicted AUC values for LDpred with different risk SNP proportion π in the simulation experiment based on the UKBB data. We selected 80% of individuals as training data and the rest as the testing data. The data from the 1KG were used as the reference panel. The correlation ρ AUC was 0.998 and the relative difference rd AUC was 1.3%.
available GWAS summary statistics and their corresponding sample sizes. Note    To further explain why there were double modes for AUC with different parameter values, we refer back to the calculation of ∆ in Eq. (3) since AUC is monotonically increasing with respect to ∆. The numerator of ∆ is a linear combination of the weights ω = (ω 1 , . . . , ω M ) T used in PRS, whereas the denominator is the square root of a quadratic function of ω, which can be further expressed as  (2), we know that AUC is monotonically increasing with respect to ∆, and we of ρ AUC and rd AUC are summarized in compared to GWAS summary statistics due to privacy and security concerns.

422
Additionally, leaving out partial data for hyperparameter tuning can also 423 reduce the predictive accuracy of the PRS model. the testing data and selecting the best-performing parameters.

435
The core of PRStuning is to estimate the allele frequency differences among 436 SNPs. To do so, we need to input the sample sizes of the cases and con- where M is the total number of the pre-selected SNPs used for constructing we have g i,m ∼ Bino(2, f 1,m ) if the individual i is from the case group.

479
By the central limit theorem, PRS approximately follows a normal distribution in each group when the SNP number M is adequately large. For PRS methods involving SNP selection steps unrelated to the SNPs' associations with the disease, such as P+T, M varies from ∼ 10 to ∼ 10K depending on the selection threshold. For PRS methods with genome-wide pre-selected SNPs, M ranges from ∼ 100K to ∼ 1M determined by the number of SNPs genotyped or imputed in the training data. Based on the central limit theorem, the PRS variables from the two groups follow normal distributions: where and for j = 0 or 1. Here R m1,m2 corresponds to the correlation between SNP m 1 480 and SNP m 2 , which is known as the LD coefficient.

481
For a binary phenotype, we usually use AUC as the criterion for evaluating the prediction performance of PRS. AUC is defined as the area under the ROC curve, which can also be calculated as the probability that a random PRS from the case group is larger than a random PRS from the control group [33]. Based on this fact and the distributions of PRS, Song, etc.[21] formulated AUC as where Here δ m := f 1,m − f 0,m records the difference between the allele frequencies of  In GWAS, we usually use the z-score calculated from the allele frequency difference test to assess the association of each SNP with the disease. Since z-scores are standardized values following a standard normal distribution N (0, 1) under the null hypothesis, we will use z-scores as surrogates to derive the posterior distribution of δ m . The z-score is calculated with the following formula: wheref j,m is the observed allele frequencies among controls or cases, and s 2 j,m is the variance of genotypes in each group. We use n 0 and n 1 to respectively denote the sample sizes of controls and cases in the GWAS. To simplify the expression, we use s m to denote the denominator of the z-score, i.e., s m := s 2 1,m /4n 1 + s 2 0,m /4n 0 , and denote s = (s 1 , . . . , s M ). We use z to encode the z-scores of all the pre-504 selected SNPs.

505
Based on this definition, we have z m |δ m ∼ N (δ m /s m , 1) given the allele frequency difference δ m . We will further prove that δ = (δ 1 , . . . , δ M ) is actually related to the LD among the pre-selected SNPs and the underlying effects of the risk SNPs in terms of changing allele frequencies between two groups. We denote the effect of SNP m as β m and β = (β 1 , . . . , β M ). If the SNP has no effect on the disease, then β m = 0. For the risk ones, β m ̸ = 0. We assume that the standardized effect β m /s m follows a point-normal distribution, i.e., Here δ 0 is a point mass at zero and π represents the prior proportion of the 506 SNPs having effects on the disease. We use σ 2 to denote the variance of β m /s m 507 in the risk SNPs.
In the following two subsections, we will prove the relationship between δ m 509 and β m in two different scenarios and demonstrate how the empirical Bayes are approximately independent because they are selected after an LD pruning 516 step.

517
In this scenario, we have δ = β and the joint distribution of z-scores follows a multivariate normal distribution with the covariance matrix equaling to the identity matrix I M , i.e., where S = diag(s) is a diagonal matrix with diagonal elements encoding the 518 standard errors of the observed allele frequency differences.

519
With the point-normal prior (17) on each entry of β, the log-likelihood of the z-scores is the summation of the log-likelihood for each individual z-score, i.e. log P (z|π, σ 2 ) = M m=1 log P (z m |π, σ 2 ).
With this property, we can use an EM algorithm to get estimates of π and σ 2 520 by maximizing the likelihood P (z|π, σ 2 ).

521
After getting estimates of parameters π and σ 2 , we can derive a closed-form solution for the posterior distribution of δ m : where Here φ(·) is the probability density function of a standard normal distribution By making some simple changes to the originally derived sampler, we can get another Gibbs sampler for simultaneously sampling π, σ 2 and D artificial replicates of the nuisance parameters {β(d), γ(d)} D d=1 , for whom the joint probability is proportional to Based on this probability, the generated replicates of {β, γ} in the sampler are conditionally independent. With this new sampler, the marginal probability of (π, σ 2 ) can be calculated by integrating/summing over all replicates of {β, γ}: In other words, (π, σ 2 ) is actually sampled from q D π, σ 2 |z ∝ P (z|π, σ 2 ) D in the sampler. We further denote (π,σ 2 ) = arg max (π,σ 2 ) P (z|π, σ 2 ) and (π,σ 2 ) as another set of parameters. If we let D increase to infinity, the relative probability of sampling (π,σ 2 ) compared to sampling (π,σ 2 ) will become q D π,σ 2 |z q D π,σ 2 |z = P (z|π,σ 2 ) P (z|π,σ 2 ) Therefore, the sampled (π, σ 2 ) will converge to their maximum likelihood 551 estimates (π,σ 2 ) in the end.