SNP characteristics and validation success in genome wide association studies

Genome wide association studies (GWASs) have identified tens of thousands of single nucleotide polymorphisms (SNPs) associated with human diseases and characteristics. A significant fraction of GWAS findings can be false positives. The gold standard for true positives is an independent validation. The goal of this study was to identify SNP features associated with validation success. Summary statistics from the Catalog of Published GWASs were used in the analysis. Since our goal was an analysis of reproducibility, we focused on the diseases/phenotypes targeted by at least 10 GWASs. GWASs were arranged in discovery-validation pairs based on the time of publication, with the discovery GWAS published before validation. We used four definitions of the validation success that differ by stringency. Associations of SNP features with validation success were consistent across the definitions. The strongest predictor of SNP validation was the level of statistical significance in the discovery GWAS. The magnitude of the effect size was associated with validation success in a non-linear manner. SNPs with risk allele frequencies in the range 30–70% showed a higher validation success rate compared to rarer or more common SNPs. Missense, 5’UTR, stop gained, and SNPs located in transcription factor binding sites had a higher validation success rate compared to intergenic, intronic and synonymous SNPs. There was a positive association between validation success and the level of evolutionary conservation of the sites. In addition, validation success was higher when discovery and validation GWASs targeted the same ethnicity. All predictors of validation success remained significant in a multivariate logistic regression model indicating their independent contribution. To conclude, we identified SNP features predicting validation success of GWAS hits. These features can be used to select SNPs for validation and downstream functional studies.


Introduction
Genome-wide association studies (GWASs) revolutionized the study of genetic control of human phenotypes and diseases (Tam et al. 2019;Visscher et al. 2012). GWASs test millions of SNPs in phenotypically different individuals to identify genotype-phenotype associations. Thousands of associations between SNPs and diseases/traits have been detected (Bosse and Amos 2018;Gallagher and Chen-Plotkin 2018;Horwitz et al. 2019;Liang et al. 2020). Despite using the strict genome-wide threshold for statistical significance (p < 5 × 10 -8 or equivalently − log(p) > 7.3), a considerable number of detected SNP-phenotype associations fail independent validation (Brzyski et al. 2017;Marigorta et al. 2018). Identifying SNP characteristics predicting validation success (true positives) is important for prioritizing SNPs for targeted validation and downstream functional studies. We and others identified a number of SNP characteristics associated with the validation success (Gorlov et al. 2014;Merelli et al. 2013;Xu and Taylor 2009).
Here we present results of an updated analysis of associations between SNP characteristics and validation success.

Data used
1 3 included to test if they are enriched by true positives. We focused on diseases/traits that were targeted by at least 10 studies. The results of the published GWASs were included regardless of the sample size used in the study. A total of 40 diseases/traits were analyzed in the study (Table 1). Diseases/traits' labels were used exactly how they were reported in the CPG.

Validation attempts
For each disease/trait, GWASs were arranged into pairs according to the publication date. Each pair was considered to be a validation attempt, with the earlier GWAS considered the discovery and the later, validation. The complete list of discovery-validation pairs can be found in Supplementary  Table S1. The supplementary table also includes pairwise linkage disequilibrium (LD) for three major ethnic groups: Europeans, Africans and Asians.

Definitions of successful validation
We used four definitions of successful validation that differ by the stringency. Under the strict definition, a SNP was considered validated when the validation GWAS detected the same SNP at the genome-wide level of significance (p < 5 × 10 -8 ). Under the relaxed definition, a SNP was considered validated when the same or a linked SNP (r 2 > 0.8 in the validation population) was detected at the genomewide level of significance. LD data were downloaded from LDLink database (Myers et al. 2020). Under the soft definition, a SNP was considered validated if the original SNP or a SNP in tight LD with it was detected in the validation GWAS at the liberal level of significance of p < 10 -5 . Finally, under the ultra-soft definition of validation success, a SNP was considered validated if the original SNP or a tightly linked SNP reached the GWAS level of significance in at least one out of at least three subsequent GWASs (attempts). Therefore, the principal difference of the ultra-soft definition of the validation success from the other three definitions is that for the latter, the validation success was per single attempt, while under the ultra-soft definition of the validation success, at least three validation attempts are required and the SNP is considered validated if at least one attempt is successful.

Predictors of the validation success
The following SNP characteristics were used as predictors of the validation success: (1) the level of statistical significance in the discovery GWAS expressed as − log(p), where p is the p value; (2) the effect size (either original odds ratios (OR) or transformed to 1/OR for ORs < 1 to keep them on same scale as ORs > 1); (3) risk allele frequency; (4) the type of the SNP (see below); (5) the level of evolutionary conservation of the site estimated by the PhyloP method (Pollard et al. 2010).
The PhyloP uses the distribution of nucleotide substitutions in an evolutionary tree of 44 vertebrate species to estimate the expected number of substitutions per site under the Lung cancer 14 30 Telomere length 13 31 Adiponectin levels 12 32 Attention deficit hyperactivity disorder 12 33 Fasting plasma glucose 12 34 Glycated hemoglobin levels 12 35 Age-related macular degeneration 11 36 Atrial fibrillation 11 37 Bilirubin levels 11 38 QT interval 11 39 Venous thromboembolism 11 40 Pancreatic cancer 10 assumption of neutral evolution. The observed number of substitutions at the site is compared to the expected under the assumption of selective neutrality. A higher PhyloP score means a higher level of evolutionary conservation.

Statistical analysis and visualization
To visualize the associations of quantitative features, e.g. − log(p), with validation success, we stratified predictors by deciles. First we ranked SNPs by the corresponding characteristic and then stratified them into ten groups. Validation success rate was estimated for each group separately. To estimate and compare different types of SNPs by validation success we used SNP types reported by CPG. The list of the most frequent SNP types used in the analysis is as follows: "intron variant", "intergenic variant", "missense variant", "non-coding transcript exon variant", "3′UTR variant", "TF binding site variant", "5′UTR variant", and "synonymous variant". To estimate the effect of the same/different ethnicities in the discovery and the validation GWASs we used the CPG data. The most frequently reported ethnicities are Europeans, East Asians, African American, Hispanic/Latino and Ashkenazi Jews. Initially associations were estimated using univariate analyses. Features significant in the univariate analyses were included in multivariable logistic regression. Validation status was treated as the outcome-validated/not-validated. All significant predictors were included into the model, to evaluate their independent effects. We present the two extreme definitions of validation success: ultra-soft and strict. The results of two other models were similar to the strict and ultra-soft models of validation success. All statistical analyses were performed using STATISTICA (TIBCO Software Inc.) and Origin (OriginLab Corporation, Northampton, MA, USA). Figure 1 shows validation success rates across diseases/ traits. We observed a higher than an order of magnitude variation among the phenotypes. Those with lowest validation success rates included "Major depressive disorder", "Attention deficit hyperactivity disorder", "Bone mineral density", "Alcohol dependence", "Coronary heart disease" and "Bipolar disorder". Diseases/traits with highest validation success rate included "Breast cancer", "Asthma", "Venous thromboembolism", "QT interval', "Atrial fibrillation", and "Age-related macular degeneration".

Validation success rates for different definitions
The overall average validation success rate for SNPs across all phenotypes varied depending on the definition of the validation success: 6.42 ± 0.07% under the strict definition, 6.66 ± 0.07% under the relaxed definition, 7.87 ± 0.08% under the soft definition, and 50.87 ± 0.16 under the ultrasoft definition.

The level of statistical significance in the discovery GWAS is positively associated with validation success
We observed a strong positive association of − log(p) in the discovery GWAS with the validation success under all 4 definitions of the validations success. Figure 2 shows mean validation success rate for SNPs categorized by the deciles of − log(p) in the discovery study. The proportion of validated SNPs is positively associated with the − log(p) in the range of − log(p) 5-7.5; for higher − log(p) deciles the slope is less steep. Similar shapes were observed for all definitions of validation success including the ultra-soft definition that dwarfs validation rate for a single validation attempt (Figs. 2b, 3).

Odds ratios in the discovery GWAS and validation success
Overall negative correlations of the OR with SNP validation success were detected under strict, relaxed, soft, and ultra-soft definitions of the validation success (corresponding Spearman rank order correlations were ρ = − 0.03, N = 60,166, p = 7.1 × 10 -15 , ρ = − 0.01, N = 60,166, p = 2.8 × 10 -3 , ρ = − 0.02, N = 60,166, p = 5.1 × 10 -5 , ρ = − 0.1, N = 57,352, p = 5.6 × 10 -25 ). The association between OR and validation success using decile stratification shows a more complex relationship. Highest validation success was for the SNPs with ORs in the range of 1.06-1.3, while the SNPs with reported ORs < 1.06 or > 1.3 had lower validation success. Details on the ranges of ORs for each of the ten groups defined by the deciles are shown in Supplementary Table S2.

Association between the risk allele frequency and validation success
Under strict, relaxed, and soft definition of the validation success there is a tendency for common risk-associated alleles (allelic frequency close to 0.5) to have a higher validation success rate (Fig. 4a). The association is more evident under the ultra-soft definition of the validation success (Fig. 4b). When we used MAF of the reported SNP (Fig. 5), we found that SNPs with MAF 0.3-0.5 tended to be validated more often compared to rarer or more common SNPs.

Different types of SNPs differ by validation success
We compared different types of SNPs by validation success (Fig. 6). Intergenic and intron variants had the lowest validation success rate. Validation success rate of SNPs producing missense mutations, stop gained, located in TF binding sites or 5' UTR were the highest.

SNPs are more likely to be validated when the same race/ethnicity is targeted by discovery and validation GWASs
When the discovery and the validation GWASs target the same race/ethnicity, the validation success rate is higher compared to the situation when the ethnicities in the discovery and validation GWASs are different. This is true regardless of the definition of the validation success (Fig. 8).

Multivariate logistic regression analysis
We analyzed the predictors simultaneously using binary logistic regression model with validation status as the outcome. All predictors remained significant, for both the strict and ultra-soft definitions of validation (Tables 2, 3).

Discussion
Compared to our previous study (Gorlov et al. 2014), the current analysis is based on a larger sample size and includes more predictors. We confirmed the previous associations and added new ones. Validation success rate per single validation attempt was similar for the strict, relaxed and soft definitions in the range of 6-8%. One of the possible reasons for the low validation rate for SNPs could be that our analysis included gray zone SNPs. However, when such SNPs were excluded from the analysis, the validation success increased only marginally for strict definition, from 6.42 ± 0.07% to 6.46 ± 0.08%. A similarly slight increase in validation success after removing gray zone SNPs was observed for the relaxed and soft definitions of validation. It is unlikely, also, that differences in genotyping platforms are a major contributor to the low validation success. By   Table S2). X-coordinates of the dots represent the median OR in each group. The bars represent 95% CI for the proportion of validated SNPs. a Strict, relaxed, and soft definition of the validation success. b The same as a plus ultra-soft definition of the validation success definition, validation GWASs are performed later than discovery, and later GWASs tend to use denser genotyping platforms. Besides, the investigators usually impute SNPs that were detected as significant earlier if they were not on the genotyping platform (Li et al. 2009;Shi et al. 2018). We found that having the same genotyping platform increased chances of validation only by 1.1%.
Ethnicity is an important factor to consider in GWASs since many SNPs show significant population variation. However, to assess the external validity of the associations of SNPs with diseases/traits, we used validation studies that did not exactly match the characteristics of the corresponding discovery studies. Thus, the effect of race on SNP reproducibility was one of the factors we wanted to explore. We showed that targeting the same ethnicity in the discovery and validation GWASs has a profound effect on validation success (about two-fold, Fig. 8). This indicates that targeting genetically similar populations is important for successful validation. Note that we have used only major population categories: Europeans, East Asians, African American, Hispanic/Latino and Ashkenazi Jews. The major population groups are genetically heterogeneous. There are, for example, significant genetic differences among European subpopulations, which also can impact reproducibility (Lao et al. 2008).
Not surprisingly, the level of statistical significance in the discovery GWAS was the strongest predictor of the validation success. The association between validation success and OR was markedly nonlinear. The highest validation success rate was in the group of SNPs with ORs in the range from 1.1 to 1.3, suggesting that "real" ORs tend to be within this range. Compared to these, SNPs with ORs > 2 in the  discovery GWAS are validated 40% less likely. This can be because the initial discoveries tend to overestimate the effect sizes-a "winner's curse" (Lohmueller et al. 2003;Shi et al. 2016;Xiao and Boehnke 2011). Validation success rate was highest for most polymorphic SNPs, likely because statistical power is the highest for the SNPs with a frequency close to 0.5 (Hong and Park 2012;Sham and Purcell 2014).
Intronic, intergenic and synonymous SNPs showed lower validation rates compared to the missense SNPs, SNPs located in TF binding sites or in 5'UTR regions. The  most likely explanation for this can be that some GWASdetected SNPs are causal (Caballero et al. 2015;Schaid et al. 2018;Wang et al. 2020). Functional SNPs affect the level of expression and/or protein function, including protein folding. Missense SNPs and SNPs located in TF binding sites or 5'UTR regions (often loaded with regulatory elements) are likely to be functional (Buroker 2014;Huo et al. 2019;Lou et al. 2017). It is accepted that the level of evolutionary conservation of the site reflects its functional importance (O'Connor et al. 2019;Zeng et al. 2018) suggesting that the positive association between the level of evolutionary conservation of the site and replication success that we found is due to the presence of functional causal SNPs among GWAS top hits. All predictors of validation success detected in univariate analysis remained significant in the multivariate logistic regression analysis. The most significant predictors in multivariate analysis were the level of statistical significance in the discovery, followed by SNP type and PhyloP score (Tables 2, 3).
The results of this study suggest that SNP features may help to select SNPs with highest chances to be validated. Indeed, when we selected SNPs based on the five major predictors of validation success, as follows: (1) the SNP is genome-wide significant in the discovery GWAS; (2) the risk allele frequency is between 0.1 and 0.9; (3) the SNP is missense, or is located in a TF binding site or in 5'UTR region; (4) the SNP has a high level of evolutionary conservation, and (5) the same ethnicity in the discovery and validation GWASs, the resulting SNPs showed validation success rate of 32.6% ± 5.8 under the strict definition, which is much higher than the overall average. Surprisingly, we found that the validation success rate of the gray zone SNPs (10 -5 < p < 5 × 10 -8 ) was inferior but still comparable to that of SNPs with genome-wide level of statistical significance in the discovery: 4.19 ± 0.09% versus 12.26 ± 0.08% under the strict definition of the validation success ( Fig. 2 first 4 points versus other points). This indicates that gray SNPs are enriched by true positives.

Limitations of the study
Subsequent GWASs targeting the same phenotype were considered in this study as an independent validation. That is not always the case. In some cases the subsequent GWASs include a subset of samples already used in an  earlier GWAS, which is likely to inflate the validation success rate. We do not think, however, that this issue substantially affects the findings on associations between the SNP characteristics and validation rate. Besides, based on our experience with lung cancer GWASs and a limited review of published GWASs, we believe that a typical overlap (if exists) does not exceed 20%. We found that the associations were very similar across different definitions of validation success. Another limitation is that we did not handle the meta-analysis studies any differently from standard two-phase GWASs. We formally followed the classification adopted by the Catalog of the Published GWASs because it reflects the current state of knowledge of disease etiology. We acknowledge that disease classification is a moving target and a disease once considered genetically homogeneous may be later reclassified into several distinct diseases as it becomes studied better. When assessing the effect of different ethnicity in the discovery and validation GWASs on validation success, we did not take into account possible effects of differences in allele frequency between discovery and validation GWAS on statistical power. For non-validated SNPs risk allele frequency in validation GWASs are not available. This precluded us from using differences in MAFs between discovery and validation GWASs as a predictor in the multivariate model, which could have helped to decide whether the effect of ethnicity on validation success is due to the differences in allele frequencies between the discovery and validation populations.