Diseases/traits differ by the average validation success
Figure 1 shows validation success rates across diseases/traits. We observed a higher than an order of magnitude variation among the phenotypes. Those with lowest validation success rates included “Major depressive disorder”, “Attention deficit hyperactivity disorder”, “Bone mineral density”, “Alcohol dependence”, “Coronary heart disease” and “Bipolar disorder”. Diseases/traits with highest validation success rate included “Breast cancer”, “Asthma”, “Venous thromboembolism”, “QT interval’, “Atrial fibrillation”, and “Age-related macular degeneration”.
Validation success rates for different definitions
The overall average validation success rate for SNPs across all phenotypes varied depending on the definition of the validation success: 6.42±0.07% under the strict definition, 6.66±0.07% under the relaxed definition, 7.87±0.08% under the soft definition, and 50.87±0.16 under the ultra-soft definition.
The level of statistical significance in the discovery GWAS is positively associated with validation success
We observed a strong positive association of -log(p) in the discovery GWAS with the validation success under all 4 definitions of the validations success. Figure 2 shows mean validation success rate for SNPs categorized by the deciles of -log(p) in the discovery study. The proportion of validated SNPs is positively associated with the -log(p) in the range of -log(p) 5-7.5; for higher -log(p) deciles the slope is less steep. Similar shapes were observed for all definitions of validation success including the ultra-soft definition that dwarfs validation rate for a single validation attempt (Fig. 2b).
Odds ratios in the discovery GWAS and validation success
Overall negative correlations of the OR with SNP validation success were detected under strict, relaxed, soft, and ultra-soft definitions of the validation success (corresponding Spearman rank order correlations were ρ=-0.03, N=60,166, p=7.1x10-15, ρ=-0.01, N=60,166, p=2.8x10-3, ρ=-0.02, N=60,166, p=5.1x10-5, ρ=-0.1, N=57,352, p=5.6x10-25). The association between OR and validation success using decile stratification shows a more complex relationship. Highest validation success was for the SNPs with ORs in the range of 1.06-1.3, while the SNPs with reported ORs <1.06 or >1.3 had lower validation success.
Association between the risk allele frequency and validation success
Under strict, relaxed, and soft definition of the validation success there is a tendency for common risk-associated alleles (allelic frequency close to 0.5) to have a higher validation success rate (Fig. 4a). The association is more evident under the ultra-soft definition of the validation success (Fig. 4b). When we used MAF of the reported SNP (Fig. 5), we found that SNPs with MAF 0.3-0.5 tended to be validated more often compared to rarer or more common SNPs.
Different types of SNPs differ by validation success
We compared different types of SNPs by validation success (Fig. 6). Intergenic and intron variants had the lowest validation success rate. Validation success rate of SNPs producing missense mutations, stop gained, located in TF binding sites or 5’ UTR were the highest.
Validation success is higher for SNPs located in evolutionary conserved sites
We found positive correlations between level of evolutionary conservation – PhyloP score and validation success under strict, relaxed, soft, and ultra-soft definitions of the validation success (corresponding Spearman rank order correlations were ρ=0.04, N=125,087, p<10-25; ρ=0.04, N=125,087, p<10-25; ρ=0.04, N=125,087, p<10-25; ρ=0.07, N=117,643, p<10-25). Figure 7 shows that SNPs with evidence of evolutionary conservation are more likely to be validated.
SNPs are more likely to be validated when the same race/ethnicity is targeted by discovery and validation GWASs
When the discovery and the validation GWASs target the same race/ethnicity, the validation success rate is higher compared to the situation when the ethnicities in the discovery and validation GWASs are different. This is true regardless of the definition of the validation success (Figure 8).
Multivariate logistic regression analysis
We analyzed the predictors simultaneously using binary logistic regression model with validation status as the outcome. All predictors remained significant, for both the strict and ultra-soft definitions of validation (Tables 2 a, b).
Table 2a. Multivariable prediction of SNP validation success in GWASs (strict definition of validation).
Predictor
|
P-value
|
OR
|
95% CI
|
Lower
|
Upper
|
SNP MAF
|
1.233E-02
|
1.494
|
1.091
|
2.046
|
-log p-value at discovery
|
1.670E-60
|
1.013
|
1.012
|
1.015
|
Different population in validation and discovery*
|
1.061E-93
|
.390
|
.356
|
.427
|
PhyloP score
|
5.647E-41
|
1.202
|
1.170
|
1.235
|
OR groups stratified by deciles 1, reference
|
|
|
|
|
2
|
4.699E-04
|
.745
|
.632
|
.879
|
3
|
1.359E-02
|
.833
|
.721
|
.963
|
4
|
3.456E-03
|
.778
|
.657
|
.921
|
5
|
8.721E-12
|
.558
|
.472
|
.660
|
6
|
8.752E-05
|
.718
|
.608
|
.847
|
7
|
2.305E-07
|
.665
|
.569
|
.776
|
8
|
3.642E-08
|
.627
|
.531
|
.741
|
9
|
6.647E-07
|
.652
|
.551
|
.772
|
10
|
2.022E-15
|
.474
|
.395
|
.570
|
SNP type categories**: likely non-functional, reference
|
|
|
|
|
Other
|
1.435E-01
|
1.089
|
.971
|
1.222
|
Likely functional
|
4.233E-22
|
1.854
|
1.636
|
2.101
|
*Reference, the same population in the discovery and validation
**Non-functional: intergenic, synonymous, intronic; functional: 5’ UTR, missense, nonsense, located in transcription factor binding sites; other – non-coding exonic, 3’ UTR
Table 2b. Multivariable prediction of SNP validation success in GWASs (ultra-soft definition of validation).
Predictor
|
P-value
|
OR
|
95% CI
|
Lower
|
Upper
|
SNP MAF
|
1.76E-20
|
2.415
|
2.005
|
2.91
|
-log p-value at discovery
|
1.81E-188
|
1.078
|
1.073
|
1.082
|
Different population in validation and discovery*
|
9.34E-11
|
0.808
|
0.757
|
0.862
|
PhyloP score
|
3.00E-54
|
1.164
|
1.142
|
1.187
|
OR groups stratified by deciles 1, reference
|
|
|
|
|
2
|
0.001
|
1.172
|
1.063
|
1.292
|
3
|
1.74E-17
|
1.491
|
1.36
|
1.635
|
4
|
0.011
|
1.146
|
1.032
|
1.273
|
5
|
3.82E-08
|
0.771
|
0.702
|
0.846
|
6
|
0.748
|
1.017
|
0.92
|
1.123
|
7
|
3.317E-10
|
0.741
|
0.674
|
0.813
|
8
|
6.999E-51
|
0.481
|
0.437
|
0.529
|
9
|
8.602E-118
|
0.294
|
0.265
|
0.326
|
10
|
1.848E-43
|
0.488
|
0.441
|
0.541
|
SNP type categories**: likely non-functional, reference
|
|
|
|
|
Other
|
1.170E-13
|
0.776
|
0.726
|
0.83
|
Likely functional
|
2.165E-54
|
2.204
|
1.994
|
2.435
|
*Reference, the same population in the discovery and validation
**Non-functional: intergenic, synonymous, intronic; functional: 5’ UTR, missense, nonsense, located in transcription factor binding sites; other – non-coding exonic, 3’ UTR