The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population level polygenic risk score (PRS) can explain phenotypic variation among geographic populations based solely on risk allele frequencies. We applied a population specific PRS (psPRS) to 26 populations from the 1000 Genomes to four phenotypes: lactase persistence (LP), melanoma, multiple sclerosis (MS) and height. Our models assumed additive genetic architecture among the polymorphisms in the psPRS, as is convention. Linear psPRSs explained a significant proportion of trait variance ranging from 0.32 for height in men to 0.88 for melanoma. The best models for LP and height were linear, while those for melanoma and MS were nonlinear. As not all variants in a PRS may confer similar, or even any, risk among diverse populations, we also filtered out SNPs to assess whether variance explained was improved using psPRSs with fewer SNPs. Variance explained usually improved with fewer SNPs in the psPRS and were as high as 0.99 for height in men using only 548 of the initial 4208 SNPs. That reducing SNPs improves psPRS performance may indicate that missing heritability is partially due to complex architecture that does not mandate additivity, undiscovered variants, or spurious associations in the databases. We demonstrated that PRS based-analyses can be used across diverse populations and phenotypes for population prediction and that these comparisons can identify the universal risk variants.
Figure 1
Figure 2
Figure 3
Figure 4
This preprint is available for download as a PDF.
This is a list of supplementary files associated with this preprint. Click to download.
Figure S1. Lactase persistence separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Full model by super population: AFR (r² = 0.0021, p-value: 0.9314), AMR (r² = 0.2608, p-value: 0.659), EAS (r² = 0.077, p-value: 0.6514), EUR (r² = 0.9734, p-value: 0.00185) and SAS (r² = 0.3847, p-value: 0.2643). B) Super populations after maximization: AFR (r² = 0.0177, p-value: 0.8017), AMR (r² = 0.2284, p-value: 0.683), EAS (no data), EUR (r² = 0.9747, p-value: 0.00172) and SAS (r² = 0.3914, p-value: 0.0580).
Figure S2. Melanoma separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Full model by super population: AFR (r² = 0.1178, p-value: 0.5718), AMR (r² = 0.6664, p-value: 0.1837), EAS (r² = 0.4958, p-value: 0.1844), EUR (r² = 0.0421, p-value: 0.7949) and SAS (r² = 0.5914, p-value: 0.1285). B) Super populations after maximization: AFR (r² = 0.1767, p-value: 0.481), AMR (r² = 0.7766, p-value: 0.1187), EAS (r² = 0.2399, p-value: 0.4022), EUR (r² = 0.6268, p-value: 0.2083) and SAS (r² = 0.4324, p-value: 0.2278).
Figure S3. Multiple sclerosis separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Full model super populations: AFR (r² = 0.4336, p-value: 0.155), AMR (r² = 0.1958, p-value: 0.5575), EAS (r² = 0.0459, p-value: 0.7293), EUR (r² = 0.3676, p-value: 0.3937) and SAS (r² = 0.3775, p-value: 0.270). B) Super populations after maximization: AFR (r² = 0.7781, p-value: 0.02003), AMR (r² = 0.9821, p-value: 0.008995), EAS (r² = 0.8775, p-value: 0.0189), EUR (r² = 0.9988, p-value: 0.000617) and SAS (r² = 0.2356, p-value: 0.407) .
Figure S4. Male height separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Super populations with full model: AFR (r² = 0.7835, p-value: 0.00806), AMR (r² = 0.8628, p-value: 0.0522), EAS (r² = 0.1003, p-value: 0.9254), EUR (r² = 0.578, p-value: 0.551) and SAS: r² = 0.1162, p-value: 0.2812. B) super populations after performing maximization: AFR (r² = 0.5534, p-value: 6.549 x 10-7), AMR (r² = 0.8556, p-value: 0.001876), EAS (r² = 0.0163, p-value: 2.052 x 10-5), EUR (r² = 0.9158, p-value: 0.01064) and SAS (r² = 0.0475, p-value: 0.002888).
Figure S5. Female height separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Super populations with full model: AFR (r² = 0.5534, p-value: 0.05523), AMR (r² = 0.8556, p-value: 0.07501), EAS (r² = 0.0163, p-value: 0.8379), EUR (r² = 0.2222, p-value: 0.4229) and SAS (r² = 0.0475, p-value: 0.7246). B) Super populations after maximization: AFR (r² = 0.9533, p-value: 0.0001627), AMR (r² = 0.9917, p-value: 0.004174), EAS (r² = 0.963, p-value: 0.003058), EUR (r² = 0.754, p-value: 0.05619) and SAS (r² = 0.9761, p-value: 0.001584).
Table S1. Lactase persistence full data set. SNP rs number and minor allele are included, as well as the r2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S2. Melanoma full data set. SNP rs number and minor allele are included, as well as the r2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S3. Multiple sclerosis full data set. SNP rs number and minor allele are included, as well as the r2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S4. Height full data set. SNP rs number and minor allele are included, as well as the r2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S5. PRS values for each population, before and after maximization. Table S6. Female height r2 values from the maximization analysis. Table S7. Lactase persistence filtered data set. Table S8. Melanoma filtered data set. Table S9. Multiple sclerosis filtered data set. Table S10. Male Height filtered data set. Table S11. Female Height filtered data set.
Figure S6. Lactase persistence maximization analysis r2 values. Figure S7. Melanoma maximization analysis r2 values. Figure S8. Multiple sclerosis maximization analysis r2 values. Figure S9. Male height maximization analysis r2 values. Figure S10. Female height maximization analysis r2 values.
Loading...
Posted 25 Mar, 2021
Received 20 Mar, 2021
On 19 Mar, 2021
Invitations sent on 19 Mar, 2021
On 17 Mar, 2021
Posted 25 Mar, 2021
Received 20 Mar, 2021
On 19 Mar, 2021
Invitations sent on 19 Mar, 2021
On 17 Mar, 2021
The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population level polygenic risk score (PRS) can explain phenotypic variation among geographic populations based solely on risk allele frequencies. We applied a population specific PRS (psPRS) to 26 populations from the 1000 Genomes to four phenotypes: lactase persistence (LP), melanoma, multiple sclerosis (MS) and height. Our models assumed additive genetic architecture among the polymorphisms in the psPRS, as is convention. Linear psPRSs explained a significant proportion of trait variance ranging from 0.32 for height in men to 0.88 for melanoma. The best models for LP and height were linear, while those for melanoma and MS were nonlinear. As not all variants in a PRS may confer similar, or even any, risk among diverse populations, we also filtered out SNPs to assess whether variance explained was improved using psPRSs with fewer SNPs. Variance explained usually improved with fewer SNPs in the psPRS and were as high as 0.99 for height in men using only 548 of the initial 4208 SNPs. That reducing SNPs improves psPRS performance may indicate that missing heritability is partially due to complex architecture that does not mandate additivity, undiscovered variants, or spurious associations in the databases. We demonstrated that PRS based-analyses can be used across diverse populations and phenotypes for population prediction and that these comparisons can identify the universal risk variants.
Figure 1
Figure 2
Figure 3
Figure 4
This preprint is available for download as a PDF.
Loading...