Estimating Prevalence of Human Traits Among Populations From Polygenic Risk Scores

doi:10.21203/rs.3.rs-341766/v1

Download PDF

Research Article

Estimating Prevalence of Human Traits Among Populations From Polygenic Risk Scores

https://doi.org/10.21203/rs.3.rs-341766/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 01 Dec, 2021

Read the published version in Human Genomics →

Version 1

posted

You are reading this latest preprint version

The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population level polygenic risk score (PRS) can explain phenotypic variation among geographic populations based solely on risk allele frequencies. We applied a population specific PRS (psPRS) to 26 populations from the 1000 Genomes to four phenotypes: lactase persistence (LP), melanoma, multiple sclerosis (MS) and height. Our models assumed additive genetic architecture among the polymorphisms in the psPRS, as is convention. Linear psPRSs explained a significant proportion of trait variance ranging from 0.32 for height in men to 0.88 for melanoma. The best models for LP and height were linear, while those for melanoma and MS were nonlinear. As not all variants in a PRS may confer similar, or even any, risk among diverse populations, we also filtered out SNPs to assess whether variance explained was improved using psPRSs with fewer SNPs. Variance explained usually improved with fewer SNPs in the psPRS and were as high as 0.99 for height in men using only 548 of the initial 4208 SNPs. That reducing SNPs improves psPRS performance may indicate that missing heritability is partially due to complex architecture that does not mandate additivity, undiscovered variants, or spurious associations in the databases. We demonstrated that PRS based-analyses can be used across diverse populations and phenotypes for population prediction and that these comparisons can identify the universal risk variants.

Medical Genetics

Population Genetics

Disease prevalence

PRS transferability

Universal risk variants

Genetic architecture

Table 1. Relative trait distribution among populations.
Super population	Population	Lactase Persistence	Melanoma	Multiple Sclerosis	Male Height (cm)	Female Height (cm)
AFR	ACB	-	4.2x10^-5((Ferlay J 2020)⁾	1.36x10^-4 (Collaborators 2019; United Nations 2019)	175.92(Collaboration 2016)	165.28(Collaboration 2016)
	ASW	0.25⁽(Bayless et al. 2017)⁾	2.9x10^-5((Group 2020)⁾	-	175.5(Fryar et al. 2018)	162.6(Fryar et al. 2018)
	ESN	0.13⁽(Storhaug et al. 2017)⁾	5.7x10^-6 (Ferlay J 2020)	3.71x10^-5 (Collaborators 2019; United Nations 2019)	165.91(Collaboration 2016)	156.32(Collaboration 2016)
	GWD	0.43	0 (Ferlay J 2020)	3.35x10^-5 (Collaborators 2019; United Nations 2019)	165.40(Collaboration 2016)	160.94(Collaboration 2016)
	LWK	0.61⁽(Storhaug et al. 2017)⁾	1.4x10^-5 (Ferlay J 2020)	3.30x10^-5(Collaborators 2019; United Nations 2019)	169.64(Collaboration 2016)	158.16(Collaboration 2016)
	MSL	0.52	5.6x10^-6 (Ferlay J 2020)	2.89x10^-5(Collaborators 2019; United Nations 2019)	164.41(Collaboration 2016)	156.60(Collaboration 2016)
	YRI	0.13⁽(Storhaug et al. 2017))	5.7x10^-6 (Ferlay J 2020)	3.71x10^-5 (Collaborators 2019; United Nations 2019)	165.91(Collaboration 2016)	156.32(Collaboration 2016)
AMR	CLM	0.2⁽(Storhaug et al. 2017)⁾	1.08x10^-4 (Ferlay J 2020)	5.53x10^-5 (Collaborators 2019; United Nations 2019)	169.50(Collaboration 2016)	156.85(Collaboration 2016)
	MXL	0.52⁽(Storhaug et al. 2017)⁾	6.9x10^-5 (Ferlay J 2020)	1.08x10^-4 (Collaborators 2019; United Nations 2019)	169.00(Collaboration 2016)	156.85(Collaboration 2016)
	Px10L	0.06⁽(Bayless et al. 2017)⁾	8.3x10^-5 (Ferlay J 2020)	6.98x10^-5 (Collaborators 2019; United Nations 2019)	165.23(Collaboration 2016)	152.93(Collaboration 2016)
	PUR	-	1.11x10^-4 (Ferlay J 2020)	1.9x10^-4(Collaborators 2019; United Nations 2019)	172.08(Collaboration 2016)	159.20(Collaboration 2016)
EAS	CDX	0.15⁽(Storhaug et al. 2017)⁾	1.5x10^-5 (Ferlay J 2020)	7.30x10^-5((Collaborators 2019; United Nations 2019)⁾	171.83(Collaboration 2016)	159.71(Collaboration 2016)
	CHB	0.15⁽(Storhaug et al. 2017)⁾	1.5x10^-5(Ferlay J 2020)	7.30x10^-5 (Collaborators 2019; United Nations 2019)	171.83(Collaboration 2016)	159.71(Collaboration 2016)
	CHS	0.15⁽(Storhaug et al. 2017)⁾	1.5x10^-5 (Ferlay J 2020)	7.30x10^-5(Collaborators 2019; United Nations 2019)	171.83(Collaboration 2016)	159.71(Collaboration 2016)
	JPT	0.27⁽(Storhaug et al. 2017)⁾	5.0x10^-5 (Ferlay J 2020)	3.62x10^-4(Collaborators 2019; United Nations 2019)	170.82(Collaboration 2016)	158.31(Collaboration 2016)
	KHV	0.02⁽(Storhaug et al. 2017)⁾	4.3x10^-6 (Ferlay J 2020)	4.41x10^-5(Collaborators 2019; United Nations 2019)	164.45(Collaboration 2016)	153.59(Collaboration 2016)
EUR	CEU	0.87	1.3x10^-3 (Group 2020)	-	177.4(Fryar et al. 2018)	163.3(Fryar et al. 2018)
	FIN	0.81(Storhaug et al. 2017)	1.04x10^-3 (Ferlay J 2020)	1.49x10^-3 (Collaborators 2019; United Nations 2019)	179.59(Collaboration 2016)	165.90(Collaboration 2016)
	GBR	0.92(Storhaug et al. 2017)	9.39x10^-4 (Ferlay J 2020)	1.61x10^-3 (Collaborators 2019; United Nations 2019)	177.49(Collaboration 2016)	164.40(Collaboration 2016)
	IBS	0.71(Storhaug et al. 2017)	3.92x10^-4 (Ferlay J 2020)	9.41x10^-4 (Collaborators 2019; United Nations 2019)	176.59(Collaboration 2016)	163.40(Collaboration 2016)
	TSI	0.28(Storhaug et al. 2017)	7.12x10^-4 (Ferlay J 2020)	1.19x10^-3(Collaborators 2019; United Nations 2019)	177.77(Collaboration 2016)	164.61(Collaboration 2016)
SAS	BEB	0.175	6.5x10^-6(Ferlay J 2020)	1.42x10^-4(Collaborators 2019; United Nations 2019)	163.81(Collaboration 2016)	150.79(Collaboration 2016)
	GIH	0.39(Storhaug et al. 2017)	5.4x10^-6 (Ferlay J 2020)	1.54x10^-4(Collaborators 2019; United Nations 2019)	164.95(Collaboration 2016)	152.59(Collaboration 2016)
	ITU	0.39(Storhaug et al. 2017)	5.4x10^-6 (Ferlay J 2020)	1.54x10^-4(Collaborators 2019; United Nations 2019)	164.95(Collaboration 2016)	152.59(Collaboration 2016)
	PJL	0.42(Storhaug et al. 2017)	4.6x10^-6 (Ferlay J 2020)	1.46x10^-04 (Collaborators 2019; United Nations 2019)	166.95(Collaboration 2016)	153.84(Collaboration 2016)
	STU	0.27(Storhaug et al. 2017)	1.4x10^-5 (Ferlay J 2020)	3.35x10^-05 (Collaborators 2019; United Nations 2019)	165.69(Collaboration 2016)	154.56(Collaboration 2016)

Table 2. 1000 Genomes populations.
Super population	Population Code	Population Description
East Asian	CHB	Han Chinese in Beijing, China
	JPT	Japanese in Tokyo, Japan
	CHS	Southern Han Chinese
	CDX	Chinese Dai in Xishuangbanna, China
	KHV	Kinh in Ho Chi Minh City, Vietnam
European	CEU	Utah Residents (CEPH) with Northern and Western European Ancestry
	TSI	Toscani in Italia
	FIN	Finnish in Finland
	GBR	British in England and Scotland
	IBS	Iberian Population in Spain
African	YRI	Yoruba in Ibadan, Nigeria
	LWK	Luhya in Webuye, Kenya
	GWD	Gambian in Western Divisions in the Gambia
	MSL	Mende in Sierra Leone
	ESN	Esan in Nigeria
	ASW	Americans of African Ancestry in SW USA
	ACB	African Caribbean in Barbados
Ad Mixed American	MXL	Mexican Ancestry from Los Angeles USA
	PUR	Puerto Ricans from Puerto Rico
	CLM	Colombians from Medellin, Colombia
	PEL	Peruvians from Lima, Peru
South Asian	GIH	Gujarati Indian from Houston, Texas
	PJL	Punjabi from Lahore, Pakistan
	BEB	Bengali from Bangladesh
	STU	Sri Lankan Tamil from the UK
	ITU	Indian Telugu from the UK

Table 3. Sensitivity analysis.
Phenotype	GWAS SNPs	1000 Genomes SNPs	Reduced SNPs
Lactase Persistence	11	NA	4
Multiple Sclerosis	372	368	131
Melanoma	39	37	16
Height male	4388	4209	547
Height female	4388	4209	188

Table 4. P-value enrichment analysis.
Phenotype	Melanoma		Multiple Sclerosis		Male Height		Female Height
Threshold	unfiltered	filtered	unfiltered	filtered	unfiltered	filtered	unfiltered	filtered
GWAS significant¹	27	12	199	71	3552	478	3552	188
GWAS not significant²	10	4	169	60	656	69	656	0
p-value	1		1		0.07644		6.73E-14

¹P ≤ 5 X 10^-8

²1 X 10^-5> P > 5 X 10^-8

SupFig1.pdf
Figure S1. Lactase persistence separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Full model by super population: AFR (r² = 0.0021, p-value: 0.9314), AMR (r² = 0.2608, p-value: 0.659), EAS (r² = 0.077, p-value: 0.6514), EUR (r² = 0.9734, p-value: 0.00185) and SAS (r² = 0.3847, p-value: 0.2643). B) Super populations after maximization: AFR (r² = 0.0177, p-value: 0.8017), AMR (r² = 0.2284, p-value: 0.683), EAS (no data), EUR (r² = 0.9747, p-value: 0.00172) and SAS (r² = 0.3914, p-value: 0.0580).
Supfig2.pdf
Figure S2. Melanoma separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Full model by super population: AFR (r² = 0.1178, p-value: 0.5718), AMR (r² = 0.6664, p-value: 0.1837), EAS (r² = 0.4958, p-value: 0.1844), EUR (r² = 0.0421, p-value: 0.7949) and SAS (r² = 0.5914, p-value: 0.1285). B) Super populations after maximization: AFR (r² = 0.1767, p-value: 0.481), AMR (r² = 0.7766, p-value: 0.1187), EAS (r² = 0.2399, p-value: 0.4022), EUR (r² = 0.6268, p-value: 0.2083) and SAS (r² = 0.4324, p-value: 0.2278).
Supfig3.pdf
Figure S3. Multiple sclerosis separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Full model super populations: AFR (r² = 0.4336, p-value: 0.155), AMR (r² = 0.1958, p-value: 0.5575), EAS (r² = 0.0459, p-value: 0.7293), EUR (r² = 0.3676, p-value: 0.3937) and SAS (r² = 0.3775, p-value: 0.270). B) Super populations after maximization: AFR (r² = 0.7781, p-value: 0.02003), AMR (r² = 0.9821, p-value: 0.008995), EAS (r² = 0.8775, p-value: 0.0189), EUR (r² = 0.9988, p-value: 0.000617) and SAS (r² = 0.2356, p-value: 0.407) .
Supfig4.pdf
Figure S4. Male height separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Super populations with full model: AFR (r² = 0.7835, p-value: 0.00806), AMR (r² = 0.8628, p-value: 0.0522), EAS (r² = 0.1003, p-value: 0.9254), EUR (r² = 0.578, p-value: 0.551) and SAS: r² = 0.1162, p-value: 0.2812. B) super populations after performing maximization: AFR (r² = 0.5534, p-value: 6.549 x 10-7), AMR (r² = 0.8556, p-value: 0.001876), EAS (r² = 0.0163, p-value: 2.052 x 10-5), EUR (r² = 0.9158, p-value: 0.01064) and SAS (r² = 0.0475, p-value: 0.002888).
Supfig5.pdf
Figure S5. Female height separated by super population. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Super populations with full model: AFR (r² = 0.5534, p-value: 0.05523), AMR (r² = 0.8556, p-value: 0.07501), EAS (r² = 0.0163, p-value: 0.8379), EUR (r² = 0.2222, p-value: 0.4229) and SAS (r² = 0.0475, p-value: 0.7246). B) Super populations after maximization: AFR (r² = 0.9533, p-value: 0.0001627), AMR (r² = 0.9917, p-value: 0.004174), EAS (r² = 0.963, p-value: 0.003058), EUR (r² = 0.754, p-value: 0.05619) and SAS (r² = 0.9761, p-value: 0.001584).
SupplimentaryTables.xlsx
Table S1. Lactase persistence full data set. SNP rs number and minor allele are included, as well as the r2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S2. Melanoma full data set. SNP rs number and minor allele are included, as well as the r2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S3. Multiple sclerosis full data set. SNP rs number and minor allele are included, as well as the r2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S4. Height full data set. SNP rs number and minor allele are included, as well as the r2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S5. PRS values for each population, before and after maximization.
Table S6. Female height r2 values from the maximization analysis. Table S7. Lactase persistence filtered data set.
Table S8. Melanoma filtered data set. Table S9. Multiple sclerosis filtered data set. Table S10. Male Height filtered data set. Table S11. Female Height filtered data set.
suppfigs610.pdf
Figure S6. Lactase persistence maximization analysis r2 values. Figure S7. Melanoma maximization analysis r2 values. Figure S8. Multiple sclerosis maximization analysis r2 values. Figure S9. Male height maximization analysis r2 values. Figure S10. Female height maximization analysis r2 values.

Download PDF

Journal Publication

published 01 Dec, 2021

Read the published version in Human Genomics →

Version 1

posted

You are reading this latest preprint version

Estimating Prevalence of Human Traits Among Populations From Polygenic Risk Scores

Status:

Journal Publication

Version 1

Abstract

Figures

Full Text

Tables

Supplementary Files

Status:

Journal Publication

Version 1