Diverse Functions Associate With Non-Coding Trans-species Polymorphisms in Humans

14 Background: Long-term balancing selection (LTBS) can maintain allelic variation at a locus over 15 millions of years and through speciation events. Variants shared between species, hereafter “trans - 16 species polymorphisms” (TSPs), often result from LTBS due to host -pathogen interactions. For 17 instance, the major histocompatibility complex (MHC) locus contains TSPs present across 18 primates. Several hundred candidate TSPs have been identified in humans and chimpanzees; 19 however, because many are in non-coding regions of the genome, the functions and adaptive roles 20 for most TSPs remain unknown. 21 Results: We integrated diverse genomic annotations, with a focus on non-coding regions, to 22 explore the functions of 125 previously identified regions containing multiple TSPs in humans and 23 chimpanzees. We analyzed genome-wide functional assays, expression quantitative trait loci 24 (eQTL), genome-wide association studies (GWAS), and phenome-wide association studies 25 (PheWAS). We identify functional annotations for 119 TSP regions, including 71 with evidence 26 of gene regulatory function from GTEx or genome-wide functional genomics data and 21 with 27 evidence of trait association from GWAS and PheWAS. TSPs in humans associate with many 28 immune system phenotypes, including response to pathogens, but we also find associations with a 29 range of other phenotypes, including body mass, alcohol intake, urate levels, chronotype, and risk- 30 taking behavior. Conclusions: The diversity of traits associated with non-coding human TSPs further support 32 previous hypotheses that functions beyond the immune system are subject to LTBS. Furthermore, 33 several of these trait associations provide support and candidate genetic loci for previous 34 hypothesis about behavioral diversity in great ape populations, such as the importance of variation 35 in sleep cycles and risk sensitivity.


Results:
We integrated diverse genomic annotations, with a focus on non-coding regions, to 22 explore the functions of 125 previously identified regions containing multiple TSPs in humans and 23 chimpanzees. We analyzed genome-wide functional assays, expression quantitative trait loci 24 (eQTL), genome-wide association studies (GWAS), and phenome-wide association studies 25 (PheWAS). We identify functional annotations for 119 TSP regions, including 71 with evidence 26 of gene regulatory function from GTEx or genome-wide functional genomics data and 21 with 27 evidence of trait association from GWAS and PheWAS. TSPs in humans associate with many 28 immune system phenotypes, including response to pathogens, but we also find associations with a 29 range of other phenotypes, including body mass, alcohol intake, urate levels, chronotype, and risk-30 taking behavior. 31 Conclusions: The diversity of traits associated with non-coding human TSPs further support 32 previous hypotheses that functions beyond the immune system are subject to LTBS. Furthermore, 33 several of these trait associations provide support and candidate genetic loci for previous 34 hypothesis about behavioral diversity in great ape populations, such as the importance of variation 35 in sleep cycles and risk sensitivity. intermediate frequency alleles suggestive of balancing selection. They also recently updated the 106 ß statistic to consider both polymorphism and substitution data (Siewert & Voight, 2020). 107 Among the highest scoring windows in these two analyses, they highlighted three genes 108 (CADM2, WFS1, and ACSBG2) with functions outside the immune system. 109 Trans-species polymorphisms, especially when more than one falls on a haplotype, 110 suggest the action of LTBS. Leffler  Determining the functional roles of these polymorphisms in human adaptation and health will 120 deepen our understanding of the dynamics of balancing and positive selection and their roles in 121 adaptation to new environments. We identify potential functions of the TSPs in humans by 122 applying several genome-wide functional annotations and association tests. Our results identify 123 diverse functions, including effects unrelated to the immune system, that may underlie LTBS on 124 the human and chimpanzee lineages. 125 126

Results 127
Human-chimpanzee TSPs 128 We consider 125 human genomic regions containing multiple TSPs segregating in both humans 129 and chimpanzees in close proximity and high LD from Leffler et al. (2013). This set is composed 130 of 263 variants with strong evidence of identity-by-descent; i.e., the divergence between the 131 ancestral and derived alleles is deeper than the human-chimp speciation event. These TSPs were 132 identified based on the observation from coalescent theory (Ségurel et al., 2012) that pairs of 133 TSPs within 4 kb in the human genome are extremely unlikely to result from neutral processes, 134 and thus are strong candidates for LTBS ( Figure 1). Hereafter, we refer to these as "TSP 135 regions". While these criteria do not guarantee that all the TSP regions are the result of LTBS 136 (Gao et al., 2015), this set is likely strongly enriched for LTBS. We also quantify the evidence 137 for balancing selection on these loci provided by BetaScan2 and NCD, two additional recent 138 genome-wide scans ( selection. In addition to the 263 TSPs, we also considered functional associations with 9,996 159 variants in high LD (r 2 > 0.8) with a TSP at least one population from the 1000 Genomes Project 160 (Supplementary Figure 1).

162
Trans-species polymorphisms overlap diverse functional annotations 163 We intersected the TSPs with diverse lines of functional evidence from large-scale genomic 164 studies, including genome-wide functional genomics assays, eQTL, GWAS, and PheWAS. We 165 found at least one functional annotation for 95% of the TSP regions (119 out of 125) covering 166 130 TSPs and 4,807 LD SNPs ( Figure 2). Here, we provide an overview of the overlap with 167 these annotations. In future sections, we provide details about each of these annotations. We hypothesized that many of the non-coding TSPs in our set perform gene regulatory 188 functions. To evaluate this possibility, we intersected the TSPs and variants in high LD with 189 maps of functional regulatory regions from the Ensembl regulatory build (Zerbino et al., 2015). 190 We found 58 TSPs with regulatory annotations in 40 TSP regions, and additionally 1334 LD 191 promoter flanking regions, enhancers, promoters, and known TF binding sites (Supplementary  193  Table 1). 194 Overlap of a variant with a regulatory annotation does not necessarily imply a regulatory 195 function. To consider additional evidence of regulatory function, we examined eQTL in GTEx 196 from 50 tissues for overlap with TSPs. At least one eQTL was found for 51 of the TSP regions 197 (40%). Among these 51 regions, 29 TSPs are themselves eQTL; for the remainder, variants in 198 high LD with TSPs were eQTL. The eQTL were found across diverse tissues ( Figure 3A), and 199 there was no enrichment for specific gene ontology (GO) terms among the set of genes 200 influenced by TSP eQTL. This suggests that the targets of balancing selection have functions in 201 gene regulation across diverse tissues beyond the immune system. 202 Next, we tested if the TSP regions are enriched for overlap with any specific types of 203 regulatory regions. We compared the observed overlap between TSP regions and each type of 204 regulatory annotation to the distribution of overlaps expected if TSP regions were randomly 205 distributed across the genome. We shuffled the TSP regions 1,000 times maintaining their length 206 and chromosome distributions and avoiding genome assembly gaps and ENCODE blacklist 207 regions and counted the number of overlaps observed with regulatory elements for each random 208 permutation. The TSPs overlap slightly fewer base pairs annotated with regulatory functions than 209 expected by chance, with significant (P < 0.05) depletion for promoter and CTCF sites ( Figure  210 3B). Since variants in these regions are likely to influence gene regulation in many tissues (e.g., 211 compared to enhancers which are often context-specific), this suggests that individual TSPs may 212 be less pleiotropic than expected by chance.

Genome-wide association studies link TSPs to traits 224
Genome-wide association studies have identified thousands of associations between genetic 225 variants and human traits. We intersected the TSP regions with associations reported in the 226 GWAS Catalog (downloaded 2020/11), which is composed of 227,262 associations. Since TSPs 227 themselves were not always directly tested in GWAS studies, we also include genome-wide 228 significant (p < 1E-8) associations with the tag variants in high LD with TSPs. We found 229 significant associations for 29 different variants ( Figure 4; Supplementary Table 2). Two main 230 functional categories were identified in the GWAS associations for these variants: 231 immunological functions and neurological/behavioral traits. The associations with immune traits 232 were expected given the results of previous balancing selection studies and the few well-233 characterized instances of LTBS. We identified many variants in LD with TSPs as associated 234 with blood measurement phenotypes and diseases related to immune response failure 235 (Supplementary Tables 2, 3, 4). Some of these traits include sarcoidosis (chromosome 4 near 236 TSPs rs114383553 and rs17007061), ulcerative colitis and chronic inflammatory disease (chr2 237 near TSPs rs13426764/rs11694806), and Behçet's disease (chromosome 11 near TSPs rs2249268 238 and rs60427879). 239 We also found many neurological/behavioral related associations among TSP region 240 variants. These traits include cognitive performance (chromosome 2 near rs13426764 and 241 rs11694806 and chromosome 3 near rs9869178/rs2118072), chronotype (chromosome 4 near 242 rs1887944 and rs7147645), addiction (alcohol use: chromosome 16 near rs9933768 and 243 rs57790054; smoking: chromosome 2 near rs13426764 and rs11694806), risky behavior 244 (automobile speeding propensity: chromosome 3:rs9869178/rs2118072), and experiencing mood 245 swings (chromosome 2 near rs13426764 and rs11694806). In addition to the immune response 246 and neurological categories, we observed associations in reproductive traits (polycystic ovary 247 syndrome, testosterone levels), urate levels, pancreatic cancer, hepatocyte growth factor levels, 248 and gut microbiota. We discuss several of these associations in more detail in following sections. 249 250 Phenome-wide association studies link TSPs to additional diverse traits 251 The growth of biobanks with linked genetic and phenotypic data has enabled the testing of the 252 association of genetic variants with diverse traits within a single cohort. This PheWAS approach 253 enables exploration of the functional and potentially pleiotropic effects of variants of interest 254 (Bush et al., 2016). Using published associations from the UK Biobank, we analyzed the 255 association of TSPs with 778 traits; all 125 of the TSP regions were tested. Overall, 21 TSPs in 256 16 regions had at least one genome-wide significant association (P < 1E-8, Figure 4; 257 Supplementary Table 3). Though testing different phenotypes than the GWAS, these associations 258 were qualitatively similar to the GWAS results, in that blood and immune system phenotypes 259 had many associations with TSPs, but the TSPs were also associated with a more diverse set of 260 phenotypes. We found associations in three major categories: immune response traits, body and 261 physical measurements, and neuropsychiatric traits related to addiction. We also observed 262 associations with many other phenotypes, for example: walking habits, heart disorders, renal and 263 kidney problems, pleural plaques, transient ischemic attack, cancerous tumor, and rupture of 264 synovium and tendon (Supplementary Tables 3 and 4). with a wider variety of phenotypes including osseous, neurological, and nervous system traits. 274 Five extreme associations with immunological and blood traits (P < 1E-60) were truncated for 275 this visualization. Since few TSPs themselves were directly tested in GWAS, we include GWAS 276 Catalog associations with tag variants in high LD (r 2 > 0.8) with TSPs. All associations are listed 277 in Supplementary Tables 2-4.  278  279 Evidence of protein-coding function for TSPs 280 The TSPs were originally filtered by Leffler  pancreatic cancer (p=7E-10). Though elevated uric acid in the blood is associated with many 325 conditions, it is a marker for pancreatic cancer (Stotz et al., 2014). This locus also contains two 326 non-synonymous variants in the gene HNF4G, a transcription factor expressed in the liver, 327 kidney, and pancreas, in high LD with the TSPs (rs1805098 and rs2943549). The TSPs are also 328 expression and splicing QTL for HNF4G (P = 1.9E-4 and 8.6E-6, respectively). Variants in 329 (threshold p ≤ 1E-08), regulatory annotations from Ensembl, and eQTLs from GTEx. One of the 337 TSPs (rs57790054, orange) is associated with alcohol intake and several growth and body mass 338 phenotypes in the UK Biobank. A variant in high LD (rs72771074, green) has been associated 339 with alcohol use disorder in a previous GWAS. The TSP is also strongly associated with growth 340 (comparative body size at age 10, 9.6e-21) and body mass index (3.5e-12). The TSPs are nearby 341 GPR139, a gene encoding a G-protein coupled receptor expressed in the brain, whose expression 342 levels influence alcohol drinking behavior in rats. The TSP region also contains several CTCF 343 binding sites. TSP are shown in bold text. 344 (B) TSPs in 8q21.11 are associated with urate levels. Regional association plot showing 345 statistically significant genome-and phenome-wide associations (P ≤ 1E-08), eQTL, regulatory 346 and coding SNPs. LD SNPs in this region are associated with urate levels and pancreatic cancer. 347 A TSP (rs1839333) is also associated with gout, although the p-value did not meet our strict 348 threshold. TSP are shown in bold text. chromatin (rs10662845 and rs6766439), promoter (rs1898263, rs1992094, rs4431106, 374 rs7631704, rs7651567, and rs7653431), and CTCF binding sites (rs7628282, rs7650239, 375 rs7650332, rs9840519, rs9840971, rs9878070, and rs9840157). Furthermore, the TSPs (and 159  376 high LD variants) are significant eQTLs (P ≤ 1E-5) for the gene DIPK2A (C3orf58) across four 377 GTEx tissues (small intestine terminal ileum, transformed fibroblasts, skin from the lower leg, 378 and suprapubic skin). DIPK2A has not been comprehensively functionally characterized, but it 379 contains a protein kinase domain and is broadly expressed, including in the developing and adult 380 brain. Deletion of this gene has been linked to autism, and its expression is responsive to 381 neuronal activity (Morrow et al., 2008). Associations with behavioral and cognitive traits must 382 be interpreted with caution as these traits are very challenging to quantify and strongly 383 influenced by social factors that may vary with other characteristics. Nonetheless, these 384 associations point to an influence of the TSPs on behaviors relevant to risk tolerance. Thus, it is 385 possible that maintaining a diversity of risk tolerance in human and chimpanzee populations has 386 been beneficial. 387 388

389
In this study we aimed to characterize the function of genomic regions with two or more TSPs in 390 LD and close proximity. Our results also raise the intriguing possibility that variants that modulate urate levels 415 have been under LTBS. Uricase, the enzyme that metabolizes uric acid into an easily excreted 416 water-soluble form in most mammals, has been lost in great apes. This gene was disabled by a 417 series of mutations that slowly decreased activity over primate evolution, increasing the levels of 418 uric acid in blood (Kratzer et al., 2014). It has been hypothesized that this loss of uricase activity 419 was driven by increase fructose in primate diets due to fruit eating (Johnson et al., 2009;Kratzer 420 et al., 2014). It has also been proposed that high levels of uric acid, a potent antioxidant, played 421 an important role in the evolution of intelligence, acting as antioxidant in the brain (Álvarez-422 Lario & Macarrón-Vicente, 2010). However, as reflected in the associations with this locus, 423 elevated uric acid levels contribute to many common diseases in modern humans, including 424 chronic hypertension, cardiovascular disease, kidney and liver diseases, metabolic syndrome, 425 diabetes, and obesity (Gustafsson & Unwin, 2013). This suggests potential functional tradeoffs at 426 this locus. However, we emphasize that proving the environmental drivers of past selection is 427 challenging. 428 Some of the phenotype associations we discovered may reflect manifestations of 429 variation on traits in modern environments that could not be long-term drivers of balancing 430 selection. As an extreme example, influence on smoking behavior could not have been the cause 431 of LTBS given the relatively recent wide availability of nicotine. Though we note that there is 432 some evidence of ethanol consumption in chimpanzees (Hockings et al., 2015).  This could potentially introduce false positives if the variant also tags a different causal non-TSP  457  variant. Given the long-term selection on TSPs, they are strong candidates for casual variants,  458 but functional studies are needed to confirm these statistical associations. Finally, our analyses 459 have focused on the human context; due to lack of functional data, it is not possible to explore 460 the function of TSPs in chimpanzees. Nonetheless, we feel that our integration of genome-scale 461 annotations and biobank data highlight the diversity of functions associated with LTBS. 462 463

464
In conclusion, we assign putative functions for TSP regions that likely persisted due to balancing 465 selection dating back to at least the common ancestor of humans and chimpanzees. These 466 annotations expand beyond immune functions to traits relevant to behavior, cognition, and body 467 shape. Notably, we also find that most regions with multiple TSPs overlap gene regulatory 468 annotations suggesting LTBS on gene expression levels. As methods improve for quantifying the 469 effects of variants on gene regulation in different tissues and how these relate to organism-level 470 phenotypes, we anticipate deeper mechanistic understanding of the functions and potential 471 evolutionary pressures on these regions. 472 473 474

Methods 475
Trans-species polymorphisms 476 The initial set of 125 regions containing 263 TSPs analyzed in this study was published by 477 Leffler et al. (2013). The set is composed of regions that: 1) contain at least two trans-species 478 polymorphisms-i.e., variants that are segregating in both 51 Yoruba individuals in the 1000 479 Genomes Pilot 1 and 10 chimpanzees from the PanMap project-within 4 kb of each other in 480 both species; and 2) are in high LD in humans and chimpanzees. 481 To increase our ability to identify annotations in each locus, the dataset was expanded to 482 include variants in high LD (threshold R 2 =0.8) with each of the TSPs as is common in 483 association studies. Genome-and Phenome-wide associations 503 The GWAS Catalog collects variant-trait associations from published genome-wide association 504 studies. The database is currently composed of more than 200,000 associations. We used the 505 GWAS Catalog version October 2020 to find functional associations for the LTBS variants. The 506 search was done using the BEDTools intersect function between the GWAS catalog and the LD-507 expanded TSP dataset (Quinlan, 2014). 508 PheWAS is an analysis strategy built on top of medical records with information about 509 patient phenotypes and associated variants. The geneAtlas catalog 510 (http://geneatlas.roslin.ed.ac.uk/phewas/) takes advantage of the data provided by the UK 511 Biobank cohort, which contains medically relevant data from nearly 500,000 British individuals 512 of European ancestry. This database contains 3 million variants in 778 traits. We matched our set 513 of variants against the geneAtlas database to search for traits associated with LTBS. 514 515 GTEx eQTL data 516 To evaluate potential gene regulatory effects of TSPs in non-coding regions, we analyzed data 517 from GTEx, a project developed to quantify the consequence of genetic variation on expression 518 at the tissue level (https://www.gtexportal.org/home/). The GTEx project v8 data have identified 519 eQTL across 50 tissues based on analyses of nearly 1,000 individuals to identify differential 520 expression through SNP variation. The intersection between the TSPs and LD SNPs and the 521 GTEx eQTL returned a large collection of TSP eQTL. 522 523 Other functional categories 524 We       Genome-and phenome-wide association studies link TSPs to diverse traits. Genome-wide signi cant (P < 1E-8) associations from the GWAS Catalog (yellow), a PheWAS over the UK Biobank from the geneAtlas (purple), and other studies summarized in the GWAS Atlas (green) (Watanabe et al., 2019). Each dot represents an association between a TSP region and a trait. Many immune-related traits are associated with TSPs, but there are also associations with a wider variety of phenotypes including osseous, neurological, and nervous system traits. Five extreme associations with immunological and blood traits (P < 1E-60) were truncated for this visualization. Since few TSPs themselves were directly tested in GWAS, we include GWAS Catalog associations with tag variants in high LD (r2 > 0.8) with TSPs. All associations are listed in Supplementary Tables 2-4. Illustrative examples of non-immune functions associated with TSPs. (A) A TSP in 16p12.3 is associated with body mass index (BMI) and alcohol intake. Regional association plot showing statistically signi cant genome-and phenome-wide associations (threshold p ≤ 1E-08), regulatory annotations from Ensembl, and eQTLs from GTEx. One of the TSPs (rs57790054, orange) is associated with alcohol intake and several growth and body mass phenotypes in the UK Biobank. A variant in high LD (rs72771074, green) has been associated with alcohol use disorder in a previous GWAS. The TSP is also strongly associated with growth (comparative body size at age 10, 9.6e-21) and body mass index (3.5e-12). The TSPs are nearby GPR139, a gene encoding a G-protein coupled receptor expressed in the brain, whose expression levels in uence alcohol drinking behavior in rats. The TSP region also contains several CTCF binding sites. TSP are shown in bold text. (B) TSPs in 8q21.11 are associated with urate levels. Regional association plot showing statistically signi cant genome-and phenome-wide associations (P ≤ 1E-08), eQTL, regulatory and coding SNPs. LD SNPs in this region are associated with urate levels and pancreatic cancer. A TSP (rs1839333) is also associated with gout, although the p-value did not meet our strict threshold. TSP are shown in bold text. Figures created using LocusZoom (Pruim et al., 2011). (C) TSP locus on 3q24. Regional association plot showing statistically signi cant genome-and phenome-wide associations (threshold p ≤ E-08), regulatory and eQTLs. This locus is characterized by neurological traits involved in educational attainment, cognitive performance, and risky behavior (automobile speeding propensity).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplement.pdf supplementarytables.xlsx