Linkage disequilibrium analysis of global populations conrms presence of regulatory SNP rs11615992 of human P2RX7 gene and uncovers rs61083578 as potential alternative in xed allele populations

The P2RX7 gene is ubiquitously expressed throughout the human body and encodes for ligand-gated ion channels. P2RX7 also contains the non-synonymous single nucleotide polymorphism (SNP) rs3751143 which has been linked to decreased ion channel function and diseases such as prostate cancer, tuberculosis, bipolar disorder, inammation and Parkinson's disease amongst others. While there have been many studies on this gene, few dive into the allele-specic expression (ASE) of P2RX7. Previous research has shown over-expression of the rs3751143 wild-type A allele over the C allele in lung tissues of heterozygous individuals correlated with the presence of downstream SNP variant rs11615992. Linkage disequilibrium was demonstrated between these two SNPs in populations in China. This article aims to expand current knowledge for which SNPs near the P2RX7 gene are in linkage disequilibrium with rs3751143; expanding upon ndings of Peng et al (2019). By using various bioinformatic tools on variant data gleaned from the 1000 Genomes Project, this analysis conrms that the downstream SNP rs11615992 is in strong LD with rs3751143 in populations across the world, outside of the initially studied Chinese groups. Furthermore, in the case of African populations, where rs11615992 seems xed, rs61083578 was identied as a potential downstream regulatory element to P2RX7 allele-specic expression. The goal of this study is to contribute to the understanding of P2RX7 gene expression on a global scale with respect to regulation by SNPs surrounding the P2RX7 gene itself, strengthening the relevance of the research performed by Peng et al (2019).


Introduction
Purinergic receptor P2RX7 (P2RX7, NCBI Gene ID: 5027) is a protein coding gene that produces an inotropic class of ligand-gated ion channels and is expressed in many tissues throughout the human body; though mostly expressed in brain and skin cells (NCBI 2020). The protein encoded from P2RX7 differs from other P2X receptor genes, as well as other ligand-gated channels, since it can transport molecules up to 900 Da (Benzaquen et al 2019). The P2RX7 protein has been shown to be associated with many biological processes, and is a potential target for anti-in ammatory therapy (Peng et al 2019).
As P2RX7 is expressed in many cell lines, including immune and non-immune cells, and is involved with genetic expression signaling, macrophage immune response, and membrane voltage potential of both the cell plasma membrane and the nuclear membrane (Fuller et al 2009). Numerous splicing variants of this gene have been identi ed, many of which seem to be involved with nonsense-mediated decay (NCBI 2020). Furthermore, single nucleotide polymorphisms in P2RX7 have been shown to play a role in a myriad of diseases, ranging from mood disorders, bipolar disorder, and unipolar depression ( Previous research has been performed on many different SNPs within P2RX7. One of the most highly researched SNPs is rs3751143, which was also associated with multiple phenotypes including IL-8 level in serum, a chronic obstructive pulmonary disease biomarker, Toxoplasma gondii killing, and mineral density (Peng et al 2019). ASE results from the lung tissue of 48 donors in China found the rs3751143 wild-type A allele to be overexpressed over the mutant C allele in heterozygous individuals. The expression ratio (C/A) had a mean ± SD of 0.84 ± 0.16 with 95% con dence interval 0.76-0.92; P = 0.001 (Peng et al 2019). This presented the possibility that SNPs in linkage disequilibrium with rs3751143 play a role in the regulation of the expression of P2RX7. Peng  SNPs of interest (rs3751143 and rs11615992) can be found in Table 1. The Gambian in Western Division (GWD) population of Africa was intentionally selected due to there being no presence of the rs11615992 minor allele. This population would serve as a control group as no LD with rs3751143 and rs11615992 would be possible. Variant data was ltered down to a 100 kb window surrounding rs3751143 (50 kb upstream and downstream) to a data set of 3,240 SNPs. This window size was selected as it is just under double the size of the P2RX7 gene (~ 54 kb). From there, sample lists for each population of study were downloaded as TSV les. These two data sets were merged using python code to generate input les for ldSelect (Nickerson 2004) or Genome Variation Server (http://gvs.gs.washington.edu/GVS150/). Additionally, in the generation of input les, data on the pairing of SNP alleles was also summarized for later queries. Table 2 presents an overview of the number of samples provided by the 1000 Genomes Project for each population of study and the total number of samples found within the chromosome 12 phase 3 variant data set.

Regulation analysis
In order to computationally determine the potential regulatory effects of SNPs, NONCODE (Zhao 2016) was utilized to search for SNP proximity to non-coding RNA. Additionally, the MATCH web-app (for querying the TRANSFAC database (Matys 2006)) was used to search for nearby transcription factor binding sites.

Results
Linkage Disequilibrium pattern near P2RX7 gene in global populations 1000 Genomes Project data for the region surrounding P2RX7 in the populations described above were analyzed to identify SNP(s) in LD with rs3751143. For reference, Peng et al (2019) reported rs11615992 as the only SNP with strong LD to rs3751143 in their two populations of study (r 2 = 0.900 in CHB and r 2 = 0.835 in CHS). In ltering for SNPs with strong LD to rs3751143, a minimum r 2 of 0.800 was used as a score threshold. LD analysis utilizing the Genome Variation Server con rmed that rs11615992 is the only SNP with strong LD to rs3751143 in all populations, excluding the control (GWD) and Puerto Rican in Puerto Rico (PUR). In PUR, no SNPs were in signi cantly strong LD with rs3751143 as the r 2 of rs3751143 being in LD with rs11615992 fell below the signi cance threshold to a value of 0.713. In the GWD control population, LD between rs3751143 and rs11615992 was impossible due to the lack of individuals surveyed for variants displaying the rs11615992 minor allele. Nevertheless, in analysis of the GWD data set, SNP rs61083578 was seen to be in strong LD with rs3751143 (r 2 = 0.807). Table 3 captures r 2 linkage disequilibrium scores between rs3751143 and rs11615992 as well as between rs3751143 and rs61083578. Cells containing N.I. (Not Identi ed) report that the allele pair did not appear in the results of linkage disequilibrium analysis. Two results of note are the scores of the two populations described in the Peng et al (2019) article. This analysis was able to recover the same r 2 values that were previously reported.

Allele pair types in strong linkage disequilibrium
In an effort to better understand the nature of rs3751143 LD with rs11615992, the counts of each variant pair were summed to determine which alleles for each SNP appeared in pairs. Peng et al (2019) observed that "the wild-type A allele of rs3751143 was in strong LD with A of rs11615992, while the mutant one C of rs3751143 was in LD with G of rs11615992". This pattern remained consistent across all 6 populations that displayed both alleles for each SNP as shown in Table 4 which shows the count of per strand genotypes for allelic variant pairs of rs3751143 and rs11615992 across all populations of study. The GWD population variant data, being absent of rs11615992 minor allele, did not present LD between that SNP and rs3751143. However, as noted above, this population revealed that rs3751143 was in strong LD with rs61083578. The counts of these two allele variant pairs closely resembles those found in the other data sets as shown in Table 5 which captures the count of per strand genotypes for allelic variant pairs of rs3751143 and rs61083578 across all populations where the two were in strong LD. TSI and PUR populations both had a single case where mutant rs3751143 was paired with mutant rs61083578 but no LD was found. All other populations only showed wild-type pairings. Potential regulatory effects of rs61083578 The region surrounding rs61083578 was surveyed for potential regulatory elements that might contextualize the relevance of Peng et al's (2019) ASE ndings (that the A allele of rs3751143 is overexpressed relative to the C) to the GWD population. NONCODE data viewed in the UCSC genome browser placed rs61083578 inside lnc-RNA (ID: NONHSAT232452.1). Furthermore, the TRANSFAC database reported a potential binding site for CCAAT Enhancer Binding Protein alpha in close proximity to rs61083578.

Discussion
In their original article, Peng et al (2019) identify the novel regulatory SNP rs11615992 as a driving force behind the observed allele-speci c expression of the clinically signi cant P2RX7 gene. The goal of this analysis was to study additional populations, from different geographical regions, in an attempt to uncover the same linkage disequilibrium between rs3751143 and rs11615992; a key characteristic of the cis-regulatory capability of rs11615992. In doing so, the two populations of focus in the Peng et al (2019) article were reanalyzed for LD to ultimately uncover the same results. Furthermore, ve of six additionally studied populations uncovered the same LD pattern with strong r 2 scores. The Sri Lankan Tamil in the UK (STU) population fell below the 0.800 cut-off (with an r 2 score of 0.713), which may be explained by the relative closeness in frequency of the rs3751143 major and minor allele frequencies (0.593 and 0.407 respectively). The LD pattern of SNP speci c alleles for each variant discovered by Peng et al (2019) (the wild-type of rs3751143 in LD with wild-type of rs11615992), remained consistent across all data sets, where the minor allele of rs11615992 was present. Finally, in all but one population, no other SNPs were found to be in LD with rs3751143.
Interestingly, in the GWD population, a new SNP was uncovered to be in strong LD with rs3751143 in the absence of rs11615992. This data set contains a xed A allele for rs11615992, similar to the YRI African population as noted in Peng et al's paper. In the GWD population, rs3751143 was in strong LD with rs61083578 (r 2 = 0.807), a variant ~ 14 kb downstream. This is almost triple the distance between rs3751143 and rs11615992. Nonetheless, there is reason to believe rs61083578 potentially stands in as a proxy for rs11615992 as a cis-regulatory factor of P2RX7 expression. In the Peng et al (2019) publication, they present 3C assay data that supports rs11615992 interacting with the promoter region of P2RX7 as shown by high interaction frequency between the P2RX7 promoter region and randomly selected chr12:12163138 position. This region randomly selected for their assay is 4050 bases downstream of rs11615992 and 5409 bases upstream of rs61083578. To quote Peng et al (2019) summarizing their results, "When one-sample t-test was used to roughly compare the ligation e ciency, a signi cant deviation was obtained (P < 10 − 6 ), thus indicating that P2RX7 promoter could interact with the enhancer [chr12:12163138 region in close proximity to rs11615992] and P2RX7 should be the target gene of this enhancer". This strengthens the possibility that the P2RX7 promoter region might interact with rs61083578, given that the region used in the 3C assay was nearly as close to rs11615992 as it was to rs61083578. There are additional opportunities for rs61083578 to have regulatory effects on P2RX7, considering it is near a CCAAT Enhancer Binding Protein alpha binding site and contained within an annotated long non-coding RNA region.

Conclusion
The goal of this analysis was to expand upon the relevance of the thorough and extensive research conducted by Peng et al (2019) on allele-speci c expression of P2RX7 explained by the downstream SNP rs11615992 being in LD with intronic non-synonymous SNP rs3751143. In their initial paper, analysis was focused on the study of Chinese populations. In an effort to broaden the scope of their ndings, this analysis has con rmed the relevance of their discovery in Europe, America, and South Asia, by uncovering comparably strong LD between these two variants in populations from these regions. Furthermore, in the case of the selected African population, SNP rs61083578 was discovered to be in strong LD with rs3751143. This downstream variant shares some characteristics to rs11615992, in that it sits near a predicted transcription factor binding site, and a region used in Peng et al's (2019) 3C assay which showed potential for interaction between downstream enhancer regions and the P2RX7 promoter. While the scope of the analysis submitted in this paper is con ned to inquiry and predictions performed in a purely computational environment, it opens the possibility for clinically relevant rs3751142 allele speci c expression of P2RX7 to not only be in uenced by the downstream SNP rs11615992 but also further downstream SNP rs61083578. Ethics approval and consent to participate

Abbreviations
The 1000 Genomes data is made available according to the Fort Lauderdale Agreement.