We conducted an exploratory study of SNP and gene interactions with the SNP rs16969968 on daily cigarette consumption. In the single SNP G×GWAS interaction analysis, none of the individual SNPs reached genome-wide significance. Notably, in the gene-level analysis, one gene, PCNA, did achieve genome-wide significance when aggregating our rs16969968×SNP p-values at the gene-level. This result was consistent with the individual SNP analysis, where some SNPs in the same region (tagged by rs73586411) had p-values approaching significance. Importantly, we replicated this gene-level finding in an independent dataset of five Finnish samples by specifically testing for an interaction between rs16969968 and three genes and meta-analyzing the results. Collectively, this replication sample confirmed our novel finding for all three genes, with p-values ranging from 0.0017 to 3.67x10− 7, depending on the model used. The fact that all three of these genes were statistically significant in our replication analyses using the Finnish samples supports our conclusion that a region tagged by lead SNP rs73586411 and shared across these three genes significantly modulates the effect of the risk allele of rs16969968 and its effects on daily cigarette consumption.
A caveat is that both the SNP and gene level interactions for log10-transformed cigarettes per day were insignificant. At the SNP level using log10-transformed CPD, the p-value for rs73586411 was 9.66x10− 5 compared to 7.50x10− 8 for raw CPD. However, at the gene level, the interaction between rs16969968 and PCNA for log10-CPD was suggestively significant (p = 2.71x10− 5 for SNP-wise mean, p = 2.21x10− 5 for SNP-wise top model). Therefore, while there is some evidence to suggest that the interaction disappears on the multiplicative scale, we believe that our replication using an independent sample supports our initial findings of a significant interaction between rs16969968 and one or more SNPs found near the PCNA gene.
We explored the LD structure of the SNPs in the PCNA gene and conducted conditional analyses to determine that this is a single signal coming from an LD block containing 11 SNPs. We note that since we used a 25kb window, all these 11 nominally significant SNPs driving the interaction with PCNA also span part of the CDS2 and TMEM230 genes [51]. It is likely that the reason why PCNA resulted as statistically significant in our UKB analyses while not CDS2 nor TMEM230 was because the PCNA gene boundary used contained 48 SNPs, whereas the CDS2 and TMEM230 gene region boundaries contained 221 and 67, respectively. Therefore, we hypothesize that the higher number of SNPs in CDS2 and TMEM230 genes diluted the interaction signal between rs16969968 and rs73586411. None of the SNPs in high linkage disequilibrium are located within coding regions of any of the three genes. Most are located within intronic regions of CDS2, but there is no evidence for functional impact based on current information available for possible epigenetic areas or other known gene regulatory elements. In sum, we emphasize that this interaction is due to a single signal within the PCNA, CDS2, and TMEM230 region of chromosome 20, but prioritization of possible functional SNPs cannot be identified in our analysis.
PCNA encodes for proliferating cell nuclear antigen, which is widely expressed across many tissues and involved in leading strand synthesis of DNA during replication. According to the GWAS catalog [52], height is the only phenotype with evidence of association with PCNA [53]. In contrast to GWAS, animal and transcriptomic studies have linked PCNA to smoking. For example, animal studies have linked nicotine exposure to PCNA damage in lung and kidney cell cultures in a dose-dependent fashion [54]. Interestingly, PCNA expression levels were higher in hepatic and pancreatic cells of rats exposed to both ethanol and tobacco compared to tobacco alone [55]. According to GeneWeaver [56], in humans, PCNA has been previously linked to tobacco smoke pollution, as well as having a couple of publications linking PCNA to nicotine according to the Comparative Toxicogenomics Database. CDS2 codes for CDP-diacylglycerol synthase 2, which is an enzyme that regulates levels of phosphatidylinositol and is therefore involved in second messenger signaling for regulating cell growth, calcium metabolism, and protein kinase C activity. Notably, there are two genes that code for this enzyme, the other of which is located on chromosome 4q21. CDS2 has emerged in four GWAS reports: two studies of height [57, 58], one on Ebbinghaus illusion, an inability to contextualize relative size perception [59], and most relevant to the present study, another identifying gene-gene interactions with pathological hallmarks of Alzheimer’s disease [60]. TMEM230, transmembrane 230, is expressed in neurons, as well as many other tissues, and may be involved in synaptic vesicle trafficking and recycling. It was identified in a GWAS study of acute myeloid leukemia [61], another with hair morphology [62], and there is ongoing debate about whether it may be associated with Parkinson’s Disease [63]. In short, of the three genes encompassing our epistatic region of interest, to our knowledge, PCNA is the only one previously linked to smoking behaviors.
Our two-step approach of conducting a genome-wide interaction study and later aggregating these signals within genes successfully increased our power to detect genome-wide interactions while keeping our type I error rate low when evaluating unlinked SNPs; we recognize that LD among interacting SNPs can lead to false positive tests of epistasis [64, 65]. Moreover, it provided the flexibility to increase power while also allowing for follow-up of identified SNP×SNP results for further examination. The approach developed here will be useful for other researchers in the field attempting to discover genome-wide interactions with a wide range of complex traits. We used a 25kb window around the start and end of each gene, but there is no clear standard in the field for this. When using genes discovered in model organisms associated with nicotine consumption, Palmer et al. found that heritability for human nicotine consumption was enriched in genomic regions surrounding the genes compared to the protein-coding regions of these genes. In addition, after comparing 5, 10, 25, and 35kb gene windows, they found that enrichment began decreasing after 10kb [66]. These findings suggest that it is beneficial to use a gene window, although the best size of the window still merits further investigation and could vary across traits and across genes. In general, we recommend pooling data from multiple datasets to increase sample size, limiting SNP×SNP epistatic analyses to common variants, and using a 10kb-25kb upstream and downstream gene window when aggregating SNP×SNP results at the gene-level. These results serve as a guide for others in the field as they also attempt to study epistatic interactions at the SNP level.
In summary, this is the first study to report an interaction between rs16969968 and any genome-wide loci influencing cigarette consumption. Five of our nominally significant SNPs, such as rs73586411 and rs6053152, previously failed to reach significance for cigarettes per day in GSCAN, with sample sizes roughly 3–10 times the size used here [7]. This highlights the power of interaction studies to detect novel variants that would not be found otherwise. Future work could expand on our current pipeline to investigate interactions between rs16969968 and genome-wide loci for other smoking behaviors such as smoking cessation. In addition, one could apply our two-stage pipeline to SNP hits from large scale meta-analyses such as GSCAN to investigate other potential genome-wide interactions influencing smoking behaviors. These findings will help inform the work of basic scientists who are working on characterizing epistatic effects influencing smoking behaviors using animal models. Understanding how well-established risk variants such as rs16969968 alter risk for smoking behaviors in conjunction with the rest of the genome is increasingly important with the rise of precision medicine.