Loci on chromosome 20 interact with rs16969968 to influence cigarettes per day in European ancestry individuals

.


Introduction
Smoking cigarettes is the leading cause of preventable death in the United States (Alberg et al., 2014).One in five deaths in the United States can be attributed to smoking (Alberg et al., 2014).Smoking also affects the economy; smoking-related health costs are around $300 billion per year in the United States alone (S.Federal Trade Commission (FTC), 2019).Previous work has demonstrated a substantial genetic component to smoking behaviors, and twin studies estimate the heritability of smoking quantity and nicotine dependence to be between 40% and 75% (Kaprio, 2009;Lessov-Schlaggar et al., 2006) in adults across multiple ancestries.Recent genome-wide association studies (GWAS) have identified several hundred individual variants associated with various smoking-related behaviors (Liu et al., 2019), but the majority of the SNP heritability for smoking quantity remains to be accounted for (Evans et al., 2021;Quach et al., 2020).Moreover, the variants within these genes and their regulatory elements are likely to influence a complex trait via minute perturbations across a complex, non-linear set of physiological networks (i.e., transcriptional, neuronal, and developmental) (Kauffman, 1993).The physiological intricacy in which complex traits such as smoking develop suggests that interactions between loci or whole genes (i.e., epistasis) are likely, since there are numerous ways and stages at which these interactions could arise.Furthermore, while current evidence of epistatic effects in humans has been limited (Hill et al., 2008), work on model organisms further suggests that epistatic effects are common (Mackay, 2013) and may be particularly important for predicting an individual's genetic risk to disease such as nicotine dependence (Mackay and Moore, 2014).
To detect novel variants and improve our understanding of the biological processes involved in nicotine dependence, we investigated SNP-SNP interactions influencing nicotine use.However, testing all pairwise, genome-wide SNP-SNP interactions is computationally infeasible and hampered by the stringent multiple testing correction required for such analyses.Instead, by reducing the total number of tests by selecting a SNP or set of well-replicated SNPs of large effect (which are more likely to harbor interactions) with known functional impact as interactors, studying epistasis is possible.We therefore selected a well-replicated SNP previously associated with nicotine behaviors as our moderator, namely, rs16969968.
SNP rs16969968 in the CHRNA5/A3/B4 gene cluster of neuronal nicotinic receptor genes is the most widely replicated genetic variant associated with smoking behaviors (Chen et al., 2018;Picciotto and Kenny, 2021;Wen et al., 2016), emerging from early GWAS studies of lung cancer and smoking behaviors (Amos et al., 2008;Hung et al., 2008;Thorgeirsson et al., 2008).Nicotine is an agonist for neuronal nicotinic acetylcholine receptors (CHRN genes) and repeated nicotine use leads to their upregulation (Fowler et al., 2020).rs16969968 was the original top SNP identified in the CHRNA5 gene and has been the major focus of further study because it changes an amino acid (aspartate to asparagine; D398N) and has been shown to confer functional effects using cell culture methods in vitro (Bierut et al., 2008) and behavioral effects in a mouse genetic model (Buck et al., 2021;Koukouli et al., 2017;O'Neill et al., 2018).Absent balanced cross-over interactions, interaction effects are likely to be associated with at least some additive effects estimated in a typical GWAS single locus, additive model.If epistatic interactions do underlie any of the variation for heaviness of smoking or nicotine dependence, rs16969968 is therefore a reasonable a priori candidate locus to study as an interactor.In addition, this SNP is relatively common in some ancestral groups, with a minor allele frequency (MAF) of 0.37-0.43 in populations of European and Middle Eastern descent according to dbSNP (NCBI, 1999).Although rs16969968 is rarer in other ancestral groups such as East Asian and African, with MAF of 2% and 7% respectively (Bierut et al., 2008), this SNP has been associated with smoking behaviors in trans-ancestry analyses (Adjangba et al., 2021;Olfson et al., 2015).In sum, because rs16969968 is a highly replicated, trans-ancestral, and common signal of large effect with known functional consequences on smoking, we hypothesized that G×GWAS investigations using rs16969968 would be better powered than other SNPs in our search for epistatic effects influencing nicotine use.
Further investigation of statistically independent SNPs within the CHRNA5/A3/B4 gene cluster has previously suggested that rs16969968 moderates the effect of other SNPs on nicotine use.In a meta-analysis of smoking quantity led by Saccone et al., the authors identified at least two signals within the region that are statistically independent of rs16969968, tagged by rs578776 and rs588765 (Saccone et al., 2010).The major (risk) allele of rs578776 is in phase with the minor (risk) allele of rs16969968.In the case of rs16969968, the minor allele increases risk for nicotine dependence, but for rs578776 the minor allele is protective against it.Consequently, although the risk loci are correlated with each other, the minor alleles are out of phase, and when controlling for rs16969968, rs578776 is no longer genome-wide significant.At a second SNP, rs588765, in linkage disequilibrium (LD) with rs16969968, the minor allele is protective in single-locus models.However, when controlling for rs16969968, the minor allele is associated with an increased risk for smoking quantity (Saccone et al., 2010).In addition, a more recent study found that women who were carriers of the rs16969968 risk allele had increased odds of stopping smoking if they had the minor allele of CHRNA3 SNP rs578776 (Jones et al., 2023).In short, previous studies have demonstrated that controlling for rs16969968 or including it in interaction models has the potential to uncover new associations with smoking behaviors and provide further nuance to previously discovered ones.
However, to date, no study has tested for interactions of rs16969968 with other genetic loci on smoking intensity using a systematic, hypothesis-free approach.Using open-sourced software, we developed a two-step approach to explore genome-wide interactions across multiple levels of analysis, namely, at the single SNP and gene-level.This approach is flexibleallowing us to pool data from multiple sources or ancestries as well as granularthe SNP-level results from step 1 can be used to pinpoint the specific region driving any significant gene-level interactions.Using this approach, we investigated genome-wide interactions with rs16969968 at the single SNP and gene-level underlying smoking quantity.

Discovery sample: UK Biobank individuals of European and South Asian ancestry
We conducted our primary analyses in the UK Biobank (Sudlow et al., 2015), a biorepository with approximately 500 000 individuals.The initial analysis was limited to individuals of European ancestry, as detailed below.Following our replication, we included individuals of the second largest ancestry group in the UKB, namely, unrelated South Asian ancestry individuals.Details about selecting unrelated individuals of South Asian ancestry and meta-analyzing across the European and South Asian subsamples can be found under Supplementary Methods.
All unrelated participants of European ancestry who reported currently or formerly smoking and had genotype data that passed quality controls were used (N European = 116 442).Participants were years of age or older.Around 46% of our sample of unrelated individuals who reported smoking were female.Different ancestral populations differ in their allele frequencies; these allele frequency differences can increase false positives or decrease power in GWAS (Tian et al., 2008).To minimize such confounding, we performed all analyses only on individuals of the same ancestral population, namely, European (Euro) or and South Asian (S_Asian) descent, and then meta-analyzed across them.To identify individuals of European ancestry, we performed principal component analysis and retained those whose top scores on the first four principal components fell within the range of European ancestry previously determined by the UK Biobank (field 22006).Similar details for identifying individuals of South Asian ancestry can be found under Supplementary Methods.
All data analysis and cleaning were performed using PLINK2 (Chang et al., 2015).We first removed 849 individuals whose self-reported sex differed from their chromosomal sex determination (UKB data fields and 22001) due to their increased probability of being a sample mix-up, 46 people with irregularly high inbreeding coefficients (|F het | > 0.2), and 159 individuals who requested their information be redacted from the UKB, as well as 1 029 individuals whose genetic data did not pass quality controls identified by Affymetrix (549 individuals) and the UK Biobank (480 individuals, fields 22010 and 22051).Then, we used MAFand LD-pruned array markers (plink2 command: -maf 0.01 -hwe 1×10 − 8 -indep-pairwise 50 5 0.2) to identify unrelated individuals among all individuals of a given ancestry who reported smoking using the Genome-wide Complex Trait Analysis (GCTA) software (Yang et al., 2011) (gcta command: -grm-singleton 0.05).For our analyses, we used the HRC-imputed dosage data provided by the UK Biobank's full release, which used the HRC reference panel v. 1.1 (McCarthy et al., 2016) and an information score greater or equal to 0.9.We filtered MAF > 1% and tested ~10 M SNPs across the 22 autosomal chromosomes.
Smoking quantity was measured by CPD; we included individuals who reported previously or currently smoking (UKB field IDs 2887, 3456, and 6183; average = 18.22,median = 20, range 1-140, inclusive).Overall, most people tend to underestimate the amount they smoke, and this is particularly pronounced in individuals who report formerly smoking in whom telescoping can partly explain why our measure of smoking quantity was right-skewed (Krall et al., 1989) (Fig. S1A).To assess whether changes to scale influence the tests of the interactions, we also investigated log10 transformed CPD (Fig. S1B).

rs16969968×SNP analyses
All models included the following covariates: sex (UKB field 31), age at time of assessment (field 21003), genotyping batch (field 22000), assessment center (field 54), and the first 10 genetic principal components generated across the UKB to control for population-stratification. We accounted for the two different SNP chips (Bioleve and Axiom) by controlling for batch, and for local geographical effects by controlling for assessment center.To calculate these 10 genetic PCs, we used flashpca (Abraham and Inouye, 2014) on common LD-pruned array markers.To reduce collinearity in our full set of covariates, we ran principal component analysis using the prcomp function in R (R Foundation for Statistical Computing, 2018) to remove the axes that explained less than 1% of the total variance.Previous research has shown that failing to include the interactions between a moderator and covariates can inflate estimates of an interaction (Keller, 2014).As such, we also included all interactions between rs16969968 and our covariates.and each additional interacting SNP and the covariates.
We used PLINK2 to run a linear regression model (plink2 command: -linear interaction) to estimate SNP j -by-rs16969968 interaction associations with CPD.We included all rs16969968×covariate and SNP j ×covariate interactions to avoid potential confounding (Keller, 2014).Because covariate scales varied widely, all covariates and their products were standardized (plink2 command: -covar-variance-standardize).We used a standard GWAS threshold of 5×10 − 8 for this analysis.Our regression model took the following form: Where X p indicates the 1.q covariates, G indicates the number of risk alleles at rs16969968, Z j indicates the j th SNP in the G×GWAS, ε denotes environmental noise and measurement error.

rs16969968×gene analyses
To investigate rs16969968 interactions with gene level effects, we fed the resulting rs16969968-by-SNP j p-values into the multi-marker analysis of genomic annotation (MAGMA) (de Leeuw et al., 2015) v.1.09to test gene-level interaction associations for CPD and log10-transformed CPD.Using MAGMA, one can employ either the "SNP-wise mean" or the "SNP-wise top" model to aggregate genome-wide signals at the gene level.The SNP-wise mean model is more powerful when several SNPs within a gene show a moderate association with the outcome of interest; the SNP-wise top model, on the other hand, is more powerful when a single SNP is strongly associated with the trait (de Leeuw et al., 2018(de Leeuw et al., , 2016)).To ensure our analyses would be sensitive to varying unknown genetic architectures, we used both MAGMA's top and mean p-value models separately (MAGMA commands -model SNP-wise top and -model SNP-wise mean, respectively).To our knowledge, this was the first time MAGMA has been used to perform G×GWAS interaction analyses.We investigated the likelihood of getting spurious results from using MAGMA in this novel fashion by simulating a random phenotype and running our rs16969968×SNP and subsequently our rs16969968×Gene analyses genome wide (See Supplementary Methods).While we did see slight deflation of the p-values in the SNP-wise top model, no genes were significant after controlling for multiple testing via a Bonferroni correction in either model (Fig. S9).
In all the MAGMA analyses, variants were annotated to genes using a 25Kb window around the start and end of a gene.SNPs were successfully mapped onto a total of 18 573 genes using genome build 37. We used the SNP×rs16969968 interaction p-values for each SNP from the original GWAS, which accounted for the appropriate main effects, covariates and covariate interactions as described above, and included MAGMA's default covariates in the analysis (gene size, density, inverse minor allele count, per-gene sample size, plus the log value of each).We used a Bonferroni multiple testing correction significance threshold based on the number of genes tested (p = 0.05/18 573 = 2.70×10 − 6 ), which is conservative given LD structure and overlapping gene regions.

Finnish replication sample
To replicate any significant interactions, we chose five Finnish subsets with genetic and cigarette use data available as a replication sample (N = 40 140).These five subsets include: FinHealth 2017 study (Fin-Health) (National FinHealth Study -THL, n.d.), FINRISK (The National FINRISK Study, n.d.), Finnish Twin Cohort (FTC) (Kaidesoja et al., 2019;Kaprio et al., 2019), GeneRISK (Widén et al., 2022), and the Health-2000-2011(T2000-2011) (Health-2000-2011, n.d.).These datasets varied in sample size (ranging from around 994 smoking individuals in GeneRISK up to 26 751 in FINRISK) and the granularity of the cigarette use outcome (i.e., FTC used binned CPD while the rest of the subsets used raw CPD).For more information on these samples, please see Supplementary Methods.We confirmed our Finnish sample was an appropriate replication sample by comparing the Finnish linkage disequilibrium patterns of any gene regions of interest to those in our original UKB European sample, the largest sample in our study (Fig. S8).
To replicate any significant SNP or gene signals from the UKB analysis using the Finnish samples, we defined a replication region as all SNPs within 250 kb of the lead SNP in a significant interaction from the discovery analyses.This ensured that all SNPs in common between the Finnish and UKB samples in our region of interest plus any new SNPs that were likely to be in linkage disequilibrium with our SNPs of interest would also be included.
We performed rs16969968×SNP and rs16969968×Gene interaction analyses for any replication regions in each of the five Finnish samples as described previously (see 2.2 rs16969968×SNP Analyses and 2.3 rs16969968×Gene Analyses).We meta-analyzed the results from the rs16969968×SNP analyses across only the Finnish subsets (labeled Fin_Meta-analysis), across all European samples (labeled Euro_Metaanalysis), and trans-ancestrally across the UKB and Finnish samples (labeled All_Meta-analysis) using METAL's inverse variance weighing model (Willer et al., 2010).
To determine the number of independent tests conducted in the replication analyses, we performed principal component analysis using the UKB on all the SNPs within any replication regions of interest using R. To identify the maximum number of independent signals in a replication region, we ran principal component analysis on the genotypes of all the SNPs in our region of interest and counted the number of principal components whose standard deviations were greater than 1.To determine a significance threshold for any replication SNPs, we then adjusted for multiple testing in each region by dividing 0.05 by the effective number of independent loci the PCA analysis revealed.For example, the number of independent loci in our region of interest on chromosome 20 was 3; dividing 0.05 by 3 yielded a corrected alpha of 0.017 (0.05/3) for replicating our SNPs of interest on chromosome 20.

Characterizing significant interactions
For any statistically significant genes from the gene-level MAGMA analysis (p < 2.70×10 − 6 ), we sought to understand what drove any significant rs16969968×Gene interactions.To do this, we inspected the linkage disequilibrium patterns and performed conditional and functional analyses on any SNPs of interest within any significant genes.

P.N. Romero Villela et al.
SNPs of interest within significant genes were defined as SNPs with a suggestive significance of p < 1×10 − 5 , which is about one order of magnitude lower than our gene-level significance of p < 2.70×10 − 6 .This ensured that our follow-up analyses included all SNPs that might be driving any significant interactions observed at the gene level.
We used HaploView (Barrett et al., 2005) as well as LocusZoom (Pruim et al., 2010) to visualize the linkage disequilibrium pattern of the SNPs of interest for any genes that reached statistical significance.To test whether a significant gene contained a single or multiple signals, we conducted rs16969968×SNP interaction analyses on all SNPs of interest while conditioning on the top SNP for that gene.Our multiple testing correction threshold for these conditional analyses was defined by the number of effectively independent SNPs within a significant gene (see Section 2.4 Finnish replication sample).To test the interactive effect of each SNPs in the gene with rs16969968 while conditioning on the top SNP and its rs16969968 interaction, we exported the additive coding of all SNPs in the gene within MAGMA's 25 kb window using PLINK (plink flag: -recode A), and included in the conditional model the interaction between the top SNP and rs16969968 as well as the main effect of the top SNP and its interactions with the rest of our covariates.The conditional analysis followed the following regression model:

Results
We used the UKB European and South Asian samples as our discovery samples.Since our largest discovery subsample was of European descent, we then used the Finnish samples to replicate any significant results.For conciseness, the main text will focus on the results from the UKB European discovery subset and replications using the Finnish samples.For additional results on the South Asian subsample or the meta-analyzed European and South Asian discovery sample, please see Supplementary Results.

Finnish replication of rs16969968×SNP analyses
Using the Finnish replication samples, we additionally tested this region tagged by rs73586411 on chromosome 20.When meta-analyzing across the Finnish samples, for purposes of multiple testing correction, we identified four independent tests in our SNPs of interest using principal component analysis (see 2.4 Finnish Replication Sample).Across the Finnish subsets, nine SNPs were nominally significant (p < 0.05 Table S1), but no SNP reached significance after adjusting for multiple comparisons (p < 0.05/4 = 1.25×10 − 2 ).We did not detect evidence of study heterogeneity across the Finnish subsets (p < 0.05, all p > 0.278, Table S1).When meta-analyzing across all Finnish and UKB European subsets, no SNP reached genome-wide significance, but the interaction between rs1696969 and rs73586411 remained suggestively significant (p < 5×10 − 8 , p = 6.5×10 − 6 , Table S1).We did not detect evidence of study heterogeneity between the UKB and Finnish samples (p < 0.05, p = 2.42×10 − 1 , Table S1).Fig. 3A shows the estimated effect sizes for this interaction within individual samples and across all samples where rs73586411 was available.The estimated effect size for the rs16969968×rs73586411 interaction was consistently negative across the four meta-analyses and four of the seven samples (Fig. 3A), spanning UKB European ancestry, South Asian ancestry, and Finnish ancestry.For example, while underpowered, we note that the direction of the interaction effect between rs1696969 and rs73586411 for the South Asian UKB sample is consistent with the European samples (Fig. 3A) and its standard error for the interaction is smaller than in the European samples because rs73586411's effect allele is more common in South Asians than in individuals of European descent (Table S2).

Finnish replication of rs16969968×gene analyses
We used the Finnish samples to replicate our significant interaction between rs16969968 and the PCNA gene.The strongest SNP-level interaction associations in this region were physically located within the CDS2 gene, but were within 25KB of PCNA and TMEM230, and therefore included in MAGMA gene analyses for all three genes.We consequently included all three genes in our Finnish replication study.In our rs16969968×Gene meta-analysis of the Finnish samples, all three genes (CDS2, TMEM230, and PCNA) were significant after multipletesting correction (p < 1.67×10 − 2 ) across both the SNP-wise Mean and SNP-wise Top models (Table 1A and B,respectively), successfully replicating our results of the PCNA gene from the UKB's European subset.

Exploring the rs16969968×SNP interactions
We used LocusZoom and HaploView to visualize the pattern of associations as a function of their linkage disequilibrium (LD) with the lead SNP (rs73586411) in the PCNA gene and our significant SNP rs1892967 in chromosome 11 for log 10 CPD.All our suggestively significant interactions (p < 1×10 − 5 ) from the rs16969968×SNP analyses for the PCNA gene were highly correlated with one another (Fig. 3B) and aggregated in a single LD block, block 3 (Fig. S4).To confirm whether this was a single signal, we conducted rs16969968×SNP interaction analyses for the SNPs within PCNA, conditioning on the rs16969968×rs73586411 interaction, the interaction with the lowest pvalue in the PCNA gene.No SNPs were significant after controlling for multiple comparisons (p > 0.05/3 effectively independent SNPs in the region = 0.017).In addition, we explored the LD patterns around SNP rs1892967 which was the single significant hit from our rs16969968×SNP analyses for log 10 CPD.rs1892967 lies within the GRAMD1B gene and is in high linkage disequilibrium (LD > 0.8) with SNP rs1892966, which neared the genome-wide significance of 5×10 − 8 (p < 5×10 − 5 , p = 6.91×10 − 8 , Fig. S10).

Discussion
We conducted an exploratory study of SNP and gene interactions with the SNP rs16969968 on daily cigarette consumption.In the SNP×SNP interaction analysis, no SNP reached genome-wide significance when analyzing only European individuals.However, when we meta-analyzed across European and South Asian populations, one SNP, rs1892967, reached genome-wide significance for log 10 CPD (p = 3.18×10 − 8 ).Nevertheless, the gene analyses for genes near this SNP were insignificant across both the MAGMA mean and top models across all discovery subsets and the discovery meta-analysis.Further analyses with increased sample sizes and greater ancestry diversity can help clarify the role of this locus and its interaction with rs16969968 on smoking quantity.
At the gene-level, one gene, PCNA, did achieve genome-wide significance within a single ancestry.This result was consistent with the SNP-level analysis, where some SNPs within this region (tagged by rs73586411 and including two other genes, CDS2 and TMEM230) had pvalues approaching significance.Importantly, we replicated this genelevel finding in an independent dataset of five Finnish samples (all p < 6.16×10 − 3 ), followed by a meta-analysis of the results (6.62E-4 > p >5.42E-8), confirming our novel finding for all three genes.The fact that all three of these genes were statistically significant in our replication analyses using the Finnish samples supports our conclusion that a region tagged by lead SNP rs73586411 and shared across these three genes significantly modulates the effect of the risk allele of rs16969968 and its effects on daily cigarette consumption.
We conducted conditional analyses to determine the number of independent signals within our region of significance.The LD structure PCNA gene and conditional analyses revealed that this is a single signal coming from an LD block containing 11 SNPs.Because we used a 25 kb

Table 1
Results for the gene-level interaction analyses with rs16969968 for cigarettes per day for Europeans and South Asians in the UKB, meta-analyzed Finnish subsets, and the meta-analysis across both ancestries and all datasets (UKB and Finnish).(A) TMEM230 and PCNA reached significance in the UKB European subsample using the SNP-wise Mean model.All three overlapping genes, CDS2, TMEM230, and PCNA were significant in the meta-analysis of the Finnish subsets and in the meta-analysis combining all UKB and Finnish sub-samples.(B) Considering the most significant signal within a gene (SNP-wise Top model), TMEM230 and PCNA reached significance in the UKB discovery meta-analysis of Europeans and South Asians; PCNA was also significant in the UKB European sub-sample.All three overlapping genes, CDS2, TMEM230, and PCNA were significant in the Finnish and the mega meta-analysis.Values in bold indicate significance after multiple testing (p < 2.63×10 − 6 ).window, all these 11 nominally significant SNPs driving the interaction with PCNA also span part of the CDS2 and TMEM230 genes (Sherry et al., 2001).Only PCNA was statistically significant in our UKB Euro analyses while CDS2 and TMEM230 were additionally significant in our replication; we hypothesize that this discrepancy was due to CDS2 and TMEM230 harboring more non-significant SNPs, which diluted the signal in the CDS2 and TMEM230 genes.To illustrate, the PCNA gene boundary contained 48 SNPs, whereas the CDS2 and TMEM230 gene region boundaries contained 221 and 67, respectively.In sum, we emphasize that this interaction is due to a single signal within the PCNA, CDS2, and TMEM230 region of chromosome 20.None of the SNPs in the LD block driving our significant gene results are located within coding regions of PCNA, CDS2, or TMEM230.Most are located within intronic regions of CDS2, but there is no evidence for functional impact based on current information available for possible epigenetic areas or other known gene regulatory elements.Therefore, prioritization of possible functional SNPs could not be identified in this study.
To our knowledge, of the three genes encompassing our epistatic region of interest, PCNA is the only one previously linked to smoking behaviors.PCNA encodes for proliferating cell nuclear antigen, which is widely expressed across many tissues and involved in leading strand synthesis of DNA during replication.According to the GWAS catalog ("GWAS Catalog," 2023), height is the only phenotype with evidence of association with PCNA (Barton et al., 2021).However, animal and transcriptomic studies have linked PCNA to nicotine.For example, animal studies have linked nicotine exposure to PCNA damage in lung and kidney cell cultures in a dose-dependent fashion (Salama et al., 2014).In addition, PCNA expression levels increased in hepatic and pancreatic cells of rats exposed to both ethanol and tobacco compared to tobacco alone (Wang et al., 2014).PCNA gene expression is up-regulated in response to complex environmental mixtures, including cigarette smoke, and is clearly involved in DNA repair following exposure to hazardous chemicals (Sen et al., 2007).Cumulatively, this previous research links PCNA to smoking behaviors.On the other hand, TMEM230 and CDS2 have been associated with a variety of other traits.For example, TMEM230 has been previously associated with acute myeloid leukemia (Lv et al., 2016), hair morphology (Medland et al., 2009), and Parkinson's Disease (Wang et al., 2021).CDS2 has emerged in four GWAS reports: two studies of height (Kichaev et al., 2019;Sakaue et al., 2021), one on Ebbinghaus illusion, an inability to contextualize relative size perception (Zhu et al., 2020), and another identifying gene-gene interactions with pathological hallmarks of Alzheimer's disease (Wang et al., 2020).While all three genes (PCNA, TMEM230, and CDS2) were significant in our replication, only PCNA was significant across both our discovery and replication analyses, and therefore functional follow-up of this interaction should prioritize the PCNA gene.
There are several caveats and limitations to our study.First, since it was computationally unfeasible for us to investigate all pair-wise genetic interactions, we only investigated genetic interactions with a single SNP of interest, rs16969968.There are likely other interactions involving SNPs other than rs169969968 influencing smoking quantity that were unexplored.Second, smoking quantity is a good proxy for nicotine dependence, but participants tend to underreport how much they smoke (Gorber et al., 2009); this is especially problematic in individuals who no longer actively smoke (Soulakova et al., 2012).While most of our sample was composed of individuals who reported actively smoking, underreporting the number of cigarettes smoked per day might have been especially pronounced on individuals who no longer report actively smoking, thereby introducing non-random bias in our outcome of interest.However, this potential bias in CPD self-reporting is unlikely to be jointly correlated with the genotype at rs16969968 and with other genes across the genome; while underreporting might bias the average CPD estimate, it would not have led to false interaction signals.Third, we estimated smoking quantity through cigarettes per day since it is the most common form of nicotine consumption, thereby excluding other ways our participants might have ingested nicotine (i.e., vapes, pipes, hookahs, etc.).In our results, both the SNP and gene level interactions for log 10 -transformed cigarettes per day were insignificant for the chromosome 20 region surrounding PCNA and tagged by rs73586411.For example, at the SNP level using log 10 -transformed CPD in Europeans, the interaction p-value for rs73586411 was 4.33×10 − 4 compared to 1.79×10 − 5 for CPD.At the gene level, the interaction between rs16969968 and PCNA for log 10 -CPD was also insignificant in Europeans in the UKB (p = 2.71×10 − 5 for SNP-wise mean, p = 2.21×10 − 5 for SNP-wise top model).In addition, the 25 kb window we chose around the start and end of each gene is arbitrary; there is no clear standard in the field for this.When using genes discovered in model organisms associated with nicotine consumption, Palmer et al. found that heritability for human nicotine consumption was enriched in genomic regions surrounding the genes compared to the protein-coding regions of these genes.In addition, after comparing 5, 10, 25, and 35 kb gene windows, they found that enrichment decreased after 10 kb (Palmer et al., 2021).These findings suggest that it is beneficial to use a gene window, though the best window size still merits further investigation and could vary across traits and across genes.
Despite these limitations, our study has multiple strengths.Our twostep approach of conducting a genome-wide interaction study and aggregating these signals within genes increased our power relative to the initial SNP×SNP analysis to detect interactions by decreasing the multiple testing burden.In addition, based on our simulations, this approach kept our type I error rate low when evaluating unlinked SNPs.Second, this approach is flexiblewe pooled data from multiple sources and ancestries with low sample sizes that would normally be disregarded in interaction studies due to low power and instead "balance statistical needs with fairness" following Ben-Eghan et al.'s call to include data from minority populations despite their low sample sizes (Ben-Eghan et al., 2020).While flexible, our approach is also granularwe pinpointed the specific LD block driving our significant SNP×Gene results by inspecting any regions of interest using the SNP×SNP results from step one.Third, our method is agnostic to functional characteristics.Previous extensions to MAGMA such as H-MAGMA (Sey et al., 2020) are limited to investigating interactions that have well documented chromatin interaction profiles, such as neural tissue.Similarly, transcriptome-wide association studies combine GWAS results and gene expression to elucidate probable mechanistic relationships for the SNP and genes associated with a trait of interest.Previous work has expanded the TWAS approach to investigate gene-gene interactions and demonstrated that gene interactions influencing complex traits are pervasive and therefore important to further investigate (Evans et al., 2023).One of the limitations of our approach is that it does not elucidate biological function; however, unlike TWAS, our approach is unaffected by limitations of existing gene expression datasets, including a lack of diversity, small sample sizes, and tissue bias present (Wainberg et al., 2019).The approach developed here (pooling data from multiple datasets to increase sample size and representativeness, limiting SNP×SNP epistatic analyses to common variants, and using a 10-25 kb upstream and downstream gene window when aggregating SNP×SNP results at the gene-level) will be useful for other researchers in the field attempting to discover genome-wide interactions with a wide range of complex traits.Potential interactions discovered using our approach can then be followed-up using H-MAGMA or TWAS to elucidate biological networks or functions underlying such interactions.These results serve as a guide for others in the field as they also attempt to study epistasis at the SNP level.
In summary, this is the first study to report an interaction between rs16969968 and any genome-wide loci influencing cigarette consumption.Five of our nominally significant SNPs, such as rs73586411 and rs6053152, previously failed to reach significance for cigarettes per day in GSCAN, with sample sizes roughly 3-10 times the size used here (Liu et al., 2019).This highlights the power of interaction studies to detect novel variants that would not be found otherwise and the importance for larger studies like GSCAN to increase access to individual-level data for the feasibility of this work.Future studies could implement our pipeline to investigate interactions between other well-replicated common polymorphisms and genome-wide loci for further characterization of the genetic factors underlying complex traits.Our approach could be especially helpful in understanding how different variants within the same gene might have different or opposing effects on a trait of interest.Like the CHRNA5 receptor gene rs16969968 is located in, the CYP2A6 gene harbors multiple polymorphisms that have been previously associated with smoking quantity but with opposing effects (Pan et al., 2015).For example, while rs16969968 has consistently been shown to increase smoking quantity, our current results show that in the presence of the minor allele of tag SNP rs73586411, smoking quantity was reduced.In short, our current findings expand our understanding of how a well-characterized and long-established SNP influencing smoking quantity alters risk for smoking behaviors in conjunction with the rest of the genome and showcases a novel way for other scientists to continue more detailed characterization of other strongly associated SNPs underlying smoking behaviors.

Fig. 2 .
Fig. 2. (A) Two genes, PCNA and TMEM230, reached significance (p < 2.63×10 − 6 ) for interacting with rs16969968 influencing CPD after adjusting for multiple testing when using the SNP-wise Mean model in MAGMA.(B) One gene, PCNA, reached significance (p < 2.63×10 − 6 ) for interacting with rs16969968 influencing CPD after adjusting for multiple testing when using the SNP-wise Top model in MAGMA.All gene analyses shown above are using only the UKB European subset.

Fig. 3 .
Fig. 3. (A) Estimated rs16969968×rs735864111effect sizes, alongside their standard error for those estimates across samples.The sample size of each sample is denoted in parentheses; samples are ordered according to decreasing sample size.(B) Locus Zoom plot for region of interest, tagged by rs73586411.