Gene-gene Interaction of AhR with and within the Wnt Cascade Affects Susceptibility to Lung Cancer

Background: Aberrant Wnt signalling, regulating cell development and stemness, influences the development of 65 many cancer types. The Aryl hydrocarbon receptor ( AhR ) mediates tumorigenesis of environmental pollutants. Com- 66 plex interaction patterns of genes assigned to AhR/Wnt -signalling were recently associated with lung cancer suscep- 67 tibility. Aim: To assess the association and predictive ability of AhR/Wnt -genes with lung cancer in cases and controls 68 of European descent. Methods: Odds ratios (OR) were estimated for genomic variants assigned to the Wnt agonist 69 and the antagonistic genes DKK2 , DKK3 , DKK4 , FRZB , SFRP4 and Axin2 . Logistic regression models with variable se- 70 lection were trained, validated and tested to predict lung cancer, at which other previously identified SNPs that have 71 been robustly associated with lung cancer risk could also enter the model. Further, decision trees were created to 72 investigate variant x variant interaction. All analyses were performed for overall lung cancer and for subgroups. Re- 73 sults: No genome-wide significant association of AhR/Wnt -genes with overall lung cancer was observed, but within 74 the subgroups of ever smokers (e.g. maker rs2722278 SFRP4 ; OR=1.20; 95%-CI: 1.13-1.27; p=5.6 10 -10 ) and never 75 smokers (e.g. maker rs1133683 Axin2 ; OR=1.27;


4
reported. Classification And Regression Tree (CART) analysis revealed an interaction of DKK2 and SFRP4 polymor-108 phisms to be the best (off all investigated) predictors for LC; especially within smokers. They also reported to have 109 identified several high-risk subgroups in smokers, e.g. characterised by DKK2 (rs17037102 / rs419558) and Axin2 110 (rs9915936). A similar picture was observed in a sample of 270 subjects from Istanbul, Turkey.
[24] A two-way inter-111 action between DKK3 (rs3206824) and SFRP4 (rs1802074) was found to be predictive of LC. 112 We aimed to assess a possible association of AhR pathway and Wnt signalling cascade with LC within the large-scale 113 series of cases and controls of European descent hold by the International Lung Cancer Consortium (ILCCO) / Inte-114 grative analysis of Lung Cancer Etiology and Risk (INTEGRAL). To do this, we also evaluated the contribution of these 115 genes to genetic prediction of LC as a complement to known LC-related markers. 116

Methods 117
The work presented has been reviewed and approved by the ILCCO Steering Committee. 118 Cases and Controls 119 Phenotype and genotype data of 58,181 entries of the data repository of ILCCO were extracted. Details of the repos-120 itory is described previously.[1, 25] QC control samples, individuals without information on smoking status or age, 121 and samples of poor genotyping quality or sex discrepancies, were excluded, To avoid population stratification, this 122 analysis is focused on European-ancestry population (defined as more than 95% probability of being of European 123 descent). 14,068 incident LC-cases and 12,390 cancer-free controls of European descent remained for analysis. Those 124 genotyped with other genome-wide array in addition to OncoArray were separated to form an independent valida-125 tion set (2 nd validation set) of size (n=4,359, including 2,360 LC-cases and 1,999 controls). 126 Selected Markers 127 For this investigation we extracted the genotypes of 113 genomic variants (markers) assigned to 58 genes, previously 128 robustly associated with the risk for LC or one of its histological subtypes [1][2][3][4][5][6][7][8] or proxies thereof (called LC-marker), 129 and 296 markers assigned to 7 genes involved in Wnt signalling and listed in Bahl et al. [22,23] and Yilmaz et al. [24] 130 (called AhR/Wnt-marker). Thus, we focused this analysis to genes previously investigated with respect to LC. Fifty of 131 these 409 markers were eliminated before analysis due to a MAF<1% (minor allele frequency), or departure from 132 5 HWE (Hardy-Weinberg equilibrium) in genotypes (unaffected p<10 -7 , affected p<10 -12 ), or low imputation accuracy 133 (info<0.8). Seventy-eight of the remaining LC-markers were genotyped with the OncoArray (44 thereof are proxy 134 SNPs identified using LDlink [26]) and 32 had to be imputed. Two-hundred twenty-one of the remaining AhR/Wnt-135 markers were genotyped and 28 have been imputed. A list of these markers extracted from ILCCO OncoArray repos-136 itory is given in the appendix. 137 Association analysis 138 We first performed association analysis for each marker separately using the program PLINK. [27, 28] Crude (model 139 1) and adjusted odds ratios (ORs) were estimated along with 95%-confidence intervals within log-additive models. 140 Sex, age and smoking status and the first 3 principal components (PCs) to adjust for population stratification (model 141 2); and in addition the 6 most significantly associated LC-markers (rs55781567, 15q25.1 CHRNA5; rs11780471, 142 8p21.2 CHRNA2; rs7705526, 5p15.33 TERT; rs56113850, 19q13.2 CYP2A6; rs71658797, 1p31.1 AK5; rs11571833, 143 13q13.1 BRCA2) (model 3) were included in adjusted models. ORs were estimated for overall LC, small cell LC (SCLC), 144 squamous cell LC (SqCLC), adenocarcinoma LC (adenoLC), ever smokers, never smokers and young individuals (21 to 145 55 years of age) as subgroups. We generated QQ-plots for the AhR/Wnt-markers and estimated the genomic infla-146 tion factor λ. To account for multiple testing, genome-wide statistical significance was considered to correspond to 147 a p-value of 10 −7 or lower, suggestive significance to a p-value between 10 −5 and 10 −7 and nominal significance to a 148 p-value between 0.05 and 10 −5 . 149 Logistic Regression -Predicting models with model selection 150 We fitted logistic regression models with variable selection to find appropriate polygenic risk scores (PRS) in order 151 to predict the disease (LC) status (affected or unaffected). Any AhR/Wnt-marker or the LC-marker could be included 152 in the model without preference. To avoid multi-collinearity we removed one of two SNPs in LD to another (R²>0.8, 153 pruning). The remaining entered the models as potential predictors. We performed forward selection until the 154 Bayesian information criterion (BIC, most stringent selection), the Akaike information criterion (AIC, less stringent 155 selection, contains in general more predictors) or the sample size corrected AIC (AICC) indicate a best solution (and 156 10 more selection steps). The resulting PRSs are called BIC-, AIC-and AICC-scores. Note, that for the purpose of 157 model building, the AIC-selection is asymptotically equivalent to cross-validation (CV). [29,30] To avoid overfitting, 158 6 we assigned individuals to a training or a validation set (to build a score) and a testing set (to examine the score 159 performance) with a 1/3 probability each. For comparison, we also generated a BIC LC -score with at least one marker, 160 only allowing LC-markers to enter the model building. To compare the importance for LC prediction of the sets of 161 formed by spurious epistasis we removed one of two SNPs in LD to another (R²>0.8, pruning). Since overfitting is a 171 point of concern when building decision trees, the complexity parameter was first optimized applying 10-fold cross-172 validation, grading the performance on the validation set by Somers' D (concordance of true and predicted LC-sta-173 tus). The ability of the optimal trees to predict the LC-status was then tested within the independent sample of 4,359 174 cases and controls. True positive (TP) and true negative (TN) rates are given. 175

LC-makers and
All statistical analyses were performed with SAS® 9.4, PLINK 1. 90  The analysed sample consists of 14,068 LC-cases and 12,390 controls with median age of 63. Sixty-three percent 181 were male, 52% of cases and 28% of controls were current smokers. The most frequent histological subtype is ade-182 nocarcinoma (38%), followed by squamous cell carcinoma (SqCLC) (26%) and small cell lung cancer (SCLC) (10%). The 7 proportion of never-smokers was largest within the subgroup of adenocarcinoma cases (14%), but almost the same 184 between younger (<55 years; 10%) and older (9%) cases. Details on smoking status and histological subtypes are 185 presented in Table 1. 186 Table 1 Smoking by LC status and subgroups  187 Association analysis 188 We first performed association analysis for each Wnt/AhR-marker separately.  Table 2). When dividing the cases and controls accord-201 ing to their smoking behaviour (ever and never smokers), genome-wide significance (p ≤ 10 -7 ) was achieved for 7 202 and 8 markers, respectively. Another 12 and 3 markers, respectively, were found suggestively significant 203 (10 -7 < p ≤ 10 -5 ) (see Additional file 1: S- Figure 1) for ever and never smokers. Those markers found associated among 204 ever smokers have mainly been directly genotyped and are assigned to SFRP4 and DKK4. E.g. for marker rs2722278 205 we estimated an OR=1.20 (95%-CI: 1.13-1.27), yielding a p-value of 5.6 10 -10 . Those markers found associated among 206 never smokers have mainly been imputed and are mostly assigned to Axin2, but also to AHR, FRZB and DKK2. Marker 207 rs17037102, assigned to DKK2, was the only one found associated with LC by Bahl et al. and in this analysis (see 208 Table 2 and Additional file 1: S- Table 3). Interestingly, the ORs of these markers estimated by model 3 (additionally 209 8 adjusted for selected LC-marker) differ from that estimated by model 2. They are closer to one and no more signifi-210 cant. E.g. for rs1133683 (Axin2) we observe an OR=1.27 (95%-CI: 1.19-1.35, p=1x10 -12 ) fitting model 2, but OR=0.95 211 (95%-CI: 0.86-1.06, p=0.3586) fitting model 3. 212 Table 2 Significantly associated AhR/Wnt-markers within never and ever smokers 213 Logistic Regression -Predicting models with model selection 214 We further fit logistic regression models with variable selection to evaluate the contribution of AhR/Wnt-markers to 215 a polygenic risk scores (PRS), but without postulating the usefulness of the score as such. Eight LC-markers from only 216 eight LC-genes (CYP2A6, CHRNA5, TERT, AMICA1, CHRNA3, COPS2, HCG4 and CHRNA2) were selected for the BIC-217 score (most stringent selection) to predict overall LC. Hence, the BIC-score and the BIC LC -score are identical. In con-218 trast, the AIC-score (for overall LC identical to the AICC-score) includes 20 LC-markers and remarkable 17 AhR/Wnt-219 markers, with LC-markers being more important than the AhR/Wnt-markers (importance ratio 0.56: 0.34) (see Figure  220 2, Additional file 1: S- Figure 3 and S- Table 4). The ability to distinguish cases and controls from susceptibility genes 221 only was, as expected, poor for each of the scores (see Additional file 1: S- Table 5). In the training set the perfor-222 mance of the AIC/AICC-score (AUC=0.607) exceeded those of the BIC/BIC LC -score (AUC=0.582) significantly 223 (p<0.001). Within the test set (AUCs: 0.577 and 0.576) and the 2 nd validation set (AUCs: 0.553 and 0.548), the higher 224 complexity with additional AhR/Wnt-markers did not improve discriminability for overall LC (p=0.87 and p=0.35). 225 Similar score composition and performance was observed for most subgroups. The BIC-scores in the subgroups 226 adenoLC (involved marker LC:AhR/Wnt=6:--), SCLC (3:--) and smokers (7:--) contained LC-markers only, whereas 227 AhR/Wnt-markers are included even under this stringent variable selection in the subgroups SqCLC (5:1) and Young 228 (2:2). However, between 14 and 31 AhR/Wnt-markers entered these subgroup's AIC-scores. For these subgroups, 229 the importance of the LC-markers for the AIC-score is higher than that of the included Ahr/Wnt-markers. 230 within the test set (see Additional file 1: S- Figure 4). For this subgroup, the selected AhR/Wnt-markers contribute to 234 9 the AIC-score more than twice as much as the LC-markers (importance ration 0.60: 1.49). The precision-recall plot 235 of Figure 3 indicates that a positive SCLC prediction based on the AIC-score can be trusted more than that based on 236 LC-markers alone (BIC LC -score). In the 2 nd validation set the score-specific AUCs were similar but no more significantly 237 different (p=0.08; AUCAIC=0.564 vs. AUCBIC=0.531). The AIC-score of this SCLC-subgroup is composed of 12 LC-mark-238 ers (assigned to CHRNA5, HCG4, DNAJB4 (4x each), CYP2A6, CHRNA3, CHRNA2, AMICA1, KCNJ4, AS1, BRCA2, EGFL8  239 and WNK1 (2x each)) and 27 AhR/Wnt-markers (assigned to all AhR/Wnt-genes except DKK3). However, only one LC 240 patient in the test set (n=434) and one in the 2 nd validation set (n=164) was recognized as a patient at a threshold of 241 50% case probability. 242 Decision trees 251 Finally, we generated decision trees to evaluate the contribution of AhR/Wnt-markers to LC prediction that allow for 252 a complex interaction structure, but without postulating the usefulness of the trees as such. The decision tree for 253 overall LC (whole sample) consists off solely a single decision node (rs55781567 assigned to CHRNA5), achieving a 254 Somers' concordance index D=0.0565 in the 2 nd validation set (see Additional file 1: S- Table 6 and S- Figure 2). A 255 single-node decision-tree was also found optimal for young participants (split: rs1051730 assigned to CHRNA3), 256 achieving a Somers' concordance index D=0.096. These two, unsophisticated trees are characterised by balanced 257 TP-(about 62%) and TN-rates (about 44%). 258 The decision trees for ever smokers, SCLC and SqCLC were more complex achieving Somers' concordance indexes D 259 of 0.007, -0.0005 and 0.0126, respectively. The trees for SCLC and SqCLC are characterised by an extreme TP-rate 260 <5% and TN-rate >99%; the tree for Ever Smokers by a TP-rate >99% and TN-rate <5%. Remarkably, a marker as-261 signed to CHRNA5 was always chosen as the first and most important split for the trees for ever smokers, for SCC 262 and SqCLC. However, markers assigned to AhR/Wnt-genes (smoker: DKK2; SCLC: FRZB; SqCLC; DKK2 and DKK3) ap-263 pear at lower-level decision-nodes (Additional file 1: S- Figure 5, 6, 7 and 8). With the same program settings, no 264 decision tree could be created for adenocarcinoma. 265 Most notable is the optimal decision tree for the 5,242 never smokers (75% LC-cases, 25% controls), the only one 266 that does not contain a marker belonging to the CHRN (Cholinergic receptors nicotinic subunits) gene group (see 267 Figure 4). The tree is built from only two LC-markers but 7 AhR/Wnt-markers, achieving a Somers' concordance index 268 D=-0.002. One can make out three branches of this tree. Branch I covers two thirds of individuals (n=754, 66% of 269 1141 in the 2 nd validation set): All of these are graded as "unaffected" based on only the two LC-markers: first deci-270 sion node (rs885518 assigned to MTAP) and second decision node (rs7705526 assigned to TERT that links telomerase 271 activity to Wnt signalling). For branch II an additional node (rs17214897 assigned to DKK2) is taken into account, 272 covering a further tenth (9.9%) of never smokers. In this branch, very few subjects of the training set (1.7% within 273 branch II eq. 0.17% of all never smokers) are graded "affected". However, one in four individuals of the 2 nd validation 274 set belonging to both branches, I and II, is truly "affected" but has not been detected (TP-rate=0%, TN-rate=100%). 275 Rated as "affected" appears in the test set only in the third branch III, covering the remaining fourth of never smokers 276 (n=284 of the 2 nd validation set). This third branch requires genotypes of several AhR/Wnt-markers assigned to AHR, 277 Axin2, DKK2 and/or SFRP4. Herein, one in three (n=97 of the 2 nd validation set) is truly "affected" and is given a 278 chance to be correctly identified, which appears in 8 LC-cases (TP-rate=9%, TN-rate=88%). We also noted that the 279 histological subtypes are equally distributed between the branches (see Additional file 1: S- Table 7). 280   Multifactor dimensionality reduction (MDR) for overall LC, while two markers of SFRP4 were closely placed within 317 smokers. In contrast, markers assigned to Axin2, but also to AHR, FRZB and DKK2 were observed as associated within 318 never smokers. According to Bahl et al. markers of Axin2 and DKK2 were in never smokers closely placed by a MDR, 319 too. The discrepancy between the total sample and the subsample association estimates point to smoking mediated 320 associations. 321 Our analysis agrees with both previous studies in that complex interaction patterns between the investigated genes 322 contribute to LC susceptibility as entirety or within specific subgroups. To discover patterns of Ahr/Wnt-genes in-323 volved in LC genesis we further changed the focus from significance of association to inclusion in prediction models, 324 and followed two approaches: First, we searched for polygenic risk scores (PRS). Doing so, we add up marker main 325 effects to construct multidimensional scores, optimising model fit (instead of marker preselection by p-value below 326 some threshold), in order to discriminate cases from controls in a somehow ideal way. Complex gene x gene (GxG) 327 interactions are not modelled. 328 Nevertheless, the proportion of Ahr/Wnt-genes entering some of the predictive models was remarkable large, given 329 that these markers are not, all other candidates however genome-wide significantly associated to LC. This was par-330 ticularly noticeable for SCLC, since AhR/Wnt-markers contribute more than twice as much to the prediction score as 331 LC-markers. It is known, that within current smokers, tobacco consumption is strongest associated to SCLC.
[42] 332 Moreover, within never smokers, a stringed defined score is made up from only two AhR/Wnt-markers, assigned to 333