Materials
The imputed genotypes (reference panel: HRC r1.1) from 46,783 individuals from European population, with lung cancer phenotype, smoking (ever- vs never-smokers) and sex information, in the INTEGRAL (Integrative Analysis of Lung Cancer Etiology and Risk)-ILCCO (International Lung Cancer Consortium) lung cancer consortium were analyzed in this study19. The subjects came from 9 independent studies: the OncoArray Consortium Lung Study (OncoArray), including 16,845 lung cancer cases and 13,057 controls, was used as the discovery dataset20–21. Individuals from another 8 smaller independent studies: Affymetrix Axiom Array Study (AFFY), the Genetic Epidemiology of Lung Cancer (GELCC) Consortium, the Environment and Genetics in Lung cancer Etiology study (EAGLE), the International Agency for Research on Cancer (IARC), MD Anderson Cancer Center Study (MDACC), NCI Lung Cancer and Smoking Phenotypes in African-American Cases and Controls (NCI), the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO), and Samuel Lunenfeld Research Institute Study (SLRI), were combined together and used as replication dataset including 7,378 lung cancer cases and 9,503 controls (Table 1, Supplementary Table S1)22–25. There were 12,084 women and 17,818 men in discovery dataset; and 6,366 women and 10,515 men in replication dataset. 60.4% of the lung cancer patients were male in the discovery study compared with 64.0% in replication study. 81.6% of individuals in both discovery and validation studies were ever-smokers. Stringent quality controls were applied on SNPs and SNPs selected for the analysis had to qualify by two criteria: 1, had imputation information score > = 0.7 in discovery data; 2, had information score > 0.2 in each of the eight studies and sample-size weighted information score > = 0.7 in replication data. About ~ 20,000,000, out of 39,000,000 imputed SNPs, with information score > = 0.7 were analyzed in discovery study; and ~ 10,000,000 SNPs were analyzed in replication study (Fig. 1). 193,050 markers, common to the 9 studies and with linkage disequilibrium r2 value less than or equal to 0.5, were selected to calculate principal components in PLINK. Detailed information about data collection, genotype imputation and quality control procedures can be found from our earlier publication19.
Table 1
Characteristics of participants that were studied during the Discovery and Replication phases.
|
Discovery1
|
Replication2
|
|
Male (%)
|
Female (%)
|
Total
|
Male (%)
|
Female (%)
|
Total
|
Controls
|
7646 (58.6)
|
5411 (41.4)
|
13057
|
5796 (61.0)
|
3707 (39.0)
|
9503
|
Cases
|
10172 (60.4)
|
6673 (39.6)
|
16845
|
4719 (64.0)
|
2659 (36.0)
|
7378
|
Ever-smokers
|
15584 (63.9)
|
8813 (36.1)
|
24397
|
9230 (67.0)
|
4553 (33.0)
|
13783
|
Never-smokers
|
2234 (40.6)
|
3271 (59.4)
|
5505
|
1285 (41.5)
|
1813 (58.5)
|
3098
|
Samples with European ancestry in INTEGRAL-ILCCO consortium were analyzed in the study. Number of overall lung cancer cases was provided in the table. 1, genotype data from Oncoarray study was used in Discovery study; 2, genotype from the other 8 studies were used in replication study. |
Statistical Methods For Gwgi
Following the two-step analysis strategy, case-only analysis was first performed between SNP dosage (additive model) and sex (male and female were coded as 1 and 2, respectively) phenotype using lung cancer patients from discovery study (N = 16,845) and replication study (N = 7,378) (Case-only model, S denotes smoking status). Fixed-effect meta-analysis was conducted to combine information from both studies. Variants selected for further case-control analysis had to satisfy all three criteria: 1, with case-only joint p-value < 5x10− 8; 2, case-only p < 0.1 in replication study; and 3, same direction of effect in both studies. All the samples in discovery and replication study including 24,223 lung cancer patients and 22,560 controls were applied in case-control analysis (Case-control model, D denotes disease status). The SNPs with case-control gene-sex interaction p-value < 0.05 and same direction of interaction effect in both discovery and replication datasets were reported as significant findings. For the significant variants with low allele frequency (MAF < 0.05), we further validated the signals using firth logistic regression, a method designed for rare variants association test to reduce small-sample bias in regular logistic regression (Fig. 1)26. The statistical analysis was adjusted for smoking status (ever- and never-smokers) and the first 5 principal components. The analysis was conducted in overall lung cancer as well as adenocarcinoma (No. Cases = 9,791) and squamous (No. Cases = 6,107) lung cancer.
Eqtl Analysis
Genotype dosage and gene expression rpkm (Reads Per Kilobase Million) data from lung tissue were downloaded from GTEx (phs000424.GTEx.v7.p2). There were 377 individuals with European ancestry available for the analysis, including 244 men and 133 women. Average rpkm for the gene was used if there were duplicated samples. Individuals with rpkm < 0.25 were removed from the analysis. For imputed variants, we applied a best guess algorithm to assign most likely genotypes to each individual to transform the dosage into genotype. For variants with minor allele frequency (MAF) >= 0.1, we adopted an additive model: dosage values <=0.2 were coded as 0; dosage between 0.4-0.6 were coded as 1; and dosage >=0.8 was coded as 1. For variants with MAF < 0.1, we adopted a dominant model: dosage values <=0.4 were coded as 0 (non-carrier group with no risk alleles) and dosage >=0.6 was coded as 1 (carrier group with at least 1 risk allele). For each candidate SNP, the boxplot of log2(rpkm) across different genotype were displayed for all the individuals, men and women groups. Student’s test was performed to compare the mean log2(rpkm) across different genotype groups and general linear regression was conducted to test SNP-sex interaction in gene expression prediction.
Functional annotation analysis
Two public functional annotation tools, CADD and RegulomeDB were applied for functional inference. CADD was designed to predict functional variants by integrating diverse information from wide range of function categories16. It computed a score inferring the functional ranking of the variants which is helpful for fine mapping. RegulomeDB is a tool designed to predict regulatory DNA elements in the Human genome based on information from GEO, ENCODE and public literature17. It adopted SURF (Score of Unified Regulatory Features) model, trained on data from massively parallel reporter assays, to predict the functional variants in enhancer and promoter elements. RegulomeDB provides a probability score for the variant being transcription factor binding site (TFBS), promotor or DNase hypersensitivity, etc. PROMO, an online tool to perform computational-based searches for gene expression regulatory sequence motif based on TFBS database, was used to search for the functional motif with sequence containing the candidate variants as input18.
Table 3
Functional annotation of candidate SNPs from significant regions.
SNP
|
rsID
|
Case-only
|
CADD
|
RegulomeDB
|
POS
|
SNP
|
P
|
PHRED1
|
Function Annotation
|
Probability2
|
17:49639139:G:A
|
rs17662871
|
4.29x10− 8
|
2.534
|
TF binding or DNase peak
|
0.49
|
20:15642704:G:A
|
rs76314075
|
1.33x10− 7
|
1.445
|
Other
|
0.18
|
20:15644218:T:C
|
rs79942605
|
2.81x10− 8
|
8.228
|
TF binding or DNase peak
|
0.04
|
21:18776825:C:T
|
rs208900
|
5.87x10− 6
|
0.668
|
TF binding or DNase peak
|
0.13
|
21:18779654:G:A
|
rs184089
|
5.79x10− 6
|
1.066
|
Motif hit
|
0.48
|
21:18783833:G:A
|
rs11088636
|
4.24x10− 6
|
0.039
|
Other
|
0.18
|
21:18784296:T:C
|
rs423598
|
4.78x10− 6
|
2.243
|
TF binding or DNase peak
|
0.13
|
21:18785818:G:A
|
rs208908
|
4.54x10− 8
|
5.773
|
TF binding or DNase peak
|
0.92
|
21:18787462:T:C
|
rs208914
|
5.01x10− 6
|
2.944
|
TF binding + DNase peak
|
0.61
|
21:18787572:T:C
|
rs9637031
|
1.34x10− 7
|
4.018
|
TF binding + DNase peak
|
0.61
|
21:18787948:A:G
|
rs6517771
|
1.76x10− 6
|
5.095
|
Motif hit
|
0.34
|
21:18788445:A:C
|
rs1389157
|
1.30x10− 6
|
6.257
|
TF binding or DNase peak
|
0.03
|
SNPs with case-only joint p-value < 1x10− 5 were selected for annotation analysis. eQTL displayed the information from GTEx Portal. The annotation information for rs208908 was bolded. 1, scaled CADD score by expressing the rank in order of magnitude terms; 2, RegulomeDB probability score is ranging from 0 to 1, with 1 being most likely to be a regulatory variant. |
|