A New Efficient Method to Detect Genetic Interactions for Lung Cancer GWAS

doi:10.21203/rs.2.14850/v1

Download PDF

Technical advance

A New Efficient Method to Detect Genetic Interactions for Lung Cancer GWAS

https://doi.org/10.21203/rs.2.14850/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 30 Oct, 2020

Read the published version in BMC Medical Genomics →

You are reading this older preprint version

Read the latest preprint version →

Genome-wide association studies (GWAS) have proven successful in predicting genetic risk of disease using single-locus models; however, identifying single nucleotide polymorphism (SNP) interactions at the genome-wide scale is limited due to computational and statistical challenges. We address the computational burden encountered when detecting SNP interactions for survival analysis, such as age of disease-onset. To confront this problem, we developed a novel algorithm, called the Efficient Survival Multifactor Dimensionality Reduction (ES-MDR) method, which uses Martingale Residuals as the outcome parameter to estimate survival outcomes, and implemented the Quantitative Multifactor Dimensionality Reduction method to identify significant interactions associated with age of disease-onset. To demonstrate efficacy, we evaluated this method on two simulation sets to estimate the type I error and power. Simulations show that ES-MDR identifies interactions using less computational workload and allows for adjustment of covariates. We applied ES-MDR on the OncoArray-TRICL Consortium data with 14,935 cases and 12,787 controls for lung cancer (SNPs = 108,254) to search over all two-way interactions to identify genetic interactions associated with lung cancer age-of-onset. We tested the best model in an independent data set from the OncoArray-TRICL data. Our experiment on the OncoArray-TRICL data identified many one-way and two-way models with a single-base deletion in the noncoding region of BRCA1 (HR = 1.24, P = 3.15 x 10-15), as the top marker to predict age of lung cancer onset. From the results of our extensive simulations and analysis of a large GWAS study, we demonstrate that our method is an efficient algorithm that identifies genetic interactions to include in our models to predict survival outcomes.

Epigenetics & Genomics

genetic interactions

machine learning

genome-wide association study

lung cancer

A fundamental aim of studying human genetics is to predict disease risk from genomic data. Genome-wide association studies (GWAS) that use single-locus models by testing each single nucleotide polymorphism (SNP) for association with a phenotype, proved to be instrumental in identifying hundreds of genetic variants associated with human traits and disorders.^1-4 However, most of the findings explain only a small proportion of the genetic effects on diseases and traits.^1,5 The complex biological mechanisms and genetic architectures of diseases motivated researchers to not only study main additive effects of single genetic variations, but also interactions between multiple variants with non-additive effects to explain more of the heritability of complex diseases.^6-10 As the availability of large genome-wide genotype and next generation sequencing data continues to grow, detecting genetic interactions (i.e., SNP interactions) will become more feasible with increased power to detect significant associations.¹¹ At the same time, epistasis detection faces computational and statistical challenges in analyzing high-dimensional data and testing millions of interaction models from an exhaustive search in GWAS.^6,12 The number of tests increases exponentially when analyzing higher orders of interactions, which require immense computing resources and processing time. Additionally, if the genotypic combinations that confer risk are nonadditive, finding the combinations of genotypes that increase risk can become a complex combinatorial challenge.⁷

With the arrival of multi-dimensional and complicated genetic data sets, researchers have adapted to this growth by integrating machine learning methods to analyze multiplex genetic architectures. In genetic epidemiology, a popular series of methods are centered around a machine learning approach adapted to detect gene-gene interactions called the Multifactor Dimensionality Reduction (MDR) method. First introduced by Ritchie et al. (2001), MDR aims at reducing high-dimensional genetic interacting loci to a one-dimensional binary variable that can be easily classified into high and low risk groups.⁷ While MDR have successfully facilitated detection and characterization of multiple genetic loci, there are disadvantages to this algorithm that limited its use on diverse data structures such as survival data, which is often a primary outcome of interest in cancer research. Gui et al. (2011) and (2013) have expanded on the MDR algorithm to different phenotypes, survival and continuous outcomes data respectively.^13,14 Survival MDR (Surv-MDR) extends the analysis of dichotomous traits in MDR to censored and time-to-event survival data using a log-rank test to classify sets of multi-loci combinations. This algorithm demonstrated proficiency in identifying genetic interactions associated with censored time-to-death or time-to-event data; however, it is more computationally demanding than MDR and it does not allow for covariate adjustments important for controlling confounding factors.¹³ Quantitative MDR (QMDR) offers a computationally efficient algorithm to identify genetic interactions associated with a quantitative outcome, but it also does not allow for covariate adjustments such as age, gender, environmental toxins, and other confounding factors to accurately identify genetic association relations.¹⁴

Currently, there are limited methods capable of identifying genetic interactions efficiently with adjustment for covariates when studying age of disease-onset, such as a patient’s age at first diagnosis or recurrence of disease, for large-scale studies due to computational demands. It is important to have reliable estimates on the age of first diagnosis to understand the etiology of the disease and to tailor clinical practice, especially for determining the appropriate starting age for diagnostic screening, such as lung cancer screening.¹⁵ In this study, we demonstrate how the Efficient Survival Multifactor Dimensionality Reduction (ES-MDR) method improves on the efficiency of Surv-MDR and allows for adjustment of covariate effects to analyze large-scale survival and genetic data to analyze age of disease-onset in association with SNP interactions. Our method uses Martingale Residuals as the estimated survival outcome with adjustment for confounding factors that provides an efficient and effective identification of genetic interactions associated with survival outcomes. We show the strength of the proposed method by designing two simulations to evaluate the 5% type I error threshold through an evaluation of the empirical null distribution and to analyze the predictive power of ES-MDR. To analyze the effectiveness of the ES-MDR method, we evaluated our approach using the genome-wide genotyped lung cancer OncoArray-TRICL (Transdisciplinary Research Into Cancer of the Lung) Consortium data to detect and characterize SNP interactions that are associated with lung cancer age-of-onset.

Assessing Type 1 Error in Simulation I

In the first simulation, we determined whether the type I error rate is close to the expected value when there is no SNP interaction effect. Assuming a data set that included two causal SNPs in a set of 20 total SNPs and a total sample size of 400, we expect the type I error rate to be 0.05

In Figure 1, the null distributions for the one- and two-way models follow the normal distribution quite closely, whereas the three-way model displays a slight left skew. Nevertheless, the right tail regions almost perfectly overlap with the right tail of the normal distribution for all three interaction models. This shows that the use of the 95^th quantile of the empirical distribution as a threshold to remove false positives is suitable. This will greatly reduce computing time by comparing testing scores with the prior calculated empirical distribution.¹⁴

In Table 1, we used the 95^th quantile of data sets with 400 samples to estimate the type 1 error rate. The estimated rate for type I error was tightly distributed around 5% with a range from 4.5% to 5.6% for the one-way and two-way models. The estimated error rate for the three-way model is greater with a range from 7.5% to 9.7%, however, it also exhibits a trend towards 5% with increasing sample size. Based on the results in Table 1, it reveals that with every increase in 200 sample data, there is an approximate 0.6% decrease in error rate. As a result, we expect that the type I error rate will converge to 5% with sample sizes greater than approximately 2,600.

Assessing Power and Speed in Simulation II

In the second simulation, we estimated the power of ES-MDR with a data set that included quantitative outcome variables and a pair of functional interacting SNPs and 18 non-interacting SNPs. We determined whether the power of ES-MDR is comparable to Surv-MDR in identifying the two functional SNPs. We counted the number of times that the functional SNP pair was correctly identified and divided that number by the total number of data sets (500 for this simulation) to get the estimated success rate.

Table 2 presents a comparison of the power to identify only the two (i.e., stringent model) interacting SNPs (SNP1 and SNP2) for ES-MDR and Surv-MDR on simulated data.

Table 3 displays the percent change in power to detect only the two functional interacting SNPs between ES-MDR and Surv-MDR. Overall, ES-MDR performs comparatively better than Surv-MDR for larger sample sizes. In addition, both ES-MDR and Surv-MDR demonstrate increasing power to detect functional SNPs with increasing heritability frequencies.

Table 4 displays a comparison in power to identify the two interacting SNPs (SNP1 and SNP2) plus an additional SNP (i.e., flexible model) between ES-MDR and Surv-MDR.

Table 5 shows the percent change in power to detect the two interacting SNPs plus an additional SNP. Here, we also demonstrate that ES-MDR has greater power compared to Surv-MDR. Again, ES-MDR performs much better than Surv-MDR at greater sample sizes.

We compared the computing time between ES-MDR and Surv-MDR for 100 simulated data sets, for one-, two-, and three-way interactions, and with ten-fold cross-validation. The computing time for Surv-MDR is 12 hours,14 minutes, and 34 seconds versus 2 minutes and 14 seconds for ES-MDR, both of which were run on 1 node in the high-performance computing cluster called Discovery with AMD 3.1 Ghz CPU and 64 GB of memory. Discovery uses a Linux RedHat 6.7 operating system and is comprised of 160 computing nodes (3000+ cores), 12.5 TB of memory, and is available to the Dartmouth research community.

Application to OncoArray-TRICL Data set

The main goal is to identify SNPs with main effects and SNP interactions that are associated with lung cancer susceptibility at different ages of disease onset. Using a population-based study with 14,935 cases and 12,787 controls, we assessed genetic variation in relation to lung cancer age-of-onset using ES-MDR.

Table 6 lists the top 10 one-way test results generated by ES-MDR and cross-validation. Using ES-MDR, highly significant SNPs were identified in association with lung cancer age-of-onset.

Table 7 displays the top 10 two-way interactions identified by ES-MDR that are associated with lung cancer age-of-onset. Due to the observed inflation of type 1 error for 3-way interactions in the simulation study, a 3-way interaction is not evaluated in the OncoArray-TRICL data analysis.

For Table 8, we combined SNPs from the top 1,000 one-way loci and the top 1,000 two-way interactions, ranked the SNP scores from highest to lowest, and applied the least absolute shrinkage and selection operator (Lasso) Cox regression method to filter and select the best genetic factors that predict age of lung cancer onset. Table 8 exhibits the top 10 significant SNPs selected by Lasso Cox regression.

To visualize the difference in age of lung cancer onset between the high risk and low risk groups, Figure 2 illustrates the contrast using the Kaplan-Meier (KM) survival curve. KM curves for SNPs in the intronic region of TULP1, FKBP5 (rs6906359), in-between genes GTF2IP1, PMS2P5 (rs149743903), in the intergenic regions of UTP23, RAD21 (rs10105870), and the deletion in the noncoding region of BRCA1 (rs749410065) (NC_000017.10:g.41196821delT per Human Genome Variation Society nomenclature) display a clear separation of curves between the high and low risk groups. This demonstrates the efficacy of ES-MDR using Martingale Residuals to differentiate high risk and low risk groups based on genotype variation when evaluating lung cancer age-of-onset.

We continue our analysis with a comparison of smoking only and smoking plus SNP models to determine the best performance in predicting lung cancer onset at different ages. We use a common graphical plot called the area under the receiver operating characteristic (ROC) curve, also known as AUC, to measure the performance of our models to discriminate the best parameters at predicting lung cancer onset at different ages based on accuracy. In Figure 3, the x-axis corresponds to the age of lung cancer onset, starting from 15 to greater than 80 years, and the y-axis indicates the AUC, ranging from 0.4 to 1. We examined the predictive performance of 6 different models with various tuning parameters identified from Cox Lasso regression, such as smoking only and smoking plus 2 SNPs, 4 SNPs, 13 SNPs, 19 SNPs, 29 SNPs, and 183 SNPs. This figure shows the averages of the estimated AUCs over the Oncoarray-TRICL data using the predictive scores from the independent left-out test set. The plot displays good predictive performances of models generated using ES-MDR. The AUC for models with more SNPs lies between 0.6 and 0.7 and continues to increase at later ages of onset. There is a noticeable decrease in AUC for ages 40 and below. This could be due to the limited number of lung cancer cases identified for individuals below the age of 40, which indicates that the models may not be appropriate to predict lung cancer diagnoses at 40 years and younger. The AUC of both smoking only and smoking with SNPs increase with age from age 40 and older. However, the AUC, depending on the number of SNPs in the models, differs by age. The model with the largest number of SNPs plus smoking performs the best at AUC 0.68 between age-of-onsets 40 and 80 compared to the smoking only model with an AUC of 0.55. There is a noticeable trend where incremental additions of SNPs in the model increases the AUC for age-of-onset between 40 and 80+. On the other hand, the AUC for smoking only and smoking plus fewer SNPs (e.g., 2 and 4) models, show the opposite trend where it increases around 90+ years of age.

In this study, we presented a novel algorithm to identify genetic interactions associated with the age-of-onset for lung cancer. We demonstrated in our simulation studies that our ES-MDR method is properly controlled for at 5% type I error rate under the null and improves power to detect causal SNPs. We identified new loci that are biologically plausible for lung cancer onset using the large OncoArray-TRICL data with 27,722 individuals. There are two unique contributions from this study. First, we offer a more computationally efficient algorithm, ES-MDR, a method that analyses survival data by using Martingale residuals in place of survival outcome. Second, ES-MDR includes the ability to adjust for covariates, such as smoking status, a necessary step to control for confounding factors, in which existing methods used for survival analysis, such as Surv-MDR, are unable to provide.

Using the MDR method to reduce the size of multiple dimensions to a single dimension to identify multi-locus genetic interactions in high-dimensional genomic data sets has been a well-established approach. Richie et al. (2001) first introduced MDR, a non-parametric (i.e., no parameters are estimated) and genetic model-free (i.e., no genetic model is assumed) model, that condenses multiple genetic loci into a single variable in order to categorize genotypes into two groups.⁷ The goal was to group genotypes into high-risk and low-risk categories associated with and without disease outcomes, respectively. However, MDR is restricted by its inability to analyze different outcome variables other than binary variables and it does not allow for the adjustment of confounding factors that is critical in preventing false association analyses. Therefore, an extension of the traditional MDR method was developed to analyze censored survival data, called Survival MDR or Surv-MDR.

Like the original MDR algorithm, Surv-MDR is a non-parametric and genetic model-free method proposed by Gui et al. (2011), and modified to allow analysis of time-to-event data, such as patient survival time or time to disease relapse.¹³ Surv-MDR uses the log-rank test statistic to compare survival times between samples with and without a specific multi-locus genotype combination and classifies them into high and low risk groups.¹³ Surv-MDR also uses cross-validation to identify the optimal set of K SNPs and overall best model. While Surv-MDR was successful in identifying SNP interactions associated with time-to-event outcomes, it is more computationally demanding than MDR and the problem with the inability to adjust for covariates still exists. Therefore, another optimized MDR method called Quantitative MDR (QMDR) was proposed by Gui et al. (2013) to address the slow-to-compute algorithm challenge.

QMDR optimizes the MDR algorithm by offering a computationally efficient way to analyze quantitative or continuous trait outcomes. QMDR compares the mean value of each multi-locus genotype to the overall mean and labels each genotype combination as “high-risk” or “low-risk”. Cross-validation is also implemented in QMDR to identify the optimal set of K SNPs and overall best model. For each K-way interaction, the steps used for a k-fold cross-validation are similar to the Surv-MDR method except for the step to identify the best K-way interaction. In this case, the largest T-test statistic is used instead of the square of the log-rank statistic when identifying the best interaction model. Inspired by the computational capabilities of QMDR to analyze quantitative outcomes associated with genetic variations, we leveraged this method’s straightforward computing efficiency to evaluate survival outcome data for time-to-event analysis.

Our approach transforms survival data (e.g., time and event status) into a single variable, Martingale Residuals, to use as a surrogate for time-to-disease and disease status, with application of QMDR for rapid processing of genotype combinations into high and low risk groups. We were able to identify hundreds of significant one-way and two-way models using ES-MDR and cross-validation when applied to the lung cancer OncoArray-TRICL data set. We were unable to compare the results of ES-MDR and Surv-MDR, both because Surv-MDR would have taken an extensive amount of time (e.g., greater than 4 months) to conduct a genome-wide genetic interaction analysis using the large OncoArray-TRICL data set, and because the current Surv-MDR algorithm does not allow for adjustment for confounding factors such as smoking status.

When searching for SNP interactions using real data, we chose a two-fold cross-validation instead of a ten-fold cross-validation to evaluate the optimal one-way and two-way interaction models as described previously.¹⁴ From the central limit theorem, assuming a sufficiently large sample size (n > 50) from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Therefore, we expected the testing scores with 400 samples from our simulation study to follow a standard normal distribution. However, Gui et al. (2013) displayed a slight right skew with a standard deviation of 1.6 in their empirical distributions that was due to extra variation introduced by overlapping training sets in their ten-fold cross-validation method.^14,20 Furthermore, two-fold cross-validation has been advocated to perform hypothesis testing where the training folds are mutually independent with no overlap.²⁰ Consequently, we evaluated the optimal one-way and two-way interaction models and the overall best model using two-fold cross-validation.

We explored prediction models that included SNPs that can be used to forecast lung cancer onset. Figure 3 lays out the AUC estimates for each model. The AUC peaks around age-of-onset less than 30 and greater than 90 years old. This may be due to the limited number of lung cancer cases (e.g., less than 10 cases) at very young and very old ages. In general, based on AUC averages, age of lung cancer onset is strongly influenced by genetic variants, with increasing numbers of SNPs contributing to better AUC estimates. The plateauing of AUC averages for the 40-80 years old range reveals good estimates for age of onset for all models, which is likely due to the larger sample size for evaluation. Another plausible explanation for the high AUC for early and late age of onset is the likelihood that those cases contain the same combinations of risk SNPs in the models. The identified top SNPs with high AUC for age of onset are not only associated with early lung cancer cases, but they potentially also contribute to late age of onset cases. The 2 SNP and 4 SNP models have strong associations with lung cancer cases, therefore responsible for high AUC averages for early and late age of onset of lung cancer. For the smoking only model, it plays less of a role for early lung cancer onset because the adverse effects from smoking requires time to develop. Over time the effects from smoking may be the main driver for late age lung cancer cases, which can explain why genetic factors do not seem to greatly effect cancer onset in later years. This interpretation makes biological sense since the effect of smoking over a longer time period can have compounding effects on cancer development. Conversely, cancer development due to genetics may appear at earlier rather than later years.

Limitations

While our novel ES-MDR overcomes some of the limitations described in previous methods used to evaluate genetic interactions, it is not without some of its own disadvantages. When analyzing survival data, the method does not directly evaluate survival variables such as time and event status. As a result, when using Martingale Residuals instead of the actual survival outcomes, we may be missing some important information that is needed to identify associations between SNP interactions and survival outcomes. A second limitation comes from over parameterizing our models, resulting in many multifactor cells with missing data.⁷ This does not affect classification of genotype combinations or identifying cross-validation consistency of the model, however, it can affect our estimation of the prediction error.⁷ Future studies would need to address this limitation. Third, we applied our ES-MDR method to analyze survival outcomes using case-control studies, where estimating the age-specific incidence (e.g., age-of-onset) is not typically designed for case-control studies. On the other hand, cohort studies, which are designed for survival analyses, are expensive and require a great deal of follow-up time to obtain age-of-onset information. This may be one of the barriers in analyzing survival outcomes for large cohort studies; it takes a lot of time and resources to amass an extensive amount of data. In our study, we can analyze and identify potential genetic markers that predicts lung cancer risk using a large lung cancer GWAS consortium, which can be followed up with further investigations for biological and functional significance. Fourth, due to few observations of lung cancer age-of-onset among younger individuals, we are limited in our ability to predict lung cancer onset for individuals 40 years and younger. With continuous efforts in recruiting participants in the OncoArray-TRICL Consortium, we may find more cases among the early onset population to better predict lung cancer risk in the future. Last, but not least, there are currently no available validation data to replicate our top SNP findings because these SNPs are not likely genotyped in other GWAS. Currently, there are ongoing efforts to collect external data that will include genotyping of our top SNP findings for replication.

Future studies

ES-MDR is a powerful alternative to Surv-MDR for identifying interactions, especially at the genome-wide scale. We have demonstrated its ability to identify high-order genetic interactions in simulated and real data sets. Although ES-MDR addresses previous limitations of Surv-MDR and other MDR-like methods, there are ways in which this method can be improved. While ES-MDR has greatly improved computing efficiency, genome-wide scans for interactions will still require massive computing resources, especially to analyze higher-order interactions. It will be necessary to optimize the selection of SNPs in predictive models, for example, by selecting genes known to participate in biological and metabolic pathways.²¹ This can improve the predictive ability of ES-MDR for two-, three, and k-way interactions in a pathway analysis. Second, a future study may entail introducing variance back to Martingale Residuals by way of weighting each residual based on the time-to-event data. This can greatly improve our power for model selection without removing the efficiency of the algorithm.

In summary, the ES-MDR method provides a way to analyze high-order interactions at the genome-wide scale to advance studies of genetic interactions. We developed a new method that efficiently captures non-linear and high-order interactions for time-to-event analysis. In general, ES-MDR has improved power performance relative to Surv-MDR using simulated data. Based on the noticeable trends, we are confident that with bigger sample sizes, ES-MDR will continue to significantly gain in power to detect functional interacting SNPs. Providing new and improved methods to analyze epistasis or gene interactions may offer new opportunities to not only explain the missing heritability for complex disease risk, but can also potentially detect new genetic determinants that is important for clinical utility such as disease diagnosis and prognosis.

In this section, we discuss how we improve the computational efficiency without sacrificing accuracy to develop the ES-MDR method when analyzing SNP interactions (i.e., joint effects of two SNPs) in association with age of disease-onset.

Incorporating Martingale Residuals for Age-of-Onset Survival Analysis

ES-MDR improves on the efficiency of Surv-MDR and applies the QMDR algorithm to analyze age of disease-onset in association with genetic interactions. Our novel ES-MDR approach uses a combination of survival analysis and QMDR for continuous outcome analysis in two steps. In the first step, we start with replacing event time and status with Martingale Residuals with covariate adjustment as a new continuous score. In the second step, we apply QMDR to efficiently categorize the genotype combinations into high-risk and low-risk groups. The best model is determined in the same way as QMDR, by using the t-test statistic computed from a continuous variable attribute (e.g., Martingale Residuals) to determine the best interaction and overall model.

The novel algorithm for ES-MDR is performed as follows:

Select K SNPs from all the SNPs in the data set and create a contingency table among every genotype combination of K SNPs.
For each multi-locus genotype combination cell, sum the Martingale Residuals between samples with and without each genotype combination.
Label cells “high-risk” if the sum of the Martingale Residuals is positive; otherwise negative Martingale Residuals are labeled “low-risk”.
Pool all the high-risk labeled cells into one group and all the low-risk labeled cells into another group to create a new one-dimensional variable.

Using Martingale Residuals to determine high or low risk group for survival data analysis is comparable to using the log-rank test statistic in Surv-MDR, however, more efficiently when classifying genotype combinations. It can be shown that the sum of the Martingale Residuals is a good surrogate variable of the log-rank test statistics for the purpose of determining high/low risk groups for each genotype combination. Next, we compare the similarities of the equations for Martingale Residuals and the log-rank test statistic. The sign and magnitude of the Martingale Residuals are dependent on the association of SNPs and the hazard function in the following equation:

In this equation, δ_i(t) denotes the number of observed events that occur at each survival time t. The number of expected events is calculated using the cox-proportional hazards model with x as the genetic factor and y as the adjusted covariate. The log-rank test statistic is defined as the following:

Here, we show that Martingale Residuals is equivalent to the numerator of the log-rank test statistic. Therefore, the sum of the Martingale Residuals is equal to the log-rank test statistic when the variance is set to 1. This infers that using Martingale Residuals as a substitute for the log-rank test statistic in evaluating genomic combinations associated with survival outcomes will provide the same data reduction and categorization process as Surv-MDR.

Evaluation through Simulations

Our purpose of running a simulation study is to evaluate how well ES-MDR performs and how well it performs compared with Surv-MDR. To demonstrate the strength of the proposed method, two simulations were designed to evaluate the testing score’s null distribution to evaluate type I error and to analyze power.

Simulation I

The first simulation study was created to estimate the 5% type I error threshold by evaluating an empirical null distribution with independent non-interacting SNPs and quantitative outcome values. Here, we created sets of SNPs (m = 10, 20, 50) with additive coding and sample sizes (n = {200, 400, 800, 1600}) in the simulation data. For every combination of m and n, we simulated m SNPs with minor allele frequencies (MAF) drawn from the uniform distribution over the interval U (0.1, 0.5). Then we simulated n continuous outcomes from a standard normal distribution. The SNP and continuous outcome data were created independently to ensure that there were no associations between SNPs and the outcome. These steps were repeated to create 1,000 null data sets for 24 different groups varied by the number of SNPs, sample size, and MAF. As a result, a total of 24,000 data sets was generated. Simulations were conducted in R 3.0.0 (Vienna, Austria). To determine whether the type I error rate is close to 5%, we analyzed the percentage of times that ES-MDR randomly identifies two interacting SNPs from a null data set.

Simulation II

The second simulation study was created to evaluate the power of ES-MDR with a data set that included quantitative outcome variables and a pair of functional interacting SNPs and 18 non-interacting SNPs. Surv-MDR was performed to evaluate whether ES-MDR is as effective as Surv-MDR in identifying functional SNPs.

The simulation data sets included different penetrance functions that describes the probabilistic relationship between the quantitative outcome variable and functional SNPs generated with additive coding. We considered two different MAFs (0.2 and 0.4) and seven different broad-sense heritabilities (0.01, 0.02, 0.05, 0.1, 0.2, 0.3, and 0.4) to create a total of 14 unique model combinations, where the two functional SNPs associated with the outcome was evenly distributed across the seven heritabilities. To create a purely epistatic model, each of the 14 unique models had one or the other functional SNP (MAF 0.2 or 0.4) with no main effects. The 14 allele-heritability frequency combinations were replicated five times to generate 70 models with varying sample sizes that included size n = {400, 800, 1600}.

Assume that SNP1 and SNP2 are the two functional SNPs. Let f_ij be an element from the ith row and jth column of a penetrance function. We generated the binary variable from a Bernoulli distribution with the following:

P (high risk|SNP1 = i, SNP2 = j) = f_ij

We randomly selected 200 high-risk subjects and 200 low-risk subjects from each of the 70 probabilistic models to create one simulated data set. We repeated this simulation 100 times to obtain at total of 7,000 data sets.

To generate the survival time, we used the Cox-proportional hazards (Cox ph) model:

h(t|x) = h₀(t)exp(ßx)

In this equation, h₀(t) is the baseline hazard function with a Weibull distribution using the shape parameter of 5 and the scale parameter of 2. The x is the genetic factor fixed at value 1 for high risk patients and 0 for low risk patients. ß represents the effect size or the log hazard ratio for a one-unit increase in x (all other covariates held constant). The censoring fractions were sampled from the uniform distribution over the interval U (0,4) from the Bernoulli distribution, resulting in 40% censoring. Finally, we merged survival time and censoring status with the SNP data.

We used Martingale Residuals in our novel ES-MDR method to classify each multi-locus genotype combination into high-risk and low-risk groups. The Martingale Residual is the stochastic component and in residual form gives the following:

M(t|x) = δ(t) - h₀(t)exp(ßx)

In this example, δ(t) denotes the number of expected events that occur at each survival time t. Assuming a null model with no target effects (ß=0), this residual is the difference between the observed events and expected number of events. The sign and magnitude of the Martingale Residuals are dependent on the association of SNPs and the hazard rate function. Each individual genotype with a positive Martingale Residual (i.e., greater than or equal to 0) is classified as high-risk. Otherwise, a negative Martingale Residual is classified as low-risk. For every multi-locus genotype combination of SNPs, we computed the sum of the Martingale Residuals to obtain a new variable that can be used to classify into the high-risk or low-risk group

To estimate the power of the proposed method, we ran ES-MDR on each of the 7,000 data sets and searched for the best model over all possible one- (i.e., single-locus), two- (i.e., two interacting loci), and three-way (i.e., three interacting loci) interaction models, using the T-statistic testing score. We also used the 95^th percentile of the testing score from the null models as a threshold to guard against any non-significant findings. The power was estimated as the percentage of time ES-MDR correctly included the two functional interacting SNPs in the best model out of each set of 7,000 data sets. This significant threshold for the results will be at the 0.05 level. For comparison, we ran Surv-MDR on the simulated data to define its power. Training and testing scores for ES-MDR were analyzed using two-fold cross-validation. The best model was selected with the smallest prediction error and largest consistency in including the two functional interacting SNPs.

OncoArray-TRICL Genotyping and Quality Control

A total of 533,631 SNPs from 57,775 individuals in the OncoArray-TRICL Consortium, selected from 29 studies across North America and Europe, as well as Asia, was genotyped using the Illumina OncoArray-500K BeadChip Platform, which includes the genome-wide backbone and select loci known to be associated with cancer phenotypes. To facilitate efficient genotyping and minimize variability that might arise from genotyping at multiple sites, genotyping was conducted at the following five institutions: the Center for Inherited Disease Research, the Beijing Genome Institute, the Helmholtz Zentrum München, Copenhagen University Hospital, and the University of Cambridge. Quality control steps described previously were followed for this OncoArray-TRICL data set.¹⁶ The following participants were excluded from the current study: participants who lacked lung cancer status (because they were not a part of the lung cancer studies), smoking status, and age and gender at diagnosis, participants who were close relatives (second degree relatives or closer), duplicate individuals, participants with non-European ancestry, participants with low-quality extracted DNA, participants with low call-rate for genotype data, and participants who did not pass quality control measures. Therefore, a total of 14,935 lung cancer cases and 12,787 controls remained in the current study. We restricted SNP filtering to a minimum to include more SNPs for analysis. We included SNPs with MAF greater than 0.01% and SNPs with 50% and above genotyping rate.

OncoArray-TRICL Data Analysis

We applied ES-MDR to the OncoArray-TRICL Consortium study to identify genetic interactions in association with lung cancer age-of-onset. The OncoArray-TRICL Consortium is a collaboration among world leaders to investigate common causes of cancer susceptibility and progression.¹⁶ Lung cancer cases and controls were genotyped using the OncoArray genotyping array known to tag cancer trait and susceptibility loci in addition to the GWAS backbone; this array consists of approximately 533,000 tagged SNPs. We identified 27,722 participants, 14,935 lung cancer cases and 12,787 healthy controls, ages 15-96 years of European ancestry. All participants provided informed consent and each study site obtained approval from their ethics committee. In this analysis, lung cancer age-of-onset, cases (event at diagnosis age), controls (censored at interview age), and a covariate (smoking status) is the survival outcome data that is substituted by Martingale Residuals. We randomly sampled 2/3 of the data into a training set and 1/3 as the testing set. We applied our novel ES-MDR method to perform an exhaustive one-way and two-way model search. We used PLINK as a pre-filtering step to identify uncorrelated and independent SNPs. SNPs that are in linkage disequilibrium will be removed, using a stringent correlation threshold of 0.1. After this filtering step, 108,254 SNPs remain. We searched over all one-way and two-way interactions in the training set to identify models consistently selected with the largest training score determined by two-fold cross-validation and analysed the prediction error of the chosen top 10 models in the testing set. In our real data study, we also considered joint detection of the two SNPs with main effects to be successful detection of the functional interaction model.

To build a predictive model that combines the strength of both one-way and two-way models, we took all the SNPs involved in the top 1,000 one-way models and all the SNPs from top 1,000 two-way interactions models and used a penalized Cox regression method to filter and select the best predictive models to evaluate genetic factors associated with age of lung cancer onset. We ranked the test scores from highest to lowest and picked the top SNPs that best predicts lung cancer onset.

To construct predictive models linking SNPs to censored survival data, we used the Lasso penalized estimation for the Cox regression model to select top SNPs that are relevant to patients’ lung cancer age-of-onsets to create a prediction model with a parsimonious set of SNPs that can provide good prediction accuracy.¹⁷ The Lasso procedure is a popular method for variable selection when the number of samples is significantly less than the number of predictor variables in the prediction model.¹⁸ Briefly, Lasso is similar to the forward stepwise method in that it provides coefficient shrinkage as well as variable selection by driving nonsignificant coefficients in a regression model to zero.¹⁸ Therefore, Lasso is a valuable tool to filter SNPs that are not associated with the outcome or highly correlated, especially in situations when there is a smaller sample size compared to the number of SNP predictors.

Survival plots were generated using the Kaplan-Meier method to visualize the differences in age of lung cancer onset between high-risk and low-risk groups based on top identified SNPs associated with lung cancer risk. To adjust for additional factors related to patient survival, the Cox ph regression model included adjustment for smoking status as a covariate in the model.

To assess the performance of our model in predicting lung cancer onset at different age intervals, we applied time-dependent receiver operating characteristic (ROC) curve and area under the curve (AUC) to evaluate the predictive performance of the best models, previously introduced by Heagerty et al. (2000).¹⁹ In our study, a given score function f(X), the time-dependent sensitivity and specificity functions are defined as follows:

sensitivity(c, t|f(X)) = Pr{f(X) > c|δ(t) = 1},

specificity(c, t|f(X)) = Pr{f(X) ≤ c|δ(t) = 0},

We defined the corresponding ROC(t|f(X)) curve for any time t as the plot of sensitivity(c, t|f(X)) versus 1 - specificity(c, t|f(X)) with the cut-off point c varying. The AUC is the area under the ROC(t|f(X)) curve, which is denoted as AUC(t|f(X)).¹⁷ Here, the δ(t) is the event indicator at time t. In this study, a larger AUC at time t based on the score function f(X) indicates better predictability of time-to-event at time t as measured by sensitivity and specificity evaluated at time t.

Table 1. Estimated Type I Error Rate using the 95th quantile of the Standard Normal Distribution
m^a = 20	n^a = 200	n^a = 400
1-way	4.5%	4.7%
2-way	5.3%	5.6%
3-way	9.7%	8.7%
^a m = number of SNPs; n = sample size

Table 2. Power Comparison between ES-MDR and Surv-MDR - SNP 1 & SNP 2 Stringent Model
ES-MDR								Surv-MDR
heritability	0.01	0.02	0.05	0.10	0.20	0.30	0.40	heritability	0.01	0.02	0.05	0.10	0.20	0.30	0.40
n^a = 400								n^a= 400
maf^a								maf^a
0.2	0.4%	1.0%	1.4%	4.8%	15.0%	38.8%	61.6%	0.2	0.4%	1.2%	2.0%	4.4%	13.2%	39.8%	61.8%
0.4	0.2%	2.6%	3.2%	7.0%	15.2%	25.6%	40.6%	0.4	0.6%	2.6%	3.8%	6.5%	15.2%	25.7%	38.6%
n^a= 800								n^a = 800
maf^a								maf^a
0.2	0.4%	1.4%	6.6%	14.2%	38.6%	66.8%	75.4%	0.2	0.2%	1.2%	5.7%	12.2%	40.2%	71.1%	78.5%
0.4	1.8%	8.8%	12.4%	19.0%	42.0%	66.0%	78.6%	0.4	1.3%	8.4%	11.3%	18.3%	41.6%	65.3%	78.7%
n^a = 1600								n^a = 1600
maf^a								maf^a
0.2	1.6%	9.6%	16.8%	41.8%	74.6%	80.4%	72.6%	0.2	0.3%	9.5%	15.8%	32.6%	69.4%	77.8%	76.4%
0.4	4.6%	29.4%	35.8%	47.0%	69.2%	84.2%	87.8%	0.4	3.7%	25.4%	34.1%	43.5%	70.4%	84.6%	87.3%
^an = sample size; maf = minor allele frequency

Table 3. Percent Change in Power between ES-MDR and Surv-MDR
heritability	0.01	0.02	0.05	0.10	0.20	0.30	0.40
n^a = 400
maf^a
0.2	0.0%	-16.7%	-30.0%	9.1%	13.6%	-2.5%	-0.3%
0.4	-66.7%	0.0%	-15.8%	7.7%	0.0%	-0.4%	5.2%
n^a = 800
maf^a
0.2	100.0%	16.7%	15.8%	16.4%	-4.0%	-6.0%	-3.9%
0.4	38.5%	4.8%	9.7%	3.8%	1.0%	1.1%	-0.1%
n^a = 1600
maf^a
0.2	433.3%	1.1%	6.3%	28.2%	7.5%	3.3%	-5.0%
0.4	24.3%	15.7%	5.0%	8.0%	-1.7%	-0.5%	0.6%
^an = sample size; maf = minor allele frequency;
^b% change calculator = ((ES-MDR - Surv-MDR)/\|Surv-MDR\|) x 100%

Table 4. Power Comparison between ES-MDR and Surv-MDR - SNP 1 & SNP 2 plus SNP 3 Flexible Model
ES-MDR								Surv-MDR
heritability	0.01	0.02	0.05	0.10	0.20	0.30	0.40	heritability	0.01	0.02	0.05	0.10	0.20	0.30	0.40
n^a = 400								n^a = 400
maf^a								maf^a
0.2	1.0%	2.2%	3.2%	8.0%	25.2%	55.0%	82.6%	0.2	0.8%	2.0%	4.9%	8.1%	20.9%	51.4%	79.3%
0.4	0.2%	3.4%	6.2%	10.8%	24.2%	35.2%	52.0%	0.4	1.0%	3.8%	6.5%	9.7%	22.2%	34.7%	49.7%
n^a = 800								n^a = 800
maf^a								maf^a
0.2	1.4%	4.0%	10.4%	21.0%	58.6%	88.6%	99.6%	0.2	1.2%	3.4%	8.3%	17.3%	54.5%	89.2%	98.6%
0.4	2.8%	11.0%	15.8%	27.0%	57.6%	81.8%	95.2%	0.4	2.4%	13.2%	17.7%	25.2%	51.7%	76.1%	93.9%
n^a = 1600								n^a = 1600
maf^a								maf^a
0.2	4.8%	15.4%	27.0%	55.4%	94.8%	99.6%	100.0%	0.2	1.9%	12.9%	22.9%	47.5%	88.9%	99.3%	100.0%
0.4	7.6%	39.6%	47.0%	61.4%	93.2%	99.8%	100.0%	0.4	6.0%	33.8%	45.8%	57.6%	90.1%	99.2%	100.0%
^an = sample size; maf = minor allele frequency

Table 5. Percent Change in Power between ES-MDR and Surv-MDR
heritability	0.01	0.02	0.05	0.10	0.20	0.30	0.40	OncoArray-genotyped (HR^a)	Cox ph P value
n^a = 400								20.51	4.09x10^-76
maf^a								6.39	2.75x10^-66
0.2	25.0%	10.0%	-34.7%	-1.2%	20.6%	7.0%	4.2%	22.83	4.16x10^-41
0.4	-80.0%	-10.5%	-4.6%	11.3%	9.0%	1.4%	4.6%	6.46	2.08x10^-33
n^a = 800								5.76	2.07x10^-31
maf^a								27.64	2.18x10^-26
0.2	16.7%	17.6%	25.3%	21.4%	7.5%	-0.7%	1.0%	19.57	4.41x10^-22
0.4	16.7%	-16.7%	-10.7%	7.1%	11.4%	7.5%	1.4%	1.24	3.15x10^-15
n^a = 1600								5.48	4.25x10^-14
maf^a								23.00	1.16x10^-11
0.2	152.6%	19.4%	17.9%	16.6%	6.6%	0.3%	0.0%
0.4	26.7%	17.2%	2.6%	6.6%	3.4%	0.6%	0.0%
^an = sample size; maf = minor allele frequency;
^b% change calculator = (ES-MDR - Surv-MDR)/\|Surv-MDR\| x 100%

Table 6. Top-one-way models identified by ES-MDR
Nearest Gene(s)	Chr^a	SNP (GRCh37/hg19)	Position (bp^a) GRCh37/hg19	Gene Region	Alleles (Major/Minor)	MAF^a (1000 Genomes)	Test score	OncoArray-genotyped (HR^a)	Cox ph P value
LINC00708, LOC105755953	10	rs12358150	8735744	Intergenic	C/T	0.26	340.90	20.51	4.09x10^-76
GTF2IP1, PMS2P5	7	rs149743903	74711828	Intergenic	T/C	NA^a	295.80	6.39	2.75x10^-66
PPP2R2B, STK32A	5	rs76601208	146581977	Intron	C/T	0.002	180.30	22.83	4.16x10^-41
KLF5, LINC00392	13	rs138428539	73736950	Intergenic	T/C,G	0.01	145.10	6.46	2.08x10^-33
TULP1, FKBP5	6	rs6906359	35528378	Intron	C/T	0.10	135.90	5.76	2.07x10^-31
UTP23, RAD21	8	rs10105870	117807762	Intergenic	G/A	0.15	113.00	27.64	2.18x10^-26
VPS8	3	rs112047443	184701960	Intron	A/T	NA^a	93.30	19.57	4.41x10^-22
BRCA1	17	rs749410065	41196821	delT^a	-/T	NA^a	62.20	1.24	3.15x10^-15
ATR	3	rs529613417	142285472	Intron	A/T	0.001	57.00	5.48	4.25x10^-14
B3GNT2, TMEM17	2	rs11526118	62647317	Intron	G/A	0.16	46.00	23.00	1.16x10^-11
^a Chr = chromosome; bp = base pair; MAF = minor allele frequency; detT = NC_000017.10:g.41196821delT; NA = not available; HR = hazard rat

Table 7. Top two-way models identified by ES-MDR

Gene(s) 1

SNP 1

(GRCh37/hg19)

Gene Region 1

Alleles (Major/Minor)

MAF^a
(1000 Genomes)

Gene(s) 2

SNP 2

(GRCh37/hg19)

Gene Region 2

Alleles (Major/Minor)

MAF
(1000 Genomes)

Test Score

OncoArray-genotyped (HR^a)

Log-rank test statistic

Cox ph
P value

BRCA1

rs749410065

delT^a

-/T

NA^a

CBR1,
LOC100133286

rs151043730

missense, ncRNA^a

G/A

0.001

13.77

1.24

65.74

5.15x10^-16

BRCA1

rs749410065

delT^a

-/T

NA^a

NAPG

rs3865365

near 5' end of genes

G/A

0.06

13.43

1.23

61.74

3.93x10^-15

C6orf10

rs16870005

Intron

C/T

0.01

BRCA1

rs749410065

delT^a

-/T

NA^a

13.42

1.25

67.66

1.94x10^-16

C1orf21,
LOC107985236

rs7535067

Intron, 3' UTR^a

C/A,T

0.19

BRCA1

rs749410065

delT^a

-/T

NA^a

13.39

1.24

61.92

3.57x10^-15

PSMB9,
LOC100294145

rs57092860

Intergenic

T/C,G

0.01

BRCA1

rs749410065

delT^a

-/T

NA^a

13.37

1.25

67.82

1.79x10^-16

CTNND2

rs7732411

Intron

T/C

0.08

BRCA1

rs749410065

delT^a

-/T

NA^a

13.32

1.25

67.53

2.08x10^-16

TRAM2-AS1,
LOC730101

rs182398206

Intergenic

A/G

0.01

BRCA1

rs749410065

delT^a

-/T

NA^a

13.30

1.25

70.95

3.66x10^-17

MIR4417,
MIR4689

rs6698924

Intergenic

A/C

0.02

BRCA1

rs749410065

delT^a

-/T

NA^a

13.29

1.25

68.74

1.12x10^-16

BRCA1

rs749410065

delT^a

-/T

NA^a

TFAP2C,
BMP7

rs186132350

Intergenic

T/A

0.01

13.29

1.25

69.79

6.61x10^-17

HRAT92,
PRKAR1B

rs142263110

Intergenic

G/A

0.002

BRCA1

rs749410065

delT^a

-/T

NA^a

13.28

1.25

69.36

8.19x10^-17

^a UTR = untranslated region; delT^a = NC_000017.10:g.41196821delT; ncRNA = noncoding transcript variant; HR = hazard ratio; NA = not available

Table 8. Top SNPs Selected by Cox Lasso Regression
Nearest Gene(s)	Chr^a	SNP (GRCh37/hg19)	Position (bp^a) (GRCh37/hg19)	Gene Region	Alleles (Major/Minor)	MAF^a (1000 Genomes)	Test score	OncoArray-genotyped (HR^a)	Cox ph P value
BRCA1	10	rs749410065	41196821	delT^a	-/T	NA^a	62.20	1.24	3.15x10^-15
GTF2IP1, PMS2P5	7	rs149743903	35528378	Intergenic	C/G	NA^a	295.80	6.39	2.75x10^-66
LOC102467079, TOX3	6	rs117142114	52328666	Intergenic	T/C	0.02	31.00	1.29	1.82x10^-07
HYKK	13	rs9788721	78802869	Intron	C/T	0.31	86.30	1.41	1.82x10^-19
MIR3925, PANDAR	5	rs7753169	36614326	Intergenic	A/C	0.36	27.90	1.15	1.30x10^-07
CHRNA5	8	rs16969968	78882925	missense	G/A	0.15	86.80	1.41	1.40x10^-19
KLF5, LINC00392	3	rs138428539	73736950	Intergenic	T/C,G	0.01	145.10	6.46	2.08x10^-33
TULP1, FKBP5	17	rs6906359	35528378	Intron	C/T	0.10	135.90	5.76	2.07x10^-31
CHRNA5	2	rs951266	78878541	Intron	G/A	0.16	88.50	1.41	6.18x10^-20
FAM114A1	3	rs1873195	38891173	Intron	C/T	0.20	8.90	0.87	0.01
^a Chr = chromosome; NA = not available; bp = base pair; delT^a = NC_000017.10:g.41196821delT; MAF = minor allele frequency; HR = hazard ratio

GWAS: Genome-wide association studies, SNP: single nucleotide polymorphism, ES-MDR: Efficient Survival Multifactor Dimensionality Reduction, MDR: Multifactor Dimensionality Reduction, Surv-MDR: Survival MDR, QMDR: Quantitative MDR, TRICL: Transdisciplinary Research Into Cancer of the Lung, GHz: Gigahertz, CPU: Central Processing Unit, GB: Gigabyte, TB: Terabyte, Lasso: least absolute shrinkage and selection operator, ROC: Receiver Operating Characteristic, AUC: Area Under the ROC Curve

Ethics approval and consent to participate

All participants provided informed consent and each study site obtained approval from their ethics committee.

Consent for publication

Not applicable

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing Interests

The authors declare no competing interests.

Funding

This publication was funded in part by the following: National Institute of Health, National Library of Medicine Institute research grants R01LM012012 and T32LM012204, National Cancer Institute grant U19CA203654, and the Cancer Prevention Research Institute of Texas RR170048.

Author contributions

J.G. and C.I.A. contributed to the project design and oversight. J.G. contributed to the method development. J.G. and J.L. contributed to the implementation and benchmarking. J.L. and X.J. wrote the manuscript. J.G., C.I.A., J.L., X.J., and T.A.M. contributed to the data analysis and discussions. X.J., X.X., and J.L. contributed to the genetic annotation of identified genes. J.G., C.I.A., and J.L. contributed to the data interpretation. J.L., X.J., X.X., D.Z., E.J.D., D.C.C., M.B.S., S.M.A., S.Z., H.B., O.M., M.D.T., T.A.M., C.I.A., and J.G. contributed to the data preparation, manuscript editing, and discussion.

Acknowledgements

The authors would like to thank all members of the Transdisciplinary Research in Cancer of the Lung (TRICL) Consortium for their data collection that made this study possible

1 Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747-753, doi:10.1038/nature08494 (2009).

2 Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 6, 95-108, doi:10.1038/nrg1521 (2005).

3 McCarthy, M. I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9, 356-369, doi:10.1038/nrg2344 (2008).

4 Bush, W. S. & Moore, J. H. Chapter 11: Genome-wide association studies. PLoS Comput Biol 8, e1002822, doi:10.1371/journal.pcbi.1002822 (2012).

5 Maher, B. Personal genomes: The case of the missing heritability. Nature 456, 18-21, doi:10.1038/456018a (2008).

6 Gilbert-Diamond, D. & Moore, J. H. Analysis of gene-gene interactions. Curr Protoc Hum Genet Chapter 1, Unit1 14, doi:10.1002/0471142905.hg0114s70 (2011).

7 Ritchie, M. D. et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 69, 138-147, doi:10.1086/321276 (2001).

8 Moore, J. H. & Williams, S. M. Epistasis and its implications for personal genetics. Am J Hum Genet 85, 309-320, doi:10.1016/j.ajhg.2009.08.006 (2009).

9 Park, M. Y. & Hastie, T. Penalized logistic regression for detecting gene interactions. Biostatistics 9, 30-50, doi:10.1093/biostatistics/kxm010 (2008).

10 Andrew, A. S. et al. Bladder cancer SNP panel predicts susceptibility and survival. Hum Genet 125, 527-539, doi:10.1007/s00439-009-0645-6 (2009).

11 He, H., Oetting, W. S., Brott, M. J. & Basu, S. Power of multifactor dimensionality reduction and penalized logistic regression for detecting gene-gene interaction in a case-control study. BMC Med Genet 10, 127, doi:10.1186/1471-2350-10-127 (2009).

12 Moore, J. H., Asselbergs, F. W. & Williams, S. M. Bioinformatics challenges for genome-wide association studies. Bioinformatics 26, 445-455, doi:10.1093/bioinformatics/btp713 (2010).

13 Gui, J. et al. A novel survival multifactor dimensionality reduction method for detecting gene-gene interactions with application to bladder cancer prognosis. Hum Genet 129, 101-110, doi:10.1007/s00439-010-0905-5 (2011).

14 Gui, J. et al. A Simple and Computationally Efficient Approach to Multifactor Dimensionality Reduction Analysis of Gene-Gene Interactions for Quantitative Traits. PLoS One 8, e66545, doi:10.1371/journal.pone.0066545 (2013).

15 Brandt, A., Bermejo, J. L., Sundquist, J. & Hemminki, K. Age of onset in familial cancer. Ann Oncol 19, 2084-2088, doi:10.1093/annonc/mdn527 (2008).

16 Amos, C. I. et al. The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers. Cancer Epidemiol Biomarkers Prev 26, 126-135, doi:10.1158/1055-9965.EPI-16-0106 (2017).

17 Gui, J. & Li, H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 21, 3001-3008, doi:10.1093/bioinformatics/bti422 (2005).

18 Tibshirani, R. The lasso method for variable selection in the Cox model. Stat Med 16, 385-395 (1997).

19 Heagerty, P. J., Lumley, T. & Pepe, M. S. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56, 337-344 (2000).

20 Bengio, Y., Grandvalet, Y. No Unbiased Estimator of the Variance of K-Fold Cross-Validation. Journal of Machine Learning Research, 1089-1105 (2004).

21 Wang, K., Li, M. & Hakonarson, H. Analysing biological pathways in genome-wide association studies. Nat Rev Genet 11, 843-854, doi:10.1038/nrg2884 (2010).

Download PDF

Journal Publication

published 30 Oct, 2020

Read the published version in BMC Medical Genomics →

Editorial decision: Major revision
07 Jun, 2020
Review #3 received at journal
25 May, 2020
Review #2 received at journal
01 Apr, 2020
Review #1 received at journal
01 Apr, 2020
Reviewer #3 agreed at journal
20 Mar, 2020
Reviewer #2 agreed at journal
17 Nov, 2019
Reviewers invited by journal
08 Nov, 2019
Reviewer #1 agreed at journal
08 Nov, 2019
Editor assigned by journal
26 Sep, 2019
Editor invited by journal
25 Sep, 2019
Submission checks completed at journal
18 Sep, 2019
First submitted to journal
16 Sep, 2019

You are reading this older preprint version

Read the latest preprint version →

A New Efficient Method to Detect Genetic Interactions for Lung Cancer GWAS

Status:

Journal Publication

Version 1

Abstract

Figures

Background

Results

Discussion

Conclusions

Methods

Tables

Abbreviations

Declarations

References

Status:

Journal Publication

Version 1