In this section, we discuss how we improve the computational efficiency without sacrificing accuracy to develop the ES-MDR method when analyzing SNP interactions (i.e., joint effects of two SNPs) in association with age of disease-onset.
Incorporating Martingale Residuals for Age-of-Onset Survival Analysis
ES-MDR improves on the efficiency of Surv-MDR and applies the QMDR algorithm to analyze age of disease-onset in association with genetic interactions. Our novel ES-MDR approach uses a combination of survival analysis and QMDR for continuous outcome analysis in two steps. In the first step, we start with replacing event time and status with Martingale Residuals with covariate adjustment as a new continuous score. In the second step, we apply QMDR to efficiently categorize the genotype combinations into high-risk and low-risk groups. The best model is determined in the same way as QMDR, by using the t-test statistic computed from a continuous variable attribute (e.g., Martingale Residuals) to determine the best interaction and overall model.
The novel algorithm for ES-MDR is performed as follows:
- Select K SNPs from all the SNPs in the data set and create a contingency table among every genotype combination of K SNPs.
- For each multi-locus genotype combination cell, sum the Martingale Residuals between samples with and without each genotype combination.
- Label cells “high-risk” if the sum of the Martingale Residuals is positive; otherwise negative Martingale Residuals are labeled “low-risk”.
- Pool all the high-risk labeled cells into one group and all the low-risk labeled cells into another group to create a new one-dimensional variable.
Using Martingale Residuals to determine high or low risk group for survival data analysis is comparable to using the log-rank test statistic in Surv-MDR, however, more efficiently when classifying genotype combinations. It can be shown that the sum of the Martingale Residuals is a good surrogate variable of the log-rank test statistics for the purpose of determining high/low risk groups for each genotype combination. Next, we compare the similarities of the equations for Martingale Residuals and the log-rank test statistic. The sign and magnitude of the Martingale Residuals are dependent on the association of SNPs and the hazard function in the following equation:
In this equation, δi(t) denotes the number of observed events that occur at each survival time t. The number of expected events is calculated using the cox-proportional hazards model with x as the genetic factor and y as the adjusted covariate. The log-rank test statistic is defined as the following:
Here, we show that Martingale Residuals is equivalent to the numerator of the log-rank test statistic. Therefore, the sum of the Martingale Residuals is equal to the log-rank test statistic when the variance is set to 1. This infers that using Martingale Residuals as a substitute for the log-rank test statistic in evaluating genomic combinations associated with survival outcomes will provide the same data reduction and categorization process as Surv-MDR.
Evaluation through Simulations
Our purpose of running a simulation study is to evaluate how well ES-MDR performs and how well it performs compared with Surv-MDR. To demonstrate the strength of the proposed method, two simulations were designed to evaluate the testing score’s null distribution to evaluate type I error and to analyze power.
Simulation I
The first simulation study was created to estimate the 5% type I error threshold by evaluating an empirical null distribution with independent non-interacting SNPs and quantitative outcome values. Here, we created sets of SNPs (m = 10, 20, 50) with additive coding and sample sizes (n = {200, 400, 800, 1600}) in the simulation data. For every combination of m and n, we simulated m SNPs with minor allele frequencies (MAF) drawn from the uniform distribution over the interval U (0.1, 0.5). Then we simulated n continuous outcomes from a standard normal distribution. The SNP and continuous outcome data were created independently to ensure that there were no associations between SNPs and the outcome. These steps were repeated to create 1,000 null data sets for 24 different groups varied by the number of SNPs, sample size, and MAF. As a result, a total of 24,000 data sets was generated. Simulations were conducted in R 3.0.0 (Vienna, Austria). To determine whether the type I error rate is close to 5%, we analyzed the percentage of times that ES-MDR randomly identifies two interacting SNPs from a null data set.
Simulation II
The second simulation study was created to evaluate the power of ES-MDR with a data set that included quantitative outcome variables and a pair of functional interacting SNPs and 18 non-interacting SNPs. Surv-MDR was performed to evaluate whether ES-MDR is as effective as Surv-MDR in identifying functional SNPs.
The simulation data sets included different penetrance functions that describes the probabilistic relationship between the quantitative outcome variable and functional SNPs generated with additive coding. We considered two different MAFs (0.2 and 0.4) and seven different broad-sense heritabilities (0.01, 0.02, 0.05, 0.1, 0.2, 0.3, and 0.4) to create a total of 14 unique model combinations, where the two functional SNPs associated with the outcome was evenly distributed across the seven heritabilities. To create a purely epistatic model, each of the 14 unique models had one or the other functional SNP (MAF 0.2 or 0.4) with no main effects. The 14 allele-heritability frequency combinations were replicated five times to generate 70 models with varying sample sizes that included size n = {400, 800, 1600}.
Assume that SNP1 and SNP2 are the two functional SNPs. Let fij be an element from the ith row and jth column of a penetrance function. We generated the binary variable from a Bernoulli distribution with the following:
P (high risk|SNP1 = i, SNP2 = j) = fij
We randomly selected 200 high-risk subjects and 200 low-risk subjects from each of the 70 probabilistic models to create one simulated data set. We repeated this simulation 100 times to obtain at total of 7,000 data sets.
To generate the survival time, we used the Cox-proportional hazards (Cox ph) model:
h(t|x) = h0(t)exp(ßx)
In this equation, h0(t) is the baseline hazard function with a Weibull distribution using the shape parameter of 5 and the scale parameter of 2. The x is the genetic factor fixed at value 1 for high risk patients and 0 for low risk patients. ß represents the effect size or the log hazard ratio for a one-unit increase in x (all other covariates held constant). The censoring fractions were sampled from the uniform distribution over the interval U (0,4) from the Bernoulli distribution, resulting in 40% censoring. Finally, we merged survival time and censoring status with the SNP data.
We used Martingale Residuals in our novel ES-MDR method to classify each multi-locus genotype combination into high-risk and low-risk groups. The Martingale Residual is the stochastic component and in residual form gives the following:
M(t|x) = δ(t) - h0(t)exp(ßx)
In this example, δ(t) denotes the number of expected events that occur at each survival time t. Assuming a null model with no target effects (ß=0), this residual is the difference between the observed events and expected number of events. The sign and magnitude of the Martingale Residuals are dependent on the association of SNPs and the hazard rate function. Each individual genotype with a positive Martingale Residual (i.e., greater than or equal to 0) is classified as high-risk. Otherwise, a negative Martingale Residual is classified as low-risk. For every multi-locus genotype combination of SNPs, we computed the sum of the Martingale Residuals to obtain a new variable that can be used to classify into the high-risk or low-risk group
To estimate the power of the proposed method, we ran ES-MDR on each of the 7,000 data sets and searched for the best model over all possible one- (i.e., single-locus), two- (i.e., two interacting loci), and three-way (i.e., three interacting loci) interaction models, using the T-statistic testing score. We also used the 95th percentile of the testing score from the null models as a threshold to guard against any non-significant findings. The power was estimated as the percentage of time ES-MDR correctly included the two functional interacting SNPs in the best model out of each set of 7,000 data sets. This significant threshold for the results will be at the 0.05 level. For comparison, we ran Surv-MDR on the simulated data to define its power. Training and testing scores for ES-MDR were analyzed using two-fold cross-validation. The best model was selected with the smallest prediction error and largest consistency in including the two functional interacting SNPs.
OncoArray-TRICL Genotyping and Quality Control
A total of 533,631 SNPs from 57,775 individuals in the OncoArray-TRICL Consortium, selected from 29 studies across North America and Europe, as well as Asia, was genotyped using the Illumina OncoArray-500K BeadChip Platform, which includes the genome-wide backbone and select loci known to be associated with cancer phenotypes. To facilitate efficient genotyping and minimize variability that might arise from genotyping at multiple sites, genotyping was conducted at the following five institutions: the Center for Inherited Disease Research, the Beijing Genome Institute, the Helmholtz Zentrum München, Copenhagen University Hospital, and the University of Cambridge. Quality control steps described previously were followed for this OncoArray-TRICL data set.16 The following participants were excluded from the current study: participants who lacked lung cancer status (because they were not a part of the lung cancer studies), smoking status, and age and gender at diagnosis, participants who were close relatives (second degree relatives or closer), duplicate individuals, participants with non-European ancestry, participants with low-quality extracted DNA, participants with low call-rate for genotype data, and participants who did not pass quality control measures. Therefore, a total of 14,935 lung cancer cases and 12,787 controls remained in the current study. We restricted SNP filtering to a minimum to include more SNPs for analysis. We included SNPs with MAF greater than 0.01% and SNPs with 50% and above genotyping rate.
OncoArray-TRICL Data Analysis
We applied ES-MDR to the OncoArray-TRICL Consortium study to identify genetic interactions in association with lung cancer age-of-onset. The OncoArray-TRICL Consortium is a collaboration among world leaders to investigate common causes of cancer susceptibility and progression.16 Lung cancer cases and controls were genotyped using the OncoArray genotyping array known to tag cancer trait and susceptibility loci in addition to the GWAS backbone; this array consists of approximately 533,000 tagged SNPs. We identified 27,722 participants, 14,935 lung cancer cases and 12,787 healthy controls, ages 15-96 years of European ancestry. All participants provided informed consent and each study site obtained approval from their ethics committee. In this analysis, lung cancer age-of-onset, cases (event at diagnosis age), controls (censored at interview age), and a covariate (smoking status) is the survival outcome data that is substituted by Martingale Residuals. We randomly sampled 2/3 of the data into a training set and 1/3 as the testing set. We applied our novel ES-MDR method to perform an exhaustive one-way and two-way model search. We used PLINK as a pre-filtering step to identify uncorrelated and independent SNPs. SNPs that are in linkage disequilibrium will be removed, using a stringent correlation threshold of 0.1. After this filtering step, 108,254 SNPs remain. We searched over all one-way and two-way interactions in the training set to identify models consistently selected with the largest training score determined by two-fold cross-validation and analysed the prediction error of the chosen top 10 models in the testing set. In our real data study, we also considered joint detection of the two SNPs with main effects to be successful detection of the functional interaction model.
To build a predictive model that combines the strength of both one-way and two-way models, we took all the SNPs involved in the top 1,000 one-way models and all the SNPs from top 1,000 two-way interactions models and used a penalized Cox regression method to filter and select the best predictive models to evaluate genetic factors associated with age of lung cancer onset. We ranked the test scores from highest to lowest and picked the top SNPs that best predicts lung cancer onset.
To construct predictive models linking SNPs to censored survival data, we used the Lasso penalized estimation for the Cox regression model to select top SNPs that are relevant to patients’ lung cancer age-of-onsets to create a prediction model with a parsimonious set of SNPs that can provide good prediction accuracy.17 The Lasso procedure is a popular method for variable selection when the number of samples is significantly less than the number of predictor variables in the prediction model.18 Briefly, Lasso is similar to the forward stepwise method in that it provides coefficient shrinkage as well as variable selection by driving nonsignificant coefficients in a regression model to zero.18 Therefore, Lasso is a valuable tool to filter SNPs that are not associated with the outcome or highly correlated, especially in situations when there is a smaller sample size compared to the number of SNP predictors.
Survival plots were generated using the Kaplan-Meier method to visualize the differences in age of lung cancer onset between high-risk and low-risk groups based on top identified SNPs associated with lung cancer risk. To adjust for additional factors related to patient survival, the Cox ph regression model included adjustment for smoking status as a covariate in the model.
To assess the performance of our model in predicting lung cancer onset at different age intervals, we applied time-dependent receiver operating characteristic (ROC) curve and area under the curve (AUC) to evaluate the predictive performance of the best models, previously introduced by Heagerty et al. (2000).19 In our study, a given score function f(X), the time-dependent sensitivity and specificity functions are defined as follows:
sensitivity(c, t|f(X)) = Pr{f(X) > c|δ(t) = 1},
specificity(c, t|f(X)) = Pr{f(X) ≤ c|δ(t) = 0},
We defined the corresponding ROC(t|f(X)) curve for any time t as the plot of sensitivity(c, t|f(X)) versus 1 - specificity(c, t|f(X)) with the cut-off point c varying. The AUC is the area under the ROC(t|f(X)) curve, which is denoted as AUC(t|f(X)).17 Here, the δ(t) is the event indicator at time t. In this study, a larger AUC at time t based on the score function f(X) indicates better predictability of time-to-event at time t as measured by sensitivity and specificity evaluated at time t.