Data
Patients in this study were drawn from the Cancer and Leukemia Group B (CALGB; now a part of the Alliance for Clinical Trials in Oncology) and SWOG 80405 (Alliance) trial. The trial was initiated in September 2005 with a total of 2,326 patients randomized to the three treatment arms (bevacizumab, cetuximab, or their combination in addition to chemotherapy with FOLFIRI or FOLFOX).
Genotyping: DNA was extracted from peripheral blood. The first genotyping batch was performed on the Illumina HumanOmniExpress-12v1 platform at the Riken Institute (Tokyo, Japan) and included 731,412 genotyped variants. The second genotyping batch was performed on the Illumina HumanOmniExpress-8v1 and included 964,193 SNPs. A total of 719,461 SNPs from HapMap from batch 1 were also on the chip from batch 2. The QC was performed to remove SNPs with mismatched annotation between the two platforms, genotyping call rates less than 99%, departure from Hardy–Weinberg equilibrium (P<10-8), allele frequencies less than 0.05, and individuals with genotyping call rate < 0.90. Passing the filters, 540,021 SNPs genotyped for 1,165 samples were remained17.
Tumor RNA sequencing: Tumor RNA was extracted from formalin-fixed paraffin-embedded (FFPE) tumor blocks (96% primary, 2% metastatic, 2% unknown) from 584 CALGB/SWOG 80405 patients at the baseline. TruSeq RNA Access target enrichment and library preparation protocol was performed using 250 ng of template RNA. Sequencing was done using synthesis chemistry targeting 50 M reads with a read length of 2x100 bp per sample on the HiSeq 2500. Data processing was conducted using standard procedures18.
Clinical outcomes and covariates: The primary endpoint of OS was calculated from the time of study entry to death or last known follow-up for those without reported death. The median follow-up of 640 samples in bevacizumab, cetuximab was 65.7 months (95% confidence interval, 63.5-70). In addition, BRAF and all RAS mutation status were determined by BEAMing (beads, emulsion, amplification, magnetics; Hamburg, Germany) technology19 and included in the analysis as covariates in addition to age and gender.
Methods
Data preprocessing
Among 584 samples with RNA-seq data, 86% were Caucasian, 9% African American, 5% from other ethnicities. Therefore, we focused on primary tumor samples from Caucasians to avoid analysis being confounded due to population stratification. We excluded genes with low expression variation across samples (standard deviation < 0.5) and genes with low counts across the samples (> 30% zeros). The remaining was 8301 genes for the analysis and we applied upper quartile normalization to make gene expression values comparable among different samples. We removed duplicated samples (n=5) and tumors with low gene expression across the genome (> 50% genes with zero counts, n=1). We then transformed the RNA-seq data into the log2 scale for the analysis. We performed principal component analysis to assess for batch effect or hidden population stratification in RNA-seq data (Figure S5). To verify the self-reported gender, we applied k-mean clustering using the expression of genes in chromosome Y, resulting in 5 samples with mismatched biological gender and recorded gender7 (Figure S6).
We used 1,055 tumor samples (Caucasian), genotyped at 540,021 SNPs for imputation. We used phased haplotypes from the Haplotype Reference Consortium (HRC) panel through the Michigan server20. Phasing was done using Eagle v2.4 algorithm21. The HRC panel combines sequence data across > 32,000 individuals from >20 medical sequencing studies. The imputed genotype data with imputation score > 0.7 and minor allele frequency (MAF) > 0.05 included 5,539,144 common SNPs. These SNPs were used in all the analyses.
Immune cell type abundance
Since the RNA-seq data in this study has been generated from heterogeneous tumor samples composed of multiple cell types, correcting for the abundance of different cell types in the data is crucial to avoid the analysis being confounded. We estimated the abundance of immune cell types in our RNA-seq data using CIBERSORTx22with the validated leukocyte gene signature matrix as a reference. We defined a cell phenotype to be enriched in our data if at most 30% of its estimated scores across samples are zero and its standard deviation is greater than 0.12. As a result, 9 hematopoietic cell phenotypes were enriched in our data: naive and memory B cells, plasma cells, CD8+ T cells, resting and activated memory CD4+ T cells, M0 and M2 macrophages, and activated mast cells27.
cis-eQTL analysis
To identify germline genetic variants associated with tumor gene expression, we previously focused on cis-eQTL23,24. For all pairs of genes and SNPs within 1 Mb upstream and downstream of the gene’ transcription start site (TSS), we applied a linear regression model while accounting for covariates (gender, age, BRAF V600E mutations and all RAS mutation status, batch effects, enriched cell type abundance). We performed cis-eQTL mapping using FastQT25. We applied the adaptive permutation mode of FastQTL while setting for 10,000 permutations and selected genes with at least one cis-eQTL with an adjusted p-value < 0.05 at the gene level. The result of cis-eQTL analysis is published elsewhere7 (Figure S7-8). Hereafter, for simplicity, we use eQTL to refer to cis-eQTL.
Mendelian randomization (MR) study. To apply MR technique and fulfill the assumptions, we take the following steps focusing on the genes with eQTLs.
1. Gene-OS association: To investigate if genes with eQTL are associated with OS (i.e, the outcome), we applied multivariable time-variant additive hazard regression model. This additive hazard regression model relaxes the assumption of the time invariant effect of the genes on OS and allows for estimating causal effects8. To enhance the feasibility of conducting multivariable analysis and to prevent overfitting, we employed k-mean clustering and clustered all genes in 4 clusters7. Using these clusters, we partitioned 352 genes into 4 clusters and performed multivariable analysis by including all the genes in each cluster in the model. The analyses were performed separately for the patients under bevacizumab treatment and cetuximab treatment while accounting for the tumor microenvironment's influence by adjusting for enriched immune cell abundances in our RNA-seq data in addition to the set of covariates gender, age, BRAF V600E mutations and all RAS mutation status. Gene associated with OS (p-value < 0.1) were considered for causal study using the MR technique reviewed in the next steps.
2. Instrumental variables and the lack of pleiotropy assessment: MR techniques aim at distinguishing causations from associations in observational studies by using genetic factors as instrumental variables (IVs). All 352 genes considered in the step 1 for association study were genes with eQTLs. We considered eQTLs as potential IVs and staringly assessed the lack of pleiotropic action. We conducted a GWAS and estimated the effect of IVs on OS via a Cox proportional hazard model while adjusting for covariates gender, age, tumor location, all-RAS, and KRAS mutation as well as the first principal component accounting for the batch effect (Figure S10). The significance threshold in GWAS is typically set at a p-value of less than 5×10-8. Any association above this threshold for IV-OS associations can be assumed as a lack of pleiotropy. However, we employed a stringent threshold of p-value < 1×10-4 and excluded IVs with p-values < 1×10-4 to ensure that the assumption of lack of pleiotropy holds.
3. Gene-OS causal association: We evaluated the causal impact of associated genes with OS by predicting the expression level of each gene using its IVs. If the predicted expression value is associated with OS, the corresponding gene is considered as a putative cause of OS.
To predict the expression level of a gene, we initially clustered the IVs of the gene based on their correlation (r2 > 0.1 within each cluster) using hierarchical clustering. We then selected one IV from each cluster as the proxy to avoid overfitting. We then predicted the expression level of each gene (ĝ) as follows
Replication of one-sample MR findings by a two-sample MR study: For the replication of causal genes from the previous section, we used 602 samples in the validation cohort (296 under bevacizumab treatment and 306 under cetuximab treatment) with genotype and clinical outcome and no RNA-seq data. We first compared the clinical characteristics and demography of patients in discovery and validation cohorts1 (Table S4) to ensure compatibility of the cohorts. We then predicted gene expression levels of the causal genes in the validation cohort using summary statistics of eQTLs from discovery cohort as:
treatment-specific additive hazard model, we considered it a replication for the one-sample MR. This analysis has some similarities with transcriptome-wide association studies (TWAS). But the main difference is that, in the two-sample MR analysis, genetic variants as predictors satisfy MR assumptions, which might not be the case in TWAS.