Patient Classification:
From the Cancer Genome Atlas (TCGA) Genomic Data Commons (GDC) portal21, we retrieved 587 high-grade serous ovarian carcinoma (HGSOC) patients with available clinical data using the TCGAbiolinks R/Bioconductor package.22 We selected for patients who received platinum-based adjuvant chemotherapy, the majority of which (96%) also received taxane treatment (see Supplemental Table 1 for characteristics of the cohort). A small percentage of the cohort has received additional adjuvant therapies in combination with platinum-based compounds, such as gemcitabine (9%), doxorubicin (2.6%), topotecan (2.6%), bevacizumab (2.2%), and tamoxifen (2.2%) (Supplemental Table 2). The interval between a patient’s last primary platinum treatment and the onset of a recurrent tumor or progression of an existing tumor was used as a metric for determining chemotherapy sensitivity. Patients who developed a new tumor in less than six months following their last platinum treatment were defined as resistant (N=110). In contrast, those who did not have a recurrent tumor event for over a year after their last primary platinum treatment were defined as sensitive (N=160). Individuals who had a recurrent tumor event between six months to one year following chemotherapy were excluded from the study. This strategy for dichotomizing resistant from sensitive patients was used to enrich for the genetic differences.
Transcriptomics Data Processing and Analysis:
Expression Microarrays: Of the 270 HGSOC subjects classified as sensitive or resistant to chemotherapy, 238 (138 sensitive, 100 resistant) had primary tumor microarray expression data available (Affymetrix ht_hg_u133a chip) in the GDC portal. The robust multi-array average (RMA) method23 in the affy package from Bioconductor24 was used for background correction, log-transformation, and quantile normalization of the probe intensities. Two potential outliers and two duplicated samples were removed from the study during the quality control step using the arrayQualityMetrics25 package (see Supplemental Data 1 for steps of pre-processing), resulting in 135 sensitive and 99 resistant HGSOC subjects in the expression set. Next, probes were filtered using the median absolute deviation (MAD) whereby the top 50% with highest variation (n=11,107) were selected for analysis. This non-specific filtering step removed probes with low variability in expression across the cohort, which are not likely to be differentially expressed between sensitive and resistant patients, reducing the number of multiple testing corrections and therefore, the likelihood of false positives.
Covariates: We assessed multiple potential confounders for correlation with therapeutic outcome including age, race, surgery (cytoreductive) outcome, cancer grade, and cancer stage (Supplemental Table 1). With the exception of age (p = 0.0041), all factors showed no significant difference between chemo-sensitive and chemo-resistant patients. For this reason, age at diagnosis was included as a covariate in all subsequent analyses.
Differential Gene Expression Analysis: The Limma26 package in Bioconductor27 was used to identify differentially expressed genes between chemo-sensitive and resistant groups using linear models. The false discovery rate (FDR) method was employed as a measure for multiple testing correction to control for type I error.
Weighed Gene Co-expression Network Analysis (WGCNA): We performed hierarchical clustering of genes using the R package WGCNA20, which groups genes based on their similarity in expression. This was achieved by first creating a similarity matrix using Pearson correlations of expression among all genes. The resulting matrix was raised to a power of 9, as suggested by the soft-thresholding power estimation plot (Supplemental Figure 1). Raising the correlation matrix to a power enriches for differences between weak and strong signals, allowing for better quantification of gene-gene interactions. The similarity matrix was transformed to a Topological Overlap Matrix (TOM), where the strength of association between a pair of genes is reinforced by the common neighbors shared by them. To avoid excessive splitting of genes into smaller modules, minimum module size was set to 30, split sensitivity (deep split) was set to 4, and modules with similar expression profiles were merged at a height of 0.5 (Supplemental Figure 2). Using principal component analysis, we calculated the module eigengene for each co-expression cluster to summarize module gene expression with a single measure. Each module eigengene was tested for association with platinum chemotherapy response using generalized linear models. Finally, we used Cytoscape28, an open source bioinformatics platform, to visualize significant gene co-expression networks.
Gene function and pathway annotations: The Database for Annotation, Visualization and Integrated Discovery (DAVID) 29 was employed to identify biological pathways and functions that were enriched in each significant gene co-expression module. We also screened significant genes in the GeneMANIA30 database to identify for functional connections reported in published literature. Next, we searched the UCSC transcription factor binding site (TFBS) conservation sites track with DAVID to identify enriched motifs of transcription factors that may co-regulate genes within each cluster. Finally, we used the Drug–Gene Interaction database (DGIdb)31, a public database with curation of data describing relationships between genes, chemicals, drugs, and pathological phenotype, to identify genes with prior reported associations with chemotherapeutic agents.
Validation of differentially expressed gene: The Kaplan–Meier plotter tool was used to cross-validate the differential expression of VCP in an independent ovarian cancer cohort (Geo accession identifier: GSE9891).32 This replication cohort included gene expression profiling (Affymetrix Human Genome U133 Plus 2.0 Array) of 285 ovarian tumor samples. Patients were filtered to include those with cancer histology of serous carcinoma and who received chemotherapy containing a platinum compound to allow close comparison with the TCGA ovarian cancer cohort. This step omitted a total of 60 subjects from analysis, which included 21 with endometrioid carcinoma cases and 43 who did not receive platinum therapy (4 overlapping subjects). Thus, 225 patients remained for replication analysis. Patient survival was evaluated using a Cox proportional hazards model and progression-free survival (PFS) was the primary outcome used in the replication analysis.33
Validation of co-expression networks:
For validation of co-expression networks, the SurvExpress database was used, which allows users to validate the combined effect of multiple gene expression measures with a target trait.34 The same cohort and filtering steps were used for validation (GSE9891, N = 225). The survival curve was evaluated using a Cox proportional hazards model and PFS was the primary outcome used in the replication analysis.
Genomics Data Processing and Analysis:
Genomics Data: Single nucleotide polymorphisms (SNPs) data from germline tissues (DNA extracted from blood or solid non-tumor ovarian tissue) were obtained from the TCGA legacy database. The Affymetrix Genome-Wide Human SNP Array 6.0 was used to capture genetic variations, which detected 906,600 SNPs. Of the 270 subjects from TCGA classified as resistant or sensitive to platinum-based chemotherapy, 266 (157 sensitive and 109 resistant) had genotype data available.
Imputations: The imputation of autosomal chromosomes was performed using the Michigan imputation server pipeline35. We used the 1000 Genome Project phase 3 sequencing data (version 5)36 reference panel for the imputation of missing genotypes. We then used Eagle v.2.337 for phasing of the genotypes to their respective chromosomes. For the imputation of variants on the X chromosome, SHAPEIT38 was used for phasing in combination with the 1000 genomes project phase 3 (version 5) reference panel (Supplemental Data 2).
Quality control:
Subject level: Two pairs of individuals had a relatedness coefficient (pi-hat) > 0.9, which are likely duplicated samples. One subject from each pair was randomly removed from the dataset. Next, inbreeding coefficients (F) were computed for each subject using PLINK39. A total of 18 subjects with high homozygosity (F>0.05) or heterozygosity (F<-0.05) rates were excluded. Moreover, genetic sex was estimated based on heterozygosity rates (F) of the X chromosome, and four subjects who had undefined genetic sex (F>0.2) were removed from the study.
SNP level: SNPs with minor allele frequencies (MAF) less than 1% or with genotyping call rate less than 90% were removed. This step removed 38,430,595 SNPs with MAF < 0.01, resulting in 9,528,963 SNPs to be used for further analysis.
Genome-wide Association Study: After imputations and quality control, 240 subjects (N= 142 sensitive, 98 resistant) and a total of 9,528,963 SNPs (MAF > 0.1) remained available for analysis (Supplemental Data 3). Plink (v.1.90) was used to compute genome wide and BRCA1/2 targeted association analysis using a logistic regression model. We pruned variants in strong (r2 > 0.8) linkage disequilibrium (LD) within the BRCA1/2 loci to determine independent association signals.
Variant annotations: Variant Effect Predictor (VEP)40 was used to predict the functional consequence of the identified variants. Similarly, the database of Genome-Wide Repository of Associations Between SNPs and Phenotypes (GRASP)41 and Clinvar42 were used to identify variants with known phenotype associations.
Expression Quantitative Trait Loci (eQTL) Analysis:
Common SNPs (MAF > 0.01) were tested for association with gene expressions of BRCA1, BRCA2, and co-expression networks using the matrixeQTL R package43. The correlation of a genotype with nearby gene expression indicates potential regulatory function of the SNP on the corresponding gene. These regulatory SNPs known as cis-expression Quantitative Trait Loci (cis-eQTL). Cis-eQTLs are defined as correlated SNPs found within 1 Mb from the gene transcriptional start site (TSS).