2.1 Data collection
Used GDC Data Transfer Tool to download corresponding data (RNA-sequence data, clinical data, and drug data) of 379 patients diagnosed with serous ovarian cancer from the TCGA database (https://portal.gdc.cancer.gov/). The download time was Feb 2022. Based on clinical and drug information, 332 patients who had received platinum-based chemotherapy and had follow-up data were selected. As previously defined(15), patients with platinum-free interval (PFI) ≥ 6 months were classed as platinum-sensitive, whereas those with PFI < 6 months as platinum-resistant.
GSE32062 was collected from Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) by the GEOquery package. This chip, based on the GP570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array and GPL6480 Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version), consists of 270 ovarian cancer samples, 260 of them based on GPL6480 were selected.
2.2 Identification of Differentially Expressed Genes Associated with Chemosensitivity Status
To obtain the differentially expressed chemosensitivity-related genes (DECRGs), the R package “limma” was used to screen the DECRGs between the chemo-sensitive and resistant groups from TCGA RNA-seq. P values below 0.05, and log 2-Fold Change (logFC) over 0.5 were considered the cutoff criteria.
2.3 Identification of Differentially Expressed Genes Associated with Immune Status and Chemosensitivity Status.
To obtain the immune phenotype-relevant genes, first, single sample gene set enrichment analysis (ssGSEA) was applied to evaluate the level of immune infiltration, which was recorded as ssGSEA scores based on the expression levels of immune cell-specific marker genes(16). Gene signatures of 28 tumor-infiltrating lymphocytes were obtained from TISIDB database (http://cis.hku.hk/TISIDB).
Based on the obtained immune cell proportion, the R package “ConsensusClusterPlus” was used to perform an unsupervised hierarchical clustering analysis(17). The range of K values was set from 2 to 5. Afterward, limma was performed for screening differentially expressed immune-relevant genes (DEIRGs). P < 0.05 and logFC > 1 was considered to denote statistical significance. DEGs with immune status and chemosensitivity status from TCGA RNA-seq were intersected to get the integrated DECIRGs.
2.4 Validation of the Plausibility of the Immune Phenotype Grouping
Varieties of bioinformatics tools were applied to characterize the immune microenvironment of samples using gene expression data. We used three different tools to estimate immune cell populations in distinct cancer subtypes.
ESTIMATE algorithm, a tool used for inferring the level of infiltrating stromal and immune cells in tumor samples and tumor purity based on gene expression signatures(18), was applied to evaluate the tumor microenvironment (TME) of TCGA. The pheatmap and ggpubr packages visualized the output.
CIBERSORT is a helpful approach for high throughput characterization of diverse cell types, such as TILs, from complex tissues(19). Thus, to obtain a comprehensive picture of immune cell type infiltration, CIBERSORT was used for quantifying cell fractions.
Tumor alterations in HLA class I antigens' surface expression and function frequently lead to immune escape(20). We extract the expression matrix of HLA genes of TCGA and implement visualization using the ggpubr package.
2.5 Construction of Prognostic Predictive Model Based on the DECIRG Pairs
LASSO is a linear regression method using L1 regularization. We construct a 0-or-1 matrix through an iterative loop based on the DECIRGs, which was noted as “A|B” and calculated as follows: If ExprA < ExprB, A|B = 1, else: A|B = 0.(where ExprA represent the expression level of gene A, and ExprB represent the expression level of gene B). With this strategy, we consider the relative expression of the genes rather than the absolute values, thus reducing the data proceeding workload before applying the model. UnicCox proportional hazard regression was used to reduce the number of pairs and build a risk assessment model(21). Least absolute shrinkage and selection operator (LASSO) regression was run 1,000 times with the glmnet package to screen the DECIRG pairs preliminarily(22), and we removed DECIRGs with constant values with setting total pair value/sample numbers ratio in the range of 0.2–0.8. After these steps, the stepAIC algorithm in the MASS R package was used to perform multivariate cox regression. The model with the lowest Akaike Information Criterion (AIC) value was then chosen to construct a prognostic signature. The formula was established as follows(where k is the number of gene pairs):
We used the performance of this model to plot the receiver operating characteristic (ROC) curves. In addition, survminer package was used to calculate the best cut-off value of the risk score(23).
2.6 Evaluation of the Prognostic Model
The DECIRGs in GSE32062 were treated with the same cyclic pair, and the DECIRG pair was obtained as the validation set. Following the optimal cut-off risk score of the train set, patients were classified into high-risk and low-risk groups. In addition, the distribution of high- and low-risk groups and the survival state per case were plotted. Moreover, to compare the overall survival(OS) between the high-risk and low-risk groups, Kaplan–Meier (K-M) survival curves were drawn by the "survival" package using the log-rank test. In addition, to ensure the predictive efficiency of the prognostic signature, we conduct the time-dependent ROC curve by "survivalROC" package.
2.7 Independent Prognostic Value of the Risk Score
Clinicopathological data of the TCGA cohort was collected, including tumor grade, Federation International of Gynaecology and Obstetrics (FIGO) stage, age and chemosensitivity information of patients. We performed wilcoxon signed-rank tests to acess the relation of the risk score and the clinical factors. In addition, univariate and multivariate Cox analysis were performed to explore the impact of the risk score and clinical features on patient outcomes, which was presented as a Forest plot.
2.8 Construction and Evaluation of Nomogram
The independent prognostic characteristcs were integrated into a nomogram plot to predict the prognosis for clinical convenience (24). Moreover, the calibration curves was plotted to assess the calibration of the models. The concordance index (C index) was used to evaluate the predictive accuracy of the signature. The ROC curves of various independent prognostic clinical characteristics were drawn, and the AUC was calculated.
2.9 The Correlation Between the Risk Subgroups and the Chemotherapeutic Drugs and Immune Checkpoints.
The pRRophetic package was used to predict the half-maximal inhibitory concentration (IC50) of chemotherapy agents among the risk subgroups in ovarian cancer patients. Block the immune checkpoint pathways is one extremely promising approach to achieve anti-cancer immunity(25). We identified potential immune checkpoint genes based on published literature(26) to explore the relationship between the gene signature and immunotherapeutic efficacy.
2.10 The Correlation Between the Risk Subgroups and Somatic Mutations and Tumor Mutational Burden (TMB).
Somatic mutation data of the TCGA-OC cohort was used to explore the mutation differences among the risk groups. Maftools package was utilized to perform the analysis of the tumor mutation burden (TMB) and draw waterfall for different risk subsets of TCGA(27). Based on the best cutoff value by “survminer” R package, the samples were divided into high-TMB and low-TMB groups. The correlation between the TMB and patient survival probability were investigated. In addition, Spearman’s correlation analysis was performed to acess the correlation between TMB and riskscore. Moreover, to explore the relationship between the risk groups and the tumor-infiltrating immune cells, CIBERSORT was applied to quantifying fractions of immune infiltration cell.
2.11 Gene Set Enrichment Analysis
Gene set enrichment analysis in the two risk subgroups was conducted by GSEA software (version: 4.2.3). C2. CP. KEGG.v7.5.1 symbols. gmt and Hallmark collections served as the reference gene set. NOM p-value < 0.05 was considered as significant.