A framework to jointly predict clinical outcome and model applicability
Due to the heterogeneous nature of cancer, most biomarkers are only applicable to a subset of tumors with a particular set of clinical and genomic features. A large proportion of cancer biomarkers are proposed to predict the clinical outcomes of patients under standard or a particular therapeutic treatment. Here we call the models that are developed to apply biomarkers clinical outcome prediction models (CPMs). Depending on the type of clinical outcome variable, a CPM can be formularized as a classification problem (Classification-CPM) or a survival analysis (Cox regression-CPM). The former is used to classify patients into different groups, such as recurrence versus non-recurrence within five years, and pathological complete response (pCR) versus residual disease (RD) after neoadjuvant therapy. The latter is used to predict the survival times (e.g., recurrence-free survival times) of patients.
As shown in Fig. 1, we propose a framework to determine the applicability of CPMs to different tumor samples. First, we apply a CPM to a benchmark dataset, and by comparing the predicted results with known clinical outcomes, divide tumor samples into correctly versus incorrectly predicted groups for Classification-CPMs, or well- versus poorly-predicted groups for Cox regression-CPMs. We then identify the genomic or clinical features that are associated with the applicability of CPMs, namely, features that discriminate the two groups. Finally, we construct a classification model (denoted as a predictability prediction model, PPM) based on these features to determine the predictability of tumor samples by a CPM.
Following this procedure, a pair of models, a CPM and a PPM, are developed to implement a biomarker. For each tumor sample, the CPM calculates a prognostic score (P-score) to predict the clinical outcome of the patient, and the PPM calculates a confidence score (C-score) to indicate the reliability of this prediction. Based on the C-scores, patients can be sorted with higher clinical outcome prediction accuracy expected for those with higher C-scores.
In the next two sections, we will demonstrate the efficacy of this framework using three multi-gene signatures developed for predicting clinical outcomes in breast cancer. In one section, we will apply these signatures to predict patient response to neoadjuvant therapy using a combination of PPMs and classification-CPMs. In the other section, we will apply them to predict recurrence-free survival of patients using a combination of PPMs and Cox regression-CPMs.
Application to classification models
To test the applicability of our framework to classification problems, we used the Hatzis breast cancer microarray dataset, which contains gene expression profiles from pre-treatment tumor biopsies of 508 patients treated with neoadjuvant taxane-anthracycline chemotherapy [23]. After treatment, these patients were classified as pCR or RD based on their clinical response. The dataset consists of a discovery dataset and a validation dataset with 310 and 198 patients, respectively.
Using the discovery cohort, we constructed a random forest model (CPM) to classify pCR and RD patient groups based on the Oncotype DX scores of each patient, as well as several clinical variables including age, ER status and tumor stage. This model predicted patient response with a fairly good accuracy (AUC = 0.746) as estimated by cross-validation (Fig. 2A). Out of the 305 patients (5 patients were excluded for the lack of ER information), 236 (77.4%) were correctly predicted by the CPM. We then constructed another random forest model (PPM) to classify patients that were correctly and incorrectly predicted. In this PPM, we used the following clinical variables as predictors: age, ER status, tumor stage, and PAM50 molecular subtype. Interestingly, cross-validation results indicated a higher accuracy for the PPM (AUC = 0.846) than the CPM (Fig. 2A), indicating that the applicability of the CPM is strongly determined by patient clinical features and therefore is highly predictable by the PPM. In Fig. 2B and 2C, we compared the relative importance of each predictor in the CPM and the PPM. As shown, to predict patient response to neoadjuvant therapy, the age is the most important variable followed by Oncotype score (Fig. 2B). In contrast, age, PAM50 subtype and stage contribute equally to the prediction of CPM applicability to patients in the PPM (Fig. 2C).
Oncotype DX assay is recommended for ER + but not ER- breast cancer. However, other clinical variables that determine its applicability have not been carefully investigated. Here we systematically investigated the association of CPM prediction accuracy with all clinical and genomic variables available from the Hatzis data. As shown in Fig. 2D-G many clinical features were correlated with predictability of patients by the CPM, in which Oncotype DX score was included as one of predictors. Notably, 95.3% of ER + patients were correctly predicted while this was the case in only 53.5% of ER- patients, consistent with the reported applicability of Oncotype DX assay [18] (Fig. 2E). It was also notable that the CPM was less applicable to tumor samples with higher grade (Fig. 2F). The applicability of the CPM also depended on the molecular subtypes with better performance in the Lum A and Normal-like tumors, whereas the prediction accuracy for Basal-like and Her2-enriched subtypes was significantly lower (Fig. 2G). These results indicated that recommending the application of a biomarker simply based on a single or subset of clinical variables may not be valid. To better implement biomarkers, a more sophisticated model like the PPM proposed here may be required to quantitatively measure the applicability of biomarkers to different tumor samples.
We then applied the CPM and the PPM trained using the discovery dataset to the validation cohort of the Hatzis data. For each patient in the validation cohort, we calculated a P-score and a C-score that indicated the likelihood of this patient to achieve pCR and the confidence of the prediction, respectively. A patient was predicted to be pCR if they had a P-score > 0.5, and RD, otherwise. We sorted patients based on their C-scores and calculated the average prediction accuracy in the top patients. Our results indicate that patients with higher C-scores were associated with higher accuracy, with average accuracy decreasing from 95–65% (Fig. 2H). The same results were obtained when we trained the CPM/PPM using the validation cohort and test in the discovery cohort (Fig. 2I).
We next examined two other gene signatures for clinical outcome prediction in breast cancer, the MammaPrint and the E2F4 signatures. Both signatures have been shown to be predictive of recurrence risk of patients and patient response to neoadjuvant therapy in breast cancer. We applied the random forest method to construct the CPMs for MammaPrint and E2F4 separately by using the MammaPrint/E2F4 score and the same set of clinical variables (age, ER status and stage) as included in the CPM for the Oncotype DX signature. Similarly, we constructed the PPMs for them using age, ER status, stage, and PAM50 subtype as predictors. The CPMs and PPMs for Oncotype DX, MammaPrint, and E2F4 signatures all achieved comparable AUC scores as shown in Fig. 3A and 3B. The CPMs and PPMs were trained using the discovery cohort and then used to calculate P-scores and C-scores for patients in the validation cohort, and vice versa. As shown in Fig. 3C and 3D, we obtained a similar trend as observed from the Oncotype DX analysis–patients with higher C-scores are more likely to be accurately predicted.
Following that, we applied the CPMs and PPMs trained from the Hatzis data (including patients from both discovery and validation cohorts) to two independent neoadjuvant therapy microarray datasets (GSE20194 and GSE22093). In both datasets, we confirmed that our framework could be used to determine the applicability of clinical outcome prediction models to each patients based on their clinical features (Fig. 4).
Application to survival prediction models
Oncotype DX has been widely used to predict the recurrence risk of patients with breast cancer. In this section, we used the Curtis breast cancer dataset to demonstrate how to apply our framework to improve prognostic prediction. The Curtis dataset includes a discovery cohort and a validation cohort, containing tumor gene expression profiles from 997 and 995 patients with breast cancer as well as detailed clinical information [24]. We calculated the Oncotype DX score for each sample, and constructed a univariate Cox regression model (CPM) to predict the recurrence-free survival of patients in the discovery cohort. The concordance index of this model is 0.679. That is, for 67.9% of all effective patient pairs (patient pairs for which survival times can be compared with certainty), the patient with the longer survival time is also predicted by the Cox regression model to have longer survival time. The prediction accuracy associated with different patients varies substantially. We calculated a sample-specific concordance for each patient as the fraction of correct predictions in all effective pairwise comparisons involving this patient. Figure 5A shows the number of effective comparisons and sample-specific concordance associated with each patient in the Curtis discovery cohort. As shown, some patients associated with a small number of effective comparisons due to short follow-up time and absence of recurrence event during follow-up. Even for patients with a large number of effective comparisons, the sample-specific concordances showed a high variation, ranging from 10–95%. We chose patients with at least 20% of comparisons being effective (997*0.2 = 199), and divided them into a well-predicted group and a poorly-predicted group, using the overall concordance 0.679 as the cut-off value (Fig. 5B).
Following that, we constructed a random forest model to classify well-predicted versus poorly-predicted patients (PPM). In the model, we include the following clinical variables: age, tumor size, grade, stage, lymph node status, PAM50 subtypes, ER-status, HER2 status, chemotherapy, radiotherapy, and hormone therapy. The prediction accuracy of this PPM was estimated using cross-validation, and an AUC score of 0.651 and 0.734 were achieved in the discovery and the validation cohort, respectively (Fig. 5C). Clinical variables with high relative importance in the model included age, tumor size, PAM50 subtype, stage and grade, whereas ER, HER2, lymph node, and treatment are associated with low relative importance (Fig. 5D).
We trained the CPM (the univariate Cox regression using Oncotype DX score as the predictor) and the PPM (the above-described random forest model) using the discovery cohort from the Curtis data and applied them to the validation cohort. Although the two cohorts both have similar number of patients (997 in discovery and 995 in validation), the discovery dataset is relatively complete while many clinical variables have missing information in the validation dataset. We thus excluded patients with too much missing information. For each patient in the validation cohort, we calculated a P-score, a C-score, and sample-specific concordance. We then sorted all patients in decreasing order of their C-scores, and calculated the average concordance in the top ranked patients. As shown in Fig. 5E, we observed a clear trend that patients with higher C-scores were more correctly predicted from their Oncotype DX scores by the CPM. Similarly, we trained the CPM/PPM using the validation cohort, applied them to patients in the discovery cohort, and obtained consistent results (Fig. 5F). These results indicate that the applicability of Oncotype DX varies substantially between patients, which can be correctly determined using the PPM based on the patient clinical features.
Next, we extended the above-analysis to the MammaPrint and the E2F4 signatures. Specifically, we used the CPM/PPM trained from the Curtis discovery or validation dataset to another breast cancer dataset compiled by Ur Rehman et al [25]. The original Ur Rehman dataset contains 1570 breast cancer gene expression profiles, from which we selected 880 profiles with recurrence-free survival information. In the PPM, we used a subset of clinical variables that were shared by the Curtis and the Ur Rehman data, including age, tumor size, ER status, grade, lymph node status, and PAM50 subtype. For each patient in the Ur Rehman data, we calculated a P-score, a C-score, and sample-specific concordance. We performed the same analyses for the Oncotype DX, the MammaPrint and the E2F4 signatures. Our results are summarized in Fig. 6. As shown, the Cox regression-based CPM trained from the Curtis discovery data achieves a concordance index of 0.679 for Oncotype DX, 0.680 for MammaPrint, and 0.638 for the E2F4 signature, respectively (Fig. 6A). The PPM model trained from the Curtis discovery data achieves an AUC score of 0.651, 0.624, and 0.646, respectively (Fig. 6B). As shown in Fig. 6C and 6D, patients with higher C-scores are associated with higher average concordance for all three multi-gene signatures. These results confirmed the effectiveness of the CPM and the PPM trained from Curtis data in the independent Rehman dataset, supporting the potential clinical application of them.