This dataset is freely available in the spBayesSurv R package and it was first analyzed by Henderson et al. (2002) (22). It includes 1043 patients diagnosed with acute myeloid leukemia (AML), with the time until death or last known follow-up \(T\) (in years) and the corresponding event indicator \(\delta\) (being 0 for censored patients and 1 for death). We aim to predict the survival after AML diagnosis according to the following covariates: age (continuous) at diagnosis, sex, white blood cell count (WBC) at diagnosis (\(50*{10}^{9}/L\)) (continuous) and the Townsend deprivation score (continuous) which is an area-based measure of deprivation for the enumeration district of residence (higher values indicating more deprived areas).
Method
Hazard regression models
In this section, we present the two regression models that we are considering as potential alternative to the Cox PH model.
Flexible hazard-based regression model using regression splines
This model uses regression spline functions to model the baseline hazard function and (possibly) time-dependent and/or non-linear effect(s) of covariates. Spline functions are flexible mathematical functions defined by D-degree polynomial segments, which are joined at K junction points called knots (6)(10)(23)(24)(25). A simple example with 2 covariates \({(x}_{1},{x}_{2})\), where the variable \({x}_{2}\) is assumed to be a continuous variable (such as age), helps to illustrate the parametrisation of the model for the hazard
$$\lambda \left(t|{x}_{1},{x}_{2}\right)= {\lambda }_{0}\left(t\right)*\text{exp}\left({\beta }_{1}{x}_{1}+ g\left(t\right)*{x}_{2}+j\left({x}_{2}\right)\right),$$
where \({\beta }_{1}\)is the regression coefficient for \({x}_{1}\); while \(g\left(t\right)\), \(j\left({x}_{2}\right)\) and the log baseline-hazard \({\text{l}\text{o}\text{g}(\lambda }_{0}\left(t\right))\)are B-splines functions. The degree of the spline (D) as well as the number and the locations of the (K) knots need to be specified by the user and may differ for \(g\), \(j\) and \({\lambda }_{0}\).
The use of flexible functions to describe baseline hazard, non-proportional effect of variables and non-linear functional form for continuous variables, allows for modelling smooth and more realistic hazard shapes while ensuring parsimony for the model. One drawback of this model is that we need to determine the number and specify the position of the knots for the spline functions. To have enough flexibility, we choose a cubic B-spline with one knot located at the median of the observed event times for the spline functions of time (i.e., the baseline hazard \({\lambda }_{0}\left(t\right)\) and time-dependent effect \(g\left(t\right)\)). This knot location ensures to have the same quantity of information on both sides of the knot. For the non-linear functional form of continuous variable, the user may choose another type of spline, as for example a restricted cubic spline with 2 knots located at the 33rd and 66th quantile of the distribution of \({x}_{2}\). The more knots there are, the greater the flexibility of the model but at a higher risk of overfitting.
General hazard structure model with a flexible baseline hazard
This general hazard (GH) structure (15) represents a rich class of hazard-based regression models, which is formulated in terms of the hazard function (exemplified with two covariates \({x}_{1}\) and \({x}_{2}\) here):
$${\lambda }^{GH}\left(t;{x}_{1},{x}_{2}\right)= {\lambda }_{0}\left(t\text{exp}\left({\beta }_{11}{x}_{1}+{\beta }_{12}{x}_{2}\right)\right)\text{exp}\left({{\beta }_{21}x}_{1}+{\beta }_{22}{x}_{2}\right),$$
where \({\lambda }_{0}\left(t\right)\) is the baseline hazard function, which can be modelled using a pre-specified parametric distribution (see below), {\({\beta }_{11}, {\beta }_{21}\}\) and {\({\beta }_{12}, {\beta }_{22}\}\) are regression coefficients of the variables \({x}_{1}\) and \({x}_{2}\), respectively. The parameters \({\{\beta }_{11}, {\beta }_{12}\}\)allow for the inclusion of time-level effects of the covariates \({x}_{1}\)and \({x}_{2}\).
The GH model includes as special cases: the proportional hazards model (PH, when \({\beta }_{1j}=0\)), the accelerated failure time model (AFT, when \({\beta }_{1j}= {\beta }_{2j}\)), the accelerated hazards model (AH, when \({\beta }_{2j}=0\)) and the Hybrid Hazard model (HH, when different covariates are included in the argument of the baseline hazard and those in the argument of the exponential function multiplying the baseline hazard). The GH model is identifiable provided that the baseline hazard is not the hazard function associated with the Weibull distribution (14).
To complete the specification of the GH model, we need to select a parametric distribution for the baseline hazard. Here, we will use the Power Generalized Weibull (PGW) distribution instead of Exponentiated Weibull distribution originally proposed (15), as it is easier to implement while being equally flexible (26). The PGW distribution is a 3-parameter distribution which is defined according to the following equations for the density, survival, and hazard functions:
$$f\left(t;\sigma ,\nu ,\gamma \right)=\frac{\nu }{{\gamma \sigma }^{\nu }}{t}^{\nu -1}{\left[1+{\left(\frac{t}{\sigma }\right)}^{\nu }\right]}^{\left(\frac{1}{\gamma }-1\right)}exp\left\{1-{\left[1+{\left(\frac{t}{\sigma }\right)}^{\nu }\right]}^{\frac{1}{\gamma }}\right\}$$
,
$$S\left(t;\sigma ,\nu ,\gamma \right)= exp\left\{1-{\left[1+{\left(\frac{t}{\sigma }\right)}^{\nu }\right]}^{\frac{1}{\gamma }}\right\}$$
,
$$\lambda \left(t;\sigma ,\nu ,\gamma \right)=\frac{\nu }{{\gamma \sigma }^{\nu }}{t}^{\nu -1}{\left[1+{\left(\frac{t}{\sigma }\right)}^{\nu }\right]}^{\left(\frac{1}{\gamma }-1\right)}$$
,
where \(f\left(t;\sigma ,\nu ,\gamma \right)\) is the probability density function, \(S\left(t;\sigma ,\nu ,\gamma \right)\) is the survival function and \(\lambda \left(t;\sigma ,\nu ,\gamma \right)\) is the hazard function, \(\sigma >0\) is a scale parameter and \(\nu ,\gamma >0\)are shape parameters.
The hazard function associated with the PGW distribution can capture several shapes: constant, increasing, decreasing, bathtub and unimodal, which are the shapes usually observed in cancer survival applications (15). Implementations of the GH model with PGW baseline hazard (among others) can also be found in the R packages GHSurv and HazReg.
Model-building strategy
To select between different regression models within a given class of model, (i.e., spline-based model or GH model), we use the Akaike Information Criterion (AIC). We start by fitting several pre-specified candidate models, based on clinical knowledge of the disease, defined with or without time-dependent effect for some covariates (specified in the result section, and then we choose the model with the smallest AIC among the set of candidate models. For comparison purposes, we also fit the Cox model assuming proportional hazards for each covariate in the model.
Evaluating the predictive abilities of hazard-based regression models for survival data
Several performance measures have been developed to compare regression models in terms of their ability to predict individual survival probabilities at a pre-specified time horizon \({t}^{*}\). In this paper, because we aimed to focus on describing the predictive performance measures, we assume that the predictive models have been built based on a different dataset and they are then validated on the LeukSurv dataset. If the model was built and also validated on the same dataset, specific techniques need to be used such as cross-validation (16) to account for the optimism of the performance measures
The performance measures can be broadly classified into three classes (18)(2): overall performance measures, discrimination measures and calibration measures. Overall performance measures estimate the accuracy of prediction at time \(t\)by calculating the difference between predicted and observed outcomes for all individuals at time \(t\)using a loss function (that is, the “prediction error”). Discrimination measures reflect the model’s ability to separate individuals who experienced the event from those who did not. Calibration measures compare the survival predictions for specific subgroups to their observed survival. We describe these performance measures in some details before we use them in our illustrative example.
Overall performance measure: prediction error estimated with the Brier Score
The Brier Score at time \(t\) (27) measures the mean quadratic difference between the predicted survival probability for subject \(i\) at time \(t\) and the observed event status for the same subject \(i\) at time \(t\). It quantifies the degree to which the predicted values coincide with the observed outcomes. Without censoring it can be estimated at time \(t\) by (28)(29)
$$\widehat{BS\left(t\right)}= \frac{1}{N}{\sum _{i=1}^{N}(\mathbb{l}\left\{{T}_{i} > t\right\}-{\widehat{S}}_{i}\left(t\right))}^{2},$$
where \({\widehat{S}}_{i}\left(t\right)\)is the model-based predicted probability of surviving beyond time \(t\) for individual \(i\), and the indicator \(\mathbb{l}\left\{{T}_{i} > t\right\}\)is equal to 1 when patient \(i\) is known to be alive at time \(t\) and is equal to 0 if patient \(i\) died before or at \(t\).
When the sample contains right-censored observations, the indicator \(\mathbb{l}\left\{{T}_{i} > t\right\}\) cannot always be computed. Indeed, if patient \(i\) is censored before\(t\), \(\mathbb{l}\left\{{T}_{i} > t\right\}\) is unknown because \({T}_{i}\)is not observed. A proposed solution to handle right censoring is based on weighting each contribution to the Brier Score (30), assuming that censored patients can be represented by patients with complete information. These weights are defined through the Kaplan-Meier estimates of the conditional survival function of the censoring times, which therefore represents the probability to be uncensored at time \(t\), \(\widehat{{S}_{C}}\left(t\right)=\widehat{P}(C>t)\). Therefore, an estimator of the Brier Score that incorporates censoring (assuming random censoring) is given by:
$$\widehat{BS\left(t\right)}=\frac{1}{N}\sum _{i=1}^{N}\left[\frac{{\mathbb{I}}_{\left({\stackrel{\sim}{T}}_{i}\le t,{\delta }_{i}=1\right)}}{\widehat{{S}_{C}}\left({\stackrel{\sim}{T}}_{i}\right)}{\left(\mathbb{l}\left\{{\stackrel{\sim}{T}}_{i} > t\right\}-{\widehat{S}}_{i}\left(t\right)\right)}^{2}+\frac{{\mathbb{I}}_{\left({\stackrel{\sim}{T}}_{i}>t\right)}}{\widehat{{S}_{C}}\left(t\right)}{\left(\mathbb{l}\left\{{\stackrel{\sim}{T}}_{i} > t\right\}-{\widehat{S}}_{i}\left(t\right)\right)}^{2}\right],$$
where \({\stackrel{\sim}{T}}_{ }=\text{min}\left(T,C\right)\) with C is the right-censoring time and \(\delta\) is the event indicator (equals to 1 if \({\stackrel{\sim}{T}=T}_{ }\) and 0 if \({\stackrel{\sim}{T}=C}_{ }\)).
The contribution of patients who had the event before time \(t\) is\({ \left(0-{\widehat{S}}_{i}\left(t\right)\right)}^{2}\), and this contribution is weighted by the inverse probability to be uncensored at time\({\stackrel{\sim}{T}}_{i}\) (first term in the equation above). The contribution of patients who are still at risk at time \(t\) is\({\left(1-{\widehat{S}}_{i}\left(t\right)\right)}^{2}\), which is again weighted by the inverse probability of being uncensored at time \(t\) (second term in the equation).
If the probability to be uncensored beyond \({\stackrel{\sim}{T}}_{i}\) or \(t\) is close to 1, then the weight is slightly bigger than 1. Conversely, if this probability is low, meaning that many patients have been censored before time \({\stackrel{\sim}{T}}_{i}\) or \(t\), the weight is far bigger than 1, thus upweighting the difference calculated on patients still under observation. Notice that patients who are censored before time \(t\) contribute to none of the 2 terms detailed just above (both indicators \({\mathbb{I}}_{\left({\stackrel{\sim}{T}}_{i}\le t,{\delta }_{i}=1\right)}\) and \({\mathbb{I}}_{\left({\stackrel{\sim}{T}}_{i}\ge t\right)}\) are equal to 0 for them); but censored patients contribute through the estimation of the censoring distribution \(\widehat{{S}_{C}}\). The closer the Brier Score is to 0, the better the overall model accuracy.
Discrimination
Intuitively, we can say that models that are able to distinguish between patients who die shortly after diagnosis from those who survive longer are well discriminated.
The C-Index
The C-Index (a.k.a Concordance Index) (31)(32)(33) is a commonly used discrimination measure. In the context of the Cox PH model, the C-Index indicates the probability that a subject \(i\) has an estimated Cox model linear predictor higher than subject \(j\), given that subject \(i\) died before subject \(j\) and with subjects \(i\) and \(j\) randomly selected (34). In the Cox PH model, the linear predictor \(\eta\) is the quantity involving the regression coefficients and the covariates in the equation of the regression hazard modelling, \(\lambda \left(t;X\right)={\lambda }_{0}\left(t\right){exp}\left({X}^{T}\beta \right)\) (that is, \(\eta ={X}^{T}\beta\)), where \(X\) is the vector of covariates.
The use of the linear predictor from a Cox PH model uniquely determines the ranking of the risks between two individuals at any time \(t\) since no time-dependent regression coefficient are involved. When one uses a more general model, where one (or more) time-dependent regression coefficients are included, then the ranking is not preserved over time. Then, one can use the predicted risk (i.e. \(\widehat{Risk}\left({t}^{*},i\right)= 1-\widehat{S}\left({t}^{*};{X}_{i}\right)\)) instead of the linear predictor to calculate the C-index,
$$C= \text{Prob}(\widehat{Risk}\left({t}^{*},i\right)>\widehat{Risk}\left({t}^{*},j\right)\mid individual i had the event before j)$$
,
where \(\widehat{Risk}\left({t}^{*},i\right)\) is the estimated risk at time horizon \({t}^{*}\) for individual \(i\) derived from the prognosis model under study.
For a Cox PH model, the ordering of the predicted risks \(\widehat{Risk}\left({t}^{*},i\right), i=1,\dots ,N\) will remain the same for any time horizon \({t}^{*}\), while the ordering might change using a model that includes time-dependent regression coefficient(s).
Calculating the C-index is equivalent to calculate the proportion of concordant pairs among all usable pairs (34). One pair is concordant when the predicted risk of an event is lower for subject \(j\) who will have the event at later point compared to subject \(i\). A concordance probability of 1 means that the model has a perfect discrimination, while a value of 0.5 indicates that the model provides no more information than random ordering.
In the absence of censoring, it is easier to calculate the C-index as all the required quantities can be calculated straightforwardly. However, the presence of censoring is more the rule than the exception in most real-life applications in survival analysis, which complicates the estimation of the C-index. Indeed, it is impossible to order the event times of interest (e.g., death) for pairs of patients if one event of a pair is censored. These pairs of observations are discarded, as they are non-informative for the estimation of the C-index. Therefore, the C-index is affected by the proportion of censored observations. Moreover, it has been shown that the C-index is not proper when interest lies in evaluating the risk of an event after \({t}^{*}\)-years (35).
The time-dependent Area Under the Receiver Operating Characteristic Curve
Another tool for assessing discrimination is the time-dependent Area Under the Receiver Operating Characteristic (ROC) Curve, \(AUC\left(t\right)\) (36).
A ROC curve at a specific time point \(t\), \(ROC\left(t\right)\), displays the “true positive rate” (sensitivity) on the y-axis according to the “false positive rate” (i.e., 1-specificity) on the x-axis for successive threshold values. Each threshold value is used in turn to classify the patients into two groups at time \(t\): those predicted to have experienced the event of interest at \(t\) among those considered as cases, and those predicted to not experience the event at \(t\) among controls, based on their marker \(Q\). The Area Under the ROC Curve, \(AUC\left(t\right)\) represents the probability that a marker or a model-based quantity, denoted \({Q}_{i}\), of a randomly selected subject i among cases is greater than \({Q}_{j}\) of a randomly selected subject j among controls (by convention, we assume that a greater \(Q\) leads to a higher risk). The closer the \(AUC\left(t\right)\) is to 1, the better the discrimination.
The way we define patients as cases or controls is important for clear definitions of sensitivity and specificity (37). We propose to define cases at time \(t\) as patients who have experienced the event of interest prior to time \(t\), i.e., \(D\left(t\right)=1\) if \(T\le t\) & \(\delta =1\), and we define controls at time \(t\) as patients who have not yet had the event at time \(t\), i.e., \(D\left(t\right)=0\) and \(T>t\). From the terminology proposed by Heagerty and Zheng (38), this definition corresponds to the cumulative case/dynamic control \(AUC\left(t\right)\), and it will be the focus of the remaining part. Alternatively, one could use incident case and static control or incidence case/dynamic control, but this AUC seems to be inaccurate to evaluate model performance. More details on these definitions can be found in (37) .
If we assume a sample without censoring, the steps to estimate the \(AUC\left(t\right)\) will be as follows. We start with a threshold value of \({c}_{1}\). Among those who have had the event prior to time \(t\), we classify patients in the group of “patients who will have the event prior to time \(t\) as predicted from the prognosis model” if \(Q>{c}_{1}\), and this will be used to estimate the sensitivity, \(sen\left({c}_{1},t\right)= P\left\{Q>{c}_{1} \right| D\left(t\right)=1\}\). Among those who have not had the event at time \(t\), we classify patients in the group of “patients who will not have the event at time \(t\) as predicted from the prognosis model” if \(Q\le {c}_{1}\), and this will be used to estimate the specificity, \(spe\left({c}_{1},t\right)= P\left\{Q\le {c}_{1} \right| D\left(t\right)=0\}\). We can then repeat these steps with different thresholds to obtain a ROC curve at time \(t\) and calculate \(AUC\left(t\right)\). We can then obtain the ROC curve for any time \(t\), \(ROC\left(t\right)\) and so the \(AUC\left(t\right)\).
In the definitions above, the quantity \(Q\) can be a model-based quantity, such as the predicted value of the linear predictor \(X\widehat{\beta }\) derived from the prognosis model. It may also be a time-varying quantity, such as \(Q\left(t\right)=X\widehat{\beta \left(t\right)}\) or \(Q\left(t\right)= 1-\widehat{S}\left(t;X\right)\): that will prove useful when a general model with time-dependent regressions coefficient(s) has been fitted.
We use the estimator of cumulative dynamic AUC as detailed in \(\left(37\right)\),
$${\widehat{AUC}}^{C,D}\left(\text{t}\right)= \frac{\sum _{i=1}^{n}\sum _{j=1}^{n}\mathbb{I}\left( {T}_{i}\le t\right)\mathbb{I}\left( {T}_{j}>t\right)\mathbb{I}\left({Q}_{i}\left(t\right)>{Q}_{j}\left(t\right)\right)* \frac{{\delta }_{i}}{\widehat{{S}_{C}}\left({T}_{i}\right)\widehat{{S}_{C}}\left(t\right)}}{{n}^{2}\widehat{S}\left(t\right)\left[1-\widehat{S}\left(t\right)\right]},$$
where \(\widehat{{S}_{C}}\) is the Kaplan-Meier estimator of the survival function of the censoring time C (thus allowing to deal with censored observations through Inverse Probability of Censoring Weights (40)), \(\widehat{S}\) is the Kaplan-Meier estimator of \(\mathbb{P}(T>t)\), \(n\) is the number of patients included in the model.
Calibration
Calibration refers to the model’s ability to match observed and model-based predicted risk, that is, the probability of death. In other words, calibration allows one to assess how closely the predicted risks agree with the observed ones (41). A prognosis model is considered well calibrated when the number of observed deaths corresponds to the number of deaths predicted by the model. A prognosis model with good calibration is essential to predict a reliable risk. To assess a model’s calibration (42), we can do a calibration plot, that is we divide the predicted risks in, say, 5 or 10 groups using percentiles of the predicted probabilities of death and compare, for each of these groups, the average of the predicted risk to the observed risk. The different calibration measures in the context of the Cox model can be found in McLernon et al (43). One may assess the calibration of the model by plotting predicted vs. observed risks: good calibration is achieved when the points lay close to the first diagonal.
A complementary way to assess calibration for a prognosis model is to compare Kaplan-Meier survival estimates to model-based prediction of survival curves among pre-defined subgroups as defined by covariate values (for example, dividing the sample into 5 age-groups). Model-based prediction for a given subgroup is obtained by averaging the model-based survival predictions of each individual from that subgroup.
Implementation
All the implementation has been carried out using the R programming language (44).The computation of the C-index and the Brier score were already implemented in the package dynpred (45) for the Cox model, and we adapted these functions for the Bspline-based model and the GH model. We also implemented the calculation of AUC function for these models. These functions are provided in a Github repository “https://github.com/margueritefournier/Perf_measures_survival_models”.