Assessing predictive abilities of hazard-based regression models for survival data: a tutorial for prognosis modelling

doi:10.21203/rs.3.rs-3866618/v1

Download PDF

Research Article

Assessing predictive abilities of hazard-based regression models for survival data: a tutorial for prognosis modelling

https://doi.org/10.21203/rs.3.rs-3866618/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Predicting the occurrence of an event over time for a newly diagnosed individual is a common aim in medical statistics. For time-to-event outcomes, this prediction is typically based on a regression model. The Cox proportional hazard (PH) model represents one of the most popular regression models for analysing time-to-event data. However, several flexible models that go beyond the assumption of proportionality of hazards have been recently developed. These include flexible hazard-based models using splines or models based on more general hazard structures. In these 2 types of models, non-linear associations and time-varying regression coefficient(s) can be easily included. Assessing the predictive ability of a hazard-based regression model is necessary to validate a predictive model but it might prove difficult for models other than the Cox PH model. We present a tutorial which explains how the predictive ability of hazard-based regression models can be assessed, focusing on the 3 commonly used performance measures. We report (i) the overall prediction ability using prediction error curve and the Brier score, (ii) the discriminative ability using the cumulative/dynamic area under the receiving operator characteristic curve, and (iii) the calibration ability, i.e., the agreement between observed and predicted probabilities, using calibration plots and graphical comparison between predicted and observed survival. We provide an implementation of these methods in R together with an illustrative example using a publicly available data set.

Predicting the occurrence of a time-to-event of interest since cancer diagnosis is an important part of medical research in oncology (1) (2) (3). Predictions are typically based on survival regression models formulated in terms of the hazard function. The main goal of these models is to equip physicians and their patients with accurate tools to develop an optimal care strategy based on the available information from patients (such as age, sex, cancer stage, co-morbidities, frailty, et cetera). The semi-parametric Cox model (4) is the most popular model in time-to-event data analysis. The Cox model has many advantages, such as being easily interpretable and relying on a semi-parametric structure as the baseline hazard is not specified (and therefore not forced to have a particular shape). However, to predict survival probabilities from the Cox model, the baseline hazard function needs to be estimated. Moreover, the classical Cox model relies on the proportional hazards (PH) assumption, where the hazard ratio (HR) associated with a covariate is assumed to be constant over time. This assumption may be reasonable only for a short follow-up period after diagnosis. Consequently, the Cox model may be unable to describe more complex scenarios where the proportionality of hazards assumption is not reasonable. Several regression models have been proposed to relax the PH assumption and thus allow for non-proportional hazards. For instance, alternative regression models have been proposed using flexible functions for modelling the baseline hazard and/or time-dependent regression coefficients for covariates (5)(6)(7)(8)(9). One possible choice for such flexible functions are spline functions (10), in particular B-splines (11). Models based on B-splines allow great flexibility for modelling both the baseline hazard and the covariate effects (12)(13). Another class of flexible parametric models has recently been proposed based on a general hazard structure (14)(15). This general structure contains the proportional hazards and accelerate failure time as particular cases. This model is able to incorporate time-level effects, while avoiding the need for numerical integration, and flexibly modelling the baseline hazard using a flexible parametric distribution (e.g., the Power Generalized Weibull). As mentioned in a book by E. Steyerberg (16), flexible models might be better for fitting particular patterns in a specific dataset at the increased risk of overfitting. It is therefore not immediately clear whether a more flexible model is to be preferred in comparison of a simpler model.

Once a predictive hazard-based regression model has been built or selected (17), it is necessary to quantify the accuracy of predictions derived from that model (16). Assessing a model’s predictive ability can be performed by using (i) overall performance measures of model predictive accuracy, (ii) discrimination measures to assess the ability of the model to classify patients in groups with homogeneous risk levels, and (iii) calibration measures, which describe the agreement between the observed and the predicted risk (18)(19). Their combination is important as a model may exhibit good classification performance (discrimination), but it may also have poor performance in terms of predictions related to specific subgroups (calibration), or vice versa.

Our main aim is to illustrate the use of prediction performance measures for two types of parametric hazard-based regression models applied on time-to-event data, in addition to the classical semiparametric Cox PH model. A secondary aim is to investigate the trade-off between flexibility and predictive performance. That is, whether the cost of using a more flexible model is justified by a considerable increase in predictive performance. We focus on two types of flexible models: (i) one based on splines and a classical hazard structure (i.e. where the baseline hazard is multiplied by the exponential of a linear predictor, which may include time-dependent effects) (12)(13)(20), and (ii) another one based on a general hazard structure (14) (15) (21), where the covariates play a role at both the hazard level and the time scale (time-dependent effects).

The remainder of the paper is organized as follows. In Section 2, we describe the dataset used in our illustrative example. In Section 3, we describe the hazard regression models to be used in our examples as well as the performance measures used for evaluating their predictive abilities. Section 4 presents the results for the model selection strategy and the prediction measures computed for the different candidate models. We conclude with a discussion of the results and potential research directions.

This dataset is freely available in the spBayesSurv R package and it was first analyzed by Henderson et al. (2002) (22). It includes 1043 patients diagnosed with acute myeloid leukemia (AML), with the time until death or last known follow-up $T$ (in years) and the corresponding event indicator $\delta$ (being 0 for censored patients and 1 for death). We aim to predict the survival after AML diagnosis according to the following covariates: age (continuous) at diagnosis, sex, white blood cell count (WBC) at diagnosis ($50*{10}^{9}/L$) (continuous) and the Townsend deprivation score (continuous) which is an area-based measure of deprivation for the enumeration district of residence (higher values indicating more deprived areas).

Method

Hazard regression models

In this section, we present the two regression models that we are considering as potential alternative to the Cox PH model.

Flexible hazard-based regression model using regression splines

This model uses regression spline functions to model the baseline hazard function and (possibly) time-dependent and/or non-linear effect(s) of covariates. Spline functions are flexible mathematical functions defined by D-degree polynomial segments, which are joined at K junction points called knots (6)(10)(23)(24)(25). A simple example with 2 covariates ${(x}_{1},{x}_{2})$, where the variable ${x}_{2}$ is assumed to be a continuous variable (such as age), helps to illustrate the parametrisation of the model for the hazard

$$\lambda \left(t|{x}_{1},{x}_{2}\right)= {\lambda }_{0}\left(t\right)*\text{exp}\left({\beta }_{1}{x}_{1}+ g\left(t\right)*{x}_{2}+j\left({x}_{2}\right)\right),$$

where ${\beta }_{1}$is the regression coefficient for ${x}_{1}$; while $g\left(t\right)$, $j\left({x}_{2}\right)$ and the log baseline-hazard ${\text{l}\text{o}\text{g}(\lambda }_{0}\left(t\right))$are B-splines functions. The degree of the spline (D) as well as the number and the locations of the (K) knots need to be specified by the user and may differ for $g$, $j$ and ${\lambda }_{0}$.

The use of flexible functions to describe baseline hazard, non-proportional effect of variables and non-linear functional form for continuous variables, allows for modelling smooth and more realistic hazard shapes while ensuring parsimony for the model. One drawback of this model is that we need to determine the number and specify the position of the knots for the spline functions. To have enough flexibility, we choose a cubic B-spline with one knot located at the median of the observed event times for the spline functions of time (i.e., the baseline hazard ${\lambda }_{0}\left(t\right)$ and time-dependent effect $g\left(t\right)$). This knot location ensures to have the same quantity of information on both sides of the knot. For the non-linear functional form of continuous variable, the user may choose another type of spline, as for example a restricted cubic spline with 2 knots located at the 33rd and 66th quantile of the distribution of ${x}_{2}$. The more knots there are, the greater the flexibility of the model but at a higher risk of overfitting.

General hazard structure model with a flexible baseline hazard

This general hazard (GH) structure (15) represents a rich class of hazard-based regression models, which is formulated in terms of the hazard function (exemplified with two covariates ${x}_{1}$ and ${x}_{2}$ here):

$${\lambda }^{GH}\left(t;{x}_{1},{x}_{2}\right)= {\lambda }_{0}\left(t\text{exp}\left({\beta }_{11}{x}_{1}+{\beta }_{12}{x}_{2}\right)\right)\text{exp}\left({{\beta }_{21}x}_{1}+{\beta }_{22}{x}_{2}\right),$$

where ${\lambda }_{0}\left(t\right)$ is the baseline hazard function, which can be modelled using a pre-specified parametric distribution (see below), {${\beta }_{11}, {\beta }_{21}\}$ and {${\beta }_{12}, {\beta }_{22}\}$ are regression coefficients of the variables ${x}_{1}$ and ${x}_{2}$, respectively. The parameters ${\{\beta }_{11}, {\beta }_{12}\}$allow for the inclusion of time-level effects of the covariates ${x}_{1}$and ${x}_{2}$.

The GH model includes as special cases: the proportional hazards model (PH, when ${\beta }_{1j}=0$), the accelerated failure time model (AFT, when ${\beta }_{1j}= {\beta }_{2j}$), the accelerated hazards model (AH, when ${\beta }_{2j}=0$) and the Hybrid Hazard model (HH, when different covariates are included in the argument of the baseline hazard and those in the argument of the exponential function multiplying the baseline hazard). The GH model is identifiable provided that the baseline hazard is not the hazard function associated with the Weibull distribution (14).

To complete the specification of the GH model, we need to select a parametric distribution for the baseline hazard. Here, we will use the Power Generalized Weibull (PGW) distribution instead of Exponentiated Weibull distribution originally proposed (15), as it is easier to implement while being equally flexible (26). The PGW distribution is a 3-parameter distribution which is defined according to the following equations for the density, survival, and hazard functions:

$$f\left(t;\sigma ,\nu ,\gamma \right)=\frac{\nu }{{\gamma \sigma }^{\nu }}{t}^{\nu -1}{\left[1+{\left(\frac{t}{\sigma }\right)}^{\nu }\right]}^{\left(\frac{1}{\gamma }-1\right)}exp\left\{1-{\left[1+{\left(\frac{t}{\sigma }\right)}^{\nu }\right]}^{\frac{1}{\gamma }}\right\}$$

$$S\left(t;\sigma ,\nu ,\gamma \right)= exp\left\{1-{\left[1+{\left(\frac{t}{\sigma }\right)}^{\nu }\right]}^{\frac{1}{\gamma }}\right\}$$

$$\lambda \left(t;\sigma ,\nu ,\gamma \right)=\frac{\nu }{{\gamma \sigma }^{\nu }}{t}^{\nu -1}{\left[1+{\left(\frac{t}{\sigma }\right)}^{\nu }\right]}^{\left(\frac{1}{\gamma }-1\right)}$$

where $f\left(t;\sigma ,\nu ,\gamma \right)$ is the probability density function, $S\left(t;\sigma ,\nu ,\gamma \right)$ is the survival function and $\lambda \left(t;\sigma ,\nu ,\gamma \right)$ is the hazard function, $\sigma >0$ is a scale parameter and $\nu ,\gamma >0$are shape parameters.

The hazard function associated with the PGW distribution can capture several shapes: constant, increasing, decreasing, bathtub and unimodal, which are the shapes usually observed in cancer survival applications (15). Implementations of the GH model with PGW baseline hazard (among others) can also be found in the R packages GHSurv and HazReg.

Model-building strategy

To select between different regression models within a given class of model, (i.e., spline-based model or GH model), we use the Akaike Information Criterion (AIC). We start by fitting several pre-specified candidate models, based on clinical knowledge of the disease, defined with or without time-dependent effect for some covariates (specified in the result section, and then we choose the model with the smallest AIC among the set of candidate models. For comparison purposes, we also fit the Cox model assuming proportional hazards for each covariate in the model.

Evaluating the predictive abilities of hazard-based regression models for survival data

Several performance measures have been developed to compare regression models in terms of their ability to predict individual survival probabilities at a pre-specified time horizon ${t}^{*}$. In this paper, because we aimed to focus on describing the predictive performance measures, we assume that the predictive models have been built based on a different dataset and they are then validated on the LeukSurv dataset. If the model was built and also validated on the same dataset, specific techniques need to be used such as cross-validation (16) to account for the optimism of the performance measures

The performance measures can be broadly classified into three classes (18)(2): overall performance measures, discrimination measures and calibration measures. Overall performance measures estimate the accuracy of prediction at time $t$by calculating the difference between predicted and observed outcomes for all individuals at time $t$using a loss function (that is, the “prediction error”). Discrimination measures reflect the model’s ability to separate individuals who experienced the event from those who did not. Calibration measures compare the survival predictions for specific subgroups to their observed survival. We describe these performance measures in some details before we use them in our illustrative example.

Overall performance measure: prediction error estimated with the Brier Score

The Brier Score at time $t$ (27) measures the mean quadratic difference between the predicted survival probability for subject $i$ at time $t$ and the observed event status for the same subject $i$ at time $t$. It quantifies the degree to which the predicted values coincide with the observed outcomes. Without censoring it can be estimated at time $t$ by (28)(29)

$$\widehat{BS\left(t\right)}= \frac{1}{N}{\sum _{i=1}^{N}(\mathbb{l}\left\{{T}_{i} > t\right\}-{\widehat{S}}_{i}\left(t\right))}^{2},$$

where ${\widehat{S}}_{i}\left(t\right)$is the model-based predicted probability of surviving beyond time $t$ for individual $i$, and the indicator $\mathbb{l}\left\{{T}_{i} > t\right\}$is equal to 1 when patient $i$ is known to be alive at time $t$ and is equal to 0 if patient $i$ died before or at $t$.

When the sample contains right-censored observations, the indicator $\mathbb{l}\left\{{T}_{i} > t\right\}$ cannot always be computed. Indeed, if patient $i$ is censored before$t$, $\mathbb{l}\left\{{T}_{i} > t\right\}$ is unknown because ${T}_{i}$is not observed. A proposed solution to handle right censoring is based on weighting each contribution to the Brier Score (30), assuming that censored patients can be represented by patients with complete information. These weights are defined through the Kaplan-Meier estimates of the conditional survival function of the censoring times, which therefore represents the probability to be uncensored at time $t$, $\widehat{{S}_{C}}\left(t\right)=\widehat{P}(C>t)$. Therefore, an estimator of the Brier Score that incorporates censoring (assuming random censoring) is given by:

$$\widehat{BS\left(t\right)}=\frac{1}{N}\sum _{i=1}^{N}\left[\frac{{\mathbb{I}}_{\left({\stackrel{\sim}{T}}_{i}\le t,{\delta }_{i}=1\right)}}{\widehat{{S}_{C}}\left({\stackrel{\sim}{T}}_{i}\right)}{\left(\mathbb{l}\left\{{\stackrel{\sim}{T}}_{i} > t\right\}-{\widehat{S}}_{i}\left(t\right)\right)}^{2}+\frac{{\mathbb{I}}_{\left({\stackrel{\sim}{T}}_{i}>t\right)}}{\widehat{{S}_{C}}\left(t\right)}{\left(\mathbb{l}\left\{{\stackrel{\sim}{T}}_{i} > t\right\}-{\widehat{S}}_{i}\left(t\right)\right)}^{2}\right],$$

where ${\stackrel{\sim}{T}}_{ }=\text{min}\left(T,C\right)$ with C is the right-censoring time and $\delta$ is the event indicator (equals to 1 if ${\stackrel{\sim}{T}=T}_{ }$ and 0 if ${\stackrel{\sim}{T}=C}_{ }$).

The contribution of patients who had the event before time $t$ is${ \left(0-{\widehat{S}}_{i}\left(t\right)\right)}^{2}$, and this contribution is weighted by the inverse probability to be uncensored at time${\stackrel{\sim}{T}}_{i}$ (first term in the equation above). The contribution of patients who are still at risk at time $t$ is${\left(1-{\widehat{S}}_{i}\left(t\right)\right)}^{2}$, which is again weighted by the inverse probability of being uncensored at time $t$ (second term in the equation).

If the probability to be uncensored beyond ${\stackrel{\sim}{T}}_{i}$ or $t$ is close to 1, then the weight is slightly bigger than 1. Conversely, if this probability is low, meaning that many patients have been censored before time ${\stackrel{\sim}{T}}_{i}$ or $t$, the weight is far bigger than 1, thus upweighting the difference calculated on patients still under observation. Notice that patients who are censored before time $t$ contribute to none of the 2 terms detailed just above (both indicators ${\mathbb{I}}_{\left({\stackrel{\sim}{T}}_{i}\le t,{\delta }_{i}=1\right)}$ and ${\mathbb{I}}_{\left({\stackrel{\sim}{T}}_{i}\ge t\right)}$ are equal to 0 for them); but censored patients contribute through the estimation of the censoring distribution $\widehat{{S}_{C}}$. The closer the Brier Score is to 0, the better the overall model accuracy.

Discrimination

Intuitively, we can say that models that are able to distinguish between patients who die shortly after diagnosis from those who survive longer are well discriminated.

The C-Index

The C-Index (a.k.a Concordance Index) (31)(32)(33) is a commonly used discrimination measure. In the context of the Cox PH model, the C-Index indicates the probability that a subject $i$ has an estimated Cox model linear predictor higher than subject $j$, given that subject $i$ died before subject $j$ and with subjects $i$ and $j$ randomly selected (34). In the Cox PH model, the linear predictor $\eta$ is the quantity involving the regression coefficients and the covariates in the equation of the regression hazard modelling, $\lambda \left(t;X\right)={\lambda }_{0}\left(t\right){exp}\left({X}^{T}\beta \right)$ (that is, $\eta ={X}^{T}\beta$), where $X$ is the vector of covariates.

The use of the linear predictor from a Cox PH model uniquely determines the ranking of the risks between two individuals at any time $t$ since no time-dependent regression coefficient are involved. When one uses a more general model, where one (or more) time-dependent regression coefficients are included, then the ranking is not preserved over time. Then, one can use the predicted risk (i.e. $\widehat{Risk}\left({t}^{*},i\right)= 1-\widehat{S}\left({t}^{*};{X}_{i}\right)$) instead of the linear predictor to calculate the C-index,

$$C= \text{Prob}(\widehat{Risk}\left({t}^{*},i\right)>\widehat{Risk}\left({t}^{*},j\right)\mid individual i had the event before j)$$

where $\widehat{Risk}\left({t}^{*},i\right)$ is the estimated risk at time horizon ${t}^{*}$ for individual $i$ derived from the prognosis model under study.

For a Cox PH model, the ordering of the predicted risks $\widehat{Risk}\left({t}^{*},i\right), i=1,\dots ,N$ will remain the same for any time horizon ${t}^{*}$, while the ordering might change using a model that includes time-dependent regression coefficient(s).

Calculating the C-index is equivalent to calculate the proportion of concordant pairs among all usable pairs (34). One pair is concordant when the predicted risk of an event is lower for subject $j$ who will have the event at later point compared to subject $i$. A concordance probability of 1 means that the model has a perfect discrimination, while a value of 0.5 indicates that the model provides no more information than random ordering.

In the absence of censoring, it is easier to calculate the C-index as all the required quantities can be calculated straightforwardly. However, the presence of censoring is more the rule than the exception in most real-life applications in survival analysis, which complicates the estimation of the C-index. Indeed, it is impossible to order the event times of interest (e.g., death) for pairs of patients if one event of a pair is censored. These pairs of observations are discarded, as they are non-informative for the estimation of the C-index. Therefore, the C-index is affected by the proportion of censored observations. Moreover, it has been shown that the C-index is not proper when interest lies in evaluating the risk of an event after ${t}^{*}$-years (35).

The time-dependent Area Under the Receiver Operating Characteristic Curve

Another tool for assessing discrimination is the time-dependent Area Under the Receiver Operating Characteristic (ROC) Curve, $AUC\left(t\right)$ (36).

A ROC curve at a specific time point $t$, $ROC\left(t\right)$, displays the “true positive rate” (sensitivity) on the y-axis according to the “false positive rate” (i.e., 1-specificity) on the x-axis for successive threshold values. Each threshold value is used in turn to classify the patients into two groups at time $t$: those predicted to have experienced the event of interest at $t$ among those considered as cases, and those predicted to not experience the event at $t$ among controls, based on their marker $Q$. The Area Under the ROC Curve, $AUC\left(t\right)$ represents the probability that a marker or a model-based quantity, denoted ${Q}_{i}$, of a randomly selected subject i among cases is greater than ${Q}_{j}$ of a randomly selected subject j among controls (by convention, we assume that a greater $Q$ leads to a higher risk). The closer the $AUC\left(t\right)$ is to 1, the better the discrimination.

The way we define patients as cases or controls is important for clear definitions of sensitivity and specificity (37). We propose to define cases at time $t$ as patients who have experienced the event of interest prior to time $t$, i.e., $D\left(t\right)=1$ if $T\le t$ & $\delta =1$, and we define controls at time $t$ as patients who have not yet had the event at time $t$, i.e., $D\left(t\right)=0$ and $T>t$. From the terminology proposed by Heagerty and Zheng (38), this definition corresponds to the cumulative case/dynamic control $AUC\left(t\right)$, and it will be the focus of the remaining part. Alternatively, one could use incident case and static control or incidence case/dynamic control, but this AUC seems to be inaccurate to evaluate model performance. More details on these definitions can be found in (37) .

If we assume a sample without censoring, the steps to estimate the $AUC\left(t\right)$ will be as follows. We start with a threshold value of ${c}_{1}$. Among those who have had the event prior to time $t$, we classify patients in the group of “patients who will have the event prior to time $t$ as predicted from the prognosis model” if $Q>{c}_{1}$, and this will be used to estimate the sensitivity, $sen\left({c}_{1},t\right)= P\left\{Q>{c}_{1} \right| D\left(t\right)=1\}$. Among those who have not had the event at time $t$, we classify patients in the group of “patients who will not have the event at time $t$ as predicted from the prognosis model” if $Q\le {c}_{1}$, and this will be used to estimate the specificity, $spe\left({c}_{1},t\right)= P\left\{Q\le {c}_{1} \right| D\left(t\right)=0\}$. We can then repeat these steps with different thresholds to obtain a ROC curve at time $t$ and calculate $AUC\left(t\right)$. We can then obtain the ROC curve for any time $t$, $ROC\left(t\right)$ and so the $AUC\left(t\right)$.

In the definitions above, the quantity $Q$ can be a model-based quantity, such as the predicted value of the linear predictor $X\widehat{\beta }$ derived from the prognosis model. It may also be a time-varying quantity, such as $Q\left(t\right)=X\widehat{\beta \left(t\right)}$ or $Q\left(t\right)= 1-\widehat{S}\left(t;X\right)$: that will prove useful when a general model with time-dependent regressions coefficient(s) has been fitted.

We use the estimator of cumulative dynamic AUC as detailed in $\left(37\right)$,

$${\widehat{AUC}}^{C,D}\left(\text{t}\right)= \frac{\sum _{i=1}^{n}\sum _{j=1}^{n}\mathbb{I}\left( {T}_{i}\le t\right)\mathbb{I}\left( {T}_{j}>t\right)\mathbb{I}\left({Q}_{i}\left(t\right)>{Q}_{j}\left(t\right)\right)* \frac{{\delta }_{i}}{\widehat{{S}_{C}}\left({T}_{i}\right)\widehat{{S}_{C}}\left(t\right)}}{{n}^{2}\widehat{S}\left(t\right)\left[1-\widehat{S}\left(t\right)\right]},$$

where $\widehat{{S}_{C}}$ is the Kaplan-Meier estimator of the survival function of the censoring time C (thus allowing to deal with censored observations through Inverse Probability of Censoring Weights (40)), $\widehat{S}$ is the Kaplan-Meier estimator of $\mathbb{P}(T>t)$, $n$ is the number of patients included in the model.

Calibration

Calibration refers to the model’s ability to match observed and model-based predicted risk, that is, the probability of death. In other words, calibration allows one to assess how closely the predicted risks agree with the observed ones (41). A prognosis model is considered well calibrated when the number of observed deaths corresponds to the number of deaths predicted by the model. A prognosis model with good calibration is essential to predict a reliable risk. To assess a model’s calibration (42), we can do a calibration plot, that is we divide the predicted risks in, say, 5 or 10 groups using percentiles of the predicted probabilities of death and compare, for each of these groups, the average of the predicted risk to the observed risk. The different calibration measures in the context of the Cox model can be found in McLernon et al (43). One may assess the calibration of the model by plotting predicted vs. observed risks: good calibration is achieved when the points lay close to the first diagonal.

A complementary way to assess calibration for a prognosis model is to compare Kaplan-Meier survival estimates to model-based prediction of survival curves among pre-defined subgroups as defined by covariate values (for example, dividing the sample into 5 age-groups). Model-based prediction for a given subgroup is obtained by averaging the model-based survival predictions of each individual from that subgroup.

Implementation

All the implementation has been carried out using the R programming language (44).The computation of the C-index and the Brier score were already implemented in the package dynpred (45) for the Cox model, and we adapted these functions for the Bspline-based model and the GH model. We also implemented the calculation of AUC function for these models. These functions are provided in a Github repository “https://github.com/margueritefournier/Perf_measures_survival_models”.

We have used the following covariates in the LeukSurv data set: age at diagnosis, sex, white blood cell-count at diagnosis with $1 unit=50 \times {10}^{9}/L$, and Townsend score, which is a measure of deprivation for the enumeration district (small geographical area with mean population of less than a thousand) area of residence. All covariates are continuous except for the variable sex. The maximum follow-up is 13.6 year. There were 1043 patients and 15.7% of censored survival times (Table 1).

Table 1

Patient’s characteristics at baseline – Leukemia data
	Overall (N = 1043)
Age at diagnosis
Mean (SD)	60.7 (18.3)
Median [Min, Max]	65.0 [14.0, 92.0]
Sex
Female	496 (47.6%)
Male	547 (52.4%)
White blood cell count
Mean (SD)	38.6 (72.8)
Median [Min, Max]	7.90 [0, 500]
Townsend score
Mean (SD)	0.340 (3.64)
Median [Min, Max]	-0.370 [-6.09, 9.55]
Survival time (years)
Mean (SD)	1.46 (2.46)
Median [Min, Max]	0.507 [0.00274, 13.6]
Survival status
Censored	164 (15.7%)
Death	879 (84.3%)

We fitted three types of regression models: a Cox model, a B-Spline-based model, and a GH model. For both Bspline-based and GH models, we investigated time-dependent effects for age. We selected the best model between a model with no time-dependent effect of age and a model with time-dependent effect of age using AIC (Akaike Information Criterion). For both Bspline-based and GH models, the selected candidate models included a time-dependent effect for age.

For each model type, we computed the predictive measures of overall performance (Brier Score), discrimination ($AUC\left(t\right)$) and calibration.

Figure 1 shows the prediction error curves for the three candidate models. The three models were similar in terms of overall prediction error as evaluated with the time-varying Brier Score.

Figure 2 presents the cumulative-dynamic AUC according to time since diagnosis for the three candidate models, where we can see that all models have similar performance in terms of this discrimination measure.

Figure 3 shows the calibration-in-the-large at three different times after diagnosis, 6 months, 3 years and 8 years, comparing the observed risks and the predicted risks of 10 groups defined using deciles of predicted risks. The Bspline-based model seemed to have a better calibration with a regression slope of 1.02, as compared to Cox (slope: 1.22) and GH (slope: 1.08) models at 6 months and at 3 years (0.91 vs 0.72 and 0.71, respectively). At 8 years, the GH models has a better regression slope, but all models have a poor calibration (around 0.5). Figure 4 compares the survival as predicted from each model to the non-parametric survival estimated with Kaplan Meier (the “observed” data). All curves were plotted by categories of the variables included in the model. Overall, all models matched well the observed data.

This article aims to explain the definition and the practical use of different performance measures to assess the predictive ability of a prognosis model, with special focus on overall performance, discrimination, and calibration measures. Many tutorials already exist, with focus on binary data, but much less attention has been paid to the context of time-to-event outcomes, especially when regression models other than the Cox PH model (potentially with time-dependent and/or non-linear effect(s)) are of interest. We hope that this tutorial, along with the R-code provided, will help researchers applying these measures when developing a prognosis model.

Regarding the overall measures, the Brier score is the most widely used, with its weighted version (30) to handle right censoring. Discrimination allows one to quantify the ability of a model to correctly classify patients in a risk group. We have favoured the description of the cumulative dynamic AUC because it is the most widely adopted measure for hazard-based regression models (38), while it has been shown that the C-index is not a proper scoring rule for survival models (35). Finally, calibration measures help assessing the adequacy between the predicted and observed risk for a given group of individuals. In the line, we have focused on the presentation and use of the calibration plot. We have also presented a graphical comparison between survival curves as estimated non-parametrically in each subgroup vs. the average of the predicted survival curves for all individuals in the same subgroup (while accounting for their observed covariates). A good overlap between the model-based curves and the survival curves estimated non-parametrically suggests a good calibration of the predictive model.

In our illustrative example, the predictive performances of the 3 different models were not substantially different, thus suggesting that an increase in flexibility on this dataset does not improve predictions, except for the calibration plot at 3 years for the B-spline model. However, such interpretation should be done with caution because we have built and validated the model using the same dataset. More work should investigate the performances of the 3 models using simulations studies with different scenarios.

We have focused on explaining the use and implementation of three types of predictive performance measures. However, predictive modelling also aims at helping clinicians inform and improve their decision-making process. Once the predictive performance of a hazard-based regression model has been assessed, its usefulness to answer the question “would the use of the predictive model improve clinical decision making?” should be evaluated (43, 46). This represents an active research area, beyond the scope of this tutorial, where the concepts of “Decision Curve Analysis” and “Net Benefit” have been recently developed (43).

AIC

Akaike Information Criterion

Accelerated Hazards model

AFT

accelerated failure time model

AML

Acute Myeloid Leukemia

AUC

Area Under the ROC

B-spline

General Hazard

Hybrid Hazard model

Hazard Ratio

PGW

Power Generalized Weibull

Proportional Hazards

ROC

Receiver Operating Characteristic

TDS

Townsend Deprivation Score

WBC

White Blood Cell

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing Interests

Authors declare no competing interests.

Funding

Researchers from the Inequalities in Cancer Outcome Network is funded by Cancer Research UK programme (Grant No. EPNCZS34).

Acknowledgements

Authors thank the biometry department of the LYSARC for useful discussions during several meetings.

Availability of data and materials

The dataset used to run the example is freely available in the spBayesSurv R package.

Authors' contributions

MF and AB developed the concept, designed the first draft of the article and the computing code. MF, FJR, LC, CM, AB interpreted and reviewed the code and the data, drafted and revised the article. MF, FJR, LC, CM, AB read and approved the final version of the article.

Henderson R, Jones M, Stare J. Accuracy of point predictions in survival analysis. Stat Med. 2001;20(20):3083–96.
Alba AC, Agoritsas T, Walsh M, Hanna S, Iorio A, Devereaux PJ, et al. Discrimination and Calibration of Clinical Prediction Models: Users’ Guides to the Medical Literature. JAMA. 2017;318(14):1377–84.
Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why, and how? BMJ. 2009;338:b375.
Cox DR. Regression Models and Life-Tables. J Roy Stat Soc. 1972;34(2):187–220.
Abrahamowicz M, Mackenzie T, Esdaile JM. Time-Dependent Hazard Ratio: Modeling and Hypothesis Testing with Application in Lupus Nephritis. J Am Stat Assoc. 1996;91(436):1432–9.
Sleeper LA, Harrington DP. Regression Splines in the Cox Model with Application to Covariate Effects in Liver Disease. J Am Stat Assoc. 1990;85(412):941–9.
Royston P, Parmar MKB. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Stat Med. 2002;21(15):2175–97.
Kooperberg C, Stone CJ, Truong YK. Hazard Regression. J Am Stat Assoc. 1995;90(429):78–94.
Hagar Y, Dignam JJ, Dukic V. Flexible modeling of the hazard rate and treatment effects in long-term survival studies. Stat Methods Med Res. 2017;26(5):2455–80.
Perperoglou A, Sauerbrei W, Abrahamowicz M, Schmid M. A review of spline function procedures in R. BMC Med Res Methodol. 2019;19(1):46.
Giorgi R, Abrahamowicz M, Quantin C, Bolard P, Esteve J, Gouvernet J, et al. A relative survival regression model using B-spline functions to model non-proportional hazards. Stat Med. 2003;22(17):2767–84.
Remontet L, Bossard N, Belot A, Estève J. An overall strategy based on regression models to estimate relative survival and model the effects of prognostic factors in cancer survival studies. Stat Med. 2007;26(10):2214–28.
Charvat H, Remontet L, Bossard N, Roche L, Dejardin O, Rachet B, et al. A multilevel excess hazard model to estimate net survival on hierarchical data allowing for non-linear and non-proportional effects of covariates. Stat Med. 2016;35(18):3066–84.
Chen YQ, Jewell NP. On a general class of semiparametric hazards regression models. Biometrika. 2001;88(3):687–702.
Rubio FJ, Remontet L, Jewell NP, Belot A. On a general structure for hazard-based regression models: An application to population-based cancer research. Stat Methods Med Res. 2019;28(8):2404–17.
Steyerberg E. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating [Internet]. New York: Springer-Verlag; 2009 [cited 2020 Jan 19]. (Statistics for Biology and Health). Available from: https://www.springer.com/gp/book/9780387772431.
Royston P, Moons KGM, Altman DG, Vergouwe Y. Prognosis and prognostic research: Developing a prognostic model. BMJ. 2009;338:b604.
Tournoud M, Larue A, Cazalis MA, Venet F, Pachot A, Monneret G, et al. A strategy to build and validate a prognostic biomarker model based on RT-qPCR gene expression and clinical covariates. BMC Bioinformatics. 2015;16(1):1–15.
Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35(29):1925–31.
Remontet L, Bossard N, Iwaz J, Estève J, Belot A. Framework and optimisation procedure for flexible parametric survival models. Stat Med. 2015;34(25):3376–7.
Rubio FJ, Rachet B, Giorgi R, Maringe C, Belot A. On models for the estimation of the excess mortality hazard in case of insufficiently stratified life tables. Biostatistics. 2021;22(1):51–67.
Henderson R, Shimakura S, Gorst D. Modeling Spatial Variation in Leukemia Survival Data. 2002.
Durrleman S, Simon R. Flexible regression models with cubic splines. Stat Med. 1989;8(5):551–61.
Hess KR. Assessing time-by-covariate interactions in proportional hazards regression models using cubic spline functions. Stat Med. 1994;13(10):1045–62.
Gauthier J, Wu QV, Gooley TA. Cubic splines to model relationships between continuous variables and outcomes: a guide for clinicians. Bone Marrow Transplant. 2020;55(4):675–80.
Jones MC, Noufaily A. Log-location-scale-log-concave distributions for survival and reliability analysis. Electron J Stat. 2015;9(2):2732–50.
Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999;18(17–18):2529–45.
Pfeiffer RM, Gail MH, Gail MH. Absolute Risk: Methods and Applications in Clinical Management and Public Health [Internet]. Chapman and Hall/CRC; 2017 [cited 2020 Jan 13]. Available from: https://www-taylorfrancis-com.docelec.univ-lyon1.fr/books/9781315117539.
Tsouprou S, Putter H, Fiocco M. Measures of discrimination and predictive accuracy for interval censored survival data [PhD Thesis]. Master’s thesis, Leiden University; 2015.
Schumacher M, Graf E, Gerds T. How to assess prognostic models for survival data: a case study in oncology. Methods Inf Med. 2003;42(5):564–71.
Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA. 1982;247(18):2543–6.
Royston P, Altman DG. Visualizing and assessing discrimination in the logistic regression model. Stat Med. 2010;29(24):2508–20.
Harrell FE, Lee KL, Mark DB. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Stat Med. 1996;15(4):361–87.
Pencina MJ, D’Agostino RB. Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med. 2004;23(13):2109–23.
Blanche P, Kattan MW, Gerds TA. The c-index is not proper for the evaluation of $t$-year predicted risks. Biostatistics. 2019;20(2):347–57.
Heagerty PJ, Lumley T, Pepe MS. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics. 2000;56(2):337–44.
Blanche P, Latouche A, Viallon V. Time-dependent AUC with right-censored data: a survey study. arXiv:12106805 [stat] [Internet]. 2012 Oct 25 [cited 2022 Feb 7]; Available from: http://arxiv.org/abs/1210.6805.
Heagerty. Survival Model Predictive Accuracy and ROC Curves [Internet]. 2005 [cited 2021 Sep 7]. Available from: https://onlinelibrary.wiley.com/doi/10.1111/j.0006-341X.2005.030814.x.
Hung H, Chiang CT. Estimation methods for time-dependent AUC models with survival data. Can J Stat. 2010;38(1):8–26.
Satten GA, Datta S. The Kaplan-Meier Estimator as an Inverse-Probability-of-Censoring Weighted Average. Am Stat. 2001;55(3):207–10.
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Bossuyt P, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230.
Steyerberg EW, Calster BV, Pencina MJ. Performance Measures for Prediction Models and Markers: Evaluation of Predictions and Classifications. Rev Esp Cardiol. 2011;64(9):788–94.
McLernon DJ, Giardiello D, Calster BV, Wynants L, van Geloven N, van Smeden M et al. Assessing performance and clinical usefulness in prediction models with survival outcomes: practical guidance for Cox proportional hazards models [Internet]. medRxiv; 2022 [cited 2022 May 30]. p. 2022.03.17.22272411. Available from: https://www.medrxiv.org/content/10.1101/2022.03.17.22272411v1.
Ihaka R, Gentleman R. R: A Language for Data Analysis and Graphics. J Comput Graphical Stat. 1996;5(3):299–314.
Van Houwelingen H, Putter H. Dynamic prediction in clinical survival analysis. CRC Press; 2011.
Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565–74.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Assessing predictive abilities of hazard-based regression models for survival data: a tutorial for prognosis modelling

Status:

Version 1

Abstract

Figures

Introduction

Material: The LeukSurv dataset

Method

Hazard regression models

Flexible hazard-based regression model using regression splines

General hazard structure model with a flexible baseline hazard

Model-building strategy

Evaluating the predictive abilities of hazard-based regression models for survival data

Overall performance measure: prediction error estimated with the Brier Score

Discrimination

The C-Index

The time-dependent Area Under the Receiver Operating Characteristic Curve

Calibration

Implementation

Results

Discussion

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1