A Comparison of Methods for Disease Progression Prediction Through a GoDARTS Study

Background: In recent years, a variety of new machine learning methods are being employed in prediction of disease progression, e.g. random forest or neural networks, but how do they compare to and are they direct substitutes for the more traditional statistical methods like the Cox proportional hazards model? In this paper, we compare three of the most commonly used approaches to model prediction of disease progression. We consider a type 2 diabetes case from a cohort-based population in Tayside, UK. In this study, the time until a patient goes onto insulin treatment is of interest; in particular discriminating between slow and fast progression. This means that we are both interested in the results as a raw time-to-insulin prediction but also in a dichotomized outcome making the prediction a classiﬁcation. Methods: Three diﬀerent methods for prediction are considered: A Cox proportional hazards model, random forest for survival data and a neural network on the dichotomized outcome. The performance is evaluated using survival performance measures (concordance indices and the integrated Brier score) and using the accuracy, sensitivity, speciﬁcity, and Matthews correlation coeﬃcient for the corresponding classiﬁcation problems. Results: We found no improvement when using the conditional inference forest over the Cox model. The neural network out performed the conditional inference forest in the classiﬁcation problem.Wediscuss the limitations of the three approaches and where they each excel in terms of prediction performance, interpretation, and how they handle data imbalance.


Background
In the medical literature, there are typically three approaches to modeling a time-to-event study where it is of interest to predict the time until a patient receives treatment, e.g. goes on insulin. The first approach is to use parametric or semi-parametric models of which the most common is the Cox proportional hazards model, which directly models the time-to-event for each subject [1]. Limiting ourselves to diabetes literature, as our case concerns diabetes, this approach has for example been used to predict incident diabetes in [2], the risk of diabetes in [3], incidents of cardiovascular events, diabetes and death in [4], risk of type 2 diabetes in [5], and progression to diabetes in [6].
The second approach is to use newer developments like decision trees or random forest survival models [7][8][9]. These model the survival time in a non-parametric way by splitting the data into smaller subsets. A survival tree has been used to analyse disease progression in type 1 diabetes in [10]. Random forests for survival data have for example been used to predict the risk of diabetes complications [11] and the risk of diabetic retinopathy [12].
The third approach is to predict a given time step ahead, e.g. 1, 2, or 5 years, by use of simpler linear models or non-linear machine learning methods. Linear regression models have for example been used to study 1-year HbA1c reduction [13] and predict asymmetric dimethylarginine levels in patients with type 2 diabetes mellitus [14]. Additionally, logistic regression has been used to predict (classify) diabetes and cardiovascular disease at the time of the study [15], drug-treated diabetes diagnosed during 5-year follow-up [16], 5-year diabetes risk after gestational diabetes mellitus in [17], and prediction of highest quartiles of asymmetric dimethylarginine levels in patients with type The table shows the median and quantiles (first and third) or the number of measurements per person (MPP), the mean and standard deviation of the baseline measurement (Baseline), number of people with a baseline measurement (Total), and number of people with more than three measurements within the year around diagnosis (total w. ≥ 3 m.).
2 diabetes mellitus [14]. Linear regression and logistic regression models come with the additional advantage of hypothesis tests for the covariate coefficients. Non-linear machine learning methods on the other hand often give rise to better predictions due to handling non-linearities as well as co-linearities typically at the cost of missing out on hypothesis testing for individual covariates. This makes the models harder to interpret. Machine learning methods have seen an increased use in the recent years as for example in [18] where several methods including support vector machines and random forests have been used to predict diabetes complications at 3, 5, and 7 years from the first visit. They find that random forests perform the best but choose logistic regression for application in the clinic due to it being easier to interpret. Recently, there have been efforts to interpret machine learning methods [19,20] and tests for inclusion of variables in a random forest [21,22]. We compare these three approaches to elucidate their pros and cons for the analysis of medical time-to-event data, with a particular focus on a diabetes case. We believe this comparison is a useful basis for other medical areas with similar cohort studies. First we choose one representative method from each of these approaches and consider how well they handle data imbalance, interpretation of the model, prediction performance, and the effects of dichotomization. We then apply the methods to a subset of the Genetics of Diabetes Audit and Research (GoDARTS) data set [23]. This data set investigates the time until a type 2 diabetes patient goes onto insulin from the day of diagnosis. It consists of around 7000 patients and has both clinical and biochemical variables. The first method is Cox proportional hazards model which has previously been applied to this data set [24][25][26]; the second is a random forest approach for survival data; and the last is a neural network which has also been applied in this data set in [27]. We consider two versions of the data set: Using the biochemical values closest to the diagnosis and using features extracted from one year around diagnosis. This is done to investigate whether it is a good feature extraction strategy and to have two versions of the data set for the method comparisons. Finally, we give recommendations based on this study.
The rest of the paper is organised as follows. In Section 2, we present the data set as well as the methods used in the analysis. The results of the analysis of the data set are presented in Section 3, a discussion of the methods is given in Section 4, and conclusions based on the study are given in Section 5.
The data consist of biochemical markers which were obtained during regular patient visits and have been recorded at different times and with varying frequency for each patient (Table 1), clinical variables including anthropometric, life-style and drug prescription variables (Table 2), and the time The table shows the mean and standard deviation of the variables or for categorical variables the number of observations in each category, and the total number of observations.
from diagnosis to first insulin treatment or they left the study as well as whether insulin treatment was given. The diagnosis is confirmed when HbA1c> 6.5% or first drug is received and the date of first insulin is defined ad when they first received insulin or when two measurements of HbA1c> 8.5% were taken within two months. In this data set, there is around 58% censoring meaning patients who did not receive insulin treatment while participating in the study. We only consider patients with type 2 diabetes. In order to minimize the number of type 1 diabetes patients, only patients diagnosed after 35 years of age are included. Besides this only patients diagnosed in 2010 or earlier are included. This leaves 6324 patients.

Feature Extraction
We define two data sets extracted from the biochemical markers listed in Table 1. The first is the baseline data set which consists of one measurement per variable j and person i. It is the measurement closest to the day of confirmed diagnosis and no more than 182.5 days before or after ( Table 1).
The second data set, the time model data set, is created as follows. For each variable j and observation i we consider all K ij measurements within 182.5 days of diagnosis and extract three features describing the time-dependent behavior. A linear trend or each observation i is estimated by solving [28].
where t ij k is the k'th time point for variable j, x ij k is the k'th measurement value of variable j, and a j 0 and a ij 1 are the two parameters for variable j.
and we fit a first order auto regressive model by solving [29].
where τ ij is the auto-regressive parameter. This gives three extracted features for each biochemical variable j:â ij 0 describing the general level,â ij 1 describing a linear trend, andτ ij capturing the auto regressive aspect. These variables are extracted for all biochemical variables that have at least 3 measurements (Table 1).

Prediction
A Cox proportional hazard model [1] and a random forest model for survival (also known as the conditional inference forest) [30][31][32][33] are fitted to the survival data where X i contains the explanatory variables, T i is the time-to-event and δ i is the event status, i.e. whether the event happens within the study period (δ i = 1) or not (δ i = 0), for observation i.
Besides this, a neural network is fitted to the dichotomized data , t c is the cut off, and I is the indicator function, i.e. Y i indicates whether the event happens before a set time point t c .

Cox Proportional Hazards Model
The Cox proportional hazards model is given by [1] as which models the hazard at time t (event rate at time t given survival until at least time t) for observation i with the explanatory variables X i . Here β contains the model parameters and λ 0 is the unknown base hazard. From this model, it is not possible to give the predicted survival times because the base hazard is unknown. The predictions can therefore be given as either the linear predictors in the test data set (X newβ ) whereβ is the estimate of the β and X new is the test data; or as an estimateŜ(t) of the survival function S(t) = P (T > t) where T is a continuous random variable. The model is fitted using the rms package in R [34,35].

Random Forest for Survival Data
The random forest model fits a number of trees to form a forest [30]. A tree is created by recursively splitting the data into smaller subsets based on some criteria (see an example in Figure 1). The starting point which contains all the data is called the root node. When the data is split it forms two child nodes. These can also be split into child nodes meaning that the subset of data is split again. This can continue until only one observation is in each child node or until a stopping criteria is reached. The final nodes are called terminal nodes. Trees are unbiased predictors but have high variance. Other advantages are that they can handle mixed variables and missing data, and that they can perform variable selection.
x 2 <b x 2 ≥b x 1 <a x 1 ≥a Figure 1: Tree example. Example of a tree where the data is first split on variable x 1 and then on A forest is created by bootstrapping the data, and building a single tree for each bootstrap sample of the data set in a process called bagging (Algorithm 1) [36]. This is done to decorrelate the trees which lowers the variance of the predictions. The forest then inherits many of the advantages of trees but reduces the variance. In random forest by [30] the trees are further decorrelated by selecting a variable at each split in a tree from a random subset of the variables.
Algorithm 1 Pseudo-algorithm for bagging of trees.
1 From 1 to number of trees (a) Take a bootstrap sample (b) Build tree on sample 2 Aggregate results from all trees The conditional inference forest (CIF) by [7] which builds each tree testing a global hypothesis is chosen over the random survival forest by [9] which adheres strictly to the random forest conditions laid out by [30], because it performs at least as well as the latter [37].
A conditional inference tree is built by recursively repeating two steps [8] (Algorithm 2).
Algorithm 2 Pseudo-algorithm for conditional inference tree.
Let each node be defined by a non-negative case weight vector w ∈ R n + such that observations which are elements of the node have non-negative weights and the weights are otherwise zero.
Repeat the following steps: 1 For case weights w test the global null hypothesis of independence between any of the variables and the response. If the hypothesis cannot be rejected stop and otherwise select the variable with the strongest association to the time to event. 2 Split the sample space of the chosen variable into two disjoint sets with each assigned to one node.
The case weight vectors of the two new child nodes are calculated by multiplying each case weight with 1 if the observation is in the node's corresponding set and 0 otherwise.
The conditional inference forest consists of a number of conditional inference trees. The prediction can either be given as an estimateT i of the survival time T i or as an estimate of the survival function, S(t). The survival function S(t) is estimated by a Kaplan-Meier estimate on the aggregation of terminal nodes which a new observation falls in [7]. The estimate of the survival time is calculated by the observation weighted average of the trees. The model can handle missing data by using surrogate splits meaning if a variable is missing it splits on another variable which leads to the same subsets [38]. It is fitted using the R package party [39].

Neural Network
A single-hidden-layer neural network (NN) is used for the dichotomized response [40]. This model consists of three layers of artificial neurons (also known as units or nodes), input, hidden, and output ( Figure 2). Each layer consists of a number of neurons that receives input from all neurons in the previous layer and sends the output to all neurons in the next layer. For binary classification the output layer only has one neuron. Neurons in the hidden and output layers process their inputs by multiplying by a weight, summing them, adding a constant and then taking a fixed activation function. The result is, where φ 0 is the output layer activation function, φ h is the hidden layer activation function, w h and w ih are weights, α and α h are constants, h runs over the hidden units, and i runs over the observations. This function is fitted for example using the logistic cost function as a loss function (in the two class case) input hidden output Figure 2: Example of a neural network.
where p i = exp(ŷ i )/(exp(ŷ i ) + exp(1 −ŷ i )) and λ is a decay parameter. Y i was defined previously as whether the event happened before a set time point.
We optimise the number of neurons in the hidden layer and the decay parameter in an inner cross-validation loop according to the area under the receiver operating curve [41]. Further, for this model all continuous variables are standardised by mean and standard deviation within the cross-validation. This is fitted using the R package nnet [42].

Missing Data Strategies
The Cox proportional hazards model cannot handle missing data. In the baseline data set, only the biochemical variables cholesterol, creatinine, HbA1c, and high-density lipoprotein are included with the clinical variables as the others have more than 50% missing values and all biochemical variables except high-density lipoprotein are log-transformed due to skewness. Similarly, in the time model data, we only include extracted features with less than 50% missing values. These are the extracted features for cholesterol, creatinine, HbA1c, and high-density lipoprotein.
We have then removed all observations with missing values leaving only 1063 observations. This is done for simplicity. It is generally not advisable as it drastically reduces the number of observations and in cases where the missing data are not missing completely at random, the results are generally biased [43,44]. However, the alternatives of imputing the missing data using single imputation often causes us to be overly certain about the results and more advanced multiple imputation methods are generally better but also have their own problems [45]. The neural network model used here cannot handle missing data either. In this case, we use the imputation strategy of replacing missing values with "-999". This strategy can be used if there is a belief that the values are not missing at random because it makes the missing values very different from any observed values. However, this method also has its problems, e.g. that it changes the distribution of the data.
As previously mentioned, the random forest model can handle missing data using surrogate splits and is therefore fitted on the data as is. However, we note that the biochemical variables cholesterol, creatinine and HbA1c are log-transformed in the baseline data set as they were for the Cox model.

Data Imbalance
Class imbalance is a well studied area in classification [46]. A variety of sampling methods can be used to address the imbalance, e.g. downsampling, upsampling, and synthetic oversampling techniques [47,48].
We have used downsampling as it is the simplest method. It is used both to balance the classes created by dichotomization (denoted as Y1, Y3 and Y5 if downsampled based on the dichotomization over year 1, 3 or 5, respectively) but also the censoring (denoted cens.), i.e. downsampling the censored observations such that we have an equal number of censored and uncensored observations. This is done because recent methods for dealing with the imbalance by resampling shows that it improved the performance [49]. If no downsampling is done it is denoted none.

Model Evaluation 2.6.1 Cross-Validation
The performance is evaluated in a 5 fold cross validation which has been repeated twice [50]. Crossvalidation is stratified according to either the censoring or the dichotomized class on which the downsampling has also been done. For the neural network, stratification on the class is also done for the non-downsampled results. All downsampling is done on only the training set and within the cross-validation.

Performance Measures
The performance measures for the survival models have been chosen because they are either routinely reported or account for censoring while they can be calculated for models other than the Cox model [51].
The discriminative powers of the survival models are evaluated using two different concordance indices (C-indices). The C-index by [52] which is calculated as the number of concordant pairs of observations, i.e. pairs where the observation which has the lowest survival time is also predicted to be lowest while the event happened for that observation, over the number of comparable pairs, i.e. pairs where the event happened for the observation with the lowest survival time.
where η i ∈ R is a one dimensional score computed for each observation, i.e. the predicted survival time or the linear predictors. T i is the time-to-event δ i is the event status. The C-index by [53] is given by whereĜ(·) is the Kaplan-Meier estimator of the censoring distribution. This C-index takes censoring into account. Both of the measures lie between 0 and 1 where 0.5 corresponds to random guessing and 1 is the best.
Besides this we report the integrated Brier score [54], for which 0 represents a perfect fit and smaller values are better. The classification performance is evaluated by the accuracy, which measures the proportion of correctly classified observations. It is also evaluated by the sensitivity which measures the proportion of positives correctly identified, sensitivity = TP TP + FN (10) where TP = which gives a value between −1 and 1 where 0 means not better than random and 1 means perfect prediction.

Results
The performance of the survival models are improved by including the extracted time model features in terms of C-indices and IBS (Table 3). There is no improvement by using the random forest model even though more data is utilized and it can model interactions. Downsampling based on censoring gave a small improvement in C U which takes censoring into account but not C H and IBS. Downsampling on censoring did not affect the classification performance but this is due to the models already just predicting the majority class (Table 4). When downsampling based on one of the classes (year 1, 3, or 5), the performance in terms of C H is largely unchanged and in terms of IBS worse, possibly due to the reduced number of observations. In terms of C U it however improves compared to not having resampled. The classification accuracy of the survival models decreases when the majority class (high survival time) is downsampled. This is because it no longer just predicts a high survival time which can be seen by the specificity increasing. MCC also increases when the majority class is downsampled.
The performance of the neural network is also improved by downsampling. The accuracy increases and the sensitivity and specificity become more balanced. MCC does however only increase for year 5 due to high specificity for the other years before downsampling. For all three years, the neural network performs better than the conditional inference forest.

Discussion
The Cox proportional hazards model is the most commonly used model for survival data [56]. The model is easily interpretable due to its simplicity and because it allows for hypothesis testing Classification results in percent (except for MCC which is on a scale from -1 to 1). Mean and standard deviations. Two methods are compared the conditional inference forest (CIF) and a neural network (NN) on two data sets baseline (B) and time model (T) (see Section 2.2). The sampling refers to whether the data was downsampled on censoring (cens.), year 1 (Y1), 3 (Y3) or 5 (Y5), or not at all (None). Class is the year on which the dichotomized response is cut-off.
for relevance of included variables [1]. However, due to being a proportional hazards model with unknown base hazard it cannot give predictions of the actual time-to-event. It further has the issue that it cannot handle missing data.
The latter two issues are addressed by the conditional inference forest model. It handles missing data by using surrogate splits [38]. Because the methods have different strategies for missing values, the performance is compared on different subsets of the data. This means that the results are not controlled such that difference in the results are due to one difference in the approach. However, this is the reality of clinical data so a comparison of methods has to consider missing data differently for different methods. The conditional inference forest can further natively give predictions of survival times. These advantages come at the expense of a more complicated model. It is not easily interpretable, though an experimental variable importance measure is available for the conditional inference forest [39] and a variable importance measure is available for the random survival forest [57]. In our data set the Cox model performs slightly better than the conditional inference forest. The reason for this could be that the assumptions of the Cox model might be satisfied or that the flexibility of the random forest might not needed.
That the conditional inference forest model can give a prediction of the time-to-event which gives the possibility to compare its performance to methods for the dichotomized response. Generally, dichotomizing is not advisable [58]. However, one might have non-statistical reasons that this form of the response is of interest, e.g. easier translation of knowledge to clinicians. If one chooses this approach, then survival models are no longer a natural choice of model as the problem becomes a binary classification. However, one advantage of using a conditional inference forest for survival data is that it can be used to classify for any cut off.
We consider a neural network for classification. This implementation cannot handle missing data. But the main limitation of this approach is that it cannot utilize the full information of the time to insulin because this has been encoded as a binary class label. It does have the advantage of optimizing the model to classification and not the survival time.
If the dichotomization is done such that one class is much larger than the other then there is a need for resampling in order to make the classes even. When downsampling is chosen we lose some data. We therefore lose more information on top of the information lost due to dichotomization. However, dealing with the imbalance of the data is necessary to produce generalizable results. In our data, the neural network outperforms the conditional inference forest for the survival data in all three years when the resampling is done. This means if the dichotomized response is of interest we are better off fitting a model directly to this response. If there is no resampling, both methods perform poorly. The same downsampling is required in order to get comparable performance from dichotomizing the survival time predictions from the conditional inference forest meaning that using the model for any cut off did produce desirable results (Table 4).

Conclusions
In this final section, we give our recommendations and conclusions based on this study. Imbalance Based on this study, downsampling on censoring in survival prediction has a small but positive effect and might be relevant to investigate. Downsampling in classification or otherwise dealing with the imbalance is however important to produce generalizable results. Interpretation The Cox model performed well in our case. This model is preferable if the interpretation of the model parameters is of interest. Prediction We found that specifically trained models performed the best for that setting, resulting in the conditional inference forest being outperformed in both settings. Dichotomization If interested in the dichotomized response then it is preferable to build a model on this response directly rather than dichotomizing post training.

Ethics approval and consent to participate
The GoDARTS study was approved by the Tayside Committee on Medical Research Ethics, and informed consent was obtained from all patients (REC reference 053/04).

Consent for publication
Data are not publicly available, but may be available on request.

Availability of data and materials
Ewan Pearson is the data guarantor.