Predicting Response to Tocilizumab Monotherapy in Rheumatoid Arthritis: A Real-World Data Analysis Using Machine Learning

for


Introduction
An expanding treatment armamentarium means more treatment options for patients with rheumatoid arthritis (RA), however clinicians face di cult decisions attempting to make evidence-based recommendations regarding which disease-modifying antirheumatic drug (DMARD) treatment will be most effective in a given patient. While the majority of patients with RA will nd an effective treatment, not all do; many spend months trying medications that may not work for them. 1 Prior investigations have attempted to nd biomarkers which can help personalize treatments, but most efforts have not produced useful results. 2 While an exhaustive comparison between all treatment options is a desirable goal, a natural rst step is to identify and understand predictors of a single drug's success.
We recently examined clinical data from several randomized controlled trials (RCTs) and were able to derive and validate a prediction score for remission among patients using tocilizumab monotherapy (TCZm). 3 Monotherapy with TCZ has been found more effective than monotherapy with some targeted therapies for RA. 4 However, RCT data may not always replicate in typical practice with real world data. 5 These differences may derive from different patient populations, different treatment patterns, or other more subtle differences. 5 Real-world data (RWD) offer important advantages over RCTs in that patients are more heterogeneous, with a greater variety of clinical characteristics as well as experience with other biologic and targeted DMARD treatments. We examined the performance of our original prediction score 3 for remission among patients using TCZm among patients in Corrona, a large RWD set from the US. 6 We employed various machine learning algorithms to take advantage of the copious data contained within Corrona.

Study Design
Our study sought to answer the following questions regarding remission in patients with RA using TCZm: A) to what extent do the ndings of Collins et al 3  Derivation and validation of original prediction score in RCT As a baseline model, we used the remission model derived by Collins et al. 3

derived from patients on
TCZm from two RCTs. 7,8 In the Collins study, twelve variables representing demographics, basic RA characteristics and treatment history were included in a logistic regression (LR) model based on three alternative criteria: an a priori baseline set of covariates, the model odds ratio (OR), and the model t (expressed as the Akaike Information Criterion, AIC). In the current analyses, we restricted our comparison to the set of variables determined by ORs, as this was deemed the most successful model, referred to here as LR-OR.
In the analysis by Collins et al., 3 two RCTs were used for validation and two additional RCTs were used for validation. 4,7−9 In the current analyses, our focus is predictive discrimination in RWD. Hence, all four trials were used for derivation of models validated in the RWD. Access to the RCT and RWD data was granted following de-identi cation and IRB approval from the Partners Healthcare Human Studies.
Variables from the four RCTs were harmonized by Collins et al. The less than 5% of subjects with missing values were removed from the study. Here, only TCZm patients were used for validation. Question A. Evaluating baseline model in real-world data Following the development of the initial LR-OR model in the RCTs, we evaluated the model using patients from the RWD dataset. This was pursued by using the parameter values t in the RCTs, and also by re tting the parameters of the LR-OR model to RWD. Using these two complementary methods serves to estimate the extent to which the quality of the LR-OR model is affected by the cohort discrepancies between RCT and RWD.

Question B. Expanding the variable set and derivation population
The original variable set used in LR-OR was limited by the covariates collected in the four RCTs underlying its derivation. 4,7−9 In the Corrona RWD used in the current analyses, we had a greatly expanded feature set not available in RWD, including additional demographic information, comorbidities and treatment history. All models t with this expanded set were trained and evaluated using only RWD. Adding covariates comes at a "statistical power price" however, since they contribute to increased variance if the cohort remains of xed size. To address this, we used a regularized logistic regression model (LR-Reg) which penalizes models with many large coe cients.
To further reduce the variance of our model parameters, we evaluated models t in an expanded population by including patients on monotherapy with any biologic DMARD, including TCZm, (bDMARDm). Our motivation for this is that variables that predict remission in RA are likely to be predictive for patients on different therapies. By including an indicator for TCZm therapy, this design choice greatly increased the cohort size while enabling the model to remain predictive for our cohort of interest. Validation was then pursued for the cohort including only patients using TCZm.
Question C. Applying ML algorithms in prediction of remission A limitation of using logistic regression for predicting remission is that it models the log-odds of an event as a linear function of the covariates. It is standard practice in statistics to overcome this limitation by introducing features representing interactions (e.g., products of variables) or transformations (e.g., the square or logarithm) of variables. Effective transformations are often hard to know in advance and exhaustive enumeration of all possibilities is not feasible, both for computational reasons and due to the exponential model size and problems with parameter variance it entails.
As an alternative, we used non-parametric random forest estimators-ensembles of tree-structured decision rules. 10 Random forests are universal function approximators capable of discovering meaningful interactions and transformations of variables while mitigating increased variance through the use of bootstrapping. A drawback of the random forest estimator is that it is often di cult to describe ensembles of learned decision rules concisely. 11 While the resulting predictions could serve as a computer-aided score, a more transparent description is often preferable. For this reason, we use random forests primarily to get an indication for how limited linear models are in this task.

Study Populations
Following Collins et al., 3 we used four RCTs, the ACT-RAY and FUNCTION, ADACTA and AMBITION for derivation of our baseline model. 4,7−9 Our RWD patient cohort was extracted from the Corrona RA registry-the largest prospective cohort study of RA in the world. 6 The registry comprises medical history including conditions, diagnoses, labs and treatments as well as demographic and lifestyle data. Records are collected through at regular visits through two questionnaires lled in by the patient and their physician, respectively. We used a version of the registry exported on February 4, 2018, containing 54,646 patients. This cohort was further restricted to patients who had been on bDMARDm therapy at some point recorded in the registry. The nal dataset included records of visits from October 2001 to December 2017.
Patients in the Corrona RWD were deemed eligible for inclusion if they were on bDMARDm for a minimum of three months with at least one follow-up visit no later than nine months after initiation. Patients starting a bDMARD in combination with other DMARDs were included if the monotherapy with the target drug was started at most three months after target drug (not necessarily monotherapy) initiation; this occurred when the non bDMARD was stopped resulting in bDMARDm. A list of considered bDMARDs can be found in the Appendix.
Initially, to match the design of the RCTs, the Corrona cohort was restricted to patients older than 18 years on TCZm. Included patients had to be on monotherapy for at least 3 months and have a follow-up at most 9 months after initiation. Patients who started TCZ in combination with other DMARDs were included if they switched to monotherapy within 3 months of TCZ initiation. Follow-up duration was evaluated with respect to start of monotherapy. If patients were eligible at multiple time-points, only the rst instance was included. For models t to the full cohort of bDMARD subjects, an indicator variable for TCZ treatment was added. Finally, the most striking difference between the RWD and RCT cohorts was the average disease activity at baseline. For this reason, we evaluated models both for all RWD TCZm patients and for the subset of patients with baseline CDAI > 20 (see Tables 1,2). Table 1 Baseline characteristics. N (%) or median (IQR). Variables in the original set across RWD (All, TCZm), RCTs, following imputation. RWD is also strati ed by baseline CDAI > 20 to achieve a closer comparison to the RCTs. The extended variable set used in our analysis included variables representing: additional disease activity scores; history of cancer, hypertension, rheumatoid factor, joint erosions & deformity; comorbidities; prescriptions of NSAIDs and steroids; work status; education; general medical problems; physical disability; current and number of previous DMARDs.    The primary outcome of interest was disease remission at 24 weeks following initiation of monotherapy, to match the outcome of the RCTs used by Collins et al. 3

Remission was de ned by a Clinical Disease
Activity Index (CDAI) < 2.8. 12 A bene t of the CDAI remission criteria is that it does not require access to a laboratory measurement and is thus widely available in RWD. In the Corrona RWD, remission status was evaluated at the follow-up visit closest to 24 weeks after start of monotherapy, but no sooner than 3 months or later than 9 months after initiation.

RCT Data
The variables used to in the baseline model derived from the RCTs were identical to Collins et al. 3 These included demographic variables (age, sex, geographic region, and body mass index (BMI)), as well as several RA characteristics: baseline CDAI, disease duration, the Health Assessment Questionnaire (HAQ-DI), C-reactive protein (CRP), erythrocyte sedimentation rate (ESR), and hematocrit. In addition, previous use of DMARDs (biologic or non-biologic) was recorded as MTX monotherapy, MTX plus another DMARD, any DMARD but not MTX, no DMARDs. The baseline variable set is listed in its entirety in Table 1.

Real-world Data
Most of the RWD was collected through a questionnaire lled out at each Corrona patient visit (typically once every six months) for each enrolled patient. Baseline features for the Corrona subjects were de ned as the last recorded measurements taken prior to initiation of the target drug (TCZm or bDMARDm). In particular, if the target drug was prescribed for the rst time between patient visits, the data of the last visit before prescription were used. To start, the variables of the baseline model LR-OR were extracted from the registry data underlying the RWD. In Corrona, the entire population resides in North America, so this predictor was omitted from models t to the RWD.
In the case of missing data, variables were rst forward-imputed based on visits prior to baseline. Remaining variables were either 0-imputed, in the case of indicators of certain comorbidities or lifestyle factors (e.g., smoking), or mean-imputed, in the case of continuous variables such as ESR. In the Appendix, we give statistics over the missingness of variables in the RWD before and after forwardimputation.
DMARD treatment status was determined using an algorithm previously used in the Corrona registry. 13 Patients were determined to be on a drug at a visit recorded in the registry if: i) the drug was prescribed to them at prior and next visit, ii) the patient reported using the drug at the current and next visit, and iii) the rheumatologist did stop the drug at the visit or reported initiating it at a later visit. Additionally, if the above holds for the previous and next visits and these were within 3 months of the current visit, the patient was determined to be on the drug at the current visit, as long as the rheumatologist did not report stopping the drug or initiating it at a later visit.
To investigate the predictive power of additional covariates, the baseline set was extended signi cantly using variables from the RWD. These variables included additional disease activity scores; history of cancer, hypertension, rheumatoid factor, joint erosions & deformity; additional comorbidities; prescriptions of non-steroidal anti-in ammatory drugs and glucocorticoids; work status; education; general medical problems; physical disability; current and number of previous DMARDs. A full description of this extended variable set (EXT) is provided in the Appendix.

Statistical Analyses
The envisioned clinical use-case for the developed risk scores is to aid in decision-making in treatment of new patients. Therefore, out-of-sample and out-of-distribution generalization is a primary concern. To address the former, as is customary, we use sample splitting of the cohort when training and validating models on RWD which are also evaluated on RWD. For a single experiment, the full sample was rst split at random into a derivation set and a validation set, the former used for tting model parameters and the latter only for evaluation. The overall quality of each method was then computed as the average quality on the validation sets over a large number of repeated experiments. To assess out-of-distribution generalization, we evaluate the baseline model t to the RCT cohorts on the RWD. The reverse (RWD to RCT) was not considered here as the extended variable set is not available in the RCTs.
The primary quality metric used was the area under the receiver-operating curve (AUROC). The AUROC assesses the extent to which models successfully discriminate (or rank) subjects in terms of their probability of remission. Standard errors in the AUROC were computed using the classical model of Hanley & McNeil. 14 A limitation of the AUROC is that it is not informative of the calibration of predicted probabilities-their absolute error with respect to the true probability. Typically, regularized classi erssuch as the ones considered in this work-are not trained to achieve perfect calibration, but minimize the expected out-of-sample classi cation error. However, given a well-discriminating model, calibration is often ensured or improved by applying, e.g., Platt scaling t to held-out data. 15 We adopt this strategy here and assess calibration using standard methods.
Three different predictive models were t to data and evaluated: logistic regression (LR), L1-regularized logistic regression (LR-Reg) and random forests. These estimators were chosen to illustrate the potential differences between a parametric (LR, LR-Reg) and a non-parametric model (random forests) and between regularized (LR-Reg) and unregularized models (LR). A signi cantly better t for random forests compared to LR would point to the existence of important nonlinearities or covariate interactions in the target function. We describe our chosen models below.

Logistic Regression
Our baseline model was the (unregularized) logistic regression (LR) risk score developed by Collins et al. 3 with variables selected based on the odds ratio in the RCTs, while forcing the inclusion of sex, age and baseline CDAI into the model. To assess the value of including additional covariates from the RWD, we t the same LR estimator to an extended variable set (EXT), as described in the previous section.

Regularized Logistic Regression with Ridge Penalty
Fitting a large number of model parameters to a small sample makes analyses susceptible to high variance. 16 Extending our variable set to include more covariates from the RWD runs this risk. However, under the assumption that only a few potential predictors have large predictive power, we may limit this using regularization. 16 Regularization trades off bias and variance by encouraging coe cients to be small in aggregate. Here, we used logistic regression with a ridge penalty, LR-Reg, which forces the sum of squared coe cients to be small. A regularization parameter controls the strength of the tradeoff. For small samples, a large penalty may be required to mitigate variance, but will introduce signi cant bias in coe cients. As our goal is primarily predictive capabilities, bias is of less concern. In larger samples, a smaller penalty is needed since variance is naturally controlled.

Random Forests
Random forests are ensembles of decision trees, each t to a random subset of variables and subjects. The resulting estimator has a high capacity to discover interactions between variables and non-linear dependencies in the target variable, while estimating and mitigating variance by aggregating predictions of many classi ers. 10,17 They are the rst choice of non-parametric estimator in many application areas.
Like many other non-parametric estimators, random forests control sample variance (over tting) by restricting the capacity of the model through a number of tuning parameters associated with the ensemble itself and with each tree. We describe these and the process for selecting them later in this section.

Weighting of Subjects in Extended Cohort
Extending the cohort to include patients on bDMARDs other than TCZ (see Question B) induces a distributional shift between the development (all BDMARDm) and evaluation cohorts (only TCZm). In particular, TCZm patients make up a fairly small proportion of the overall RWD cohort and would have limited impact on model t in a standard model. If the remission probability, as a function of patient covariates, is different for different therapies, this could cause a bias in model t that is disadvantageous when applied to TCZm patients speci cally. To minimize this bias, we made use of inverse propensity reweighting, as is standard practice for handling distributional shift between treatment groups or more generally between different populations. A logistic regression model was t to estimate the propensity of patients in the RWD to receive TCZm treatment compared to receiving any bDMARDm therapy (including TCZm). This propensity was used to compute weights for samples in the training and validation sets which up-weight patients that had higher propensity to be put on TCZm. The weights were then uses to t weighted logistic regression and weighted random forest models tailored to predicting remission for TCZm patients.

Model tuning and feature importance
The LR-Reg and random forest models have so-called tuning parameters, common in machine learning, which control the way that model coe cients are t to data. In particular, these are used to in uence the bias-variance tradeoff to improve out-of-sample performance. For LR-Reg, the only tuning parameter controls the strength of regularization-the degree of preference for smaller squared coe cients. Random forests have many potential parameters, the details of which vary between implementation. Here, we tuned the maximum depth (1, 2, 5, 20, or no max), the maximum number of features (1, 2, 5, 10, or no max), the minimum samples per leaf (5,10,20,50,100), and the splitting criterion (Gini or entropy) used by each tree in the forest. 18 Following standard practice, tuning parameters for random forests and LR-Reg were selected based on 3fold cross-validation within the derivation set using the area under the ROC-curve (AUROC) as selection criterion. Following selection of these parameters, all models were t to the entire derivation set. All experiments were implemented in Python. The LR models were t using the statsmodels 18 package and LR-Reg, random forests with scikit-learn. 17 For linear models t to standardized variables, the magnitudes of regression coe cients are often interpreted as measuring the variables' importance. For tree and forest models, it is common to rank variables by their ability to discriminate between subjects with different outcomes when used as splitting nodes in the trees. Here, we measure this so-called feature importance using the mean decrease in impurity (MDI), which is the standard in scikit-learn.

Patient Sample Characteristics
From the RCTs, a total of 853 subjects were enrolled in the TCZm arms and had complete data. Among these, 80% were female and 80% were white. At baseline, the mean CDAI was 40.1 and 52% of subjects had been treated previously with both MTX and another DMARD. In the RWD, out of 54,646 subjects, 3204 subjects were identi ed tting our criteria for bDMARDm. 76% of these subjects were female and 93% white. The mean baseline CDAI was 17.4 and 83% were previously treated with both MTX and another DMARD. In the bDMARDm cohort, 452 were treated with TCZm.
Missingness at baseline in the RWD was low (< 2%) for variables in the original feature set, with the exception of disease duration (11%), ESR (30%) and HAQ-DI (25%). For the extended feature set, indicator variables representing certain past comorbidities, joint erosion, rheumatoid factor, smoking and previous pregnancy had high (> 30%) missingness. For evaluation of models t to the RWD, the RWD was repeatedly split into a validation set, containing 50% (n = 226) of TCZm subjects and 20% (n = 550) of other bDMARDm subjects, and a derivation set containing remaining subjects.

Derivation and Validation of the Prediction Model
The full results of our evaluation of different sets of predictors (original/extended), models (LR, LR-Reg, random forest) and derivation sets (RCT, RWD TCZm, RWD bDMARDm) are presented in Table 2. Each combination is evaluated for the validation sets within two cohorts: the RWD TCZm population and the subset of these with CDAI > 20. Training models with the extended feature set on the RCT cohort was infeasible as many covariates were not available. For a closer comparison with RCTs, we report results also for the subset of 240 RWD subjects with baseline CDAI > 20. The cohorts were comparable on demographic characteristics (see Table 1). The calibration of the LR and random forest models, following Platt scaling, is illustrated in Fig. 1. suggests that there are gains to be made in predictive performance from including additional comorbidity, lifestyle and treatment variables in the risk score. Note, however, that the CIs overlap. We saw no advantage of using the random forest model over the regularized logistic regression model, in either setting, indicating that the remaining variance in the outcome is unlikely to be due to underutilized interactions or nonlinearities.
We found that expanding the derivation cohort to include non-TCZ bDMARDm patients improved the AUROC for both the TCZm and the TCZm high-CDAI cohorts (see bottom of Table 2). Compare, for example 0.73 (95% CI 0.64, 0.82) AUROC for LR (Original) trained on bDMARDm to 0.68 (95% CI 0.58, 0.78) for the same model t to TCZm patients only. For LR-Reg, the AUROCs were 0.77 (95% CI 0.68, 0.85) and 0.73 (95% CI 0.64, 0.82) when tting to bDMARDm and TCZm, respectively. Comparable gains were seen for the CDAI > 20 group, but due to its small validation set, the variance in these results are high. The original model, derived in the RCTs performed substantially worse in the high-CDAI cohort than in the full TCZm cohort, even though the criterion CDAI > 20 was meant to increase the similarity to the RCTs. However, as we can see in Table 1, signi cant differences in the cohorts remained. The results may be partially explained by higher variance in outcome for the high-CDAI cohort, after controlling for baseline disease severity.
For all models and derivation sets, different measures of disease severity (e.g., DAS28 and CDAI) at baseline were consistently highly predictive of remission. In Table 3, we list the features with highest estimated importance in the LR-Reg and random forest models trained on the extended feature set of the full bDMARDm cohort, ordered by feature importance (random forests) or coe cient magnitude (LR-Reg). The highest-ranked features are mostly unsurprising: the majority pertain to measures of disease severity either explicitly (CDAI, MDAS, global assessment) or implicitly (larger number of previous DMARDs). For the LR-Reg model, education up to high-school or less was associated with lower chances of remission. The remission rate in this group (13%) was smaller than for subjects with more than high-school education (20%). The average reduction in CDAI between baseline and follow-up was almost identical for the two groups (-5.7 and − 5.8). Con dence intervals for LR-Reg coe cients were computed using the empirical bootstrap over the derivation set. This method was chosen due to the inclusion of regularization and sample weighting in the procedure. Subjects were resampled with replacement and the propensity and outcome models were t to each bootstrap sample. The stated results are for the best tuning parameters from the experiment presented in Table 2.

Discussion
Machine learning applied to real-world data may offer new opportunities to better de ne the course of disease and to identify better treatment strategies. In RA, the expanded treatment options present a challenge for clinicians and patience, as predictors for response to speci c treatments are lacking. In prior work, we used RCT data to derive and validate predictors of remission among patients initiating TCZm. 3 In the current study, we tested this prediction rule using RWD and attempted to re ne the prediction rule using machine learning. We found that the original prediction rule held up well in RWD from Corrona, despite notable differences between the RCT and RWD populations. This and the fact that the original rule contains only commonly available variables points toward the feasibility of implementing these rules in clinical practice.
An expanded number of predictive variables were identi ed using machine learning algorithms which led to small gains in performance. However, for the features and number of samples available, we saw no gains in using a random forest model over logistic regression. This may be explained either by a lack of bias-that the conditional probability of remission is well approximated by the logistic model-or by noise and variance-that the available observations are insu cient to learn a better approximation without further assumptions. A larger (or broader) study, including comparison of a larger set of machine learning models, could help decide which of these explanations dominates the behavior.
The implications of this work are several. First, we examined the validity of prediction models derived and validated in trials in RWD. RCTs, while appropriate for estimating average treatment effects on the selected cohort, are limited in their generalizability to a broader population. RWD offers an insight into how patients are treated in the healthcare system, and what their outcomes are in the absence of strict inclusion criteria and potential experiment effects. In this work, for example, we found that the RCT and RWD cohorts differ substantially in terms of disease duration and severity, as well as treatment history.
For example, the FUNCTION trial enrolled subjects that were MTX naïve with short disease duration, 4 while the other RCTs enrolled subjects that showed inadequate response to MTX. [7][8][9] Despite this, the discrimination between patients was good for the transferred model in terms of AUROC. This indicates that a) good predictors in the RCTs are good predictors in the RWD, and b) the variables observed for at baseline appropriately controlled for the cohort differences. To further validate the models we derived from the RWD, we may consider how well RWD generalizes to an RCT cohort. We refrained from this here as the RCT data did not contain the extended variable set used in the RWD analysis, nor are the RCT cohorts as representative as the RWD of current treatment practices.
Second, we developed and validated in RWD several prediction models for remission with TCZm. As noted above, TCZ when given as monotherapy seems to be more effective than other bDMARDs as monotherapy. 4 The variables identi ed in our prediction rule were disabled working status, educational attainment, prior DMARDs, and baseline CDAI. These variables are not surprising, but they have never been put together in one prediction rule that may have utility in the clinic. They also point out the value of several non-clinical variables. While we anticipate further re ning this rule and testing it among other bDMARDs, it may be that future iterations of this rule could be programmed in an electronic medical record and help clinicians and patients identify therapy likely to be effective in patients with a given set of characteristics.
Finally, this set of analyses used a robust RWD dataset, Corrona, with an expanded set of variables.
Because of the presence of many potentially correlated variables, we used several machine learning algorithms to analyze these data. In high-dimensional settings such as these, machine learning may be used together with sample splitting to discover models with reduced predictive variance at the cost of a small increase in bias. 16 This is appropriate particularly in applications where prediction and out-ofsample generalization is the goal, rather than parameter identi cation. The bene ts of machine learning are smaller when domain knowledge is strong enough to identify a successful model without the need to search over a large set of variables. Additionally, in some cases, a model with slightly lower predictive accuracy may be preferred if it is easier to interpret, explain or communicate. 19 Strengths of the current analyses include the validation of a previously derived and validated algorithm using RWD as an external validation dataset. However, several limitations should be noted. The work needs to be expanded to consider the prediction rule across other bDMARD; this is planned future work.
We had signi cant rates of missing data in the RWD; this is typical but likely has some impact on model t. In particular, different imputation strategies could be considered. Corrona encompasses patients from North America only; while more generalizable and much larger than many RCTs, this is a limitation. The further generalizability of the prediction models developed here from Corrona to other RWD should be explored further.

Conclusion
In conclusion, we were able to test a prediction rule for remission with TCZm among patients with RA in RWD and found that it worked well. Additional variables enhanced the prediction rule further. Moreover, using data from other bDMARDs allowed us to improve the model t. We encourage other investigators to derive and validate prediction models for RA treatment across RCTs and RWD. Machine learning algorithms may play important roles in optimizing prediction rules.

Declarations
Ethics approval and consent to participate: All patients in the study database gave written informed consent.
Consent for publication: Not applicable.
Availability of data and material: The data used herein are property of Corrona. All quali ed investigators are free to contact Corrona and inquire about use of the data.  Figure 1 Calibration of logistic regression (LR) and random forest models trained on all bDMARDs in RWD using the extended feature set and evaluated on held-out TCZm patients from the RWD. The predictions of each model have been adjusted using Platt scaling. Calibration is assessed in the four quartiles of predicted remission probability. We note that the majority of patients (75%) have a predicted probability of remission around or below 0.1.