This analysis used simulated data having a binary outcome. We started by generating for 5000 simulated patients an **event probability** by randomly selecting an **event log-odds** value from a normal distribution having a mean of 0 and standard deviation of 0.75. This value was converted to an event probability using the formula: elog−odds/(1 + elog−odds). As a result, the overall event probability in the patient cohort was 50%. Each person was then deemed to have had an event if a uniformly distributed random number ranging from 0 and 1 was below the event probability.

**Model-based event probabilities** were then generated by systematically modifying each person’s event log-odds. Fifteen distinct model-based event probabilities – reflecting probability estimates from 15 different models with varying accuracies - were created for each patient in three steps:

i) a *bias parameter* (-0.8, -0.2, 0, 0.2, or 0.8) was added to the event log-odds;

ii) to the output from i) we then added a random number from a normal distribution having a mean of 0 and a standard deviation determined by the *precision parameter* (0, 0.25, or 0.50);

iii) the output from ii) was used to generate a model-based expected event probability using the formula: elog-odds’/(1 + elog-odds’).

These 3 steps generated for each simulated person a total of 15 model-based expected event probabilities having varying bias and precision. Notably:

i) One model (that with a bias parameter of 0 and a precision parameter of 0) returned a model-based event probability that equaled the true event probability used to generate the observed events. This model should be the most accurate.

ii) We had model pairs having the same precision but equivalent bias both above and below the true event probability. The accuracy of these model pairs should be the same.

**Model applicability** was then determined for all 15 models. Models were deemed applicable to a particular patient if a randomly selected number from a uniform distribution between 0 and 1 was less than 0.9. This process aimed to reflect distinct model applicability.

These processes were repeated 1000 times (i.e. 1000 **iterations** of 5000 patients each).

## Analysis

Within distinct model applicability patterns of each iteration, the accuracy of each model was quantified with the Scaled Brier Score (SBS):

$$\left[1\right] SBS= 1- \frac{\frac{1}{n}*\sum _{1}^{n}{\left(p-y\right)}^{2}}{\frac{1}{n}*\sum _{1}^{n}{\left(\stackrel{-}{y}- y\right)}^{2}}$$

Where: n = number of people with that applicability pattern; p = model-based event probability; y = event status (‘1’ if event occurred and ‘0’ if event did not occur); and \(\stackrel{-}{\mathbf{y}}\) = event rate in the n patients. The SBS ranges from -∞ to 1. The SBS increases with model accuracy. Models having an SBS of 0 are as accurate as predicting the average event rate for all people. SBS was used instead of the Brier Score because values of the latter vary with event rates3 (and we expected varying event rates within different applicability patterns). We calculated Brier Score variance using methods from Bradley et. al.4; this was divided by the denominator in the second term of the SBS (see equation [1]) to return SBS variance.

Excepting simulated patients to whom all prediction models applied, at least one of the 15 SBS values were missing. This data structure is like that seen in network meta-analysis (NMA) in which studies include a subset of therapeutic interventions. We therefore applied fixed-effects methods described by Loveman et. al.5 to conduct NMA having a continuous outcome. Fixed-effects methods, rather than random-effects methods, were used because data heterogeneity was not an issue since prediction models were all simulated using the same methods. The fixed-effects model (PROC GENMOD, SAS 9.4, Cary NC) predicted SBS as a function of 2 class variables: i) the model applicability patterns; and ii) the prediction model. Observations in the fixed-effects analysis were weighted by the inverse of SBS variance. The analysis excluded applicability patterns having 5 or less patients and those having an observed event prevalence of 0 or 100%; the latter condition was implemented because BS variance is undefined in the absence of events.4

In the fixed-effects model, we used the unbiased and precise prediction model as the reference. Therefore, parameter estimates for all other prediction models were the absolute difference in SBS between it and the unbiased, precise model (after accounting for the different applicability patterns). The fixed effects model was repeated within each of the 1000 iterations. The mean model SBS was then calculated with 95% confidence intervals created using the percentile method.6

Similar to the Surface Under the Cumulative RAnking (SUCRA) score in network meta-analysis,7 we quantified each prediction model’s relative accuracy using the **NEtwork Relative Model Accuracy (NeRMA)** score as:

$$\left[2\right] Model NeRMA score= 1- \frac{Relative Model SBS}{Random Risk Prediction SBS}$$

Where: Relative Model SBS is the prediction model’s SBS relative to the most accurate prediction model (i.e. the model with the greatest parameter estimate values in the fixed effects model); and Random Risk Prediction SBS is the SBS for a randomly generated prediction model. The random risk prediction SBS was measured by adding to the analytical dataset a risk estimate for each person which was randomly selected from a uniformly distributed number between 0 and 1. Therefore, the model NeRMA score could range from 1 (the most accurate prediction model) through 0 (a model as accurate as randomly estimating risk) to < 0 (a model less accurate than randomly estimating risk).

Anchoring the NeRMA score at 0 to indicate the accuracy of a random model required the addition of data to the analytical dataset (namely, randomly generated event probabilities). These additional observations could narrow confidence intervals around parameter estimates in the fixed effects model. To account for this, we first measured 95% confidence interval width for each prediction model’s parameter estimates in the fixed-effects model without random risk predictions. We then calculated each model’s confidence interval width relative to the parameter estimate’s value:

$$\left[3\right] 0.5* ABS\left(\frac{Upper95\%CI- Lower95\%CI}{PE}\right)$$

Where: ABS is the absolute value function; Upper95%CI is the value for the upper confidence interval; Lower95%CI is the value of the lower confidence interval; and PE is the model parameter estimate. In the final fixed effects model (with random model data), this value was multiplied by parameter estimates of the final fixed-effects model and then subtracted from and added to those parameter estimates of to create 95% confidence intervals.