Causal Deep Learning on Real-world Data Reveals the Comparative Effectiveness of Antihyperglycemic Treatments

Type 2 Diabetes is associated with severe health outcomes, the effects of which are responsible for approximately 1/4 of total U.S. healthcare spending. Current treatment guidelines endorse a massive number of potential anti-hyperglycemic treatment options in various permutations and combinations. Personalized strategies for optimizing treatment selection are lacking. We analyzed real-world data from a nationwide population of over one million diabetics to evaluate the comparative effectiveness of more than 80 different treatment strategies ranging from monotherapy up to combinations of ve concomitant classes of drugs across each of 10 clinical subgroups dened by age, insulin dependence, and number of other chronic conditions. Our causal deep learning approach developed on such data allows for more personalized recommendations of treatment selection. We observe signicant differences in blood sugar reduction between patients receiving high vs low ranked treatment options and that less than 2% of the population is on a highly ranked treatment. This method can be extended to explore treatment optimization of other chronic conditions.


Introduction
Recent data from the Center for Disease Control and Prevention (CDC) estimates that approximately 13 percent of the adult population of the United States, or about 34 million people, have been diagnosed with diabetes mellitus 1 . When insu ciently managed, diabetes leads to complications including cardiovascular disease, kidney disease, neuropathy, and blindness, any of which can dramatically impair an individual's quality of life. The high incidence of diabetes and concomitant complications put a major burden on the US health care system in terms of care utilization and costs, with one recent report estimating that one of every four healthcare spending dollars in the United States can be directly attributed to diabetes 2 .
Diabetes is typically managed by a combination of lifestyle interventions and pharmacological treatments. For the latter, current guidelines stipulate that unless otherwise contra-indicated, initial therapy for type II diabetes mellitus (T2DM) should be metformin 3 . If this rst-line therapy is insu cient, combination therapy with anti-hyperglycemic drugs from two or more classes is suggested. There are multiple second-line choices with various risks and bene ts, and a clinician may therefore need to attempt multiple treatment combinations before nding one that works for their patient. There have been efforts to determine sequential treatment of diabetes, both with data-driven informatics methods 4 and with expert-curated guidelines 5 , however, both of these approaches take into account few patient-speci c characteristics and can be ambiguous in suggesting the next best option for an individual patient. Even when glycemic control is achieved, there is currently no simple way to know whether a different combination might be more optimal for a given patient, either by providing greater glycemic control, by simultaneously managing comorbidities, or by providing equivalent control at a lower cost or with fewer total drugs or with less side effects. Given the complexity of potential treatments for T2DM, patients can often bene t from subspecialist management by an endocrinologist. 6 However, the current shortage of endocrinologists is only projected to grow, thus leaving the majority of T2DM treatment to primary care practitioners (PCPs). Nonetheless, a PCP is often managing several other issues in addition to T2DM and may not be as familiar with diabetes guidelines 7 or the newest treatment modalities 6 . The enormous heterogeneity of treatment decisions observed in daily clinical practice is indicative that optimal treatment regimens have not been identi ed 8 . Therefore, an understanding of the real-world comparative effectiveness of pharmacological diabetes treatment strategies would be highly desirable to help guide T2DM management.
Research on comparative e cacy and effectiveness of anti-hyperglycemic drugs has been expanding.
The ADOPT trial examined the relative e cacies of three monotherapies using a randomized, doubleblind study examining the time to monotherapy failure over multiple years on 4360 relatively healthy patients between the ages of 30 and 75 9 . Causal analysis methods, such as the frameworks developed by Rosenbaum and Rubin 10 for observational data are now robust and widespread in clinical research.
Meta-analysis 11 and Network Meta-Analysis 12 have made it possible to combine results from multiple cohorts to respectively gain effect insights from a combined pool of patients and leverage both direct and indirect comparisons between treatment arms to reduce measurement uncertainty. These approaches have been applied to a growing body of literature on the effects of T2DM pharmacological interventions. RCTs, with an average duration of four months, including 32,000 total participants with an average baseline HbA1C of 8%, comparing dual therapies from ve drug classes added to Metformin 14 . Looking at real-world data, Ryan et al. compared Sodium-Glucose Transport Protein 2 (SGLT2) inhibitors versus non-SGLT2 inhibitor mono-therapies in a meta-analysis on observational data from four large real-world databases 15 and Vashisht et al. evaluated the effectiveness of three second-line monotherapies after metformin treatment by performing meta-analyses on observational data 4 . Although each of these studies has contributed substantially to clinical knowledge, a comprehensive understanding that re ects the realities of daily practice including diverse patients who may be on more than two classes of antihyperglycemics is still missing.
In recent years, there has been a rapid trend towards digitization in the healthcare industry. Patient medical histories are increasingly recorded in electronic format and claims adjudication systems have become streamlined and more automated. This digitization has led to an explosion in the amount of medical data available to learn from. Concurrently, there have been major advances in the elds of arti cial intelligence and machine learning 16 , allowing algorithms to extract complex signals from increasingly larger amounts of data. In medicine, arti cial intelligence models have demonstrated humanlevel performance in interpreting dermatology 17 and ophthalmology 18 images. Deep neural networks trained on electronic health records (EHR) have been used to estimate the risk of disease onset 19 , the risk of hospital readmissions 20 , and to forecast the future health state of individuals with complex diseases 21 . The combination of these factors has made it possible to use arti cial intelligence to extract causal insights from large-scale observational studies, which was previously infeasible.
Here, we demonstrate how an approach that combines deep learning, causal inference, and network meta-analysis (NMA) can be used to estimate the real-word comparative effectiveness of combination therapies for T2DM in clinically strati ed sub-populations. Using the change in levels of glycated hemoglobin (HbA1c) as the primary outcome of interest, we measure effectiveness by estimating confounder-adjusted average treatment effects (ATE) of each treatment strategy observed using a nationwide cohort with more than 1 million patients with T2DM and at least one HbA1c measurement. Our work departs from previous research in several important ways: (i) we go beyond single or dual therapies and compare all treatment regimens observed in the data without imposing restrictions on the number of drug classes; results on combinations of up to ve drug classes are reported here; (ii) we perform this analysis on 10 sub-groups strati ed based on clinical variables to make the rankings more personalized; (iii) we use a recently developed deep-learning-based propensity-score model for causal analysis that scales well to large multi-arm observational studies; and (iv) we perform sensitivity analysis on held-out data in order to assess the extent to which the comparative effectiveness rankings were meaningful and broadly generalizable. These rankings for combination therapies can complement/enhance guideline-based practice and help clinicians make personalized data-driven decisions when formulating treatments for their patients.

Inclusion Criteria
Temporal snapshots of patient trajectories beginning at the start of a patient's health history and ending with each subsequent HbA1c lab measurement were generated for each person from their ve-year history of medications, diagnoses, procedures, and relevant laboratory values to assess the treatment strategies and resulting causal effect calculations ( Figure 1).

Characteristics of the Study Population
Patient variable values (Table 1) show that the study cohort had a mean age of 55, with a baseline HbA1c, EGFR and Creatinine of 10.4%, 95 mL/min/1.72m2, and 0.9 mg/dL respectively. Neighborhood incomes ranged signi cantly with a median of fty-ve thousand dollars annually. White, Black, and Asian populations were well represented with Whites being in the majority. As expected within a population of patients with signi cantly elevated blood sugar, a wide range of comorbid conditions were present, with obesity, heart disease, COPD, and renal disease being most prevalent. The test cohort statistically matched with the study cohort on all variables.

Characteristics of Treatment Pathways
To account for potential differences in patient needs and treatment goals, patient snapshots were assigned to one of ten clinical subgroups on the basis of each patients' age, insulin dependency status, and number of additional chronic health conditions at the time of each index event. The number of snapshots present as well as the number of treatment strategies that subgroups were exposed to tended to decrease with age and disease burden. (Table 2). Though these trends did not decrease monotonically. Subgroup A had the largest number of snapshots (n = 53,532) and treatments (n = 69) and Subgroup H had the fewest snapshots (n = 2,931) and number of treatments (n = 16). There were considerably more insulin naïve snapshots (n=99,116) than insulin dependent (n=24,184). Across insulin status, subgroups <65 years of age had an average of 81,338 snapshots and 162 treatments while those >65 had 24,674 snapshots and 121 treatments. The number of distinct treatment strategies observed in a subpopulation was correlated to the number of patient snapshots present, the larger the population the more unique treatment strategies were observed.

Causal Modeling of Optimized Treatment Ranking by Subpopulation
Signi cant differences existed in the underlying covariate distributions between treatment and control arms in the observational data but were successfully balanced through the BCAUS methodology (Supplemental Figure S1). The confounder-adjusted causal effect of each treatment strategy on each clinical subpopulation was calculated independently for each subgroup ( Figure 2). League plots were produced (Supplemental Figures S12-S21), and treatment strategies were ranked based on causal reduction in HbA1c.
The top 10 most effective treatment strategies for each subgroup (Table 3), revealed that the highest ranked treatment strategy was unique to each subgroup. Indeed, while trends were certainly present, no two clinical subpopulations had a single treatment strategy share the same rank. GLP-1's and metformin, both known to be highly e cacious for blood glucose control 22 , are the only classes to appear as monotherapies in any group's top ten ranked treatments, though they only appear for half of the subgroups and never higher than position ve. While a monotherapy is not the top-ranked treatment for any subgroup, the rankings clearly show that simply adding more drug classes to a patient's regime is not uniformly best for blood sugar reduction. Hypoglycemic classes known to provide secondary cardioprotective bene ts, such as SGLT2's and GLP-1's, feature prominently in the top ten choices for each subgroup, though trends in which is most effective varies by subgroup. Nevertheless, this nding may indicate that there is no need for a trade-off between glucose control and cardio protection and that both can be optimized for simultaneously. Interestingly, in the insulin-naïve groups, insulin-containing regiments tend to rank poorly, suggesting poor real-world effectiveness of a medication known to have high e cacy in Randomized Controlled Trials 22 . A complete listing of the rankings for all treatment strategies across all subpopulations can be found in the Supplement (S22) as well as the measured causal effects, con dence intervals, and observed population sizes for each strategy in each population (S2-S11). When treatments were divided into three groups based on rank (high:1-3, middle: 4-10, low: below 10) signi cant differences in patient outcomes between highly ranked treatments and all other choices were observed for nine out of ten subgroups ( Figure 3). The differences were signi cant clinically as well as statistically, persisted even after controlling for differences between patients that received highly ranked choices versus others, and generalized extraordinarily well to the test subgroups. A sensitivity analysis revealed a consistent dose response relationship between highly ranked, middle ranked, and low ranked treatment strategies (Supplement Figure S23).

Ranking Group Prescription Patterns in Real-World Observational Data
After determining that treatment strategies with higher ranks caused larger reductions in blood sugar, we examined the study population to measure the distribution of high, medium, and low ranked treatment strategies provided to patients in each clinical subgroup (Table 4). We found that a less than two percent of patients were provided a highly ranked treatment choice and the vast majority were on low ranked treatment options. Across all subgroups, the average treatment rank per snapshot was 28. We observed the lowest rates of concordance among the younger or relatively healthier subgroups. The likelihood of switching treatments between snapshots was 47 percent. When patients switched treatments, 49 percent of those switches lead to a new treatment with a better rank (with an average improvement of ten positions of rank), while fty-one percent of switches lead to treatments with a worse rank for the patient (with an average decrease in rank of eleven). Overall, changes did not lead to improvements, the mean change in treatment rank across all changes was decrease of 1.

Discussion
In this study, we examined the anti-hyperglycemic treatment strategies over a ve-year period in a nationwide cohort of patients with type II diabetes whose index HbA1C was 9% or higher. We found over 80 different strategies of drug class combination, ranging from monotherapy to combinations of ve distinct drug classes. This enormous heterogeneity persisted even after accounting for age, number of comorbidities, and status of insulin dependence. We then performed a network meta-analysis using deep causal models on the cohort's observational data to rank treatment strategies for the ten clinical subgroups based on e cacy in lowering HbA1C. We found that the rankings differed between each of the subgroups, were sensitive even for infrequently observed treatments, and that they generalized well to patients in our test set. Highly ranked treatments were clinically and statistically signi cantly better than other choices. There were considerable differences between which treatments where best for each of the clinical subgroups and cardioprotective drug classes featured prominently, though the speci c class and combination was subgroup dependent. We observed that fewer than 2% of patients were on an optimal treatment and that treatment switches, when they occur, move patients into worse strategies as often as they result in better strategies.
That treatment strategies for diabetes mellitus are massively heterogenous is well known. While Hripcsak et al. 8 observed that 10% of diabetic patients in their international study had treatment pathways that were unique speci cally to that individual, the authors hazard that the variability was not a sign of personalization but rather that "it may point to a failure of the eld to converge on an effective treatment". To our knowledge, no previous studies have examined comparative e cacy between all observed treatment strategies in multiple clinically relevant real-world sub-groups. The mono-therapy 9 and dual-therapy 14 results found in this work are reasonably consistent with prior published results, that were limited to those two options. However, differences in cohort sizes and inclusion criteria make direct comparisons di cult. For example, the ADOPT trial contained fewer than 5,000 participants and excluded patients with more advanced disease that would not be eligible for monotherapy. Mearns et al. combined all patients from dual-therapy trials, regardless of age, disease severity, or comorbid conditions, making it impossible to directly compare results to our clinically strati ed subgroups.
We found no treatment strategies that were in violation of current standard of care guidelines provided by UpToDate. While it is reassuring that guidelines are followed, it also reinforces the concern that guidelines may be insu cient at guiding treatment choices for blood sugar reduction. Instead of contradicting current best practices, our ndings provide clarity on which strategies may be best when the guidelines provide many to choose from. It is perhaps not surprising that so few patients are on a treatment that may be optimal for them. Patients with highly elevated HbA1C are, by de nition, those who have not yet found a treatment strategy su cient to control their blood sugar. Additionally, since the current guidelines unilaterally suggest a progressive approach from mono to dual therapy, followed by experimentation within dual-therapy before adding more drug classes, there is a diffusion effect that necessitates a long time until su cient experimentation has occurred to identify a good strategy for many patients.
Although we believe that this work has potentially signi cant clinical value, and may even provide the signal necessary for the eld to identify effective treatments that Hripcsak et al. have called for 8 , there are two important limitations. First, we have de ned effectiveness exclusively on the grounds of HbA1C reduction. This choice is reasonable given that HbA1C as a surrogate endpoint is the most used outcome for clinical trials and that our study population is comprised of patients with highly elevated blood sugar, however, there are additional clinical endpoints, particularly those related to cardiovascular outcomes, that are relevant for patients with diabetes. While many of these clinical endpoints are strongly correlated with HbA1c, conclusions about how treatment protocols that utilized the rankings derived here would impact these endpoints cannot be drawn from this work. That said, the strong presence of treatments with secondary protective effects in the top ranked choices may indicate that while the effect on cardiovascular and renal outcomes is not captured here, never-the-less the treatment choices that are most effective for glycemic control are top choices for secondary protective effects as well. A second limitation is that we calculated impact on HbA1C only at the follow-up measurement after a treatment was assigned (a 6-month median window). This time period is su cient to see the effects of medication changes considering the half-life of hemoglobin, including the slower-acting thiazolidinediones 22 , but may not be perfectly indicative of long-term trends and tolerability. Given the length of time it takes for diabetes-related complications to develop, causal attribution to speci c treatment strategies is clouded by the many patient-related factors that can change over such a length of time, such as the course of treatment. However, a longer study period that tracks clinical endpoints as well as laboratory endpoints is desirable and could be feasible as datasets such as this grow over time. Future studies could leverage our methodology to de ne effectiveness by distance from a speci ed blood sugar value for each clinical sub-population, instead of absolute HbA1c reduction. Alternatively, maximal risk reduction for microvascular or macrovascular outcomes could be used as the endpoint instead of HbA1c. However, we believe such investigations would be most robustly served by a prospective study tracking the impact on multiple clinical endpoints from prescribing high-ranked treatment strategies to achieve subgroup-speci c targets.
Although the treatment rankings presented in this work are xed, they can supplement guidelines to support many different approaches to decision making. For example, based on patient needs and clinician preferences, some may choose not to prescribe the highest ranked treatment but instead the highest-ranked option that involves the smallest change from the current regimen. Alternatively, patients for whom compliance may be a concern, selecting a treatment that optimizes rank with the fewest number of total drugs would be an option. For every highly ranked strategy in our study that contained many different drug classes, there was usually a simpler combination with nearby rank. The enormous variety of ways in which these comparative e cacy rankings could be utilized may be best leveraged by software with a performant, intuitive user interface to return the optimized results for a given patient target. Such software could also provide additional metrics captured in this study, like the number of patients observed on each treatment strategy and the clinical and demographic parameters associated with each person, which are not possible to display in the context of individual patients within a manuscript like this. Additional convenience functions, such as the removal of contraindicated treatments from the rankings list for each individual patient may be desired.
Taken together, these ndings have important implications for personalizing type 2 diabetes mellitus recommendations without any tradeoff for cardiovascular protection. In real-world terms, we may be to provide more effective, more personalized treatment strategies to 98% of established diabetic patients immediately, although future work will have to con rm these hypotheses. We believe that the approach outlined here represents a concrete step towards a functional learning healthcare system, and that it is immediately extensible to other conditions beyond diabetes mellitus that have complex pharmacological treatment patterns such as hypertension, asthma, chronic obstructive pulmonary disease, depression, and congestive heart failure. By forestalling adverse events that arise from unmanaged chronic diseases, such learning systems could greatly reduce patient suffering and lead to signi cant reductions in healthcare costs.

Study Cohort De nition and Data Preparation
We analyzed electronic health records from 56.4 million members in Anthem Inc's subscriber population between December 1 st , 2014 and January 1 st , 2020. The records included approximately ve billion insurance claims (for diagnoses, procedures, and drug prescriptions or re lls) as well as lab test results for the associated patients. Clinical lters were designed to distinguish between major sub-types of diabetes, and patients with Type I, anyone under 18 years of age, or Gestational Diabetes were excluded from the study (Figure 1a). We also excluded individuals with histories of diabetes ketoacidosis, cystic brosis, or solid-organ transplants as a safety precaution because they are highly complex patients who would clearly bene t from subspecialist care and the rankings developed herein are targeted towards PCPs managing typical patients with Type II Diabetes Mellitus (T2DM). We removed patients with recent HbA1c's below 9% to focus on those who were clearly eligible for treatment strategies beyond rst line. This resulted in a study population of 1.2 million individuals with T2DM.
The health status of any individual evolves with time. Since the study period in our work spanned several years, to properly account for this evolution, we split each individual's health history into a series of temporal snapshots as shown in Fig. 1(b). Each snapshot terminated in a pair of HbA1c lab measurements. The time period between the two labs was considered as the observation period. The age of the individual in a particular snapshot and any clinical covariates that were treated as confounders were measured as of the date of the rst lab of the pair. Individuals with only a single HbA1c lab reported were excluded. Only snapshots where the observation period was between 90 and 365 days were retained, and the rest were excluded, resulting in a nal study population of 141,625 patient snapshots.
As shown in Fig 1(c), an individual was considered as treated by a particular anti-hyperglycemic drug at the time of a particular HbA1c lab event if it was prescribed prior to that lab and if the number of days of supply plus a grace period of 30 days (for non-adherence) extended past the lab date. When multiple such drugs existed, the individual was considered as treated by the combination of these drugs. Diabetes drugs were identi ed only by their class names (e.g. SGLT2 inhibitors, sulfonylureas, etc.) and nondiabetes drugs were excluded. Prior treatment was the regimen used to treat the individual in the period prior to the observation window between the two labs. All further analysis was performed on this pseudopopulation of patient snapshots.
Many clinical and social factors are known to be associated with diabetic treatment selection and HbA1c outcomes. For example, kidney function as well as the presence of various comorbid conditions may result in contraindications for certain antihyperglycemic drug classes and may also in uence the HbA1c value that the prescribing clinician targets for an individual. Additionally, Social Determinants of Health (SDoH) such as patient race, income, and location are known to in uence both treatment selection and health outcomes. In order to control for these confounding factors so that an accurate estimate of the causal effect of treatment strategies could be obtained, we included all comorbidities present in the history of each patient using diagnostic de nitions de ned by the Charslon Comorbidity Index (CCI) 23 , as well as the most recent EGFR and Creatinine values at the time of each snapshot. Race is known to be reported at very low levels both within EHRs and claims data. Accordingly, we chose to use censusderived data on the racial and economic pro les of each patient's neighborhood using zip codes. We believe that these are weak surrogates for true SDoH markers, but that to use them is still signi cantly better to ignore SDoH completely from large scale clinical studies. Table 1 provides the summary statistics of all covariates that were treated as confounders for causal inference.
Snapshots were strati ed into ten clinical sub-populations that were de ned based on age, number of additional chronic diseases, using disease de nitions from the unweighted Charslon Comorbidity Index (CCI) 23 , and prior insulin use as evidenced by the presence or absence of insulin treatment as of the rst HbA1c lab of the observation period. Summary statistics for the 10 segments are shown in Table 1 For comparison all summary statistics in Table 1 are reported at the individual as well as patient snapshot levels.

Causal Inference Modeling
We considered several methods for the causal inference analysis presented here. Because there are multiple possible combinations of treatments, the number of head-to-head comparisons that need to be performed is extremely large. Propensity score matching 10 or weighting 24 methods are widely used for observational studies but are considered "do-it-yourself," 25 in that the propensity score model must be checked for correct speci cation after it is trained, and, when incorrectly speci ed, it has to be retrained by modifying model parametrization or feature engineering. Automated methods like Bayesian Additive Regression Trees 26 have yielded good performance on benchmark datasets 25 , but rely on Monte Carlo sampling and are therefore prohibitively slow for the number of comparisons necessary in this study. We recently introduced 27 a technique called BCAUS (Balancing Covariates Automatically Using Supervision) that scales well to massive multi-arm studies. BCAUS consists of a neural-network propensity model that is trained using a joint loss given by The rst term, is a binary cross-entropy loss which penalizes incorrect treatment assignment, while the second, is a loss term which explicitly tries to minimize imbalance between inverse probability weighted covariates. Details of the training process are described in Supplementary Materials and a comparison with other state-of-the-art neural-network-based methods on benchmark datasets has been described elsewhere 27 . For each pairwise comparison between diabetes treatments, a separate BCAUS model was trained. The propensity score outputs of trained models were used to estimate average treatment effects using Inverse Probability of Treatment Weighting (IPTW). A bootstrapping procedure was used to compute standard errors and con dence intervals. The input data for NMA consisted of the estimated ATEs and standard errors.

Network Meta-Analysis
An average treatment effect value measured via a direct causal comparison between two treatments has to be consistent with values that are indirectly estimated (under the transitivity assumption) by comparing each treatment of the pair with intermediary treatments and then computing differences.
Separate network graphs were constructed for the 10 clinical subgroups where every treatment node was connected with every other treatment node. Edges representing observational studies where all confounding covariates were not balanced were trimmed and Bayesian NMA was performed over the resultant graph. Heterogeneity in the treated populations was accounted for by using a random-effects, (hierarchical) model, uninformative priors were set, and a Markov Chain Monte Carlo (MCMC) sampling procedure was used to construct posterior distributions of average treatment effect values for all treatment pairs. To determine relative ranks, samples were drawn from the posterior predictive distributions of average treatment effects of all treatments compared against metformin, which was set as the baseline treatment. For each draw, treatments were ranked in ascending order of average treatment effect values (i.e. higher ranks for more negative values), and a mean rank was computed for each treatment across all draws. This mean rank was normalized to compute the SUCRA score. Treatments were ranked in descending order of SUCRA scores such that the treatment that reduced HbA1c by the largest amount relative to metformin had the highest rank. This ranked list of treatments was returned to all members of the subgroup. Further details of the training procedure are available in Supplementary Materials.

Ranking Algorithm Development
A schematic of the work ow used to generate treatment rankings is shown in Fig. 2. Patient snapshots were split randomly within each sub-group into training (80%) and hold-out (20%) sets such that in each set the relative sizes of the 10 clinical subgroups was approximately the same and that no patient was present in both. In each subgroup of the training set, all treatments combinations with a cohort size greater than 35 were chosen and head-to-head average treatment effect estimation were run comparing each treatment with every other treatment within that subgroup. Average treatment effect estimation was performed using a neural-network-based propensity-score model described in Methods. For the 10 subgroups considered here, a total of 15,198 neural networks were trained, one for each observational study. The propensity-score models were used to estimate pairwise average treatment effects and associated 95% con dence intervals. These values were used to construct a densely connected network graph, where each node represented a treatment and edges connecting two nodes represented the casecontrol study between the respective treatments. Bayesian network meta-analysis was performed (see Methods) to compute network-synthesized meta-analytic average treatment effects and 95% credible intervals compared against a baseline treatment. The baseline treatment was set to metformin since it is the rst-line therapy for T2DM and all ATEs were measured relative to metformin. Samples were drawn from the posterior predictive distributions of the trained model to compute Surface Under the Cumulative RAnking curve (SUCRA) scores for all treatments (see Supplement). Treatments were sorted in decreasing order of this score to generate comparative effectiveness rankings.
Standardized difference plots for confounder adjustment analysis are shown in Supplementary Fig. S1. Forest plots for network meta-analysis and league tables for each subgroup are shown in Supplementary   Figs. S2-S21.

Ranking Validation Procedure
To investigate the degree to which the rankings generalized to new patients while generating an estimate of the improvement to HbA1c over existing practices if rankings were used to guide treatment decisions, we retrospectively compared outcomes between patients whose physicians happened to have prescribed a top-3 ranked treatment choice for them versus selecting any other treatment option. Snapshots in each clinical subgroup were divided into concordant cohorts (where the actual prescribed treatment matched one of the top-3 recommendations) and non-concordant cohorts (where a patient was provided any treatment ranked 4 or lower). Differences in the mean change in HbA1c between and the non-concordant cohorts were calculated for all subgroups for both training and test populations. If the difference in means was found to be statistically signi cant, an additional confounder-adjusted observational study was performed between the cohorts to measure whether the differences in means was directly attributable to the differences in treatment strategy ranks.
To further investigate if the rankings demonstrate an internally consistent effect, we performed a sensitivity analysis by splitting patient snapshots of each subgroup in the training dataset into three concordance cohorts: i) the "top" cohort is concordant with treatments ranked 1-3; ii) the "middle" cohorts is concordant with treatments ranked 4-10; and iii) the "bottom" cohort is concordant with treatments ranked 11 and below. We estimated confounder-adjusted average treatment effect values, comparing the top versus bottom groups and the middle versus bottom groups. If the ranks are internally consistent, we would expect a titration effect where-in the top outperforms the middle and the middle outperforms the bottom.
HbA1c lab measurement. Only snapshots where the duration between the lab pairs was between 90 to 365 days were retained and the rest were excluded, resulting in a nal study population of 141,625 patient snapshots. All further analyses were conducted on the snapshots. (c) A patient was considered to have been treated by a particular anti-hyperglycemic drug at the time of a given HbA1c lab event if it was prescribed prior to the lab and if the number of days supply (blue arrows) extended past the lab date.
When multiple such drugs existed, the individual was considered treated by the combination of these drugs. Prior treatment was the regimen used to treat the individual in the period prior to the observation window between the two labs.

Figure 2
Schematic of Ranking Generation and Analysis Snapshots were split into Training (80%) and Hold-Out (20%) datasets. Patients were strati ed into 10 clinical subgroups based on age, number of comorbidities, and prior insulin use ( Table 2). For each clinical subgroup, all treatments with cohort size > 35 were selected and case-control observational studies were performed comparing every treatment with every other treatment using a neural-network-based propensity-score model for causal inference. A densely connected network graph was constructed with treatments as nodes and edges connecting treatments via measured Average Treatment Effect (ATE) values. Bayesian Network Meta-Analysis (NMA) was performed to compute network-synthesized ATEs compared against a baseline treatment which was set to Metformin (the rst-line therapy for T2DM). Treatments were ranked by their Surface Under the Cumulative RAnking curve (SUCRA) scores and the Top-K were returned to each member of the subgroup.
To gauge the e cacy of the causal recommender engine, recommendations were generated for each patient in the hold-out dataset and the observed change in HbA1c was recorded. This change in HbA1c between the concordant cohort (where the prescribed treatment was one of the top-three ranked treatments) and the non-concordant cohort was used to estimate the confounder-adjusted ATE of the recommendations based on rank.