In order to be comparable with OLS, we use two-stage least squares (2SLS) as the main statistical method to generate the IV estimates of treatment effectiveness. Despite the fact that 2SLS may cause model misspecification for binary outcomes and treatment, the 2SLS is the most common method and a common starting point for the IV method (12). In addition, in many settings, when the outcome is not rare, the 2SLS estimates generates similar estimates to non-linear two stage regression (prevalence between 1.5–50%) (13, 14).

## Treatment and outcome

In this paper, we focus on scenarios where the treatment and outcome are both binary.

The formula for the probability of being prescribed a certain treatment (X = 1) and the probability of the outcome of interest (Y = 1) are listed below:

$$Prob(X=1)={\alpha }_{0}+{\alpha }_{z}PPP+{\alpha }_{1}X1+{\gamma }_{x}X2+{\alpha }_{3}X3$$

$$Prob(Y=1)={\beta }_{0}+ {\beta }_{x}*Prob\left(X=1\right)+{\beta }_{1}X1+{\gamma }_{y}X2+{\beta }_{3}X3$$

PPP stands for instrumental variable. We set PPP 70% of chance equals to 1, 30% of chance equals to 0. This imbalance reflects a common situation that treatment providers tend to prefer one type of treatment than another (perhaps based on following clinical guidelines). X1 is a binary covariate, and X2, X3 are continuous covariates. We assume X1 and X3 are measured covariates and X2 is an unmeasured covariate. In the Data generation process, X1 follows binominal distribution, X2 and X3 follows the normal distribution. These are implemented using R functions rbinom and rnorm (please see code in supplementary material for full details). \({\alpha }_{z}\) controls the strength of association between the instrumental variable and exposure. The PPP is the ‘true’ prescribing preference that in practice is a latent variable and is a binary variable. The parameter values for the data generation process are listed in equation (1) and (2).

The focus of this study is to investigate the impact of unmeasured confounding. Therefore, we keep \({\alpha }_{z}\)=0.4 to ensure the IV strength is fixed.

The parameter value for treatment in equation (2) is 0.1 and this represents the ‘true’ risk difference between the two treatments. \({\beta }_{x}\) is the observed estimate of this risk difference.

$$\left[Prob(X=1)\right]={\alpha }_{z}PPP+0.053X1+0.1 X2+0.02X3 \left(1\right)$$

$$\left[Prob\left(Y=1\right)\right]=0.10*treatment+0.04X1+{\gamma }_{y}X2+0.01X3-0.01 \left(2\right)$$

Draw from the existing literature(1, 2, 4, 6), we constructed the proxies for PPP mainly based on the prescription history. The prior 1 to prior 4 prescriptions are investigated in this study. The prior 1 prescription is the most recent prescription made by the same physician. Likewise, the prior 2 prescription is prior 2 prescriptions from the same physician and the same for prior 3 and prior 4 prescriptions. For example, possible values for prior 4 prescriptions are 0,1,2,3,4. The proportion PPP is the number of certain treatment (X=1) divided by the number of all prescriptions made by this physician (See below).

Proportion PPP=\(\frac{\text{N}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{d}\text{r}\text{u}\text{g} \text{A} \text{m}\text{a}\text{d}\text{e} \text{b}\text{y} \text{o}\text{n}\text{e} \text{p}\text{h}\text{y}\text{s}\text{i}\text{c}\text{i}\text{a}\text{n} }{\text{N}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{a}\text{l}\text{l} \text{p}\text{r}\text{e}\text{s}\text{c}\text{r}\text{i}\text{p}\text{t}\text{i}\text{o}\text{n}\text{s} \text{m}\text{a}\text{d}\text{e} \text{b}\text{y} \text{t}\text{h}\text{e} \text{s}\text{a}\text{m}\text{e} \text{p}\text{h}\text{y}\text{s}\text{i}\text{c}\text{i}\text{a}\text{n}}\)

All analysis is done in R studio with R version 3.6.1. The R code that generates the simulated datasets and the regression models can be seen in the supplementary material.