## Patient movement and antibiotic data

We used retrospective patient movement data from the University Medical Center Groningen (UMCG), one of the largest hospitals in the Netherlands with more than 10 000 employees and almost 1 400 beds. Antibiotic usage and patient movement data are stored in an electronic health record (EHR) database. The period under study is January 2018 until December 2019. The anonymised data consist of admission and discharge dates for each department within the hospital and antibiotic administration times during admission. These data were used to calculate two covariates for each day during the period of study: 1) the number of patients in each ward (pat_num); 2) the number of patients using antibiotics in each ward (pat_num_ant).

### Spatiotemporal graph

The intrahospital patient movements data can be used to construct a dynamic directed spatiotemporal graph (DG) [17]. The graph nodes are the wards and the edges between the nodes are the patients moving between the wards. The DG is spatiotemporal and dynamic since it presents the location of patients using a node structure over time. We created two DGs using the patient movement data and the antibiotics data. The first graph includes all patient movement between all wards. The second graph only includes the movements of patients using antibiotics.

### PageRank algorithm

The PageRank (PR) algorithm aims to determine the centrality or “importance” of nodes given the number of other “important” nodes with vectors directed towards it [18]. In the context of this study, the PR algorithm estimates the probability distribution of an arbitrary patient ending up in a particular ward. We calculated the daily PageRank probabilities for both DGs using a 30-day rolling time window: 1) PageRank of patient movements between wards (PR_pat_num) and 2) PageRank of patient movements currently using antibiotics (PR_pat_num_ant). The PR_pat_num and PR_pat_num_ant represent the centrality of wards in terms of patients and antibiotics, respectively.

### VRE screening data

The number of VRE tests per week fluctuated between 100 to 300 per week during the study period. There was a VRE outbreak in the second half of 2018 (Figure 1). Outbreak procedures were implemented and hospital ward screening continued. Between July - December 2018, 141 positive VRE tests were reported, with a peak of 25 positive tests in one week. In total, 48 patients tested positive for VRE over the study period. These data were used to calculate the binary outcome variable for this study (1).

$$Y= \left\{\begin{array}{cc}1,& \text{n}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{V}\text{R}\text{E} \text{p}\text{o}\text{s}\text{i}\text{t}\text{i}\text{v}\text{e} \text{p}\text{a}\text{t}\text{i}\text{e}\text{n}\text{t}\text{s} \text{i}\text{n} \text{w}\text{a}\text{r}\text{d}>0\\ 0,& \text{o}\text{t}\text{h}\text{e}\text{r}\text{w}\text{i}\text{s}\text{e}\end{array}\right.$$

1

## Modelling

We estimated the probability that there is at least one patient with VRE in a specific ward (Y) given the covariates pat_num, pat_num_ant, PR_pat_num and PR_pat_num_ant (2).

$$P\left(Y=1\right| pat\_num,pat\_num\_ant,PR\_pat\_num,PM\_pat\_num\_ant)$$

2

## Decision trees

A decision tree was used to determine a simple set of rules based on the covariates to estimate the probability of Y [19]. The decision tree was grown using a 70% random training sample of the complete set of data. The data were split incrementally by adding question nodes. The question nodes consider the ability of each covariate to discriminate between the observed binary outcomes and formulates the question using the one that can discriminate best [20]. We used the Gini index to quantify the discriminatory ability of each covariate at the question nodes [19]. Continuing in this way, a tree branch structure is created, leading to the final decision or leaves of the tree.

## Random forest

The model performance of decision trees was improved by creating an ensemble of decision trees and using them in unison to predict the outcome variable [20]. We used the same 70% randomly sampled training samples used to train the decision tree model. To build the random forest (RF) model, 500 random samples with replacement (bootstrap sample) were drawn from the training data and two random outcome variables were used to build a decision tree for each of the bootstrap sample. The probability of Y was determined by calculating the proportion of the 500 trees that predicted *Y=1*

We compared the model performance of the decision tree and random forest models using the remaining 30% data as a test sample. The area under the receiver operating characteristic curve (ROC) was used to measure model performance as it provides a holistic view of how well the model predicts the outcome variable for different levels of sensitivity and specificity [21]. An AUC between 0.7 and 0.8 is considered as acceptable and between 0.8 and 0.9 excellent [123].

## Software

The R statistical programming language was used to perform the analyses in this study [22]. Graphs were created and evaluated using igraph [23]. The decision trees and random forest models were built using the rpart and randomForest packages [24, 25]. In addition, the tidyverse R package was used to clean and structure the data [26].