Study sample
Patients were drawn from the University of North Carolina Health System, an integrated healthcare delivery system where clinical care is managed via a comprehensive electronic health record (EHR). Patients were included if they were ≥ 18 years old with an ICU admission greater than 24 hours between October 2015 - October 2020. ICUs included medical, surgical, neurosciences, and burn. The hospitals varied including community hospitals and academic medical centers. Only the index ICU admission per each patient was considered in this analysis. The institutional review board at The University of Georgia approved this study and included waiver of consent (PROJECT00002652), and all methods were performed in accordance with the relevant guidelines and regulations.
The EHR was queried for patient demographics, medication information, and patient outcomes. Patient demographics included age, sex, admission diagnosis, ICU type, and Acute Physiology and Chronic Health Evaluation II. Medication information including drug, dose, route, duration, and timing of administration were recorded. Patient outcomes included mortality, hospital length of stay, development of delirium (defined by CAM-ICU positive score), duration of mechanical ventilation, duration of vasopressor use, and acute kidney injury (defined by presence of renal replacement therapy or a serum creatinine greater than 1.5x baseline).
Feature Extraction
Patient demographics: There were 30,550 given medication entries in the dataset from a total of 991 patients. A total of 440 unique medications were included when generic drug names were used and when dose and route information were excluded (e.g., cefepime 1gm and 2gm were counted under the feature of cefepime). Medication records from the raw dataset included a variety of medication administration record (MAR) actions including “given”, “missed”, “hold,” etc. To ensure this analysis only included records of medication that were actually administered to the patient (not just ordered), only the entries where the medication action label corresponded to "Given", "New Bag", "Restarted", or "Rate Change" were used for the analysis. Some entries contained "free-text" for ICU personnel communication purposes and were discarded. Additionally, duplicate and incomplete entries were filtered out. After cleaning the dataset, the data were transformed into a binary (boolean) vectored form where the 440 unique medications were assigned as the rows, and 991 patients were assigned as the columns. For each patient, a binary value of 1 was assigned to indicate whether the patient received a particular drug. For patient outcomes, the labels for categorical features were relabeled as numeric values. In the cases of unknown or missing entities, these were replaced with “negative” or “no.” The entire mapping of original labels to new labels is provided in Appendix Table 1.
Unsupervised learning approach
Medication clustering: After performing principal component analysis (PCA) on the large, binary medication dataset, the Restricted Boltzmann Machine was utilized to further enrich the latent feature space, which we used as input to the hierarchical clustering algorithm to support the novel discovery of unique pharmacotherapy profiles. 24
Principal Component Analysis. During PCA, each of the 440 unique medications was treated as an independent variable. PCA is a widely used dimensionality reduction technique to reduce the dimensionality of a dataset with p random variables to q, which is the desired number of variables. 25 The optimal number of principal components was selected after plotting the explained variance against the number of principal components (see Appendix Figure 1). The number of principal components was selected as 150 to maintain a sufficient amount of variance (approximately 75%) in the data while significantly reducing the dimensionality.
Restricted Boltzmann Machine (RBM). Restricted Boltzmann Machine was used to learn unsupervised feature abstractions or ‘latent factors’ of the PCA reduced data. 26 RBM is a simple, two-layered neural network with one visible layer and one hidden layer. It is typically used for collaborative filtering as RBM is capable of learning internal representations of the input variables using unsupervised methods enabling complex relationships to be discovered in the process. For medication clustering purposes, RBM learned the relational nature among medication assignments based on the co-occurrence of medications for each patient. From each patient’s binary assignment of medications, RBM learned the hidden units to ultimately determine which nodes out of all nodes were activated or inactivated for each hidden unit. For clustering purposes, each medication is an independent node from the visible layer, and connections that are activated to the hidden layer indicate cluster assignment (see Figure 1). For example, if acetaminophen (from the visible layer) and Cluster 1 (from the hidden layer) connection was activated, acetaminophen would be assigned to Cluster 1. After assigning medications to each cluster from the created hidden layers, medications that were unassigned (never activated in the five hidden layers) were grouped as Cluster 6. Table 1 lists the medications assigned to Clusters 1-5, and Table 2 lists the unassigned medications in Cluster 6.
II. Patient clustering. After performing principal component analysis on the large, binary medication dataset, agglomerative clustering was utilized to cluster the medications.
Normalized medication cluster distribution. For each patient, the frequency of each medication cluster was counted (see Figure 1). To obtain a normalized medication cluster distribution for each patient, the frequency table was normalized by the total number of medications taken by each patient. This normalized medication cluster distribution was used as a derived feature for patient clustering.
Hierarchical agglomerate clustering. The normalized medication cluster distribution was used to cluster patients using Hierarchical Agglomerative Clustering, which builds a tree to represent data with successor nodes. 27 For implementation, scikit-learn 1.0.2 python library was used to obtain a total of five cluster labels. The optimal number of clusters n = 5 was selected from visual inspection of the dendrogram (see Figure 1), which visually illustrates the hierarchical relationship between the entries (see Figure 1). Table 3 describes relevant demographic and outcomes information for each cluster.
Validation of clusters
Upon selection of the optimal number of clusters, the validity of these clusters as clinically meaningful subgroups was assessed. This surrogate validation was conducted by comparing patient outcomes with medication data to see if clinically relevant characteristics were distinguishable.
Wilcoxon rank sum and signed rank tests were performed for continuous characteristics. Fisher's Exact tests were performed for categorical characteristics. Holm’s adjustment of p-values was applied to the comparisons within each outcome to control the familywise error rates. Significance was assessed at p-value < 0.05. A notable finding was that two groups of clusters (Patient Clusters 1,5 and Patient Clusters 2,4) appear to have a similar length of stay while mortality rate was significantly different. Permutation multivariate analysis of variance (MANOVA) was also used to confirm if the clusters were significantly different considering all clinical outcomes simultaneously. 28