Figure 1 depicts the workflow for this study: 1) data extraction and cleaning; 2) prior knowledge extraction from two pre-trained models, Snomed2vec16 and the Google pre-trained word2vec (Google-word2vec)13 for medical and non-medical words, respectively; 3) model training regularized by prior knowledge (PK-word2vec); and 4) similarity analysis and human evaluation of the model performance. This study was deemed exempt from human subjects research by the Vanderbilt University Institutional Review Board.

## 2.1. Data Collection and Preprocessing

We collected the patient portal messages from the EHR system of Vanderbilt University Medical Center (VUMC), a large, private, non-profit, academic medical center in Nashville, Tennessee. VUMC launched its patient portal, *My Health at Vanderbilt* (*MHAV*), in 2004, which was migrated to Epic’s MyChart platform in 2017. MHAV supports secure messaging between healthcare providers and patients, appointment scheduling, billing management, and access to laboratory results or other EHR data.5

In this study, we focused on the patient portal messages for patients with breast cancer who were prescribed the most common hormonal therapy medications: *anastrozole*, *exemestane*, *letrozole*, *raloxifene*, and/or *tamoxifen*. We masked URLs, email addresses, phone numbers, timestamps, and numerals by using the *ekphrasis* (v0.5.1)17. Afterwards, all personal names and usernames were manually masked, and misspelled words were corrected with the guidance of two domain experts (STR, JLW) by referring to the contexts from randomly selected messages that contain these misspelled words. We applied the CLAMP software package (v1.6.4) to detect medical words pertaining to diseases, laboratory tests, and medications from messages18.

## 2.2. Skip-Gram Model with Negative Sampling

PK-word2vec is built on a skip-gram model with negative sampling (SGNS)13, a standard word2vec. In SGNS, each word has two vector representations: 1) a center vector when it is processed as a center word, and 2) a context vector when it serves as a neighbor of a center word. Figure 2 illustrates the center word and neighbor word within a context window size of 1 in the sentence "*Please reschedule my appointment to Friday*”. In this example, *reschedule* is one of the neighbors of the center word *appointment*. Formally, SGNS structures the model training as a binary classification problem, where the objective is to predict whether a certain pair of words are neighbors. Formally, given a sequence of words \({w}_{1}, {w}_{2},\cdots , {w}_{T}\)and a context window of size \(m\), the SGNS objective function is defined as:

$${Loss}_{SGNS}=\sum _{t=1}^{T}\left[\sum _{-m\le j\le m}\left[-\text{log}\sigma \left({\mathbf{u}}_{t+j}^{T}{\mathbf{v}}_{t}\right)-\sum _{k=1, {w{\prime }}_{k}\sim P\left({w}_{t}\right)}^{K}\text{log}\sigma \left({-\mathbf{u}}_{k}^{T}{\mathbf{v}}_{t}\right)\right]\right]$$

1

where \({\mathbf{u}}_{t}\) and \({\mathbf{v}}_{t}\) represent the context and center vector of \({w}_{t}\), which are the parameters optimized in the loss function; \({w{\prime }}_{k}\) represents a random word that does not appear in the neighboring context of \({w}_{t}\). \({w{\prime }}_{k}\) is sampled from \(P\left({w}_{t}\right)\), a unigram distribution defined by the frequency of occurrence of each word19; \(K\) is a hyperparameter that specifies the number of negative words sampled for \({w}_{t}\); and \(\sigma\) is the sigmoid function. The first term in Eq. (1) aims to maximize the probability of occurrence for words that are in the context window. By contrast, the second term in Eq. (1) aims to minimize the probability of occurrence for words that are not in the context window.

## 2.3. Prior Knowledge Word2vec Model

The challenge of training a word2vec model on a small dataset arises from the insufficient context that such datasets provide. To address this issue, we define the prior knowledge of a word as its most similar words identified from other pre-trained models. Subsequently, the prior knowledge is applied to regularize the PK-word2vec model training. More formally, if a word in the patient portal messages \({w}_{t}\) also appears in a pre-trained embedding vocabulary, the prior knowledge of \({w}_{t}\) is defined by another word \({w}_{j}\) that is sampled with probability \({p}_{tj}\) from the patient portal message vocabulary. In this respect, the more similar \({w}_{j}\) is to \({w}_{t}\) in the pre-trained embedding model, the greater the likelihood that \({w}_{j}\) will be sampled.

We introduce a prior knowledge loss function for the entire dataset, which is defined as follows:

$${\text{L}\text{o}\text{s}\text{s}}_{\text{P}\text{K}}=\left\{\begin{array}{cc}{\sum }_{t=1}^{T}\left[\gamma \right({w}_{t})\bullet {\Psi }\left({\mathbf{u}}_{t},{\mathbf{u}}_{j},{\mathbf{v}}_{t},{\mathbf{v}}_{j}\right)]& {w}_{t} \text{w}\text{i}\text{t}\text{h} \text{p}\text{r}\text{i}\text{o}\text{r} \text{k}\text{n}\text{o}\text{w}\text{l}\text{e}\text{d}\text{g}\text{e}\\ 0& {w}_{t} \text{w}\text{i}\text{t}\text{h}\text{o}\text{u}\text{t} \text{p}\text{r}\text{i}\text{o}\text{r} \text{k}\text{n}\text{o}\text{w}\text{l}\text{e}\text{d}\text{g}\text{e}\end{array}\right.$$

2

where \(\gamma \left({w}_{t}\right)=\tau /\text{f}\text{r}\text{e}\text{q}\left({w}_{t}\right)\) discounts the words proportionally to their frequency, \(\tau\) is a hyperparameter, and \({\Psi }\left(·\right)\) is a regularization factor to be defined in Eq. (4) below. Note that if a word does not exist in the pre-trained model, then it has no prior knowledge and its corresponding term in \({\text{L}\text{o}\text{s}\text{s}}_{\text{P}\text{K}}\) is set to zero.

In this work, we calculate word similarity as the cosine similarity of two vectors \(\mathbf{u}\) and \(\mathbf{v}\), \(\text{cos}(u,v)=u\bullet v/(\left|u\right|\bullet \left|v\right|)\). We require that each sampled \({w}_{j}\) have a similarity score with \({w}_{t}\) above a pre-defined threshold \(\theta\). As such, \({p}_{tj}\) is defined as:

$${p}_{tj}=\left\{\begin{array}{cc}\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left[\text{cos}({\mathbf{u}}_{t},{\mathbf{u}}_{j})\right]& \text{cos}({\mathbf{u}}_{t},{\mathbf{u}}_{j})\ge \theta \\ 0& \text{cos}({\mathbf{u}}_{t},{\mathbf{u}}_{j})<\theta \end{array}\right.$$

3

In our experiments, for each pre-trained model, we set \(\theta\) to the average of the top 10 similarity scores for words with prior knowledge in patient portal messages. In practice, each time \({w}_{t}\) is processed in model training using gradient descent optimization, a word \({w}_{j}\) is sampled and its cosine similarity with \({w}_{t}\) in both context and center vectors is calculated as its prior knowledge. Then, the corresponding item for \({w}_{t}\) that needs to be minimized in the PK-word2vec loss function in Eq. (2) is:

$${\Psi }\left({\mathbf{u}}_{t},{\mathbf{u}}_{j},{\mathbf{v}}_{t},{\mathbf{v}}_{j}\right)=\left[1-\text{cos}\left({\mathbf{u}}_{t},{\mathbf{u}}_{j}\right)\right]+[1-\text{cos}({\mathbf{v}}_{t},{\mathbf{v}}_{j})]$$

4

There are several reasons for incorporating context vector similarity into Eq. (4). First, a key characteristic of high-quality word embeddings is that the distance between the context vectors for similar words is relatively small 20. Second, since the context vectors are applied to estimate the center vectors of neighboring words, the regularization of the context vectors can propagate to the center vectors of neighboring words without prior knowledge. Finally, the overall objective function for PK-word2vec can be defined as:

$$\text{L}\text{o}\text{s}\text{s}={\text{L}\text{o}\text{s}\text{s}}_{\text{S}\text{G}\text{N}\text{S}}+\alpha \times {\text{L}\text{o}\text{s}\text{s}}_{\text{P}\text{K}}$$

2

where \(\alpha\) is a hyperparameter that determines the degree to which the training of PK-word2vec relies on prior knowledge. A higher \(\alpha\) indicates a stronger dependency on prior knowledge. To evaluate the impact of α, we compared our approach to a baseline model where \(\alpha =0\), a condition corresponding to SGNS. We generate prior knowledge for non-medical and medical words using two different models. Specifically, we applied Google-word2vec, which was trained on the Google News dataset (about 100 billion words), to build prior knowledge for non-medical words. We turned to a *Snomed2vec* model, which was trained on the SNOMED-CT knowledge graph16 to obtain prior knowledge for medical words.

## 2.4. Hyperparameter Selection

Hyperparameter tuning for word2vec models is challenging because it is difficult to define similarity. Since Google-word2vec was trained on a vast corpus, we selected hyperparameters, the dimension of word embedding *d* and the prior knowledge weight \(\alpha\), such that PK-word2vec is similar to Google-word2vec in terms of the distribution of the similarity scores between a given word and its 10 most similar words. In this paper we name this distribution as similarity distribution for simplicity. We did not use Snomed2vec as a reference because it is inherently a node2vec21 trained on a semantic network in the Unified Medical Language System22 rather than text data.

We set the embedding size \(d\) with a value that corresponds to the smallest Wasserstein distance23 in the similarity distributions between the two models. To select \(\alpha\), we first categorized words into high-, mid-, and low-frequency ranges based on their tertiles. Then, we set \(\alpha\) with a value that corresponds to the smallest Wasserstein distance in the similarity distribution between three distinct word frequency ranges. This resulted in a comparison in an unbiased manner, since high-frequency words tended to receive better representations. When training PK-word2vec, the \(\theta\) threshold was set to the average word similarity between all the words and their 10 most similar words in the pre-trained models. We set the threshold \(\tau\) in Eq. (2) to 0.00001, batch size to 10000, and trained the model for 10 epochs.

## 2.5. Human Evaluation

To compare PK-word2vec with SGNS, we randomly sampled 1,000 words from the patient portal message vocabulary, with the same proportion of medical to non-medical words. For each term of interest, we created one group of the five most similar words generated from PK-word2vec and SGNS, respectively. To evaluate the system, we asked each reviewer to indicate which word group was more related to the word of interest. We randomized the order of the two groups in each task to avoid framing biases.

We recruited two groups of reviewers for the model evaluation. The first group consisted of 330 Amazon Mechanical Turk (MTurk) workers24, recognized as “masters” due to their consistently submission of high-quality results in their history annotations. Each MTurk worker completed 100 out of the 1000 tasks and each task was answered by 33 different MTurk workers. The second group consisted of 7 medical students recruited through the Vanderbilt Institute for Clinical and Translational Research. Each medical student completed all 1,000 tasks. We set a 30-second time limit for each task; task not completed within this time limit were reassigned to a new worker.

To analyze the human evaluation data, we first calculated the support rate, defined as the proportion of the 1,000 tasks for which MTurk workers or medical students preferred PK-word2vec by a majority vote. Next, we employed sample skewness25 to examine the distribution of the proportion of reviewers who favored PK-word2vec in each task. This provided insights into the extent to which PK-word2vec was favored over SGNS. A larger value for left sample skew indicates a preference of PK-word2vec over SGNS. Finally, to account for the heterogeneity, we fitted a mixed-effects logistic regression model to assess which word embedding model was preferred by MTurk workers. The type of model (PK-word2vec or SGNS) and the category of the term of interest (medical or non-medical words) were variables with fixed effects. We assigned a value of 1 to the PK-word2vec model and medical words and a value of 0 to the SGNS and non-medical words in the corresponding variables.