Optimizing Word Embeddings for Patient Portal Message Datasets with a Small Number of Samples

Background Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small. Objective We introduce a novel adaptation of the word2vec model, PK-word2vec, for small-scale messages. Methods PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec on patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks. Results The dataset was composed of 1,389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7,981 non-medical and 1,116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p=0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks’ choices between the two groups of reviewers (p = 0.774 under a paired t-test). Conclusions. PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.


Introduction
Patient portals have become standard components of modern electronic health record (EHR) systems. 1 They enable patients to access health information, manage clinical appointments, and communicate securely with care providers in an asynchronous manner. 2,3The messaging functionality of patient portals has gained in popularity over time 4,5 due to various bene ts, such as the ability to increase patient involvement in medical decision-making 6 .At the same time, the information conveyed in patient portal messages can serve as a foundation for clinical research, such as studies of initiation and discontinuation of cancer therapies 7,8 , assessing readmission risk 9,10 , and creating taxonomies of patient engagement 11 .
However, maximizing the utility of patient portal messages requires a means to extract meaningful signals from unstructured text. 12Rule-based methods, which rely on manually pre-de ned rules to perform a task, are ineffective as the volume of data or complexity of the task increases 12 .While speci c computational methods, such as topic modeling, can be applied in this regard, these approaches typically rely on word frequency to determine the most representative words for each topic.By contrast, word embedding models, such as word2vec 13 , have been widely adopted by natural language processing (NLP) applications.These models can learn vector representations, typically from millions of documents, to capture each word's semantic and linguistic meaning.However, it is challenging to procure a cohort that is su ciently large to train a speci c word embedding model.As an illustration, in this paper, we focus on breast cancer as it serves as an exemplar of a complex care coordination setting between various specialists who often communicate with patients via the patient portal 14 .
The current approaches for generating word embeddings on relatively small corpora may learn semantic meaning inaccurately due to lack of su cient occurrences of word usage.A natural solution to this problem is to ne-tune a word embedding model that is pre-trained on a larger data corpus.This is because the tted parameters of a pre-trained model can serve as a better starting point for training than random initialization.Yet this strategy presents certain limitations as well.In particular, pre-trained models are incapable of handling words that are absent from the pre-trained model's vocabulary.
Moreover, inconsistencies arise due to the incompatible linguistic contexts between the data used to t the pre-trained model and the data for a speci c domain of interest. 15 this study, we introduce a novel approach, PK-word2vec, to train word2vec models on a patient portal message corpora of relatively small sample size.Our model leverages prior information, as de ned by the most similar words sampled from pre-trained embedding models, to regularize the model training.We demonstrate the utility of this approach using a dataset of approximately 137,000 patient portal messages from 1,389 patients diagnosed with breast cancer at a large academic medical center. 1416Our experiments show that the regularization process of PK-word2vec can guide and constrain the model training to learn word semantic meanings more effectively than standard word2vec models.

Methods
Figure 1 depicts the work ow for this study: 1) data extraction and cleaning; 2) prior knowledge extraction from two pre-trained models, Snomed2vec 16 and the Google pre-trained word2vec (Google-word2vec) 13 for medical and non-medical words, respectively; 3) model training regularized by prior knowledge (PK-word2vec); and 4) similarity analysis and human evaluation of the model performance.
This study was deemed exempt from human subjects research by the Vanderbilt University Institutional Review Board.

Data Collection and Preprocessing
We collected the patient portal messages from the EHR system of Vanderbilt University Medical Center (VUMC), a large, private, non-pro t, academic medical center in Nashville, Tennessee.VUMC launched its patient portal, My Health at Vanderbilt (MHAV), in 2004, which was migrated to Epic's MyChart platform in 2017.MHAV supports secure messaging between healthcare providers and patients, appointment scheduling, billing management, and access to laboratory results or other EHR data. 5 this study, we focused on the patient portal messages for patients with breast cancer who were prescribed the most common hormonal therapy medications: anastrozole, exemestane, letrozole, raloxifene, and/or tamoxifen.We masked URLs, email addresses, phone numbers, timestamps, and numerals by using the ekphrasis (v0.5.1) 17 .Afterwards, all personal names and usernames were manually masked, and misspelled words were corrected with the guidance of two domain experts (STR, JLW) by referring to the contexts from randomly selected messages that contain these misspelled words.We applied the CLAMP software package (v1.6.4) to detect medical words pertaining to diseases, laboratory tests, and medications from messages 18 .

Skip-Gram Model with Negative Sampling
PK-word2vec is built on a skip-gram model with negative sampling (SGNS) 13 , a standard word2vec.In SGNS, each word has two vector representations: 1) a center vector when it is processed as a center word, and 2) a context vector when it serves as a neighbor of a center word.Figure 2 illustrates the center word and neighbor word within a context window size of 1 in the sentence "Please reschedule my appointment to Friday".In this example, reschedule is one of the neighbors of the center word appointment.Formally, SGNS structures the model training as a binary classi cation problem, where the objective is to predict whether a certain pair of words are neighbors.Formally, given a sequence of words and a context window of size , the SGNS objective function is de ned as: 1 where and represent the context and center vector of , which are the parameters optimized in the loss function; represents a random word that does not appear in the neighboring context of . is sampled from , a unigram distribution de ned by the frequency of occurrence of each word 19 ; is a hyperparameter that speci es the number of negative words sampled for ; and is the sigmoid function.The rst term in Eq. ( 1) aims to maximize the probability of occurrence for words that are in the context window.By contrast, the second term in Eq. ( 1) aims to minimize the probability of occurrence for words that are not in the context window.

Prior Knowledge Word2vec Model
The challenge of training a word2vec model on a small dataset arises from the insu cient context that such datasets provide.To address this issue, we de ne the prior knowledge of a word as its most similar words identi ed from other pre-trained models.Subsequently, the prior knowledge is applied to regularize the PK-word2vec model training.More formally, if a word in the patient portal messages also appears in a pre-trained embedding vocabulary, the prior knowledge of is de ned by another word that is sampled with probability from the patient portal message vocabulary.In this respect, the more similar is to in the pre-trained embedding model, the greater the likelihood that will be sampled.
We introduce a prior knowledge loss function for the entire dataset, which is de ned as follows: 2 where discounts the words proportionally to their frequency, is a hyperparameter, and is a regularization factor to be de ned in Eq. ( 4) below.Note that if a word does not exist in the pre-trained model, then it has no prior knowledge and its corresponding term in is set to zero.
In this work, we calculate word similarity as the cosine similarity of two vectors and , .We require that each sampled have a similarity score with above a pre-de ned threshold .As such, is de ned as: 3 In our experiments, for each pre-trained model, we set to the average of the top 10 similarity scores for words with prior knowledge in patient portal messages.In practice, each time is processed in model training using gradient descent optimization, a word is sampled and its cosine similarity with in both context and center vectors is calculated as its prior knowledge.Then, the corresponding item for that needs to be minimized in the PK-word2vec loss function in Eq. ( 2) is: There are several reasons for incorporating context vector similarity into Eq.( 4).First, a key characteristic of high-quality word embeddings is that the distance between the context vectors for similar words is relatively small 20 .Second, since the context vectors are applied to estimate the center vectors of neighboring words, the regularization of the context vectors can propagate to the center vectors of neighboring words without prior knowledge.Finally, the overall objective function for PK-word2vec can be de ned as: where is a hyperparameter that determines the degree to which the training of PK-word2vec relies on prior knowledge.A higher indicates a stronger dependency on prior knowledge.To evaluate the impact of α, we compared our approach to a baseline model where , a condition corresponding to SGNS.
We generate prior knowledge for non-medical and medical words using two different models.Speci cally, we applied Google-word2vec, which was trained on the Google News dataset (about 100 billion words), to build prior knowledge for non-medical words.We turned to a Snomed2vec model, which was trained on the SNOMED-CT knowledge graph 16 to obtain prior knowledge for medical words.

Hyperparameter Selection
Hyperparameter tuning for word2vec models is challenging because it is di cult to de ne similarity.
Since Google-word2vec was trained on a vast corpus, we selected hyperparameters, the dimension of word embedding d and the prior knowledge weight , such that PK-word2vec is similar to Google-word2vec in terms of the distribution of the similarity scores between a given word and its 10 most similar words.In this paper we name this distribution as similarity distribution for simplicity.We did not use Snomed2vec as a reference because it is inherently a node2vec 21 trained on a semantic network in the Uni ed Medical Language System 22 rather than text data.
We set the embedding size with a value that corresponds to the smallest Wasserstein distance 23 in the similarity distributions between the two models.To select , we rst categorized words into high-, mid-, and low-frequency ranges based on their tertiles.Then, we set with a value that corresponds to the smallest Wasserstein distance in the similarity distribution between three distinct word frequency ranges.This resulted in a comparison in an unbiased manner, since high-frequency words tended to receive better representations.When training PK-word2vec, the threshold was set to the average word similarity between all the words and their 10 most similar words in the pre-trained models.We set the threshold in Eq. (2) to 0.00001, batch size to 10000, and trained the model for 10 epochs.

Human Evaluation
To compare PK-word2vec with SGNS, we randomly sampled 1,000 words from the patient portal message vocabulary, with the same proportion of medical to non-medical words.For each term of interest, we created one group of the ve most similar words generated from PK-word2vec and SGNS, respectively.To evaluate the system, we asked each reviewer to indicate which word group was more related to the word of interest.We randomized the order of the two groups in each task to avoid framing biases.
We recruited two groups of reviewers for the model evaluation.The rst group consisted of 330 Amazon Mechanical Turk (MTurk) workers 24 , recognized as "masters" due to their consistently submission of high-quality results in their history annotations.Each MTurk worker completed 100 out of the 1000 tasks and each task was answered by 33 different MTurk workers.The second group consisted of 7 medical students recruited through the Vanderbilt Institute for Clinical and Translational Research.Each medical student completed all 1,000 tasks.We set a 30-second time limit for each task; task not completed within this time limit were reassigned to a new worker.
To analyze the human evaluation data, we rst calculated the support rate, de ned as the proportion of the 1,000 tasks for which MTurk workers or medical students preferred PK-word2vec by a majority vote.
Next, we employed sample skewness 25 to examine the distribution of the proportion of reviewers who favored PK-word2vec in each task.This provided insights into the extent to which PK-word2vec was favored over SGNS.A larger value for left sample skew indicates a preference of PK-word2vec over SGNS.Finally, to account for the heterogeneity, we tted a mixed-effects logistic regression model to assess which word embedding model was preferred by MTurk workers.The type of model (PK-word2vec or SGNS) and the category of the term of interest (medical or non-medical words) were variables with xed effects.We assigned a value of 1 to the PK-word2vec model and medical words and a value of 0 to the SGNS and non-medical words in the corresponding variables.

Data Summary
The dataset was composed of 137,554 messages sent by 1,389 patients with breast cancer.After preprocessing, there was a total of 10,683 unique words in the vocabulary, of which 8,878 (80%) and 1,895 (20%) were non-medical and medical words, respectively.Prior knowledge was available for 7,981 (90%) non-medical and 1,116 (59%) medical words.

Hyperparameter Selection
Embedding size ( ).The word embedding size was selected from a range of 10 to 300. Figure 3 depicts the Wasserstein distance between SGNS and the Google-word2vec as a function of the word embedding size.In the following experiments, we present results with an embedding size of 45 for SGNS and PK-word2vec because it corresponds to the smallest distance among all the candidate vector sizes.
Prior knowledge weight ( ). Figure 4a shows the similarity distribution for PK-word2vec as a function of .When , this model corresponds to SGNS and most of the density concentrates around 0.6.As increases, the prior knowledge becomes dominant, and the density gradually shifts toward two directions, forming two peaks.Figure 4b shows the Wasserstein distance in the similarity distribution between Google-Word2vec (three dashed lines at the bottom of the gure) and PK-word2vec (three solid lines at the top of the gure) in different word frequency ranges.The distributions for Google-word2vec were nearly at and maintained relatively small values.Based on this observation, we set to for the following experiments because it corresponds to the smallest Wasserstein distances as well as smallest variance across three pairs of comparisons.

Word Similarity
Context vector regularization.Figure 5 depicts the similarity distribution for the models with (panels a-d) and without (panels e-g) context vector regularization.The blue and orange lines in each graph correspond to words with and without prior knowledge, respectively.The comparisons between two similarity distributions are illustrated in settings with different pre-selected values.As extreme examples, Figs.5d and 5g show that when , context vector regularization mitigates the skew of the similarity distribution for the words with prior knowledge.While a large value shifts the mode of the similarity distribution for words without prior knowledge in the model without context vector regularization to the right, the variance of similarity distribution is not signi cantly altered.Figures 5c and   5f depict the scenario where is set to its optimal value of , where the difference between the two distributions is in an intermediate state compared with other values.These results show that context vector regularization mitigates the impact of prior knowledge on the similarity distribution between words with and without prior knowledge.

Human Evaluation
Figure 6 displays the human evaluation results based on a majority vote from the MTurk workers and medical students.In addition to an overall comparison, the results were strati ed by the word type, and the existence of prior knowledge.The difference was negligible for all comparisons between the two groups of reviewers ( under a paired t-test).In over 90% of the tasks, both groups of reviewers indicated that PK-word2vec generates more similar words than SGNS.For medical words, the support rate for PK-word2vec was 78.8% for MTurk workers and 71.8% for medical students.For non-medical words, the support rate for PK-word2vec was 93.6% for MTurk workers and 93.4% for medical students.For words lacking prior knowledge, the support rate for PK-word2vec was 66.7% from MTurk workers and 61.4% from medical students.
Figure 7 shows the distribution of the proportion of reviewers that favor PK-word2vec in each task.The three curves represent the proportion distribution for the low-, mid-, and high-frequency words, respectively.All distributions are strongly left-skewed and most of the density falls between 0.8 and 1.0, indicating reviewers consistently favored PK-word2vec.Notably, in the responses from MTurk workers, the sample skew was 1.19, 1.65, and 1.92 for the high-, mid-, and low-frequency words, respectively.In the responses from the medical students, the sample skew was 0.88, 1.46, and 1.84 for the high, mid-, and low-frequency words, respectively.In other words, the degree to which the distribution was leftskewed was inversely proportional to the word frequency.reviewers were similar regarding their preferences for non-medical and medical words.This implies that for simple medical surveys, engaging a large number of participants pool can obtain accurate results, mitigating the need for, and potential higher costs associated with, specialized professionals.
In the context of customizing word representation models, our PK-word2vec method offers a distinct approach compared to previous studies.Several earlier studies have adapted SGNS to enhance word representation, primary focusing on aligning the vector space of a general, pre-trained embedding model to a speci c problem domain 28,29 .Other re ned strategies included improving the embedding quality by retro tting the pre-trained word vectors using lexical relational resources 30 , and augmenting SGNS with additional knowledge-based graph models and `anchor context models 31 .Different from these approaches, our PK-word2vec method does not require a large number of training samples to ne-tune the word embedding model.This signi cant advantage could substantially facilitate research within the eld of biomedical informatics, mainly when working with small study cohorts.
Despite the encouraging ndings, this study has several limitations that can serve as a basis for future research.First, PK-word2vec was examined solely on patients with breast cancer.It has yet to be determined whether the method generalizes to clinical communications or notes for other types of patients.Second, we generated prior knowledge from two pre-trained models only.It is unknown if incorporating additional pre-trained models can enhance the generalizability of PK-word2vec.Third, complex medical concepts are often represented by long phrases rather than words, which begs the question of how to include more complex concepts in human evaluations.Finally, the model should be tested in downstream NLP tasks by jointly tuning the prior knowledge weight and other hyperparameters.

Conclusions
This paper introduces PK-word2vec, an adapted word embedding model trained on a relatively small number of patient portal messages.This was achieved by incorporating prior knowledge from two pretrained embedding models into the model training.Our evaluation with MTurk workers and medical students demonstrated that PK-word2vec outperformed the standard word2vec model in generating more similar words from small-size patient portal messages.While this study focused on extracting prior knowledge from Google-word2ec and Snomed2vec for patient portal messages, the proposed PK-word2vec can leverage prior knowledge from any pre-trained models, irrespective of the vector size or representation space, to build vector representations for other clinical text data.

Declarations
Author Q.S. Z.Y. wrote the main manuscript text and Q.S. prepared gures.All authors reviewed the manuscript.The distribution of the words with the 10 highest similarity scores of words with and without context vector regularization during model training.
The number of tasks where one model was preferred over the other based on a majority vote by MTurk workers (upper row) and medical students (lower row).
Distribution of the proportion of reviewers that choose PK-word2vec over SGNS per annotation task (n = 1000).Different colors indicate the word frequency ranges of the original word in each task.The vertical dashed line corresponds to 0.5, which indicate that there was no difference in the annotator preference.

Figures
Figures

Figure 1 Work
Figure 1

Figure 2 An
Figure 2