Data Source
We used EHR data from the Mass General Brigham (MGB) system, which is the largest healthcare system in the state of Massachusetts. The sampling frame included a total of 1.2 million individuals with deterministically linked EHRs from MGB to insurance claims data from Medicare and Medicaid for the period of 2007–2020. For this study, we used free text associated with the patients’ social history documentation from progress notes using regular expression matching. Since the patients’ social documentation is added incrementally over time, we only used the most recent social documentation for each patient.
SDoH Questions and Manual Labels
To identify the SDoH that are frequently reported in social documentation, we first performed a manual review of a random sample of 200 patients’ social documentation and summarized 9 aspects of the patient’s SDoH that appeared in > 5% of the notes. This list included marital status, number of children, employment status, educational status, lifestyle factors (use of tobacco, alcohol, illicit drugs, exercise), and cohabitation status. Of the random sample of the 200 reviewed patients, we split the first 100 patients’ social documentation as the validation set to inform prompt engineering (described below) and the remaining 100 patients’ social documentation as the test set to evaluate performance. Following the classic evaluation framework of LLM evaluation, we converted the SDoH extraction as a question-answer problem. That is, for each of the 9 SDoH characteristics, given the EHR notes, we designed a question and candidate options along with the note text as the LLM input, and let the LLM to select the option from the candidates. The SDoH questions, together with their distribution of candidate options in the validation and the test set, are shown in Supplementary Table S1.
For each of the 200 patients, two human reviewers (B.G and V.S) manually labeled the 9 SDoH aspects to one of the quantified choices according to the labeling criteria documentation (Supplementary: Annotation Guide section). Each reviewer labeled the 200 patients independently. The inter-annotator agreement was calculated based on the total number of the 1800 questions (9 SdoH questions * 200 patients) that the two annotators agreed before discussion among all the 1800 questions, which was 93%. For the inconsistently annotated cases, the two annotators discussed them in detail and reached a consensus. New criteria were also added to the labeling criteria documentation that addressed the causes of these inconsistencies.
Experiment Settings
We selected 8 well performing open source LLMs on the LLM leaderboard hosted by Hugging Face.[32–41] All LLMs used in this study are publicly available. The details of the LLMs and the links to the models can be found in Supplementary Table S2. A copy of the model weights went through the AWQ quantization process to generate the quantized model weights.[42–43] Quantization is a technique of reducing the model size for faster inference in resource limited settings.
Rule-based Baseline Model
To evaluate the comparative performance of LLMs against a common baseline, we designed a model that used pattern matching to extract the answers to the SDoH questions from the patients’ social documentation. The matching patterns were designed according to the labeling criteria. If a match was found in the patients’ social documentation, the output answer was guaranteed to be one of the choices of the SDoH question. If no match was found in the patients’ social documentation, the output answer was “Not mentioned”. To avoid mismatching (e.g. answer “No” was matched to “Not mentioned” since “No” was a substring of “Not mentioned”), we sorted the choices by their character lengths in descending order and matched the response from the longest choice to the shortest choice and stopped matching if the longer choice was matched. The specific patterns for each SDoH question are shown in Supplementary Table S3.
Pipeline Workflow and prompt engineering
We built two pipelines: a default pipeline and a refined pipeline. The default pipeline was designed to run the LLM to extract the SDoHs from the unstructured social documentation using the default prompt for all SDoH questions. Alternatively, the refined pipeline used different engineered prompts on 3 (Q2, Q6, and Q7) of the 9 SDoH questions that most LLMs struggled with in the validation set experiments. We ran both pipelines to compare the effectiveness of the refinement. An illustration of the two pipelines is shown in Fig. 1. The “LLM Response Postprocessing” step included implementing a systematic code that uses pattern matching to map the model response to one of the choices of the SDoH questions. The “Auto-Grader” took the refined response from the “LLM Response Postprocessing” procedure and compared it against manual labels. When mapped model response matched exactly with the human label, then the auto-grader considered it as an accurate extraction. Otherwise, the auto-grader considered the LLM extraction inaccurate. The “Model Comparator” step combined the graded model responses from the “Auto-Grader” and provided the grading results in a single chart to formulate the final benchmark.
Additional context for the “Default Prompt” and the “Engineered Prompt” is shown in Supplementary Table S4, which summarizes 4 types of prompts: default prompts (not including the default secondary prompts), premise prompts, special prompts, and secondary prompts. The default pipeline only used the default prompts while the refined pipeline used all 4 types of prompts.
LLM performance evaluation
To evaluate the model performance, we used three metrics in the test set: Accuracyoverall, Accuracymentioned, and Accuracynon−mentioned, which corresponds to the overall accuracy for extractions, the accuracy when a note contained mention of the specific SDoH, and the accuracy when a note did not contain mention of the specific SDoH, respectively. The three accuracies are defined as follows, where the Accuracyoverall is a weighted average of the Accuracymentioned and the Accuracynon−mentioned, with the weights dependent on the missingness of the SDoH aspects in the text:
$$\:{Accuracy}_{overall}\:=\:\frac{Total\:\#\:ofaccurate\:extractions}{Total\:\#\:of\:questions}\:$$
$$\:{Accuracy}_{mentioned}\:=\:\frac{Total\:\#\:of\:accurate\:extractions\:where\:the\:ground\:truth\:is\:NOT\:labeled\:as\:"not\:mentioned"}{Total\:\#\:of\:questions\:where\:the\:ground\:truth\:is\:NOT\:labeled\:as\:"not\:mentioned"}$$
$$\:{Accuracy}_{non-mentioned}=\:\frac{Total\:\#\:of\:accurate\:extractions\:where\:the\:ground\:truth\:is\:labeled\:as\:"not\:mentioned"}{Total\:\#\:of\:questions\:where\:the\:ground\:truth\:is\:labeled\:as\:"not\:mentioned"}$$
We calculated all three metrics on all 9 SDoH questions. To calculate the confidence interval for these accuracies, we used the Jackknife resampling technique to generate samples for each model accuracy on every question.[44] We then calculated the 95% CI of each accuracy using the samples generated. We assumed t-distribution since our sample size was small (100 per model, per question, per accuracy). For questions that did not meet the premise (e.g. The human label is “not mentioned” for this patient on this SDoH question when trying to calculate Accuracymentioned), we marked them as not applicable for calculations. To evaluate the performance difference between the LLM and the baseline, we also used the Jackknife resampling to generate samples for each model on each question on each accuracy. After that, we performed a two-sided Welch's t-test on the accuracy differences between the LLM and the baseline using the samples generated from the LLM and the baseline, assuming t-distribution.
Additionally, when the post-processing procedure could not map the LLM response to one of the predefined choices, we defined the response to be invalid and reported the proportion of invalid responses for all models. We further reported the F1 score, calculated as a harmonic mean of precision and recall, using the macro averaging method considering all the SDoH questions as multi-class classification problems. When calculating F1 scores for LLM responses, we combined invalid responses with the ‘not mentioned’ category for each question as a default since invalid responses from LLMs lacked a corresponding manual label.