To the best of our knowledge, this study represents the first investigation of trauma-related features in a large and representative cohort of psychiatric electronic health records. In this study we developed clinically informed guidelines for annotating these patient health records for instances of traumatic events, created a gold standard publicly available dataset, and demonstrated that the data gathered using this annotation scheme is suitable for training a ML model to identify indicators of trauma in novel (unseen) health records. The preprocessing steps, guidelines, as well as the code for NLP models, are available as a repository on our GitHub group < https://github.com/TraumaML/JAMIA-Materials/>.
We identified five span tags (Event, Perpetrator, Symptom, Substance, and Temporal_Frame) and three relational tags (Perpetrated_By, Grounded_To, and Sub-Event) as important for modeling trauma-related linguistic features in psychiatric EHR data. Sub-Event tags were not included in baseline modeling due to low support and limited agreement among annotators.
Identifying symptoms and trauma-related features in the mental health domain is challenging because psychiatric clinical narratives have considerable variability, both because of the heterogeneity of patient presentations as well as heterogeneity in the way that different clinicians may document clinical events. Nuance is intrinsic to the annotation process of psychiatric EHRs, and in many cases there is no obvious justification for a correct way to annotate a span. The following discussion aims to shed light on the consistent points of disagreement and complexity in the annotation process and how we navigated these challenges. Our efforts serve as a guide for future annotation endeavors, providing insight into how to effectively identify symptoms and trauma-related features while accounting for the inherent nuance and variability of clinical narratives.
To guide our annotation process, we therefore relied on three overarching principles: (1) to approach the text neutrally; (2) to maximize the pertinent clinical information; and (3) to optimize model performance by minimizing complexity. Principles #2 and #3 are generally in contention; any increase in annotation complexity (more span tags, attributes, and relations, or an expanded reading frame) comes at the cost of a more difficult learning task with fewer training instances of increasingly specific descriptions. The guidelines were developed based on the project scope and the capabilities of the model, in order to make meaningful conclusions using the fixed training corpus size. To illustrate how these principles were substantiated in the guidelines, we include the resolutions to several major areas of disagreement during the annotation process.
In some cases, annotators found it difficult to decide the truthfulness/credibility of a traumatic event, due to the acute state of psychosis of a patient (e.g., delusional). In these instances, events were assigned the default attribute, factual. Annotators marked attributes maybe or unlikely only when the sentential context is unambiguous that the event in question is uncertain or dubious. Thus, “sexual assault” in the span “possible sexual assault” was attributed as maybe, but in the line “pt noted that she had been assaulted…” further in the same document, the span “assaulted” was attributed as factual. This conservative approach is an exemplification of principle #1, where the text is taken at face value based on the current limitations of language models to infer nuanced context, especially over prolonged reading frames. Substance use was similarly annotated, where unspecified drug/alcohol use was ignored when the clinician did not specify it as a clinical problem.
Additionally, many questionable situations during annotation required balancing annotation complexity with modeling complexity. For instance, to establish consistency on span tag extents, the extent of a span tag was determined as the minimal extent to capture the information pertinent to the study. An example of this would be deciding to annotate “self-injury” as a symptom only, rather than the full span “scratch herself superficially as self-injury.” Although this decision results in the loss of clinical context, the inclusion of additional information would be deemed extraneous if the training data does not provide sufficient support for a model to draw an inference about the symptom in question. In this way principles #2 and #3 are balanced. Importantly, this balance is contingent on the scope of the study. Because we were focused specifically on childhood trauma, we decided to only annotate “trauma” within the context “military trauma”, as the gains in specificity do not offset the loss in generalizability.
To simplify the annotation task and still capture specific temporal links between temporal expressions such as dates, and available events ({sexual abuse} in {2002}), we created the Temporal_Frame tag, consisting of 5 possible temporal type attributes: age, duration, time-of-life, major event, and date. The relation between the temporal frame and the event is expressed with the Grounded_to relation. We do not include other temporal relations such as defined by the Temporal Relation (TLINK) link described in the NIMH-THYME (Temporal History of Your Medical Events) guidelines as it is beyond the project scope. Although a framework of temporal relations to the onset of psychosis symptoms has been reported,[35] this framework focuses on psychosis symptom onset identification and is based on a limited set of symptom keywords which are not suitable to capture more complex linguistic variants. Developing temporal NLP systems in mental health records remains a challenge, due to the inherent complexity of the task. We plan to develop a psychiatric temporal relation annotation scheme and build temporal information extraction systems for psychiatric notes using a graph neural network method in the future.
Model error analysis
The performance of the baseline models demonstrates the practical viability of the gold standard TEPC for machine learning applications. Incorrect predictions made by the baseline models primarily fall into one of three categories: error in predicting span tag extents, error in classifying relations, and difficulties distinguishing between syntactic and semantic validity. For example, in the line “mother offered support to patient,” the span “mother” was incorrectly annotated as a perpetrator. To address these issues, we will finetune the transformer blocks in the models on the entirety of the psychiatric clinical narratives in the RPDR, redesigning our model architecture to a joint modeling approach that predicts both the text spans and their relations in a single step, and modifying the model’s input representation to include temporally preceding information in a document.
In comparing the performance of the baseline models to the clinical experts involved in the annotation task, it is challenging to establish a direct comparison for the relation extraction model. This is because when providing the inputs to the relation extraction model, we provide the span tags from the TEPC, where span-level disagreements have been resolved during adjudication. Therefore, errors introduced by span-level disagreement (e.g. one annotator marking < shoved > and < father > with a Perpetrated_By relation between them, and another annotator marking < shoved > and < his father > with a Perpetrated_By relation between them) are not present in the model evaluation but are present in the human annotator error rate evaluation. As a result, the reported performance of the relation model is expectedly higher than the reported human performance, despite human performance generally being considered an upper bound of potential accuracy on the task. Computing the relaxed F1 metric ameliorates the issue of span-level disagreement but also significantly reduces the number of instances used to compute the human baseline, as any relation tags with span-level disagreements are excluded.
Limitations and related future considerations
There are a few limitations that should be highlighted. First, given the complexity of psychiatric clinical notes and expertise required of annotators, the annotated corpus size is small relative to other domains (e.g., recipes). This is a major limiting factor for training ML models. We are developing new methodologies to scale up the amount of de-identified and annotated data in the corpus. Second, among the annotated corpus, 23 notes were annotated by two annotators whereas the remaining 78 were not. However, these 78 singly annotated notes were retroactively updated to reflect changes to the guidelines that occurred while resolving discrepancies in the dually annotated notes. Third, TraumaML is designed for annotation of EHRs for patients with diagnosis of PTSD and psychotic and mood disorders. It doesn’t cover all psychiatric diagnoses (e.g., addiction). However, the annotation schema and guideline can be incorporated, expanded, and modified to suit for different psychiatric conditions that extend beyond the scope of this project. Finally, the annotation guidelines were created with respect to the state-of-the-art in NLP and ML. This choice was reflected in the trade-offs of richness in clinical detail for support of modeling capabilities, particularly regarding the limitations in the size of reading frames and limitations in capabilities of few or one shot learning. Future models may implement new methods removing these limitations to create a gold standard assuming human parity.