Our study showed that NLP approaches developed to detect complex clinical constructs, such as suicidality, can be successfully ported and shared across institutions, with proper emphasis on clear and effective communication. First, our informative discussions on the technical compatibility of our NLP algorithms (now hosted on GitHub1) made porting the algorithms a seamless experience. Second, once the approaches were applied to the test data sets, a deeper understanding of the underlying details allowed for a more informative evaluation and identification of potential areas for improvement. While the traditional quantitative measures of portability, such as accuracy and f1-scores, indicated that ported NLP approaches failed to achieve the same level of success as at their home institution, there were inherent reasons for this. Our qualitative error analysis, which took into account each of the approaches slight differences in objectives, indicated that many of the errors were to be expected based on the institution’s guidelines for defining and annotating suicidality. In fact, if expected errors were taken out of consideration, we found the ported algorithm’s results to improve significantly. For the KCL-neg on the WCM data, 65 (32%) of the 206 errors were to be expected, changing the overall F1 score of the algorithm from 0.75 to 0.83. Similarly, of the 1,398 errors that the WCM-si approach made on the KCL data, 587 (42%) of the errors were to be expected based on the algorithm’s configuration and objective, changing the overall F1 score from 0.72 to 0.83.
In addition to a more meaningful evaluation, understanding the underlying details of the NLP approaches may inform how to develop a more generalizable approach. Similar to prior work, we confirmed in this study that suicidality is interpreted and defined differently across institutions and healthcare systems. Among our two institutions, the biggest differences in the definition of suicidality include decisions on temporality and inclusion of indirect suicidality-related terms. While these differences exist, our approaches did also have a number of similarities, including terms in the target lexicons and modifier terms within negation, conditional, and non-experiencer (unrelated categories). This commonality suggests that with further experimentation and collaborations, we can continue to improve on the detection of this complex clinical condition, by developing a portable and generalizable approach.
Both of our institutions have also implemented more novel, state-of-the-art methods for the detection of suicidality, such as a text classification convolutional neural network (CNN) (23), and support vector machines (SVM) (10). However, we decided to experiment with the more basic lexicon and rule-based NLP algorithms for two reasons: ease of portability and human interpretability. With the eventual goal of building generalizable and portable NLP approaches to detect suicidality, we determined that rule-based approaches could be more widely implemented across institutions as they require significantly less technical expertise, computational power, and other resources. In addition, many studies have concluded that simple rule-based approaches achieve similar levels of success to these novel implementations (24–26). A second advantage of rule-based NLP algorithms is their human interpretability. While effective, state of the art machine learning methods are challenging to interpret (27,28) and even more difficult to adjust. In the case of rule-based NLP algorithms, an external institution would have the ability to determine the most common sources of error and make changes to the approach, as it sees fit to the organization’s use case.
There are several limitations to this study. First, because the suicidality NLP approaches and test data sets were developed completely independently of each other, we recognize that they may not be directly comparable. Although a study with a single objective and manual annotation guideline may have been a more robust approach to evaluating portability, we view this to be a real-world experiment and thus replicable by other institutions. Second, our method of data extraction differed amongst the two institutions. While the WCM team only extracted notes with explicit mentions of suicidality, such as “suicidal”, the KCL team extracted notes with a wider search criteria, which included both explicit and “implicit” terms, such as “wish to die.” Third, incorporation of diagnostic codes, such as ICD-9/10 may improve our algorithms’ results. However, the aim of this study was to evaluate and assess portability of our suicidality NLP approaches, rather than prediction results. We hope that results from this study will make NLP algorithms more widely accessible and bolster results of existing suicide prediction algorithms currently reliant on structured EHR data such as diagnosis codes. Finally, because we did not port and evaluate our more advanced machine-learning based algorithms, we cannot say for certain whether these models would have achieved higher success than our lexicon-based NLP approaches. However, based on our goal of developing NLP approaches that can be widely used, we believe this is out of scope for this study and a future area of research.
In an effort to understand patients who are at risk for death by suicide, routinely collected data from healthcare institutions, such as EHRs, can be a valuable resource at scale. Information about suicidal risk behavior is predominantly documented in free text notes in EHRs, leading to an increase in the development of NLP approaches to detect suicidality in unstructured clinical text. However, due to the clinical complexity of suicidality, the lack of consensus on how to define this condition, differences in how clinical assessments are documented in EHRs, and the various ways the task can be modeled for information extraction, the development of relevant NLP approaches have largely been local to each institution’s definitions and interpretations. Thus, these NLP approaches are much less generalizable and portable in comparison to phenotype algorithms for well-defined clinical conditions, such as rheumatoid arthritis(11). However, with a well-defined process to understand the underlying details of an approach, institutions can be well-equipped to make use of an external approach, allowing a larger number of institutions to participate in suicide-related research. This is a critical step forward in learning how to develop more robust, portable, and generalizable NLP methods that can be applied to any clinical text, regardless of the origin EHR system.