Detecting adverse events in clinical care using natural language processing CURRENT

Background Current methods for retrospective review of medical records require both time- and cost-wise a substantial effort. Therefore, we wanted to find the best method (based on natural language processing (NLP)) to select cases out of the medical records for further investigation in search for a (potentially preventable) adverse event (AE) to the decrease this effort. Methods The basic dataset consisted of 2987 medical records of patients who died during their hospitalization. To gain insight into the signal to noise ratio of the various resources, several subsets of our basic dataset were tested. Thereafter, we tested the scalability. After the best subset was chosen, several NLP algorithms were tested to select the best performing algorithm for the detecting of AEs. In the last experiment we tested the performance of the computer algorithms to predict potentially preventable AEs. The results of the NLP were compared with the outcome of the original retrospective medical record review. Results The dataset which contained he last three letters of the medical record showed the biggest potential. The scalability experiment showed that more data leads to a better performance of the algorithm. The best performing algorithm in the third test was the one based on support vector machine (SVM), with a precision of 79%, a negative predictive value (NPV) of 95% and a specificity of 85%. The results of the preventability experiment showed that the performance of the algorithms was almost equal to the results of the AEs. Conclusions In this study, we have shown that the SVM algorithm generates the most accurate results for the selection of cases for further investigation in the search for a (potentially preventable) AE. The sensitivity of the algorithms was around 75%. However, the SVM algorithm selected fewer cases to be examined for AEs compared to the original method. Consequently, this would lead to a lower workload for the committee. At the same time, there are a substantial number of cases, with potentially preventable AEs, not detected by machine learning.

Background Current methods for retrospective review of medical records require both time-and costwise a substantial effort. Therefore, we wanted to find the best method (based on natural language processing (NLP)) to select cases out of the medical records for further investigation in search for a (potentially preventable) adverse event (AE) to the decrease this effort. Methods The basic dataset consisted of 2987 medical records of patients who died during their hospitalization. To gain insight into the signal to noise ratio of the various resources, several subsets of our basic dataset were tested. Thereafter, we tested the scalability. After the best subset was chosen, several NLP algorithms were tested to select the best performing algorithm for the detecting of AEs. In the last experiment we tested the performance of the computer algorithms to predict potentially preventable AEs. The results of the NLP were compared with the outcome of the original retrospective medical record review. Results The dataset which contained he last three letters of the medical record showed the biggest potential. The scalability experiment showed that more data leads to a better performance of the algorithm. The best performing algorithm in the third test was the one based on support vector machine (SVM), with a precision of 79%, a negative predictive value (NPV) of 95% and a specificity of 85%. The results of the preventability experiment showed that the performance of the algorithms was almost equal to the results of the AEs. Conclusions In this study, we have shown that the SVM algorithm generates the most accurate results for the selection of cases for further investigation in the search for a (potentially preventable) AE. The sensitivity of the algorithms was around 75%.
However, the SVM algorithm selected fewer cases to be examined for AEs compared to the original method. Consequently, this would lead to a lower workload for the committee. At the same time, there are a substantial number of cases, with potentially preventable AEs, not detected by machine learning.

Background
Current methods for retrospective review of medical records require both time-and cost-wise a substantial effort. Although trigger systems aim to decrease the burden, they still select many cases without adverse events (AEs). To optimize the process, sometimes cases are preselected leading to the investigation of a specific subset of cases. Examples are: deceased patients, patients of a particular department or patient with a longer length of stay. Results are therefore not generalizable to all hospital patients. Another issue is the relatively high inter-rater variation and low reproducibility of trigger results and subsequently AE detection with current methods. [1][2][3][4][5][6][7] This suggests that these methods generate results that make translation to the clinical practice difficult.
To increase the efficacy, computer assisted detection of triggers, or even better AEs, in medical records has been developed. The aim is to reduce the labor-intensive manual record review and to increase the reproducibility. 8 Thus far, text mining software is available for the detection of triggers and a limited selection of AEs. [9][10][11][12][13][14][15][16][17] The software has to be taught what triggers or AEs are, preferably with a large database of well characterized cases with their triggers and AEs. Until now, the most optimal dataset with patient details to be investigated is not known. For example, incorporating all information about the patients might influence the signal to noise ratio negatively and thus also the reliability of the results. Therefore, the optimal dataset has to be determined. Thereafter, different natural language processing (NLP) algorithms can be tested to detect the algorithm with the highest sensitivity and specificity for detecting triggers or AEs.
In this study, we wanted to find the most optimal method to select cases for further investigation in search for a (potentially preventable) AE. We describe our stepwise approach for developing a computer algorithm to search reliably for AEs in medical records. First, we used an NLP algorithm with excellent computing power restrictions to identify the optimal dataset selected from the medical record in our database with well characterized cases of deceased patients. Second, we evaluated the influence of the amount of records to be included for finding optimal results concerning agreement (true positives + true negatives divided by the total group). Third, several experiments were performed with different NLP algorithms in this optimal dataset to find the algorithm with the best performance for finding AEs and their preventability. Then, the performance of the best computer algorithm was validated in a different part of the data.

Methods
The usual procedure of the medical record review (our "gold standard") Since 2008, a team of trained (according to the EMGO/NIVEL standards 18 ) nurses screened the medical records of all deceased inpatients (approximately 700 annually) for the presence of (one of the fifteen) triggers (figure 1). Triggers are clues which alert screeners for potential AEs (for example "unplanned transfer to the intensive care unit''). To accommodate the process, a database facilitating the necessary steps in this procedure was introduced in 2010 (Medirede®, Clinical File Search; Mediround BV). We used the triggers originally proposed by the Harvard medical practice study (HMPS) in 1991 19 with a slight adjustment to fit the group of deceased patients. Therefore, the triggers regarding transfer to another acute care hospital and unplanned inappropriate discharge to home were omitted as they have no relevance in deceased patients.
The medical records with at least one trigger were redirected to the review committee. This committee consisted of both active and retired medical specialists with considerable clinical experience in the field of quality and safety in healthcare and medical record review. After a thorough review by a member specialised in the field of medicine related to the main diagnosis of the case (e.g. a surgeon investigates surgical patients etc.), this case was presented to the other members of the review committee in a regular meeting. A first conclusion on the potential presence of an AE was then established. Subsequently, after consulting the involved specialists, the committee finally decided on the presence of an AE and the potential preventability of this AE. Since 2012 there is a stable formation of the review committee.
Previous research showed that the average time nurses needed for the manual screening of triggers was 38 minutes (they had no time restrictions), for the reviewers this was on average 60 minutes (excluding the time needed for the discussion in a meeting). Thus, it takes approximately 1.5 hours for the total review of a single medical record.

Data
The total dataset (originated from our "gold standard" procedure) consisted of 2987 medical records of patients who died during their hospitalization. All records between 2011 and 2016 were included.
Of these records 59% contained one or more triggers after the screening by the nurses. In 742 of these medical records (42%) with a trigger, one or more AEs were detected by the review committee. 208 of these AEs were classified as potentially preventable. For these records, there was full access to surgical reports, discharge letters, patient records, nursing reports, use of medication, radiology reports, lab results and the medical history.

Definitions
An AE was defined as an unintended outcome arising from the (non)-action of a caregiver and/or the health care system with damage to the patient resulting in temporary or permanent disability or death of the patient. 20 The patient file was defined as the document including all reports, letters, lab results, scans, reports and medical history.

Modified data
From the basic dataset several parts of data were selected. Machine learning is based on NLP. This is the ability of a computer program to understand human language as it is spoken. It is a component of artificial intelligence. In machine learning, it is important to make a selection out of data with a high signal-to-noise ratio. Preferably you would like to find the "signal" in the data, rather than fitting the noise. The signal was in this case the useful information in the medical record that is pointing towards the AE and the noise is the information in the medical record that is not helpful in locating/finding the AE. To gain insight into the signal to noise ratio of the various resources, several subsets of our basic dataset were tested (A-F).

A: Last general letter to the general practitioner (GP)
This letter describes the last communication from the hospital to the GP of the patient.
In this selection, all records without a last general GP letter were excluded. There were 476 medical records without this letter, leaving 2511 (84%) for inclusion in the analysis. We have chosen the last GP letter because we assumed this contained the most useful information regarding the hospitalization of the patient.
In this selection, the last general GP letter was used, but for the 476 cases in which this letter was missing, the last written document was used instead. Therefore, in this analysis, the original 2987 records were included.

C: Last three letters
In this analysis, the last three letters of all medical records were included. Specificity: calculated as the number of true negatives divided by the total number of medical records which were identified as not containing an AE.

Computer NLP algorithms
We tested different computer algorithms and explored the feasibility of this software (Open Mines Platform supplied by "the Praktijk Index'').
The following NLP algorithms have been used: Naive Bayes (NB) with n-gram input and term frequency-inverse document frequency (tf-idf) scores; Fast-text (FT) 2-layer neural network with hierarchical softmax output Linear Support Vector Machine (SVM) Convolutional neural network (CNN) based on pre-trained word vectors 21,22 Experiments As a first step, the 6 datasets (described in section datasets) were provided to the NB algorithm to select the dataset that provided the highest performance in predicting an AE. This selected dataset was then used for the next experiments. To correct for the variation of the initialization, this experiment was repeated 28 times. Due to time and computing power restrictions, we chose to test the fast NB algorithm for all selections. In the second experiment (scalability) was tested whether the performance would decrease if a smaller training set was available. In the second experiment, we attempted to predict the preventability of an AE. AEs were therefore categorized as ''probably not preventable'' or ''potentially preventable''. The results of these test were compared with the outcome of the gold standard (committee review) In the third experiment all four algorithms were trained, optimized and tested for AE prediction with use of the best-fitted dataset, which resulted from the first experiment (see box 1). 1. Training set: this dataset consisted of examples used for the NLP to learn. In this case the NLP was taught which cases contained an AE and which didn't. learning, that is to fit the parameters (e.g., weights) of, for example, a classifier. [7] [8] 2.

Box 1: Explanation of training, validation and test
Validation set: a set of examples used to tune the parameters of a classifier. It is sometimes also called the development set or the "dev set". 3.

Test set : A test dataset is a dataset that is independent of the training dataset, but that follows the same probability distribution as the training dataset. A test set is therefore a set of examples used only to
assess the performance (i.e. generalization) of a fully specified classifier. [7] [8] 60% of the data was used for training, 20% for validation and 20% for testing.

Privacy
To guarantee patients privacy, information such as names, addresses, and other personal information were deleted from the data. After the anonymization, the data was checked by the privacy officer and researcher DK.

Experiment 1: Dataset selection
In table 1 the results are presented for the individual datasets. The positive predictive value varied widely for the different datasets. The accuracy was for all datasets around 75% and the specificity was close to 100% for every dataset.
Dataset C, which contained the last three letters, showed the biggest potential and was therefore used in the next experiment to test the performance of the different algorithms. For this first experiment, the NB algorithm was used.   The best performing algorithm was the SVM algorithm, with a precision of 79%, an accuracy of 84% and a specificity of 85%. Figure 2 shows the precision presented against the recall/sensitivity for the SVM algorithm. This shows, if we would create a set with 82% precision, this would result in a recall/sensitivity of 40% (percentage AEs with respect to all AEs). In figure 3, the ROC curve of the SVM algorithm is presented. Table 2 shows the performance of the four algorithms separately. These results are based on 12 repetitions of the experiment, with the same settings for the algorithm but with a different distribution for the training set and the test set.   The results of this experiment shows that the performance of the algorithms for the preventability is almost equal to the results of the AEs. However, it must be considered that these results are based on a small dataset. This experiment was repeated five times.

Discussion And Conclusion
In this study, we have shown that the SVM algorithm generates the most accurate results for the selection of cases for further investigation in the search for a (potentially preventable) AE. The sensitivity of the algorithms was around 75%. However, the SVM algorithm selected about 50% fewer cases to be examined for AEs compared to the original method. Consequently, this would lead to a lower workload for the committee. 7 At the same time, there are a substantial number of cases, with potentially preventable AEs, not detected by machine learning (20% for CNN and FT algorithm, 18% for SVM algorithm, 24% for NB algorithm). If we look at the best accuracy, then the SVM algorithm performs best. However, for precision, the algorithms don't differ that much. The CNN and the SVM algorithm used the letters and the patient record. By using more structured information, which is already generated by hospitals, such as age and length of stay, the algorithms could be improved.
Another option is using more data. The number of examples of which a algorithm will learn has an influence on the performance of the algorithm.
Because of an unfavorable signal to noise ratio in the total record the performance of the algorithms is dependent on the selected dataset. When different dataset selections are compared, it was indeed shown that the use of the unedited total patient record doesn't improve the output and can even have a negative effect on the results. Different datasets show different results for sensitivity and specificity. This is probably caused by the length of a patient record in itself and especially by long sentences.
The use of discharge letters, in particular, the letters for the GP, appeared to be useful because of a more favorable signal to noise ratio. These letters contain brief information regarding the hospitalization of the patient in contrast to the extensive patient record which contains a lot of "noise". The scalability experiment shows that by increasing the dataset, the precision of the algorithming will improve.
Finally, the results of the preventability experiment showed that the preventability can be adequately estimated despite scarce records in the training set.
Until now, the literature concerning the automatic detection of AEs focused on a short list of prespecified AEs or only the triggers for which the data were searched. 9 suggest a more optimal performance compared to these studies.
The strengths of our study were the careful step by step analysis with NLP to find the best algorithm that improved the precision for a potentially preventable AE compared with the current trigger system. Several, currently common NLP algorithms were tested. Experiments were repeated until saturation was reached. Also, we tested different dataset to find the most useful one with the most valuable information.
Still, some questions remain because of some weaknesses in our study. First, it is not clear what an optimal cut off point for sensitivity or specificity could be. This has to be determined by further research to see if the data generated with this study make policymakers draw the same conclusions as with conventional methods (in this study our "gold standard"). Second, the number of wellcharacterized records can be considered as rather small for NLP programs to be optimized, especially when we take the potential preventability into account. Finally, this was a selection of deceased patients and it might be even more interesting to use this method in patients that were alive at discharge from the hospital. Therefore, we think these data are not generalizable to living patients.
We think if the algorithms' performance is considered acceptable it should be tested prospectively and compared to conventional methods. Considering that the conventional method is our "gold standard" the agreement will never be 100%. However, this is not an issue if the conclusions that can be drawn from the final results are comparable to the original ones. Still, we think the word of the committee cannot be fully replaced because the communication with the involved departments is crucial in the success of this quality and safety instrument.
In conclusion, the results of NLP algorithms to predict potentially preventable AEs in specific datasets from patient records is a promising tool to simplify record review. Further research is necessary, to investigate if the results of this method lead to the same overall conclusions from medical record review compared to the more expensive and labor-intensive conventional methods. The study protocol was evaluated by the ethics committee of our hospital.

Consent for publication
Not applicable

Availability of data and material
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.  Influence of the amount of data against the accuracy of the algorithm ROC curve of the SVM algorithm