Novel Surveillance of Occupational Injury Powered by Machine Learning Using Chief Complaint at Emergency Triage

Background: Underreporting of occupational injuries remains a global health issue, and warrants social awareness. To assist surveillance of occupational injuries, we developed an automatic screening model using chief complaint at emergency department to improve detection and reporting of occupational injuries. Methods: A total of 181,843 emergency visits aged 15 to 65 year-old were included at a medical center in Taiwan from April, 2015 to March, 2018. The retrospective cohort composed of 9,307 cases of occupational injuries and 172,536 controls. We applied the rst 30 months as training dataset, and the following 6 months for prospective testing. Natural Language Processing (NLP) was applied to analyze patients’ chief complaints. The sentences were processed by JiebaR, a Chinese text segmentation technique, reviewed by two occupational physicians, and transformed by a word-embedding model using Glove. Logistic regression was conducted to predict suspected cases of occupational injuries. Results: The prediction model using the chief complaint alone can achieve an overall AUC of 0.936 [95% condence interval (CI): 0.931 to 0.941], with high sensitivity (93.1%, 95% CI: 91.8 to 94.4) and adequate specicity (84.8%, 95% CI: 84.4 to 85.2). Patients in the most urgent or severe conditions showed the highest accuracy around 90%. We also observed increasing trends of referral to specialists of occupational medicine, and reimbursement rate. Conclusions: Interdepartmental coordination and integration of machine learning may augment the detection of occupational injuries at the emergency triage and improve the reporting, compensation, and prevention of occupational injuries.

practice for more than 20 years in Taiwan, there is still a high proportion of occupational diseases that have not been reported in the o cial system (5)(6)(7)(8). Meanwhile, the emergency department (ED) is the main unit to manage urgent injuries. It serves as an excellent venue for disease surveillance (6,(9)(10)(11).However, healthcare professionals typically have to handle an overwhelming number of patients at the ED. This situation makes it di cult to e ciently assess occupational injuries; thus an automatic screening system to facilitate clinical evaluation is needed (12,13). To solve the under-reporting, joint support by healthcare professionals may serve to detect occupational injuries at the frontline. Finding ways to improve the reporting rate of occupational injuries without burdening the clinical staff in the overcrowded ED is a challenging but important task.
Natural Language Processing (NLP) is a subset algorithm which has speci c strengths in non-structural text mining, and has been applied in the surveillance of infectious diseases (12,17,18). We hypothesize that the analyses of the chief complaints using a NLP algorithm will not only augment the detection of occupational injuries at the ED, but also save time for healthcare professionals.
Chief complaints at the ED are patient-reported outcomes written in Chinese by experienced triage nurses, and comprise key information in the form of brief sentences, including causes of illness, type of injuries or mechanisms, severity, duration of symptoms, time and place of onset (19,20). Starting from February, 2015, a novel occupational surveillance system was established in a tertiary medical center in southern Taiwan that includes universal screening and documentation of work-related injuries and illnesses. Using the system with both unique and universal labels of occupational injuries, we were able to perform chart review and analyze the association between chief complaint and occupational injury.
The speci c aim of the study was to build a novel surveillance system that automatically detects occupational injuries by using NLP and statistics machine learning algorithms to facilitate the identi cation, reporting, compensation and prevention of occupation injuries. Further evaluation of the database may help us analyze risk factors at the work place and promote workers' overall health.

Methods
This study was approved by the Institutional Review Board of National Cheng Kung University Hospital (NCKUH), approval number: B-ER-106-158, before commencement. We analyzed chief complaints at the triage station of the emergency department retrospectively by abstracting electronic records from NCKUH, a tertiary medical center in southern Taiwan, from April, 2015 to March, 2018. Prior to this, an Occupational Diseases Surveillance System (ODSS) has been in place at NCKUH since February, 2015. Three groups of healthcare professionals used the ODSS to detect occupational injuries, involving (1) triage nurses, (2) staff at the cashier/ registration counter, and (3) ED doctors during consultations to detect if patients visited ED due to injuries at the work place. Although the conventional de nition of occupational injury must ful ll both criteria of "arising out of work" and/or "in the course of work", the ODSS can only be applied for the second de nition in daily practice. Brie y, the triage nurses identi ed injuries that occurred in the course of work and recorded the chief complaints, which were supplemented by complaints identi ed by doctors and staff at the cashier/ registration counter. Injuries which resulted from domestic work without payment or without commuting to or from work was not included. Data from the initial two months (February and March, 2015) were excluded to assure the validity.
As outlined in Fig. 1(A), there were 309,644 emergency visits in total at the hospital, from April, 2015 to March, 2018. The ED visits of populations aged between 15-to 65-year-old were included, resulting in a total of 181,843 patients. Among them, there were 9,307 suspected cases of occupational injuries identi ed by our healthcare professionals. Among the 9,307 suspected cases of occupational injuries, 6,495 (69.79%) cases were identi ed by nurses at the triage station, 2,698 (28.99%) by doctors, and 2,055 (22.08%) by staff at the cashier/ registration counter. We excluded duplicated data for cases re-admitted to the ED within 36 hours, and only the chief complaint at the rst visit for patients with multiple visits were included. The study period was from April, 2015 to March, 2018. We applied events in the rst 30 months, from April 2015 to September 2017, as the training dataset (7,747 cases and 141,081 controls). The events of the following 6 months, from October, 2017 to March, 2018, were classi ed as the testing dataset (1,560 cases and 31,455 controls) for prospective evaluation.
The main outcome of interest was to evaluate if the algorithm could automatically differentiate emergency visits due to occupational injuries from visits due to other causes according to chief complaints recorded at the triage station. The algorithm of the screening model is summarized in Fig. 1B: The Chinese text segmentation technique JiebaR(21) was rst applied to identify all stop words to be omitted from the chief complaints. To assure the quality of the results of segmentation and to save the resources of computer processing, two occupational physicians in our team carefully selected meaningless words out and added them into the pool of stop words. Then, each chief complaint was represented as a vector that considered the summed effect of leftover words and its relative location in a sentence, which was processed with the word-embedding model with Glove in a dimensionality of 100 (22). A third step of constructing a classi cation model was developed with logistic regression to predict occupational injuries. Three models with different input variables were developed. In Model 1, only the sentence of the chief complaint was applied to predict occupational injuries. In Model 2, additional age and sex were incorporated into the chief complaint. In Model 3, the model included chief complaint, age, sex and additional triage level, time of ED visit (within working hours versus non-working hours), and weekday versus weekend. The seven metrics of sensitivity, speci city, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio [LR (+)], negative likelihood ratio [LR (-)], and area under curve (AUC) were calculated to evaluate the overall model performance, and subgroups strati ed by age, gender, triage level and time of visit.

Results
In total, there were 9,307 cases of emergency visits of occupational injuries and 172,536 controls of nonoccupational visits for working populations at NCKUH from April 1, 2015 to March 31, 2018 (Table 1). Men accounted for 59.62% of visits related to occupational injuries, while women accounted for 52.71% of other emergency visits during the study period. In the subset of triage level 1 (most urgent in severity), there were more ED visits related to other causes (the control group) than those with occupational injuries (the study group). On average, cases in the study group were generally younger than controls. The time of visit for the study group was mainly during working hours 8:00-17:00 and on weekdays (Monday to Friday), while those of the control group were during non-working hours and on weekends. There was no major difference in the distributions of age, sex, triage severity, and time of visit for both the training and testing datasets. The average length of chief complaints in Chinese character word count was 33 (SD = 12) in the training period and 35 (SD = 13) in the testing period.

Discussion
In this study, we have successfully developed a surveillance system that integrates natural language processing and supervised machine learning to automatically identify occupational injuries through analysis of chief complaints at the emergency department. By using chief complaints only, this system achieved a favorable AUC (0.936) with a sensitivity of 93.1% and speci city of 84.8% (Table 2). It also showed stable performance with accuracies above 84% in both genders, people aged > 25, urgent or severe cases (triage levels I, II, and III), during both working hours and night-time, as well as weekdays and weekends (Table 3). In fact, following the implementation of the occupational surveillance system at the ED, we observed an increasing trend of referral to specialists of occupational medicine at outpatient clinics, and an increasing trend in the reimbursement rate from the National Labor Insurance (Fig. 2), indicating an initial effect in the improvement of identi cation and surveillance.
Chief complaints provided valuable information, and served as an essential element in occupational injury surveillance. Table 2 shows that adding more variables (Models 2 & 3) to chief complaints (Model 1) does not seem to improve the performance of accuracy, sensitivity, and speci city. Namely, the results of adding age and gender in Model 2 and those of adding level of triage and time of visit in Model 3 do not improve the AUC, sensitivity and speci city. Moreover, since this surveillance tool showed consistent performance in both training and predicting datasets even after strati cation into smaller subgroups(23) ( Table 3), its validity seems adequate for application in prospective clinical practice(16) at least for the time being studied.
In addition to a standardized, structured electronic medical record (EMR) system, we consider the following major reasons to also contribute to the successful application of supervised machine learning algorithms for the surveillance of occupational injuries: First, we had experienced triage nurses who provided short but structured sentences of chief complaints summarizing information of patient-reported outcomes, signs, time, place, as well as mechanism of trauma (13,19,24). Second, as much attention has been paid to occupational injuries, all emergency visits are universally checked by three different personnel, namely, triage nurses, doctors, and staff at the cashier/ registration counter; hence, chief complaints related to such injuries would be picked up from surveillance. Third, two independent domain scientists or occupational physicians carefully reviewed the results of word segmentation, and ltered out 28, 030 meaningless stop words and phrases. Such efforts provide excellent feedback to machine learning algorithms to improve accuracy during training and testing periods (25,26). Moreover, this system applied conventional logistic regression which requires relatively low programming efforts, and can be carried out on commercial laptops with free R packages available online in multi-language text mining algorithms. The automated processing can reduce clinical documentation, increase work ow e ciency, and hence reduce practice time and lower costs (12,20). Supported by the ergonomic design and user friendly EMR system, primary care doctors can devote more time to patients and facilitate the identi cation, referral, reporting, compensation and prevention of occupational injuries (27).
There were at least the following limitations that must be acknowledged: First, this study applied a logistic regression model on chief complaints only. As the data could be enriched through using detailed electronic medical records, the performance could possibly be improved with non-linear transformation algorithms (28). However, since the current system achieved AUC above 0.931 across the three models; a more sophisticated system might not necessarily improve accuracy signi cantly while increasing the cost and complexity of computation and facility requirements. Second, the sampled patients in this study were collected at an emergency department in a tertiary teaching hospital in Tainan metropolitan city. The patients' demographics, disease severity, and prevalence of injury types might be quite different to townships of agricultural communities or rural areas. Thus, our results may not be generalized to other communities, though the method could be adopted and tested in other settings. Third, while the usual gold standard for occupational injury may be validated by worker's compensation, we were unable to apply the same criteria in this study because of the low compensation rate, namely, 21.69% in this study. The major reason may be related to the experience-rated premium charged by the National Labor Insurance in Taiwan. While medical claims of occupational injuries should be covered by the National Labor Insurance (worker's compensation), the National Health Insurance, too, can be utilized in Taiwan since 1995 (29,30), causing under-reporting of occupational injuries for the experience-rating system of the National Labor Insurance (31). There are only minor differences in copayments, 18 USD and 5 USD for general diseases and occupational diseases, respectively, which further aggravates such a discrepancy(32). To improve the performance of surveillance for occupational injury, we applied screening results by health professionals to avoid under-reporting and/or detection. As this approach has improved the referral rate, we would anticipate a consistent increase in the compensation rate in the future, which would in turn provide feedback to healthcare professionals and improve this surveillance system.

Conclusions
In our study, we successfully developed a useful surveillance system for occupational injuries using natural language processing and supervised machine learning with high sensitivity and speci city. The system utilized chief complaints collected at the emergency triage in order to strengthen the conventional surveillance system and reduce underreporting and improve occupational safety and health (29). We recommend that such a system be tested in more diverse settings to improve the e ciency and generalizability (20). Such research might also trigger studies for the detection of injuries of workers comorbid with chronic disease and/or treated with polypharmacy, which is crucial in promoting Total Worker Health (33,34) in the aging workforce (35,36).

Declarations
Ethical approval and consent to participate: The study protocol was approved by the Institutional Review Board of NCKUH (Approval Number: B-ER-106-158) before commencement. Inform consent was waived since this is a retrospective, de-identi ed analysis. Figure 1 (A). Flow diagram of the inclusion of the study population divided into training and testing datasets. Emergency department (ED) visits aged between 15-to 65-year-old were included at a medical center in southern Taiwan from April 2015 to March 2018. We applied events that occurred in the rst 30 months as the training dataset, and those of the last 6 months as the testing dataset for evaluation of performance of the screening model based on machine learning. (B). The occupational injury screening system with natural language processing (NLP) was implemented in three stages: Word segmentation was applied to Chinese texts with an R package JiebaR, which resulted in 28 030 stop words to be omitted from the training data. Then, the word-embedding model called Glove was used to assign each chief complaint sentence a vector. In the third step, a classi cation model was developed using logistic regression. Figure 1 (A). Flow diagram of the inclusion of the study population divided into training and testing datasets.

Figures
Emergency department (ED) visits aged between 15-to 65-year-old were included at a medical center in southern Taiwan from April 2015 to March 2018. We applied events that occurred in the rst 30 months as the training dataset, and those of the last 6 months as the testing dataset for evaluation of performance of the screening model based on machine learning. (B). The occupational injury screening system with natural language processing (NLP) was implemented in three stages: Word segmentation was applied to Chinese texts with an R package JiebaR, which resulted in 28 030 stop words to be omitted from the training data. Then, the word-embedding model called Glove was used to assign each chief complaint sentence a vector. In the third step, a classi cation model was developed using logistic regression.

Figure 2
Trend of outpatient referral to occupational medicine and compensation rate of National Labor Insurance among those suspected cases of occupational injuries.