Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing

doi:10.21203/rs.3.rs-2640617/v1

Download PDF

Research Article

Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing

https://doi.org/10.21203/rs.3.rs-2640617/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 02 Sep, 2023

Read the published version in Brain Informatics →

You are reading this latest preprint version

Background

Abstracting cerebrovascular disease (CeVD) from inpatient electronic medical records (EMRs) through natural language processing (NLP) is pivotal for automated disease surveillance and improving patient outcomes. Existing methods rely on coders’ abstraction, which has time delays and under-coding issues. This study sought to develop an NLP-based method to detect CeVD using EMR clinical notes.

Methods

CeVD status was confirmed through a chart review on randomly selected hospitalized patients who were 18 years or older and discharged from 3 hospitals in Calgary, Alberta, Canada, between January 1 and June 30, 2015. These patients’ chart data were linked to administrative discharge abstract database (DAD) and Sunrise^TM Clinical Manager (SCM) EMR database records by Personal Health Number (a unique lifetime identifier) and admission date. We trained multiple natural language processing (NLP) predictive models by combining two clinical concept extraction methods and two supervised machine learning (ML) methods: random forest and XGBoost. Using chart review as the reference standard, we compared the model performances with those of the commonly applied International Classification of Diseases (ICD-10-CA) codes, on the metrics of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).

Result

Of the study sample (n=3036), the prevalence of CeVD was 11.8% (n=360); the median patient age was 63; and females accounted for 50.3% (n=1528) based on chart data. Among 49 extracted clinical documents from the EMR, four document types were identified as the most influential text sources for identifying CeVD disease (“nursing transfer report,” “discharge summary,” “nursing notes,” and “inpatient consultation.”). The best performing NLP model was XGBoost, combining the Unified Medical Language System concepts extracted by cTAKES (e.g., top-ranked concepts, “Cerebrovascular accident” and “Transient ischemic attack”), and the term frequency-inverse document frequency vectorizer. Compared with ICD codes, the model achieved higher validity overall, such as sensitivity (25.0% vs 70.0%), specificity (99.3% vs 99.1%), PPV (82.6 vs. 87.8%), and NPV (90.8% vs 97.1%).

Conclusion

The NLP algorithm developed in this study performed better than the ICD code algorithm in detecting CeVD. The NLP models could result in an automated EMR tool for identifying CeVD cases and be applied for future studies such as surveillance, and longitudinal studies.

Accurate identification of patients with cerebrovascular diseases (CeVD) is important for health services research, surveillance and monitoring, risk adjustment, and quality improvement measurement [1, 2]. The standard approach to identify conditions is coded administrative hospital data using International Classification of Disease (ICD) terminology. Although structured codes are widely available and highly standardized, some conditions, including CeVD, are under-coded. Quan et al. [3] validated the ICD algorithms against chart review and reported a sensitivity of 46.3% for detecting CeVD diseases in both ICD-9 and ICD-10-CA. To overcome the shortcomings of ICD code-based algorithms, medical chart reviews act as a gold standard for case identification. Unfortunately, chart review is time- and resource-intensive requiring health professionals familiar with specific conditions [4, 5].

Electronic medical records (EMRs) are becoming increasingly popular for collecting health information [6], and can be used to improve the accuracy of identifying conditions such as CeVD. Among the components of EMR, free text notes contain detailed descriptions and give health professionals great flexibility to report conditions and comorbidities. Natural Language Processing (NLP) is an artificial intelligence technique to analyze human languages and retrieve clinically relevant information for detecting and predicting medical conditions [7]. A recent literature conducted by our team yielded few studies using NLP on clinical notes for patients with CeVD conditions [8]. Existing studies have focused on identifying ischemic stroke [9–11] and cerebral aneurysms [12], predicting the cerebrovascular causes of ischemia [13], and detecting complications of stroke [14]. Most previous studies focus on specific conditions within CeVD and have limited access to a complete set of clinical notes from EMRs, using only admission notes or radiology reports.

In this study, we explored all available types of inpatient clinical notes from an EMR to identify a broad spectrum of CeVD cases. The CeVD cases were defined by our previous ICD-10 algorithm [3]. We hypothesized that using NLP techniques on these clinical notes would better detect CeVD cases than ICD-based algorithms and existing ML algorithms with limited data source types.

Study Population

In this retrospective cohort study, we randomly selected patients who were at least 18 years of age and discharged from three acute care facilities in Calgary, Canada, between January 1 and June 30, 2015. Obstetric admissions were excluded because they have a short length of stay and lack conditions of interest. We randomly selected one hospitalization per patient if multiple discharges occurred during the study period [15]. Six nurses reviewed charts to determine the existence of CeVD [15].

Data sources

EMR: Sunrise Clinical Manager (SCM)

The EMR data are from SCM, a city-wide, population level EMR system used in the three acute care hospitals in Calgary. SCM provides patient-level clinical information containing medical and nursing orders, medication records, clinical documentation, diagnostic imaging and lab results [16].

Administrative Discharge Abstract Database: DAD

The inpatients’ administrative, clinical, and demographic information at the time of discharge is coded in the DAD [17]. The clinical coder records up to 25 diagnostics codes for each inpatient based on available information from patient charts. The DAD, EMR data and chart data were linked with Personal Health Number (a unique lifetime identifier), chart number (a distinctive number associated with a patient’s admission), and admission date.

Phenotyping Algorithm Framework

We trained, validated, and tested an EMR data-driven phenotyping algorithm using NLP techniques to detect CeVD. NLP techniques are used to process and analyze human language, and contain a wide range of tasks, including named entity recognition (NER), information extraction, and text classification [11, 16]. They were applied to analyze the free text clinical notes and derive a CeVD phenotype to detect the disease automatically. As depicted in Figure 1, the general framework consists of 1) input document selection from patients’ clinical notes, 2) model training, and 3) performance evaluation using chart review as a reference standard.

Document selection and feature engineering

Many types of clinical notes could be generated during the hospitalization of patients involved in this study, such as nursing transfer reports, inpatient consultations, discharge summaries, and surgical assessment and history. However, not all document types contribute equally to the detection of CeVD. Noise and redundant information can hamper the detection performance of ML models [19]. The first step is determining and selecting the appropriate document type(s) sensitive to CeVD identification.

The method we used is a feedforward sequential selection method [20], to iteratively add the document type that contributes most to model performance, until the performance stops increasing or reaches a predefined criterion, as shown in Figure 2. All the documents are first converted into vectors by 1) extracting relevant medical concepts from the text and 2) turning concepts into numeric features. To examine the extraction performance, we compared two types of commonly used concept extraction methods: Bag of Words (BOW) using ScispaCy [21] and Concept Unique Identifiers (CUIs) from the Unified Medical Language System using cTAKES (see Appendix 1 for a detailed explanation) [22]. We also compared two types of feature construction methods: Term Frequency-Inverse Document Frequency (TF-IDF) and word count. The obtained vectors are fed into the ML models and validated by the model performance. To estimate better generalization of the selected document types, 5-fold cross validation was applied to the selected patients (i.e., 80% training, n= 2429 and 20% test, n=607). The model development is detailed in the following section.

Model development

The model outcome is a binary classification where hospitalized patients with CeVD are considered positive cases. Two supervised ML methods were trained, validated, and tested using the obtained input vectors and chart review output labels, including random forest (RF) and XGBoost [19, 20]. The two methods are known for handling datasets with high dimensionality, missing data and outliers, and providing accurate and reliable predictions, especially for NLP tasks containing thousands of concept features [21, 22].

With the different combinations among methods of concept extraction, vectorization, and ML models, we have 8 model variations, such as “BOW + TF-IDF + RF” and “CUI + TF-IDF + XGBoost.” As both methods, RF and XGBoost, use decision trees as the base models, we assigned 100 decision trees to them, respectively. These models’ performance was then estimated by 5-fold cross validation, maintaining the same proportion of positive and negative patients in each group.

Performance metrics

To evaluate and compare the models developed, we calculated their sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score using chart data as a reference standard. We also calculated binomial proportional confidence intervals for all the metrics.

We compare the results with ICD-based CeVD identification algorithms in DAD after defining CeVD using ICD-10 codes (e.g., G45-46, I60-69, H34, see Table S1 in Appendix) [3]. The performance metrics of the developed NLP models were reported on the same level of specificity as with the ICD-based algorithm.

Characteristics of the study cohort

Among the 3036 patients, chart reviewers identified 360 patients with CeVD (see Table 2). Characteristics that were statistically significantly different (P < .05) between the CeVD positive cohort and negative cohort are: age, comorbidities such as atrial fibrillation, angina, hypertension, peripheral vascular disease (PVD) and obesity

Table 1

Patients Characteristics.
Characteristics		All (percentage)	Patients with CEVD (percentage)	Patients without CEVD (percentage)	P value
N =		3036 (100%)	360(11.9%)	2676(88.1%)
Demographic
	Median of Age (IQR)	63.0 (48.9–76.5)	77.4 (67.0-85.9)	60.9 (46.4–74.2)	< 0.0001
	Female	1528 (50.3%)	175 (48.6%)	1353(50.6%)	0.5
Comorbidities
	Atrial fibrillation	370 (12.2%)	106 (29.4%)	264 (9.9%)	< 0.0001
	Angina	203 (6.7%)	41 (11.4%)	162 (6.1%)	0.0002
	Myocardial infarction	102 (3.4%)	18 (5.0%)	84 (3.1%)	0.06
	Hypertension	1469 (48.4%)	267 (74.2%)	1202 (44.9%)	< 0.0001
	Peripheral vascular disease	148 (4.9%)	46 (12.8%)	102 (3.8%)	< 0.0001
	Obesity	736 (24.2%)	68 (18.9%)	668 (25.0%)	0.01
	Alcohol abuse	230 (7.6%)	19 (5.3%)	211 (7.9%)	0.08
	Smoking	605 (19.9%)	66 (18.3%)	539 (20.1%)	0.4
IQR = Interquartile range

Characteristics of selected document types

We collected 49 types of clinical documents of patients during hospitalization, such as nursing transfer reports, inpatient consultations, and discharge summaries. The detailed text statistics for these document types can be found in Appendix Table S2. For a better explanation, we consolidated these document types into 9 categories (see Table S3 in Appendix).

Using the feedforward sequential selection, we identified four essential document types, “nursing transfer report,” “discharge summary,” “nursing notes,” and “inpatient consultation.” These documents are sensitive and informative for CeVD detection. Table 2 shows the statistics of patients, documents, and words. At least 90% of patients (with or without CeVD) have at least 2 types of documents. These four types of documents complement each other in providing sufficient clinical information to identify CeVD.

Table 2

Characteristics of extracted documents.
Document type	All (n = 3036)	Patients with CeVD (n = 360)	Patients without CeVD (n = 2676)
Median number of notes per patient (IQR)	2.0 (1.0–2.0)	2.0 (1.0–2.0)	2.0 (1.0–2.0)
Number of patients with at least 2 types of documents (%)	2774 (91.4)	344 (95.6)	2430 (90.8)
Median word count per note (IQR)	430.0 (310.0-678.0)	434.5 (322.2–723.0)	428.0 (308.0-675.0)

Detailed document types: nursing transfer report - emergency department to inpatient, discharge summary-medical; surgical assessment and history, inpatient consultations, and discharge summary.

To examine how these document types contribute to CeVD detection, the top ten key concepts in each document type were analyzed, as shown in Fig. 3. There are some common and vital concepts across four document types, such as “C0038454” (stroke-related concepts) and “C0007787” (transient ischemic attack). It is reasonable that the existence of these concepts can directly reflect the CeVD status. The remaining concepts are less overlapped and unique to each document type, such as “C0012169” (low sodium diet) in “nursing notes,” “C0004134” (ataxia) in “nursing transfer report,” “C0202691” (CAT scan of head) in “discharge summary,” and “C0001962” (ethanol) in “inpatient consultation.” This demonstrated that these document types contain essential concepts and can supplement each other to gain more comprehensive information in CeVD detection.

Classification performance

The top 4 trained models were shown in Table 3. XGBoost generally outperformed the random forest method. TF-IDF performed better than term count when comparing models “CUI + word count + XGBoost” and “CUI + TF-IDF + XGBoost.” Similarly, the concept extraction method “CUI” had better performance than “Bag of Words (BOW).” Consequently, the combination of XGBoost, TF-IDF, and CUI achieved the best performance over other ML models in the metrics of sensitivity (70%), specificity (99.1%), PPV (87.8%), NPV (97.1%), F1 (77.8%), and accuracy (96.5%).

We also compared the model performance with ICD-10-CA-based methods. With similar specificity (99.3% in ICD-10-CA vs 99.1% in model “CUI + TF-IDF + XGBoost”), the performance in other metrics is improved hugely by the obtained model, such as sensitivity increased from 25.0–70.0%, and F1 increased from 38.4–77.8%.

Table 3

CeVD case identification with DAD and EMR.
Model	Sensitivity% (95% CI)	Specificity% (95% CI)	PPV% (95% CI)	NPV% (95% CI)	F1%	Accuracy% (95% CI)
ICD-10-CA-codes in DAD	25.0 (20.6–29.8)	99.3 (98.9–99.6)	82.6 (74.5–88.5)	90.8 (90.3–91.3)	38.4	90.5 (89.4–91.5)
CUI + TF-IDF + RF	65.8 (60.7–70.7)	98.5 (98.0–99.0)	85.9 (81.5–89.3)	95.5 (94.9–96.1)	74.1	94.7 (93.8–95.4)
CUI + word count + XGBoost	68.1 (63.0-72.8)	98.6 (98.1–99.0)	86.9 (82.7–90.2)	95.8 (95.2–96.4)	76.2	95.0 (94.2–95.7)
CUI + TF-IDF + XGBoost*	70.00 (65.0-74.7)	99.1 (98.7–99.3)	87.8 (83.7–91.0)	97.1 (96.6–97.5)	77.8	96.5 (95.8–97.0)
BOW + TF-IDF + XGBoost	59.2 (53.9–64.3)	98.7 (98.1–99.1)	85.5 (80.9–89.2)	94.7 (94.1–95.3)	69.4	94.0 (93.1–94.8)

We included the four metrics of the four NLP models with changing threshold values from 0.05 to 0.95, as shown in Fig. 4. Since the ICD algorithm is deterministic, its threshold is not changeable. The PPVs of “CUI + TF-IDF + RF,” “CUI + word count + XGBoost,” “BOW + TF-IDF + XGBoost,” and “CUI + TF-IDF + XGBoost” started to exceed the performance of ICD at thresholds 0.32, 0.25, 0.28, and 0.17 within the threshold bound, respectively. “CUI + word count + XGBoost” and “CUI + TF-IDF + XGBoost” had very similar and robust performance with the change of thresholds, whereas “CUI + TF-IDF + RF” was affected significantly. Generally, the “CUI + TF-IDF + XGBoost” algorithm achieved better and more robust performance with smaller thresholds.

This paper shows that EMR textual information abstracted by NLP techniques outperforms traditional ICD codes for assessing cerebrovascular disease, and compares favourably with resource-intensive chart review using a fraction of human resources. With the prevalence of 11.8% CeVD in over 3000 records, the developed NLP model significantly improves the validity of DAD-based ICD algorithm (sensitivity: 70% vs. 25% and PPV: 88% vs. 83%).

EMR data is more informative and efficient in identifying CeVD patients than conventionally used hospitalization data (i.e., DAD). First, due to the high volume of discharges, coders have limited time to code patients comprehensively, causing missing codes and low quality. Second, there is no uniform international definition of the most responsible diagnosis, which varies between the primary reason for admission and the condition with intensive resource usage [27]. When looking for conditions contributing primarily to the length of stay in hospital (a Canada-wide used definition), CeVD is likely under-coded as it can be a comorbidity causing admission. Conversely, EMRs contain many documents not usually used by medical coders. As identified in this study, four types of documents (i.e., “nursing transfer report,” “discharge summary,” “nursing notes,” and “inpatient consultation”) jointly contribute to the accurate detection of CeVD by providing more comprehensive medical information. Restricting the analysis to a specific document type, therefore, has the potential to impede detection.

To abstract the knowledge from these EMRs textual data, NLP techniques are essential. Information extraction from unstructured text is known to be difficult, and contains subtasks including NER, relation extraction, and pattern extraction. The text-based classification assigns categorical labels for a text fragment by finding the patterns composed of NERs and their relationships. By comparing different combinations of NLP models, we identified the optimal model, CUI + TF-IDF + XGBoost. The TF-IDF performs better than word count because it can efficiently eliminate low-sensitive concepts in differentiating positive and negative groups. CUI is a better concept extraction method than the NER by scispaCy because cTAKES can merge similar concepts into one, such as “stroke,” “CVA,” and “brain vascular accidents” are mapped to the same CUI “C0038454”. Sine XGBoost has better capability in dealing with overfitting and allows a more general model than random forest, it shows a slightly better performance in detecting CeVD, as shown in Table 3.

The widespread use of text based EMR algorithms to supplement ICD codes and traditional chart reviews has many potential advantages for epidemiology and health outcomes research. CeVD status is frequently used as an important factor in stratifying outcomes in population health research. While some outcomes, such as ischemic stroke, have reasonable validity, other aspects of CeVD, such as carotid atherosclerosis, are likely poorly coded. This probably explains the poor sensitivity (25%) of ICD codes for CeVD in our study. We achieved 88% PPV and 70% sensitivity, an improvement over the widely adopted ICD-based algorithm. Given the amount of knowledge contained in clinical text, the algorithm is applicable to detecting many other diseases, especially conditions with under-coding issues. Text based EMR algorithms may be used to periodically re-evaluate the validity of existing ICD code-based approaches and ensure that ICD code validity is not changing over time.

Limitations

There are some limitations in this study. First, further examination of missing cases is needed, as 30% of cases are still missed by the proposed algorithm using EMR data. The missing cases are likely caused by variations in clinical documents and the capability of NLP models to detect them. We believe that the performance of the NLP models can be further improved by having better NER and incorporating sequential and contextual patterns among recognized concepts. Second, the data we studied is only from one city (i.e., Calgary). EMR diversities in format and content could be subject to change when larger populations and geographies are considered. The identified sensitive document types will vary accordingly. Lastly, we did not validate the algorithms in external databases. We encourage researchers to apply this method to their datasets for validation and improvement.

Compared to the widely used ICD-based algorithm, the EMR NLP model significantly improved the sensitivity and PPV while maintaining similar specificity. This algorithm could be used to enhance existing ICD databases, for health research and surveillance.

Ethical Approval

This study was approved by the Conjoint Health Research Ethics Board at the University of Calgary (REB19-0088).

Competing interests

The authors declare that they have no competing interests to disclosure.

Authors’ contributions

Jie Pan wrote the main manuscript text and conducted the study design and analysis. Zilong Zhang developed machine learning models and assisted with writing. Steven Ray Peters provided subject matter expertise and insights about the discussion. Shabnam Vatanpour assisted with the literature review and writing. Robin L. Walker provided subject matter expertise and assisted with the writing. Seungwon Lee and Elliot A. Martin assisted with the result analysis and writing. Hude Quan was responsible for the study design and provided the interpretation framework of experimental results. All authors reviewed the manuscript from the perspectives of soundness, completeness, and novelty.

Funding

This work was supported by a Canadian Institutes of Health Research Operating Project Grant (201809FDN-409926-FDN-CBBA-114817).

Availability of data and materials

The data sets analyzed in this study are not publicly available due to the risk of exposing idenficiable information contained within the clinical notes. Access to the data is restricted to those collaborate with the Centre for Health Informatics and Alberta Health Services.

C. P. Friedman, A. K. Wong, and D. Blumenthal, ‘Policy: Achieving a nationwide learning health system’, Sci Transl Med, vol. 2, no. 57, Nov. 2010, doi: 10.1126/SCITRANSLMED.3001456/ASSET/31DF6FBB-61EA-4899-8446-9F05C371B44A/ASSETS/GRAPHIC/257CM29-F1.JPEG.
A. K. Bonkhoff and C. Grefkes, ‘Precision medicine in stroke: towards personalized outcome predictions using artificial intelligence’, Brain, vol. 145, no. 2, pp. 457–475, Apr. 2022, doi: 10.1093/BRAIN/AWAB439.
H. Quan et al., ‘Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database’, Health Serv Res, vol. 43, no. 4, pp. 1424–1441, Aug. 2008, doi: 10.1111/j.1475-6773.2007.00822.x.
W. W. Yim, M. Yetisgen, W. P. Harris, and W. K. Sharon, ‘Natural Language Processing in Oncology Review’, JAMA Oncology, vol. 2, no. 6. American Medical Association, pp. 797–804, Jun. 01, 2016. doi: 10.1001/jamaoncol.2016.0213.
A. Y. X. Yu et al., ‘Use and utility of administrative health data for stroke research and surveillance’, Stroke, vol. 47, no. 7, pp. 1946–1952, Jul. 2016, doi: 10.1161/STROKEAHA.116.012390.
C. S. Kruse, K. Kothman, K. Anerobi, and L. Abanaka, ‘Adoption Factors of the Electronic Health Record: A Systematic Review’, JMIR Med Inform 2016;4(2):e19 https://medinform.jmir.org/2016/2/e19, vol. 4, no. 2, p. e5525, Jun. 2016, doi: 10.2196/MEDINFORM.5525.
S. Wu et al., ‘Deep learning in clinical natural language processing: a methodical review’, Journal of the American Medical Informatics Association, vol. 27, no. 3, pp. 457–470, Mar. 2020, doi: 10.1093/JAMIA/OCZ200.
S. Lee et al., ‘Electronic Medical Record–Based Case Phenotyping for the Charlson Conditions: Scoping Review’, JMIR Med Inform 2021;9(2):e23934 https://medinform.jmir.org/2021/2/e23934, vol. 9, no. 2, p. e23934, Feb. 2021, doi: 10.2196/23934.
W. Guan et al., ‘Automated Electronic Phenotyping of Cardioembolic Stroke’, Stroke, vol. 52, no. 1, pp. 181–189, Jan. 2021, doi: 10.1161/STROKEAHA.120.030663.
R. Garg, E. Oh, A. Naidech, K. Kording, and S. Prabhakaran, ‘Automating Ischemic Stroke Subtype Classification Using Machine Learning and Natural Language Processing’, Journal of Stroke and Cerebrovascular Diseases, vol. 28, no. 7, pp. 2045–2051, Jul. 2019, doi: 10.1016/J.JSTROKECEREBROVASDIS.2019.02.004.
S. F. Sung, C. Y. Lin, and Y. H. Hu, ‘EMR-Based Phenotyping of Ischemic Stroke Using Supervised Machine Learning and Text Mining Techniques’, IEEE J Biomed Health Inform, vol. 24, no. 10, pp. 2922–2931, Oct. 2020, doi: 10.1109/JBHI.2020.2976931.
V. M. Castro et al., ‘Large-scale identification of patients with cerebral aneurysms using natural language processing’, Neurology, vol. 88, no. 2, p. 164, Jan. 2017, doi: 10.1212/WNL.0000000000003490.
S. Bacchi, L. Oakden-Rayner, T. Zerner, T. Kleinig, S. Patel, and J. Jannes, ‘Deep learning natural language processing successfully predicts the cerebrovascular cause of transient ischemic attack-like presentations’, Stroke, vol. 50, no. 3, pp. 758–760, Mar. 2019, doi: 10.1161/STROKEAHA.118.024124.
M. I. Miller et al., ‘Natural Language Processing of Radiology Reports to Detect Complications of Ischemic Stroke’, Neurocrit Care, vol. 37, no. 2, pp. 291–302, Aug. 2022, doi: 10.1007/S12028-022-01513-3/FIGURES/3.
C. A. Eastwood, D. A. Southern, S. Khair, C. Doktorchik, W. A. Ghali, and H. Quan, ‘The ICD-11 field trial: Creating a large dually coded database’, Research Square Preprint, 2021.
S. Lee et al., ‘Unlocking the Potential of Electronic Health Records for Health Research’, Int J Popul Data Sci, vol. 5, no. 1, Jan. 2020, doi: 10.23889/IJPDS.V5I1.1123.
H. Quan, M. Smith, G. Bartlett-Esquilant, H. Johansen, K. Tu, and L. Lix, ‘Mining Administrative Health Databases to Advance Medical Science: Geographical Considerations and Untapped Potential in Canada’, Canadian Journal of Cardiology, vol. 28, no. 2, pp. 152–154, Mar. 2012, doi: 10.1016/j.cjca.2012.01.005.
K. P. Liao et al., ‘Development of phenotype algorithms using electronic medical records and incorporating natural language processing’, BMJ, vol. 350, Apr. 2015, doi: 10.1136/BMJ.H1885.
F. Bagherzadeh-Khiabani, A. Ramezankhani, F. Azizi, F. Hadaegh, E. W. Steyerberg, and D. Khalili, ‘A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results’, J Clin Epidemiol, vol. 71, pp. 76–85, Mar. 2016, doi: 10.1016/J.JCLINEPI.2015.10.002.
G. H. John, R. Kohavi, and K. Pfleger, ‘Irrelevant Features and the Subset Selection Problem’, Machine Learning Proceedings 1994, pp. 121–129, Jan. 1994, doi: 10.1016/B978-1-55860-335-6.50023-4.
M. Neumann, D. King, I. Beltagy, and W. Ammar, ‘ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing’, BioNLP 2019 - SIGBioMed Workshop on Biomedical Natural Language Processing, Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 319–327, Feb. 2019, doi: 10.18653/v1/W19-5034.
O. Bodenreider, ‘The Unified Medical Language System (UMLS): integrating biomedical terminology’, Nucleic Acids Res, vol. 32, no. suppl_1, pp. D267–D270, Jan. 2004, doi: 10.1093/NAR/GKH061.
L. Breiman, ‘Random forests’, Mach Learn, vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324/METRICS.
T. Chen and C. Guestrin, ‘XGBoost: A Scalable Tree Boosting System’, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, doi: 10.1145/2939672.
Z. Xu, G. Huang, K. Q. Weinberger, and A. X. Zheng, ‘Gradient boosted feature selection’, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 522–531, 2014, doi: 10.1145/2623330.2623635.
Y. Qi, ‘Random Forest for Bioinformatics’, Ensemble Machine Learning, pp. 307–323, 2012, doi: 10.1007/978-1-4419-9326-7_11.
H. Quan et al., ‘International variation in the definition of “main condition” in ICD-coded health data’, International Journal for Quality in Health Care, vol. 26, no. 5, pp. 511–515, Oct. 2014, doi: 10.1093/INTQHC/MZU064.

No competing interests reported.

Appendix.docx

Download PDF

Journal Publication

published 02 Sep, 2023

Read the published version in Brain Informatics →

Editorial decision: Major revision
16 May, 2023
Reviews received at journal
15 May, 2023
Reviewers agreed at journal
14 May, 2023
Reviewers invited by journal
06 Mar, 2023
Editor assigned by journal
02 Mar, 2023
Submission checks completed at journal
02 Mar, 2023
First submitted to journal
28 Feb, 2023

You are reading this latest preprint version

Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing

Status:

Journal Publication

Version 1

Abstract

Background

Methods

Result

Conclusion

Figures

Introduction

Methods

Study Population

Data sources

EMR: Sunrise Clinical Manager (SCM)

Administrative Discharge Abstract Database: DAD

Phenotyping Algorithm Framework

Document selection and feature engineering

Model development

Performance metrics

Results

Characteristics of the study cohort

Characteristics of selected document types

Classification performance

Discussion

Conclusion

Declarations

Ethical Approval

Competing interests

Authors’ contributions

Funding

Availability of data and materials

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1