Use of machine learning techniques for phenotyping ischemic stroke instead of the rule-based methods: A nationwide population-based study

doi:10.21203/rs.3.rs-2684842/v1

Download PDF

Research Article

Use of machine learning techniques for phenotyping ischemic stroke instead of the rule-based methods: A nationwide population-based study

https://doi.org/10.21203/rs.3.rs-2684842/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Many studies have evaluated stroke using claims data; most of these studies have defined ischemic stroke by using an operational definition following the rule-based method. Rule-based methods tend to overestimate the number of patients with ischemic stroke.

Objective

We aimed to identify an appropriate algorithm for phenotyping stroke by applying machine learning (ML) techniques to analyze the claims data.

Methods

We obtained the data from the Korean National Health Insurance Service database, which is linked to the Ilsan Hospital database (n = 30,897). The performance of prediction models (extreme gradient boosting [XGBoost] or long short-term memory [LSTM]) was evaluated using the area under the receiver operating characteristic curve (AUROC), the area under precision-recall curve (AUPRC), and calibration curve.

Results

In total, 30,897 patients were enrolled in this study, 3,145 of whom (10.18%) had ischemic stroke. XGBoost, a tree-based ML technique, had the AUROC was 93.63% and AUPRC was 64.05%. LSTM showed results similar to those of the rule-based method. The F₁ score was 70.01%, while the AUROC was 97.10% and AUPRC was 85.70%, which was the highest.

Conclusions

We proposed recurrent neural network based deep learning techniques to improve stroke phenotyping. We anticipate the ability to produce rapid and accurate results.

phenotyping

ischemic stroke

machine learning

deep learning

insurance claim data

Stroke is the second leading cause of death worldwide and often causes disabilities among survivors. The incidence and prevalence of stroke have been increasing over the past 30 years[1, 2] and vary depending on the population structure and country’s economic level; however, ischemic stroke accounts for 80% and hemorrhagic stroke accounts for 20% of all cases [2]. In high-income countries, the age-specific stroke incidence has dramatically decreased due to the provision of preventive treatment and implementation of lifestyle changes; however, the number of new stroke cases is expected to increase with the aging of the population [3]. Stroke also imposes a significant burden on healthcare systems due to the long-term costs associated with disability. Therefore, it is an important outcome variable or independent variable in medical research.

Stroke is a disease with high incidence and prevalence; as such, active researches are conducted using large datasets. Administrative data obtained from large datasets can accurately reflect the real-world practices, are population based, and can be used as a basis for long-term follow-up evaluations [4]. However, administrative data are not originally intended for research and have several limitations, one of which is the suboptimal accuracy of the assigned International Classification of Disease (ICD)-10 diagnostic codes. The selection of diagnosis codes depends on the patient’s primary diagnosis and can be affected by the accuracy of the physician’s diagnosis [5]. Diagnosis coding can also be affected by the financial incentives provided to the corresponding hospitals [6]. To overcome these limitations, several studies have established operational definitions of stroke diagnosed a by neurologist and have validated them [7]. Relevant studies have also been conducted to construct algorithms for identifying ischemic stroke using claims data by linking the patient’s information obtained from multicenter registries to the claims data and verifying the key identifiers [8].

Machine learning (ML) is an analytical method that uses computerized algorithms to identify the relationships among large amounts of data and to make predictions. In stroke-related studies, ML techniques are used to identify and classify strokes, predict stroke outcomes, and identify the stroke subtypes. It helps researchers to analyze large amounts of data, identify patterns, and make predictions that can be useful for the diagnosis, treatment, and prognosis prediction of stroke patients [9]. ML techniques are used to analyze the data obtained from the electronic health records (EHRs) to differentiate ischemic stroke from hemorrhagic stroke. Studies have been conducted to determine the extent to which these techniques can accurately detect the appropriate patients confirmed by experts [10]. A previous study used ML techniques to analyze EHR data in order to identify patients with acute ischemic stroke [11]. Although several previous studies have developed ML algorithms for identifying stroke using EHR data, it remains uncertain whether these algorithms can produce the same results in other countries with different healthcare systems. The performance of such algorithms may be affected by various factors such as the availability and quality of data, specific healthcare systems and infrastructure, and cultural and demographic differences.

In South Korea, many studies have used claims data to investigate the incidence of acute ischemic stroke. These studies assigned the ICD-10 diagnosis code I63 for hospitalized patients and used the results of imaging tests or drug claims to define the disease. The use of claims data allows the selection of a larger sample size and can provide insight into the incidence and treatment of acute ischemic stroke; however, the claims data have a limitation on the availability of clinical information, which can affect the accuracy of the research [12]. The use of claims data, such as ICD-10 diagnosis codes, imaging test codes, and medication codes, to identify acute ischemic stroke patients enable the identification of a higher number of patients compared with that using other methods. For example, a study published in 2013 that used these codes to identify patients with acute ischemic stroke in a particular country reported that the number of patients identified was twofold higher than that using a national registry database constructed by the Korean Stroke Society. This is because claims data can include information of patients who were not diagnosed with stroke by a neurologist or did not receive treatments for stroke but still had diagnosis codes in their medical records. It is difficult to identify actual patients diagnosed with acute ischemic stroke based on the claims data. Therefore, a new analytical tool using the latest ML technology is required. In our study, we aimed to apply the ML techniques to analyze the claims data in order to develop an appropriate algorithm for identifying acute ischemic stroke patients.

Study participants and the development cohort

We obtained the data from the Korean National Health Insurance Service (NHIS) database, which is linked to the National Health Insurance Service Ilsan Hospital (NHIMC) database. The NHIS covers compulsory health insurance for all citizens in South Korea and provides cost-free annual or biennial health screening examinations for all insured individuals. Since South Korea has a single-payer national health system, all medical records of covered inpatient and outpatient visits and the results of national health examinations are collected in the NHIS database, which includes diagnostic codes, procedures, prescriptions, medical costs, and personal information (e.g., age, sex, residential area, income level, and disability status). In this study, patients diagnosed with ischemic stroke were defined as those who were treated by a neurologist or identified through a review of the medical records of patients who visited Ilsan Hospital between 2015 and 2021. Suspected patients were defined as those who underwent at least one brain magnetic resonance imaging (MRI)/computed tomography (CT) scan, excluding those diagnosed with ischemic stroke. The control group consisted of patients with suspected and diagnosed ischemic stroke, matched by sex and age, and were selected at a 1:1 ratio. All methods were performed in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines.

Model Development

In this study, we developed a prediction model for ischemic stroke. The model used 61 features: 1 rule-based operational definition, 5 personal information, 21 health examinations, 4 medical records, and 30 word-embedding variables. The embedding variables were based on the assumption that the codes frequently used in similar medical situations will have a higher probability of appearing, using a word-embedding technique to screen to a total of 2,692 codes (633 diagnosis codes, 1,841 procedure records, 100 procedure material codes, or 118 prescription records). The vector values of each code are determined based on their position relative to one another [13]. The total 2,692 variables were reduced to 300 using term frequency and transformed into 30 embedding vectors (Multimedia Appendix 1). Events were identified using the following definitions: Patients diagnosed with ischemic stroke were defined as those who were treated by a neurologist or identified through a review of the medical records from 2015 to 2021.

We evaluated the performance of prediction models using multiple logistic regression, random forest, and extreme gradient boosting (XGBoost), which are tree-based ML techniques, and multi-layered perceptron, long short-term memory (LSTM), gated recurrent unit (GRU), and convolutional neural network (CNN), which are neural network based deep learning techniques. In neural network based deep learning methods, an embedding model can be created using variables such as diagnosis, tests/treatments, and medication codes. A concatenated model, which combines the embedding model with additional variables such as qualifications, number of medical visits, and medical costs, can also be used for prediction (Fig. 1). In the medical usage variables, 2,692 codes were used, including 633 disease codes, 1,841 procedure codes, 100 treatment material codes, and 118 medication prescription codes. These codes were arranged in 402 variable values based on frequency of use and then padded to obtain the same number of variables by incorporating additional values, resulting in 300 variables. The statistical model or tree-based ML techniques used 300 variables in one-hot encoding, while the neural network based deep learning methods used an embedding method to convert the variables into 30-dimensional vectors. The model features were summarized in 1-year intervals in 20 repetitions from 2002 to 2021. In the hyperparameter setting, Adam was used as the optimizer, the number of epochs was 100, the batch size was 64, the loss function used was binary cross-entropy, and early stopping was set using the Keras callback function (Multimedia Appendix 2). In this case, XGBoost used grid search to tune the hyperparameters; for the LSTM and GRU models, the prediction models were also examined using the same ML model features.

The missing values were imputed using the last observation carried forward method, which replaces them with the previous data. The outliers that exceeded or were below the mean ± interquartile range were replaced with endpoint values. Standardization was performed by subtracting the minimum value from the original value and dividing it by the range (maximum value–minimum value). To correct for imbalance in outcomes, we sampled 10% of the data, regardless of the prevalence of outcomes.

Performance Measurements

Model prediction validation was performed using AUROC, AUPRC, ROC curve, precision-recall curve, F₁ score, precision, recall, precision-recall curve, accuracy, specificity, and calibration curve. The threshold value was defined as the point on the ROC curve where the sum of the estimated recall and specificity is maximized. The average of the product of recall and specificity was also used (Multimedia Appendix 3).

Validation

We performed the validation task in two ways. The data for model training and interval validation were divided into 80% and 20%,. 80% of the model training data was used for model development, and 20% was used for model validation. All analyses were performed using Python (version 3.6.7)[14], and the model was built using the TensorFlow 1.14[15] deep learning framework.

Data Availability

The datasets generated and/or analyzed in this study are not publicly available in accordance to the National Health Insurance Service regulations for the protection of electronic medical data.

Demographic characteristics

We generated an ischemic stroke cohort based on the National Health Insurance Service Ilsan Hospital data and established a patient group (3,624 people), a disease-suspected group (15,902 people), and a normal control group (19,279 people) matched by sex and age at a ratio of 1:1. The data were combined with the National Health Information Database, and the mapping rate was 80%. The final cohort comprised 30,897 participants. A total of 1,996 patients (10.09%) were included in the model training data, and the rule-based method used to analyze the training data of the patient group showed an accuracy of 91.36%. A total of 636 patients (10.38%) were included in the test data. The rule-based method used to analyze the training data of the patient group had an accuracy of 91.71% (Table 1).

Table 1

Demographic characteristics of the study cohorts.
Characteristic	Development cohort		Validation cohort		Interval validation cohort
Characteristic	n	%	n	%	n	%
Patients(n, %)	19,773	100.00	4,944	100.00	6,180	100.00
Age (years, %)	67.42	20.19	67.58	19.94	67.36	20.02
Gender(male, %)	9,156	46.31	2,318	46.89	2,879	46.59
Income(n, %), medical aid	465	2.35	104	2.10	155	2.51
1	2,054	10.39	543	10.98	638	10.32
2	2,349	11.88	590	11.93	729	11.80
3	3,748	18.96	942	19.05	1,200	19.42
4	4,398	22.24	1,077	21.78	1,389	22.48
5	6,635	33.56	1,668	33.74	2,042	33.04
Death(n, %)	3,731	18.87	919	18.59	1,142	18.48
Rule-based method*(n, %)	18,065	91.36	4,534	91.71	5,663	91.63
Ischemic stroke (n, %)	1,996	10.09	513	10.38	636	10.29
*The rule-based method is for individuals who are hospitalized with a diagnosis of I63 and have received anti-platelet therapy and anti-coagulant therapy within 30 days of diagnosis

Model Performance

The rule-based method showed a high recall but relatively low precision in the test data (91.98% and 54.80%, respectively). The XGBoost model had a relatively low recall but showed high precision (63.56% and 92.15%, respectively). The F₁ score was slightly lower (62.50%), but the AUROC was 93.63% and the AUPRC was 73.39%. The LSTM and CNN models that did not use word embeddings exhibited extremely low prediction accuracy. The use of 402 sparse variables was not appropriate for model fitting, while the suitability of using 30 vectors in embedding models was demonstrated. The LSTM model showed results similar to those of the rule-based method and had a slightly higher accuracy rate of 91.59%. The F₁ score was 70.01%, while the AUROC was the highest at 97.10% and the AUPRC was the highest at 85.70%. The deep learning model showed the highest prediction performance, and the one-dimensional CNN model demonstrated a fast convergence rate (Table 2). The accuracy and loss function used indicated the absence of problems, such as overfitting, in LSTM. Using the Keras callback function, the early stopping and automatic saving of the training model were performed; the model converged when the epoch value was 30. When the prediction accuracy was compared between the XGBoost and LSTM models based on the ROC curves and precision-recall curves, the LSTM model was a slightly better fit than the XGBoost model. When the actual and predicted probabilities were compared through the calibration plots, the XGBoost model was underestimated, whereas the LSTM model was fitted with no trend and was randomly distributed (Fig. 2).

Table 2

AUROC, AUPRC, F₁ score, precision, recall, accuracy, and specificity of each model.
Operational definitions	AUROC	AUPRC	F₁ score	Precision	Recall	Accuracy	Specificity
All normal (case:10%)	-	-	-	-	0.00	89.71	100.00
Rule-based method	91.47	73.39	68.68	54.80	91.98	91.47	91.41
Statistical models or tree-based machine learning techniques
Logistic	87.33	46.82	0.11	23.33	0.06	89.70	99.98
Random Forest	91.20	55.30	39.02	60.28	28.84	90.72	97.82
XGBoost	93.63	64.05	62.50	61.46	63.56	92.15	95.43
Neural network based deep learning techniques: without embedding
LSTM	55.30	12.00	23.16	15.19	48.76	64.34	66.27
CNN	48.60	9.50	18.80	10.63	81.13	27.88	21.77
Neural network based deep learning techniques: with embedding
MLP	88.80	52.30	50.35	36.75	79.87	83.79	84.24
LSTM	97.10	85.70	70.01	55.28	95.44	91.59	91.14
GRU	96.20	80.60	68.45	54.40	92.30	91.25	91.13
CNN	92.90	53.00	67.97	54.51	90.25	91.25	91.36

AUROC: The area under the receiver operating characteristic curve, AUPRC: The area under precision-recall curve, Rule-based method: individuals who are hospitalized with a diagnosis of I63 and have received anti-platelet therapy and anti-coagulant therapy within 30 days of diagnosis, LSTM: long short-term memory, CNN: Convolutional Neural Networks, MLP: Multi-Layer Perceptron, GRU: Gated Recurrent Unit.

PRC: the precision-recall curve; ROC: the receiver operating characteristic curve.

Feature importance

The importance of model features was examined using the gain method, in which a decision is made based on the performance benefit obtained when a specific feature is divided. Age is the most important feature, followed by the rule-based method, death, sex, weight, highest blood pressure, height, income, and fasting blood glucose level. Income and total medical costs had a significant impact, and the medical utilization records, such as the number of days of care and hospitalization, were important variables that were explained (Fig. 3).

The rule-based method used for phenotyping ischemic stroke according to the claims data tends to classify more people as patients, resulting in a higher rate of false positives. In South Korea, where healthcare insurance is provided to all citizens and healthcare accessibility is high, this tendency is even more pronounced. In the past, when the medical insurance coverage included the cost for a brain MRI examination, the stroke incidence phenotyping by diagnostic code increased by 150% [16]. The rule-based method also predicted the stroke diagnosis with high recall but with relatively low precision. This tendency is a common phenomenon in hospital data, where individuals are classified as patients if they have any factor that raises suspicion for a disease [17]. In claims data, the likelihood of receiving insurance benefits is relatively high if an individual is diagnosed with a specific disease. We used ML to build a model with higher precision that can predict ischemic stroke with a recall capacity that is similar to that of the rule-based method. Among the ML models, the XGBoost model exhibited the highest accuracy. This model had a slightly lower recall but higher precision compared with the rule-based method, resulting in a 1% increase in its prediction accuracy. In addition, the AUROC and the AUPRC was higher for the XGBoost model. In this study, we also evaluated the advanced models such as RNNs and CNNs, which consider repeated measurements and time, in addition to the ML models. Among the deep learning models, LSTM showed results similar to those of the rule-based method but with the highest accuracy, F₁ score, AUROC, and AUPRC. This can be presented as an alternative to the operational definition of ischemic stroke.

To provide the gold standard definition, we used existing registry databases and collected data from patients who were treated directly by a neurologist or identified based on a chart review of mandatory records. However, this is not feasible in reality or incurs extremely high time and cost. A rule-based method, known as the silver standard, may be necessary as an alternative to the gold standard. In addition, unsupervised learning methods that do not require the use of a gold standard may be used. Additionally, the latest prediction models, such as self-supervised learning, active learning, and semi-supervised models, can be considered. For example, one could use a pretrained model to predict labels on a subset of randomly masked data, then compare the predicted labels to the actual labels, and finally fine-tune the model using the predicted label data. Notable examples include Bidirectional Encoder Representations from Transformers and generative pre-trained transformers (OpenAI), which can serve as alternative methods of performing the task. Therefore, the model can be enhanced using a combination of advanced models, such as transformer and CNN, rather than relying sole on traditional models. Because the variables were summarized in 1-year intervals in 20 repetitions, it was not possible to observed an improvement in the prediction performance of the latest models at arbitrary repetition points or when missing values occurred.

In previous studies that focused on claims data, the operational definition of the disease varied depending on the research purpose and expertise of researchers. According to the data surveyed in this study, when a disease was used as a feature, simple operational definitions using diagnostic codes alone were often employed. When used as a major label, the proportion of cases using operational definitions that incorporated additional imaging or medication codes in addition to the diagnostic codes was relatively high [17]. The proportion of studies using a definition that included imaging codes for ischemic stroke was 7.1%, which was higher compared with that of studies of other diseases, likely due to the fact that ischemic stroke can be easily diagnosed with imaging. In the previous studies, the diagnostic accuracy of the operational definition of ischemic stroke using ICD codes I60–I64 was approximately 43–64% [18–20]. Other previous studies that used the rule-based method of stroke included codes I64, I65, and I66 in addition to I63. This variability in phenotyping ischemic stroke is a major issue in studies that use claims data. The inclusion of additional codes, such as I65 or I66, in the analysis is less common, but the inclusion of the code I64, which is used when the cause of stroke is unclear, is more common [17]. Including the I64 code in the analysis may result in a slightly decreased positive predictive value for ischemic stroke, but it allows a larger number of patients with ischemic stroke to be included in the study [21]. A previous study that included I63 and I64 codes for the diagnosis of ischemic stroke and used additional criteria such as hospitalization and medication codes for validation found that despite the increased sensitivity achieved by including I64, the specificity remained low (< 50%) [22].

This study is the first to use claims data to construct an ML model and a deep learning model that can accurately identify ischemic stroke, thus surpassing the accuracy of previous studies. Using claims data, the study was able to obtain a large amount of patient information, enabling model validation on a wide variety of cases. The use of advanced ML and deep learning techniques also allows the model to identify complex patterns and relationships in the data, leading to improved diagnostic accuracy. In studies using claims data, the results obtained using traditional rule-based methods of ischemic stroke should be compared with those obtained using the model developed in this study. Such an analysis can then be used to make adjustments to the model, such as incorporating additional variables or using different ML techniques, to improve the diagnostic accuracy. As an increasing number of studies utilize prediction models, the reliability of research results related to ischemic stroke based on claims data may increase.

However, this study has some limitations. First, variables related to healthcare utilization may be affected by policies and medical insurance fees applied at the time, as well as advances in medical technology; therefore, the aspect of being dependent on the era should be taken into consideration when developing a model. This is not just a problem for operational definition prediction models; rather, it must be sufficiently reflected in the study. Therefore, it is necessary to review the generalizability of the models by splitting the data in chronological order or regularly updating the prediction models. Second, the model was based on the data obtained from a single institution in a particular country. The clinical characteristics of patients with stroke may vary depending on the medical institution, and the same model may be difficult to data from other countries with different medical insurance systems. To complement this, studies should be conducted using databases[23] of various South Korean institutions in the future.

This study found that ML prediction models can improve the predictability of operational definitions that have relied on rule-based methods employed in previous studies using claims data. This can provide processed and refined disease variables rather than primitive data, such as diagnosis codes, calculation special codes, medication, test and procedure codes, examination items, and qualification information when conducting studies based on claims data. Therefore, it can quickly and accurately derive the study results when conducting future studies using big data.

Multimedia Appendix 1

The input features of the model.

[DOCX File , Appendix 1]

Multimedia Appendix 2

Logistic regression, Random Forest, XGBoost, and LSTM hyperparameters.

[DOCX File , Appendix 2]

Multimedia Appendix 3

Threshold values of each model.

[DOCX File , Appendix 3]

Acknowledgements

Funding

This work was supported by an NHIS (National Health Insurance Service) Ilsan Hospital grant (2022–20–010).

Author information

Authors and Affiliations

Department of Research and Analysis, National Health Insurance Service Ilsan Hospital, Goyang, Republic of Korea

Hyunsun Lim & JH Hong

Department of Family Medicine, National Health Insurance Service Ilsan Hospital, Goyang, Republic of Korea

Youngmin Park

Division of Health Administration, Yonsei University, Wonju, Republic of Korea

Ki-Bong Yoo

Department of Neurology, National Health Insurance Service Ilsan Hospital, Goyang, Republic of Korea

Department of Neurology, Graduate School of Medicine, Kangwon National University, Chuncheon, Republic of Korea

Kwon-Duk Seo

Contributions

HSL contributed to conceptualization, methodology, software, validation, formal analysis, data curation, investigation, resources, original draft preparation, visualization, supervision, project administration, and funding acquisition. JHH contributed software, validation, formal analysis, and data curation. YMP contributed to conceptualization, investigation, resources, review, and editing. KBY contributed to software and formal analysis. KDS contributed to conceptualization and project administration. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Kwon-Duk Seo

E-mail: [email protected]

Ethics declarations

Ethics approval and consent to participate

We used the NHIS-NSC data (NHIS-2022-1-757) from the NHIS. The authors declare that they have no conflict of interest with the NHIS. This study was approved by the Institutional Review Board of NHIS Ilsan Hospital (NHIMC-2022-01-001-001). Study participants' consent was not required to use data from the electronic health records. The informed consent for this study was waived by the Institutional Review Board of NHIS Ilsan Hospital.

Consent for publication

Not applicable.

Collaborators GBDS. Global, regional, and national burden of stroke and its risk factors, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Neurol. 2021 Oct;20(10):795-820. https://doi.org/10.1016/S1474-4422(21)00252-0.
Donkor ES. Stroke in the 21(st) Century: A Snapshot of the Burden, Epidemiology, and Quality of Life. Stroke Res Treat. 2018;2018:3238165. https://doi.org/10.1155/2018/3238165.
Li L, Scott CA, Rothwell PM, Oxford Vascular S. Trends in Stroke Incidence in High-Income Countries in the 21st Century: Population-Based Study and Systematic Review. Stroke. 2020 May;51(5):1372-80. https://doi.org/10.1161/STROKEAHA.119.028484.
Ung D, Kim J, Thrift AG, et al. Promising Use of Big Data to Increase the Efficiency and Comprehensiveness of Stroke Outcomes Research. Stroke. 2019 May;50(5):1302-9. https://doi.org/10.1161/STROKEAHA.118.020372.
Yu AY, Holodinsky JK, Zerna C, et al. Use and Utility of Administrative Health Data for Stroke Research and Surveillance. Stroke. 2016 Jul;47(7):1946-52. https://doi.org/ 10.1161/STROKEAHA.116.012390.
Iezzoni LI. Assessing quality using administrative data. Ann Intern Med. 1997 Oct 15;127(8 Pt 2):666-74. https://doi.org/10.7326/0003-4819-127-8_part_2-199710151-00048.
Park TH, Choi JC. Validation of Stroke and Thrombolytic Therapy in Korean National Health Insurance Claim Data. J Clin Neurol. 2016 Jan;12(1):42-8. https://doi.org/ 10.3988/jcn.2016.12.1.42.
Kim JY, Lee KJ, Kang J, et al. Development of stroke identification algorithm for claims data using the multicenter stroke registry database. PLoS One. 2020;15(2):e0228997. https://doi.org/ 10.1371/journal.pone.0228997.
Aguiar de Sousa D, Katan M. Promising Use of Automated Electronic Phenotyping: Turning Big Data Into Big Value in Stroke Research. Stroke. 2021 Jan;52(1):190-2. https://doi.org/ 10.1161/STROKEAHA.120.033061.
Ni Y, Alwell K, Moomaw CJ, et al. Towards phenotyping stroke: Leveraging data from a large-scale epidemiological study to detect stroke diagnosis. PLoS One. 2018;13(2):e0192586. https://doi.org/10.1371/journal.pone.0192586.
Thangaraj PM, Kummer BR, Lorberbaum T, Elkind MSV, Tatonetti NP. Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods. BioData Min. 2020 Dec 7;13(1):21. https://doi.org/10.1186/s13040-020-00230-x.
Choi EK. Cardiovascular Research Using the Korean National Health Information Database. Korean Circ J. 2020 Sep;50(9):754-72. https://doi.org/10.4070/kcj.2020.0171.
Kim H, Chung Y. A Study on the Application of Natural Language Processing in Health Care Big Data Focusing on Word Embedding Methods. Health Policy and Management. 2020;30(1):15-25. https://doi.org/10.4332/KJHPA.2020.30.1.15.
Python 3 Reference Manual. https://docs.python.org/3/reference/ Python Software Foundation; [2023-01-19]; Available from: https://docs.python.org/3/reference/
Abadi M, Agarwal A, Barham P, et al. TensorFlow: Large-scale machine learning on heterogeneous systems. 2015 [01-19].
Kim JY, Kang K, Kang J, et al. Executive Summary of Stroke Statistics in Korea 2018: A Report from the Epidemiology Research Council of the Korean Stroke Society. J Stroke. 2019 Jan;21(1):42-59. https://doi.org/10.5853/jos.2018.03125.
Lim HS, Oh HC, Park SH, et al. Research on methods to improve the quality of research using the National Health Information DB. National Health Insurance Ilsan Hospital Research Institute. 2021.
Leibson CL, Naessens JM, Brown RD, Whisnant JP. Accuracy of hospital discharge abstracts for identifying stroke. Stroke. 1994 Dec;25(12):2348-55. https://doi.org/ 10.1161/01.str.25.12.2348.
Goldstein LB. Accuracy of ICD-9-CM coding for the identification of patients with acute ischemic stroke: effect of modifier codes. Stroke. 1998 Aug;29(8):1602-4. https://doi.org/ 10.1161/01.str.29.8.1602.
Tirschwell DL, Longstreth WT, Jr. Validating administrative data in stroke research. Stroke. 2002 Oct;33(10):2465-70. https://doi.org/10.1161/01.str.0000032240.28636.bd.
McCormick N, Bhole V, Lacaille D, Avina-Zubieta JA. Validity of Diagnostic Codes for Acute Stroke in Administrative Databases: A Systematic Review. PLoS One. 2015;10(8):e0135834. https://doi.org/10.1371/journal.pone.0135834.
Park J, Kwon S, Choi E-K, et al. Validation of diagnostic codes of major clinical outcomes in a National Health Insurance database. International Journal of Arrhythmia. 2019;20(1). https://doi.org/10.1186/s42444-019-0005-0.
Jeong HY, Jung KH, Mo H, et al. Characteristics and management of stroke in Korea: 2014-2018 data from Korean Stroke Registry. Int J Stroke. 2020 Aug;15(6):619-26. https://doi.org/ 10.1177/1747493019884517.

No competing interests reported.

appendix.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

Use of machine learning techniques for phenotyping ischemic stroke instead of the rule-based methods: A nationwide population-based study

Status:

Version 1

Abstract

Background

Objective

Methods

Results

Conclusions

Figures

Introduction

Methods

Study participants and the development cohort

Model Development

Performance Measurements

Validation

Data Availability

Results

Demographic characteristics

Model Performance

Feature importance

Discussion

Conclusion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1