Study participants and the development cohort
We obtained the data from the Korean National Health Insurance Service (NHIS) database, which is linked to the National Health Insurance Service Ilsan Hospital (NHIMC) database. The NHIS covers compulsory health insurance for all citizens in South Korea and provides cost-free annual or biennial health screening examinations for all insured individuals. Since South Korea has a single-payer national health system, all medical records of covered inpatient and outpatient visits and the results of national health examinations are collected in the NHIS database, which includes diagnostic codes, procedures, prescriptions, medical costs, and personal information (e.g., age, sex, residential area, income level, and disability status). In this study, patients diagnosed with ischemic stroke were defined as those who were treated by a neurologist or identified through a review of the medical records of patients who visited Ilsan Hospital between 2015 and 2021. Suspected patients were defined as those who underwent at least one brain magnetic resonance imaging (MRI)/computed tomography (CT) scan, excluding those diagnosed with ischemic stroke. The control group consisted of patients with suspected and diagnosed ischemic stroke, matched by sex and age, and were selected at a 1:1 ratio. All methods were performed in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines.
Model Development
In this study, we developed a prediction model for ischemic stroke. The model used 61 features: 1 rule-based operational definition, 5 personal information, 21 health examinations, 4 medical records, and 30 word-embedding variables. The embedding variables were based on the assumption that the codes frequently used in similar medical situations will have a higher probability of appearing, using a word-embedding technique to screen to a total of 2,692 codes (633 diagnosis codes, 1,841 procedure records, 100 procedure material codes, or 118 prescription records). The vector values of each code are determined based on their position relative to one another [13]. The total 2,692 variables were reduced to 300 using term frequency and transformed into 30 embedding vectors (Multimedia Appendix 1). Events were identified using the following definitions: Patients diagnosed with ischemic stroke were defined as those who were treated by a neurologist or identified through a review of the medical records from 2015 to 2021.
We evaluated the performance of prediction models using multiple logistic regression, random forest, and extreme gradient boosting (XGBoost), which are tree-based ML techniques, and multi-layered perceptron, long short-term memory (LSTM), gated recurrent unit (GRU), and convolutional neural network (CNN), which are neural network based deep learning techniques. In neural network based deep learning methods, an embedding model can be created using variables such as diagnosis, tests/treatments, and medication codes. A concatenated model, which combines the embedding model with additional variables such as qualifications, number of medical visits, and medical costs, can also be used for prediction (Fig. 1). In the medical usage variables, 2,692 codes were used, including 633 disease codes, 1,841 procedure codes, 100 treatment material codes, and 118 medication prescription codes. These codes were arranged in 402 variable values based on frequency of use and then padded to obtain the same number of variables by incorporating additional values, resulting in 300 variables. The statistical model or tree-based ML techniques used 300 variables in one-hot encoding, while the neural network based deep learning methods used an embedding method to convert the variables into 30-dimensional vectors. The model features were summarized in 1-year intervals in 20 repetitions from 2002 to 2021. In the hyperparameter setting, Adam was used as the optimizer, the number of epochs was 100, the batch size was 64, the loss function used was binary cross-entropy, and early stopping was set using the Keras callback function (Multimedia Appendix 2). In this case, XGBoost used grid search to tune the hyperparameters; for the LSTM and GRU models, the prediction models were also examined using the same ML model features.
The missing values were imputed using the last observation carried forward method, which replaces them with the previous data. The outliers that exceeded or were below the mean ± interquartile range were replaced with endpoint values. Standardization was performed by subtracting the minimum value from the original value and dividing it by the range (maximum value–minimum value). To correct for imbalance in outcomes, we sampled 10% of the data, regardless of the prevalence of outcomes.
Performance Measurements
Model prediction validation was performed using AUROC, AUPRC, ROC curve, precision-recall curve, F1 score, precision, recall, precision-recall curve, accuracy, specificity, and calibration curve. The threshold value was defined as the point on the ROC curve where the sum of the estimated recall and specificity is maximized. The average of the product of recall and specificity was also used (Multimedia Appendix 3).
Validation
We performed the validation task in two ways. The data for model training and interval validation were divided into 80% and 20%,. 80% of the model training data was used for model development, and 20% was used for model validation. All analyses were performed using Python (version 3.6.7)[14], and the model was built using the TensorFlow 1.14[15] deep learning framework.
Data Availability
The datasets generated and/or analyzed in this study are not publicly available in accordance to the National Health Insurance Service regulations for the protection of electronic medical data.