Dataset
In this study, we used the benchmark dataset of the Kaggle competition for stroke prediction (https://www.kaggle.com/asaumya/healthcare-dataset-stroke-data. Accessed December 1, 2018). This dataset is a subset of the original stroke data collected from healthdata.gov and accounts for 1.18% of the whole original dataset.[15] The dataset comprises a total of 43,400 records, of which 783 correspond to patients with stroke and the others to non-stroke participants. Overall, the occurrence ratio of stroke was 1.8%.
The dataset contained three continuous variables and seven categorical variables. Continuous variables include age, body mass index (BMI), and average glucose level, while categorical variables include heart disease, hypertension, sex, ever-married status, smoking status, work type, and residence type (Table 1). The detailed lists of variables in the dataset are also described in a previous study.[15]
Table 1. Descriptive statistics of numerical variables and number of cases of categorical variables for the whole datasets
Variable
|
Mean
|
SD
|
Minimum
|
Maximum
|
Continuous variable
|
|
|
|
|
Age (years)
|
41.84
|
22.48
|
0.08
|
82.00
|
Body mass index (kg/m2)
|
28.61
|
7.77
|
10.10
|
97.60
|
Average glucose level (mg/dL)
|
103.63
|
42.23
|
55.00
|
291.05
|
Categorical variable
|
|
|
|
|
Heart disease, No. (%)
|
Yes:No
|
1,808 (4.4):40,123 (95.6)
|
Hypertension, No. (%)
|
Yes:No
|
3,670 (8.8):38,261 (91.2)
|
Sex, No. (%)
|
Female:Male
|
24,945 (59.5):16,986 (40.5)
|
Ever-married status, No. (%)
|
Yes:No
|
26,781 (63.9):15,150 (36.1)
|
Smoking status, No. (%)
|
NS:FS:Smokes:Missing
|
15,746 (37.5):7,093 (16.9):6,226 (14.9):12,866 (30.7)
|
Work type, No. (%)
|
Private:SE:GJ:NW:Children
|
23,980 (57.2):6,474 (15.4):5,243 (12.5):176 (0.4):6058 (14.5)
|
Residence type, No. (%)
|
Urban:Rural
|
21,001 (50.1):20,930 (49.9)
|
Abbreviations: SD, standard deviation; NS, never smoked; FS, formerly smoked; SE, self-employed; GJ, government job; NW, never worked.
Notably, the dataset lacked some information. Specifically, the smoking status of 13,292 cases (approximately 30%) and BMI of 1,462 cases (approximately 3%) were missing. In the development datasets, we dropped cases with missing BMI information. Additionally, we dropped seven cases that selected “other” in the sex question to remove ambiguous data.
Finally, we randomly divided the development datasets containing 41,931 cases with 643 occurrences of stroke into three subsets: 60% (N = 25,158) were used as the training dataset, 20% (N = 8,386) were used as the validation dataset, and 20% (N = 8,387) were used as the test dataset. We then preprocessed the three datasets, fit the model to the training dataset and validated it with the validation dataset, and evaluated the model performance on the test dataset (Table 1).
Preprocessing
We transformed 10 variables using the weight of evidence (WoE) method. For each variable, the WoE is computed by calculating the logarithm of the ratio of the proportion of non-strokes over the proportion of strokes as follows:
WoE transformation converts categorical variables into numerical values and has a linear relationship with the logistic function. Thus, WoE-transformed variables are well-suited for input features in the logistic regression model. High positive WoE values indicate a low risk, whereas high negative WoE values indicate a high risk. WoE is widely used in models for risk management, such as credit risk models. [16-18]
Logistic Regression
We used a logistic regression model for the assessment of stroke occurrence because the non-strokes/strokes odds ratio in the logistic regression is easy to calculate and interpret, and logistic regression has been widely used in building prediction models for various diseases. [19, 20] We first performed univariate logistic regression to evaluate the significance of 10 transformed variables. Finally, we performed a multivariable logistic regression with input transformed variables at P < 0.05 in the univariate logistic regression. We used a backward selection approach to build the final model. The significance of each transformed variable in the logistic regression model was determined at a threshold of P < 0.05.
Model Performance
The developed model was used to predict stroke occurrence from the validation and test datasets. To evaluate the model performance, we computed the area under the receiver operating curve (AUROC) and Kolmogorov–Smirnov (KS) statistics for the validation and test datasets. The AUROC measured the discriminatory power of a stroke prediction model that can be interpreted as the probability that strokes receive better scores than non-strokes.[21] The KS statistic computed the maximum differences between the cumulative distributions of two discriminations, such as strokes and non-strokes, where each discrimination score had a value between 0 and 1.[21]
Scorecard Model
Borrowing the concept of the credit scoring model, [16, 22] we developed a scorecard model for strokes. In the clinic, the use of a disease score rather than the probability of disease ranging from 0 to 1 is usually beneficial to determine a patient’s health status. According to the scorecard model, the odds ratios used in the logistic regression could be converted into a disease score as follows:
where A and B are constraints that need to be determined through specific disease scorecard model settings and P0 indicates the user-defined baseline score. The point of double odds (POD) representing the score that doubles the odds was used to determine these constraints. The sum of POD and P0 yields the double odds ratio as follows:
By solving equations (1) and (2), we calculated the two constraints as follows:
A detailed description of the scoring method with POD has been provided previously.[23]