From the (2,011,813) visits included in the study, there were (1,474,391) no-show (537,422) show visits. Therefore, the overall proportion of no-shows at all outpatients’ clinics was (26.71%). Of these visits, we will not consider cancelled appointments. Each record contains 20 variables, which summarized in table 1. As per table 1, male patients were less likely to miss their appointments than female patients. New patients were the most likely to miss of their appointments. The patients who has Follow up were the second most likely to miss their appointments.
Table 1: Descriptive characteristics of the dataset (N= 2,011,813)
|
Features
|
No-Show (N%)
|
Show (N%)
|
Total (N)
|
Gender
|
Male
|
213,729 (10.62%)
|
564,024 (28.04%)
|
777,753
|
Female
|
323,693 (16.09%)
|
910,367 (45.25%)
|
1,234,060
|
Age Group
|
0-5
|
66,118 (3.29%)
|
150,748 (7.49%)
|
216,866
|
6-10
|
39,234 (1.95%)
|
95,566 (4.75%)
|
134,800
|
11-15
|
32,949 (1.64%)
|
87,269 (4.34%)
|
120,218
|
16-20
|
32,440 (1.61%)
|
87,115 (4.33%)
|
119,555
|
21-25
|
40,968 (2.04%)
|
104,714 (5.20%)
|
145,682
|
26-30
|
44,580 (2.22%)
|
118,409 (5.89%)
|
162,989
|
31-35
|
45,776 (2.28%)
|
123,426 (6.14%)
|
169,202
|
36-40
|
40,400 (2.01%)
|
114,836 (5.71%)
|
238,262
|
41-45
|
32,026 (1.59%)
|
95,960 (4.77%)
|
127,986
|
46-50
|
31,599 (1.57%)
|
97,445 (4.84%)
|
129,044
|
51-55
|
30,485 (1.52%)
|
94,975 (4.72%)
|
125,460
|
56-60
|
27,602 (1.37%)
|
88,763 (4.41%)
|
116,365
|
61-65
|
23,159 (1.15%)
|
74,015 (3.68%)
|
97,174
|
66-70
|
16,748 (0.83%)
|
51,621 (2.57%)
|
68,369
|
71-75
|
13,857 (0.69%)
|
39,771 (1.98%)
|
53,628
|
76-80
|
10,223 (0.51%)
|
27,134 (1.35%)
|
37,357
|
81-85
|
5,464 (0.27%)
|
13,272 (0.66%)
|
18,736
|
> 85
|
3,794 (0.19%)
|
9,352 (0.46%)
|
13,146
|
Nationality
|
|
|
|
Saudi
|
530,112 (26.35%)
|
1,451,144 (72.13%)
|
1,981,256
|
Non-Saudi
|
6,650 (0.33%)
|
21,807 (1.08%)
|
28,457
|
Unknown
|
660 (0.03%)
|
1,440 (0.07%)
|
2,100
|
Appointment type
|
New Patient (NP)
|
243,158 (12.09 %)
|
890,110 (44.24%)
|
1,133,268
|
First visit (FV)
|
271,466 (13.50%)
|
517,688 (25.73 %)
|
789,154
|
Follow up (FU)
|
22,798 (1.13 %)
|
66,593 (3.31 %)
|
789,154
|
Reservation type
|
Scheduled
|
516,300 (25.66 %)
|
1,278,602 (63.55 %)
|
1,794,902
|
Walk-in
|
21,122 (1.05 %)
|
195,789 (9.73 %)
|
216,911
|
Patient type
|
Patient Service
|
530,923 (26.40 %)
|
1,460,373 (72.59 %)
|
1,991,296
|
Business Center
|
2,625 (0.13 %)
|
6,634 (0.33%)
|
9,259
|
VIP
|
3,874 (0.19 %)
|
7,384 (0.37%)
|
11,258
|
Distance (km)
|
distance <= 100
|
517,591 (25.73 %)
|
1,421,338 (70.65 %)
|
1,938,929
|
distance >= 101 and distance <= 399
|
10,251 (0.51 %)
|
28,962 (1.44%)
|
39,213
|
distance >= 400 and distance <= 799
|
7,462 (0.37 %)
|
19,229 (0.96%)
|
26,691
|
distance >= 800
|
2,118 (0.11 %)
|
4,862 (0.24 %)
|
6,980
|
Outpatient Clinics
|
|
|
|
Health Care Specialty Clinic
|
236,668 (11.76%)
|
656,152 (32.61%)
|
892,820
|
National Guard Comprehensive Specialized Clinic
|
102,428 (5.09%)
|
264,979 (13.17%)
|
367,407
|
King Abdulaziz City Housing
|
106,727 (5.31%)
|
304,584 (15.14%)
|
411,311
|
King Saud city Housing
|
81,487 (4.05%)
|
215,946 (10.73%)
|
297,433
|
Prince Bader Housing City Clinic
|
10,112 (0.50%)
|
32,730 (1.63%)
|
42,842
|
As an outcome of the feature importance process, the top four predictors are; number of no-show appointments, medical department, lead time and number of show appointments. The second four important predictors group are appointment type, patient type, outpatient clinics and appointment month. While appointment year, distance, gender, reservation type and nationality are not important predictors, thus removed from the models. The rest factors have less influence on the no-show such as number of schedule appointments, number of walk-in appointments, appointment time and age. The factors related to patients have more impact on no-show of patients to than factors related to the appointments. Ranking of factors in the predictive model is performed according to the calculated of Info Gain. The list of the factors ranked base on their importance in Fig. 2, the prediction models developed using only 14 factors.
We have evaluated the different models using different validation methods and various evaluation metrics. In general, performance for all models among evaluation metrics were close except time. Tables 2 to 4 describes the experiments results carried out to show the performance of Spark using five machine learning algorithms over the same huge dataset. We evaluated the effectiveness of all classifiers in terms of time to train and evaluate the models, accuracy, precision, recall, F-measure and ROC. MLP and RF classified visits well. From the results, we can see that the percentage of all metrics is comparable for both classifier. A more improvement observed for the MLP in F-measure than RF, LG and SVM have similar ROC performance, LG are preferred than SVM as it produces better performance in all metrics with less computation power. SVM likely performs poorly due to the limitation of kernel function in MLlib, the only available linear kernel is used with SVM algorithm. GB performed best, resulting in an increase of accuracy and ROC to 79% and 81%, respectively.
Table 2: Evaluation metrics shown by different models on predicting outpatients no-show using 70/30 holdout method
|
|
Accuracy
|
Precision
|
Recall
|
F-measure
|
ROC Area
|
Random Forest
|
0.76
|
0.76
|
0.76
|
0.68
|
0.77
|
Gradient Boosting
|
0.79
|
0.77
|
0.79
|
0.76
|
0.81
|
Logistic Regression
|
0.75
|
0.73
|
0.75
|
0.70
|
0.73
|
SVM
|
0.73
|
0.70
|
0.73
|
0.62
|
0.73
|
Multilayer Perceptron
|
0.77
|
0.75
|
0.77
|
0.72
|
0.78
|
Table 3: Evaluation metrics shown by different models on predicting outpatients no-show using 80/20 holdout method
|
|
Accuracy
|
Precision
|
Recall
|
F-measure
|
ROC Area
|
Random Forest
|
0.75
|
0.76
|
0.75
|
0.68
|
0.77
|
Gradient Boosting
|
0.79
|
0.77
|
0.79
|
0.77
|
0.81
|
Logistic Regression
|
0.75
|
0.72
|
0.75
|
0.70
|
0.73
|
SVM
|
0.73
|
0.54
|
0.73
|
0.62
|
0.72
|
Multilayer Perceptron
|
0.77
|
0.75
|
0.77
|
0.74
|
0.78
|
Table 4: Evaluation metrics shown by different models on predicting outpatients no-show using 10-fold cross validation
|
|
Accuracy
|
Precision
|
Recall
|
F-measure
|
ROC Area
|
Random Forest
|
0.76
|
0.76
|
0.76
|
0.68
|
0.77
|
Gradient Boosting
|
0.79
|
0.77
|
0.79
|
0.77
|
0.81
|
Logistic Regression
|
0.75
|
0.73
|
0.75
|
0.70
|
0.73
|
SVM
|
0.73
|
0.70
|
0.73
|
0.62
|
0.73
|
Multilayer Perceptron
|
0.77
|
0.75
|
0.77
|
0.72
|
0.78
|
To better understand efficiency, Fig. 2 presents the ROC curve of five models to illustrate the precision of each classifier. Five models achieved identical ROC using different validation methods. From the plot, we can easily show that Gradient Boosting is best model (area=081). SVM with linear kernel and Logistic Regression returned comparable classification results. Currently, MLlib supports linear SVMs only; using non-linear kernels may outperform Logistic Regression.
As evaluation criteria, we have employed the overall training and test time (in seconds) for all five algorithms as shown in table 4 and 5. Since the performance is close for all metrics, time is the key factor for selecting the best validation method. Unlike other metrics, there are a differences between times of the algorithms and considered a huge difference in the training time. GB achieved best performance using 70:30 holdout method significantly outperformed all other methods in training time-value metric. For 70:30 holdout method, we observe that GB is around 15x times slower than MLP, although it achieved the optimal results. SVM, the algorithm with close performance to LG, takes about 68x times as long to train the model. Logistic Regression is 4x times faster than the next two accurate algorithm MLP and RF with comparable performance. For huge datasets, the time is a factor to select one of the quicker algorithms, considering that the time values of models depends on the choice of algorithms parameters. We showed that exploring and evaluating the performance of the machine learning models using various evaluation methods is critical as the accuracy of prediction can significantly differ.
Table 5: Training time value for each machine learning model (seconds)
|
|
Holdout method 70/30
|
Holdout method 80/20
|
10-fold cross validation
|
|
Random Forest
|
41.289
|
45.876
|
517.830
|
|
Gradient Boosting
|
668.882
|
1148.144
|
21170.190
|
|
Logistic Regression
|
10.033
|
8.805
|
8.805
|
|
SVM
|
685.782
|
671.625
|
671.625
|
|
MLP
|
42.444
|
45.627
|
45.627
|
|
Table 6: Test time value for each machine learning model (seconds)
|
|
70/30
|
80/20
|
10-fold cross validation
|
|
Random Forest
|
31.118
|
25.787
|
57.394
|
|
Gradient Boosting
|
27.287
|
22.461
|
57.001
|
|
Logistic Regression
|
24.962
|
20.134
|
43.192
|
|
SVM
|
23.081
|
19.201
|
45.116
|
|
MLP
|
23.458
|
19.600
|
44.667
|
|