Full Model
Regarding the performance of the model, Fig. 1 displays validation set ROC and PR curves as well as training set calibration curves for all the algorithms applied to the full patient cohort training set. The fraction of positive classes in the validation and testing datasets was 17.2%. The best performing model during 5-fold stratified cross-validation was the extra tree classifier and random forest classifier with an average AUC (0.832 ± 0.015), followed by XGBoost classifier (0.830 ± 0.018), voting classifier (0.826 ± 0.017), and the logistic regression (0.812 ± 0.015) (Fig. 1A). The XGBoost classifier also had the best AP (0.513 ± 0.047) followed by that of the random forest (0.514 ± 0.04), voting classifier (0.512 ± 0.039), extra tree classifier (0.0.511 ± 0.034), and logistic regression (0.492 ± 0.04) (Fig. 1B). The training set calibration curves showed random forest, voting classifier, and extra trees were well-calibrated, while the logistic regression and XGBoost overpredicted risk for COVID-19 infection in the bottom half of bins and had the highest brier scores (0.168 ± 0.007 and 0.204 ± 0.005) (Fig. 1C). Random forest had the lowest Brier score (0.124 ± 0.005), followed by voting classifier (0.142 ± 0.007) and extra trees classifier (0.154 ± 0.005).
On the held-out test set, random forest produced the highest average AUC of 0.849 as compared to voting classifier (0.848), extra tree classifier (0.842), the XGBoost (0.841), and logistic regression (0.835) (Fig. 2A). The random forest classifier had the highest AP (0.51) while extra tree, XGBoost, voting classifier, and logistic regression had AP of 0.509, 0.503, 0.497, and 0.487 respectively (Fig. 2B). Using operating thresholds from the cross-validation step, we could calculate the recall, specificity, positive-predictive value (PPV), negative-predictive value (NPV), and the F1 score of our models (Fig. 2C). In this perspective, the extra tree classifier had the highest F1 score (0.565) and was followed by the voting classifier (0.562) in second place. Throughout the evaluation process, random forest produced the most stable results by either taking the first or the second place among the models that we applied. Furthermore, random forest also had the lowest Brier Score, which shows that the model has the best probability predictions. Hence, we picked random forest as our main model for the full model.
After training the tuned random forest classifier on the entire training set, feature importance and model explainability was assessed through SHAP values. Figure 3 displays the SHAP summary plot for the top 20 most important features in the full model. Features related with symptoms are regarded as the most important by the model, with asymptomatic, chills, cough, and headache are included in the top five. The model also successfully identifies that someone who is in self-isolation after COVID-19 exposure will have high risk of infection. Common COVID-19 symptoms such as fever, runny nose, sore throat, and muscle pain were also deemed important features by the model. Health workers that wore N95 masks, medical gloves, hazmat suits, and face shields regularly were also at lower risk of COVID-19 diagnosis. Furthermore, other behavioral features such as frequency and duration of outdoor activity, frequency of offline meetings, and density of people in the room most frequented by the health worker are also associated with higher COVID-19 risk. The high ranking of behavioral features highlights the benefit of adding these factors to complement symptomatic factors.
Jakarta Model
The Jakarta model was trained and cross-validated on health workers from Jakarta who submitted their results in 2021. Then, the model was tested on two datasets, which are data respondents from Semarang and Jakarta in 2022. The Semarang test set and Jakarta 2022 set was composed of 12.7% and 18.76% of the entire dataset respectively. The fraction of positive classes in the validation dataset was 10.27%, while the Semarang and Jakarta 2022 testing datasets were 42.2% and 31.3% respectively. As shown in Fig. 4, the XGBoost classifier had the best predictive performance during cross-validation. The mean AUC (0.857 ± 0.017) outperformed random forest (0.856 ± 0.016), extra trees (0.856 ± 0.019), voting classifier (0,856 ± 0.017) and logistic regression (0.843 ± 0.015) (Fig. 4A). The random forest (0.434 ± 0.039) and extra trees (0.434 ± 0.052) produced the best AP, followed by voting classifier (0.429 ± 0.043), XGBoost (0.416 ± 0.045), and logistic regressor (0.409 ± 0.041) (Fig. 4B). The calibration curves from training showed that random forest is the most well-calibrated model, while the other models appear to have poorly calibrated predictions in upper predicted probability bins (Fig. 4C). Additionally, random forest had the lowest brier scores of 0.080 ± 0.0003.
While testing the models on the Semarang dataset, the random forest had the best AUC of 0.745 followed by extra trees, XGBoost, voting classifier, and logistic regression classifier with 0.744, 0.743, 0.740, and 0.726 (Fig. 5A). XGBoost had the best AP of 0.705 followed by random forest (0.694), voting classifier (0.694), extra trees classifier (0.689) and logistic regression (0.672) (Fig. 5B). Looking at the F1 score, extra tree classifier got the highest score with 0.657 followed by logistic regression, random forest, XGBoost, and voting classifier with 0.651, 0.649, 0.646, and 0.646 respectively (Fig. 5C).
On the other hand, testing on Jakarta 2022 dataset produced the best AUC of 0.762, which is associated with the voting classifier. Moreover, the random forest had a slight difference with AUC of 0.761. Meanwhile, XGBoost, extra trees, and logistic regression got 0.760, 0.757, and 0.753 respectively (Fig. 6A). The highest AP is achieved by voting classifier and logistic regression with 0.548 and 0.547. Furthermore, random forest, XGBoost, and extra trees followed with 0.535, 0.529, and 0.524 (Fig. 6B). The F1 score of random forest outperformed the other models with 0.582. The second and third places were occupied by XGBoost (0.502) and voting classifier (0.481) (Fig. 6C). Unfortunately, the F1 scores for this dataset experienced poorer performance compared to other test sets. It highlights timed-based drift in the data which influences the choice of threshold to be suboptimal.40
Based on the AUC, AP, and Brier Score, random forest classifier was chosen for the Jakarta model due to well-calibrated predictions and high training and testing performance. SHAP analysis is then executed using random forest as the model. Figure 7 displays a SHAP summary plot for the Jakarta model. The feature importance for the Jakarta model is highly identical with the full model.
Almost half of the top 20 features are related to symptoms of COVID-19, which shows that symptoms are key to predicting infection. Moreover, outdoor activities also contribute to the risk of COVID-19 infection as shown by the high ranking of SHAP values of the average duration, frequency, and people density of outdoor activities. Similar to the full model, the model also picked wearing personal protective equipment (such as N95 mask, medical gloves, hazmat suit, and surgical hood) regularly as features which could influence the risk of infection. Additionally, the Jakarta model also recognized washing hands after wearing masks as a predictive feature for the model.