In a total of 82,073 individuals in our dataset, we identified 3,305 patients who had COVID-19 and were previously captured by contact tracing. Among these patients, 2,022 individuals were at least admitted into the inpatient setting once and 217 patients had at least one COVID-19 related hospitalization. Among all the patients, female patients accounted for 60% and male patients accounted for 40%, with similar distribution when being stratified by COVID-19 infection. White patients were the largest racial group in the cohort (54.6%), followed by Black patients (37.1%), Hispanic/Latino patients (4.9%), and other races. In COVID-19 positive group, Black patients accounted for 52.4%, followed by White patients (30.0%), Hispanic/Latino patients (11.9%), and other races. Table 1 is the sociodemographic description of the data set.
Two sets of ML models were fitted for making predictions, where each set was trained using three algorithms (i.e., RF, XGB, SVM). The first set (i.e., general hospitalization set) of models specified whether the patient had been admitted to the hospital as the output (n = 2,022 for positive), while the second set (COVID-19 related hospitalization set) of models specified whether the patient had COVID-19 related hospitalization as the output (n = 217 for positive). These 2,022 and 217 individuals, respectively, were matched with control cases at a ratio of 1:1 to balance the data in each set of models. With 10-cross validation, RF outperformed other models (F1 = 0.872 for the general hospitalization set and 0.763 for the COVID-19 related hospitalization set), followed by XGB (F1 = 0.845 and 0.682, correspondingly) and SVM (F1 = 0.845 and 0.644, correspondingly). See Table 1 for numeric results. As displayed in Fig. 1, the learning curves demonstrated the impact of underfitting is minimized for finalized model to be trained.
Among patients admitted to inpatient settings during the COVID-19 pandemic (general hospitalization set), our findings revealed that the most significant predictors of hospitalization are from the contact tracing data, which included breathing difficulty, fever, pneumonia, cardiovascular disease, cough, chronic obstructive pulmonary disease (COPD), renal disease, and other respiratory conditions (Fig. 2). Notably, features such as fatigue, congestion, headache, and loss of taste from contact tracing were associated with a lower likelihood of hospitalization. Additionally, demographic factors such as being aged 70 or older and Black race were positively correlated with hospitalization, whereas younger age (18–29) and White race showed a negative association. Features from medical claims data including Medicare coverage, primary hypertension, and type 2 diabetes mellitus without complications were associated with a higher risk of hospitalization, while commercial insurance coverage indicated a lower risk of hospitalization. Among all the features, the top three most predictive factors were breathing difficulty, pneumonia, and fever.
For patients who experienced COVID-19 related hospitalization (COVID-19 related hospitalization set), features from medical claims data played a more prominent role than the other set (Fig. 2). The top positive contributors from medical claims data to COVID-19 related hospitalization were fever (unspecified), cough, shortness of breath, type 2 diabetes mellitus without complications, other viral pneumonia, other long-term (current) drug therapy, pneumonia (unspecified organism), primary hypertension, long-term (current) use of oral hypoglycemic drugs, and morbid (severe) obesity due to excess calories. From the contact tracing data, the positive features that significantly contributed to hospitalization were fever, pneumonia, breathing difficulty, cough, Black race, cardiovascular disease, and vomiting. Conversely, the three highest-ranked negative features, all derived from the contact tracing, were White race, fatigue, and congestion, suggesting individuals with these features are less likely to be hospitalized.
Results from step one clinical evaluation reached a CVI of 0.7 for general hospitalization set and 0.9 for COVID-19 related hospitalization set, indicating a fair to good agreement. We set a significance level of 0.05 for the tests. The Spearman’s rank correlation coefficients between the mean rating and the scaled SHAP values for the top 20 features were 0.57 with a p-value of 0.009 and 0.52 with a p-value of 0.019 correspondingly, suggesting a significant positive correlation. The Mann-Whitney U test statistics were 199.0 with a p-value of 0.998 and 191.5 with a p-value of 0.828 correspondingly, indicating there was no significant difference between the two ranked lists.
For the step two clinical evaluation, the mean values for FN, FP, TN, and TP were 1.7, 4.1, 2.1, and 3.8, respectively, for the general hospitalization set, which are in alignment with the ML results. The mean values for FN, FP, TN, and TP were 2.5, 2.6, 2.6, and 3.2, respectively, for the COVID-19 related hospitalization set, which remain in alignment with the ML results, but the effect is weaker than the other data set.
Based on the best-performed RF models, we identified the most robust predictors for those with COVID-19 exposure and progressed adversely to hospitalization, which include underlying conditions such as primary hypertension, cardiovascular disease, type 2 diabetes mellitus, pneumonia, and nuanced patterns of predictors from both contact tracing and health care history. Two researchers (CL and TL) reviewed these predictors and organized them into several healthcare systems where public health practitioners or clinicians can respond as early as these predictors are detected in individuals. Figure 3 shows a diagram of these findings which was inspired by the Advanced Care Planning framework 16. Generally, we found that some of the data captured by contact tracing are highly predictive of COVID-19 related hospitalization such as self-reported fever, pneumonia, breathing difficulty, cough, cardiovascular disease, and vomiting. These data were self-reported in this study but in the real world we have ample opportunities to capture these data. For example, cardiovascular disease could be found in contact tracing data, historical claims data, and EHR data (Fig. 3). In addition, individual-reported fatigue and congestion are strongly associated with disease progression that does not lead to the need of hospitalization. SC DHEC can utilize such information to create a surveillance system for prioritizing actionable responses including individual outreach and referral. Some other symptoms, however, have been challenging for the machine to make accurate predictions, such as cough, breathing difficulty, and cardiovascular conditions, which deserve a close investigation considering individuals’ health history, healthcare utilization after contact tracing. Conditions such as viral pneumonia and pneumonia with unspecified organism, fever, cough, and shortness of breath, when diagnosed at clinical sites rather than self-reported at contact tracing are suggestive of COVID related hospitalization. These nuances can be used by SC DHEC and domestic clinical sites for refining risk factors for better surveillance.