Using Machine Learning to Unveil Demographic and Clinical Features of COVID-19 Symptomatic and Asymptomatic Patients

Background: Demographic and clinical features of COVID-19 patients are critical components in shaping their symptomatic status. However, the relationship between patients' symptomatic status and their features are typically complicated and nonlinear. Methods: We explored important features that drive the symptomatic status of COVID-19 patients and reveal their interactions with other relevant factors. We used an extensive multi-algorithm machine learning (ML) pipeline and 68 demographic and clinical features to �t a predictive model to 3,995 patients in the State of Kuwait between February and June 2020. Our ML pipeline comprised �ve algorithms, including logistic regression (LR), random forest (RF), support vector machine (SVM), gradient boosting (GBM), and extreme gradient boosting (XGM). Results: SVM outperformed all algorithms (AUC = 0.77 and accuracy = 70.01%), while logistic regression had the lowest predictive power (AUC = 0.65 and accuracy = 66.14%). Our ML model identi�ed C-reactive, respiratory rate, transmission dynamics, and other demographics as the most important predictors of COVID-19 symptomatic patients. While, only demographic features were important predictors for asymptomatic patients. However, our ML model further revealed that the non-linear relationships between impaired renal function, other clinical biomarkers and demographic features were critical in shaping the risk of being symptomatic patient. Conclusions: We demonstrated remarkable predictive performance of our ML model over traditional statistical methods in identifying important clinical and demographic features of symptomatic vs. asymptomatic. Further application of our ML pipeline in the COVID-19 case de�nition and guiding pharmaceutical and none-pharmaceutical interventions will help reduce the public health and economic implications of this devastating virus on local and global scales.


Highlights
-C-reactive protein and community type were the top two important predictors for symptomatic patients.
-Only demographic features were important predictors for asymptomatic patients.
-There are strong non-linear relationships between impaired renal function and other features in symptomatic patient.
-Machine learning outperformed classical logistic regression in predicting symptom status.
-Machine learning can help improve case de nition and public health surveillance.

Background
The rapid emergence and spread of the novel severe acute respiratory syndrome 2 (SARS-CoV-2) led to the catastrophic pandemic of coronavirus disease 2019 .The complexity of COVID-19 infections and their associated clinical course, are shaped by the substantial heterogeneity in the characteristics, dynamics, and comorbidities of the affected populations [1].Therefore, untangling the demographical and clinical features of COVID-19 infected individuals is critical for improving currently implemented intervention strategies.Machine learning (ML) algorithms have previously been used to further our understanding of infectious diseases and have recently become more interpretable, due to newly introduced tools [2].These tools have enabled us to gain deeper mechanistic insights into the drivers of different clinical outcomes [3].
The spectrum of COVID-19 clinical presentations, includes asymptomatic, presymptomatic and symptomatic cases [4].Recent studies have estimated that asymptomatic COVID-19 carriers account for 40-45% of COVID-19 cases [5].These individuals have been hypothesized to be unknowing 'super spreaders' of the virus, as they do not exhibit any overt physical signs of being infected [6].Although several authors have explored the clinical characteristics of asymptomatic COVID-19 carriers, these studies are limited by small patient numbers [7].As this population does not exhibit outward physical signs of the disease, it is di cult to study them.Consequently, most of the literature related to this important group of patients has not been addressed at a large scale.Also, asymptomatic patients may also become symptomatic after a few days [8].Of those patients that become symptomatic, the majority have a milder clinical presentation of COVID-19 [9], while approximately 18-33% may require hospitalization and admission into the intensive care unit [10,11].The reasons for these widely differing disease manifestations are currently not well understood.
As most current studies, comparing the clinical characteristics of symptomatic and asymptomatic patients, have relied on small samples and traditional statistical methods, their ndings may not be representative of a larger, more general, hospitalized population [12,13].Further, traditional statistical methods have many underlying assumptions, such as linearity of the relationship between the outcomes and its predictors, which neglect the complexity that underpins COVID-19 infections.Statistical linear models tend to underperform with large datasets and are extremely susceptible to over tting due to the large correlations between variables [14,15].Conversely, ML algorithms can explore and accommodate large datasets with thousands of variables of different varieties and require minimal statistical assumptions [3].Additonally, ML algorithms have proven to be superior in making individual-level predictions and are able to interrogate multiple interactions between risk factors, rendering them powerful tools to unveil novel insights into the clinical epidemiology of COVID-19 [15,16].
To the knowledge of the authors, ML algorithms have not yet been widely adopted in exploring the characteristics of COVID-19 infected populations in a clinical setting.Thus, we applied an interpretable multi-algorithms ML ensemble pipeline to a dataset consisting of approximately 4000 COVID-19 patients detected in the State of Kuwait.We explored demographic and clinical factors that were shaping the risk of being asymptomatic or symptomatic.Furthermore, we studied complex interactions between important risk factors that likely contributed to symptom status in infected patients and assessed our ML model in the context of individual-level clinical diagnosis, to address their advantage and limitations.

Methods
Setting and Data source Due to unique circumstances, Kuwait is ideally suited to further our understanding of the varying spectrum of COVID-19 patients' clinical presentations.In Kuwait, before the rst case of COVID-19 was even reported in the country, repatriation efforts were instigated, and screening for SARS-CoV-2 was performed for all incoming passengers.Any travelers who tested positive for the SARS-CoV-2 virus, irrespective of symptomology, were admitted to Jaber Al-Ahmad Al-Sabah Hospital.In addition, during those early stages of the pandemic, all cases of COVID-19 that were diagnosed in Kuwait, including asymptomatic carriers, were admitted as inpatients at this same hospital.As a result, Jaber Hospital eventually became a hospital that is solely dedicated to treating COVID-19 patients.This presented an opportunity to study the clinical features of COVID-19 patients, including those who were asymptomatic, due to the large proportion of patients that were admitted to Jaber Hospital, after being diagnosed with COVID-19, secondary to the mass random screening and traveler screening efforts.Additionally, asymptomatic and presymptomatic patients all received identical investigations and work-up, to their symptomatic counterparts, and all within the same facility, contributing to the homogeneity of this patient population.
Data comprised 3 995 admitted patients between the 24th of February and the 27th of May 2020, and was an extension of a population described earlier elsewhere [17].All patients, included in the study, tested positive for the SARS-CoV-2 virus using real-time polymerase chain reaction (RT-PCR).We extracted their demographic and clinical variables (i.e., features) upon admission from the hospital's electronic medical record system (Supplementary Table 1).Patients were classi ed as being symptomatic or asymptomatic based on their initial hospital presentations.Those who were classi ed as initially asymptomatic on admission were re-classi ed as symptomatic if their medical records indicated they developed symptoms later during their clinical course.Thus, the nal dataset comprised SARS-CoV-2 symptom status as an outcome and 63 features (Additional le 1).We used the R software environment to conduct all statistical analyses.

Data pre-processing
We used an extensive ML ensemble pipeline [18], which is comprised of ve supervised algorithms, including logistic regression (LR), support vector machine (SVM), random forest (RF), gradient boosting (GBM), and extreme gradient boosting (XGB) to build a predictive model for the SARS-CoV-2 symptom status.We used multiple ML algorithms and compared the differences in their predictive performances.Some ML algorithms are less exible to missing data (i.e., SVM) [18], and therefore we imputed missing data using the RF-based nonparametric routine implemented in 'missForest' R package [19].Also, because some of our selected algorithms (i.e., LR) cannot accommodate strongly collinear features, we removed features with the largest mean absolute correlation ( > 0.7) [20].Additionally, we applied the 'Boruta' R package, which contains an RF-based algorithm to only include those that are relevant for prediction [21].We used the Boruta procedure to increase the e ciency, predictive performance, and accuracy of our ML algorithms [22].We controlled for the class imbalance between asymptomatic and symptomatic (Supplementary Table 1), using a down-sampling routine, to avoid biased predictions toward the majority outcome class (i.e., symptomatic) [23].We then partitioned the data randomly into training (80%) and testing sets.

Model training and evaluation
We trained our ML algorithms using the feature set determined by the Boruta procedure.We ran LR, SVM, GBM, and XGB ML algorithms using the 'Caret' R package [24], while the 'random Forest' R package [25] was used to ran the RF algorithm.We used a K-fold cross-validation approach (K = 10) to estimate model performance parameters for each algorithm, including area under the curve through a receiver operator characteristic (ROC), accuracy (Acc), sensitivity (Se), speci city (Sp).We calculated these parameters using the average confusion matrix across all folds of the cross-validation and used the default grid parameter settings in the training process of all algorithms.The K-fold approach was selected, because it can prevent over tting and arti cial in ation of Acc, due to the use of the same data for training and validation [18].We then selected the best performing algorithm to predict the probability ( ) of COVID-19 symptom status by comparing the estimated validation parameters of each model using the testing dataset.

Model Interpretation
We visualized and examined the best performing model results using the 'iml' R package [26].Breiman's permutation routine [27] was used to estimate feature importance, which quanti es the expected loss in predictive performance (i.e., how correctly the model classi es symptomatic vs. asymptomatic) for a pair of patients compared to the full model when a speci c feature has been switched [27].Therefore, if the permutation routine does not affect the selected model's predictive performance, the feature is then classi ed as unimportant.We estimated the global and individual effect of the top six important features using partial dependence (PD) and centered individual condition expectation (cICE) methods, respectively [28].We used Friedman's H-statistic method [29] to quantify the overall interaction strength of each features.We then reused the same method to quantify the top features' interaction strength (i.e., feature with the highest overall interaction strength) with the other features.This method uses a partial dependency decomposition routine to quantify the portion of the variance explained by each interaction [29].Finally, we computed Shapley values ( ), following a game theory approach, to quantify individuallevel predictions and the contribution of each feature to those predictions for two (symptomatic and asymptomatic) randomly selected patients [30].

Results
Our results indicate that 65% of the infected patients were symptomatic, and 48 of the 63 selected features were signi cantly associated with the study outcome (Additional le 1).However, the Boruta procedure indicated that 41 features were important for predicting the asymptomatic status of infected patients.Results of the ML pipeline suggest that the SVM algorithm outperformed the other algorithms, across all validation parameters (i.e., AUC, Acc, Sp and Se; Table 1) and was able to correctly classify symptom status of the infected patients 77% of the time (AUC = 0.77; Table 1).In contrast, the LR algorithm was the worst performing algorithm, and was only able to correctly predict symptom status 65% of the time (AUC = 0.65; Table 1).Vector Machine; GBM: Gradient Boosting; XGB: Extreme Gradient Boosting.
Our ML approach revealed that C-reactive protein (CRP), community type, travel history, the presence of diffuse opaci cation in a chest x-ray (CxR), respiratory rate, and transmission type were the most important features for predicting symptom status (Figure 1).PD plots showed that the risk of being symptomatic increases when the concentration of CRP rises (Figure 2A).There was a slight increase in the risk of being symptomatic for patients belonging to the citizen-resident community (Figure 2B), with no travel history (Figure 3C) and who had presence of diffuse opaci cation in their CxR (Figure 2D).Our PD plots also show that infected patients with a respiratory rate > 20 breaths/min (Figure 2E) and that were healthcare workers (Figure 2F) were more likely to be symptomatic.
Estimated glomerular ltration rate (eGFR), followed by creatinine levels, had the strongest overall interactions with other features, in shaping the risk of being symptomatic (Figure 3A).Further, the strongest interaction with eGFR was urea, followed by the presence of diffuse opaci cation in CxR, sex, total protein, alkaline phosphatase (ALP), and sodium (Figure 3B).Patients with low eGFR (< 50 mL/min./1.73 2 ) and urea (< 20 mmol/L) were most likely to develop symptoms ( > 0.75; Figure 4A).The risk of developing symptoms was higher in patients with diffuse opaci cation and a low eGFR (Figure 4B).Similarly, females with a low eGFR were more likely to develop symptoms than males (Figure 4C).
Our Shapely values of all selected features that contributed to the risk of developing symptoms in two randomly selected patients are summarized in Figure 5.A patient with high C-reactive protein (= 99.14 mg/L), low eGFR (= 40 mL/min./1.73 2 ), low sodium (= 132 mmol/L), and has diffuse opaci cation on CxR is most likely to develop symptoms (Figure 5A).Conversely, an infected patient with a history of recent travel, with low CRP (= 4 mg/L) and normal white blood count (= 9.1 x 10 9 /L) is less likely to develop symptoms (Figure 5B).

Discussion
To our knowledge, this study represents the rst attempt to use an integrated and interpretable ML analytical framework on a large series of COVID-19 infected patients, to uncover more in-depth insights, into the risk factors that shaped their symptom status.We identi ed important demographic and clinical predictors of symptom status and unveiled their complex non-linear relationships.In general, we found that the combination of clinical and demographic features, including targeted in ammatory markers, radiographic ndings, laboratory blood tests, and patients' characteristics, were important in predicting the risk of being symptomatic.
Our ML model revealed CRP levels, diffuse opaci cation in a CXR, and respiratory rate (Fig. 1), were the most important features in predicting which patients infected with COVID-19 will subsequently develop symptoms.These results are consistent with past studies, as CRP levels have been identi ed as potential biomarkers for symptomatic COVID-19 patients [31].Moreover, these nding supports the notion, that imaging modalities, such as chest x-rays and computerized topography, are important for diagnosing COVID-19 patients [32].
Community type, travel history, and transmission type were the most important demographic features for predicting the risk of symptomatic status (Fig. 1).Being a citizen-resident of Kuwait is associated with having a higher socioeconomic status compared non-citizen residents.Despite this, our results indicate that citizen-residents were more likely develop symptoms (Fig. 2B).This nding is surprising, especially as Alkhamis et al., inferred that signi cant spreading and cluster events in migrant workers communities were substantially more severe than residents-citizens due to their densely populated areas and poor living conditions [33].A potential explanation for this, may be that migrant workers, tend to represent a much younger subset of the population in Kuwait [34].Our ML model inferred that patients with a recent travel history are less likely to develop symptoms (Fig. 2C).We attributed this nding to the government's extensive intervention measures of testing and forced institutional quarantine of arriving travelers at the beginning of the epidemic in Kuwait [33].Also, healthcare workers were more likely to develop symptoms despite having access to personal protective equipment during their duties (Fig. 2F).Being in close contact with a large number of COVID-19 patients for a prolonged period of time, whilst performing various high-risk procedures such as aerosol generated procedures (e.g.intubation) may be contributory risk factors [35].
Past studies have inferred a linear relationships between evidence of kidney injury, on admission, and the clinical course of COVID-19 [36].Using our ML pipeline, we were able to explore this relationship further, by modeling nonlinear interactions between features (Fig. 3) and found that eGFR has the strongest overall interactions with other variables in shaping the risk of being symptomatic (Fig. 3A).Indeed, COVID-19 patients with low eGFR may be more likely to be severely ill on admission than patients with normal kidney function, as described elsewhere [37].The mechanistic process that underlies this has been hypothesized to be due to the presence of the angiotensin-converted enzyme-2 (ACE-2) receptor in the kidney, which has been shown to be 100 times greater in the kidney, than in the lung.In addition to its function as a receptor for SARS-CoV-2 entry into the alveolar cells of the lung, the ACE-2 enzyme has also been shown to interact with the virus directly, affecting the renin-angiotensin aldosterone system (RAAS) physiologically.This process might indicate that patients with chronic kidney disease (CKD) may be more susceptible to getting a complicated COVID-19 infection since they have high RAAS activities, resulting in a systemic increased expression of ACE-2, a major entry site for the virus.Feature interaction plots with eGFR show that patients with low urea (Fig. 4A), elevated total protein (Fig. 4D), and hyponatremia (Fig. 3F) are at higher risk of being symptomatic.These results unveil the complexity of the acute disease phase upon admission, in which patients might experience multiple severe in ammatory processes and a negative uid balance, as a result of impaired renal function [38].
A general limitation of the present study is the population size and potential selection bias toward our study population.That said, our data were collected from the o cial COVID-19 treatment hospital (i.e., Jaber Al-Ahmad Al-Sabah Hospital), which makes this population representative of the whole state of Kuwait.Furthermore, our ML pipeline is incapable of characterizing the uncertainties in the model predictions well.Methods such as Bayesian additive regression tree (BART) are more robust in quantifying such uncertainties, although they are limited by their requirement for larger datasets and demanding computations [39].An advantage of the present analytical pipeline is the remarkable applicability of Shapley values to interpreting, at a ner scale, what our model means in terms of classifying symptom status (i.e., why a speci c infected individual developed COVID-19 symptom, while the other did not?).For example, for a randomly selected patient from our cohort (Fig. 5A), having high CPR, low eGFR, hyponatremia, and diffuse opaci cation were associate with that COVID-19 patients becoming symptomatic.
By providing deeper insights into the underlying disease process that dictate patients' clinical course, our ML pipeline can potentially be used to risk-stratify patients.Biomarkers and demographic data can be used as a proxy for disease status, potentially eliminating the need for extensive testing, which has exhausted healthcare resources, particularly for COVID-19 worldwide.Also, ML models can be robust tools for COVID-19 case de nitions, and therefore, may help avoid inaccurate mapping of epidemic trajectories through public health surveillance activities [40].It worth noting that while our ML model identi ed community type as an important feature (Fig. 1), it was insigni cantly associated (p-value = 0.270; Additional le 1) with the study outcome using traditional statistical methods.Thus, commonly used p-values to assess the statistical signi cance of the association between two variables might not be a reliable measure of inference in population-based studies [41].

Conclusions
In

Consent for Publication
Due to the retrospective nature of this study, the requirement for obtaining informed consent from study subjects was waived by the IRB (Ministry of Health Kuwait) Plot showing feature importance that contribute to the prediction of symptom status of COVID-19 positive patients.

Figure 3 Feature
Figure 3

Table 1 .
Cross-validation summary results for the machine learning algorithms.Best performing algorithm is boldfaced.
AUC: Area Under the Curve; SE: Standard Error; LR: Logistic Regression; RF: Random Forest; SVM: Support conclusion, we have shown that in ammatory markers, respiratory signs, transmission dynamics, and demographics were essential predictors of symptom status in COVID-19 patients.Nevertheless, our ML model further revealed that the non-linear relationships between impaired renal function and other clinical biomarkers, were critical in shaping the risk of being a symptomatic COVID-19 patient.We demonstrated a superior predictive performance of our ML model, over traditional statistical methods, such as logistic regression.Further application of our ML pipeline in de ning COVID-19 cases may help guide public health interventions, which will help reduce the grave implications of this devastating virus globally.Ethical approval and consent to participate for this study was obtained from the Ministry of Health of Kuwait Ethical Review Board.