Ethical approval
The Providence Institutional Review Board (IRB) approved this study and waived the requirement for written informed consent.
Data sources
Data for the development and validation data sets were collected from the electronic medical record (EMR) of Providence St. Joseph Health. Records were included for all people from Alaska, Washington, Oregon, Montana, and California who had at least one COVID-19 PCR test result on a nasal swab sample between February 21, 2020 and October 20, 2020. People with at least one positive test were coded as a positive for infection; people with exclusively negative tests were coded as negative for infection. Location outcomes were evaluated by linking EMR geocoded data to data from the U.S. Census Bureau’s 2018 American Community Survey at the census block group or tract level as previously described (2).
Data were split into training and test sets with a 75/25 ratio, respectively, and a random seed for reproducibility. Additional modeling was performed with the final selected model with a train, test, and validation split (80/10/10 ratio, respectively). Two sets of training data were also generated: with clinical symptoms (fever, cough, myalgia, sore throat, chills, and shortness of breath) and without.
Statistics
All major statistical analyses were performed using Python versions 3.6.12 on a 64-bit computer and 3.6.10 leveraging a GPU instance in the Azure Machine Learning ecosystem.
Data cleaning
Continuous variables were standardized or log normalized to address skew and the influence of large values and outliers on the predictive power of trained models. Categorical variables were encoded and dummy variables were created for those variables with more than two classes. Variables were treated mostly as missing not at random (MNAR) except body mass index (BMI) and gender. Missing data for MNAR variables were coded as a separate category, e.g. ‘Unknown’. For BMI, median imputation was used to fill in the large amount of missing data (n = 25,646 from initial participant pool, approximately 8%). Gender was analyzed as legal sex, and missing values were dropped (n = 119; 0.04% of initial participant pool).
Hyperparameter tuning and cross validation
We used a randomized search approach, with cross validation, to tune and identify critical hyperparameters for each model (Supplementary Material). A set of hyperparameters that produced the best area under the curve (AUC) on the training set were selected as part of the final ensemble. This was performed with a repeated, stratified k-fold cross validation with 10 splits and 3 repeats. A random seed was set for reproducibility of the cross-validation step. We chose a randomized approach due to the computationally intensive nature of the alternative, more comprehensive grid search approach. We report the best hyperparameters selected for the best model with symptoms (Supplementary Material).
Model training and selection
An ensemble approach was used as the predictive model for each possible experiment. Four models – Logistic Regression, Random Forest, and two gradient boosting libraries, XGBoost (XGB) and LightGBM (LGBM) – were used as classifiers for training. We selected the best hyperparameters for each classifier, after hyperparameter tuning, and included these as part of the ensemble for the prediction task. We used a soft-voting ensemble due to the need to compute probabilities of a positive test or event.
Data augmentation
Most COVID-19 test results were negative. Thus, different data augmentation techniques were applied to address class imbalance by over-sampling and/or down-sampling the minority and majority class, respectively. We used a Synthetic Minority Oversampling Technique (SMOTE) and case-control approach to augment the training data as part of multiple modeling experiments. SMOTE is used to create synthetic data that is close, or nearest neighbor, to the minority class in the feature space (7). We also experimented with a case-control (CC) approach typically used in epidemiological studies to create a 1:1 match by down-sampling the majority class (COVID-19 negative) to the size of the minority class. Negative classes were selected using a simple random sample method without replacement. This strategy, unlike SMOTE techniques, uses real, non-synthetic data for model training. These approaches helped to create a 1:1 match of the negative (majority) class and the positive class (Class 0: 19,390, Class 1: 19,390, respectively). No augmentation was performed on the validation/test data set.
Twelve experiments were conducted such that at each experiment, models were fitted on the training set depending on whether data augmentation and dimensionality reduction techniques were applied to that set (Fig. 1). For dimensionality reduction, we applied principal component analysis (PCA) to compute the minimal set of principal components that explained 95% of the variance in the data. Recursive feature elimination (RFE) approach was also used, as part of different experiments, to select the minimal set of predictors that were most predictive for a COVID-19 positive test. Dimensionality reduction techniques were also applied on the test/validation sets; however, no augmentation was applied to the validation/test data set. PCA was not applied to comparative logistic regression models.
Feature importance
We used the Python implementation of SHAP (SHapley Additive exPlanations) (8) to examine the key predictor variables that contribute to a patient’s probability of a positive COVID-19 test result. The library computes Shapley values, which aim to demonstrate the marginal contribution of a feature to the predicted outcome of a vector or an instance (9). This approach examines how much each feature in the model pushes the predicted value of that instance from a baseline, or average, prediction (expected value). Using the SHAP methodology provides a method for improving the interpretability of a machine learning model. SHAP values were computed using the final selected model.