Data Source and Cohort
Data for this study were derived from the AoU data enclave (version 7) available in the preproduction environment,9 which consist of EHRs, survey-based information, genotyping data, and lifestyle data (collected through mobile devices) of non-deceased adults (18 years or older) living in the United States. EHR data encompass demographics, health-care visit details, medical conditions, and prescription drug orders for each patient.
Our cohort inclusion criteria largely paralleled that used for the original N3C study.3 Specifically, cohort members were required to have either 1) an International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) COVID-19 diagnosis code (i.e., U07.1) from an inpatient or emergency health-care visit, or 2) a positive SARS-CoV-2 PCR or antigen test.
We required patients to have EHR data available in AoU before and after the diagnosis date or positive test result. We excluded patients with < 90 days between the positive COVID-19 test or diagnosis date and the end of the study (May 2023) to ensure sufficient follow-up to assess long COVID.3 To accommodate the existing EHR data model, we gathered EHR data from AoU participants. We only gathered variables that were used in the pre-existing models and made only the exact same transformations given from the previous literature’s data cleaning protocol.3 We used EHR data in the AoU preproduction environment, which follows the common OMOP standard data model. The only additional inclusion criterion was that patients must have valid data (described below) for least one other data type (i.e., genetic, survey, or mobile device data).
From the genetic data, we used discrete variables showing the minor allele counts (0, 1, and 2) for each genetic variant. We extracted 25 genetic variants from AoU whole genome sequencing (WGS) data version 7 (Supplemental Table 1). These variants were effect alleles identified by the COVID-19 Genetic Consortium and were significantly associated to COVID-19 infection or hospitalization due to COVID-19.6 Participants with genetic data were included only if they had non-null values for the genetic expression of each of our pre-specified 25 genetic variants (Supplemental Table 1).
From the health survey information available in AoU, we selected 32 questions which gave information that was distinctively different from the data available through our EHR data (Supplemental Table 2). These questionnaire outputs were transformed into scalar variables for model testing and training. Participants considered to have valid survey data must have completed the AoU core survey questionnaire (the basics, lifestyle, and overall health) within the time span of their EHR history in AoU. Further, participants who skipped 10% or more of the questions used for our algorithms were excluded.
The mobile device data collected in AoU derives from patients’ Fitbits or other smart devices, which provide measures such as heart rate, steps, and sleeping status. Several factors contributed to variability in data availability. First, patients varied in the consistency of their monitoring. Second, the timing of the COVID-19 diagnosis relative to the span of available data differed. For instance, if a patient received a COVID-19 diagnosis 3 months into a year of data, most of the that participant’s data would be after their COVID-19 diagnosis, while others who received a COVID-19 diagnosis later would have proportionally more data before their diagnosis.
Given this variability, our goal was to select a set of features which gave interpretative value to the mobile device data for our models and worked regardless of observation period. We narrowed our inclusion criteria down to only patients with a positive COVID-19 test or diagnosis result available in their EHR history. Long COVID is normally defined as persisting symptoms more than 28 days after a positive COVID-19 test, and tests are relatively accurate up to a week of COVID-19 contraction. Because of this, we excluded all mobile device measurements between 7 days before the COVID-19 test to 28 days after the COVID-19 test. We then created averages for the total measurements before and after this excluded time frame. For example, step counts from a patient’s mobile device data were extracted into two features for our machine learning model. The first feature measures the patients average step count per day from the start of the survey to 7 days prior to their first COVID-19 positive test. The second feature measures the patient’s average step count per day starting 28 days after the positive COVID-19 test (Supplemental Table 3).
Participants with mobile device data were considered only if they had sufficient data points both before and after a COVID-19 test or diagnosis date in their EHR. We required patients to have two or more data points earlier than seven days prior to their COVID-19 test or diagnosis, and two or more data points following 28 days after their positive COVID-19 test or diagnosis date. Further, participants’ measures had to fall within a clinically reasonable average range for heart rate, step count, and average minutes asleep for us to consider their mobile device data reasonable. Specifically, we applied exclusion criteria based on the reasonability of a participants’ average health metrics, by giving each variable a range and excluding data that fell over that range. The reasonable ranges for each metric are as follows: average heart rate < = 100bmp, minimum heart rate (for a single day) < = 125bpm, median minutes asleep < = 600, and median steps < = 100000.
Data cleaning, feature engineering, and model training were completed using AoU Researcher Workbench cloud computing.
Training, Testing, and Validating Sets
We first reserved the patients with all data sources (i.e., EHR, survey and/or mobile device, and genetic data) for validation. We named this cohort the gold standard testing set, and this set was used to evaluate the combined prediction given from all the model’s combined predictive output. From the remaining participants, we separated patients based on available data, and trained each model with 75% of the patients with that data type available. This data set is referenced later as the training data set. The corresponding 25% of remaining patients with that data were designated as a preliminary testing set. The preliminary testing set was used to test the output of a single model iteratively in the model training process (Fig. 1).
Transformation of Training Data
While long COVID is a common disease, the associated ICD code U09.9 is quite rare and under-utilized in clinical practice, and it was not available before October 2021. Because long COVID is a rare diagnostic billing codes, the rarity of true cases makes training machine learning models difficult. Given the limited scope of our data, there is a restrictively small number of true long COVID cases to train our models. However, the N3C model was created using a more robust dataset. Therefore, to increase the number of cases in our training set, we also counted patients that the N3C model gave a predictive probability value of > 0.9 as a long COVID case. Thus, we defined long COVID cases for our training set as participants with either (1) the U09.9 diagnostic code or (2) a predicted probability for having long COVID of > 0.9 based on the original N3C EHR algorithm.
Although this definition is helpful for training new models on limited data sets, it is still important to maintain some level of accuracy for our total evaluation set. For this reason, the definition for long COVID in our testing sets remained unchanged, and we evaluated the performance of our models only using the U09.9 diagnostic code for long COVID.
Model Design and Machine Learning Infrastructure Selection
We implemented the N3C model to determine baseline performance for all data sets. We then trained two models independently: (1) genetic data alone, and (2) survey and mobile device data. We grouped mobile device data and survey data due to their high correlation and aptitude for missing elements. Following the N3C approach, we used the existing XGBoost algorithm for EHR data. We then selected the most appropriate ML infrastructure for the remaining data sources among Logistic Regression, Polynomial Regression, Random Forests, Convolutional Neural Networks, K Nearest Neighbors, Support Vector Machines, and Extreme Gradient Boosting Trees (Fig. 1). We first ran preliminary tests to find the most effective ML infrastructure to represent each data type individually. Our preliminary tests consisted of calculating the AUROC score for each model using preliminary testing data and comparing the base performance of different ML infrastructures. Then a different model was created for each ML infrastructure, and the best infrastructure was picked based on the leading AUROC score for each data type. Given the missingness of the survey and mobile device data, extreme gradient boosting (XGBoost) was the best performing model type for this data. The genetic data alone was found to have the highest AUROC score when using a convolutional neural network (CNN) machine learning infrastructure.
Predictive Method
The individual predictive outcomes from our independent models were combined using a weighted average function to give a single overall outcome for our combined methodology. Our weighted average determined a person’s likelihood of long COVID by taking each of models results and multiplying it by a given proportional weight. The weights were combined in a sum for overall confidence using the equation below:
X represents the participant’s data, with subsets of the participant’s data represented by a subscript. P represents the probabilistic outcome of a machine learning model with P1, P2, and P3 representing the outcomes of the EHR, genetic, and survey models, respectively. The non-negative constants α, β, and 𝛾 are proportional weights given to the predictive outcome of each model. The total output is given as a percentage that represents the overall predictive method described as a probability. For this reason, α, β, and 𝛾 are fitted using the equation below.
$$\alpha +\beta +\gamma =1$$
These proportional weights were used as a fine-tuning parameter for our predictive method. By taking the sum of these products, the result is a single probability percentage for a patient's risk for long COVID.
To calculate the optimal weights for these parameters, we designed a function to iterate through possible weights for α, β, and \(\gamma\). The function found the combination of weights that gave the best AUROC score for the preliminary testing set, accurate to the nearest hundredth.
Cross Validation
We created a set of parameters for training models and validated our results using an original cross validation method. We trained each model separately using a 5-fold cross validation method. Then, we divided our Gold-Standard Evaluation cohort into 4 groups and cycled through portions of the Gold-Standard Evaluation set to evaluate the prediction given by each combination of models. Because of this, our final validation used 100 unique iterations of our data to train, test, and validate the predictive method (Fig. 2). We assessed AUROC, positive predictive values, specificity, and sensitivity for each iteration and then the average of the 100 iterations for each model. We used a two-sample t-test for difference in sample means to assess the significance of difference between the AUROC.
Further, to assess the difference in the specificity for our predictive models, we created an optimization function to choose a threshold for each model so that the results would have a resulting sensitivity closest 0.7. Then, we did a two-sample t-test for difference in sample means for the specificity of each predictive model’s outcome.
Assessing Individual Predictive Features
Using Shapely as an analysis tool for ML methods, we determined the most important features in the final predictive method based on Shapley values and the relative direction for each new feature. To do so, we transformed each model’s Shapely values by dividing each value by the sum off all Shapely values for the model. The resulting proportional feature contribution for each model were then combined by multiplying each model’s set of Shapely values, by the model’s relative contribution to the overall predictive method.