Experimental Design
This was a prognostic analysis of patients who (1) were admitted to Michigan Medicine between March 10, 2020 (the date of the first case in this state) and March 31, 2022 (the cutoff date of the released EHR data), (2) tested positive for COVID-19 or transferred in carrying a positive diagnosis, and (3) had at least one COVID-related chest X-ray image taken. We focused on patients with X-rays because patients without imaging were in general much younger and healthier, and images are valuable in triaging patients and managing resources (Jiao et al, 2021). Our outcome was the time from admission until in-hospital death, censored by discharge or the end of the study. Discharge was regarded as a censoring event, except for discharge to hospice, because the median survival for these patients was less than 30 days post-discharge. As it was a strong precursor to death, we considered both in-hospital death and discharge to hospice as failure events (see Supplement A).
From the EHR database, we extracted and created a set of demographic, socioeconomic, and clinical risk factors (see Supplement B) identified as being related to COVID-19 in the literature (Rod et al, 2020; Wu and McGoogan, 2020; Jordan et al, 2020; Mikami et al, 2021; Kim et al, 2021; Rosenthal et al, 2020; Centers for Disease Control and Prevention, 2022b; Ebinger et al, 2020; Williamson et al, 2020; Alqahtani et al, 2020; Khan et al, 2020; Ssentongo et al, 2020; Yang et al, 2020; Wang et al, 2020; Salerno et al, 2021a). Patient demographics included age, sex, race (Black or non-Black), ethnicity (Hispanic or non-Hispanic), smoking status, alcohol use, and drug use. As patient-level socioeconomic factors were unavailable, we created four composite socioeconomic measures at the US census tract-level based on patient residences. These composites, measuring affluence, disadvantage, ethnic immigrant concentration, and education, were defined to be the proportion of adults meeting the corresponding criterion within a census tract (Clarke and Melendez, Ann Arbor, MI; Gu et al, 2020; Salerno et al, 2021b), and were further categorized by quartiles. For each of twenty-nine prevalent comorbidity conditions commonly used in literature (Crabb et al, 2020; Elixhauser et al, 1998; van Walraven et al, 2009; Quan et al, 2005), we defined a binary indicator to flag whether the patient had any associated ICD-10 code at admission. Lastly, we obtained physiologic measurements within 24 hours of admission, including body mass index (kg/m2), oxygen saturation, body temperature, respiratory rate, diastolic and systolic blood pressure, and heart rate.
With multiple X-rays potentially taken for one patient, we chose the one closest to the time of admission and examined its role in predicting patient survival. We first pre-processed each image according to the pipeline depicted in Fig. 2. First, prior to feature extraction and selection, we retained only those images taken from the anterior-posterior or posterior-anterior position so that the orientation of the images would be comparable. We then normalized these images so that the pixel intensities of each image conformed to a standard range of 0 (black) to 255 (white) units. We further used histogram equalization to enhance the contrast of the images (Jain, 1989).
Broadly, there are two potential approaches for feature extraction, namely (1) artificial intelligence methods, which learn feature representations automatically from the data, and (2) engineered texture features. While deep learning has been shown to have high prognostic accuracy, learned features are difficult to interpret, not standardized, and often not reproducible, which may impact their reliability (Yip and Aerts, 2016). Thus, we extracted a standard panel of engineered texture features according to the PyRadiomics workflow (van Griethuysen et al, 2017). Specifically, we applied six different filters (e.g., different transformations) to the pre-processed images to acquire additional information (e.g., at edges or boundaries) and derive different image types (e.g., shape) (van Griethuysen et al, 2017). From the seven image filters (original + six transformations), we extracted seven classes of features from each image (Haralick et al, 1973; Chu et al, 1990; Thibault et al, 2013; van Griethuysen et al, 2017), resulting in 1,311 candidate image features. To obtain a short list of predictive clinical and image features, we performed feature screening by fitting Cox proportional hazards models (Therneau and Grambsch, 2000) on each feature one at a time and retaining those significant at the 0.05 level. Finally, we selected the features with the highest feature importance, and obtained a final Cox model, quantifying the adjusted associations of important clinical and radiomic features with in-hospital mortality. We used the concordance index (C-index) to assess the predictiveness of models (Harrell Jr et al, 1996) (see Supplement C). This study was approved by the Michigan Medicine Institutional Review Board (HUM00192931), which waived informed consent based on secondary analysis of deidentified datasets. All analysis was conducted in accordance with relevant guidelines and regulations.
Statistical Analysis
We implemented five risk prediction algorithms, namely, the Cox proportional hazards model (Therneau and Grambsch, 2000), survival support vector machines (P¨olsterl et al, Preprint posted online November 21, 2016), random survival forests (Ishwaran et al, 2008), survival gradient boosting (Hothorn et al, 2006), and ensemble averaging of the first four algorithms (Zhou, 2012). The Cox model, the most widely used method in survival analysis, assumes a risk function that is linear in the predictors. Survival support vector machines (P¨olsterl et al, Preprint posted online November 21, 2016) can account for non-linear relationships. Both random survival forests and survival gradient boosting combine multiple predictions from individual survival trees to achieve a more powerful prediction (Ishwaran et al, 2008; Hothorn et al, 2006; Salerno and Li, Preprint posted online May 5, 2022). Ensemble averaging combines predictions from multiple models to produce a desired output and often performs better than individual models by averaging out their errors (Zhou, 2012). Supplement D details these methods.
We used cross-validation to unbiasedly estimate the predictiveness of each method. We randomly split the data into 80% training and 20% testing samples, maintaining the proportion of events in the full sample within each split. We then trained the various predictive models by using the training samples and computed the C-index by using the testing samples. We repeated the same procedure one hundred times and took an average of the C-index to obtain an unbiased estimate of the C-index for each method (Uno et al, 2007). We applied each method with the demographic and clinical predictors, followed by the addition of radiomic features to assess their incremental prognostic utility via the C-index. Using ensemble averaging, which was the most predictive (see the Section of Results), we developed a risk score to predict in-hospital mortality and classified patients into low- and high-risk groups using the median score as the cutoff.
Lastly, we detail the variable selection process for building a final Cox model. We selected clinical and image features based on their importance in prediction, defined by the absolute decrease in C-index with the “removal” of the concerned feature in the data (Breiman, 2001). To do so, we randomly split the data into 80% training and 20% testing samples, fit the model on the training data and calculated the feature importance using the testing data (Supplement D.6). We repeated the same procedure one hundred times, selected the features that were most important (on average) among these one hundred experiments, and included them in a multivariable Cox regression to assess their statistical associations with in-hospital mortality. All data processing and analysis was carried out with Python (version 3.8.8), NumPy (version 1.20.1), and scikit-survival (version 0.17.2).
We examined different subgroups to gauge how the prediction performance of the model improved with the added radiomic features. Because age and comorbidity burden were the most relevant to survival among the clinical factors, we considered patient subgroups defined by age (> versus ≤ 65 years old) and number of comorbidities at admission (> versus ≤ median seven comorbidities), respectively. We compared the change in prediction performance with the addition of the radiomic features between different subgroups.