Proteomic analysis of cardiometabolic biomarkers and predictive modeling of severe outcomes in patients hospitalized with COVID-19

Background: The high heterogeneity in the symptoms and severity of COVID-19 makes it challenging to identify high-risk patients early in the disease. Cardiometabolic comorbidities have shown strong associations with COVID-19 severity in epidemiologic studies. Cardiometabolic protein biomarkers, therefore, may provide predictive insight regarding which patients are most susceptible to severe illness from COVID-19. Methods: In plasma samples collected from 343 patients hospitalized with COVID-19 during the first wave of the pandemic, we measured 92 circulating protein biomarkers previously implicated in cardiometabolic disease. We performed proteomic analysis and developed predictive models for severe outcomes. We then used these models to predict the outcomes of out-of-sample patients hospitalized with COVID-19 later in the surge (N=194). Results: We identified a set of seven biomarkers predictive of admission to the intensive care unit and/or death (ICU/death) within 28 days of presentation to care. Two of the biomarkers, ADAMTS13 and VEGFD, were associated with a lower risk of ICU/death. The remaining biomarkers, ACE2, IL-1RA, IL6, KIM1, and CTSL1, were associated with higher risk. When used to predict the outcomes of the future, out-of-sample patients, the predictive models built with these biomarkers outperformed all models built from standard clinical data, including known COVID-19 risk factors. Conclusions: These findings suggest that proteomic profiling can inform the early clinical impression of a patient’s likelihood of developing severe COVID-19 outcomes and, ultimately, accelerate the recognition and treatment of high-risk patients.

Circulating protein biomarkers re ect underlying physiologic or disease states and can serve as measurable indicators of disease severity for prognostication in clinical practice. Interrogating the relationship between COVID-19 outcomes and protein biomarkers previously implicated in cardiometabolic disease may yield actionable insights regarding how common cardiometabolic comorbidities contribute to COVID-19 pathogenesis and, ultimately, improve our ability to identify which patients are most likely to suffer poor outcomes.

Study design
During the rst wave of the pandemic (March 10 to June 1, 2020), we collected discarded blood samples from patients who presented to care (i.e., rst contact with the healthcare system) with COVID-19 symptoms and were subsequently hospitalized with PCR-con rmed SARS-CoV-2 infection at Massachusetts General Hospital (MGH), a large academic hospital in Boston, Massachusetts. In these samples, we measured 92 protein biomarkers in the Olink Target 96 Cardiovascular II panel [27]. We performed proteomic analysis and developed predictive models for the combined outcome of death or admission to the intensive care unit (ICU/death) within 28 days of presentation to care in patients hospitalized early in the surge, between March 10 and April 21, 2020 ("in-sample" group; n = 343). We then used these models to predict the outcomes in a separate sample of patients hospitalized with COVID-19 later in the surge, between April 22 and June 1, 2020 ("out-of-sample" group; n = 194).

Data collection
Details of the hospitalization, past medical history, and demographic characteristics were collected for each patient by manual chart reviews. Hospital laboratory tests performed within three days of presentation to care were retrieved electronically through the Enterprise Data Warehouse, a repository derived from electronic health records. We excluded hospital laboratory tests with a missing rate greater than 30% and imputed those remaining with the median of the non-missing values (mean missing rate = 13%). Study procedures were approved by the Mass General Brigham (formerly Partners) Human Research Committee, the governing institutional review board at MGH.

Plasma collection and proteomic assays
Blood samples were collected in EDTA tubes and processed within three days of collection in a biosafety level 2 + laboratory. Whole blood was centrifuged at 2800 rpm for 10 minutes. Plasma samples were extracted and stored at -80°C. Prior to plating, samples were treated for two hours with 1% TritonX-100 at room temperature for virus inactivation. The Olink reagents were based on Proximity Extension Assay technology. Measurements were translated into normalized protein expression (NPX) units. Sample plates passed quality control if the standard deviations of internal controls were < 0.20 NPX and individual samples were < 0.30 NPX from the median value of the internal controls [27]. A total of 537 samples (95%) passed quality control. All 92 protein biomarkers were detected with intra-assay coe cient of variance (CV) < 15% (mean 4%) and inter-assay CV < 30% (mean 10%).

Statistical analysis
We performed all statistical analyses in Python version 3.9.2 and R version 4.0.5. To account for nonnormality in the raw data, we applied rank-based inverse normal transformation for feature selection, modeling, and P value calculation. To preserve interpretability, the odds ratios (ORs) shown in the volcano plot and correlation matrix were calculated after standardizing the data to have a mean of 0 and standard deviation of 1. To account for multiple testing, the threshold P < 0.05/116 hospital laboratory tests and biomarkers = 4x10 − 4 was used to determine signi cance.
To uncover mechanistic links, we performed a core enrichment analysis using QIAGEN Ingenuity Pathway Analysis (IPA; Ingenuity Systems, Redwood City, CA) by matching a proteomic dataset, consisting of P values and ORs for each protein biomarker, with curated content of the Ingenuity Knowledge Base. We generated reports for molecular networks and canonical pathways, highlighting upstream regulators and pathways relevant to the observed changes, along with their predicted impact on downstream biological or disease processes.

Predictive modeling
To identify the subset of variables with the greatest predictive value for the logistic regression model, we ranked all variables by the frequency with which they were selected by LASSO (least absolute shrinkage and selection operator) regression when repeated across separate, random subsets of the in-sample data and different shrinkage parameters [28]. To identify the best model, we used an all-possible-regressions approach with these top-ranked variables. We used 5-fold strati ed cross-validation within the in-sample data to test the models formed by every possible combination of top-ranked variables. We then plotted the average area under the receiver operating characteristic curve (AUC), with the models sorted by the number of variables they included. The model that formed the elbow at which the plot of increasing AUC versus model size began to plateau was considered the best model, achieving the greatest performance without over-tting the in-sample data. We performed this analysis with and without the 92 protein biomarkers and compared the AUCs of the best models built with and without the biomarkers using a two-sided DeLong test [29]. We repeated the modeling process using random forest models and using the Boruta algorithm to initially rank the variables by signi cance [30]. Data from the out-of-sample patients was not used at any point in the feature selection and modeling process.

Characteristics of the patients
Compared to the in-sample group, the out-of-sample group had a similar proportion of patients who died, but a smaller proportion of patients who were admitted to the ICU (Fig. 1). The out-of-sample group had a larger proportion of self-reported non-Hispanic White and non-Hispanic Black/African-American patients, a smaller proportion of self-reported Hispanic patients, a lower average BMI, higher levels of D-dimer, lower levels of lactate dehydrogenase, and lower levels of creatine kinase relative to the in-sample group.
Patients who died or were admitted to the ICU had a higher average BMI and higher levels of D-dimer, LDH, C-reactive protein (CRP), and ferritin compared to those alive and not admitted to the ICU after 28 days of presentation to care (Table 1). Patients who suffered the ICU/death outcome (de ned as ICU admission or death within 28 days of presentation to care) were compared with those who did not suffer ICU/death across demographic factors, clinical variables, and hospital laboratory tests using a two-sided t-test for continuous variables and chi-square test for categorical variables. All race/ethnicity categories were self-reported. BMI categorization: < 18.5 kg/m 2 for underweight, 18.5-24.9 kg/m 2 for normal weight, 25.0-29.9 kg/m 2 for overweight, and ≥ 30.0 kg/m 2 for obese. Standard deviation, SD; African American, AA; body mass index, BMI; coronary artery disease, CAD; chronic obstructive pulmonary disease, COPD; C-reactive protein, CRP; lactate dehydrogenase, LDH.

Cardiometabolic protein biomarkers
To evaluate the relative importance of each of the 92 protein biomarkers and the 24 hospital laboratory tests with respect to ICU/death, we built a logistic regression model, adjusted for age, gender, BMI, and self-reported race/ethnicity, for each biomarker and hospital laboratory test in the in-sample group (Table   S1). The protein biomarkers comprised 31 of the 36 signi cant associations (P < 4⋅10 − 4 ) with ICU/death ( Fig. 2). The ve hospital laboratory tests with signi cant associations included LDH, CRP, procalcitonin, aspartate aminotransferase, and alanine transaminase, with standardized odds ratios (ORs) ranging from 1.8 to 3.2 (Fig. S1).
We evaluated how the biomarkers that were signi cantly associated with ICU/death correlated with one another and with the hospital laboratory tests, comorbidities, and demographics (Fig. 3). The largest cluster included 16 biomarkers which were positively associated with ICU/death (with standardized ORs ranging from 1.4 to 5.2) and showed positive correlations with type 2 diabetes, chronic kidney disease, and cardiac disease ( Fig. 3: Box A). This cluster also showed signi cant positive correlations with troponin, blood urea nitrogen, creatinine, procalcitonin, and D-dimer ( Fig. 3: Box B) and negative correlations with estimated glomerular ltration rate, albumin, hematocrit, and hemoglobin ( Fig. 3: Box C). A smaller cluster of ve biomarkers (ADAMTS13, SCF, FABP2, VEGFD, and TGM2) was negatively associated with ICU/death and negatively correlated with the majority of the hospital laboratory tests ( Fig. 3: Box D). Among all 36 signi cant biomarkers and hospital laboratory tests, this cluster included the only markers associated with lower risk of ICU/death (with standardized ORs ranging from 0.40 to 0.64). Canonical pathways identi ed by IPA included the tumor microenvironment, IL-10 signaling, airway pathology, granulocyte adhesion, and wound healing. Implicated functional networks included cardiovascular and organismal development, lipid metabolism, and protein synthesis (Tables S2 and S3).

Prediction in out-of-sample patients
For both the logistic regression and random forest, the models built with the protein biomarkers outperformed the models built without the biomarkers in the out-of-sample patients (Fig. 4). The bestperforming models consisted of a common set of nine variables (Table S4), which included two hospital laboratory measurements, procalcitonin and LDH (both of which were included in the best models without the biomarkers), and seven biomarkers: IL-1RA, CTSL1, ADAMTS13, VEGFD, KIM1, ACE2, and IL6 (Fig. 4A). The AUCs of the best models built with these nine variables were greater than that of the best models built without the protein biomarkers (logistic regression: 0.82 versus 0.70; P = 0.001; random forest: 0.83 versus 0.69; P = 3⋅10 − 5 ). We continued to observe superior performance by the models built with the protein biomarkers when excluding patients with more than 14 days between presentation to care and sample collection date (Fig. S2 and Table S5), when excluding patients with sample collection on or after the ICU admission date (Fig. S3), and when randomly splitting patients into the in-sample and out-of-sample group (Fig. S4). The models with the protein biomarkers also outperformed the models without biomarkers in age-strati ed and gender-strati ed analyses (Fig. S5 and S6). When evaluating the model performance in the two most prevalent self-reported categories of race/ethnicity, the improvement in performance of the model with biomarkers was greater in the Hispanic population than the non-Hispanic White population (Fig. S7). We repeated the analysis using ICU admission as the outcome and death as the outcome. The results for the ICU admission outcome were similar to those for the combined ICU/death outcome (Fig. S8 and Table S6), while results for the death outcome showed no signi cant difference in performance between the models built with and without the biomarkers (Fig. S9 and Table  S7).

Discussion
We identi ed protein biomarkers previously implicated in cardiometabolic disease that were signi cantly associated with severe illness from COVID-19, shedding light on biological pathways involved in COVID-19 pathology. We demonstrated that these protein biomarkers, measured early in the disease course, were more predictive of ICU admissions or death than established clinical risk factors. These ndings suggest that proteomic pro ling could improve the triage and treatment of patients hospitalized with COVID-19.
LDH, D-dimer, brinogen, CRP, and low platelets, markers of thrombotic risk, have been reported to be associated with poor prognosis in COVID-19 [44]. This observation is in keeping with the association between lower ADAMTS13, an enzyme that degrades von Willebrand factor, and poor outcomes found in our study and other reports [33]. Low levels of ADAMTS13 have also been described in thrombotic thrombocytopenic purpura and syndromes of thrombotic microangiopathy caused by infection [45]. Microangiopathic thrombosis has been seen in autopsies of patients who have died of COVID-19, similar to what has been observed in other ARDS-causing diseases [46]. Three of the identi ed biomarkers, KIM1, ACE2, and CTSL, are involved in host-virus interactions. KIM1, an indicator of renal insults, plays a role in viral entry and regulation of the host immune response to viral infections [47]. ACE2, the cellular receptor for SARS-CoV-2 [36, 38], undergoes shedding, leading to circulating ACE2, a biomarker of cardiovascular disease, diabetes, and death in patients with and without COVID-19 [48,49]. The association of ACE2 with severity is supported by a recently reported rare genetic variant that is associated with a 37% reduction in ACE2 expression and a 40% reduction in risk of severe COVID-19 [50]. Finally, CTSL is one of the lysosomal proteases that can cleave the SARS-CoV-2 spike protein, a step necessary for cellular entry [37,51].
The logistic regression and random forest models built with these seven biomarkers signi cantly outperformed all models developed from the clinical features and laboratory tests, suggesting that the biomarkers provide unique predictive value not captured by patient data extracted from the electronic health record. The biomarkers replaced known clinical risk factors for severe illness that had been selected in the model built without biomarkers (i.e., BMI, D-dimer, CRP, ALC, and troponin). Notably, BMI was replaced by IL-1RA, a biomarker that was strongly correlated with BMI (Fig. 3). IL-1RA, known to be highly expressed in white adipose tissue [52] and upregulated during in ammation, could serve as a better proxy than BMI for obesity-driven COVID-19 risk.
We recognize that standards of care and resource availability evolved quickly during the rst wave of the pandemic. As data on the e cacy and side effects of COVID-19 therapies accrued, the use of remdesivir and hydroxychloroquine increased and decreased, respectively. Prone positioning, applied heterogeneously early in the pandemic, eventually became standard of care. It is possible that these exogenous factors contributed to differences in outcomes between the in-sample and out-of-sample cohorts. We expect that, as the SARS-CoV-2 virus mutates, the virulence pathways and host responses may change, as noted by both the delta and omicron variants [53]. The patients hospitalized with COVID-19 today are generally younger and consist of both unvaccinated and vaccinated patients with breakthrough infections or repeat infections. Newer COVID-19 therapies have emerged, which could in uence the proteomic pro le of patients and its association with COVID-19 severity.
By evaluating the models in a sample separate from that used to develop the models, we showed that the predictive value of the biomarkers was robust to changes in clinical protocols and the patient characteristics during the highly dynamic study period. Compared to studies that develop and test predictive models within the same patient population, our approach provided a more rigorous assessment of the generalizability of our models and the conclusions derived from our analysis. As one of the largest proteomic analyses performed in COVID-19 patients, we were able to conduct age-strati ed, gender-strati ed, and race/ethnicity-strati ed analyses, demonstrating the strong performance of the model with the protein biomarkers across various demographic strata (Fig. S5, S6, and S7) Similar to previous COVID-19 analyses [54][55][56], this study was limited by the precision with which COVID-19 severity could be captured and COVID-19 related outcomes could be tracked. We used ICU admission and death as proxies for severe illness from COVID-19; however, patients may have died or been admitted to the ICU for reasons independent of their COVID-19 status. Patients discharged alive could still have died outside the hospital or be admitted to ICUs at other hospitals. Another limitation was that the time between symptom onset and blood sample collection was not uniform across all patients. Despite this, we observed similar results when excluding patients with sample collection dates that were on or after the date of ICU admission or greater than 14 days following presentation to care (Fig. S2, Fig. S3, and Table S5). Finally, by only collecting discarded blood samples at a single time point, we were unable to perform longitudinal analyses; however, biomarkers that can be interpreted with single timepoint measurements may be more useful in clinical settings where only one lab draw is available.

Conclusions
In this study, we identi ed a set of biomarkers that yield both mechanistic insight regarding how cardiometabolic disease contributes to COVID-19 pathology, as well as predictive value regarding which patients have the highest risk for severe outcomes. If considered early in the clinical evaluation of patients with COVID-19, these insights can help clinicians estimate a patient's cardiometabolic-driven risk, which, in turn, can inform downstream decisions regarding how to stratify patients across pathways of clinical care (e.g., in-hospital observation or early admission to ICU) and whether to institute treatments that reduce the risk of poor outcomes, such as monoclonal antibodies or novel antiviral therapies.

Consent for publication
All authors consent to this publication.

Availability of data and materials
The datasets generated and analyzed during the current study are not publicly available but may be made available upon reasonable request.  Figure 1 Characteristics of in-sample and out-of-sample patients. Blood samples were collected from 537 patients hospitalized with COVID-19 during an early surge in the outbreak, between March 10 th and June 1 st of 2020. Data from patients who were hospitalized early in the surge (before April 22, 2020) was used to analyze the cardiometabolic protein biomarkers and develop logistic regression and random forest models for severe outcomes. These patients comprised the in-sample group (shown in grey). These models were then used to predict the outcomes of the out-of-sample patients (shown in gold) who were hospitalized later in the surge (starting April 22, 2020). The in-sample and out-of-sample patients were compared across various demographic and clinical variables using a two-sided t-test for continuous variables and chi-square test for categorical variables. All race/ethnicity categories were self-reported.  Volcano plot of the 92 cardiometabolic biomarkers and 24 hospital laboratory tests. The plot includes, for each biomarker and hospital lab, the odds ratio on the x-axis and P value (-log10) on the y-axis resulting from a logistic regression model with ICU/death as the outcome, adjusted for the covariates: age, gender, Hierarchical clustering and correlation matrix with signi cant cardiometabolic biomarkers. A heatmap (top left) and correlation matrix (top right and bottom) for the 31 biomarkers signi cantly associated with ICU/death (P < 0.05/116 hospital laboratory tests and biomarkers = 4×10 -4 ). The correlation matrix shows how the biomarkers, ordered based on hierarchical clustering, correlate with one another (top right) and how they correlate with the demographic factors, clinical variables, and hospital laboratory tests (bottom). The color re ects the magnitude and direction of the Pearson correlation coe cient. The cells corresponding to correlations with P > 0.05 were left blank. The P values and odds ratios (OR) reported for the association of each variable with ICU/death are the same as those shown in Fig. 2. Box A shows the association of the largest cluster, comprised of 16 biomarkers, with type 2 diabetes, chronic kidney disease (CKD), and cardiac disease. Boxes B and C show how this cluster correlates with the hospital labs. Finally, Box D shows correlations between the hospital laboratory tests and a smaller cluster, comprising the ve biomarkers that were negatively associated with ICU/death. Standard deviation, SD; con dence interval, CI; African American, AA; chronic obstructive pulmonary disease, COPD; coronary artery disease, CAD; heart failure with preserved ejection fraction, HFpEF; heart failure with reduced ejection fraction, HFrEF; blood urea nitrogen, BUN; erythrocyte sedimentation rate, ERS; lactate dehydrogenase, LDH; aspartate aminotransferase, AST; white blood cells, WBC; C-reactive protein, CRP; absolute lymphocyte count, ALC; estimated glomerular ltration rate, eGFR. Prediction of ICU/death outcome in out-of-sample patients. (A) Violin plots for the set of seven cardiometabolic protein biomarkers that were included in the best model with biomarkers for both logistic regression and random forest. The gure depicts the distribution and box plot of these seven biomarkers, strati ed by the ICU/death outcome, in the in-sample patient population. The P values shown for each biomarker are based on the rank-inverse normalized data, while the odds ratios (OR) are based on the