Evaluation of the fitness of data for discriminant analysis. Most of our predictor variables deviated, sometimes substantially, from normal Gaussian distribution. Univariate normality of each variable (v = 171) was tested in four datasets: healthy controls, moderate COVID-19, severe COVID-19, and combined moderate/severe COVID-19. Therefore, there were 684 variable/dataset combinations (Vi). A Shapiro-Wilk’s W statistic equal to, or approaching, “1” and a p-value greater than 0.05 indicate normal distribution, while low values of W with p-values lower than 0.05 indicate significant deviation from normality. Vis that did not significantly deviate from normality using a p-value cutoff of 0.05 [245 (35.8%)] showed W values ranging from 0.812 to 0.990. Skewness values for these Vis were mostly within the −1-to-1 range, except for 58 Vis (23.7%); none of the latter Vis, however, fell outside the −2-to-2 range. For kurtosis of these Vis, 109 (44.5%) were outside the −1-to-1 range, of which 20 (8.2%) were also outside the −2-to-2 range. The remainder (55.5%) were within the −1-to-1 range (Supplemental Table S1). This means, using a −1-to-1 cutoff, skewness and kurtosis agreed with Shapiro-Wilk’s test p-value 76.3% and 55.5% of the time. Among the 439 Vis (64.2%) that significantly deviated from normality according to the Shapiro-Wilk’s test, 49 (11.2%) and 90 (20.5%) showed skewness and kurtosis values within the −1-to-1 range, respectively. Therefore, among these Vis, skewness and kurtosis data agreed with Shapiro-Wilk’s test results 88.8% and 79.5% of the time, respectively. These Vis showed W values ranging from 0.273 to 0.920. (Supplemental Table S1). Overall, Shapiro-Wilk’s test showed 84.4% and 70.9% agreement with skewness and kurtosis data, respectively, using a cutoff range of −1-to-1 for the latter two.
The absence of multicollinearity was confirmed using correlation matrices generated using Pearson moment correlation coefficient. There is no precise consensus on the correlation coefficient threshold above which multicollinearity is presumed to exist. Thresholds as low as 0.40 and as high as 0.85 have been reported44, but the most commonly used threshold in our experience ranges from 0.7044 to 0.8045. In this study, we used 0.8 as our threshold and we found that CD69+, CXCR5+, CD38+, HLA-DR+, CD38+ HLA-DR+, Ki67+, and PD1+ subsets of total memory CD4+ T cells were often highly correlated with cell populations carrying the same surface marker among central, effector, and transitional memory CD4+ T cells. Total memory CD8+ T cells expressing these same surface markers were often highly correlated with cell populations carrying the same markers among central, effector, effector CD45RA+, and transitional memory CD8+ T cells. CD69+, CXCR5+, CD38+, HLA-DR+, CD38+ HLA-DR+, Ki67+, and PD1+ subsets of total memory CD8+ T cells were also highly correlated with multiple other cell populations including subsets of CD4+ T cells (Tables S2-S4).
Non-parametric Levene’s test showed that most of the variables in each of our 3 models were homoscedastic with p-values greater than 0.05. Thirteen, 29, and 11 of the 171 variables used in each model showed p-values less than 0.05, suggesting heteroscedasticity among these variables in models 1, 2, and 3, respectively (Table S5). Box’s M test was also performed yielding p-values of 5.7528 × 10–14 and 9.8324 × 10–22 for models 1 and 2, respectively (Table 1). Box’s M test failed in SPSS for Model 3 with an error message indicating that one or both of the group covariance matrices was/were singular. As discussed in more detail in the discussion section, the non-parametric Levene’s test is more reliable in assessing homoscedasticity in non-normal data31,46. Using the criteria described in the materials and methods section, we identified 52, 31, 134, and 216 outliers in the control, moderate, severe, and COVID-19 groups, respectively. Out of 171 variables, there were 45, 31, 79, and 94 variables containing 1 or more outliers in the control, moderate, severe, and COVID-19 groups, respectively (Tables S6-S9).
Construction and evaluation of discriminant models tailored for specific clinical applications. The wide range of presentations that develop following infection with SARS-CoV-2 called for a prognostic algorithm that may enable identifying critical patients early after infection. We therefore evaluated three discriminant models of immune profiles to distinguish healthy controls versus moderate or severe disease presentation. One model was designed to distinguish between the three groups of participants, healthy controls, moderately ill patients, and severely ill patients (Model 1). The predictive model was significant (p = 4.87 × 10–15) with a Wilks’ λ of 0.065, indicating that a majority of the variance contained in the model’s discriminant functions could be explained by differences in group membership. The model was built in five steps, each of which was statistically significant (p = 6.68 × 10–15 – 1.11 × 10–8) and contributed to improving the model as indicated by the incremental decrease of Wilks’ λ from 0.351 in the first step to 0.065 with the fifth (Table 1). Such small Wilks’ λ is consistent with a good fit with 93.5% of model variance (1 – 0.065 = 0.935) geared toward predicting group membership. The model contained two canonical discriminant functions, the first of which was a more important predictor of group membership than the second, as indicated by the first’s higher eigenvalue (4.155 versus 1.985), larger proportion of variance it explained (67.7% versus 32.3%), and greater canonical correlation (0.898 versus 0.815) (Table 1). Five variables were incorporated into the model; NK cells, T cells, CD21−CD27+CD38lo (class-switched, activated memory47,48) B cells (actSMB), activated HLA-DR+ neutrophils49 (actNeut), and Ki67+ immature granulocytes (imGran) were incorporated in steps 1 through 5, respectively. The discriminant score plot of Model 1 showed that the first discriminant function separated the three groups, especially controls and severe COVID-19, while the second discriminant function separated moderate COVID-19 from both severe COVID-19 and controls (Fig. 1d). The first discriminant function was most representative of NK cells and T cells, while the second represented actSMB and actNeut the most. ImGran almost equally contributed to both discriminant functions where their contributions ranked third for both functions. Despite model improvements with the introduction of each of the five variables, only the first three variables showed significantly different group means (by ANOVA with Holm correction and pairwise comparisons using a t-test with Holm-Sidak correction) and had relatively small Wilks’ λs (Table 2), prompting us to conclude that some of the variables ruled nonpromising based on individual biomarker evaluations, may still be useful to the model. Pairwise comparisons using a t-test p-value cutoff of 0.05 showed that NK cells and T cells were less abundant in patients with severe COVID-19 compared to uninfected controls and moderate disease patients, while ActSMB were more abundant in patients with moderate disease compared to those with severe disease and uninfected controls (Fig. 1a). In conclusion, model 1 proposes that reduced frequency of NK cells and T cells are the most distinguishing features separating severe COVID-19 from moderate COVID-19 and controls, high frequency of actSMB and actNeut are the most distinguishing features separating moderate COVID-19 from severe COVID-19 and controls, and imGran play a minor role in separating the three groups.
The second model (Model 2) was designed to distinguish between healthy volunteers and COVID-19 patients, regardless of the latter’s disease severity status. The model was statistically significant (p = 1.10 × 10–11) and had a good fit as a predictive model with a Wilks’ λ of 0.166. Since this model aimed to distinguish between two groups, only one discriminant function was extracted, which explained 100% of variance and had an eigenvalue of 5.03 and a canonical correlation of 0.913 (Table 1). This model was also compiled in five steps, each of which was significant (p = 1.36 × 10–11 – 1.43 × 10–7) and improved the model as indicated by the incremental decline of Wilks’ λ starting at 0.459 in the first step and reaching 0.166 in the last (Table 1). Five variables were included in the model, of which MAIT was the only significantly different variable and was decreased in COVID-19 patients compared to control (Fig. 1b). More than 50% of the variance of this variable could be explained by group membership, as indicated by a Wilks’ λ of 0.459. Having the highest standardized canonical discriminant function coefficient value (1.024), MAIT population was the most impactful on the single discriminant function in Model 2 and, thus, on group separation. Group means did not significantly differ for the remaining four variables [CD38+ NK cells (NK cells with enhanced cytotoxicity and cytokine secretion50) (en38NK), CD21+CD27−Ki67+ B cells (proliferating/activated naïve B cells51,52) (actNB), CD27+ NK cells (NK cells with enhanced function53,54) (en27NK), and CD27−CD45RA+ effector memory CD8+ T cells (terminally differentiated effector T cells55) (Temra)]; these variables had relatively high Wilks’ λs (0.960, 0.852, 0.993, and 0.908, respectively), indicating that only small portions of their respective variances were related to group membership (Table 2). However, these variables were not useless to the model since incorporating each of them resulted in a highly significant boost to the discriminatory ability of the model – as indicated by the p values associated with each step (see above and Table 1) – and a decrease in the model’s Wilks’ λ (Table 1). En38NK and actNB (standardized canonical discriminant function coefficients of 0.968 and 0.684, respectively) impacted the sole discriminant function of the model more than en27NK and Temra did (standardized canonical discriminant function coefficients of −0.626 and −0.443, respectively) (Table 2). In brief, the model suggests that the lower frequency of MAIT is the most prominent distinguishing features that separates COVID-19 patients from controls, while en27NK, en38NK, actNB, and Temra improve a discriminant model despite a lack of significant differences between groups.
The last model (Model 3) aimed to distinguish COVID-19 patients with severe disease from those with moderate presentation. The model was statistically significant (8.73 × 10–11) with an excellent fit (Wilks’ λ of 0.058). The model was constructed in seven steps, all of which were highly significant (p = 2.01 × 10–11 – 3.00 × 10–6) and resulted in a corresponding decrease in Wilks’ λ. The single discriminant function included in the model explained 100% of variance and had an eigenvalue of 16.219 and canonical correlation of 0.971 (Table 1). Multiple aspects of this model were paradoxical. Group means were significantly different for two variables (NK cells and the mean fluorescence intensity (MFI) of HLA-DR on monocytes) when correcting for multiple testing over all variables, and four variables (additional two variables were neutrophils and actSMB) when correcting for multiple testing over the number of variables incorporated in the model. However, these variables were not the most impactful on the discriminant function of the model. In fact, monocytes’ mean fluorescent intensity of HLA-DR and neutrophils had the lowest absolute standardized canonical discriminant function coefficient (0.616) and, thus, the smallest impact compared to other variables in the model. NK cells and actSMB had a standardized canonical discriminant function coefficient of 1.199 and 1.415, making them the second and fourth most impactful in the model, respectively. On the other hand, CXCR5+ CD8+ MAIT and Temra had the strongest and third strongest impacts on the discriminant function (standardized canonical discriminant function coefficients of 1.493 and –1.137), respectively, but none of them had significantly different group means. Even more perplexing is that almost none of CXCR5+ CD8+ MAIT and Temra variance was related to group membership (Wilks’ λ of 0.997 and 1.000, respectively) (Table 2). It is noteworthy that MFI of HLA-DR on monocytes had the smallest Wilks’ λ (0.413) of any variable tested in the study, thus, using it as a nidus around which the model was built was, indeed, appropriate. The fact that six other variables were added in the following steps indicates that each introduced variable lowered the models’ Wilks’ λ the most at the step in which it was introduced, which is mandated by the algorithm. In conclusion, it appears that decreased frequency of NK cells and actSMB, decreased MFI of monocyte HLA-DR, and increased frequency of neutrophils are the main distinguishing features of severe COVID-19 that set it apart from COVID-19 of moderate severity.
To visually observe group separation using the three models, we examined the corresponding canonical score plots. Model 1 clearly separated uninfected controls from moderate and severe patients, while the latter two appeared closer to each other than either of them was to the control. This finding suggested that distinguishing between moderate and severe patients would probably be more challenging than distinguishing between infected and uninfected participants, which will be adHLA-DRessed below (Fig. 1d). Model 2 plot shows complete separation between COVID-19 patients—inclusive of patients with moderate and severe disease—and uninfected controls. Severe and moderate patients overlapped, almost completely, as expected since they were all in one group whose centroid and the centroid of the control group defined the direction of the models’ discriminant function (Fig. 1e). Model 3 shows effective separation between moderate and severe patients (Fig. 1f).
Creating binary logistic regression models for binary dependent variables. Models 2 and 3 were recreated using binary logistic regression resulting in two new models that we named models 2’ and 3’, respectively. According to Chi-square test results, both new models were highly significant (p = 1.18 × 10–10 and p = 2.26 × 10–7, respectively). The Hosmer-Lemeshow null hypothesis of perfect group-membership prediction was retained at p = 1.000 for each of the two steps in model 2’ and the one step of Model 3’. All steps in both models had very high Nagelkerke’s pseudo-R2 values (0.916 – 1.000) (Table 3). These data strongly suggest that each of the two models had a strong predictive power. In Model 2’, there were 27 COVID-19 patients and 11 controls with the COVID-19 group being the target group—meaning we were interested in estimating the odds and probability of having COVID-19 for each of our participants. MAIT was introduced to the model in the first step and the corresponding Chi-square p value was 4.45 × 10–10 (Table 3), implying that the model at this stage, while containing MAIT as the sole predictor, was highly likely to be a better predictor of group membership than the null model containing no predictor variables. In the second step, which was also significant (p = 0.009) CD56dimCD16+ NK cells variable was introduced alongside MAIT (Table 3), meaning the addition of this other variable significantly improved the model’s predictive ability. No more steps or variables were added to the model signaling that it could not be significantly improved by incorporating more variables. Model 3’ had seven moderately ill and twenty severely ill patients, with the severely ill being the target group. Only one predictor variable – MFI of HLA-DR on monocytes – was introduced into the model with a Chi-square p value of 2.26 × 10–7. Regression weights, significance of each predictor variable, and odds ratios were not reliably calculated due to complete or quasi-complete group separation. This issue will be adHLA-DRessed in the discussion section. The differences in the makeup of the logistic regression models compared to the corresponding discriminate models led us to put each of these models to the test and empirically determine their predictive power.
Evaluation of the discriminant and binary logistic regression models. RCC was used to evaluate each models’ ability to correctly assign participants to their respective groups. Our discriminant models were more successful in correctly classifying participants into two groups than three groups. Model 1 achieved 92% overall RCC with RCCs of 91%, 71%, and 100% for healthy controls, the moderately ill, and the severely ill groups, respectively. Model 2 achieved an overall RCC of 97%, 91% for the control group, and 100% for the COVID-19 group. Model 3 achieved 100% RCC for both moderate and severe groups. Model 2’ showed 100% RCC for both the control and COVID-19 groups, which was achieved even with one variable (MAIT) in the model. Model 3’ showed an overall RCC of 93%, 86% for the moderate group, and 95% for the severe groups (Fig. 2).
We also compared the predictive ability of the two-group models (Models 2, 3, 2’, and 3’) to each other and to the use of predictor variables individually using the AUC method. In distinguishing between healthy participants and COVID-19 patients regardless of disease severity status, the largest AUC of any individual variable was that of the frequency of MAIT cells (0.993). Plasmablasts, three populations of CD38+ HLA-DR+ CD8+ T cells (central memory, effector memory, and total memory), and NK cells had the second through sixth largest AUCs (0.966, 0.946, 0.943, 0.939, and 0.931, respectively) among all individual analytes. Combining biomarkers using DA scores or binary logistic regression probabilities resulted in maximum AUCs of 1.000, indicating an improved predictive power with the use of multivariate biomarkers (Fig. 3, Table 4).
For predicting severe COVID-19 disease in a pool of hospitalized SARS-CoV-2-positive patients with moderate or severe disease, the largest AUCs of individual biomarkers were obtained using monocyte HLA-DR MFI, frequency of T cells, NK cells, CD4+ T cells, dendritic cells, and CD56dim CD16+ NK cells with AUCs of 0.993, 0.957, 0.957, 0.936, 0.932, and 0.929, respectively. Combining biomarkers using either DA resulted in a perfect sensitivity and specificity with an AUC of 1.000, while using binary logistic regression was equivalent to monocyte HLA-DR MFI with an AUC of 0.993 (Fig. 4, Table 5).
Next, we wanted to investigate fidelity of prediction of severe COVID-19 in a population of both health individuals and COVID-19 patients with either severe or moderate disease. The largest AUCs of individual biomarkers were obtained using T cells, NK cells, CD4+ T cells, CD56dim CD16+ NK cells, B cells, and neutrophils with AUCs of 0.981, 0.969, 0.947, 0.942, 0.922, and 0.917, respectively. A multivariate biomarker based on the discriminant scores of function 1 had a perfect AUC (Fig. 5, Table 6). The separation of groups illustrated in figure 1d is compatible with the results obtained, since the severe group is well-separated from the moderate disease group and controls on the first discriminant function, but not the second.
Finally, we tested the predictive ability of our models by employing them to classify eight participants by a blinded investigator. These participants included two with mild, four with moderate, and two with severe disease. There were no uninfected controls. Please note that there was no mild group in any of the models we constructed, but we had data for these two patients with mild disease and thought to include them to see how they would be classified. Using the discriminant scores of Model 1, one of the mild patients was classified as uninfected control, while the other was classified as moderate disease. Only one of the moderate patients was correctly classified, while the other three as well as the two severe patients were classified as severe. Counting the classification of mild patient as moderate correct (due to the absence of a mild category in the model), the overall RCC was 50%. Models 2 and 2’ correctly classified seven participants, while a moderate participant and a mild participant were misclassified as control by models 2 and 2’, respectively. Both models showed an RCC of 87.5%. The two participants with mild disease were classified as moderate by both models 3 and 3’. Both patients with severe disease were correctly classified by model 3, while model 3’ misclassified one of them as moderate. For participants with moderate disease, only one of them was classified as such by model 3 – the remaining three were misclassified as severe – and model 3’ correctly classified all four participants.