Stabl: sparse and reliable biomarker discovery in predictive modeling of high-dimensional omic data

High-content omic technologies coupled with sparsity-promoting regularization methods (SRM) have transformed the biomarker discovery process. However, the translation of computational results into a clinical use-case scenario remains challenging. A rate-limiting step is the rigorous selection of reliable biomarker candidates among a host of biological features included in multivariate models. We propose Stabl, a machine learning framework that unifies the biomarker discovery process with multivariate predictive modeling of clinical outcomes by selecting a sparse and reliable set of biomarkers. Evaluation of Stabl on synthetic datasets and four independent clinical studies demonstrates improved biomarker sparsity and reliability compared to commonly used SRMs at similar predictive performance. Stabl readily extends to double- and triple-omics integration tasks and identifies a sparser and more reliable set of biomarkers than those selected by state-of-the-art early- and late-fusion SRMs, thereby facilitating the biological interpretation and clinical translation of complex multi-omic predictive models. The complete package for Stabl is available online at https://github.com/gregbellan/Stabl.


61
Several machine-learning methods, including sparsity-promoting regularization methods (SRMs), such 62 as least absolute shrinkage and selection operator (Lasso) 5 or elastic net (EN), 6 provide predictive 63 modeling frameworks adapted to ≫ omic datasets, but the selection of a sparse and reliable set of High-dimensional feature selection methods such as stability selection (SS) or Model-X knockoff improve 74 reliability by controlling for false discoveries in the selected set of features. 12,13 However, in these 75 methods, the threshold for feature selection or the target false discovery rate (FDR) are defined a priori, 76 which uncouples the feature selection from the multivariate modeling process. Without prior knowledge 77 on the data, these methods can lead to suboptimal feature selection, requiring multiple iterations to 78 identify a desired threshold. This limitation also precludes optimal integration of multiple omic datasets 79 into a unique predictive model, as a single fixed selection threshold may not be suited to the specificities 80 of each dataset.

82
Here we introduce Stabl, a supervised machine learning framework that bridges the gap between 83 multivariate predictive modeling of high-dimensional omic data and the sparsity and reliability 84 requirements of an effective biomarker discovery process. Stabl combines the injection of knockoff-85 modeled noise or random permutations into the original data, a data-driven signal-to-noise threshold, and 86 integration of selected features into a predictive model. Systematic benchmarking of Stabl against Lasso, 87 EN, and SS using synthetic datasets, three existing real-world omic datasets, and a newly generated 88 multi-omic clinical dataset demonstrates that Stabl overcomes the shortcomings of state-of-the-art SRMs:

89
Stabl yields highly reliable and sparse predictive models while identifying biologically plausible features 90 amenable to further development into diagnostic or prognostic precision medicine assays.

98
When applied to a single cohort drawn at random from the population, SRMs will select informative 99 features (i.e., truly related to the outcome) with a higher probability, on average, than uninformative 100 features (i.e., unrelated to the outcome). 5,12 However, as uninformative features typically outnumber 101 informative features in high-dimensional omic datasets, 1,2,11 the fit of an SRM model on a single cohort 102 can lead to selection of many uninformative features despite a low probability of selection. 12,14 To address 103 this issue, Stabl implements the following strategy ( Fig. 1 and methods):

105
1. Stabl fits SRM models (e.g., Lasso or EN) on subsamples of the data using a procedure similar 106 to SS. 12 Subsampling mimics the availability of multiple random cohorts and estimates each 107 feature's frequency of selection across all iterations. However, this procedure does not provide 108 an optimal frequency threshold to discriminate between informative and uninformative features 109 objectively.

110
2. To define the optimal frequency threshold, Stabl creates artificial features unrelated to the 111 outcome (noise injection) via random permutations 1-3 or knockoff sampling, 13,15,16 which we 112 assume behave similarly to uninformative features in the original dataset 17 (see theoretical 113 guarantees in methods). The artificial features are used to construct a surrogate of the false 114 discovery proportion (FDP+). We define the "reliability threshold", , as the frequency threshold 115 yielding the minimum FDP+ across all possible thresholds. This method for determining is 116 objective, in that it minimizes a proxy for the FDP. It is also data-driven, as it is tailored to individual 117 omic datasets.

119
As a result, Stabl provides a unifying procedure that selects features above the reliability threshold while 120 building a multivariate predictive model. Stabl is amenable to classification and regression tasks and 121 extends to integration of multiple datasets of different dimensions and from different omic modalities.

123
Stabl improves sparsity and reliability while maintaining predictivity: synthetic modeling

125
We benchmarked Stabl against Lasso and EN using synthetically generated training and validation 126 datasets containing known informative and uninformative features (Fig. 2a). Simulations representative 127 of real-world scenarios were performed, including variations in the sample size (n), total features (p), and 128 informative features (S). Models were evaluated using three performance metrics (

135
Before performing benchmark comparisons, we tested whether the FDP+ defined by Stabl experimentally 136 controls the FDR at the reliability threshold , as the true value of the FDR is known for the synthetic 137 dataset. We observed that FDP+ ( ) was indeed greater than the true FDR value ( Fig. 2c and S1). These

144
Stabl was tested using a random permutation method ( Fig. 2 and S2-5) or model-X knockoffs (Fig. S5) 145 for noise generation. In each case, Stabl achieved higher sparsity compared to Lasso or EN (Fig. S6), 146 as the number of features selected by Stabl was lower across all conditions tested and converged 147 towards the true number of informative features (Fig. 2d). The reliability was also higher for Stabl than 148 for Lasso or EN, such that the features selected by Stabl were closer to the true set of informative features 149 (Fig. 2e). Meanwhile, Stabl had similar or better predictivity compared to Lasso or EN (Fig. 2f).

151
Further modeling experiments tested the impact of the data-driven computation of while building the 152 multivariate model compared to SS (i.e., choosing a fixed frequency threshold a priori). Three representative frequency thresholds were evaluated: 30%, 50%, or 80% ( Fig. 2g-i and S7-9). The 154 performance of models built using a fixed frequency threshold varied greatly depending on the simulation 155 conditions. For example, for a small sample size (n < 75), the 30% threshold had the best sparsity and 156 reliability. However, for a large sample size (n > 500), the 80% threshold resulted in greater 157 performances. In contrast, Stabl models systematically reached optimal sparsity, reliability, and 158 predictivity performances. Further, we show that varied greatly with the sample size ( Fig. 2j and S10), 159 illustrating how Stabl adapts to datasets of different dimensions to identify an optimal frequency threshold 160 solution.

162
In sum, synthetic modeling results show that Stabl achieves better sparsity and reliability compared to 163 Lasso or EN while preserving predictivity and that the set of features chosen by Stabl is closer to the true 164 set of informative features. The results also emphasize the advantage of the data-driven adaptation of 165 the frequency threshold to each dataset's unique characteristics rather than using an arbitrarily fixed 166 threshold.

170
We evaluated Stabl's performance on four independent clinical omic datasets. Three were previously 171 published with standard SRM analyses, while the fourth is a newly generated dataset introduced and 172 analyzed for the first time here. Because clinical omic datasets can vary greatly with respect to 173 dimensionality, signal-to-noise ratio, and technology-specific data preprocessing, we tested Stabl on

178
For each dataset, Stabl was compared to Lasso and EN on single-omic data or to early fusion and late 179 fusion on multi-omic data over 50 random repetitions using a repeated five-fold cross-validation (CV) 180 strategy. As the true set of informative features is not known for real-world datasets, the performance 181 metrics differed from those used for the synthetic datasets:

208
Stabl's reliability performance was also improved compared to Lasso and EN. The univariate p-values 209 (Mann-Whitney test) for the features selected by Stabl were lower than for those selected by Lasso ( Fig.   210 3i,j) or EN (Fig. S11e,f). Independent evaluation of the COVID-19 validation dataset confirmed these 211 results (Table S1)

216
Consistent with the synthetic modeling analyses, the predictivity and sparsity performances of SS varied 217 greatly with the choice of threshold, while Stabl provided a solution that optimized sparsity while 218 maintaining predictive performance. For example, using SS with a 30% compared to a 50% threshold

224
Identification of fewer and more reliable features using Stabl facilitated the biomarker discovery process,  (Table S3). For the COVID-19 dataset, several features identified by Stabl echoed key pathobiological 230 mechanisms of the host inflammatory response to COVID-19. For example, CCL20 is a known element 231 of the COVID-19 cytokine storm, 25,26 CRTAC1 is a newly identified marker of lung function, 27-29 PON3 is 232 a known biomarker decreased during acute COVID-19 infection, 30 and MZB1 is a protein associated with 233 high neutralization antibody titers after COVID-19 infection (Fig. 3j). 20 The Stabl model also selected 234 MDGA1, a previously unknown biomarker candidate of COVID-19 severity (Table S4).

236
Together, the results show that Stabl improves the reliability and sparsity of biomarker discovery in two 237 single-omic datasets of widely different dimensionality while maintaining predictivity performance.

239
Stabl successfully extends to multi-omic data integration

241
We extended the assessment of Stabl to complex clinical datasets combining multiple omic technologies.

242
In this case, the algorithm first selects a reliable set of features at the single-omic level, then integrates 243 the features selected for each omic dataset in a final learner algorithm, such as linear or logistic 244 regression.

246
We compared Stabl to early and late fusion Lasso, two commonly employed strategies for multi-omic   (Table S5). These results emphasize the advantage of the data-driven threshold, as fixing a common frequency threshold across all omic layers would have been suboptimal, risking over-or under-selecting 265 features in each omic dataset to be integrated into the final predictive model.  Table S6), including a regulated decrease in innate immune cell frequencies (e.g., neutrophils) and 272 their responsiveness to inflammatory stimulation (e.g., pSTAT1 signaling response to IFNα in NK 273 cells 34,35 ), along with a synchronized increase in pregnancy-associated hormones (e.g., 17-

277
Stabl identifies promising candidate biomarkers from a newly generated multi-omic dataset

279
Application of Stabl to the three existing omic datasets demonstrated the algorithm's performance in the 280 context of biomarker discovery studies with a known biological signal. To complete its systematic 281 evaluation, Stabl was applied to our newly generated multi-omic clinical study performing an unbiased 282 biomarker discovery task. The aim of the study was to develop a model to predict which patients will  predictive performance (Fig. 5b, S14), yet superior sparsity (Fig. 5c) and reliability performance (Fig.   292 5h,i). As a result of the frequency-matching procedure, there were no differences in major demographic 293 and clinical variables between the two patient groups, suggesting that model predictions were primarily 294 driven by pre-operative biological differences in patients' susceptibility to develop an SSI.

296
Stabl selected four mass cytometry and 25 plasma proteomic features that were combined into a 297 biologically interpretable immune signature predictive of SSI. Examination of Stabl features revealed cell-298 type specific immune signaling responses associated with SSI ( Fig. 5h) that resonated with circulating 299 inflammatory mediators (Fig. 5i, Table S8). Notably, the STAT3 signaling response to IL-6 in neutrophils 300 was increased before surgery in patients predisposed to SSI. Correspondingly, patients with SSI had 301 elevated plasma levels of IL-1β and IL-18, two potent inducers of IL-6 production in response to

308
Altogether, application of Stabl in the setting of a new biomarker discovery study provided a manageable 309 number of candidate biomarkers of SSI, pointing at plausible biological mechanisms that can be targeted 310 for further diagnostic or therapeutic development.

314
Stabl is a machine learning method for analysis of high-dimensional omic data designed to unify the 315 biomarker discovery process by identifying sparse and reliable biomarker candidates within a multivariate 316 predictive modeling framework. Application of Stabl to several real-world biomarker discovery tasks 317 demonstrates the versatility of the algorithm across a range of omic technologies, single-and multi-omic 318 datasets, and clinical endpoints. Results from these diverse clinical use cases emphasize the advantage of Stabl's data-driven adaptation to the specificities of each omic dataset, which enables reliable selection 320 of biologically interpretable biomarker candidates conducive to further clinical translation.

322
Stabl builds on previous methods, including Bolasso, SS, and Model-X knockoff. These methods improve 323 reliability of sparse learning algorithms by employing a bootstrap procedure, or using artificial 324 features. 5,12,14,16 However, these methods rely on a fixed or user-defined frequency threshold to 325 discriminate between informative and uninformative features. In practice, in the ≫ context, objective 326 determination of the optimal frequency threshold is difficult without prior knowledge of the data, as shown 327 by the results from our synthetic modeling. The requirement for prior knowledge impairs the capacity for 328 predictive model building, limiting these previous methods to sole feature selection.

330
Stabl improves on these methods by experimentally, and, under certain assumptions, theoretically,

337
Stabl objectively defines a model fit from the procedure without requiring prior knowledge of the data.

339
On a synthetic dataset, we experimentally demonstrate that Stabl selects an optimal reliability threshold 340 by minimizing the FDP+ and allows for improved reliability and sparsity compared to Lasso or EN at

358
Stabl analyzes each omic data layer independently and fits specific reliability thresholds before selecting 359 the most reliable features to be merged in a final layer, thus combining the advantages of both methods.

360
Multi-omic data integration with Stabl was particularly useful for analysis of our newly generated dataset 361 in patients undergoing surgery. In this case, the Stabl model comprised several features that were 362 biologically consistent across the plasma and single-cell datasets, revealing a patient-specific immune 363 signature predictive of SSI that appears to be programmed before surgery.

365
Our study has several limitations. Although we demonstrate the validity and performances of Stabl

Number of samples
Reliability threshold E

433
UMAP visualization of the proteomic data. Node characteristics as in (b   Comparison of FDP+ vs. true FDR in synthetic dataset benchmarking.

472
Extended Data Figure S2 Comparison of Stabl and Lasso sparsity performance on synthetic data.