Title: Effect of biomarker identification on power analysis for diagnostics research

ABSTRACT


Introduction
Over the last few years there has been lots of emphasis on the high dimensional omics data generation that includes untargeted -omics datasets like transcriptomics [1,2] metabolomics [3,4], proteomics [5,6], microbiomes [7][8][9], as well as deep phenotyping [10]. As a consequence, a massive amount of data is routinely accumulated which needs to be integrated and analysed so as to facilitate the identification of the relevant markers. If the markers fromomics datasets are robust, reproducible and indicative then they can be used as a biomarker for patient's stratification [11,12] and can also be useful either as diagnostics or prognostic tools.
Selecting relevant markers or features from a high dimensional dataset is defined as feature or variable selection [13] and requires a robust statistical or computational workflow [14].
In the literature there has been lots of interest in the application of machine learning methods on omics datasets for feature selection [14]. Lately, decision tree based statistical machine learning methods, for example, Random Forest (RF) [15], have gained prominence. RF is an ensemble learning method which has been applied successfully on multiple high dimensional omics studies that includes transcriptomics [16], metabolomics [17], methylation [18] and proteomics [19]. The objective of these studies has been either the prediction or the selection of important features that serve as potential biomarkers and that can be potentially employed for patient stratification in future studies. The random forest algorithm is a powerful prediction method that is known to be able to capture complex dependency patterns between the outcome and the covariates. The latter feature makes random forest a promising candidate for developing a prediction method tailored to the challenges of multi-omics data.
In the case of omics analysis, it is important to have feature selection procedures that are more systematic and data driven so to avoid any bias selection. In the literature there is limited evidence that RF can also be employed for this tasks when coupled with other feature selection methods, for example: selecting genes and metabolites [20]; selection of lipids, metabolites [21]. More recently, Degenhardt et al. (2019) performed a simulation study, and also applied on published experimentally derived (real) datasets, different versions of RF methods.
However, in that instance, the markers that were identified, were not satisfactorily related to future translational research. That is, the effect/weight of the identified markers was not assessed and validated for their use in similar future studies via 'study design' or 'power analysis'.
There are some earlier studies that have focused on power analysis, for example relating to metabolomics data [22,23] and transcriptomics data [24]. However, those studies are very specific to certain -omics datasets and often failed to properly relate power calculations to putative biomarkers identified via stable feature selection procedures.
In this study, we divided our objectives into two main tasks. For the first module, we performed extensive simulation with RF based feature selection methods such as Boruta [25], permutation based feature selection [26], permutation based feature selection with correction [26], and backward elimination based feature selection [27], both in a regression and classification context, to understand their feature selection capabilities and prediction error. We then conducted similar investigations using experimentally derived data, to examine how the methods perform when dealing with disparate -omics data paradigms. For the second module, we developed a workflow to identify the number of samples required for a future study using the stable biomarkers that were identified in the first task. Finally, we developed a web interface to streamline power calculations and offer a valuable asset for use in translational research, going forward.

Random Forest (RF)
Random Forest (RF) [15] is an ensemble-based machine learning approach that can handle nonlinear relationships between response and predictor variables and both classification and regression based analysis. Typically, around two-thirds of the samples are used for model training, with the remaining third being held out as the out-of-bag (OOB) samples and used to assess the model's performance. For classification, prediction performance can be quantified in terms of the rate at which OOB samples are misclassified, or the OOB error. For regression, the average distance between OOB predictions and the true continuous response variable can be quantified using the mean squared error (MSE) metric. The contribution from each variable to the final model is quantified as a ranked measure of variable importance (More information in the Additional file 1 methods section).

RF feature selection methods
Four methodologies were utilised for the automatic selection of important features from the aforementioned ranked list RF generates, namely, Boruta [25], permutation based feature selection [26], permutation based feature selection with correction [26], and backward elimination based feature selection [27]. Details of the statistical basis for each of these approaches is outlined in the following section.

Boruta
Boruta [25] compares the feature importance values estimated for the real predictor variables, with variables generated by permutation of the real variables across observations. Variables generated by permutation are termed "shadow" variables. For each run, a RF is trained using a double length set of predictor variables comprised of an equal number of true and shadow variables. For each of the real predictor variables, a statistical test is conducted comparing its importance with that of the maximum importance value achieved by a shadow variable.
Variables with significantly larger or smaller importance values are declared by the algorithm as important or unimportant, respectively. In subsequent runs, all unimportant and shadow variables are removed, and the process is repeated until all variables have been classified or a specified maximum number of runs have taken place.

Permutation based feature selection
Standard permutation testing is an established approach for approximating a significance level or threshold for the selection of a subset of associated features from a RF model [21,26]. The iterative nature of this approach ultimately establishes a null distribution of importance values expected to be observed by chance alone. Using this null distribution, features are determined to be important only if the probability of achieving their observed importance value or one more extreme (p-value), is below a specified threshold. Permutation testing was implemented using both raw (uncorrected) p-values as well corrected ones for multiple hypothesis testing via the Benjamini-Hochberg procedure (corrected) [28]. For both implementations, a threshold p-value of 0.05 was used to determine statistical significance.

Backward elimination based feature selection
Recursive feature elimination (RFE) [27] is an approach which endeavours to determine the smallest subset of variables that produce an effective model with good prediction accuracy.
The methodology involves the iterative fitting of RF, where upon each iteration, a specified proportion of the variables with smallest variable importance are discarded. This process is applied recursively until only a single variable remains available for input. At each iteration, model performance is assessed in terms of out-of-bag error, when RF is used in a classification capacity, or mean squared error (MSE) for regression forests. The set of variables leading to the production of the RF with the smallest error, or within a certain range of the minimum, are ultimately selected. This procedure was facilitated for classification forests using the base implementation of the method in the R package varSelRF [27]. Additionally, the constituent functions of varSelRF were modified by the present study to accept a continuous y variable input and to use MSE for model assessment, to facilitate feature selection when RF is used in a regression capacity

Module 1 -Stable feature selection
We divided our analysis methodology into two modules. In module one, we incorporated a double cross validation (CV) procedure [29][30][31][32], which is summarised in Figure 1  First, a training:testing ratio of 75:25 is used to generate outer train and test data subsets, respectively. The outer train subset is then subject to a further tenfold CV, where one-tenth of the data (inner train) is used for hyper parameter optimisation (additional information is provided in the Additional file 1 methods). The remaining nine-tenths are considered the inner test subset which are then used to complete 100 iterations of feature selection, where upon each iteration, a RF is trained and each of the aforementioned methods identify a subset of important features. Stable features are then defined as those identified by a particular method in a number of iterations greater than a specified stringency value. In the present study, features selected in >5/100 iterations are determined low stringency (LS) stable features and those selected in at least 90/100 iterations, high stringency (HS) stable features. Stable features are subset from the outer test data and used to quantify properly the predictive power (R 2 ) and prediction error (OOB/MSE) of the model using independent data. The entire procedure is repeated four times, each with a modified outer loop split, such that each sample appears once within the outer test data subset. The values specifying predictive power, prediction error, and the frequency with which each feature is selected by each method, are then averaged across the four outer loop repeats. The full double CV procedure was applied to simulated data and three published experimentally derived (real) data cohorts. Due to sample size limitations, CV was not conducted for published experimentally derived classification data 2.

Module 2 -Power Analysis
We implemented a flexible approach to facilitate power analysis and sample size determination, based on functions designed and described by Blaise et al., (2016) [23]. We included both datasets with a continuous outcome variable (regression) and those with a twogroup (binary) classification outcome. Furthermore, the correlation structure of input data was explicitly modelled, in order to capture any multicollinearity between variables. An overview of the approach is represented schematically in Figure 1 (Module 2) and described briefly in the following section. (More information in the Additional file 1 methods section).
In summary, data is first simulated using a multivariate log-normal distribution then iteratively subset to produce data with a specified series of sample sizes. In the case of regression, for each of the variables assessed, a continuous outcome is generated that relates to the assessed variable with a Pearson correlation value equal to that of its relation to the real outcome (effect size). The simulated outcome variable is then regressed against each variable to produce a set of p-values describing each variable's association with the outcome.
For the two-group (binary) classification case, two datasets are produced for each specified sample size (one for each group) and a specified Cohens d effect size [33] is introduced to one of them for the currently iterated variable and its highly correlated partners. A one-way ANOVA is then conducted for each variable, comparing the intra and inter group variances, and producing a set of p-values describing the variables with statistically significant variances.
In either case, true/false positive metrics are then determined by comparing the set of statistically significant variables to a set containing the variable chosen for analysis and its highly correlated partners.

Datasets
We used simulated data and published experimentally derived datasets to understand methods and their performance in both a classification and regression context. Please refer to the Table   1 for detail data information.

Simulated data
Simulation data featuring correlated predictor variables and a quantitative outcome variable were generated using a nonlinear regression model, as previously reported by the reference literature [26,34]. Specifically, the simulation strategy and equations outlined by Degenhardt et al., (2019) [26] were adapted and implemented in the present study, with minor parameter modifications. A single simulation scheme was incorporated in which sixty (six groups of ten) correlated variables were generated, alongside additional uncorrelated variables, to produce a dataset of 5000 predictor variables.
The variables 1 , 2 and 3 were also used to generate the quantitative outcome variable via the equation: where y correlates decreasingly with variables 1 , 2 and 3 .
The simulation scheme outlined above was repeated iteratively to produce a final simulation dataset of 200 observations, 5000 predictor variables and a quantitative outcome variable correlated with only the first three groups of correlated predictor variables. Additionally, the quantitative outcome variable was adapted for a binary classification context, whereby observations with a y value below the mean of the outcome variable set were assigned to group one, and those with a value above the mean assigned to group two. Consequently, the same simulation protocol was used to facilitate the assessment of feature selection and power analysis in both a classification and regression context.

Published experimentally derived (real) datasets
A summary of the published experimentally derived datasets is presented in Table 1. For more detailed methods please refer to method section of the respective articles.

Simulation Data
We performed regression considering the complex relations created within the 5000 predictor variables and continuous response (y). Examination of the power calculations for the non-y-related features revealed similarly negligible effect sizes for V36 (representing correlated group 3) and V4500. Consequently, the power values calculated for these variables remained close to zero across the full range of sample sizes considered. Contrastingly, V50 and V52 (representing correlated groups 5 and 6, respectively) each produced an effect size 0.14, by chance. Consequently, significant power values could be obtained for these variables at sufficiently large sample sizes. We determined a sample size of ~1990 necessary to observe power=1.

Published experimentally derived metabolomics data for regression
We applied similar strategies on a public -omics dataset by Acharjee et al., (2016) [21], featuring lipid metabolites (Positive DI-MS Lipids) and using relative liver weight as a continuous outcome variable (y). Six metabolites of interest were identified by the original work and were thus considered known by the present study.
Boruta selected the largest number of HS features, including three previously known features of interest (Table 3, SF3A). In the LS environment, Boruta selected an additional forty features.
In addition, the HS Boruta model achieved the highest R-squared value, even slightly exceeding the value it achieved under LS (Fig. 3A). Permutation (Raw) also achieved strong stability across iterations (Table 3), exhibited consistent predictive performance (R-squared) across validation models (Fig. 3A), and identified two known metabolites under HS. RFE and Permutation (Corrected) both exhibited poor stability across iterations, retaining only six and three HS features, respectively, despite the methods identifying 94 and 51 features each, under LS ( Table 3).
The results for power analysis of -omics data 1 are presented in Figure 3

Published experimentally derived lipidomics data for regression
The results from a second application of the regression approach using lipidomic data from 3 months old infants are summarised in Table 3

Simulation Data
The simulated dataset was modified to produce a binary classification outcome variable (y) and utilised for feature selection (Table 3, (Table 3). We

Published experimentally derived metabolomcis for classification
To further assess the performance the classification approach, we once again utilised data from Acharjee et al. (2016) [21], with a modified binary classification outcome variable: below or above the mean value of the relative liver weight (Table 3 and  The HS stable features selected by Boruta in each of the four analyses were provided to the binary classification power functions described in study design (Module 2). In each case the function simplified the provided features into correlated groups and identified the feature from each group determined to possess the largest Cohen's d effect size. In addition, t-tests were conducted for each feature, and statistically significant differences were observed between binary classes, in all cases (SF8).
Power calculations were conducted and this process determined that at their observed effect size, the three features identified in the control vs stage one comparison could achieve maximum power using <10 case samples; whilst the sample sizes necessary for the features selected in the three other comparisons ranged between <5 for the largest effect features and >1280 for the smallest (SF9). In each comparison scenario, the effect size of all features were greater than the 'very large' threshold of 1.0 [39].

Web tool
To streamline power calculations and provide an accessible package fit for translational medicine, we produced 'PowerTools', an interactive open-source web application written in R code using the Shiny framework (http://shiny.rstudio.com/) ( Figure 5). The tool is capable of performing efficient simulation-based power calculations for regression and two-group classification datasets from various -omics disciplines. The web interface allows easy specification of sample sizes, quick access to function parameters and is equipped with help information and example datasets to maximise the tool's accessibility. Confusion matrix result values are presented as both a customisable plot and as raw data tables, which can be easily downloaded using the intuitive user interface. A user guide is added in the Additional file 2.

Discussion
In the present study, we first examined the performance of four RF feature selection methods When applied to published experimental data, it was impossible to distinguish variables truly associated with the outcome from false positives. Consequently, the subset of features selected by each method were assessed only in terms of their stability, numerosity and the predictive performance of their resultant validation models. Across the two published experimental datasets considered for regression, Boruta once again exhibited the best stability, producing the smallest difference in feature subsets selected between stringency states. In addition, Boruta selected the largest number of stable HS features from the first dataset, consistent with the notion that the method uses an all relevant approach [25] and has selected the largest number of biomarker candidates in previous comparison studies [40].
In each of the published experimental data applications, RFE identified the smallest subset of HS stable features. However, this was expected as by design the method seeks the smallest subset of predictive features [27]. Consequently, RFE exhibited the worst stability, producing a 15-fold difference in the number HS/LS features found for the first dataset and an 11-fold difference for the second.  [40,41]. The RF algorithm is capable of compensating for noisy features without a large decrease in model performance, thus validation model performance alone is an ineffective measure of the relative performance of RF feature selection approaches [41]. Furthermore, the variance in predictive performance values between CV repeats were greater when using variables selected under HS (e.g. Fig.2C). This effect can be understood with respect to their being greater differences, between outer loop repeats, in the variables meeting the HS selection criteria, compared to the equivalent differences in those which surpass the LS selection criteria.
For our second module, we sought to combine our stable feature selection protocol with a novel simulation-based approach to facilitate power calculations for future study design. Further to previous efforts by Guo et al., 2010 [24], who compared the power achieved by a multitude of classifiers when presented with diverse sets of data, we focused singularly on RF but expanded into both classification and regression domains. Furthermore, we ensured the true correlation structure of input data was captured, applying a methodology originally used in the context of metabolic phenotyping [23]. Once more, we incorporated automated effect size calculations, and grouped similar variables before filtering by effect size, to identify a small subgroup of high effect putative biomarkers for which to quantify power.
We validated the performance of our power calculations with respect to the values achieved using the simulated regression data. The power values predicted for each feature, at the number of samples actually used in module one for selection, matched with the observed empirical power achieved by the most stable feature selection methods (SF2B). Furthermore, as expected, we consistently observed smaller necessary sample sizes for the features with largest effect values. This observation was well illustrated for CRC stage 3 from the second classification dataset [36] (SF9C), where effect sizes ranged between 4.78 and 1.65, and the sample size necessary to obtain maximal power ranged from <5 to >1280.
We explored general trends in the sample sizes necessary to achieve good power, concluding less than 40 samples necessary to observe max power at an effect size of 0.8 in regression mode and no more than 10 samples necessary for a Cohen's d effect of 3.0, for classification. The values at these effect sizes are reported as they appeared consistently amongst the stable features selected by Boruta in module one of the present study. In almost all cases, fewer samples were necessary to observe max power for the features selected during classification than the equivalent chosen in regression mode. This observation corroborates the greater stringency observed for classification mode feature selection, in module one.
Lastly, we developed an interactive open-source web application: 'PowerTools', to facilitate not only the estimation of the number of samples required for future study, but to group similar features and determine the effect size associated with potential biomarkers. Whilst other computational tools for power analysis have been produced previously, many have been limited in their accessibility: providing only raw functions [23,42]; relating only to specific study designs, such as case-control microbiome studies [43]; or are now defunct [24]. We believe that our workflow and approach is generalised across multiple different -omics datasets and will help in the translational community to interpret the stability and future design aspects of the biomarkers.

Conclusion
In this paper, we presented a number of different RF based stable feature selection methods and compared their performances using simulated as well as published experimentally derived datasets. Across all of the scenarios considered we found Boruta to be the most stable methodology, whilst Permutation (Raw) offered the largest number of relevant features when allowed to stabilise over a number of iterations. We determined that the decision about which approach to pursue should be weighed against the computational requirements and runtime necessary. Finally, we extended and linked the effect size of the stable markers with future study, using a web-based interface.

Data availability
All the data is available in the respective published papers and also in the github repository.

Author's contributions
JL carried out data analysis, developed web interface and was involved in the drafting of the manuscript. AA conceived and designed the data analytics strategy. VRC contributed to the Additional file and revised the manuscript. AA and GVG supervised the project and the analysis. All authors contributed to the writing of the manuscript.