Background Biomarker identification is one of the major and important goal of the functional genomics and translational medicine remits. Large scale –omics data are increasing being accumulated and can provide vital means for the identification of biomarkers for the early diagnosis of complex disease and/or patient/diseases stratification for prospective studies. These tasks are clearly interlinked and it is essential that an unbiased and stable methodology is applied in order to address them. Although, recently, many, primarily machine learning based, biomarker identification approaches have been developed, the exploration of potential associations between biomarker identification and the design of future experiments remains a challenge.
Methods In this study, using both simulated and published experimentally derived (real) datasets. We compared the performance of decision based machine learning approach called Random Forest. Four Random forest based feature selection methods namely, Boruta, Permutation based feature selection without correction, Permutation based feature selection with correction, Backward elimination based feature selection. Moreover, we conducted power analysis to estimate the number of samples required for potential future studies using the derived stable from the previous step.
Results We presented a number of different RF based stable feature selection methods and compared their performances using simulated as well as published experimentally derived datasets. Across all of the scenarios considered, we found Boruta to be the most stable methodology, whilst Permutation (Raw) offered the largest number of relevant features when allowed to stabilise over a number of iterations. Finally, we developed a web interface (https://joelarkman.shinyapps.io/PowerTools/) to streamline power calculations and aid future study design within a translational medicine context.
Conclusions We developed a pipeline to discover biomarkers using RF methods. The web interface, “PowerTools” offers the potential for designing appropriate and cost-effective subsequent future omics study designs.