PowerTools: A web based user-friendly tool for future translational study design

Biomarker identification is one of the major goals of functional genomics and translational medicine research. The advent of NGS lead to a constant and exponential increase of large datasets that have the potential of providing the means for novel biomarker identification for the early diagnosis of complex diseases and/or for patient/disease stratification. Once a biomarker has been identified, a validation study is necessary to assess its value. A study design that considers its appropriateness and cost-effectiveness is paramount. The calculation of a sample size is a challenge that needs to be addressed.


Background
Over the last few years there has been a lot of emphasis on the high dimensional omics data generation that includes untargeted omics datasets such as transcriptomics [4], [5] metabolomics [6], [7], proteomics [8], [9], microbiomes [10] [11] [12] and deep phenotyping [13]. Vast amount of data is routinely accumulated which needs to be integrated and analysed to facilitate the identification of the relevant markers. If the identified markers from the various omics datasets are robust, reproducible and indicative then they can be used as a biomarker for patient's stratification [14], [15] and can also be useful either as diagnostics or prognostic tools.
To validate those biomarkers experimentally, a study needs to be carefully designed and more often than not encompass features that are sometimes arbitrary. Earlier studies have focused on generating power analysis outputs for such scenarios using different omics datasets, for example metabolomics data [16], [1] and transcriptomics data [17].
However, those studies are very specific to these omics datasets and often fail to relate power calculations to the relevant biomarkers that were identified. In this study, we developed a web based interface tool, termed PowerTools, to streamline power calculations and offer a valuable asset for use in translational research.
PowerTools forms a flexible webtool to facilitate power analysis and sample size determination, based on a method described by Blaise et al., (2016) [1]. It can take as input two types of response (outcome) variables; regression (continuous variables) and binary classification (class variables) outcomes. Furthermore, the correlation structure of predictor variables was explicitly modelled, in order to capture any multi-collinearity between variables. To increase the potential adaption of our tool, we used the R statistical software environment (https://www.r-project.org/) to implement functionality. Additionally, the redesigned functions incorporated comprehensive progress messages and error 4 notation, to improve their usability. Furthermore, the R implementation, presented here, improves the functionality of the original functions in two key respects. Firstly, each variable is automatically assessed using its true effect size (i.e. in the case of regression, the true effect size of a variable is estimated as its correlation with the true outcome variable, whilst in a binary classification, the observed Cohen's d effect size [8] is computed). Secondly, our approach caters for highly correlated variables to be optionally grouped together and only the member of each group with the largest effect size to be used for assessment thereby facilitating the identification of a smaller subset of potential biomarkers.

PowerTools Workflow
PowerTools accepts as input a set of -omics biomarkers associated with an outcome variable. Based on the outcome variable, either a binary or continuous class, it performs a simulation with a random multivariate normal distribution. The design of the workflow also considers potential correlations between the biomarkers.

Datasets
Our case studies were based on previously published experimental datasets, presented in Table 1.

Software and Code Availability
We used the R v3.5.0 software for statistical computing [20]. The web interface was constructed using R shiny app [21] and is available online [3] All our input and supplementary files can be found on our GitHub repository [2] .

Web tool
To streamline power calculations and provide an accessible package fit for translational medicine, we produced PowerTools, an interactive open-source web application, written in R code, using the Shiny framework. The tool is capable of performing efficient simulationbased power calculations for regression and binary classification datasets from various omics disciplines. The web interface caters the estimation of sample sizes, quick access to function parameters and is complemented with help information and example datasets.
Performance matrix or confusion matrix result values are presented as both a customisable plot as well as raw data tables, which can be downloaded using the user interface. A screenshot of PowerTools is presented in the Figure 2.

Case studies
PowerTools was applied to perform power analysis using previously published freely available omics datasets. To assess the two different modes, regression and classification, we have employed the data published by Acharjee et al., 2017 [18] and Bravo-Merodio et al., 2019 [19].

Regression mode case
In this category, the outcome variable considered was the amount of the milk given to the infants in the Cambridge Baby Growth Study (CBGS). In this case, we used PowerTools in the regression mode. A previous study [18] identified three lipids: PC(35:2), SM(36:2) and SM(39:1) and were thus considered for a potential future design study individually.  Figure 3.

Classification Mode case
We used three physiological features (decrease neutrophil CD62L and CD63 expression as well as monocyte CD63 expression and frequency) [19] as potential biomarkers for multi  [22], others are directly related to specific study designs, such as case-control microbiome studies [23] and some are not currently being maintained [17].

Conclusion
PowerTools forms an interactive open-source web application that utilises an intuitive visual representation to cater for the estimation of the number of samples required for potential future studies. We believe that our workflow and approach is generalised across multiple different -omics datasets and will help in the translational and precision medicine community to interpret the stability and future design aspects of potential biomarkers.

Data Availability
All the data used in this study is available from the respective published papers as well as from our GitHub repository [2]. The step by step procedure can be found in the supplementary material.

Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable

Availability of data and materials
All the data used in this study is available from the respective published papers as well as from our GitHub repository [2].

Competing interests
The authors declare that they have no competing interests.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.