Background: Numerous metagenomic studies aim to discover associations between the microbial composition of an environment (e.g. Gut, Skin, Oral) and a phenotype of interest. Multivariate analysis (MVA) is often performed in these studies without critical a priori knowledge of which taxa are associated with the phenotype being studied. Consequently, non-parametric MVA methods are applied directly to all taxa surveyed independent of noise. This approach typically reduces statistical power in settings where true associations among only a few taxa are obscured by high dimensionality (i.e. sparse association signals). At the same time, the inclusion of all taxa can confound the extraction of key biological insights. Further, low sample size and compositional sample space constraints exist in these data whereby beyond-study generalizability may be reduced if not properly accounted for. More powerful association tests that are interpretable and directly account for compositional constraints while detecting sparse association signals are needed.
Methods: We developed Selection-Energy-Permutation (SelEnergyPerm), a non-parametric group association test with embedded feature selection. SelEnergyPerm directly accounts for compositional constraints by selecting parsimonious log ratio signatures from the set of all pairwise log ratios (PLR) between features (OTUs, taxa, etc.). To do this, network methods are used to rank, select, and maximize the between-group association of a candidate log ratio subset. This process is then repeated with an appropriate permutation testing design to simultaneously determine the significance of the selected signatures and association.
Results: Simulation results show SelEnergyPerm selects small independent sets of log ratios that capture strong associations in a range of scenarios with small and large dimensional feature spaces. Additionally, our simulation results demonstrate SelEnergyPerm consistently detects/rejects associations in synthetic data with sparse, dense, or no association signals. We demonstrate the novel benefits of our method in four case studies utilizing publicly available 16S rRNA and whole-genome sequencing datasets.
Conclusions: Tools to analyze complex high-dimensional metagenomic datasets with sparse association signals using robust PLR have not been sufficiently developed previously. We propose SelEnergyPerm, a novel framework for the discovery of phenotype-associated, metagenomic log ratio signatures for characterizing and understanding alterations in microbial community structure. SelEnergyPerm is implemented in R, available at https://github.com/andrew84830813/selEnergyPermR.