The effect of microbes on their human host is often mediated through changes in metabolite concentrations. As such, multiple tools have been proposed to predict metabolitc profiles from microbial taxa frequencies, assuming a direct relation between the gut microbiome composition and blood metabolite concentrations. However, the microbiome-metabolite relation may depend on host demographics or condition.
We show that the relation between microbiome and metabolites is best predicted at the log concentration level. We further develop LOCATE (Latent Of miCrobiome And meTabolites rElations), a machine learning (ML) tool based on latent representation which predicts the log normalized metabolites composition based on the log normalized microbiome composition. LOCATE has a higher overall accuracy than all current state-of-the-art predictors in both 16S rRNA gene and shotgun gene sequencing.
The accuracy of LOCATE and all other predictors significantly decreases when predicting on one dataset and testing on a different dataset, or on a different condition in the same dataset, especially in 16S rRNA gene sequence based data.
We propose an intermediate representation between the microbiome and the metabolite concentrations and show that this representation can be used to predict the host phenotype better than either the microbiome or the metabolome. This representation is strongly correlated with host demographics, including age, gender and diet and can be used to improve ML predictions of host phenotypes in comparison with either microbiome or metabolome using a large microbiome sample combined with a small number of metabolome samples (~ 50)