Background: The relationship between the compositions of microbial communities and various host phenotypes is an important research topic. Microbiome association research addresses multiple domains, such as human disease, diet and medicine. Statistical methods for testing microbiome-phenotype associations have been studied recently to determine their ability to assess longitudinal microbiome data. However, existing methods fail to detect sparse association signals in longitudinal microbiome data.
Methods: In this paper, we developed a novel method, namely, aGEEMiHC, which is a data-driven adaptive microbiome higher criticism analysis based on generalized estimating equations, to detect sparse microbial association signals from longitudinal microbiome data. aGEEMiHC adopts a generalized estimating equations framework that fully considers the correlation among different observations from the same cluster (individuals) in longitudinal data, and it integrates multiple microbiome higher criticism analyses based on generalized estimating equations by setting different working correlation structures. Thus, the proposed method is robust to diverse correlation structures for longitudinal data.
Results: The proposed method shows a stable performance for diverse association patterns in both sparsity levels and phylogenetic relevance. Extensive simulation experiments demonstrate that it can control the type I error correctly and achieve superior performance according to a statistical power comparison. In our simulation, we applied aGEEMiHC to longitudinal microbiome data with various types of host phenotypes to demonstrate the stability of our method. aGEEMiHC is also utilized for real longitudinal microbiome data, and we found a significant association between the gut microbiome and Crohn's disease.
Conclusions: aGEEMiHC is a statistical method that facilitates association testing for sparse microbial association signals from longitudinal microbiome data, and it can be applied to situations in which the true underlying correlations among different observations from the same cluster in longitudinal data are unknown. It is worth noting that our method also ranks the significant factors associated with the host phenotype to provide potential biomarkers. The R package GEEMiHC is available at https://github.com/xpjiang-ccnu/GEEMiHC.