Background: In high-throughput sequencing studies, sequencing depth, which quantifies the total number of reads, varies across samples. Unequal sequencing depth can obscure true biological signals of interest and prevent direct comparisons between samples. To remove variability due to differential sequencing depth, taxa counts are usually normalized before downstream analysis. However, most existing normalization methods scale counts using size factors that are sample specific but not taxa specific, which can result in over- or under-correction for some taxa.
Results: We developed TaxaNorm, a novel normalization method based on a zero-inflated negative binomial model. This method assumes the effects of sequencing depth on mean and dispersion vary across taxa. Incorporating the zero-inflation part can better capture the nature of microbiome data. Two corresponding diagnosis tests on the varying sequencing depth effect are proposed as well for validation. We demonstrate from simulations that TaxaNorm reaches the similar to higher power while significantly lower the false discoveries in downstream analysis to existing methods. Applying to the real dataset also shows its improved performance when correcting the technical bias.
Conclusion: TaxaNorm considers correcting both sample- and taxon- specific bias by introducing an appropriate regression framework in the microbiome data, which aids in data interpretation and visualization. The ’TaxaNorm’ R package is freely available through the CRAN repository https://CRAN.R-project.org/ package=TaxaNorm and the source code can be download at https://github. com/wangziyue57/TaxaNorm.