MicrobiotaProcess: A comprehensive R package for managing and analyzing microbiome and other ecological data within the tidy framework

The data output from microbiome research is growing at an accelerating rate, yet mining the data quickly and efficiently remains difficult. There is still a lack of effective data structure to represent and manage data, as well as flexible and composable analysis methods. In response to these two issues, we designed and developed the MicrobiotaProcess package. It provides a comprehensive data structure, MPSE , to better integrate the primary and intermediate data, which improves the integration and exploration of the downstream data. Around this data structure, the downstream analysis tasks are decomposed and a set of functions are designed under the tidy framework. These functions independently perform simple tasks and can be combined to perform complex tasks. This gives users the ability to explore data, conduct personalized analysis and develop analysis workflows. Moreover, MicrobiotaProcess can interoperate with other packages in the R community, which further expands its analytical capabilities. This article demonstrates the MicrobiotaProcess for analyzing microbiome data as well as other ecological data through several examples. It connects upstream data, provides flexible downstream analysis components, and provides visualization methods to assist in presenting and interpreting results.


Introduction
A wide array of important roles of the microbiota in diverse environments have been investigated and explored substantially 1,2 , largely due to the development of high-throughput sequencing technologies and bioinformatics. During the last decades, many bioinformatics algorithms and tools for the exploration and the analysis of microbiome data have been built in the scientific community, such as qiime2 3 , dada2 4 , usearch 5 , mothur 6 , MetaPhlAn 7,8 . These pipelines or tools conduce initial bioinformatics analysis of microbiome data, but the relevant data to a microbiome experiment became heterogeneous consisting of feature-oriented (operational taxonomic unit or exact sequence variants) data frames, sample-oriented data frames, the representative sequences, and the phylogenetic tree of the sequences after the initial analysis, which brings new challenges for the downstream data analysis and reproducibility. There are often many important links (such as the features of the feature table and the nodes of the phylogenetic tree, feature table and metadata , etc.) among the diverse data that need to be preserved throughout an analysis. If a researcher failed to integrate these data comprehensively, the information might be lost and thus generate errors. In addition, many intermediate steps that may confuse may be repeatedly performed in the downstream statistical analysis. For instance, the dissimilarity indices (such as Bray-Curtis, (Un)Weighted Unifrac and Jaccard, etc.) for the communities of microbes might need to be calculated and reused in hierarchical cluster analysis, PCoA (Principal Coordinate Analysis), and PERMANOVA (Permutational multivariate analysis of variance), etc. If the intermediate data can be effectively integrated and stored, it will improve the efficiency of analysis, enhance reproducibility and avoid errors 9 . The R programming language has become one of the most popular tools for biomedical data analysis 10 . Some efforts have been made to build common representations and infrastructures for complex, highly interdependent data sets in R 10,11 . For example, SummarizedExperiment 12 class is widely used to integrate matrix-like objects of feature abundance, the normalized feature abundance, a sample-and feature-oriented metadata data frames as a standardized data structure across many Bioconductor 13 packages. To integrate the phylogenetic tree structure and heterogeneous associated data, we defined the treedata 14,15 class, which has also been widely used in several packages, such as tidytree 14 , treeio 15 , and ggtree 16 . However, these data structures did not cover all the heterogeneous data of a microbiome experiment, and the existing tools [17][18][19][20][21][22] for the downstream statistical analysis of microbiome also did not well integrate the primary heterogeneous data and the useful intermediate data. For instance, the phyloseq 17 class is used in the phyloseq 17 , microViz 19 , and MicrobiomeAnalyst 22 packages, but the data structure can only store the primary input datasets, it cannot integrate the normalized data and the intermediate data such as the alpha diversity, dissimilarity indices, and the result of differential analysis, etc. The animalcules package was developed using the MultiAssayExperiment 23 class, and it cannot integrate intermediate data and the phylogenetic tree which is often needed in the calculation of specified dissimilarity ((Un)Weighed UniFrac). These data structures for the microbiome cannot take into account the diversified needs of downstream analysis, which makes it more convenient for some specific needs, while other needs may be troublesome. Moreover, the differences in these data structures are too large, which restricts downstream integrated analysis.
Downstream statistical, visual, and functional analyses of microbiome data are complex since the appropriate analysis workflow often needs to be explored and adjusted according to the research design. Many methods and parameters, such as the methods of dissimilarity, the clustering, and the ordination, must be considered properly during analysis. A simple, unified, and human-readable analysis workflow will improve the efficiency of data analysis, avoid errors, and enable researchers to focus more on biological problems 9 . Recently, a concise data structure (tidy data) 23 has become popular in the R 10 data analysis community with tidy data tools, such as dplyr 24 , tidyr 24 , and ggplot2 25 . This tidy concept has been applied to different disciplines, including genomic, transcriptomic, and phylogenetic data analyses. The plyranges 26 package provided a dplyr-like interface for interacting with some of the most common data structures containing genomic coordinates including Ranges and GenomicRanges 27 . The tidybulk 28 package introduced a tidy transcriptomic data structure paradigm and analysis grammar for the bulk transcriptional analyses to bridge transcriptional data analysis with the tidyverse ecosystem 24 . The clusterProfiler 29 package developed by us introduced a tidy interface to facilitate users to operate and visualize enrichment results. We also developed the ggtree 16 and ggtreeExtra 30 packages to provide a general framework for the annotation and visualization of phylogenetic trees with heterogeneous associated data based on a concise data structure, the treedata 14,15 class. Through the human-readable data structure and analysis grammar, these resources make it easier for the community to develop modular manipulation, visualization, and analysis methods, decrease the learning curve for users, and facilitate reproducibility for the related studies 9,23,24 . However, this principle of tidiness has not been implemented in the existing tools [17][18][19][20][21][22] for microbiome data analysis. This greatly hinders the flexibility and ease of use of downstream data analysis in this discipline.
To fill these gaps, we developed the MicrobiotaProcess package. We defined the MPSE class for storing the microbiome experiment or related ecological dataset and the related intermediate data of downstream analysis. The MPSE class defined in MicrobiotaProcess inherits both the SummarizedExperiment 12 class and the treedata 14,15 class, which are popular and widely used in the Bioconductor 13 ecosystem. It takes merits from Bioconductor packages that are based on these two data structures. The importance of data structure is that downstream analysis is based on it. A good data structure helps unify downstream analysis. When operating a certain data, the associated data in the data structure can be updated synchronously, which can reduce many errors caused by improper operations. In order to make downstream data analysis modular and easy to use, we introduce a user-friendly grammar (i.e., tidy interface) to process, analyze and visualize micriobiome data stored in the MPSE data structure. This framework allows users to build humanreadable and flexible analysis workflows. We believe it can remove a major obstacle for scientists to explore and analyze the microbiome and other ecological data.

A container improving the exploration of the downstream data of microbiome
MicrobiotaProcess introduces the MPSE class, which inherits the SummarizedExepriment 12 , treedata 14,15 , and XStringSet 31 classes, for storing microbiome or other related assay data, metadata, and phylogenetic tree data (Fig.1A). The primary matrices data, such as count matrices and standardized matrices, are stored in the assays, where rows represent features (such as ASVs/OTUs and genes, etc.) and columns represent samples. The samples characteristics are stored in the colData. To store the phylogenetic tree and associated data, we defined the otutree slot, which inherits a treedata 14,15 class defined in tidytree 14 and treeio 15 . The phylogenetic tree file that contained the evolutionary statistics inferences and other associated data can be parsed using treeio 15 and stored in the otutree slot. We also defined a taxatree slot which is also a treedata 14,15 class to store hierarchical relationships among the taxa and their lineages. The taxatree and otutree components can be processed and visualized directly via the in-house developed ggtree package suite (tidytree 14 , treeio 15 , ggtree 16 , and ggtreeExtra 30 ). Moreover, the results of the downstream analysis can also be stored in the MPSE class, which is also different from existing related packages. For example, we used mp_rrarefy to rarefy the primary matrices data and added the results into assays slot, then used mp_cal_alpha to calculate the alpha diversity of the samples. The results of alpha diversity were added into colData slot, which can be visualized using mp_plot_alpha further. Each component of the MPSE class can be extracted by the self-explanatory function names, prefix with 'mp_extract_' and followed by a specific explanatory term (see also Accessors to fetch internal data in the Methods session) (Fig.1A). In the following example, we used mp_extract_assays to extract the assays component and used mp_extract_sample to extract the colData component which contained sample metadata.

library(SummarizedExperiment
rare.tb <-mouse.time.mpse %>% mp_extract_assays(.abundance = RareAbunda nce) sample.da <-mouse.time.mpse %>% mp_extract_sample() Through the MPSE class, all related data and results relevant to a microbiome experiment can be stored in a single instance, which enables to improve the exploration of the downstream data, facilitate data sharing and enhance reproducibility.

Bridging the upstream analysis tools and MPSE constructor
The output files of the upstream analysis of the microbiome are various and usually not humanfriendly. How to parse the output files is often the first obstacle that researchers should face. The phyloseq 17 had provided some functions to parse the output of common upstream analysis tools for 16s rRNA dataset, but not for metagenomics or the output of commonly used tools (qiime2 3 , dada2 4 ).
To address the need for multiple types of microbiome data (16s rRNA, metagenomics, and other related ecological data) as well as output obtained from commonly used tools, MicrobiotaProcess provides several functions to parse the output of the upstream analysis of microbiome ( Fig.2A). To enhance the interoperability with other R packages, MicrobiotaProcess also provides as.MPSE to convert common S4 objects defined in the Bioconductor 13 ecosystem, such as phyloseq 17 , biom 32 , SummarizedExperiement 12 , and TreeSummarizedExperiment 18 . All of the parse functions and the convert functions output the same object type, MPSE, making them consistent and robust for downstream manipulation. In addition, An MPSE class also can be constructed by a call of the function of the same name with required parameters (see also parser functions and the MPSE constructor in the Methods session). Here, we used as.MPSE to convert a biom 32 object which is the output of read_biom of biomformat 32 to an MPSE object, and used the MPSE function to build an MPSE object from scratch.

The unified tidy framework analysis grammar
To facilitate data manipulation and exploration of the microbiome or other ecological data based on the MPSE, MicrobiotaProcess defined a tidy-like format output for the MPSE class (Fig.1B). The assays of MPSE are converted to a tidy format, in which three columns (feature identifiers, sample, and feature abundance) are fixed, each row represents the abundance of each feature of each sample. Optional columns include the taxonomy annotation and sample metadata information or newly calculated information. MicrobiotaProcess provides a unified analysis grammar to express all the steps of a typical microbiome community analysis workflow using self-explanatory function names, which are composed of 'mp_', a specific verb, and one or two explanatory terms ( Fig.2 C1 and C2). Moreover, MicrobiotaProcess extends the dplyr 24 verbs to support processing data stored in an MPSE object (Fig.2B). Following the concept of tidiness, these verbs provide robust and standardized operations for data analysis and transformation, and can be assembled into a workflow using the pipe operator (%>%). This enables users to effectively analyze ecological data such as metagenomic data and easily build reproducible and human-readable pipelines.
To illustrate the flexibility and versatility of MicrobiotaProcess, we reanalyzed three public datasets as examples. Each example addresses a different problem and has a different focus on demonstrating the features of MicrobiotaProcess. The details of the analysis and the corresponding code are presented in the Supplementary file.

Example 1: Characterization of the gut microbiota using 16s rRNA data
The 16s rRNA dataset of pediatric stool was obtained from the iHMP (Integrative Human Microbiome Project Consortium) 33 and downloaded from the MicrobiomeAnalyst 22 website, which contained 23 CD (Crohn's Disease) and 20 control samples. We covered the calculation of the alpha and beta diversity indices, the analysis of the diversity measures with the related test method (subsequent Mann−Whitney U test and Permutational Multivariate Analysis of Variance), and visualization of patterns in the microbiota community using MicrobiotaProcess. The result of alpha diversity showed that the richness of the CD's microbiota community was significantly lower than the control's ( Fig.3A and 3B), the evenness of the microbiota species between the CD and control group did not have significant difference but the median of the diversity of the CD group was higher than the control group (Fig.3B). Then the PCoA (principal coordinate decomposition) and the adonis (Permutational Multivariate Analysis of Variance) results revealed that the microbiota communities between the CD and control group had significant difference (Fig.3C). We also found the Proteobacteria might be more abundant in CD by combining the hierarchical relationships and the phyla abundance presented as the horizontal stacked bar charts (Fig.3D). To further investigate the relationship between the microbial communities and CD, we used mp_cal_abundance and mp_plot_abundance to display the abundance of class level, this result showed that the abundance of Gammaproteobacteria belonged to Proteobacteria might be enriched in the CD group ( Fig.3E and 3F). Next, we used mp_diff_analysis to identify the different OTUs, the result also revealed that the OTUs significantly enriched in the CD group mainly belong to Proteobacteria, whereas the OTUs significantly enriched in the Control group mainly belonged to Firmicutes (Fig.3G). Through the tidy-like unified grammar of MicrobiotaProcess, the alpha and beta diversity of the stool microbial communities between the Crohn's Disease patients and control can be explored rapidly. The differential microbial organisms between the Crohn's Disease patients and the control groups can also be identified and visualized intuitively.

Example 2: Differential abundance analysis and functional characterization of metagenomic data
The metagenomics taxa abundance and KEGG gene abundance datasets were obtained from another pediatric Crohn's disease study 34 . We used the mp_diff_analysis function to identify the differential species and genes, followed by KEGG pathway enrichment analysis of differential genes using clusterProfiler 29 and MicrbiomeProfiler 35 . We found the Clostridium symbiosum and Faecalibacterium prausnitzii were nominally (uncorrected p-value <=0.05, fdr>0.05) significantenriched in the CD group compared to the control group ( Fig.S20 and Fig.S21). Interestingly, the KEGG enrichment results showed that the KEGG pathways of the CD stool group were significantly enriched in the Biosynthesis of amino acids and Glycine, serine, and threonine metabolism, and Pyruvate metabolism (Fig.4). This result is consistent with a recent study, which found that Crohn's Disease microbiomes had an increased potential to synthesize amino acids 36 . However, the differential species result is inconsistent with the findings in example 1. This may be due to the differences of samples and sequencing platforms between the studies, and the relationships between the different species and the pediatric CD need to be further investigated. In this example, we use MicrobiotaProcess for differential abundance analysis, and then use the in-house developed packages, clusterProfiler 29 and MicrobiomeProfiler 35 , for functional enrichment analysis. The combination of these packages facilitates the functional interpretation of metagenomic data and the discovery of functional signatures that leads to disease progression.

Example 3: Analyzing environmental-trait interactions in ecological communities
The dataset is from a study on the landscape ecology of mosquito communities in North Carolina for determining the risk for vector-borne disease 37 . Three agricultural/mixed-use landscapes (regions) were selected. Within each region, three 200m transects that spanned a field-forest habitat gradient were identified, and five traps were set for each transect to collect mosquitoes 37 . Here, we used MicrobiotaProcess to calculate alpha diversity and visualize the patterns of the community composition to evaluate whether the mosquito community structure was influenced by spatial variability. We found the richness of the mosquito species was significantly associated with the habitat with mean richness increasing from field to forest (Fig.5A). The partial constrained correspondence analysis (pCCA) result showed that the mosquito communities were significantly associated with some environment landscape variables, such as grassland (Grassland), evergreen tree canopy (EvergreenForest), and deciduous tree canopy (DeciduousForest) (Fig.5B). The mosquito species showed a clear preference for forested or field habitat (Fig.S23). These results were consistent with the findings of the original study 37 . This example shows that MicrobiotaProcess can be used not only for analyzing microecological data but also for general ecological data, helping us to discover the impact of environmental factors on community structure and functions.

Example 4: Interact with existing tools
Since the MPSE class inherits SummarizedExperiment 12 , the methods developed for SummarizedExperiment 12 can also be applied to an MPSE object. For example, the tidybulk 28 package provides a test_differential_abundance to perform differential transcriptome analysis using edgeR 38 quasi-likelihood (QLF), edgeR likelihood-ratio (LRT), limma-voom, limma-voom-withquality-weights, or DESeq2 39 . These methods can be directly applied to an MPSE object within the tidy framework for differential abundance analysis and indeed they have been reported for analyzing microbiome data 40,41 . Here, we re-analyze example 1 by using the test_differentical_abundance function to perform the differential analysis based on the edgeR quasi-likelihood 38 . Then the result was extracted by the mp_extract_tree function and can be manipulated and visualized using the ggtree package suite. Compared with the results of example 1, the number of the differential OTUs identified by edgeR 38 is more (Fig.S16) and the main conclusions are consistent. We found the abundance of OTUs belonged to Bifidobacterium, Faecalibacterium, Roseburia, and Coprobacillus were also significantly decreased in the CD group compared to the control group and the abundance of several OTUs belonged to Escherichia, Klebsiella and Haemophilus, which belonged to Gammaproteobacteria, were significantly enriched in the CD group (Fig.6). MicrobiotaProcess is able to interoperate with existing tools, including tidybulk, dplyr, ggtree package suite, clusterProfiler and MicrobiomeProfiler that we have demonstrated earlier. This greatly expands the functionalities of MicrobiotaProcess. It can connect upstream data, unify downstream analysis, and easily integrate into existing analysis pipelines.

Conclusion and Discussion
Microbiomics technologies have become increasingly popular methods for exploring the relationship between the microbial communities and the host or the environment (e.g., intestine, skin, soil, and the ocean) 1,33,[42][43][44][45][46] . Data analysis is still one of the bottlenecks in this field, especially in downstream analysis, the integration of heterogeneous data and the need for personalized analysis have brought new challenges. MicrobiotaProcess provides a comprehensive data structure, the MPSE class, to store heterogeneous microbiome data, including feature (e.g., species, OTUs, etc.) abundance, sample data (e.g., clinical information), and feature data (e.g., taxonomic relationships, functional profiles, etc.). Moreover, simultaneously with the analysis, intermediate results can be stored in an MPSE object. For example, the normalized or rarefied data, the diverse dissimilarity metrics, and the results of the differential analysis can be integrated with the microbiome data into an MPSE object. These intermediate results can be further processed, visualized, and re-used. This can prevent repeated calculation, facilitate data sharing, and enhance analytic reproducibility.
MicrobiotaProcess implements a set of functions within the tidy principles to unveil the characteristics and biological function of microbial communities in a diverse environment. The returned values of these functions are predictable and consistent. Each function is designed to accomplish a simple task and these functions can be combined to accomplish complex tasks. In this way, it has better flexibility, and the development of workflow through function series can meet most of the personalized analysis needs. Considering the interoperation between MicriobiotaProcess and other tools, some functions developed by other software can be applied to MPSE objects, making the downstream analysis function more comprehensive and having wider application scenarios. For example, differential analysis using tidybulk 28 and functional analysis using clusterProfiler 29 can be integrated into the analysis of microbiome and other ecological data through MPSE objects.
MicrobiotaProcess provides several functions to parse outputs obtained from upstream tools and allows converting commonly used objects to MPSE objects ( Fig.2A). This enables the functionalities provided by MicrobiotaProcess to be well connected to upstream analysis. The downstream analysis results need to be interpreted through visualization. MicrobiotaProcess provides several functions for visualization. More importantly, the tidy interface extracts relevant data and connects to the visualization functions developed by the R community, allowing users to have more options for visual exploration and presentation of data. For example, differential analysis results can be extracted by mp_extract_tree, followed by a visual exploration of the result via the ggtree package suit (Fig.3G and Fig.6). Although MicrobiotaProcess was mainly designed for analyzing microbiome data, some of its methods, such as alpha, beta diversity, and (partial) constrained correspondence analysis (pCCA or CCA) and redundancy analysis (RDA), etc. 48 , are also applicable to other ecological data (Fig.5). We believe that MicrobiotaProcess will be a valuable resource to support the analysis of the microbiome and other ecological data. This package is freely available at https://www.bioconductor.org/packages/MicrobiotaProcess. The development version of MicrobiotaProcess is hosted on GitHub (https://github.com/YuLab-SMU/MicrobiotaProcess).

The MPSE class
To better store the input data (abundance data, sequence data, phylogenetic tree data) and the result of downstream analysis, MPSE class was implemented in the MicrobiotaProcess package. This class inherits the SummarizedExepriment 12 class. In which, the assays slot was designed to store the rectangular abundance matrices of features (microbiota profiling or function profiling) for microbiome experiment results. The colData slot was designed to store the meta-data of features and results about the features generated in the downstream analysis. Compared to the SummarizedExperiment 12 class, MPSE introduces the following additional slots, 1) the otutree slot is a treedata object and was designed to store the phylogenetic tree and the associated data, including the results about the features in the downstream analysis and the evolutionary statistics inferred by the software building the tree; 2) the taxatree slot is also a treedata 14,15 object and was designed to store hierarchical taxonomy relationships and the associated data, such as the relative abundances and the results from different analysis; 3) the refseq slot is a XStringSet 31 object and was designed to store the reference sequences, with names corresponding to the rows of the assays slot. In addition, an internal attribute internal_attr (a list object) was introduced to store the raw results of pca (Principal Components Analysis), pcoa (Principal Coordinate Analysis), and hierarchical cluster analysis.

Overview of the design of the MicrobiotaProcess package
The overall design of the MicrobiotaProcess package was illustrated in Fig.2. It presents multiple parser functions to read the outputs of upstream analysis tools, such as qiime2 3 or qiime 49 , dada2 4 , and MetaPhlAn 7,8 . After parsing, the abundance of microbiota (or other features), the metadata of sample (optional), and the phylogenetic tree information (optional) are extracted from the outputs and stored as an MPSE object. Other objects designed to store microbiome data, such as phyloseq 17 , TreeSummarizedExperiment 18 , and SummarizedExperiment 12 can be converted to an MPSE object.
This enables MPSE to serve as a standardized entry point for downstream analysis while being compatible with existing analysis software and pipelines. MicrobiotaProcess provides a wide variety of microbiome analysis procedures to work with MPSE objects. These procedures were designed to follow the tidy data principles and thus are human-friendly, consistent, and composable to solve complicated problems. The results of the analysis can be returned in three modes via the action argument. The (intermediate) result can be stored in the MPSE object if the action is 'add'. If the action is 'only', it returns a tidy data frame with non-redundant sample or OTU (features) information with the result. While the action is 'get' returns the analysis result only.
MicrobiotaProcess also extends the dplyr 24 package to offer dplyr-verbs for data operation.

Parser functions and the MPSE constructor
The MicrobiotaProcess package provides mp_import_qiime2 to load the output of qiime2 3 , which is a common tool for the analysis of amplicon data. The feature abundance table of the output (i.e., a qza format file) of qiime2 is required while the taxonomy information, phylogenetic tree, and representative sequences are optional. The mp_import_dada2 function was designed to parse the output of dada2 4 . The output of removeBimeraDenovo of dada2 4 is required while the taxonomy information, representation phylogenetic tree, and the representative sequences are optional. The mp_import_metaphlan function was designed to parse the output of MetaPhlAn3 50 , which is a common tool for profiling the composition of microbial communities. The microbiota abundance output of MetaPhlAn3 is required, while the phylogenetic tree and metadata are also optional. In addition, An MPSE object can be constructed from scratch by calling the MPSE function with the following key parameters: 1) assays (required): A list or SimpleList of matrix-like objects or a matrix-like object (rows represent the features and columns represent the samples) providing abundance data for all samples.
2) colData (optional): A DataFrame object storing the characteristics of the samples.
3) otutree (optional): A treedata object storing a phylogenetic tree with/without associated data. Any tree file formats as well as commonly used software outputs that can be parsed by treeio 15 are supported. 4) taxatree (optional): A treedata object storing the taxonomy information (or other hierarchical data). MicrobiotaProcess provides the convert_to_treedata function to convert taxonomy data (a data.frame object) to a treedata object. 5) refseq (optional): A XStingSet object storing the representative sequences. Both nucleic acid sequences and amino acid sequences are supported via the readDNAStringSet or readAAStringSet functions provided by the Biostrings packages.

Process MPSE object using dplyr-verbs
To facilitate data manipulation and exploration of the microbiome data, MicrobiotaProcess defined a tidy-like formatted output for the MPSE object and extended a subset of the dplyr-verbs to support the MPSE object. The extended dplyr-verbs include: 1) filter function: subset MPSE class, retaining the data that satisfy the provided conditions. 2) select function: select the data according to the provided variables.
3) group_by function: return a grouped tbl_df-like data frame according to groups defined by the provided variables, then some data operations can be done on the groups. 4) mutate function: create new columns according to the provided variables and conditions. 5) left_join function: add columns from 'y' (a data.frame object) to 'x' (an MPSE object), by matching all rows of 'x' based on the keys. The keys only should be one or all of the Sample or OTU.
6) rename function: rename the column names of an MPSE object, except the OTU, Sample, and Abundance column names which cannot be renamed.

Data preprocessing
The microbiome features (OTU or ASV) with very low abundance and rare occurrence (exits in few samples) are difficult to distinguish from the sequencing error or other experimental technical error and are usually uninformative. It is better to improve the statistical power of multiple testing in the downstream analysis by filtering these features. MicrobiotaProcess presents mp_filter_taxa to filter the features based on their abundance (default Abundance count) and sample prevalence. By default, this function will screen out the features with zeros counts in 0.05% sample prevalence and users can reset the criteria via the parameters of min.abun (minimum abundance in a sample) and min.prop (the minimum sample prevalence). To make the data more meaningful for the downstream analysis after filtering, MicrobiotaProcess provides the mp_decostand and mp_rrarefy functions for the standardization of community data by inheriting the decostand and rrarefy functions from vegan 48 . The mp_rrarefy function allows users to estimate expected diversity (e.g., taxonomic richness) for a reduced sampling size, while the mp_decostand function provides several standardization methods for community data, such as total (divide by total abundance of each sample (relative abundance)), max (divide by the max abundance of the feature in all samples), frequency (divide by total abundance of each sample and multiply the number of non-zero features), hellinger (square root for the result of total) 51 , log (logarithmic transformation log_b(x > 0) + 1) 52 . These functions are developed to follow the tidiness concept and the results will be added to the assays slot of the MPSE object automatically to enhance reproducibility and reuse in the follow-up analysis.

Alpha diversity
Alpha diversity measures the species richness and evenness within a community. MicrobiotaProcess provides the mp_cal_alpha function to calculate the community diversity. There are six commonly used methods to calculate alpha diversity, including Observe (calculates the total species per sample), Chao1, ACE (estimate species richness by considering the low abundance of species). Pielou (measures the species evenness), Shannon and Simpson (take both of the richness and evenness of species into account). By default, the mp_cal_alpha will rarefy the abundance before calculating the alpha diversity. Users can specify the force argument to TRUE to calculate the diversity directly without performing rarefaction. This will be useful for taxonomic profiling data since they are usually relative abundance and cannot be rarefied. By default, the results are added to the colData slot that stored the metadata information of samples and returns an updated MPSE object. The mp_plot_alpha function is designed to visualize alpha diversity and it allows comparing different communities that were specified via the .group parameter.

Taxonomy composition
To compare the difference of OTUs (features) composition between different communities, MicrobiotaProcess provides the mp_cal_upset and mp_cal_venn to calculate the conjunct OTUs (features) or specific OTUs (features) of different groups (specified by the .group parameter). The result can be visualized by the mp_plot_upset and mp_plot_venn functions respectively. The microbiome OTUs (features) are often annotated to different taxonomy levels in upstream analyses, and to survey the species profile of different samples, it is often necessary to calculate abundances of different taxonomy levels. MicrobiotaProcess implements the mp_cal_abundance function to calculate the abundances of all taxonomy levels. Similar to the calculation of alpha diversity, the mp_cal_abundance will rarefy the raw abundance and then calculate the relative abundance by default. Users can specify the force argument to TRUE to disable rarefaction. And the relative argument controls whether to calculate the relative abundance (total is 100% for the same taxonomy level). The results will be added to the associated data of the taxatree (treedata object) slot by default. The abundance of a selected taxonomy level can be extracted by the mp_extract_abundance function with the taxa.class parameter specified. The results of the mp_cal_abundance can be visualized by the mp_plot_abundance function.

Beta diversity
The beta diversity has been applied in a broad sense to measure variation or changes in community composition. It can assess how microbiota composition changes across spatial and temporal scales. Some distance indexes, such as Bray-Curtis index, Jaccard index, UniFrac (weighted or unweighted) index, are useful and popular to measure the degree of community differentiation. These distances can be further subjected to ordination which aims to capture essential information in a lowerdimensional representation and is commonly used to visualize sample dissimilarities. MicrobiotaProcess implements the mp_cal_dist function to compute the common distances (dissimilarity) and provides the mp_plot_dist function to visualize the result. It also provides several commonly-used ordination methods, such as Principal Components Analysis (PCA: mp_cal_pca), Principal Coordinate Analysis (PCoA: mp_cal_pcoa), Nonmetric Multidimensional Scaling (NMDS: mp_cal_nmds), Detrended Correspondence Analysis (DCA: mp_cal_dca), Redundancy Analysis (RDA: mp_cal_rda), and (Constained) Correspondence Analysis (CCA: mp_cal_cca). To fit environmental vectors or factors onto an ordination, this package provides the mp_envfit function to perform this analysis. All the ordination results can be visualized by the mp_plot_ord function. In addition, it also wraps several statistical analyses for the distance matrices, such as permutational multivariate analysis of variance (mp_adonis), analysis of similarities (mp_anosim), multi-response permutation procedure (mp_mrpp), and mantel tests for dissimilarity matrices (mp_mantel). All these functions are developed based on a tidy-like framework. These functions can be assembled into linear workflows with the pipe operator (%>%).

Differential analysis and biomarker discovery
MicrobiotaProcess implements the mp_diff_analysis function for identifying differentially abundant genera as biomarkers based on the tidy-like framework. Similar to LEfSe 53 , there are three steps to perform this analysis. First, all features are tested to determine whether values (e.g., abundance) in different groups of samples are differentially distributed via the Kruskal-Wallis ranksum test (default, other option is oneway.test, glm or glm.nb). Then, the resulting features infringing the null hypothesis are further tested by a second round of test using Wilcoxon rank-sum test (default, other option is t.test, glm or glm.nb) to keep the features that in all pairwise comparisons between the sub-groups are significantly consistent with the group level trend. Finally, the linear discriminant analysis (LDA) or random forest model was built to rank all the features based on the relative difference among different groups. Compare to LEfSe, mp_diff_analysis is more flexible. Not only the test method but also the test value (using generalized fold change 54 by default, other option is comparing the median or mean value of different groups by setting to compare_median or compare_mean) can be set by users. The result is integrated into the taxatree component (default) or rowData component depending on whether the taxonomy is provided or not. The result can be extracted via the mp_extract_tree or mp_extract_feature respectively. Then it can be processed and displayed via treeio 15 , tidytree 14 , ggtree 16 , ggtreeExtra 30 , and ggplot2 ( Fig.3G and Fig.6). To decrease the coding burden, we also developed mp_plot_diff_res to visualize the result of differential analysis (Fig.S13).

Accessors to fetch internal data
The MPSE object is composed of several objects to store different data including primary data and analysis results. To extract the components of the data, MicrobiotaProcess provides several accessors started with 'mp_extract', including: 1) mp_extract_sample function: to return the sample characteristics in a tidy data table, similar to the colData function for the SummarizedExperiment class.
2) mp_extract_assays function: to extract the assays from an MPSE object, similar to the assay function for the SummarizedExperiment class.
3) mp_extract_feature function: to extract the features characteristics (optional with taxonomy information by setting addtaxa argument to TRUE) and return tidy data, similar to the rowData function for the SummarizedExperiment class. 4) mp_extract_tree function: to extract the taxatree (by default) or otutree (by specifying the type argument to otutree) from an MPSE object. 5) mp_extract_taxonomy function: to extract the taxonomy information from an MPSE object. 6) mp_extract_refseq function: to extract the representative sequences from an MPSE object. 7) mp_extract_dist function: to extract distances in a matrix (as a dist object, by default) or a tidy data frame with comparison among the groups (by specifying the .group argument). 8) mp_extract_rarecurve function: to extract rarefaction in a rarecurve object, which can be visualized by ggrarecurve function. 9) mp_extract_internal_attr function: to extract the raw result of mp_cal_pca, mp_cal_pcoa, mp_cal_rda, mp_cal_cca, mp_cal_clust, mp_envfit, mp_adonis, mp_anosim, mp_mrpp and mp_mantel.    and ggtreeExtra ). (E) The relative abundance bar chart at the class level for each group. (F) The relative abundance heatmap at the class level of each sample (visualized by mp_plot_abundance). (G) The taxa tree of the community was visualized with associated data, including the relative abundance and the LDA (Linear discriminant analysis) value of differential OTUs. The background color in the tree represents the phyla of the clade. The external point layer represents the relative abundance of each OTU on each sample. The external bar layer represents the mean LDA of the differential OTUs. The colored points represent the differential taxa, and the sizes were scaled by the fdr (kruskal.test) (visualized by mp_plot_diff_res). The pCCA result showed that the mosquito communities were significantly associated with some environment landscape variables, such as grassland (Grassland), evergreen tree canopy (EvergreenForest), and deciduous tree canopy (DeciduousForest). Each point represents one sample and the corresponding data of the samples are encoded as visual characteristics of the points (observe species as to size, habitat as to color, and the region as to shape). The arrows represent the environmental factors and the asterisks indicate significant levels related to the Mosquito communities in the study (* 0.05, ** 0.01, ***0.001) (visualized by mp_plot_ord).    The taxa tree of the community was visualized with associated data, including the relative abundance and the LDA (Linear discriminant analysis) value of differential OTUs. The background color in the tree represents the phyla of the clade. The external point layer represents the relative abundance of each OTU on each sample. The external bar layer represents the mean LDA of the differential OTUs. The colored points represent the differential taxa, and the sizes were scaled by the fdr (kruskal.test) (visualized by mp_plot_diff_res). Characterizing mosquito community structure with the raincloud plot of the alpha diversity and the pCCA plot of the mosquito communities. (A) The alpha diversity result showed that the mosquito species richness gradually increases from eld to forest (visualized by mp_plot_alpha). (B) The pCCA result showed that the mosquito communities were signi cantly associated with some environment landscape variables, such as grassland (Grassland), evergreen tree canopy (EvergreenForest), and deciduous tree canopy (DeciduousForest). Each point represents one sample and the corresponding data of the samples are encoded as visual characteristics of the points (observe species as to size, habitat as to color, and the region as to shape). The arrows represent the environmental factors and the asterisks indicate signi cant levels related to the Mosquito communities in the study (* 0.05, ** 0.01, ***0.001) (visualized by mp_plot_ord).

Figure 6
The results of differential OTUs based on the edgeR_quasi_likelihood method using tidybulk. (A) The relative abundance heatmap of the different OTUs (visualized by the mp_plot_abundance function). (B) The hierarchical cluster of samples is based on the relative abundance of the different OTUs (visualized by ggtree). (C) The hierarchical cluster of OTUs based on their relative abundance across different samples, the differential OTUs were labeled with their names (visualized by ggtree).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.