MAAWf: An Integrated and Visual Tool for Microbiome Data Analyses

Background Microbiomic research has grown in popularity in recent decades. The widespread use of next-generation sequencing technologies, including 16S rRNA gene-based and metagenomic shotgun-based methods, has produced a wealth of microbiome data. At present, most software and analysis workflows for analysis and processing of microbiomic data are command line-based, which requires considerable computing time and makes interaction difficult. To provide a command-line free, integrated, user interface friendly and online/local deployable microbiome analysis tool, we developed Microbiome Automated Analysis Workflows (MAAWf). The MAAWf assesses taxonomy, protein-coding genes, metabolic pathways, carbohydrate-active enzymes (CAZy) and antibiotic resistance genes (ARGs) for WMS, and clusters operational taxonomic units, alpha-/beta-diversity and functional annotations for 16S. MAAWf generates similar results to currently popular tools, but the running time was much shorter. MAAWf is freely accessible and source code available at http://www.maawf.com.

Metadesign of Subways & Urban Biomes (MetaSUB), the Earth Microbiome Project (EMP), and the Extreme Microbiome Project (XMP) [1][2][3][4][5]. Two widely adopted culture-free approaches have been devised to effectively quantify and characterise microbiome data, namely 16S rRNA gene sequencing and metagenomic sequencing. The whole metagenomic shotgun workflow (WMS) analyses the entire genomes of all organisms and all genes of microbial communities in an environment [6], while the amplicon sequencing protocol identifies marker genes such as 16S rRNA genes (16S for short in the following text) that are present in bacteria and archaea [7]. Current downstream analyses depend on a variety of command line-based tools that are specialised for calculating taxonomic classification, community structure, diversity, co-occurrence of species, functional annotation, carbohydrate metabolism activity and anti-drug resistance [4,8,9]. These complex steps involve user-unfriendly pipelines that impose great challenges to biologists and biomedical researchers without command-line skills.
Additionally, there are metagenomic tools based on the k-mer algorithm, such as Kraken [16] and Clark [17] , that enable high-precision identification and classification of species.
Microbiome analysis also focuses on gene prediction, gene annotation and functional analysis of communities to further understand the functional composition of the includes sequence pre-processing such as paired-end stitching and quality control, then OTU clustering, followed by diversity analysis, inter-group difference analysis, and finally metagenomic functional gene prediction.

WMS workflow
For WMS analyses, species classification and abundance statistical analysis are performed using MetaPhlAn2 (Figure 2a). Heatmap of gene family abundance is shown in Figure 2b.

Comparison with other visual and command line-based tools
We compared the performance of MAAWf with several available platforms in terms of overall function, consistency of commonly used indices, and running time of taxa classification which is performed by all the tools. Details of the overall comparison of parameters and specs from MAAWf and other available tools are included in Table 1.
MAAWf is compatible with multiple file types including 16S/WMS raw sequences and OTU/BIOM formats. Additionally, MAAWf is fully command line-free (Table 2), and is ready for local deployment. To further demonstrate performance, two 16S and two WMS datasets were employed for comparison; gut microbiome data Pc (PRJNA302832) from 10 patients receiving ipilimumab treatment [36] , and gut microbiome data Pm (PRJNA301903) from 15 premature infants [37]. Each biosample from each dataset was simultaneously sequenced using both WMS and 16S procedures.  Table 3. Except for the Simpson's index, correlations between MAAWf and DIAMOND-MEGAN6 were lower than expected, and MAAWf achieved a good correlation between overall alpha-diversity using the two programs.
For the 16S workflow, we chose DADA2 [39] and QIIME2 [14] Figure 4b, right panel). In terms of taxonomy processing speed, due to employment of Closed-reference OTU picking, MAAWf is also faster than both DADA2 and QIIME2, as shown in Figure 4d. The high Pearson correlations for alpha-diversity indicate good consistency between MAAWf and currently available softwares (Table 4).

Discussion
As next-generation sequencing becomes cheaper, the microbiomic approach has become popular for biomedical research. It is therefore critical to enable biologists and biomedical researchers to easily explore datasets using efficient, user-friendly whole-pipeline tools. In

Conclusion
MAAWf offers researchers a rapid and integrated microbiome analysis tool for local and web-based setup while allowing a good level of user interaction which is command-line free.

Case description
To demonstrate the MAAWf pipeline, we employed a WMS dataset and a 16S dataset from a Taizhou Longitudinal Cohort study (BioProject PRJNA493884) [40] . Briefly, stool samples A and B were taken from two healthy donors from two independent sampling sites as biological replicates (A1 and A2, B1 and B2), and each biological replicate sample was homogenised to generate two identical aliquots for technical replicates (T1 and T2). DNA from each specimen was extracted for 16S and WMS library preparation. Samples were further sequenced by a HiSeq 2500 sequencer (Illumina, USA) in 250PE mode.

WMS analysis workflow
For the WMS workflow, FASTQ, FASTA and zip formats are accepted during downstream analysis using HUManN2, ARG and CAZy pipelines. Metadata are required to describe each of sample, including sample ID, file prefix name and grouping. Users can customise execution parameters by tuning the workflow's default parameters before each run.

The HUMAnN2 pipeline
First, the KneadData tool is applied to remove host genome contaminants and perform quality control by Trimmomatic [41] and Bowtie2 [42,43]. Species classification and abundance statistical analysis for the WMS data are then performed using MetaPhlAn2 The abundance results from MetaPhlAn2 analysis are then compared with known and unclassified species communities in the ChocoPhlan pan-genome database, and unpaired sequences are translated into protein sequences and compared against the DIAMOND and UniRef90 databases. Finally, results of the above comparisons are used by the HUMAnN2 core algorithm to obtain metabolic pathway coverage and gene family abundance data.

The ARG pipeline
FASTQ and FASTA raw data are compared with the SARG and Greengenes databases to obtain potential ARG and 16S sequences. BLASTX is then used to identify and annotate the ARG sequences, and the SARG database is used to classify the identified ARGs to evaluate the abundance of each ARG type and subtype. Principal co-ordinates analysis (PcoA) analysis of input samples can be compared with reference samples including drinking water, livestock, ocean, clay and sewage to explore potential relationships and identify potential routes of ARG transmission.

16S analysis workflow
For the 16S workflow, FASTQ, FASTA and zip formats are accepted in the downstream QIIME pipeline. Additionally, MAAWf accepts OTU table and BIOM data generated by QIIME [13]. Metadata are also required to describe each sample, including sample ID, he barcode sequence, primer sequence and sample grouping.

OTU clustering
MAAWf employs QIIME to accomplish the entire OTU clustering process. Before clustering, the workflow stitches paired sequences using fastq-join [13], then performs quality control. Clustering is based on 97% similarity between sequences using the closedreference OTU picking method (based on GreenGenes or SILVA databases). Sequences are clustered to obtain classification information and construct phylogenetic trees. If the input file is an OTU table or BIOM format, this step can be skipped and subsequent analysis steps performed directly.

Functional analysis
Since there are differences between databases, we employed both the GreenGenes database for PICRUSt functional analysis and the SILVA database for Tax4Fun analysis.
The PICRUSt analysis pipeline infers the organism's last phylogenetic common ancestor,

Availability of data and material
A WMS dataset and a 16S dataset from a Taizhou Longitudinal Cohort study (BioProject PRJNA493884) were employed in the software demonstration. All code for pipelines, website establishment and databases are available at http://www.maawf.com/get_maawf.html.

Consent for publication
The authors declared no competing interests.       Comparison of featured indices between MAAWf and other microbiomic tools. (a)