MAAWf: A Multifunctional and Visual Tool for Microbiomic Data Analyses

Background: Microbiomic research has grown in popularity in recent decades. The widespread use of next-generation sequencing technologies, including 16S rRNA gene-based and metagenomic shotgun-based methods, has produced a wealth of microbiome data. At present, most software and analysis workflows for analysis and processing of microbiomic data are command line-based, which requires considerable computing time and makes interaction difficult. Results: To provide a command-line free, multifunctional, user interface friendly and online/local deployable microbiome analysis tool, we developed Microbiome Automated Analysis Workflows (MAAWf). MAAWf is composed of a whole metagenomic shotgun workflow (WMS) and a 16S Sequencing Workflow. The WMS analysis workflow assesses taxonomy, protein-coding genes, metabolic pathways, carbohydrate-active enzymes (CAZy) and antibiotic resistance genes (ARGs). The 16S ribosomal RNA (rRNA) analysis workflow counts and clusters operational taxonomic units (OTUs), estimates alpha- and beta-diversity and inter-group differences, and performs functional analysis. We also compared MAAWf with other commonly avaiable analysis tools using two public datasets. The MAAWf pipeline was established using the Ubuntu 16.04.6 LTS kernel with primary sequence files such as FASTQ format and taxonomic format such as OTU or BIOM formats. Following analysis of public 16S and WMS datasets, MAAWf obtained similar results to DIAMOND-MEGAN6, MG-RAST, DADA2 and QIIME2, but the running time was much shorter. Conclusions: MAAWf is a visual, integrated, rapid analysis tool that enables remote and local computing of microbiome data.


Introduction
Microbiome sequencing is a powerful approach for studying microbial communities  [1][2][3][4][5]. Two widely adopted culture-free approaches have been devised to effectively quantify and characterise microbiome data, namely 16S rRNA gene sequencing and metagenomic sequencing. The whole metagenomic shotgun workflow (WMS) analyses the entire genomes of all organisms and all genes of microbial communities in an environment [6], while the amplicon sequencing protocol identifies marker genes such as 16S rRNA genes (16S for short in the following text) that are present in bacteria and archaea [7]. Current downstream analyses depend on a variety of command line-based tools that are specialised for calculating taxonomic classification, community structure, diversity, co-occurrence of species, functional annotation, carbohydrate metabolism activity and anti-drug resistance [4,8,9]. These complex steps involve user-unfriendly pipelines that impose great challenges to biologists and biomedical researchers.
Currently, the critical steps of microbiome analysis include classification and abundance analysis [10], using marker gene-based classification software MetaPhlAn2 [11] and mOTUs2 [12] in WMS analysis, and QIIME[13], QIIME2 [14], and Mothur [15] in 16S analysis. Additionally, there are metagenomic tools based on the k-mer algorithm, such as Kraken [16] and Clark [17] , that enable high-precision identification and classification of species. Microbiome analysis also focuses on gene prediction, gene annotation and functional analysis of communities to further understand the functional composition of the microbiome. For example, based on 16S sequencing, PICRUSt [18] and Tax4Fun [19] can be used to analyse functional prediction of the microbiome. In addition, various tools have been developed to apply microbiome research in the field of antibiotic resistance [20], corresponding tools have also been developed, such as ARGs analysis [21][22][23][24] [29,30]. However, Galaxy is not specialised for local installation due to complex environment dependency, while MG-RAST is an online-only tool with Application Programming Interface (APIs) that lacks functional analyses such as ARGs and CAZy.
Local deployment of an analytical tool is more suitable for uploading large WMS files which requires high speed and bandwidth. Furthermore, most of pipelines accept only OTU, BIOM or DAA (MEGAN) formats that are generated from command lines rather than raw sequences (FASTQ or FASTA formats).
In order to solve this problem, we developed an automatic and visual microbiome analysis workflow platform (MAAWf) based on Docker [31] that can be deployed online or locally. The platform is easy to operate for researchers from a biology or medical background wishing to perform composition, diversity, function, ARG and CAZy analyses of human and environmental microbiome data in a visual way.
The main features of MAAWf are:

3.
Tools: A user-friendly interface for online analysis or local installation.

Implementation
As shown in Figure 1, MAAWf is divided into two workflow modules; WMS and 16S.
The WMS module runs the HUMAnN2 [32] workflow for sequence quality control, trimming, and paired-end merging. MetaPhlAn2 then analyses the species composition and species abundance of microbial communities. MAAWf also employs the HUMAnN2 core algorithm for comparison of gene abundance, metabolic pathway abundance, pathway coverage, and gene family abundance using heatmaps. ARG analysis and CAZy workflows are used to assess the abundance of ARGs and carbohydrate-active enzymes [33]. The 16S module includes sequence preprocessing such as paired-end stitching and quality control, then performs OTU clustering, followed by diversity analysis, inter-group difference analysis, and finally metagenomic functional gene prediction.

Case description
To demonstrate the MAAWf pipeline, we employed a WMS dataset and a 16S dataset from a Taizhou Longitudinal Cohort study (BioProject PRJNA493884) [34] . Briefly, stool samples A and B were taken from two healthy donors from two independent sampling sites as biological replicates (A1 and A2, B1 and B2), and each biological replicate sample was homogenised to generate two identical aliquots for technical replicates (T1 and T2). DNA from each specimen was extracted for 16S and WMS library preparation. Samples were further sequenced by a HiSeq 2500 sequencer (Illumina, USA) in 250PE mode.

WMS analysis workflow
For the WMS workflow, FASTQ, FASTA and zip formats are accepted during downstream analysis using HUManN2, ARG and CAZy pipelines. Metadata are required to describe each of sample, including sample ID, file prefix name and grouping. Users can customise execution parameters by tuning the workflow's default parameters before each run.

The HUMAnN2 pipeline
First, the KneadData tool is applied to remove host genome contaminants and perform quality control by Trimmomatic [35] and Bowtie2 [36,37]. Species classification and abundance statistical analysis for the WMS data are then performed using MetaPhlAn2 (Fig. 2a). The abundance results from MetaPhlAn2 analysis are then compared with known and unclassified species communities in the ChocoPhlan pan-genome database, and unpaired sequences are translated into protein sequences and compared against the DIAMOND and UniRef90 databases.
Finally, results of the above comparisons are used by the HUMAnN2 core algorithm to obtain metabolic pathway coverage and gene family abundance data. A heatmap of gene family abundance is shown in Figure 2b.

The ARG pipeline
FASTQ and FASTA raw data are compared with the SARG and Greengenes databases to obtain potential ARG and 16S sequences. BLASTX is then used to identify and annotate the ARG sequences, and the SARG database is used to classify the identified ARGs to evaluate the abundance of each ARG type and subtype. Principal co-ordinates analysis (PcoA) analysis of input samples can be compared with reference samples including drinking water, livestock, ocean, clay and sewage to explore potential relationships and identify potential routes of ARG transmission.
ARG workflow output files include ARG classification annotation, ARG abundance maps for different groups as shown in Figure 2c, and PCoA results with reference environmental samples as shown in Figure 2d.

CAZy pipeline
First, sequences preprocessed by KneadData are assembled with MegaHIT software to obtain contigs [33]. Prokka software is then used to annotate contigs and identify feature sequences [38]. Subsequently, CD-HIT is used to construct non-redundant genomes by similarity-based sequence clustering [39]. MAAWf then uses Salmon software for gene quantification [40], followed by carbohydrate function annotation against the CAZy database using DIAMOND. Similar to the heatmap of gene family abundance, CAZy terms are ranked by abundance and the top 40 differentially abundant CAZy terms are subjected to heatmap clustering (Fig. 2e). Barplot of pathway abundance analysis is shown in Fig. 2f.

16S analysis workflow
For the 16S workflow, FASTQ, FASTA and zip formats are accepted in the downstream QIIME pipeline. Additionally, MAAWf accepts OTU table and BIOM data generated by QIIME [13]. Metadata are also required to describe each sample, including sample ID, he barcode sequence, primer sequence and sample grouping.

OTU clustering
MAAWf employs QIIME to accomplish the entire OTU clustering process. Before clustering, the workflow stitches paired sequences using fastq-join [13], then performs quality control. Clustering is based on 97% similarity between sequences using the closed-reference OTU picking method (based on GreenGenes or SILVA databases). Sequences are clustered to obtain classification information and construct phylogenetic trees (Fig. 3a). If the input file is an OTU table or BIOM format, this step can be skipped and subsequent analysis steps performed directly.

Diversity analysis
Alpha-and beta-diversity analyses are based on the vegan R package [41]. Alphadiversity results comprise four commonly used indices (Shannon-weiner, Simpson, Chao1 and ACE) as shown in Figure 3b. The distance matrix between samples is calculated based on Bray-Curtis and Jaccard beta-diversity algorithms to generate a PCoA plot of the microbial community (Fig. 3c).

Differential species analysis
MAAWf can calculate differentially abundant bacteria in different taxonomic ranks from any preferred groups, as shown in the volcano plot in Fig. 3d. Larger foldchange and smaller p-values indicate greater differences between samples.
Meanwhile, MAAWf employs the LEfSe [42] pipeline to analyse differential abundant bacteria. Linear discriminant analysis (LDA) is used to generate an LDA value barplot (Fig. 3e) and a phylogenetic tree, as shown in Figure 3f.

Functional analysis
Since there are differences between databases, we employed both the GreenGenes database for PICRUSt functional analysis and the SILVA database for Tax4Fun analysis. The PICRUSt analysis pipeline infers the organism's last phylogenetic common ancestor, builds a phylogenetic tree based on OTUs using the ancestral state reconstruction algorithm, normalises the OTU

Comparison with other visual and command line-based tools
We compared the performance of MAAWf with several available platforms in terms of overall function, consistency of commonly used indices, and whole-workflow running time. Details of the overall comparison of parameters and specs from MAAWf and other available tools are included in Table 1. Compared with other tools, MAAWf is compatible with multiple file types from both 16S/WMS raw sequences and OTU/BIOM formats. Additionally, MAAWf is fully command line-free (Table 2), and is ready for local deployment. To further demonstrate performance, two 16S and two WMS datasets were employed for comparison; gut microbiome data Pc (PRJNA302832) from 10 patients receiving ipilimumab treatment [44] , and gut microbiome data Pm (PRJNA301903) from 15 premature infants [45]. Each biosample from each dataset was simultaneously sequenced using both WMS and 16S procedures.   Fig. 4b, right panel). In terms of processing speed, due to employment of Closed-reference OTU picking, MAAWf is faster than both DADA2 and QIIME2As, as shown in Figure 4d). The high Pearson correlations for alpha-diversity indicate good consistency between MAAWf and currently available software (Tables 5 and 6).

Discussion
MAAWf uses QIIME, PICRUSt and Tax4Fun for OTU clustering and function prediction in the 16S workflow. This approach has some limitations. Firstly, 97% similaritybased OTU clustering is gradually being replaced by an amplicon sequence variant (ASV)-based algorithm that can be considered to represent 100% similarity with target sequences. For example, the latest QIIME2 calls DADA2 and Deblur to denoise sequence data to generate an ASV

Conclusions
As next-generation sequencing becomes cheaper, the microbiomic approach has become popular for biomedical research. It is therefore critical to enable biologists and biomedical researchers to easily explore datasets using efficient, user-friendly whole-pipeline tools. In this study, we developed an interactive and automated analysis workflow platform MAAWf, that provides comprehensive analysis of both WMS and 16S microbiomic data by online computing or local deployment. Although still suboptimal, MAAWf offers researchers a convenient and comprehensive microbiome analysis tool while allowing a good level of user interaction. To overcome the limitations, we will continue to add functional modules to MAAWf in future releases.

Consent for publication
All the authors agree to the publication of this work.

Availability of data and material
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declared no competing interests.     Figure 1 MAAWf analysis workflow. MAAWf is divided into two workflow modules, the WMS and 16S. Th Supplementary_MAAWf_local_setup_final_20191216.docx