Microbiome sequencing is a powerful approach for studying microbial communities by non-culture methods, promoting microbiomic research such as the Human Microbiome Project (HMP), METAgenomics of the Human Intestinal Tract (MetaHIT), Metagenomics and Metadesign of Subways & Urban Biomes (MetaSUB), the Earth Microbiome Project (EMP), and the Extreme Microbiome Project (XMP) [1-5]. Two widely adopted culture-free approaches have been devised to effectively quantify and characterise microbiome data, namely 16S rRNA gene sequencing and metagenomic sequencing. The whole metagenomic shotgun workflow (WMS) analyses the entire genomes of all organisms and all genes of microbial communities in an environment[6], while the amplicon sequencing protocol identifies marker genes such as 16S rRNA genes (16S for short in the following text) that are present in bacteria and archaea [7]. Current downstream analyses depend on a variety of command line-based tools that are specialised for calculating taxonomic classification, community structure, diversity, co-occurrence of species, functional annotation, carbohydrate metabolism activity and anti-drug resistance [4, 8, 9]. These complex steps involve user-unfriendly pipelines that impose great challenges to biologists and biomedical researchers.
Currently, the critical steps of microbiome analysis include classification and abundance analysis [10], using marker gene-based classification software MetaPhlAn2 [11] and mOTUs2[12] in WMS analysis, and QIIME[13], QIIME2[14], and Mothur[15] in 16S analysis. Additionally, there are metagenomic tools based on the k-mer algorithm, such as Kraken [16] and Clark[17] , that enable high-precision identification and classification of species. Microbiome analysis also focuses on gene prediction, gene annotation and functional analysis of communities to further understand the functional composition of the microbiome. For example, based on 16S sequencing, PICRUSt [18] and Tax4Fun[19] can be used to analyse functional prediction of the microbiome. In addition, various tools have been developed to apply microbiome research in the field of antibiotic resistance [20], corresponding tools have also been developed, such as ARGs analysis [21-24]. The carbohydrate-active enzymes (CAZy) database that describes families of structurally-related catalytic and carbohydrate-binding modules (or functional domains) of enzymes that degrade, modify, or create glycosidic bonds is also widely used during functional analysis of microbiota[25].
However, these pipelines are either command line-designed, visualisation-unfriendly, or single purposed with minimal user interaction. Some attempts have been made to improve the operability of microbiome analysis, and several online automatic pipelines are available. METAGEN-assist [26] and Microbiome-Analyst [27] are specialised for 16S, while MEGAN [28] is designed for WMS. Galaxy and MG-RAST are well known online tools that enable analysis of both 16S and WMS files [29, 30]. However, Galaxy is not specialised for local installation due to complex environment dependency, while MG-RAST is an online-only tool with Application Programming Interface (APIs) that lacks functional analyses such as ARGs and CAZy. Local deployment of an analytical tool is more suitable for uploading large WMS files which requires high speed and bandwidth. Furthermore, most of pipelines accept only OTU, BIOM or DAA (MEGAN) formats that are generated from command lines rather than raw sequences (FASTQ or FASTA formats).
In order to solve this problem, we developed an automatic and visual microbiome analysis workflow platform (MAAWf) based on Docker [31] that can be deployed online or locally. The platform is easy to operate for researchers from a biology or medical background wishing to perform composition, diversity, function, ARG and CAZy analyses of human and environmental microbiome data in a visual way. MAAWf is available at http://www.maawf.com.
The main features of MAAWf are:
- WMS: Species composition, diversity, gene abundance, metabolic pathway abundance, pathway coverage, CAZy and ARG analysis of WMS data.
- 16S: OTU clustering, diversity, inter-group difference analysis, PICRUST and Tax4Fun functional prediction, and metabolic pathways.
- Tools: A user-friendly interface for online analysis or local installation.
- Files: Supports FASTQ, FASTA, BIOM (16S) and OTU (16S) formats.
Implementation
As shown in Figure 1, MAAWf is divided into two workflow modules; WMS and 16S. The WMS module runs the HUMAnN2 [32] workflow for sequence quality control, trimming, and paired-end merging. MetaPhlAn2 then analyses the species composition and species abundance of microbial communities. MAAWf also employs the HUMAnN2 core algorithm for comparison of gene abundance, metabolic pathway abundance, pathway coverage, and gene family abundance using heatmaps. ARG analysis and CAZy workflows are used to assess the abundance of ARGs and carbohydrate-active enzymes [33]. The 16S module includes sequence pre-processing such as paired-end stitching and quality control, then performs OTU clustering, followed by diversity analysis, inter-group difference analysis, and finally metagenomic functional gene prediction.
Case description
To demonstrate the MAAWf pipeline, we employed a WMS dataset and a 16S dataset from a Taizhou Longitudinal Cohort study (BioProject PRJNA493884) [34] . Briefly, stool samples A and B were taken from two healthy donors from two independent sampling sites as biological replicates (A1 and A2, B1 and B2), and each biological replicate sample was homogenised to generate two identical aliquots for technical replicates (T1 and T2). DNA from each specimen was extracted for 16S and WMS library preparation. Samples were further sequenced by a HiSeq 2500 sequencer (Illumina, USA) in 250PE mode.
WMS analysis workflow
For the WMS workflow, FASTQ, FASTA and zip formats are accepted during downstream analysis using HUManN2, ARG and CAZy pipelines. Metadata are required to describe each of sample, including sample ID, file prefix name and grouping. Users can customise execution parameters by tuning the workflow’s default parameters before each run.
The HUMAnN2 pipeline
First, the KneadData tool is applied to remove host genome contaminants and perform quality control by Trimmomatic [35] and Bowtie2[36, 37]. Species classification and abundance statistical analysis for the WMS data are then performed using MetaPhlAn2 (Fig. 2a). The abundance results from MetaPhlAn2 analysis are then compared with known and unclassified species communities in the ChocoPhlan pan-genome database, and unpaired sequences are translated into protein sequences and compared against the DIAMOND and UniRef90 databases. Finally, results of the above comparisons are used by the HUMAnN2 core algorithm to obtain metabolic pathway coverage and gene family abundance data. A heatmap of gene family abundance is shown in Figure 2b.
The ARG pipeline
FASTQ and FASTA raw data are compared with the SARG and Greengenes databases to obtain potential ARG and 16S sequences. BLASTX is then used to identify and annotate the ARG sequences, and the SARG database is used to classify the identified ARGs to evaluate the abundance of each ARG type and subtype. Principal co-ordinates analysis (PcoA) analysis of input samples can be compared with reference samples including drinking water, livestock, ocean, clay and sewage to explore potential relationships and identify potential routes of ARG transmission. ARG workflow output files include ARG classification annotation, ARG abundance maps for different groups as shown in Figure 2c, and PCoA results with reference environmental samples as shown in Figure 2d.
CAZy pipeline
First, sequences preprocessed by KneadData are assembled with MegaHIT software to obtain contigs [33]. Prokka software is then used to annotate contigs and identify feature sequences [38]. Subsequently, CD-HIT is used to construct non-redundant genomes by similarity-based sequence clustering [39]. MAAWf then uses Salmon software for gene quantification [40], followed by carbohydrate function annotation against the CAZy database using DIAMOND. Similar to the heatmap of gene family abundance, CAZy terms are ranked by abundance and the top 40 differentially abundant CAZy terms are subjected to heatmap clustering (Fig. 2e). Barplot of pathway abundance analysis is shown in Fig. 2f.
16S analysis workflow
For the 16S workflow, FASTQ, FASTA and zip formats are accepted in the downstream QIIME pipeline. Additionally, MAAWf accepts OTU table and BIOM data generated by QIIME [13]. Metadata are also required to describe each sample, including sample ID, he barcode sequence, primer sequence and sample grouping.
OTU clustering
MAAWf employs QIIME to accomplish the entire OTU clustering process. Before clustering, the workflow stitches paired sequences using fastq-join [13], then performs quality control. Clustering is based on 97% similarity between sequences using the closed-reference OTU picking method (based on GreenGenes or SILVA databases). Sequences are clustered to obtain classification information and construct phylogenetic trees (Fig. 3a). If the input file is an OTU table or BIOM format, this step can be skipped and subsequent analysis steps performed directly.
Diversity analysis
Alpha- and beta-diversity analyses are based on the vegan R package [41]. Alpha-diversity results comprise four commonly used indices (Shannon-weiner, Simpson, Chao1 and ACE) as shown in Figure 3b. The distance matrix between samples is calculated based on Bray-Curtis and Jaccard beta-diversity algorithms to generate a PCoA plot of the microbial community (Fig. 3c).
Differential species analysis
MAAWf can calculate differentially abundant bacteria in different taxonomic ranks from any preferred groups, as shown in the volcano plot in Fig. 3d. Larger fold-change and smaller p-values indicate greater differences between samples. Meanwhile, MAAWf employs the LEfSe [42] pipeline to analyse differential abundant bacteria. Linear discriminant analysis (LDA) is used to generate an LDA value barplot (Fig. 3e) and a phylogenetic tree, as shown in Figure 3f.
Functional analysis
Since there are differences between databases, we employed both the GreenGenes database for PICRUSt functional analysis and the SILVA database for Tax4Fun analysis. The PICRUSt analysis pipeline infers the organism’s last phylogenetic common ancestor, builds a phylogenetic tree based on OTUs using the ancestral state reconstruction algorithm, normalises the OTU table, and predicts metagenomic functions based on the KEGG Ortholog (Kyoto Encyclopedia of Genes and Genomes Ortholog, KO), COG (Cluster of Orthologous Groups of proteins) and RFAM (RNA family) databases. Visualisation of the clustering of KEGG pathways is shown in Figure 3g. If the KEGG pathway level parameter is set to 2 or 3, MAAWf can also compare the abundance of pathways between groups using the Wilcoxon rank sum test. Tax4Fun analysis starts with the OTU table obtained from clustering based on the SILVA database. The R package Tax4Fun is then used to functionally predict KEGG metabolic pathways (Fig. 3h).
Requirements
MAAWf can be deployed on a server with 64 GB of RAM and 32 logical 2.6 GHz CPUs under the Ubuntu 16.04.6 LTS development environment. The front-end interactive page is based on HTML5, CSS3 and JavaScript, while the server side scripts were developed using PHP, R and Python. Packages such as ggplot2 [43] are used for visualisation, while packages like vegan are adopted for distance analysis [41]. Each workflow is packed in a Docker virtual container to ensure a stable running environment. MAAWf is compatible with regular browsers such as Google Chrome, Microsoft Internet Explorer and Mozilla Firefox. For online analysis manuals and local deployment setup protocols, please refer to http://www.maawf.com.