a) Re-assembly of public metagenomic samples
A large number of metagenomic samples based on short reads sequencing were deposited to public repositories over the last decade. Their associated analyses in most cases were based on either directly using short reads for gene predictions or most recently employing the assembly (10) step but including a mix of both partial and complete genes. To obtain full-length complete genes from existing studies it is worthwhile to apply a common but computationally demanding protocol of metagenomic assembly equally to all samples. To compile a list of publicly available shotgun metagenomic datasets we searched ENA database (accessed May 2018) and retrieved ~27000 samples (out of a list of over 34,000) with a valid ftp download location. The shotgun metagenomic samples were grouped according to their microbial environments or habitats based on their corresponding taxon id and taxon names as available in ENA (complete details are provided in Supplementary Table ST2). The metagenomes were processed through a quality control protocol and validation of read pairs, followed by the assembly of individual samples in paired-end mode (see methods). For easy access to sample metadata, we include references of ENA identifiers representing sample taxon id, taxon name, study or project identifier and run accession identifiers in the header of assembly sequence files and assembly statistics html tables, see http://www.cbrc.kaust.edu.sa/aamg/KMAP_Data/eAssemblies_kgraph.html and http://www.cbrc.kaust.edu.sa/aamg/KMAP_Data/eAssemblies/. These datasets are available to the wider scientific community for further targeted analyses precluding the need to retrieved and re-assemble the original reads.
b) Creating habitat-specific gene catalogs
These metagenome assemblies available through this work (see Supplementary Materials) can be very useful for Metagenome-assembled genome (MAG) recovery. However, our focus here is to produce gene catalogs from different environments, using full-length (complete) genes based on harmonized procedures. For this purpose we processed every sample using Prodigal (17) with an option “-c” that allows predicting genes with closed ends, avoiding gene prediction near the end of contigs. For illustration, we generated gene catalogs from a few ecological metagenomes encompassing 36 out of 77 environments reported in ENA projects (called studies). Genes obtained from all samples covering the same habitat category were combined into a single set, followed by a clustering procedure to create each of the 36 habitat-specific gene catalogs. For clustering, we used cd-hit (18) software, with two global nucleotide sequence identity cutoffs of 95% and 90 % of the query gene sequence. For both identity cutoffs, the alignment coverage was capped at 80% of the query gene length to avoid clustering shorter genes with longer ones. Supplementary Table ST1 shows a list of microbial habitats, number of samples included, ENA study ids, number of unique genes, and pertinent annotation information that we explain below.
A global non-redundant microbial metagenomic reference gene catalog
A global microbial metagenomic reference gene catalog (KMAP global meta-proteome) was produced from all habitat-specific non-redundant gene catalogs at the protein level containing 275 million genes. For this purpose, we used the mmseq2 (19) clustering approach, applying a percent global sequence identity of 90% and minimum gene length difference of 80%. The resultant global non-redundant microbial gene catalog is composed of 177.4 million proteins; see Supplementary Figure S1 that shows the percentage of unique and common peptide sequences across different gene catalogs used in producing this gene catalog. The global microbial reference gene catalog contains non-redundant proteins across diverse habitats available as reference dataset for direct annotation of new metagenomic samples. It is available online, alongside other gene catalogs reported in this study, for sequence comparison through BLAST, http://www.cbrc.kaust.edu.sa/kmapBLAST/ and also as a FASTA formatted sequences file, http://www.cbrc.kaust.edu.sa/aamg/KMAPglobalRef/KMAP_Global_MetaProteome__proteins_NR.fasta.gz.
c) Annotation of gene catalogs with improved coverage
Given the continued update and improvement of relational public reference databases critical for gene annotation, we posit that previously annotated metagenomes or gene catalogs can be significantly improved and anchored with up-to-date taxonomic and functional annotation. For instance, the massive gene catalogs from the Human Integrated Gut (HIG) (14) and the Tara Ocean’s marine metagenome (15) sampling programs significantly improved when re-annotated using updated reference databases regarding the proportion of taxonomically and functionally assigned genes in comparison to results reported just a few years ago (Figure 2). These two gene catalogs were previously annotated in the year 2012 when the total number of reference sequences in Universal Protein Knowledgebase was around 25 million, this number now increased to over 175 million.
We performed extended annotation of genes for taxonomic assignment and functional annotation. Protein sequences were compared to the Universal Protein Knowledgebase (UniProKB www.uniprot.org) reference database to infer taxonomic origin of the gene sequences, and also cross references to e.g. Cluster of Orthologous Genes (COGS) and the eggNOG (eggnog.embl.de) datbase. Furthermore, genes were compared to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (www.kegg.jp) to infer KEGG Orthologies (KOs) of functional roles and enzyme information. Also, the InterPro database is used to obtain functional signature domains and Gene Ontology (GOs).
In our analysis, massive improvement in the fraction of genes with assigned taxa was observed in the HIG gene catalogue under both stringent criteria versus default criteria (80% vs. 50% BLAST query coverage) for taxonomic assignment using our approach. For instance, 77–81% of the genes were annotated relative to the original study (14), where ~21.3% genes were assigned a taxonomic label, see Figure 2 and http://meta.genomics.cn/meta/home. In the case of the Tara Ocean’s gene catalog, the re-annotation improved taxonomic coverage by ~10% with stringent coverage and ~12% with default coverage parameters (Figure 3). Minimum percent identity is kept at 30% for amino acid sequence alignments. Regarding functional coverage, re-annotation reduced unassigned genes from 28.5 % to 16.2% for TARA and 35% to 16.7% in the case of HIG (Figure 2).
These improvements prompted us to address the issue of a more comprehensive annotation of the earlier shotgun metagenomic data sets. The aggregated global microbial gene catalog reached over 275 million non-redundant gene sequences. We performed these extended annotations for all of the 36 new and four existing gene catalogs, see Figure 3 and Supplementary Table ST1 showing the proportion of annotated vs unannotated genes where new gene catalogs appear with an initial “e” denoting source of sequence data is ENA.
From the gene catalogs reported here, 16 are from Aquatic environments, 15 are from Engineered environments, 3 are from Soils and another 2 from host-associated environments. Annotation of these gene catalogs reveal (Figure 3A) that over 50% of the genes were assigned were assigned a putative function except marine sediment and TARA eukaryotic gene catalogs. The annotation of the global non-redundant gene catalog (KMAP global meta-proteome) showed that ~57% of genes (~101 million genes) had probable functions. An open question, therefore, is how much microbial diversity is captured in the funtional assignment. To estimate potential taxonomic diversity in metagenomic environments as operational taxonomic unit (OUT), one of the universal single copy gene such as ribosomal protein S30 (rpSc) can be used as a proxy (20). Figure 3B shows taxonomic diversity across different habitats through count of variants, or OTUs based on rpSc gene, as assigned with a taxonomic label (considering BLAST based percent identity of 30 and percent coverage of 80 and presence of rpSc domain PF01738). The diversity across different gene catalogs summarized for the global gene catalog showed 3,4471 different bacteria, 2,430 Eukaryota and 7,00 different Archaea in this dataset. Highest taxonomic diversity appears to be in the soil metagenomes, followed by aquatic environments. From the engineered environments, activated sludge and anaerobic digester gene catalog showed taxonomic diversity above 1000 OTUs for different microbes.
*Wu D, Jospin G, Eisen JA. Systematic identification of gene families for use as "markers" for phylogenetic and phylogeny-driven ecological studies of bacteria and archaea and their major subgroups. Plos one. 2013 ;8(10):e77033. DOI: 10.1371/journal.pone.0077033.
Figure 3C summarizes count of unique genes related to different enzyme classes throughout different habitat-specific gene catalogs, while Figure 3D presents an interactive heatmap of complete and incomplete KEGG pathway modules, see https://bit.ly/2WvVtLX for full interactive map. Here, the functional repertoire of individual microbial habitats, compared to all others, can be interactively tracked on the level of a single or set of critical genes required to activate portions of pathways.
A BLAST-based sequence comparison of the 275 million genes to reference databases, such as the UniProtKB, requires ~522 years of computer processing time using a single central processing unit (CPU) computer, however the same task on KAUST’s Shaheen II supercomputer completed sequence comparisons in ~13 days using ~4.8 million computer CPU hours per day.
d) Gene Information Tables (GITs)
Gene Information Table (GIT) represent a simple tab separated text-based table, similar to the one introduced by Metagenomics Reports (MetaRep) framework (21), showing unique types of annotations for a list of genes available from a sample, a gene catalog, a genome or a metagenome (see an example GIT in supplementary Figure S0). It includes annotations such as gene name, Gene Ontology (GO), Enzyme Classification (EC), InterPro domains, Taxon ID, Annotation type filters, KEGG Orthology (KO) ID, weight or an expression value, COG and eggNOG IDs. There is a source column to report the source of annotation and to later filter genes based on BLAST statistics, E-value, Percent Identity, Percent Coverage data are recorded. Interesting “Filters” can be introduced to work with subset sets of genes for example the ones available with KO, Enzyme, COG or other interesting sources.
GITs available for a genome, metagenome or a gene catalog can be easily used for further analysis and comparison of data sets using commandline tools. As the size of data grows in metagenomic analysis, it would require advance computing power and skills to sift through these data sets for deeper analyses. On the other hand these GITs can be indexed into a database, e.g. using MetaRep framework (21), for easy web-based browsing, querying and comparisons of taxonomic or functional aspects of different metagenomic datasets. In this study we performed a large number of metagenomic assemblies from different environments; these data sets are openly available for public use through their choice of annotation and analyses with a recommendation to use GIT format for data sharing and data integration. Example GITs related to gene catalogs reported here are available (see ST1, for download or building a database).
e) An example database to explore and compare GITs from shotgun metagenomic data
Considering the increasing volume of data, a normal user can not process these huge shotgun metagenomics datasets for analyses; however, using GITs data integration, further analyses become easy, either via commandline methods or through a database. Here, we present the KAUST Metagenomic Analysis Platform (KMAP) an open-access platform (http://www.cbrc.kaust.edu.sa/kmap) providing wider access for the annotation (producing GITs), exploration and comparison (by providing a database) of microbial shotgun metagenomic data.
KAUST Metagenomic Analysis Platform (KMAP)
KMAP consists of two modules: The Annotation Module, which is used for annotation of user-submitted contigs or genes, and the Compare Module, which allows for sample-to-sample or gene catalog-based comparison (see Figure 4A, methods and KMAP documentation). The annotation process and compilation of GITs is implemented in the Automatic Annotation of Microbial Genomes (AAMG) pipeline (22). AAMG was recently improved to handle metagenomic-scale data through supercomputing systems, such as Shaheen II (ranked no 7 in 2015 at https://www.top500.org/system/178515), available at King Abdullah University of Science and Technology (KAUST). GITs integrate annotations available from different sources, for every gene in a study as shown in Supplementary Figure, and here this format is recommended as a minimum standard for data integration, exploration and comparison of shotgun metagenomic samples. In order to expand the access to GITs and analytics of metagenomics data to larger scientific community, without the need of advanced computational skills or resources, we provide indexed GITs through KMAP’s online ‘Compare Module’ by extending and repurposing the standard framework of Metagenomic Reports (MetaRep (21)) software. See supplementary videos (SV1, SV2) on how to view and compare data sets in KMAP Compare Module. To contribute to the global effort of analyzing massive-scale microbial resources, we provide KMAP-based Gene Information Tables (GITs) from 40 gene catalogs. These GITs are based on ~3000 metagenomic samples capturing the comprehensive annotation of a global gene pool with over 275 million genes (see the Supplementary Table ST1 for links to GITs, POIs or set of annotations).
Comparison of KMAP features
Currently, several metagenomic data analysis pipelines offering a breadth of useful features already exists. A comparison of KMAP is performed with the four most relevant platforms such as MG-RAST (23), EBI Metagenomics(10), Meta-Pipe (24), MGX (25) and MetaRep (21). Considering presence/absence and extent of implementation of different features categorized as types of input, gene prediction, gene annotation, visualization, comparison of samples and available computational resources, shows KMAP to be the most comprehensive platform thus far available (see Figure 4).
The KMAP approach for improved analytics revolves around full-length genes from assembled genomes or metagenomes. Given a microbial genome, metagenomic contigs, or a gene catalog from any source as input, AAMG in the KMAP Annotation Module performs computationally expensive sequence comparisons against regularly updated reference databases. In the case of contigs, it provides a range of choices for gene prediction. Gene annotation is the main feature of annotation pipelines. There are a number of annotation features in KMAP not directly available in other systems, such as detection of Antibiotic Resistance Genes (ARG), Proteins of Interest with application to Industry (POIs) and Metagenomic Species (MS) binning using contigs as well as genes (see supplementary document on KMAP binning). AAMG integrates all the possible taxonomic and functional role assignments, including cross-references, to populate a GIT. In order to view, explore and compare metagenomic samples or gene catalogs, KMAP improves upon MetaRep and presents KMAP Compare Module by providing additional interactive heatmaps for more in-depth analysis of annotations, particularly automated estimation and comparative visualization of complete or incomplete KEGG pathway modules across selected datasets in one interactive figure (see supplementary video SV3). Additionally, in other systems like e.g. MG-RAST, it is not possible to look at both taxonomic and functional aspects using a single search. Here using GIT format, since we record complete information for a gene from different aspects, KMAP Compare Module allows the visualisation of results of a single query in the context of taxonomy, pathways, enzymes, KOs or GOs without re-issuing the search query every time.
Exploring environmental functional genomics with KMAP
Gene Information Tables provided in this study can be used to lookup interesting genes based on specific identifiers for taxonomy, gene family ids e.g. KOs or Enzyme Classification (EC) numbers, GO ids or protein signature domain identifiers or preset “Filters”, using either commandline methods or online version of GITs. KMAP Compare Module provides online access enabling larger audiences with minimal bioinformatics skills to mine interesting genes from datasets across different habitats. We demonstrate this utility of KMAP using two examples examining the distribution of (a) extremozymes and (b) antibiotic resistance genes (ARGs) across a range of contrasting habitats. Microbes from extreme environments are increasingly recognized as sources of novel compounds for biotechnological applications, with a potential of providing solutions for humanity’s great challenges, such as providing society with food, energy, and a clean environment (26,27). Environmental metagenomics allows the exploration of previously inaccessible, genetic material from extreme environments that are likely, because of the challenges their extreme conditions pose to life, to contain extremozymes. Extremozymes can be very useful in industry, given their optimal activity and stability under extreme conditions (e.g. pressure, temperature and salinity), and can provide a foundation for developing environmentally friendly, efficient, and sustainable industrial technologies (27). Most valuable cold and hot extremozymes include catalases, cellulases, proteases, lipases, mannases, pectinases and lacases (27). We demonstrated the use of the KMAP “compare” module to explore using KMAP published metagenomes for the presence of enzymes of interest across a range of habitats, which showed microbial communities from some habitats to be enriched in these genes (Figure 5).
Figure 5 shows the diversity of genes as percent of unique genes found in each of our range of selected habitats, revealing the general prevalence of cellulases over alpha-amylases, catalases, pectate lyases and hydrolases. For sequence search we provide a simple BLAST to create a list of related genes of interest from selected habitats, see online KMAP BLAST http://www.cbrc.kaust.edu.sa/kmapBLAST/, allowing download of hit sequences.
In another example, we look into the issue of antibiotic resistance by exploring the spread of recently reported top antibiotic resistance genes (28) (ARGs) using tetX, tetM, and tetV (linked to tetracycline) as well as blaTEM (linked to beta lactamase class A) across different microbial habitats. Results in Figure 6 show that tetracycline ARGs are prevalent across several environments. The taxonomic affiliation of unique genes for tetM (K18220), shows most of these genes are affiliated with Firmicutes in Human and several other habitats (Figure 6 B). However, different predominant affiliation for this gene were detected in specific environments: Actinobacteria in in soil and compost metagenomes, Bacteroidetes in hydrothermal vents, estuaries and some other metagenomes, and Gemmatimonadetes in cold marine and aquatic metagenomes.
A unique feature of KMAP is the capability to present ‘filters’ (as shown in column 14 of the aforementioned GIT example), to focus on sets of genes with specific annotations. As an example, ARGs predicted via deepARG (29) in KMAP are available through ‘filter:F.AntiBiotic.Resistance’. This ARG filter also includes ~30 classes of ARGs based on types of antibiotics, as provided by the deepARG reference database. Utilizing the ARG filter, (see methods query for ARGs), KMAP produces an interactive heatmap representing ARGs-related complete or incomplete KEGG Pathway modules across selected habitats (Figure 7).
Clicking on a cell of this interactive heatmap provides more details and a link to KEGG module diagrams in order to examine how a module is shown to be complete or incomplete (Supplementary Video SV5). Since KMAP results from a query can also be saved as a table, these tables can be used to produce visualizations from any other system. An example heatmap using morpheus (https://software.broadinstitute.org/morpheus/), and providing individual ARG KOs and antibiotics across different habitats is shown at http://www.cbrc.kaust.edu.sa/aamg/habitats_KO_ARGs.svg. Here KO table obtained from KMAP using ARG filter query was appended with antibiotic types.
In general, all the gene catalogs presented in this study can easily be explored or compared using the KMAP Compare Module (see project number 119 with public access). However, this module suffers limitations when comparing large number of samples (e.g., above 50) due to the volume of data read-writes and Solr Lucene optimization compared to the available computer random access memory (RAM). These limitations can be addressed in future through optimizations for large-scale data visualization from platforms like Google Genomics (30). We used servers with two terabytes of RAM. BLASTable version of our gene catalogs provide access to annotation and alignments through online BLAST based sequence comparisons, see supplementary video SV6 for an example.