Background: The exponential increase in high-throughput sequencing data and the development of computational sciences and bioinformatics pipelines has advanced our understanding of microbial community composition and distribution in complex ecosystems. Despite these advances, the identification of microbial interactions from genomic data remains a major bottleneck. To address this challenge, we present OrtSuite, a flexible workflow to predict putative microbial interactions based on genomic content.
Results: OrtSuite combines ortholog clustering strategies with genome annotation based on a user-defined set of functions allowing for hypothesis-driven data analysis. OrtSuit allows users to install and run all workflow components and analyze the generated outputs using a simple pipeline consisting of 23 bash commands and one R command. Annotation is based on a two-stage process. First, only a subset of sequences from each ortholog cluster are aligned to all sequences in the Ortholog-Reaction Association database (ORAdb). Next, all sequences from clusters that meet a user-defined identity threshold are aligned to all sequence sets in ORAdb to which they had a hit. This approach results in a decrease in time needed for functional annotation. Further, OrtSuit identifies putative interspecies interactions based on their individual genomic content based on constrains given by the users. Additional control is afforded to the user at several stages of the workflow: 1) The construction of ORAdb only needs to be performed once for each specific process also allowing manual curation; 2) The identity and sequence similarity thresholds used during the annotation stage can be adjusted; and 3) Constraints related to pathway reaction composition and known species contributions to ecosystem processes can be defined.
Conclusions: OrtSuit is an easy to use workflow that allows for rapid functional annotation based on a user curated database. Further, this novel workflow allows the identification of interspecies interactions through user-defined constrains. Due to its low computational demands, for small datasets (e.g. maximum 100 genomes) OrtSuit can run on a personal computer. For larger datasets (> 100 genomes), we suggest the use of computer clusters. OrtSuit is an open-source software available at https://github.com/mdsufz/OrtSuit .

Figure 1

Figure 2

Figure 3
This is a list of supplementary files associated with this preprint. Click to download.
Table S1. Number of reactions, enzymes, KO groups and KO-associated sequences represented in each alternative benzoate to acetyl-CoA conversion pathway used (BTA_P1, BTA_P2 and BTA_P3).
Table S2. Species names, strain and abbreviation codes used to validate OrtSuite. The genomic potential, based on KEGG database, to completely encode all proteins involved in a BTA pathway is identified in the column “BTA pathway” (P1 – Anaerobic conversion of benzoate to acetyl-CoA 1; P2 – Anaerobic conversion of benzoate to acetyl-CoA 2; P3 – Aerobic conversion of benzoate to acetyl-CoA) . The column OrtSuite_result contains which BTA pathway(s) were identified as being completely performed by each species used for validation.
Example_reaction_list. Example file of a list of KEGG reaction identifiers used to generate ORAdb.
Example_ec_list. Example file of list of enzyme commission (EC) numbers used to generate ORAdb.
Example_ko_list. Example file of list of KEGG ortholog identifiers used to generate ORAdb.
Table S3. Example of the binary table with the mapping of KO identifiers (rows) to the species of interest (columns) (1 – indicates the presence of genes in the species; 0 – indicates the absence of genes in the species).
Table S4. Gene-Protein-Reaction (GPR) rules and metadata associated with all reactions present in the Ortholog Reaction-Associated database (ORAdb).
Table S5. User-defined constraints: pathway name, complete set of reactions present in each pathway, sets of reactions required to be performed by single species (each subset is described between parenthesis) and transport reactions. Transporter column describes the transport reaction (e.g. R00750) that is coupled to the reaction in the pathway (e.g. R02601). Thus, species that perform the latter must also contain the genes associated with the transport reaction.
Set_A_genomes. Compressed folder containing all the genomes (FASTA format) of the test species used to test the workflow. Species abbreviation codes are used as file names.
Test_genome_set. Sequenced genomes of the species used in a study by Fetzer and collaborators [23].
Table S6. Species used in the Fetzer study [23] and their ability to grow as monocultures in benzoate-containing medium.
Table S7. Growth results of microbial communities composed of individuals and groups of species from the Fetzer study [23] on three different environments. Environment 1 contained a benzoate concentration of 1g/L. Environment 2 contained a benzoate concentration of 6g/L. Environment 3 contained a benzoate concentration of 6g/L and 15 g/L of NaCl.
Table S8. OrtSuite workflow runtime. The total runtime of each OrtSuite step when analyzing the genomic potential of species in Set_A_genomes dataset in three pathways (P1, P2 and P3) for the conversion of benzoate to acetyl-CoA (BTA). Steps were performed with default parameters on a laptop with 4 cores and 16 GB of RAM.
Table S9. Sequence alignments of original and mutated sequences using BLAST [13].
Table S10. Statistics obtained during the clustering of protein orthologs using the Test_genome_set. Results include numbers of species, genes, clusters and genes per cluster.
Table S11. Overview of the number of clusters, sequences and KOs during the annotation of the Test_genome_set.
Table S12. Mapping of species annotated with KO identifiers from ORAdb.
Table S13. Potential of species in the Test_genome_set to perform reactions associated with benzoate degradation based on ORAdb. Species identifiers: (A) Bacillus subtilis ATCC, (B) Paenibacillus polymyxa ATCC 842, (C) Brevibacillus brevis ATCC 8246, (D) Comamonas testosteroni ATCC 11996, (E) Cupriavidus necator JMP 134, (F) Pseudomonas putida ATCC 17514, (G) Pseudomonas fluorescens DSM 6290, (H) Variovorax paradoxus ATCC 17713, (I) Rhodococcus sp. (isolate UFZ), (J) Acidovorax facilis (isolate UFZ), (K) Rhodococcus ruber BU3, (L) Sphingobium yanoikuyae DSM 6900. (1 – species with the complete genomic potential to perform the reaction; 0 – species without the complete genomic potential to perform the reaction).
Table S14. Growth of single species and in combination with others measured by Fetzer and collaborators in three different media (low substrate: 1g/L benzoate, high substrate: 6g/L benzoate and high substrate+salt stress: 6g/L benzoate supplemented with 15 g/L of NaCl). Species identifiers: (A) Bacillus subtilis ATCC, (B) Paenibacillus polymyxa ATCC 842, (C) Brevibacillus brevis ATCC 8246, (D) Comamonas testosterone ATCC 11996, (E) Cupriavidus necator JMP 134, (F) Pseudomonas putida ATCC 17514, (G) Pseudomonas fluorescens DSM 6290, (H) Variovorax paradoxus ATCC 17713, (I) Rhodococcus sp. (isolate UFZ), (J) Acidovorax facilis (isolate UFZ), (K) Rhodococcus ruber BU3, (L) Sphingobium yanoikuyae DSM 6900. Growth was considered when optical density (OD) was equal or above 0.094, 0.2545 and 0.0752 in environments with low substrate, high substrate and high substrate+salt stress benzoate, respectively.
Loading...
Posted 06 Aug, 2020
Posted 06 Aug, 2020
Background: The exponential increase in high-throughput sequencing data and the development of computational sciences and bioinformatics pipelines has advanced our understanding of microbial community composition and distribution in complex ecosystems. Despite these advances, the identification of microbial interactions from genomic data remains a major bottleneck. To address this challenge, we present OrtSuite, a flexible workflow to predict putative microbial interactions based on genomic content.
Results: OrtSuite combines ortholog clustering strategies with genome annotation based on a user-defined set of functions allowing for hypothesis-driven data analysis. OrtSuit allows users to install and run all workflow components and analyze the generated outputs using a simple pipeline consisting of 23 bash commands and one R command. Annotation is based on a two-stage process. First, only a subset of sequences from each ortholog cluster are aligned to all sequences in the Ortholog-Reaction Association database (ORAdb). Next, all sequences from clusters that meet a user-defined identity threshold are aligned to all sequence sets in ORAdb to which they had a hit. This approach results in a decrease in time needed for functional annotation. Further, OrtSuit identifies putative interspecies interactions based on their individual genomic content based on constrains given by the users. Additional control is afforded to the user at several stages of the workflow: 1) The construction of ORAdb only needs to be performed once for each specific process also allowing manual curation; 2) The identity and sequence similarity thresholds used during the annotation stage can be adjusted; and 3) Constraints related to pathway reaction composition and known species contributions to ecosystem processes can be defined.
Conclusions: OrtSuit is an easy to use workflow that allows for rapid functional annotation based on a user curated database. Further, this novel workflow allows the identification of interspecies interactions through user-defined constrains. Due to its low computational demands, for small datasets (e.g. maximum 100 genomes) OrtSuit can run on a personal computer. For larger datasets (> 100 genomes), we suggest the use of computer clusters. OrtSuit is an open-source software available at https://github.com/mdsufz/OrtSuit .

Figure 1

Figure 2

Figure 3
This is a list of supplementary files associated with this preprint. Click to download.
Table S1. Number of reactions, enzymes, KO groups and KO-associated sequences represented in each alternative benzoate to acetyl-CoA conversion pathway used (BTA_P1, BTA_P2 and BTA_P3).
Table S2. Species names, strain and abbreviation codes used to validate OrtSuite. The genomic potential, based on KEGG database, to completely encode all proteins involved in a BTA pathway is identified in the column “BTA pathway” (P1 – Anaerobic conversion of benzoate to acetyl-CoA 1; P2 – Anaerobic conversion of benzoate to acetyl-CoA 2; P3 – Aerobic conversion of benzoate to acetyl-CoA) . The column OrtSuite_result contains which BTA pathway(s) were identified as being completely performed by each species used for validation.
Example_reaction_list. Example file of a list of KEGG reaction identifiers used to generate ORAdb.
Example_ec_list. Example file of list of enzyme commission (EC) numbers used to generate ORAdb.
Example_ko_list. Example file of list of KEGG ortholog identifiers used to generate ORAdb.
Table S3. Example of the binary table with the mapping of KO identifiers (rows) to the species of interest (columns) (1 – indicates the presence of genes in the species; 0 – indicates the absence of genes in the species).
Table S4. Gene-Protein-Reaction (GPR) rules and metadata associated with all reactions present in the Ortholog Reaction-Associated database (ORAdb).
Table S5. User-defined constraints: pathway name, complete set of reactions present in each pathway, sets of reactions required to be performed by single species (each subset is described between parenthesis) and transport reactions. Transporter column describes the transport reaction (e.g. R00750) that is coupled to the reaction in the pathway (e.g. R02601). Thus, species that perform the latter must also contain the genes associated with the transport reaction.
Set_A_genomes. Compressed folder containing all the genomes (FASTA format) of the test species used to test the workflow. Species abbreviation codes are used as file names.
Test_genome_set. Sequenced genomes of the species used in a study by Fetzer and collaborators [23].
Table S6. Species used in the Fetzer study [23] and their ability to grow as monocultures in benzoate-containing medium.
Table S7. Growth results of microbial communities composed of individuals and groups of species from the Fetzer study [23] on three different environments. Environment 1 contained a benzoate concentration of 1g/L. Environment 2 contained a benzoate concentration of 6g/L. Environment 3 contained a benzoate concentration of 6g/L and 15 g/L of NaCl.
Table S8. OrtSuite workflow runtime. The total runtime of each OrtSuite step when analyzing the genomic potential of species in Set_A_genomes dataset in three pathways (P1, P2 and P3) for the conversion of benzoate to acetyl-CoA (BTA). Steps were performed with default parameters on a laptop with 4 cores and 16 GB of RAM.
Table S9. Sequence alignments of original and mutated sequences using BLAST [13].
Table S10. Statistics obtained during the clustering of protein orthologs using the Test_genome_set. Results include numbers of species, genes, clusters and genes per cluster.
Table S11. Overview of the number of clusters, sequences and KOs during the annotation of the Test_genome_set.
Table S12. Mapping of species annotated with KO identifiers from ORAdb.
Table S13. Potential of species in the Test_genome_set to perform reactions associated with benzoate degradation based on ORAdb. Species identifiers: (A) Bacillus subtilis ATCC, (B) Paenibacillus polymyxa ATCC 842, (C) Brevibacillus brevis ATCC 8246, (D) Comamonas testosteroni ATCC 11996, (E) Cupriavidus necator JMP 134, (F) Pseudomonas putida ATCC 17514, (G) Pseudomonas fluorescens DSM 6290, (H) Variovorax paradoxus ATCC 17713, (I) Rhodococcus sp. (isolate UFZ), (J) Acidovorax facilis (isolate UFZ), (K) Rhodococcus ruber BU3, (L) Sphingobium yanoikuyae DSM 6900. (1 – species with the complete genomic potential to perform the reaction; 0 – species without the complete genomic potential to perform the reaction).
Table S14. Growth of single species and in combination with others measured by Fetzer and collaborators in three different media (low substrate: 1g/L benzoate, high substrate: 6g/L benzoate and high substrate+salt stress: 6g/L benzoate supplemented with 15 g/L of NaCl). Species identifiers: (A) Bacillus subtilis ATCC, (B) Paenibacillus polymyxa ATCC 842, (C) Brevibacillus brevis ATCC 8246, (D) Comamonas testosterone ATCC 11996, (E) Cupriavidus necator JMP 134, (F) Pseudomonas putida ATCC 17514, (G) Pseudomonas fluorescens DSM 6290, (H) Variovorax paradoxus ATCC 17713, (I) Rhodococcus sp. (isolate UFZ), (J) Acidovorax facilis (isolate UFZ), (K) Rhodococcus ruber BU3, (L) Sphingobium yanoikuyae DSM 6900. Growth was considered when optical density (OD) was equal or above 0.094, 0.2545 and 0.0752 in environments with low substrate, high substrate and high substrate+salt stress benzoate, respectively.
Loading...