Isolation and Phenotypic Characterization of strains
Each of the strains sequenced for this study were selected because (i) they were isolated from soils with measurable disease-suppressive characteristics, and (ii) they displayed strong phenotypes in competition or signaling assays.
Streptomyces sp. GS93-23 was isolated from a potato scab-suppressive plot in Grand Rapids, MN using the Anderson Air Sampler isolation method22,23. This strain performed the best of ~800 isolated strains at combating potato scab22. GS93-23 also shows antifungal activity against Phytophthora medicaginis and Phytophthora sojae, two fungal pathogens of alfalfa. This activity extended to soil studies, where GS93-23 protected alfalfa, reducing the percentage of dead plants from 50% to 0% when pathogens were seeded at low density24. Further, compared to no-treatment controls, GS93-23 increased plant growth and yield (forage weight per pot), suggesting direct or indirect plant growth promotion activity. Lastly, GS93-23 was found to be strongly antagonistic against other Streptomyces spp., but did not reduce nodule production by rhizobial bacteria24.
Streptomyces spp. S3-4 and 3211-3 were isolated from pathogen suppressive soils located in the Cedar Creek Ecosystem Science Reserve (CCESR), an NSF long-term ecological research site25. S3-4 was isolated from soil in a long-term big bluestem (Andropogon gerardii) monoculture plot and is antagonistic against sympatrically evolved soil isolates26. Strain 3211-3 was isolated from a native prairie control plot at CCESR. It has a strong signaling phenotype, defined as the ability to elicit antibiotic/antifungal production in strains with which it is cultured on close spatial proximity27.
PacBio sequencing and assembly of genomes
Initial genome sequencing and scaffold assembly was performed on a Pacific Biosciences (PacBio) RS single molecule sequencer (October 2014). Genomic DNA was size-selected using Blue-Pippen 20kb and sequenced in three SMRTcells each. The first two SMRTcells for each genome were run using P4 chemistry, and third SMRTcell was run for each genome with P6 chemistry. Initial read assembly using the PacBio HGAP2 algorithm and sequence polishing using the PacBio Resequencing algorithm produced genome sizes of and contig numbers shown in Table 1. Final coverage was > 100x for each genome.
The high GC-content of Streptomyces genomes produces many homopolymer G and C stretches, which can produce errors during base-calling and genome assembly. Low-coverage Illumina sequence data was collected for error correction. Illumina sequencing was performed on a Mi-seq instrument to collect 2 x 250 base paired end reads equating to 110-fold (3211-3), 118-fold (GS93-23), or 155-fold (S3-4) coverage for each genome. Final, error-corrected genome sequences were generated by mapping Illumina short reads to PacBio-generated reference genomes using the BreSeq algorithm28, and incorporating single nucleotide polymorphisms (SNPs) and short Indels using the Pilon algorithm29.
Comparison of Illumina-corrected and PacBio-alone genome sequences
The short-read corrected genome sequences were compared to the PacBio-only assemblies, and 70, 295, and 335 SNP/Indels were present between the two assemblies for GS93-23, S3-4, and 3211-3, respectively. In each case, the vast majority were single base insertions in homopolymer stretches. We next sought to verify that the short-read corrected sequences were indeed a better representation of the actual genome sequence, as the two sequencing platforms are known to generate different types of errors. To determine which sequence variant was correct for each SNP/indel, translated protein sequences at each of the 295 SNP/indel loci in the S3-4 genome were compared against the NCBI GenBank non-redundant database, with the assumption that a frameshift resulting from an indel will result in a worse top blast hit for a stretch of DNA. Supplementary Figure S1 shows the comparison of significance score for BLASTx results of searching a fragment of DNA +/-150 bases from the variant loci. This analysis is only expected to reveal the correct sequence variant when (i) the indel is present within a coding DNA sequence (CDS), (ii) correct protein sequences for close homologs are present in GenBank, and (iii) the 300 basepair window that is searched is sufficiently focused such that top BLAST hits align to the translated query in the region of the variant locus (i.e. at the center of the query, not the edges). We find that the Illumina-corrected sequence returns a top BLASTx hit with lower (better) E-value twice as often as the uncorrected sequence. The average E-values for the top BLASTx hit alignment are six orders of magnitude lower (better) for the short-read corrected sequences compared to the PacBio-only sequences. Because of this, we use the short-read corrected genome sequences for the remaining analyses.
General characteristics of the genome sequences
We were able to assemble the chromosome as a single large contig for strains GS93-23 (8.24 Mb) and 3211-3 (8.23 Mb), and as two large contigs for S3-4 (4.19 Mb and 3.31 Mb) (Figure 1 and Table 1). For S3-4, the two chromosome arms can be oriented relative to one other with high confidence based on GC-skew, orientation of rRNA operons, and enrichment of specialized metabolite gene clusters at chromosome arms (Figure 1, rings 8, 6, and 4, respectively). Manual attempts to close the gap by retrieving PacBio reads that mapped to each contig were unsuccessful. The gap is present in a locus that is especially repetitive, with 3 rRNA operons in close proximity. The overall G+C content (71-73%) and differences in G/C skew for the chromosome arms in each genome are similar to what has been reported for other genomes from this genus30–35. In addition to the large linear chromosomes, strains 3211-3 and S3-4 each contain two large linear plasmids (519 Kb and 240 Kb for 3211-3, 349 Kb and 203 Kb for S3-4).
Annotation of the genomes with the Prokka software tool36 identified 7188 CDSs, 7 ribosomal RNA operons, and 66 tRNAs for GS93-23. Similar numbers of annotated genes were present in the S3-4 genome (7071 CDSs, 8 rRNA operons, 73 tRNAs), and slightly more in the 3211-3 genome (8087 CDSs, 7 rRNA operons, 77 tRNAs), accounting for its larger total size. Gene products were assigned to Clusters of Orthologous Groups (COGs) using the BASys platform37. Functional categorization of proteins reported in Table 2 in comparison to the model organism, S. coelicolor A3(2) were performed with EggNOG-mapper38.
Annotation of natural product biosynthetic gene clusters.
Because natural product biosynthesis is thought to play a mechanistic role that underpins the ecology of disease suppressive soils17,39, we have analyzed the genomes for their biosynthetic potential using the antiSMASH 3.0 toolkit40. We conservatively assigned specific molecules to these BGCs only when the annotated gene clusters share 100% of the biosynthetic genes from previously characterized BGCs by manual comparison (Supplementary Information). For ribosomally produced and post-translationally modified peptides (RiPPs), we predict the production of minor structural variants when the sequence of precursor peptides is slightly different than in characterized BGCs. The 26 high-confidence BGCs identified in the GS93-23 genome include known pathways for RiPP cyclothiazomycin41, the dienoyltetramic acid streptolydigin42, and the lipoglycopeptide mannopeptimycin43. The 38 high-confidence BGCs in the 3211-3 genome include known pathways for the chlorinated non-ribosomal peptide tambromycin44, the siderophore coelichelin45, and terpenoid 2-methylisoborneol46. The 28 high-confidence BGCs in the S3-4 genome include known pathways for 2-methylisoborneol, and the aminoglycoside streptothricin47. In addition, all three genomes contain the highly conserved BGCs for the siderophore desferrioxamine b48, terpenes geosmin49 and hopene50, minor structural variants of lantibiotic SapB51, and osmoprotectant ectoine52.
The majority of BGCs identified in these genomes remain uncharacterized. Intriguing pathways include a 178 Kb polyketide cluster on a plasmid in S3-4 that putatively encodes a 60-member macrolide, and a pyrrolopyrrole-containing metabolite in 3211-3.
Comparison to closest sequenced relatives
We compared the draft genome sequences to a collection of 500 publicly available actinomycete genomes using multi-locus sequence comparison to identify the closest sequenced relative of each (Figure 2). S3-4 groups with the small Streptomyces katrae clade near type strain NRRL-ISP 555053. Strain 3211-3 is in the neighboring Streptomyces virginiae clade defined by the type strain NRRL ISP-509454. GS93-23 clusters with the Streptomyces lydicus type strain NRRL-ISP 546155.
We identified closely related genomes in the available whole-genome sequence databases for each of our DSS isolates (Figure 3). For each of our newly sequenced strains, a previously published genome was available with high sequence similarity in several common phylogenetic markers (16S rRNA, rpoB, and multi-locus sequencing (MLS) using ribosomal proteins) (Figure 3a). Our closest pair of new and previously reported genomes is GS93-23 and S. lydicus NRRL ISP-5461, which share 100% identity of 16S rRNA and 99.92 % identity using MLS comparison. Even our most divergent pair, S3-4 to Streptomyces sp. WM6372, shared >98% identity at the 16S rRNA level and >96% identity at the rpoB level, and 93.72% by four-gene MLS comparison (atpD, gyrB, rpoB, trpB).
Genome pairs were compared to determine the amount of shared sequence across the entire genome (Figure 3b). Alignments were constructed in Mauve and alignment gaps were mapped back to the new high-quality reference genomes. Alignment gaps between of GS93-23 and ISP-5461 are uniformly distributed across the chromosome. Insertions or deletion events greater than 100 bp account for only 4.5% of the genome sequence as a whole (Figure 3B), with a similar proportion being lost/gained in BGCs as in the rest of the genome (Figure 3b).
The high-level of sequence conservation between GS93-23 and ISP-5461 allowed us to examine the micro-scale evolution of these genomes. There are approximately 40,000 SNPs between the two, making the sequence identity in the aligning sequences greater than 99.5%. Interestingly, the position of SNPs relative to CDSs shows a marked de-enrichment in (i) the approximate position of the Shine-Dalgarno sequence in the 5’-UTR, and (ii) the 5’ end of the CDS (Figure 3c). This suggests a selection for maintaining relative translation rates of encoded genes, as both loci are important in determining translation initiation rates in bacteria56. Most of the ~33,000 SNPs in CDSs encode silent mutations. Of the missense mutations, the majority are conservative in terms of amino acid chemistry (Figure 3d). The ratio of synonymous to non-synonymous mutations (dS/dN) is 1.8, which is substantially lower than seen in housekeeping genes in E. coli and invasion genes from S. enterica57,58, suggesting that there has been little selective pressure against non-synonymous mutations and that these two strains belong to the same clonal complex59,60.
Despite the strong similarity between GS93-23 and ISP-5461, there are still substantial differences between the two strains. GS93-23 contains 98 genes that are missing in ISP-5461, and ISP-5461 contains 11 unique genes. 66/98 genes unique to GS93-23 are of unknown function. Of genes with functional annotations the largest categories specific to GS93-23 are transcriptional regulators (11/98) and metabolic enzymes (10/98). Of the genes unique to ISP-5461, only a single gene was of unknown function. The largest functional categories for genes unique to ISP-5461 also were transcriptional regulators (3/11) and metabolic enzymes (3/11).
The other two DSS genomes presented here are more divergent from the nearest sequenced relative. Both 3211-3 and S3-4 have two large plasmids that are absent in their closest relatives, S. virginiae NRRL B-1447 and S. katrae NRRL ISP-5550, respectively. These changes alone account for 9% and 7% of the total genome content, respectively. The plasmids in S3-4 are rich in secondary metabolism genes, with four large gene clusters totaling roughly 500 kb of sequence. Besides the plasmid differences, the chromosome of 3211-3 contains 285 large (>100 bp) insertions compared to B-1447, totaling 609 kb of new sequence, and 309 large deletions totaling 758 kb of sequence lost. In the regions that do align, there are 102,000 SNPs, corresponding to an average sequence identity of 98.7% across the genome. The S3-4 genome lacks a close homolog in the sequence databases. Despite sharing 96.3% sequence identity of the rpoB gene, 26% of the S3-4 genome does not align with the WM6372 sequence.
We next compared the natural product biosynthetic potential for these three strains by analyzing their BGC content. Our closest pair, GS93-23 and ISP-5461, share 26/26 of the high-confidence BGCs and 61/64 ‘putative’ clusters (co-localized clusters of genes that belong to COGs typically found in BGCs, but which lack canonical secondary metabolism signature sequences). The next closest pair, 3211-3 and B-1447, which share 99.7% similarity of the rpoB gene, have in common only 31/38 of the high-confidence BGC annotations, which is driven mostly by the presence of two plasmids in 3211-3 missing from B-1447. Between S3-4 and WM6372 (96.3% identity of rpoB), 12/28 of high-confidence BGCs are shared, and 27/54 ‘putative’ clusters. These relationships between genetic distance and BGC overlap follow the general trend for rpoB conservation and non-ribosomal peptide synthetase (NRPS) BGC overlap described by Doroghazi et al61.
Signaling potential analysis
One possible organization for a highly antagonistic microbial community would have a keystone species that produces a signal to induce antibiotic production in many other community members. The University of Minnesota DSS strain library was assayed for signaling potential using a plate-based phenotypic assay27 (Kinkel, unpublished data). Strain 3211-3 was selected for whole genome sequencing because it is among the best signalers of antibiosis in our library of DSS isolates. The signaling assay requires dilution of a signaling molecule through solid agar medium, so signaling through cell-cell contact can be ruled out as a mechanism. We looked for genomic features that could explain the signaling promiscuity in 3211-3.
Signaling between Streptomyces can be mediated by several well-known classes of hormone-like signaling molecules62 including γ-butyrolactones63, furans64, γ-butenolides65, SapB66-like RiPPs, diamino-bis(hydroxymethyl)-butanediol67, and diketopiperazines68. Signaling can also be mediated by sub-inhibitory concentrations of antibiotics69–71. We first looked for the presence of BGCs encoding hormone-like signaling molecules in 3211-3. There are two γ-butyrolactone BGCs in this genome and a SapB BGC, but this number is comparable to other sequenced Streptomyces. There is no evidence that 3211-3 produces an unusually diverse set of hormone-like signaling molecules.
A second possibility is that 3211-3 does not produce many diverse hormone-like signaling molecules, but the molecule they do produce can be sensed by many species of Streptomyces. There are at least fifteen unique γ-butyrolactone signals produced by the genus, and unfortunately it is not possible to predict the specific γ-butyrolactone chemical structure from sequence information alone. However, we reasoned γ-butyrolactone biosynthesis genes and receptors that produce/sense the same compound will have a higher degree of sequence similarity than those producing/sensing different compounds. We performed a CLUSTER-BLAST analysis with the γ-butyrolactone biosynthesis protein ScbA and the receptor AfsR against the set of sequenced Streptomyces genomes (not shown). Again, we did not see any evidence that 3211-3 produces a more widely-sensed hormone-like signaling molecule.
A third possibility is that 3211-3 is a prolific signaler due to production of sub-inhibitory concentrations of antibiotics (SICA). This genome encodes more ‘high-confidence’ BGCs than the two genomes from strongly antagonistic DSS isolates (Table 1). Among 125 complete Streptomyces genomes with antiSMASH 4.1 (Supplementary Table S4), the number of high-confidence BGCs in 3211-3 places it in the top 16% in terms of BGC content. Since there is no clear genomic signature that allows us to explain the signaling potential in 3211-3, teasing apart its ability to elicit antibiosis in so many diverse isolates will require future molecular genetic experiments.