Pangenome calculations based on 124 complete Vibrionaceae genomes identifies 710 clusters of orthologous core genes
To categorize all genes associated with Vibrionaceae genomes into distinct classes, we downloaded all complete genomes from the NCBI RefSeq database (124 as of May 2018, see Additional file 1), and then used GET_HOMOLOGUES v3.1.0 [23] to cluster orthologous protein sequences based on the OrthoMCL algorithm. The pangenome calculations identified a total of 61,512 clusters, of which 710 were encoded by genes found in all 124 genomes (i.e., core genes). The remaining clusters are distributed among softcore (encoded by ≥ 117 genomes), shell (encoded by 116≤ and ≥3 genomes) and cloud (encoded by ≤ 2 genomes), and contain 1,796, 14,642 and 45,074 clusters, which represents 3%, 23% and 73% of the total clusters, respectively. In individual genomes, core gene clusters represent 1.2% of the pangenome, and comprise 10—17% of the total genes. Similarly, softcore constitutes 24—34% (1,489—1,796 genes per genome) of the total genes.
Core and softcore genes densely populate the upper half of Chr 1
The four gene categories core, softcore, shell and cloud, were next mapped to their chromosomal locations to investigate whether they are randomly or non-randomly distributed on each chromosome. First, genes of eleven selected Vibrionaceae representatives (see Additional file 2 for phylogeny of the 11 genomes) were classified as either upper or lower (i.e., upper or lower half of the chromosome) based on their chromosomal location on Chr 1 and Chr 2 in relation to their distance of the origin of replication. As presented in Fig. 1 (complete table of pangene distribution is available as Additional file 3 and chi-squared test is available as Additional file 4), core and softcore genes are significantly overrepresented (adjusted chi-square P-value ≤ 0.05) in the upper half of Chr 1 in all investigated genomes. Similarly, shell and cloud genes on Chr 1 are significantly overrepresented (adjusted chi-square P-value ≤ 0.05) in the lower half of Chr 1 in 8 genomes, thus supporting a non-random distribution of genes on Chr 1. In contrast to Chr 1, genes of all categories are much more evenly distributed on Chr 2. Although shell, cloud and softcore genes show non-random distribution on Chr 2 in some of the investigated genomes (softcore 3/11, shell 1/11, cloud 2/11), the majority of genomes show no significant bias (adjusted chi-square P-value ≤ 0.05). Furthermore, core genes were not significantly overrepresented in either lower or upper half of Chr 2 in any of the genomes.
To provide a more fine-grained picture of the core (710—721) and shell (749—2753) gene distributions, we plotted the distribution of core and shell genes on Chr 1 and Chr 2 of eleven Vibrionaceae taxa using the genome comparison tool Circos [24] (Fig. 2). Each plot was centered on mioC (Chr 1) and rctB (Chr 2). Our results show that although the exact distribution pattern varies between species, the biased distributions of core and shell, as described above, are striking and readily visible with the naked eye. Interestingly, although core genes densely populate the upper half of Chr 1, the region immediately surrounding ori1 contains very few core genes. This region (denoted “i” in Fig. 2) is, in contrast, densely populated by softcore genes (at least in V. natriegens and A. salmonicida, see section below). Also, a region (denoted “ii” in Fig. 2) of approximately 500 kb surrounding ter1 is densely populated with shell genes (and hence sparsely populated with core genes). For Chr 2, the chi-square test supported no significant bias in gene distribution (Additional file 4), and Fig. 2b supports this general picture although some local clustering of gene categories will occur. In summary, the results presented here reveal that core, softcore, shell and cloud genes are non-randomly distributed on Chr 1. Core and softcore genes are more likely to be located on the upper half of Chr 1, whereas shell and cloud genes tend to be located closer to the replication terminator. For Chr 2, the distribution of the four pangene categories are in general randomly distributed showing locational bias only for a few genomes.
Expression levels of genes located on Chr 1 of V. natriegens and A. salmonicida generally correlate with distance to ori1
Fig. 3 shows how core, softcore, shell and cloud pangenes are distributed on Chr 1 and Chr 2 of V. natriegens and A. salmonicida. The pattern is consistent with the biased gene distribution pattern described above, with core and softcore genes being overrepresented at the upper half of Chr 1, and shell and cloud genes being overrepresented at the lower half. The two species were chosen as models for comparison of gene expression data with pangene distribution patterns. Specifically, we were curious to examine if regions that are densely populated by core/softcore pangenes are expressed at high levels, compared to regions more sparsely populated by core/softcore pangenes. This expectation is based on previous data from V. parahaemolyticus and V. cholerae, which showed that growth rates have large impacts on the copy number (gene dosage) of genes located on Chr 1, as well as on gene expression levels [10, 21, 25]. Fast- and slow-growing bacterial representatives were therefore chosen for this particular comparative analysis. V. natriegens is a fast-growing bacterium commonly found in estuarine mud, with doubling times below 10 minutes at favourable conditions [26]. A. salmonicida is, in contrast, a slow growing Vibrionaceae bacterium, and the causative agent of cold-water vibriosis in e.g., Atlantic salmon and cod [27, 28]. To correlate gene distribution with gene expression data, publicly available RNA-seq data of V. natriegens and A. salmonicida were downloaded from the Sequence Read Archive [29] at NCBI. For V. natriegens, datasets from growth in minimal and optimal (rich) medium at 37 °C to mid log phase were chosen [30]. For A. salmonicida, a dataset originating from growth in LB medium containing 1% NaCl at 8 °C to mid log phase was used [31]. EDGE-PRO 1.3.1 [32] was used to align cDNA reads to the V. natriegens ATCC 14048 (NBRC 15636, DSM 759) (assembly no. GCA_001456255.1) or A. salmonicida LFI1238 (assembly no. GCF_000196495.1) genome, and to calculate expression values as reads per kilobase per million (RPKM) for all protein coding sequences (CDS).
Fig. 4 shows global expression maps of V. natriegens and A. salmonicida chromosomal genes centered around the median. Data points (log2 ratio RPKM CDS:RPKM median) for each CDS are shown, as well as a trend line averaged over a sliding window of 200 data points. For Chr 1 the general picture is similar in all three datasets, i.e., RPKM values are typically above the median value at the upper half (i.e., the region closest to the origin of replication), but lower at the region surrounding the terminus, independent of growth conditions. This is somewhat surprising since the observed expression patterns described above was expected for fast growing cultures (i.,e V. natriegens in rich medium), but not for slow growing cultures (i.e., A. salmonicida in LB 1% NaCl and 8 °C and V. natriegens in minimal medium, see Additional file 5). The rationale is that gene copy numbers (also known as “gene dosage”), and thus expression levels are expected to be correlated with growth rates/multifork replication [21].
A more detailed circular expression map is available in Additional file 6 and shows that region “i” (see Fig. 2), which encodes mostly softcore genes, contains a highly expressed proton-translocating ATP synthase (F0F1 class) gene cluster (atpIBEFHAGDC). The ATPase cluster is well described in Escherichia coli as an operon located 84 min on the chromosome (close to Ori), and with gene expression levels varying according to cell growth rate [33]. The ATP synthase cluster represents softcore genes, and are present in both bacteria. Moreover, the detailed map shows that region “ii”, which is densely populated with shell genes, differs from the remaining lower half of Chr 1 by being expressed far below median in V. natriegens at both fast and slow growth conditions. For A. salmonicida the main picture is the same, but less pronounced, meaning that the majority of shell genes located in “ii” are expressed below median.
For Chr 2, the results are more ambiguous, although overall similar between minimal and rich growth. For A. salmonicida, expression around the terminus is, on average, higher compared to that of regions adjacent to ori2. For V. natriegens, expression is generally higher than median in regions surrounding the terminus, but varies across the remaining parts of Chr 2. Similar to Chr 1, little difference could be determined between the slow- and the fast-growing datasets of Chr 2.
In summary, we found that global expression levels for Chr 1, consistently correlate with the distance to the origin of replication. The log2 ratio of RPKM CDS:RPKM median decreases as the distance from origin of replication increases.
All pangene categories contribute to higher expression levels around ori1 at fast-growth conditions, but not at slow-growth conditions
The global trend described above can be explained by generally higher expression levels of all pangene categories located close to ori1, or, higher expression of three or less of the four pangene categories. To discriminate between the two alternatives, we calculated the RPKM median value for each pangene category, and compared the median values for genes located on the upper or lower halves of Chr 1 (Table 1). The Wilcoxon signed-rank test strongly support (P-adj ≤ 0.05) that median values for all four pangene categories are significantly higher for genes located on the the upper half, i.e., when V. natriegens is cultured at fast-growth (“optimal”) conditions. Notably, when grown under slow-growing conditions, median values for softcore, shell and cloud genes located on the upper half are significantly higher. Core genes are in contrast, expressed at equal levels on both halves. This applies for both V. natriegens (RPKM median = 370 and 360, P-adj= 0.321) in minimal medium, and A. salmonicida (RPKM median = 301 and 309, P-adj = 0.717) at suboptimal conditions. Conversely, we can therefore state that genes from all pangene categories located on the lower half are generally expressed at lower levels compared to those on the upper half (except for core genes at slow growth conditions). To summarize, we conclude that gene expression levels correlate with distance to ori1 (Fig. 4), and genes from all four pangene categories contribute to this trend when grown under fast-growing conditions, whereas softcore, shell and cloud genes contribute at slow-growing conditions.