Core and softcore genes densely populate the upper half of Chr 1
The four gene categories core, softcore, shell and cloud, were next mapped to their chromosomal locations to investigate whether they are randomly or non-randomly distributed on each chromosome. First, genes of eleven selected Vibrionaceae representatives were classified as either upper or lower (i.e., upper or lower half of the chromosome) based on their chromosomal location on Chr 1 and Chr 2 in relation to their distance of the origin of replication. As presented in Fig. 1 (complete table is available as Additional file 2: Table S2), core and softcore genes are significantly overrepresented (adjusted chi-square P-value ≤ 0.05) in the upper half of Chr 1 in all investigated genomes. Similarly, shell and cloud genes on Chr 1 are significantly overrepresented (adjusted chi-square P-value ≤ 0.05) in the lower half of Chr 1 in 8 and 7 genomes, respectively, supporting a non-random distribution of genes on Chr 1. In contrast to Chr1, genes of all categories on are much more evenly distributed on Chr 2. Although shell, cloud and softcore genes show non-random distribution on Chr 2 in some of the investigated genomes (softcore 3/11, shell 2/11, cloud 3/11), the majority of genomes show no significant bias (adjusted chi-square P-value ≤ 0.05). Furthermore, core genes were not significantly overrepresented in either lower or upper half of Chr 2 in any of the genomes.
To provide a more fine-grained picture of the core (710—721) and shell (749—2753) gene distributions, we plotted the distribution of core and shell genes on Chr 1 and Chr 2 of eleven Vibrionaceae taxa using the genome comparison tool Circos [24] (Fig. 2). Each plot was centered on mioC (Chr 1) and rctB (Chr 2). Our results show that although the exact distribution pattern varies between species, the biased distributions of core and shell, as described above, are striking and readily visible with the naked eye. Interestingly, although core genes densely populate the upper half of Chr 1, the region immediately surrounding ori1 contains very few core genes. This region (denoted “i” in Fig. 2) is, in contrast, densely populated by softcore genes (at least in V. natriegens and A. salmonicida, see section below). Also, a region (denoted “ii” in Fig. 2) of approximately 500 kb surrounding ter1 are more sparsely populated with core genes than the rest of the chromosome. Figure 2b shows that the shell genes are distributed in an evenly fashion without any large gaps on both chromosomes. However, genera represented with one or few genomes in the dataset have fewer shell genes and hence more gaps (e.g. G. hollisae ATCC 33564, Photobacterium damselae KC-Na-1 and P. profundum SS9).
In summary, the results presented here reveal that core, softcore, shell and cloud genes are non-randomly distributed on Chr 1. Core and softcore genes are more likely to be located on the upper half of Chr 1, whereas shell and cloud genes tend to be located closer to the replication terminator. For Chr 2, the distribution of the four pangene categories are in general randomly distributed showing locational bias only for a few genomes.
Expression levels of genes located on Chr 1 of V. natriegens and A. salmonicida generally correlate with distance to ori1
Figure 3 shows how core, softcore, shell and cloud pangenes are distributed on Chr 1 and Chr 2 of V. natriegens and A. salmonicida. The pattern is consistent with the biased gene distribution pattern described above, with core and softcore genes being overrepresented at the upper half of Chr 1, and shell and cloud genes being overrepresented at the lower half. The two species were chosen as models for comparison of gene expression data with pangene distribution patterns. Specifically, we were curious to examine if regions that are densely populated by core/softcore pangenes are expressed at high levels, compared to regions more sparsely populated by core/softcore pangenes. This expectation is based on previous data from V. parahaemolyticus and V. cholerae, which showed that growth rates of these bacteria have large impacts on the copy number (gene dosage) of genes located on Chr 1, as well as on gene expression levels [10, 21, 25]. Fast- and slow-growing bacterial representatives were therefore chosen for this particular comparative analysis. V. natriegens is a fast-growing bacterium commonly found in estuarine mud, with doubling times below 10 minutes at favourable conditions [26]. A. salmonicida is, in contrast, a slow growing Vibrionaceae bacterium, and the causative agent of cold-water vibriosis in e.g., Atlantic salmon and cod [27, 28]. To correlate gene distribution with gene expression data, publicly available RNA-seq data of V. natriegens and A. salmonicida were downloaded from the Sequence Read Archive [29] at NCBI. For V. natriegens, datasets from growth in minimal (BioSample no. SAMN1092609, SAMN10926310 and SAMN10926313) and optimal (rich) medium (sample no. SAMN10926311, SAMN10926312 and SAMN109329) at 37 °C to OD600nm 0.3—0.5 were chosen [30]. These conditions were selected because they represent slow, as well as fast growth conditions. For A. salmonicida, a dataset (sample no. SAMEA4548122, SAMEA4548133, SAMEA4548134) originating from growth in LB medium containing 1% NaCl at 8 °C to mid log phase (OD600nm ~ 0.5) was used [31]. The salt concentration is expected to be similar to the concentration the bacterium would experience inside its natural host (Atlantic salmon), where the bacterium is known to cause cold water vibriosis at temperatures below 10 °C [27, 28]. Hence, 8 °C was used in the experiment. EDGE-PRO 1.3.1 [32] was used to align cDNA reads to the V. natriegens ATCC 14048 (NBRC 15636, DSM 759) (assembly no. GCA_001456255.1) or A. salmonicida LFI1238 (assembly no. GCF_000196495.1) genome, and to calculate expression values as reads per kilobase per million (RPKM) for all protein coding sequences (CDS).
Figure 4 shows global expression maps of V. natriegens and A. salmonicida chromosomal genes centered around the median. Data points (log2 ratio RPKM CDS:RPKM median) for each CDS are shown, as well as a trend line averaged over a sliding window of 200 data points. For Chr 1 the general picture is similar in all three datasets, i.e., RPKM values are typically above the median value at the upper half (i.e., the region closest to the origin of replication), but lower at the region surrounding the terminus, independent of growth conditions. This is somewhat surprising since the observed pattern was expected for fast growing cultures (i.,e V. natriegens in rich medium), but not for slow growing cultures (i.e., V. natriegens in minimal medium (see Additional file 3: Fig. S1), and A. salmonicida in LB, 1% NaCl and 8 °C). The rational is that gene copy numbers (also known as “gene dosage”), and thus expression levels are expected to be correlated with growth rates/multifork replication [21]. For Chr 2, the results are more ambiguous, although overall similar between minimal and rich growth. For A. salmonicida, expression around the terminus is, on average, higher compared to that of regions adjacent to ori2. For V. natriegens, expression is generally higher than median in regions surrounding the terminus, but varies across the remaining parts of Chr 2. Similar to Chr 1, little difference could be determined between the slow- and the fast-growing datasets of Chr 2.
In summary, we found that global expression levels for Chr 1, consistently correlate with the distance to the origin of replication. The log2 ratio of RPKM CDS:RPKM median decreases as the distance from origin of replication increases.
All pangene categories contribute to higher expression levels around ori1 at fast-growth conditions, but not at slow-growth conditions
The global trend described above can be explained by generally higher expression levels of all pangene categories located close to ori1, or, generally higher expression of three or less of the four pangene categories. To discriminate between the two alternatives, we calculated the RPKM median value for each pangene category, and compared the median values for genes located on the upper or lower halves of Chr 1 (Table 1). The Wilcoxon signed-rank test strongly support (P-adj ≤ 0.05) that median values for all four pangene categories are significantly higher for genes located on the the upper half, i.e., when V. natriegens is cultured at fast-growth (“optimal”) conditions. Notably, when grown under slow-growing conditions, median values for softcore, shell and cloud genes located on the upper half are significantly higher. Core genes are in contrast, expressed at equal levels on both halves. This applies for both V. natriegens (RPKM median = 370 and 360, P-adj = 0.321) in minimal medium, and A. salmonicida (RPKM median = 301 and 309, P-adj = 0.717) at suboptimal conditions. To summarize, we conclude that gene expression levels correlate with distance to ori1 (Fig. 4), and genes from all four pangene categories contributes to this trend when grown under fast-growing conditions, whereas softcore, shell and cloud genes contributes at slow-growing conditions.
Table 1
Comparison of gene expression levels for pangenes located on the upper or lower halves of Chr 1.
| A. salmonicida | V. natriegens slow-growth | V. natriegens fast-growth |
| core | softcore | shell | cloud | core | softcore | shell | cloud | core | softcore | shell | cloud |
| Upper halfa | | | | | | | | | | | | |
| | Q1 | 152 | 118 | 42 | 42 | 188 | 126 | 21 | 5 | 249 | 170 | 36 | 37 |
| | Q2 | 301 | 245 | 89 | 67 | 370 | 288 | 71 | 147 | 447 | 341 | 93 | 269 |
| | Q3 | 853 | 633 | 197 | 197 | 1101 | 760 | 190 | 426 | 1059 | 719 | 241 | 581 |
| | Max | 34 254 | 34 254 | 6 473 | 13 656 | 23 238 | 23 238 | 17 161 | 5 533 | 35 274 | 35 274 | 28 737 | 4 049 |
| Lower halfa | | | | | | | | | | | | |
| Q1 | 151 | 89 | 34 | 25 | 143 | 83 | 4 | 4 | 178 | 109 | 0 | 0 |
| Q2 | 309 | 207 | 65 | 47 | 360 | 192 | 28 | 18 | 328 | 232 | 26 | 17 |
| | Q3 | 695 | 486 | 133 | 82 | 966 | 565 | 74 | 59 | 696 | 480 | 97 | 62 |
| | Max | 53 501 | 8 098 | 19 837 | 23 646 | 14 116 | 14 116 | 15 800 | 463 | 16 521 | 17 549 | 17 550 | 535 |
P-value Q2 b | 0.71 | 0.01 | 0.00 | 0.00 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
a Q1 is the RPKM value at the first quartile. Q1 is defined as the middle number between the smallest number and the median (i.e., the second quartile Q2), if the data numbers (in this case RPKM values) are ordered from smallest to largest. The third quartile (Q3) is the middle value between the median (Q2) and the maximum (Max) value. |
b Adjusted P-values from Wilcoxon signed-rank test, to test if Q2 values (median) of genes located on the upper half of Chr 1 are significantly different from Q2 values of genes located on the lower half. Values below 0.05 are considered significant. |