Borrelia bavariensis genome reconstruction from next-generation sequencing data
The assembly of B. burgdorferi s.l. genomes is known to be difficult due to the fragmentation of the genome and to the presence of highly similar plasmids (like the cp32 plasmid family) [18]. We used a combination of long read (PacBio) and short read (Illumina HiSeq and MiSeq) to overcome this problem.
For three isolates (the B. bavariensis type strain PBi from Germany, a second European isolate A104S from the Netherlands and the Japanese isolate NT24: highlighted in gray in Table 1), we used both sequencing techniques. For each isolate an assembly was first reconstructed using PacBio reads and then assembled contigs of Illumina short reads were mapped to the PacBio assemblies (see Methods). For most of the three genomes, the two methods gave very similar results with over 99.99% similarity between the Illumina contigs and the PacBio assemblies. Most differences were point mutations and 1bp-long indels which are known to occur due to the lower accuracy of the PacBio sequencing method [30]. In such cases, the Illumina version of the sequence was kept.
In one case, the Illumina data allowed us to correct a PacBio assembly. The PacBio assembly for isolate NT24 showed two plasmids of respective sizes of 107,820 bp and 49,218 bp that we originally named plasmids cp32-12+5+6 and cp32-7+7+11 due to the presence of the corresponding PFam32 sequences. These two plasmids seemed to be fusions of three cp32 plasmids each. Mapping the Illumina raw reads on these sequences (Suppl. Fig. 1) showed that several regions of those PacBio plasmids were not covered by Illumina reads which was not the case for other plasmids. The fusions were thus not supported and were probably an artifact of the PacBio assembly. We used contigs from other isolates as a reference for reconstructing the probable plasmids of the cp32 family in this isolate (see below).
In isolate PBi, unmapped Illumina reads contained sequences similar to lp28-8. This plasmid was not reconstructed in the PacBio assembly. Mapping of PBi Illumina contigs on A104S lp28-8 showed that five contigs mapped to this plasmid with 92-99% similarity to the A104S version. However, the original architecture of the plasmid in PBi was probably different as the five mapping contigs did not cover the full A104S sequence and were themselves not mapped over their whole length. Therefore, the final lp28-8 PBi plasmid sequence could not be reconstructed and, additionally, no PFam32 plasmid partition protein could be found for this plasmid. However, another plasmid partition protein of the family PFam50 for lp28-8 was identified in PBi showing that this plasmid is probably present.
For the remaining 30 isolates which were sequenced with Illumina only, we mapped contigs assembled with SPAdes v. 3.10.1 [31] to the final genomes of the three isolates sequenced with PacBio, as well as to plasmids identified as one full contig in the isolates sequenced with Illumina only, with NUCmer v. 3.1 from package MUMmer [32] (see Methods). Plasmid sequences were kept in the final reconstructed genomes only if they were at least 5,000 bp long and were named after the PFam32 protein types identified in their sequence using BLAST v. 2.8.1 [33, 34] or after the reference they were mapped to in case of the absence of a PFam32 sequence (see Methods and Suppl. Table 1). To ensure that the assembly method chosen was good (SPAdes v. 3.10.1 [31]), we also assembled sequence data of 25 isolates with SOAPdenovo v. 1.0 [35] and VelevetOptimizer v. 1.0 [36] (see Methods) and used QUAST v. 4.6 [37] to compare the quality of the three assemblies. As is shown in Supplementary Figure 2, N50 values were significantly higher in SPAdes assemblies compared to assemblies of the two other assemblers (Wilcoxon Rank Sum Tests with each other assembler: Bonferroni-Holm corrected P-Value < 0.01) and the number of contigs was significantly smaller (Wilcoxon Rank Sum Tests with each other assembler: Bonferroni-Holm corrected P-Value < 10-4). In addition, the total length of the final assembly was largest in SPAdes in 24 out of 25 isolates tested. We conclude that, of the three tested assemblers, SPAdes performed the best.
We also remapped the raw Illumina reads on the final reconstructed genomes to check the quality of our reconstruction (see Methods) and show the relative standard deviation (SD) of coverage as a measure of quality in Supplementary Figure 3. A well assembled genome should have a low coverage variance as reads would map evenly to the contigs. The isolates from Asia showed a significantly higher variance in coverage (Wilcoxon Rank Sum Test: P-Value < 10-16) as compared to the European isolates. This could be due to variation in the quality of the original DNA samples, (DNA samples from the Asian isolates were shipped to Germany), or to the lack of good references for certain plasmids due to the higher diversity observed in the Asian isolates. Indeed, the relative SD was higher for plasmids compared to the main chromosome in Asian isolates even if this difference was not significant (Suppl. Fig. 3b). The quality of the assembly did not depend on the method used for obtaining the final plasmid sequence (either as an own entire contig or with contigs mapped to a reference) (Suppl. Fig. 3a).
Genome composition of 33 B. bavariensis isolates
The genomes of the 33 isolates consisted of a main chromosome and a variable number of plasmids (Table 1). Chromosomes were about 900 kb in size (size of reconstructed chromosome varied between 894,779 bp in isolate PBaeII and 906,948 bp in isolate NT24) and made up on average 72.1% of the total assembled genome. Eight to 18 individual plasmid sequences of at least 5,000 bp could be assembled per isolate. Additional plasmid sequences were identified in 11 isolates due to the presence of partition genes or as some contigs mapped to plasmids identified in other isolates (Suppl. Table 1). However, these additional plasmids could not be fully assembled or the assembled sequence did not reach the 5,000 bp criterion. Several reconstructed plasmid sequences, particularly of the lp28 and cp32 plasmid families, are very short (below 10 kb). It is probable that the sequence reconstructed here for these plasmids does not recover the full plasmid length and that the missing sequences were probably erroneously assembled in other contigs due to similarity. This confirms that short read sequencing alone is not sufficient to reconstruct plasmids from these families. Using long-read sequencing was very helpful in the assembly of plasmids in isolates PBi and A104S. However, even the PacBio assembly pipeline failed to reconstruct properly the cp32 content of isolate NT24. For this isolate we used the same strategy as for the isolates with only Illumina data (see Methods) and mapped Illumina contigs to cp32 plasmids from other isolates. This allowed us to reconstruct plasmids cp32-11 and cp32-12. For plasmids cp32-5, -6 and -7 no mapping was possible; we could only use Illumina contigs that were 7.3, 7.3 and 9.9 kb long, respectively, and probably do not represent the full plasmid (Table 1).
The number of plasmids per isolate (Figure 1) was significantly higher in the Asian population (ranging from 10 to 18 reconstructed plasmids over 5 kb long) as compared to the European population (8 to 13 plasmids). As some plasmid fusions were observed and as some plasmids could not be reconstructed, we also tested for the number of PFam32 gene sequences present in each isolate. This was found to be significantly higher in Asian isolates compared to European isolates (Figure 1), again implying that fewer plasmids are present in European isolates than in Asian isolates.
We also tested for a deviation in copy number between plasmids with respect to the main chromosome by plotting the coverage of the raw read mapping to each plasmid relative to the chromosome (Suppl. Fig. 4). As the coverage of Asian B. bavariensis genomes was more variable, we did this for European isolates only. We found the coverage of plasmids lp17, lp28-7 and lp36 to be significantly higher than for the main chromosome for all European isolates. In particular, based on this coverage measure, there were, on average, about seven copies of lp17 per cell in European isolates.
As several plasmids seemed to have a higher copy number compared to the main chromosome based on the read coverage of the Illumina data, we used a qPCR protocol to directly measure the number of DNA molecules present in a strain relative to the main chromosome. We chose to use plasmids cp26 (which we hypothesized to be present in about the same number as the main chromosome, based on read coverage) and lp17 and lp36 (which seemed to have higher copy numbers). We designed a qPCR protocol following Millan et al. [38] with one PCR per plasmid (see Methods for details) on two low passage isolates of B. bavariensis isolate PBi. Each isolate was run using three biological and two technical replicates. As can be seen in Figure 2, the copy number of plasmid cp26 was estimated to be slightly below one copy per chromosome, that of lp36 was about one copy per chromosome and lp17 plasmid was found to be at a higher relative copy number varying between three and five copies per chromosome. This value is lower than the copy number estimated based on the coverage measure but is probably a more accurate estimate.
Shared versus variable genome components
All B. bavariensis isolates sequenced in this study contained, in addition to the main chromosome, plasmids cp26, lp54, lp36, lp17 and lp28-4 (Table 1). In addition, we found in each isolate between 4 and 9 types of cp32 sequences. These were either fused with other plasmids or independent plasmids and their numbers were obtained by counting cp32 PFam32 sequences (as cp32 family plasmids could not be properly assembled in several isolates). Three cases of plasmid fusions were observed in at least two isolates and were thus considered to be true (other cases were not reported as they may have been due to mis-assembly and, in such cases, the plasmids were recorded without the possible fusion). In all European isolates, we observed two cases of fusion of a linear plasmid (lp28-4 or lp25) with a cp32 plasmid (cp32-1 and cp32-3, respectively). These fusions were found to be fixed in European B. bavariensis isolates but were absent from Asian isolates. In addition, plasmid lp17 and lp28-4 were found to be fused in four Asian isolates, but not in any of the European isolates. Interestingly, these isolates were found in independent clades in the phylogeny of the species (see below).
Supplementary Figure 5 shows a schematic representation of the fusions involving plasmids lp28-4, lp17 and cp32-1 with a precise description of the different plasmid types as well as plasmid lp28-7 as we found that translocations occurred within the European population between lp28-7 and lp17. To produce Supplementary Figure 5, we first had to determine plasmid types for the four plasmids under study. Following Casjens et al. [14] we counted a new plasmid type each time a deletion or insertion of at least 400 bp was observed and for each translocation or inversion of at least 400 bp (see Methods for more details). We were able to identify nine lp28-7 types, 12 lp17 types including two fusions with lp28-4, five other lp28-4 types, six cp32-1 types and six versions of the fused plasmid lp28-4+cp32-1. There was no case of two Asian isolates sharing the same plasmid types for each one of these four plasmids (i.e. lp17 ; lp28-4; lp28-7 and cp32.-1) and in the European population we could identify only three groups of two isolates and one group of three isolates that shared the same plasmid types for lp28-7, lp17 and lp28-4+cp32-1. Even if many short indels were observed on plasmid lp28-4, we could identify an almost 20 kb-long sequence that is shared by all types with or without fusions. The fusion of plasmids lp17 and lp28-4 in four Asian isolates was found to have occurred without any other big rearrangements. However, we identified two different architectures for this fusion. In isolate J-14 (and in isolates FujiP2 and Hiratsuka that were mapped to it) we observed a fusion of the 5’ ends of plasmids lp17 and lp28-4, thus lp17 appeared to be flipped. In isolate Arh923, the two plasmids were fused by their 3’ ends. Of course, this could have been due to mis-assembly. The fusion of lp28-4 and cp32-1, that is fixed in the European population, was shown to be an insertion of cp32-1 into lp28-4. There were two very different types of cp32-1 plasmids in the Asian population, with only about 10 kb homology. The fused plasmid observed in the European isolates seems to have occurred using the cp32-1 type carried by Asian isolate Hiratsuka (or a related cp32-1 type), which does not have more than 2 kb homology with the other Asian type of cp32-1. Apart from these two fusions, we could also observe a reciprocal translocation that occurred between plasmids lp28-7 and lp17 in the European population. Five European isolates including the reference strain PBi carry at the end of plasmid lp17 a 2.5 kb-long sequence that is found at the beginning of plasmid lp28-7 in all other isolates. And reciprocally, plasmid lp28-7 of three of these five isolates (the other two do not have a lp28-7) carry at their beginning a 5 kb-long sequence that is found at the end of lp17 in all other isolates. Both regions contained genes encoding outer membrane proteins.
We used RAST [39, 40] to annotate the reconstructed B. bavariensis genomes and, following the method by Mongodin and colleagues [17], kept all detected genes of at least 50 amino-acid length. The main chromosome was found to contain on average 816.4 genes that met this criterion (range 812 - 842) and on average 94% of the chromosome sequences were coding with very low variation among isolates (standard deviation 0.41 – see Figure 3a). This was significantly higher than in plasmids (Welsh T test T= 32.0, df = 427, P-value < 0.001). Circular plasmids had a significantly higher percentage of coding sequence compared to linear plasmids (average circular: 82.4%, average linear: 66.1%, Welsh T test, T= 14.6, df = 366, P-value < 0.001; fusions between circular and linear plasmids were excluded). As shown in Figure 3b, annotated genes were also significantly longer on the chromosome compared to the plasmids (mean 981 bp, Welsh T test T= 65.0, df = 434, P-value < 0.001). Circular plasmids had significantly longer genes compared to linear plasmids (average circular: 562 bp, average linear: 514 bp, Welsh T test, T= 3.4, df = 326, P-value < 0.001; fusions between circular and linear plasmids were excluded). We used BLAST v. 2.8.1 [33, 34] at the amino-acid level (algorithm BLASTp) to compare each of the 33 isolates with the 32 others for gene content. A hit between protein sequences in two different isolates was kept if the hit had at least half the length of the original gene and if the identity between the two sequences was as least 90%. Using these criteria, we found that at least 93% of the genes located on each chromosome had a hit on every other chromosome. This confirmed that the chromosome was highly conserved within the species B. bavariensis, even between Asian and European isolates. Indeed the best hits between isolates for each chromosomal gene had on average 98.8% sequence identity when the compared genes were from isolates from within the same continent and 97.5% when the compared isolates were from different continents. Plasmid cp26 was also found to be highly conserved with on average 91.1% of its 26 to 28 genes being shared with the cp26 plasmids of each other isolate and the identity of the best hit in each isolate being on average 99.0% for isolates from the same continent and 92.7% for isolates from the other continent.
Out of the 24 different plasmids assembled from the genomic data of the 33 B. bavariensis isolates (without taking fusions into account), 19 were not found in all isolates. This estimated variable portion represented on average 19.2% of the total reconstructed genomic content of each isolate and 68.3% of the total assembled plasmid content. These size estimates of the variable genome represent only a lower bound because some plasmids found in all isolates are nevertheless not similar over their whole length and some plasmids were not successfully assembled. The greatest degree of diversity was observed on the two plasmid families lp28 and cp32 which were represented by seven and ten members, respectively, over all isolates with only lp28-4 found to be present in every isolate.
Evolution of the species
We used BEAST v1.8.0 [41] to reconstruct the phylogeny of the main chromosome (see Methods for more details) for all of our 33 isolates as well as four additional isolates for which chromosomal sequences have been published in GenBank (under accession numbers CP000013 for strain PBi from Germany, CP003151 for strain BgVir from Russia and CP003866 and CP007564 for strains NMJW1 and SZ from China). We used B. garinii strain 20047 as an outgroup to root the tree (GenBank accession number CP028861). The resulting phylogeny, presented in Figure 4, shows that the two continental populations are clearly divergent with a deep branching. The European population is characterized by a very short-time divergence and an almost clonal recent evolution as has already been noted [11]. The Asian population, even if showing greater overall divergence, does not show any geographical structure: isolates from Japan, China and Russia are found in the same terminal clades. Asian isolates also did not cluster by origin of the isolate (questing tick or patient). In Europe, only one isolate from a tick was available and this had no special position in the phylogeny. Both chromosome assemblies for the PBi type strain (ours and that published as CP000013) were both located in the same clade. We compared RAST [39, 40] annotation results for both PBi chromosome sequences and found that there was perfect synteny between the two (Suppl. Fig. 6).
In this phylogeny, we also indicated gains, losses and fusions of plasmids based on the reconstructed genomes using maximum parsimony (Table 1 and Suppl. Table 1). This showed that, in addition to five plasmids present in all isolates, four other linear plasmids and five cp32 plasmids could have been present at the root of the tree in the ancestral B. bavariensis. These plasmids would then have been subsequently lost in some derived isolates. Nine gain and ten loss events could be placed on internal branches and thus were shared by at least two isolates. In the European clade, three plasmid loss events on internal branches and ten plasmid loss event on terminal branches were found, whereas only two gain events were identified (plasmid lp28-9 shared by five isolates and plasmid cp32-12 found only in isolate PBN). This shows that the plasmid repertoire of the European population is rather stable with only plasmid losses that could have been due to isolate cultivation in the laboratory rather than to real evolutionary change. In the Asian population, according to our maximum parsimony reconstruction, plasmid gains were as frequent as plasmid losses on internal branches (eight gain events for seven loss events) but there were twice as many losses as gains on terminal branches (13 gains for 29 losses).
Genetic diversity within and between the Asian and European populations was estimated by nucleotide diversity (π [42]) and genetic distance (FST [43]) for the main chromosome and seven plasmids with orthologous regions in at least five isolates in each population (see Methods, Table 2). Diversity was found to be lower in the European population compared to the Asian population by one to two orders of magnitude depending on the genomic segment and to be lower for the main chromosome compared to plasmids. Genetic distance between Asian and European populations was lowest for lp25 (0.36) and highest on lp36 (0.69).
Table 2: Within and between population genetic diversity for the main chromosome and plasmid orthologous regions
Genomic region
|
# Asia
|
# Europe
|
Length (bp)
|
# SNP
|
π Asia
|
π Europe
|
FST
|
chromosome
|
17
|
19
|
920,528
|
42,039
|
8.79*10-3
|
1.72*10-4
|
0.56
|
cp26
|
15
|
19
|
29,623
|
1,979
|
1.54*10-2
|
1.99*10-4
|
0.50
|
lp17
|
14
|
19
|
13,732
|
1,331
|
1.98*10-2
|
4.99*10-4
|
0.49
|
lp25
|
13
|
18
|
27,833
|
3,232
|
2.97*10-2
|
7.03*10-4
|
0.36
|
lp28.3
|
11
|
19
|
11,152
|
1,572
|
6.80*10-2
|
8.23*10-4
|
0.50
|
lp28.4
|
14
|
18
|
31,849
|
4,144
|
2.05*10-2
|
2.62*10-3
|
0.52
|
lp36
|
14
|
19
|
9,819
|
1,081
|
2.34*10-2
|
3.66*10-4
|
0.69
|
lp54
|
15
|
19
|
67,261
|
8,167
|
2.06*10-2
|
3.47*10-4
|
0.59
|
Genetic diversity (π [42]) within populations and genetic distance (FST [43]) between populations were estimated on orthologous sequences aligned with MAFFT v 7.407 [44, 45]. The number of single nucleotide polymorphisms (SNP) is indicated for both populations mixed and the length is the length of the alignment
We also estimated genetic diversity along the main chromosome and for plasmids cp26 and lp54, in which alignments were possible over the whole length (Suppl. Fig. 7, 8 and 9). For all three replicons, we identified peaks of diversity either between populations from the two continents (peak only when considering all isolates) or in one or both regional populations. We found high diversity in several chromosomal genes coding for proteins located in the outer membrane of the bacteria (OppA, ABC transporter, Lmp1, PTS system). This was also true for lp54, particularly in the Asian population, with diversity peaks located in the genes encoding OspA, OspB,DbpA and in the PFam54 gene array. On cp26, the ospC gene is well known for having high diversity in several B. burgdorferi s.l. species including B. bavariensis which is confirmed here for the Asian population [11, 17, 46, 47].
As ospC showed a high diversity, and as this locus is known to be a hotspot of recombination in several B. burgdorferi s.l. species [11, 46, 48], we reconstructed a phylogeny of this gene and compared it to that of the cp26 plasmid cutting out the ospC locus. Several publicly available sequences for B. bavariensis (strain BgVir), B. garinii (strains Far04 and PBr), B. afzelii (strains ACA-1, K78 and PKo) and B. spielmanii (strain A14S) (see Methods for details) were additionally included in this analysis. As can be seen in Figure 5, the cp26 phylogeny followed the known species tree with B. bavariensis and B. garinii being sister species as are B. afzelii and B. spielmanii. The phylogeny of plasmid cp26 within B. bavariensis was very similar to the phylogeny reconstructed for the main chromosome (Figure 4), except for minor differences in clustering of Japanese isolates. However, the phylogeny reconstructed for ospC was quite different and showed two major clades. One clade contained all European B. bavariensis and all B. afzelii as well as some Asian B. bavariensis and one of the two B. garinii strains. The second clade contained B. spielmanii, the other B. garinii strain and the rest of the Asian B. bavariensis haplotypes. Apart from the European B. bavariensis clade (where we observed only two different ospC haplotypes with only one non-synonymous difference between them) and the B. afzelii clade, all other species or populations with several isolates were found not to be monophyletic.