PFGE analysis of B. miyamotoi Izh-4 strain
Pulsed-field Gel Electrophoresis (PFGE) analysis revealed a chromosome with a length of ~900 kb and nine non-chromosomal fragments (potential plasmids) (Fig. 1). The first three non-chromosomal fragments with sizes ranging from 72 kb to 64 kb were similar among all Russian B. miyamotoi isolates [41] (data not shown). The remaining bands indicated the presence of additional six plasmids with sizes ranging from approx. 40 kb to 13 kb. This is probably an underestimate, since it is well known that plasmids with similar sizes or circular plasmids (which may have different migration patterns than linear plasmids) may not be identified by PFGE.
B. miyamotoi strain, genome sequencing and assembly
In order to obtain a high quality reference genome for comparative genomics of B. miyamotoi, the genome of isolate Izh-4 was randomly chosen from available Russian clinical isolates [41] (Additional file 1: Table S1) and sequenced using different sequencing platforms including Illumina MiSeq and HiSeq, ONT MinION, and Pacific Biosciences SMRT. Assemblies of long reads were corrected using long reads (e.g. PacBio with PacBio; ONT with ONT) and subsequently using highly accurate Illumina sequence reads by means of the Pilon pipeline [42].
Using the MinION platform we obtained 129,992 raw reads of an average length of 6.6 kb. After correction and trimming in the Canu v1.7 pipeline the number of long reads decreases to 31,584 with an average length 7.3 kb. The assembly showed 16 contigs with lengths ranging from 900 kb to 10 kb. Manual validation revealed that two of them - tig00009030 and tig00000013 – were characterized by a specific coverage pattern of ONT reads in two peaks indicating that two separate plasmids were merged. Moreover, the two contigs were 46 kb and 50 kb in size, which was not in line with the PFGE analysis (Additional file 2: Figures S1-S3). Therefore, these contigs were split into two contigs and processed as separate plasmids. In addition, three of the resulting 18 contigs were characterized by low long read coverage (2-3x) and had a high similarity level (≥ 95%) to other contigs and were therefore removed from further analysis. Finally, two of the 15 remaining contigs were automatically circulated with lengths of 30 kb and 29 kb. To summarize, using this method, in the end we obtained 15 contigs corresponding to one main chromosome and 14 potential plasmids , with coverage by trimmed reads ranging from 300x to 20x (Table 1).
Using the PacBio platform we obtained 312,224 raw reads with an average length of 4 kb. Using 2,635 corrected reads with an average length of 8.8 kb 20 contigs were assembled, with a contig length varying from 6 kb to 910 kb. Three low-coverage contigs, with sequences present in other parts of the genome, were assumed to be assembly artifacts and were removed. Two contigs were manually circularized based on overlapping ends.
Mismatches between ONT and PacBio assemblies were noted and differences to hypothetical lengths of plasmids in PFGE were observed. PacBio unitig#3 was 68 kb in size and was not identified in PFGE. It was similar to three separate ONT contigs (41 kb, 27 kb and 22 kb) (Additional file 2: Figure S4). Three PacBio unitigs corresponding to an ONT contig of 70 kb were identified, so ONT contig was mistakenly split into three separate PacBio contigs (Additional file 2: Figure S5). Moreover, two of these PacBio unitigs #20 (~38 kb) and #22 (~38 kb) were not observed in PFGE. The 64 kb ONT contig was partially represented in unitig#10, which was 43 kb in size (Additional file 2: Figure S6) and also not found in PFGE. These mis-assemblies of PacBio sequences might have been due to a low amount of DNA submitted for sequencing (1.2 μg), which was lower than requested by the sequencing service (5-10 μg) and did not permit BluePippin size selection. Nonetheless, the remaining contigs were similar between PacBio and ONT assemblies. ONT contigs that were split based on coverage analysis were confirmed by PacBio unitigs as separate sequences. Overall, the extracted consensus sequences from PacBio and ONT assemblies (corrected by using highly accurate Illumina reads) resulted in a complete genome consisting of a chromosome of ~900 kb, and 14 putative plasmid contigs, of which two were circular and 12 linear, ranging in length from 6 to 73 kb.
The contigs of the above-described final assembly was also compared with the contigs obtained by direct sequencing of DNA fragments extracted from the agarose gel after separation by PFGE. These contigs were matched using Mummer and visualized by Circos. A number of contigs were produced for the different bands, but only a subset in each band represented the plasmid in question (see Figure 1 and Additional file 2: Figures S7-S15). For example, for the PFGE fragment N1, 85 contigs were assembled from Illumina short reads, but only one contig of a length of 72,707 bp completely reproduced the lp72 plasmid in the final assembly. Although we were able to identify the majority of linear plasmids by direct sequencing of PFGE fragments, among the collected contigs no sequences corresponding to circular plasmids (cp30-1 and cp30-2) were found. Two of the plasmids, namely lp70 and lp64, were highly fragmented. Many small contig with low k-mer coverage compared to major contigs were observed and were possibly the result of sample contamination during the DNA isolation process.
The final composition of genome is summarized in Table 1. This assembly was deposited in GenBank, BioSample SAMN07572561.
Genome content
Genome annotation of isolate Izh-4 revealed a total of 1,362 genes including 31 genes for transfer RNA (tRNA), one cluster of three genes of ribosomal RNA (rRNA) (5S, 16S, 23S) and three genes of non-coding RNA (ncRNA). Out of the 1,362 genes, 1,222 have been annotated as protein-coding genes. The analysis showed the presence of 103 (7.5%) pseudogenes in the Izh-4 genome (Table 2). The majority of pseudogenes were the result of a frameshift. The number of pseudogenes differed between genomic elements and ranged from 0 to 24. The highest number of pseudogenes was present in two plasmids, lp70 and lp64, and in the chromosome, with 24, 23 and 22 pseudogenes, respectively.
Functional classification of proteins by comparison with previously defined clusters of orthologous groups (COG) showed that approximately 81% of chromosomal proteins and only 16% of the plasmid proteins of Izh-4 could be assigned to 25 different COG categories (rpsblast, threshold E-value 0.01). This confirms that the chromosome is well conserved. Indeed, a comparison based on COG between the chromosomes of Russian isolates with the previously sequenced genomes of the American (CT13-2396) and Asian (FR64b) genotypes did not reveal significant differences either.
A high percentage of COG-classified proteins localized on some plasmids indicated the existence of plasmids carrying vital genes that likely encode proteins that contribute to basic metabolic processes. For example, according to our analysis plasmid lp41 (41 kb) encodes 12 COG-classified proteins, and the three plasmids lp72, lp70 and lp64 encode 15, 10 and 9 of such proteins, respectively (Table 2). It is worth mentioning that lp41 is the main virulence plasmid carrying and expressing the “main variable surface proteins” (variable major proteins, Vmps) [28].
Borrelia miyamotoi chromosome
Pairwise sequence comparison of the linear chromosome of Izh-4 with the previously sequenced genomes of FR64b (Japan), CT14D4, LB2001, and CT13-2396 (USA) of B. miyamotoi revealed that the average nucleotide identity (ANI) between chromosomes of Izh-4 and FR64b amounted to 99.97% and to 97.77% to isolates from the USA. Whole genome alignment of these chromosomes did not reveal any noticeable genomic rearrangements such as long insertions\deletions, duplications of regions, and translocations, confirming the conservative nature of the B. miyamotoi linea chromosome. However, small differences were detected in polymorphisms of tandem repeats (VNTR), single nucleotide polymorphisms (SNPs), and small indels (Additional file 3: Figures S30 – S31 and Table S2). The total number of differences detected among chromosomes was - unsurprisingly - different between isolates from different geographic regions: Izh-4 and isolates from the USA showed an average of 18,563 differences; Izh-4 and the Japanese isolate had merely 122. The majority of differences were base substitutions. We also identified five (possibly) canonical sites containing VNTRs (Additional file 3: Figure S30). Such differences may be useful for developing future subtyping schemes for B. miyamotoi clinical isolates.
Plasmid typing by analysis of paralogous gene families (PF) genes
The identified 14 plasmid contigs and the chromosome of Izh-4 were subjected to an analysis to define the type of partition proteins and to decide on potential names for particular plasmids. In order to identify genes homologous to the plasmid replication/maintenance proteins PF 32, 49, 50, 62 and 57 [43, 44], extracted nucleotide sequences of open reading frames (ORFs), including genes annotated as pseudogenes, from the Izh-4 genome as well as reference genomes of different Borrelia species were submitted to interproscan annotation and used for comparative phylogenetic analysis (See the Methods section for a more detailed description). Reference genomes included B. burgdorferi B31 (GCA_000008685.2), B. afzelii PKo (GCA_000222835.1), B. duttonii Ly (GCA_000019685.1), B. hermsii HS1 (GCA_001660005.1), B. miyamotoi CT13-2396 (GCA_001767415.1), B. miyamotoi FR64b (GCA_000568695.2), and the partially sequenced genome of Borrelia miyamotoi LB-2001 (GCA_000445425.4).
We identified that Izh-4 possessed contigs characterized by different PF genes (Fig. 2). Using a method that was previously described for B. burgdorferi (43), we defined the plasmid types in Izh-4 by investigating the phylogenetic relatedness of PF genes to reference genomes. PF genes 32, 49, 50, 57/62 found on the chromosome and several plasmids (lp72, lp41, lp23, lp6) were phylogenetically closely related and formed monophyletic clades to PF genes corresponding to plasmids of genome CT13-2396 (Additional file 4: Figures S37 – S40). Despite the fact that in Izh-4 a plasmid of 27 kb length had the same PF genes as the plasmid named lp23 in CT13-2396, we choose the same name for these plasmids which is in accordance to plasmid typing in B. burgdorferi sl [43]. Notably, PF genes of Izh-4 and FR64b clustered together in more cases than they did with CT13-2396, indicating a closer genetic/genomic relatedness of Russian and Japanese B. miyamotoi isolates than of Russian and North American isolates (including plasmid content).
We found two plasmids - lp70 and lp64 - that have not previously been described in Borrelia. Each of these plasmids carried several sets of PF genes suggesting that they were formed by fusion of different types of plasmids in the past. Plasmid lp70 of Izh-4 carried two copies of PF32, which phylogenetically clustered with plasmid fragments of FR64b. However, one of the copies showed high similarity to the PF32 of plasmid cp2 of CT13-2396 (Additional file 4: Figure S37). Plasmid lp64 carried three sets of PF 32, 49, 50, 57/62. Of these one cluster was represented only by PF50 while PF57/62 was a pseudogene and PF32 and PF49 were absent. The other two sets of genes had four PF genes, but one set was characterized by the presence of pseudogenes related to PF 32 and 49 (Fig. 2). Two copies of PF32 of lp64 clustered in different phylogenetic groups and similar copies were found in the FR64b genome. One of the copies of lp64-PF32 had a common ancestor with the PF32 located on plasmid pl42 of B. duttonii isolate Ly; the other copy (pseudogene) had a common ancestor with the PF32 located on plasmids lpF27 of B. hermsii HS1 and lp28-7 of B. afzelii PKo (Additional file 4: Figure S37).
Plasmids lp29, lp27, lp24, lp18-2, and lp13 possessed only one copy of PF57/62, but the copy in plasmid lp18-1 was a pseudogene of PF57/62. This was consistent with data from previously sequenced genomes [11]. For instance, B. miyamotoi CT13-2396 plasmids lp30, lp20-1, lp20-2 and lp19 have only the PF57/62 gene, and plasmid cp4 only carried a PF50 (Additional file 4: Figure S39, S40). Although the classification of plasmid compatibility types was mainly based on the phylogeny of the PF32 locus, in cases where this locus was absent, we used PF57/62 for plasmid typing. In the phylogeny of PF57/62, plasmids lp29, lp27, lp24, lp18-2, and lp13 of Izh-4 and other B. miyamotoi isolates formed a clade distinct from most other RF and LB species, except for B. hermsii HS1 lpG27. Near identical PF57/62 were found for two pairs of plasmids of Izh-4: plasmids lp29 - lp27 and lp18-1 - lp18-2. This could raise the question whether these are indeed different plasmids. However, these pairs of plasmids had no other extended regions of nucleotide similarity (Additional file 3: Figures S33, S34) beyond the PF57/62 locus, indicating they are two different pairs of plasmids. PF57/62 of plasmid lp13 clustered together with the PF57/62 of lp30 of CT13-2396 and a gene located on a plasmid fragment (CP004259.1) of FR64b. The PF57/62 of Izh-4 lp24 was nearly identical to a homologous gene located on a plasmid fragment (CP004252) of FR64b. It should be noted that clustering of plasmids based on PF32 genes correlates with groups of plasmids based on PF57/62 clustering, indicating a similar evolutionary patterns between PF32 and PF57/62. Since we did not identify variants of the PF57/62 genes of previously sequenced B. miyamotoi genomes that would be close enough to the PF57/62 genes of the Izh-4 genome, we decided to establish the names of plasmids based on their length.
The analysis allowed us to identify only two circular plasmids, each of which was approximately 30 kb in length. The percentage of identity between them was 79%. The set and relative position of ORFs between these plasmids was collinear, with the exception of the variation in the number of Mlp genes (cp30-1 had two genes, cp30-2 had one gene) and inversion of the gene cluster of PF 32, 49, 50, 57/62. Both plasmids are characterized by the presence of genes encoding PBSX phage terminase large subunit, site-specific integrase, indicating a relationship to prophage-related plasmids. In addition, both circular plasmids are characterized by the presence of a complete set of PF 32, 49, 50, 57/62 genes. According to the phylogeny of the PF32 genes, these two plasmids belong to different phylogenetic clusters. The PF32 gene of plasmid cp30-1 was more closely related to the PF32 gene localized on plasmids pl28 (B. duttonii Ly) and pl28-8 (B. afzelii PKo). In turn, the PF32 gene of plasmid cp30-2 was phylogenetically closest related to the PF32 gene localized on plasmid lpT28 of B. hermsii HS1.
Organization of the lp41 virulence plasmid
Plasmid lp41 appears to play a pivotal role in virulence of B. miyamotoi by expressing the Vmps, which enable the bacteria to escape the host immune system during infection [28]. We performed a comparison of lp41 plasmids using BlastN analysis between Izh-4 and earlier sequenced isolates of B. miyamotoi from USA (LB-2001 and CT13-2396) and Asia (FR64b). This analysis revealed a high degree of similarity in the relatively conserved 3 'and 5' regions flanking the variable region of the Vmp genes (Fig. 3). Izh-4 carries a gene encoding the Vlp-δ protein (locus tag: CNO09_05195) after the expression site, while genomes FR64b and CT13-2396 carry Vlp-γ (BOM_1113, AXH25_04655) (Fig. 4) and LB-2001 carry Vsp1 (I871_B20) (Fig. 5).
Some minor 800 bp insertions were detected at the left-end of lp41plasmids between pairs of isolates: FR64b - Izh-4 and CT13-2396 - LB-2001 (data not shown). At the same time, the number and order of the Vmp genes were unique for each of the isolates (partially shown in Fig. 3 and Fig. 6). In addition, single nucleotide variations as well as a 138 bp deletion in an intergenic region before the expression site were detected in both Asian genomes, Izh-4 and FR64b, in comparison to CT13-2396 and LB-2001 (Additional file 3: Figure S35). This might be a marker for differentiation of lp41 plasmids of Asian and American genotypes. Importantly, the organization of the sequence expression site did not differ between B. miyamotoi isolates, the nucleotide composition of the Ribosome Binding Site (RBS), the "-10", and "-35" sites were 100% identical (Additional file 3: Figure S35, bottom), which could be very helpful in identifying the expressed Vmp [28].
Intragenetic diversity of Variable Large Proteins and Variable Small Proteins
All Izh-4 nucleotide sequences of genes and pseudogenes were searched to assess whether they belonged to the family of lipoproteins in the InterPro database. In total, we found 39 genes encoding variable large proteins (Vlp), nine of them were pseudogenes, and 15 genes encoding variable small proteins (Vsp), including five pseudogenes. Vlp and Vsp genes were clustered in an island manner and were mostly located on plasmids lp41, lp29, lp23 and lp24. Some single Vsp genes were located on lp64, lp18-2 and lp13 plasmids (Fig.6).
Phylogenetic analysis of the extracted Vlp genes and pseudogenes of four B. miyamotoi genomes showed that Vlp genes of Izh-4 formed well supported clades: four clades of Vlp-δ (20 genes), Vlp-γ (13 genes), Vlp-α (five genes) families and one gene on lp29 plasmid corresponded to Vlp-β (Fig. 4). The closest homologues to Vlp-β at 78% amino acid identity were identified in the genomes of B. crocidurae DOU (AHH07120.1) and B. hermsii (WP_064536660.1). Notably, Vlp-β genes were not described in the genomes of B. miyamotoi LB2001 (28), however, similar genes were present in the genome of CT13-2396 (AXH25_04965) and the partially sequenced genome of FR64b genome (BOM_1386)(Fig.6, lower violet branch).
Phylogenetic analysis of the extracted Vsp genes did not show any patterns of clustering (Fig. 5). However, comparing 14 of Vlp and 4 Vsp genes showed that they are present in two identical copies located on plasmids lp41 and lp23. A BLAST analysis of nucleotide sequences of these plasmids showed that the right parts of plasmids lp41 and lp23 were identical, with the same order of Vlp and Vsp genes and its pseudogenes (Additional file 3: Figure S36). Pairwise comparison of plasmids containing clusters of these genes did not reveal any similarities like the one found between lp41 and lp23. Such right-end similarity of lp41 and lp23 was also detected in CT13-2396.
Comparison of plasmid sequences among B. miyamotoi isolates
To explore the plasmid similarity between different B. miyamotoi isolates, we compared the nucleotide sequences of the three isolates CT13-2396, FR64b and Izh-4 (Additional file 2: Figure S15 – S29). We chose these isolates since for CT13-2396 and Izh-4 almost completed genomes were available and for FR64b a draft genome with 50 contigs was accessible in GenBank. Within these three genomes, we found four common plasmids with high nucleotide similarity: lp72, lp41, lp23 and lp6 (Table 3). Plasmids lp70, lp64, lp27, and lp13 of Izh-4 were only present in the Asian FR64b genome, but absent in the North American isolate CT13-2396. Plasmids cp30-1, cp30-2, lp29, lp24, lp18-1 and lp18-2 were partly present in the F64b genome, and absent in CT13-2396.
Phylogenetic analyses
Phylogeny of Borrelia spp. based on chromosomal genes
To understand the relationships of isolate Izh-4, North American and Asian B. miyamotoi isolates as well as to other Borrelia species, we performed a phylogenetic analysis of the newly sequenced genome (Izh-4) and Borrelia genomes deposited in GenBank (Additional file 1: Table S1). To date, these genomes comprised completed chromosomes and/or several completed plasmids (lp73, lp41, lp23 and lp6). The phylogenetic tree was reconstructed using a concatenated alignment of nucleotide sequences of 245 core genes localized on the chromosome (minimum percent identity for Blastp 70%) and identified during the process of protein clustering among all Borrelia genomes. This phylogenetic analysis showed that B. miyamotoi forms a monophyletic clade inside the relapsing-fever group and was split into two lineages belonging to the Asian and American genotype. The Asian lineage includes the Izh-4 and FR64b from Japan (Fig. 7, A).
For a more detailed analysis, i.e. to determine intraspecific differences among B. miyamotoi isolates, we conducted a reciprocal Blastp search for core genes, but now only within the species B. miyamotoi. As a result, 719 orthologous genes were identified (minimum percentage identity for blastp 80%) (Fig. 7, B).
Mean SNP-distances (in concatenated alignment of core genes) between isolates from Northeast America (CT13-2396, CT14D4, LB-2001), Japan (FR64b) and Russia (Izh-4) were as follows: Northeast American - Russian – 13,767 SNPs, Northeast American – Japanese – 13,776 SNPs, and Russian – Japanese - 36 SNPs. Among the three Northeast American isolates six SNPs were found.