Ensemble pipeline for complete Borrelia genome reconstruction.
In this study we show that the complete Borrelia genome reconstruction is possible by using different sequencing technologies and assembly strategies including several manual curation steps and data combining. These steps can be summarized in an ensemble pipeline that enables the reconstruction of complete Borrelia genomes. An overview of the pipeline is shown in Figure 1 and will be summarized below (see material and methods for details).
Borrelia strains were cultured and DNA was extracted (grey in Figure 1). Sequencing was performed using PacBio SMRT long-read sequencing technology (blue in Figure 1) and Illumina short-read technology (orange in Figure 1). The PacBio long-read sequencing resulted in two datasets: PacBio subreads and PacBio HiFi (CCS with minimum number of 3 passes and minimum accuracy of 0.99) reads. For PacBio subreads the PacBio microbial assembler was used to generate assemblies, for the HiFi reads the PacBio Improved Phase Assembler (IPA) and the HiCanu assembler were used. The microbial assembly is based on low accuracy PacBio subreads and was polished using highly accurate Illumina reads. In addition, a hybridSPAdes assembly was performed on the Illumina data and PacBio microbial contigs. In the case of poor quality or incompleteness of contigs generated via the microbial assembler, it was replaced by, or concatenated with, the hybridSPAdes contig. The consensus of the microbial assembly may therefore result from a combination of Illumina and PacBio subread data (purple in Figure 1). In contrast, assemblies based on the highly accurate PacBio HiFi reads were polished using HiFi reads instead of Illumina data. Therefore, the consensus of the IPA and HiCanu contigs is only based on PacBio HiFi data. Afterwards, quality control and refinement steps were conducted for the microbial, IPA and HiCanu consensi (yellow in Figure 1). Finally, the assembly results were manually compared regarding correctness and completeness and combined to generate the final consensus representing a completed Borrelia genome (red in Figure 1).
In the following, the results of the QC and refinement steps as well as the generation of the final consensus are described in detail.
QC and refinement steps
Assembly statistics and quality
Assembly statistics and quality results (number of contigs, largest contig, total length, N50, L50 and completeness) deteremined using QUAST and Merqury are shown in Table S1.
We observed a trend of lowest contig number (potentially indicating high quality) resulting from the IPA assembler and the highest contig numbers (potentially indicating low quality) resulting from the HiCanu assembler, while no trend was noticed in in the case of the microbial assembler (Table S1). Although the IPA assembler seemed promising due to low contig numbers, we observed typically shortest contigs, minimum total length, low N50, high L50 and the lowest degree of assembly completeness. Further analyses revealed that it tended to have more incompletely assembled genome elements. This indicated a limited suitability of the IPA assembler for Borrelia genome reconstruction. In contrast, HiCanu assemblies showed the highest contig number, it typically generated the largest contigs, maximum total length, high N50, low L50 and the highest degree of assembly completeness, which indicated that it performed very well. Further analyses confirmed the high quality of the HiCanu assemblies and the high number of contigs could be explained due to duplicates (more information is shown in section “Genome reconstruction from different assemblies”). The microbial assembler also resulted in good assembly statistics and performed nearly as well as the HiCanu assembler.
Contig trimming
PacBio contigs often contain reads that wrap around the hairpin ends of linear elements to generate long inverted repeats. If such wraparounds in untrimmed contigs contain the telomere consensus sequence, TAGTATA typically 14 bp from the center of the wraparound inverted repeat (to be described in more detail in a subsequent publication), the wraparound is considered to indicate the presence of a telomere on a linear replicon. Circular plasmid contigs, on the other hand, typically show terminal direct repeats due to their circular and continuous structure. The wraparounds and terminal direct repeats need to be trimmed off to generate a correct final sequence, and dot plot analyses were used to identify their presence. Figure 2 shows several examples of dot plots of contigs with wraparounds (linear genome elements) and terminal direct repeats (circular) before and after trimming.
Figure 2A and 2B show the two dot plots of PBaeII lp54 (contig ctg.s2.000000F of the microbial assembly) before and after trimming, respectively. The untrimmed contig contains long wraparounds of several thousand bases at the left end (1 bp – 16,417 bp) and right end (76,814 bp – 93,363 bp) of the contig (wraparounds are indicated by black arrows in Figure 2A). After trimming both ends, the dot plot forms one continuous straight line from the beginning to the end of the contig (Figure 2B). If wraparounds or terminal direct repeats are not present, this may be an indication of incompleteness of the linear or circular genome element, respectively.
As Borrelia plasmids may contain repetitive sequences or have high similarity stretches within the contig, the dot plots may show similarity lines that should not be trimmed. An example is shown in Figure 2C and 2D, which depicts the dot plots of PBaeII lp28-8 (contig ctg.s2.000004F of the microbial assembly) before and after trimming. The lp28 plasmid family may contain the vls locus including silent cassettes (repetitive sequences) adjacent to the expression site, which can be observed in the dot plot. Figure 2C shows the untrimmed contig with wraparounds on the left side (1bp – 10,298bp) and right side (23,448bp – 46,893bp) of the contig. After trimming (Figure 2D) the dot plot still shows the vls silent cassettes that produce the similarity pattern but should not be trimmed.
In contrast to the linear genome elements, contigs of the circular plasmids may contain terminal direct repeats. In this case, the first part of the contig corresponds to the end of the contig, which shows that the contig should be circularized and that the plasmid is complete. Figure 2E shows the dot plot of PBaeII cp26 (contig tig00000016, HiCanu assembly) with terminal direct repeats at the ends (1bp – 5,571bp overlapping 27,106bp – 32,677bp). In this case, the contig should only be trimmed at one side (e.g. 1bp – 5,571bp, black arrow in Figure 2E). Figure 2F shows the dot plot of PBaeII cp26 after trimming.
Dot plots for all the contigs of the microbial, IPA and HiCanu assembler for sample PBaeII are shown in additional information (Figure S1 – Figure S3).
Genome reconstruction from different assemblies
To reconstruct the genome, every contig of the microbial, IPA and HiCanu assembly was detailed analyzed including plasmid typing, identification of duplicates, misassemblies and genome elements present in multiple contigs. The summary of these analyses for the representative isolate B. bavariensis PBaeII is shown in Table 1.
Table 1: Single contig analyses of the microbial, IPA and HiCanu assembly of PBaeII.
assembler
|
contig
|
length
|
length trimmed
|
PFam32
|
comment
|
genome element
|
consensus length
|
microbial
|
ctg.s1.000000F
|
885916
|
885916
|
-
|
concatenate to chromosome
|
chromosome
|
905911
|
ctg.s2.000000F
|
93363
|
60396
|
lp54
|
-
|
lp54
|
60397
|
ctg.s2.000001F
|
55952
|
41416
|
cp32-3
|
concatenate with hyrbidSPades
|
cp32-3+lp25_incomplete
|
46802
|
ctg.s2.000002F
|
39059
|
19532
|
-
|
concatenate to chromosome
|
-
|
-
|
ctg.s2.000003F
|
43315
|
36788
|
lp28-4+cp32-1
|
-
|
lp28-4+cp32-1 _incomplete
|
36786
|
ctg.s2.000004F
|
46893
|
13149
|
lp28-8
|
-
|
lp28-8
|
13162
|
ctg.s2.000005F
|
34879
|
19448
|
-
|
concatenate to chromosome
|
-
|
-
|
ctg.s2.000006F
|
41548
|
24153
|
lp28-3
|
-
|
lp28-3
|
24153
|
ctg.s2.000007F
|
50489
|
21395
|
lp36
|
-
|
lp36
|
21397
|
ctg.s2.000008F
|
26711
|
15157
|
-
|
-
|
lp28-7_incomplete
|
15163
|
ctg.s2.000009F
|
35548
|
17911
|
-
|
-
|
lp17_incomplete
|
17912
|
ctg.s2.10arro
|
27107
|
27107
|
cp26
|
-
|
cp26
|
27107
|
ctg.s2.12arro
|
21095
|
21095
|
cp32-4
|
-
|
cp32-4
|
21095
|
ctg.s2.14arro
|
29941
|
29941
|
cp32-5
|
-
|
cp32-5
|
29944
|
IPA
|
ctg.000000F
|
930217
|
905913
|
-
|
-
|
chromosome
|
905913
|
ctg.000001F
|
70596
|
56122
|
lp54
|
-
|
lp54_incomplete
|
56122
|
ctg.000002F
|
39819
|
39819
|
cp32-3+lp25
|
-
|
cp32-3+lp25_incomplete
|
39819
|
ctg.000003F
|
28849
|
28849
|
cp32-5
|
-
|
cp32-5_incomplete
|
28849
|
ctg.000004F
|
29068
|
14702
|
-
|
-
|
lp28-3_incomplete
|
14702
|
ctg.000005F
|
31957
|
18069
|
lp36
|
-
|
lp36_incomplete
|
18069
|
ctg.000006F
|
28109
|
28109
|
lp28-4+cp32-1
|
-
|
lp28-4+cp32-1 _incomplete
|
28109
|
ctg.000007F
|
33304
|
13160
|
lp28-8
|
-
|
lp28-8
|
13160
|
ctg.000008F
|
13940
|
8722
|
-
|
duplicate and missassembly
|
-
|
-
|
ctg.000009F
|
16360
|
16360
|
-
|
-
|
lp17_incomplete
|
16360
|
ctg.000010F
|
14120
|
14120
|
-
|
-
|
lp28-7_incomplete
|
14120
|
ctg.11
|
27107
|
27107
|
cp26
|
-
|
cp26
|
27107
|
ctg.13
|
21099
|
21099
|
cp32-4
|
-
|
cp32-4
|
21099
|
HiCanu
|
tig00000001
|
930187
|
905912
|
-
|
-
|
chromosome
|
905912
|
tig00000003
|
21178
|
21178
|
-
|
duplicate of chromosome
|
-
|
-
|
tig00000004
|
19699
|
19699
|
-
|
duplicate of chromosome
|
-
|
-
|
tig00000005
|
15112
|
15112
|
-
|
duplicate of chromosome
|
-
|
-
|
tig00000006
|
14495
|
14495
|
-
|
duplicate of chromosome
|
-
|
-
|
tig00000008
|
23702
|
23702
|
-
|
duplicate of chromosome
|
-
|
-
|
tig00000009
|
84366
|
60397
|
lp54
|
-
|
lp54
|
60397
|
tig00000010
|
80586
|
54929
|
cp32-3+lp25
|
-
|
cp32-3+lp25
|
54929
|
tig00000011
|
75901
|
50735
|
lp28-4+cp32-1
|
-
|
lp28-4+cp32-1
|
50735
|
tig00000012
|
50555
|
28286
|
lp28-7
|
-
|
lp28-7
|
28286
|
tig00000014
|
29941
|
29941
|
cp32-5
|
-
|
cp32-5
|
29941
|
tig00000015
|
25964
|
21829
|
cp32-4
|
-
|
cp32-4
|
21829
|
tig00000016
|
32677
|
27106
|
cp26
|
-
|
cp26
|
27106
|
tig00000018
|
46995
|
21394
|
lp36
|
-
|
lp36
|
21394
|
tig00000019
|
26898
|
13496
|
lp17
|
concatenate to lp17
|
lp17
|
24961
|
tig00000020
|
9677
|
9677
|
-
|
duplicate of lp17
|
-
|
-
|
tig00000021
|
9099
|
9099
|
-
|
duplicate of lp17
|
-
|
-
|
tig00000022
|
39099
|
19609
|
-
|
concatenate to lp17
|
-
|
-
|
tig00000023
|
28698
|
14353
|
-
|
concatenate to lp28-3
|
lp28-3
|
24137
|
tig00000024
|
30200
|
16799
|
lp28-3
|
concatenate to lp28-3
|
-
|
-
|
tig00000025
|
8619
|
8619
|
-
|
duplicate of lp28-3
|
-
|
-
|
The microbial assembly of PBaeII resulted in 14 contigs with five contigs lacking PFam32 (Table 1). Three of these contigs (ctg.s1.000000F, ctg.s2.000002F and ctg.s2.000005F) were part of the chromosome which had overlapping sequences and were concatenated. The other two contigs were incompletely assembled plasmids lacking the portion where PFam32 would be located (ctg.s2.000008F and ctg.s2.000009F). Due to the lack of PFam32, the type of the incomplete assembled plasmid could not be determined and were only revealed by comparison with the other assembly results (ctg.s2.000008F: lp28-7_incomplete, ctg.s2.000009F: lp17_incomplete). Contig ctg.s2.000001F contained a cp32-3 type PFam32 but further analyses and comparison with the results of the other assemblers (IPA and HiCanu) showed that the plasmid was apparently a cp32-3+lp25 fusion plasmid that was incomplete and therefore only contained one PFam32 gene.
The IPA assembler generated 13 PBaeII contigs and five of them did not contain PFam32 (Table 1). One of these contigs represented the chromosome (ctg.000000F) and three were incomplete plasmids where the type was only revealed by comparison with the other assembly results (ctg.000004F: lp28-3_incomplete, ctg.000009F: lp17_incomplete, ctg.000010F: lp28-7_incomplete). Contig ctg.000008F was a misassembled duplicate and was deleted.
The HiCanu assembler produced 21 PBaeII contigs where 11 did not have PFam32 (Table 1). One of the contigs represented the 906 kb chromosome (tig00000001) and five contigs were duplicates of portions of the chromosome (tig00000003 (21 kb), tig00000004 (20 kb), tig00000005 (15 kb), tig00000006 (14 kb), tig00000008 (24 kb) with identities of 99.87%, 99.94%, 99.93%, 99.81% and 99.78%, respectively) and were deleted. The plasmid lp17 was concatenated from two contigs (tig00000019 and tig00000022) that had overlapping sequences and only one contained PFam32 (tig00000019). Contigs tig00000020 (10 kb) and tig00000021 (9 kb) did not possess PFam32, and were duplicates of portions of lp17 with identities of 99.98% and 100%, respectively, and were deleted. Similarly, lp28-3 was concatenated from overlapping tig00000023 and tig00000024 contigs where only the latter carried PFam32. The contig tig00000025 (9 kb) was a duplicate of a portion of the concatenated lp28-3 (24 kb) with an identity of 99.71% and was deleted.
The genome elements were analyzed for the number of intact CDS and completeness. The later is indicated by the presence of wraparound telomere sequences at both ends of linear replicons or terminal direct repeat in circular plasmids in untrimmed contigs and by the presence of PFam32 or related partition genes. If a circular plasmid did not show terminal direct repeats, the sequence was extended and reanalyzed by dot plot generation (for details see materials and methods). Figure 3 shows the dot plots of contig ctg.s2.10 (cp26) of the microbial assembly of PBaeII, which was considered complete as direct terminal repeats were found after sequence extension.
Plasmids that were reconstructed by concatenation of overlapping contigs were reanalyzed for the presence of wraparound and terminal direct repeats and were given a final trim (see section “Contig trimming”).
It must be emphasized that the analysis steps of "contig trimming" and "genome reconstruction" are partially intertwined and are dependent on one another, since there is no fixed order for the analyses.
Generation of final consensus
Comparison of assembly results
Based on the previous analyses, the assembly results after genome reconstruction of the three representative isolates (PBaeII, PBes and 89B13, Figure 4) for each assembler (microbial, IPA and HiCanu) are shown in Table 2. Further detailed information can be found in the additional information (Table S2).
Table 2: Assembly results after genome reconstruction of the three representative isolates (PBaeII, PBes and 89B13) for each assembler (microbial, IPA and HiCanu) and overview of the final combined consensus. Complete reconstructed genome elements are colored green, incomplete, missing or probably wrong assembled genome elements are shown in red. Genome elements used for final consensus are shown in bold.
|
|
microbial
|
IPA
|
HiCanu
|
final combined consensus
|
isolate
|
genome
element
|
length (bp)
|
# CDS (with protein)
|
length (bp)
|
# CDS (with protein)
|
length (bp)
|
# CDS (with protein)
|
genome
element
|
assembler
|
length (bp)
|
PBaeII
B. bavariensis
|
chromosome
|
905911
|
804
|
905913
|
800
|
905912
|
801
|
chromosome
|
microbial
|
905911
|
lp54
|
60397
|
63
|
56122
|
63
|
60397
|
63
|
lp54
|
HiCanu
|
60397
|
cp32-3+lp25
|
46802
|
41
|
39819
|
40
|
54929
|
48
|
cp32-3+lp25
|
HiCanu
|
54929
|
lp28-4+cp32-1
|
36786
|
27
|
28109
|
20
|
50735
|
36
|
lp28-4+cp32-1
|
HiCanu
|
50735
|
lp28-8
|
13162
|
11
|
13160
|
10
|
-
|
-
|
lp28-8
|
microbial
|
13162
|
lp28-3
|
24153
|
13
|
14702
|
6
|
24137
|
11
|
lp28-3
|
microbial
|
24153
|
lp36
|
21397
|
14
|
18069
|
11
|
21394
|
13
|
lp36
|
microbial
|
21397
|
lp28-7
|
15163
|
19
|
14120
|
16
|
28286
|
31
|
lp28-7
|
HiCanu
|
28286
|
lp17
|
17912
|
16
|
16360
|
15
|
24961
|
22
|
lp17
|
HiCanu
|
24961
|
cp26
|
27107
|
26
|
27107
|
27
|
27106
|
25
|
cp26
|
IPA
|
27107
|
cp32-4
|
21095
|
17
|
21099
|
19
|
21829
|
16
|
cp32-4
|
IPA
|
21099
|
cp32-5
|
29944
|
37
|
28849
|
37
|
29941
|
38
|
cp32-5
|
HiCanu
|
29941
|
PBes
B. garinii
|
chromosome
|
906103
|
806
|
899173
|
800
|
906104
|
805
|
chromosome
|
microbial
|
906103
|
lp54
|
50750
|
55
|
38734
|
45
|
50745
|
66
|
lp54
|
HiCanu
|
50745
|
lp25
|
32676
|
23
|
32677
|
22
|
32677
|
22
|
lp25
|
microbial
|
32676
|
lp32-10
|
31997
|
25
|
31997
|
18
|
25909
|
14
|
lp32-10
|
microbial
|
31997
|
cp32-5
|
29494
|
41
|
29494
|
41
|
29494
|
40
|
cp32-5
|
microbial
|
29494
|
lp28-3
|
29334
|
16
|
-
|
-
|
-
|
-
|
lp28-3
|
microbial
|
29334
|
cp26
|
26995
|
26
|
26987
|
26
|
26996
|
25
|
cp26
|
microbial
|
26995
|
lp28-7
|
25469
|
30
|
13758
|
16
|
27059
|
29
|
lp28-7
|
HiCanu
|
27059
|
lp17
|
20612
|
18
|
20909
|
18
|
20909
|
19
|
lp17
|
HiCanu
|
20909
|
lp36
|
24692
|
18
|
16312
|
10
|
24697
|
15
|
lp36
|
microbial
|
24692
|
cp32-9
|
16382
|
22
|
-
|
-
|
30240
|
38
|
cp32-9
|
HiCanu
|
30240
|
cp9
|
9364
|
10
|
9361
|
11
|
9365
|
9
|
cp9
|
IPA
|
9361
|
89B13
B. valaisiana
|
chromosome
|
906612
|
816
|
900409
|
807
|
912938
|
819
|
chromosome
|
HiCanu
|
912938
|
lp54
|
54109
|
66
|
44514
|
55
|
54111
|
66
|
lp54
|
HiCanu
|
54111
|
lp28-3
|
48639
|
28
|
48637
|
29
|
48639
|
28
|
lp28-3
|
IPA
|
48637
|
lp28-8
|
30636
|
21
|
25414
|
20
|
30637
|
21
|
lp28-8
|
HiCanu
|
30637
|
lp17
|
18198
|
14
|
18338
|
14
|
18336
|
14
|
lp17
|
HiCanu
|
18336
|
cp32-6
|
29501
|
39
|
29501
|
38
|
29506
|
37
|
cp32-6
|
microbial
|
29501
|
lp32-7
|
36351
|
24
|
39355
|
22
|
32639
|
20
|
lp32-7
|
IPA
|
39355
|
cp26
|
26683
|
27
|
26682
|
26
|
26683
|
26
|
cp26
|
microbial
|
26683
|
lp25
|
35078
|
28
|
24904
|
25
|
25418
|
24
|
lp25
|
microbial_ circulomics
|
35078
|
lp36
|
18970
|
18
|
26297
|
19
|
26298
|
21
|
lp36
|
HiCanu
|
26298
|
cp9
|
9474
|
9
|
9474
|
9
|
9474
|
9
|
cp9
|
HiCanu
|
9474
|
We conclude that B. bavariensis PBaeII contains the following genome elements (n=12): chromosome, 8 linear plasmids (lp54, cp32-3+lp25, lp28-4+cp32-1, lp28-8, lp28-3, lp36, lp28-7, lp17) and 3 circular plasmids (cp26, cp32-4, cp32-5) (Table 2, Figure 4). The microbial assembler completely reconstructed 8 out of 12 genome elements, IPA only 4 and HiCanu reached a maximum of 9 complete genome elements (Table 3). The HiCanu assembly also contained a maximum of 1104 intact genes, followed by microbial and IPA with 1088 and 1064 intact genes, respectively (Table 3). The chromosome and the plasmid cp26 are the only genome elements that were successfully completely assembled by all three assemblers. Plasmids lp54, lp36 and cp32-5 were only completely reconstructed by the microbial and HiCanu assemblers and incomplete by the IPA assembler. Two plasmids (lp28-8 and cp32-4) were completely reconstructed by microbial and IPA assembler, while HiCanu failed to generate the complete cp32-4 and did not assemble the lp28-8 at all. The remaining five plasmids were only completed by one of the assemblers (HiCanu: cp32-3+lp25, lp28-4+cp32-1, lp28-7 and lp17; microbial: lp28-3).
Table 3: Assembler performance comparison with regard to the completeness of assembled genome elements.
|
|
|
|
microbial
|
IPA
|
HiCanu
|
species
|
strain
|
# genome elements
|
# CDS (with proteins)
|
# complete
|
# incomplete/ missing
|
# CDS (with proteins)
|
# complete
|
# incomplete/ missing
|
# CDS (with proteins)
|
# complete
|
# incomplete/ missing
|
# CDS (with proteins)
|
B. bavariensis
|
PBaeII
|
12
|
1126
|
8
|
4
|
1088
|
4
|
8
|
1064
|
9
|
3
|
1104
|
B. garinii
|
PBes
|
12
|
1118
|
9
|
3
|
1090
|
6
|
6
|
1007
|
10
|
2
|
1082
|
B. valaisiana
|
89B13
|
11
|
1095
|
7
|
4
|
1090
|
7
|
4
|
1064
|
9
|
2
|
1085
|
total
|
35
|
3339
|
24
|
11
|
3268
|
17
|
18
|
3135
|
28
|
7
|
3271
|
Borrelia garinii PBes also has 12 genome elements: chromosome, 7 linear plasmids (lp54, lp25, lp32 -10, lp28-3, lp28-7, lp17, lp36) and 4 circular plasmids (cp26, cp32-5, cp32-9, cp9) (Table 2, Figure 4). The plasmid lp32-10 carries the PFAm32 gene of cp32-10 (circular plasmid), but has wraparounds at both ends (additional information, Figure S4) that include telomere sequences. The latter indicates a linear structure and the plasmid was therefore named lp32-10 instead of cp32-10. Plasmids that possess a cp32 type PFam32 gene, but have a linear structure have previously been described, including lp32-10 (17). Out of 12 genome elements, the microbial, IPA and HiCanu assembler completely reconstructed 9, 6 and 10 genome elements, respectively (Table 3). The microbial and HiCanu assembly contained similar numbers of intact CDS with 1090 and 1082, respectively, while the IPA consensus only showed 1007 CDS with proteins (Table 3). The 4 plasmids lp25, cp32-5, cp26 and cp9 were fully assembled by all of the assemblers. Five genome elements were completed by two assemblers: Microbial and HiCanu completely assembled the chromosome, lp54 and lp36; microbial and IPA completed the lp32-10; IPA and HiCanu successfully assembled lp17. The remaining 3 plasmids were completely assembled only by one of the assemblers (microbial: lp28-3; HiCanu: lp28-7 and cp32-9), while the plasmid cp32-9 was completely missing in the IPA assembly. Interestingly, the lp28-3 was completely assembled by the microbial assembler but was missing from the IPA and HiCanu assemblies. Illumina read mapping confirmed the presence of plasmid lp28-3 but showed that the average coverage (9.6) was quite low in comparison to the other genome elements (average coverage ranging from 62.0 to 640.3).
Borrelia valaisiana 89B13 contains the following 11 genome elements: chromosome, 7 linear plasmids (lp54, lp28-3, lp28-8, lp17, lp32-7, lp25, lp36) and 3 circular plasmids (cp26, cp32-6, cp9) (Table 2, Figure 4). The microbial and IPA assembler completely assembled 7 of 11 genome elements, and HiCanu reached a maximum of 9 (Table 3). The microbial and HiCanu assemblies contained a similar high number of intact genes (1090 and 1085, respectively), while IPA assembly contained 1064 CDS with proteins (Table 3). All of the assemblers successfully completed 4 plasmids (lp28-3, cp32-6, cp26 and cp9). Another 4 plasmids were completed by two assemblers: microbial and HiCanu fully assembled lp54 and lp28-8; IPA and HiCanu completed lp17 and lp36. The chromosome and the plasmids lp32-7 (dot plot of the linear plasmid containing a PFam32 of cp32-7: additional information, Figure S5) and lp25 were completely assembled only by one assembler (HiCanu, IPA and microbial, respectively). The lp25 was not completely assembled by sequences based on DNA extracted via the Maxwell method (see methods for details on DNA extraction), but a combination of the Circulomics Nanobind DNA extract and the microbial assembler led to a complete reconstruction of the plasmid. This plasmid contains a very long inverted terminal repeat with a central unique sequence, which may be challenging to assemble due to sequence similarity between the two halves of the inverted repeat (dot plot: additional information section, Figure S6).
In summary, none of the assemblers resulted in complete genome sequences for isolates PBaeII, PBes and 89B13. HiCanu generated the highest number of completely assembled genome elements resulting in 28 complete genome elements out of 35 in total (PBaeII n=12, PBes n=12, 89B13 n=11; Table 3), followed by the microbial assembler resulting in 24 complete genome elements. The lowest number of fully reconstructed genome elements was assembled by the IPA assembler where only about half of the genome elements were fully assembled (17 complete, 18 incomplete). Similar results were observed regarding the number of intact assembled genes. HiCanu assemblies contained the highest number of intact genes 3271 out of 3339 in total (PBaeII n=1126, PBes n=11118, 89B13 n=1095; Table 3), followed by microbial and IPA (3268 and 3135 intact genes, respectively).
Generation of a final consensus by combining data
We manually compared the results of the individual genome elements of each genome with regard to completeness and correctness and combined the data to generate a final consensus for all genome elements. If more than one assembler successfully completely reconstructed the genome element, we used the genome element with the highest number of intact genes as we considered it to be most correct. For example, the chromosome of PBaeII was completely assembled by the microbial, IPA and HiCanu assembler with a length of 905911 bp, 905913 bp and 905912 bp and 804, 800 and 801 intact genes (CDS with proteins), respectively (Table 2). All three assemblers assembled the chromosome completely, but as the microbial contig showed the highest number of intact genes, we used this contig in the final combined consensus. The plasmid cp32-4 of PBaeII was completely reconstructed by the microbial and IPA assembler, but the HiCanu contig had atypical terminal repeats (Table 2). We observed incomplete wraparounds remaining in the trimmed HiCanu contig (additional information, Figure S3 L) that were not present in the cp32-4 contigs assembled by the microbial and IPA assembler (additional information, Figure S1 M and Figure S2 M) and therefore considered this as misassembly. The misassembled terminal repeats led to an increase in contig length (PBaeII cp32-4 HiCanu: 21829 bp) compared to the microbial and IPA contig (21095 bp and 21099 bp, respectively). Although the HiCanu assembler reconstructed cp32-4 with maximum contig length, the IPA contig contained the maximum number of intact genes (IPA: 19, microbial: 17, HiCanu: 16) and was used for the final consensus as we consider it to be more likely correct. The final combined consensus of PBaeII (12 genome elements) includes 6 genome elements assembled by the HiCanu assembler (lp54, cp32-3+lp25, lp28-4+cp32-1, lp28-7, lp17, cp32-5), 4 microbial contigs (chromosome, lp28-8, lp28-3, lp36) and 2 IPA contig (cp26, cp32-4) (Table 2).
By the same strategy, we generated final consensus genomes for PBes and 89B13. PBes (12 genome elements) is a combination of 7 microbial contigs, 4 HiCanu contigs and 1 IPA contig, and the final consensus of 89B13 (11 genome elements) consists of 6 HiCanu contigs, 3 microbial contigs (one of them is based on the Circulomics Nanobind DNA extract) and 2 IPA contigs. With regard to future genome comparison, the core genome (chromosome, lp54 and cp26) was reoriented as given in the type strain B31 of B. burgdorferi s.s.. To confirm that the genome elements were reconstructed and concatenated correctly, we mapped the PacBio HiFi reads on the final combined consensus and checked for equal coverage throughout the plasmid. The mapping graphs and mapping statistics of PBaeII are shown in the additional information (Table S3, Table S4 and Figure S7).