D. solani genomic assemblies
The newly sequenced genomes of 8 D. solani strains (Table 1) were assembled into 1 - 7 contigs with no N bases (Table 2) from the PacBio reads with the use of the previously described genome assembly pipeline [33]. This method profits solely from PacBio RSII raw reads that are at first filtered from adapters with the use of SMRT Analysis v. 2.3 (Pacific Biosciences, USA) and then corrected, trimmed and assembled with the use of Canu version 1.5 [103]. Getting consensus and variant calling was achieved with Quiver (SMRT Analysis v. 2.3) [104] and final functional annotation was conducted with Prokka version 1.12 [67]. The size of these genomes ranged from 4 882 124 bp to 4 934 537 bp, in the case of IFB0487 and IFB0421 D. solani strains, respectively (Table 2). The largest contig of the acquired assemblies varied in size from 2 394 283 bp to 4 934 537 bp regarding either IFB0311 or IFB0421 (Table 2). N50, which refers to the minimum length of contigs in which half of the bases of the assembly are covered, ranged from755 734 to 4 934 537 bp (for IFB0695 or IFB0421; Table 2). L50, describing the number of contigs that comprise half of the genome size, spanned from 1 to 2 (Table 2). The calculated GC content falls within the range of 56.23 to 56.25 (Table 2). None of the contigs from de novo assembled D. solani genomes has been assigned to the sequences of plasmid origin as computed with the use of PlasmidFinder [66]. According to Prokka-based [67] annotation, the newly sequenced genomes of D. solani strains contained in total from 4 304 to 4 608 genes (in the case of IFB0212 and IFB0417, respectively; Table 1). The number of protein-coding genes varied from 4 143 (IFB0212) to 4 446 (IFB0417), while the quantities of the annotated rRNA and tRNA amounted to 18 - 22 and 72 - 75, respectively (Table 1).
Genomic contents and assembly statistics for the herein reported newly-sequenced D. solani genomes have been juxtaposed to these attributed to 14 reference D. solani sequences (versions of the genomes available in the Genbank database at a time of conducting research have been included; Table 2). The numbers of scaffolds building up the included reference genomes are considerably higher (1 - 38) than the quantities of scaffolds present in de novo sequenced ones (Table 2). Also, the vast majority of reference genomic sequences contain N bases, reaching even the number of 27 548 (GBBC 2040). Other quality metrics (Table 2) of reference assemblies like the largest contig (> 570 255 bp), N50 (> 305 078 bp) or L50 (< 7) are also in favour of the genome assembly pipeline used for the newly sequenced genomes. Moreover, it is worth noting a significantly higher variation (56.13 - 56.34) in the %GC among the reference genomes than de novo sequenced ones (Table 2). Interestingly, as listed by Golanowska et al. [33], both the predicted number of genes and protein-coding genes fall within narrower ranges, being 4 273 - 4 536 and 4 138 - 4 303, respectively, in D. solani reference genomes in contrast to the herein reported sequences.
On the other hand, the stated quantities of tRNA (Table 1) were often lower in the reference genomes, even though the range from 60 to 75 was broader [33]. Regarding rRNA, solely 1 to 4 such genes were annotated for the included versions of the reference genomes of PPO 9019, RNS 05.1.2A, RNS 07.7.3B, IPO 2222, GBBC 2040, MK10, MK16, PPO 9134, IFB0158 and IFB0221 strains [33], in contrast to 18 - 22 detected in the herein reported de novo sequenced genomes (Table 1). Taking into consideration that genes coding for 5S, 16S and 23S rRNAs are typically organized into operons encountered in multiple copies, i.e. 1 - 14 [68], within the bacterial chromosome, such a low number of annotated rRNAs disagrees with the current biological knowledge. Thus, we postulate that the number of the annotated rRNA-encoding genes might be regarded as an informative marker of the achieved quality of de novo assembly of D. solani genomes in view of the fact that highly similar sequences of rRNAs were previously reported to potentially disrupt, due to the occurrence of both highly conserved and variable regions, the assembling process that is typically based on de Bruijn graphs [69]. It should be noted that the genomes possessing a low number of rRNA-coding genes have been assembled from the data generated by Illumina or 454 pyrosequencing platforms with the use of assemblers handling short length reads [33]. For example, the IPO 2222 genome available currently (13.02.20) in the GenBank database was reassembled from both PacBio and Illumina reads and harbours 22 rRNA-encoding genes in contrast to the number of three annotated for the here discussed version [33].
High potency of herein applied genome assembly pipeline is further supported by the fact that out of 8 de novo sequenced genomes, 4 have been closed to a full chromosome and the remaining ones contained just 2 - 7 scaffolds (Table 2). It is worth underlining that solely PacBio RSII reads have been used during the assembly process, by these means lowering the required financial effort associated with additional acquisition of MiSeq Illumina reads. Furthermore, all the herein utilized software is open-source, contrary to for instance CLC Genomics Workbench v5 utilized by Pedron et al. [47] for assembling the Illumina HiSeq 2000 reads of D. solani RNS 08.23.3.1A strain into 42 contigs with N50 of 299 659. Interestingly, the sequence of RNS 08.23.3.1A was later on improved by Khayi et al., (2014) [60] into a fully closed chromosome containing the N bases by application of scaffolding, home-made scripts in addition to Sanger sequencing of the PCR amplicons. The herein utilized approach is less laborious and does not require significant bioinformatic skills.
One should notice significant progress in the assembling of D. solani genomic reads that have taken place in the recent years. For instance, a draft genome of D. solani IFB0099 reported before [70] consisted of 97 contigs. This sequence was assembled with Celera from 454 pyrosequencing and PacBio (SMRT Analysis version 2.3) reads after trimming with StreamingTrim software [71]. The resulting assembly contained 5 094 121 bp (%GC 56.40), exceeding by 161 201 bp the improved closed circular genome of IFB0099 (%GC 56.24) obtained with the use of the developed genome assembly pipeline. In spite of the same annotation software utilized, the total number of protein-coding genes, i.e. 4 365 [70] vs. 4 164 [33], in addition to the number of tRNA- or rRNA-encoding sequences, i.e. 129 [70] vs. 97 [33], varied considerably between the above-mentioned versions, which points to the crucial importance of obtaining high quality genomic assemblies prior to undertaking any comparative genomic analyses.
An alternative approach to assembling of D. solani genomes was undertaken by Garlant et al. [31]. The reads for D s0432-1 strain were acquired with Roche 454 GS Flx Titanium chemistry and assembled by using Newbler that generated 98 contigs. Gaps in this assembly were filled in by sequencing PCR or linker-PCR products using an ABI 3730 capillary sequencer. Final gap closing involved the Gap4 program (Staden package). This laborious and costly approach yielded a genome consisting finally of 4 contigs, which discloses obvious benefits of the herein utilized genome assembly pipeline. Another strategy was chosen by Pritchard et al. [44] that assembled 4 D. solani genomes into 23 - 224 scaffolds by relying on 454 pyrosequencing (3 genomes - MK10, MK16, IPO 2222) or IlluminaGAIIx (1 genome - GBBC 2040) technologies. The genome of IPO 2222 was assembled de novo with the use of 454 Life Sciences Newbler v2.5.3. In the case of MK10 and MK16, meta-assembly of Newbler de novo and reference-guided assemblies to the IPO 2222 reference genome were performed. Regarding a GBBC 2040 genome, for which solely Illumina reads have been acquired, CLC bio assembly module was implemented for mapping the reads to the IPO 2222 reference genome [44]. The N50 values reported for the first released versions of the above-mentioned genomes were much lower (from 40 901 to 485 700, for GBBC 2040 and MK16, respectively [44]) than in the case of the revised versions included here as references (Table 2). The assemblies reported by Pritchard et al. [44] have been further improved since their release, reaching in the herein utilized versions 1 - 3 contigs (Table 2), though the number of the incorporated N bases (2 100 - 27 548; Table 2) is still quite large. It is worth noticing that the previous assemblies differ significantly from the following ones regarding for instance the genome length. In the work of Pritchard et al. [44], IPO 2222 was reported to possess 4 857 348 bp, the version of this genome included in the here-presented research exhibits 4 867 258 bp, while the length of the one that might be currently (13.02.20) downloaded from the Genbank database equals 4 919 833 bp. This further proves the importance of obtaining high quality assemblies before conducting any genomic comparisons.
One of the reasons behind undertaking search for plasmids in draft genomic sequences of D. solani is that the occurrence of such extrachromosomal molecules might be an explanation for the contig-based status of the assembly. However, our data confirmed that up to the present day solely one plasmid sequence has been described in the D. solani species, namely the one harboured by PPO 9019 strain isolated from hyacinth [61]. Notably, this extrachromosomal genetic sequence shared complete identity (100%) with the plasmid of Burkholderia ambifaria AMMD (CP000443.1) [32]. In spite of sharing a common plasmid, there is another argument pointing to the association between D. solani and B. ambifaria, as these two species exhibited notable similarities in the O-polysaccharides (OPS) within their lipopolysaccharide (LPS) structures [72, 73]. In more detail, 6-deoxy-D-altrose that was found in the OPS of D. solani and D. dadantii [72] was up to now reported only as a constituent of a disaccharide repeating unit →4)-α-d-Rhap-(1→3)-β-d-6dAltp-(1→ in the OPS of B. ambifaria type strain LMG 19182 [73]. Interestingly, B. ambifaria was noted to possess two diverse OPS molecules, which might be related with the adaptation of these strains to various environmental niches such as plant leaves, roots and rhizospheres, forest soil or even sputum or respiratory tract of patients suffering from cystic fibrosis [74]. Specifically, the B. ambifaria LMG 19182 strain was isolated from the rhizosphere of pea in Wisconsin (USA) in 1985 [75]. As suggested previously, sugar composition of O-antigen follows the availability of monosaccharide substrates [76], therefore the occurrence of D-altrose in the OPS of plant associated isolates of Dickeya spp. and B. ambifaria, together with the previous proofs for horizontal gene transfer (HGT) resulting from plasmid transmission between these species [32], gives a clue about their coexistence in natural environment.
Structural similarities between D. solani genomes
Large scale BLAST comparison of de novo sequenced and reference D. solani genomes, computed with the use of BLAST Ring Image Generator (BRIG) [77], revealed an exceptionally high level of homogeneity among the studied 22 genomes (Fig. 1). The de novo sequenced genomes of D. solani: IFB0158, IFB0167, IFB0212, IFB0221, IFB0223, IFB0231, IFB0311, IFB0417, IFB0421, IFB0487, IFB0695 (Table 1), in addition to RNS 08.23.3.1A and D s0432-1, possess a nearly identical genomic structure to that of IFB0099 (Fig. 1), regardless of the sequencing method used or the closed/draft status of the genome assembly. A notable absence of certain genomic regions is a repeating feature in the case of other D. solani genomes, namely IPO 2222, GBBC 2040, MK10, MK16, PPO 9019, PPO 9134 (Fig. 1). Some but not all of these sites are likewise not present in the genome of RNS 07.7.3B (Fig. 1). Undoubtedly, the genome of RNS 05.1.2A stands out from the pool of the tested sequences, not only taking into consideration the number, but also the size of the missing regions. It is also worth considering that the genomes of IFB0487 and IFB0695 lack quite large parts of DNA sequences present in the reference IFB0099 genome (Fig. 1). Putatively, it might be associated with the draft character of these genomic assemblies as the number of contigs is reflected in the number of computed synteny blocks. However, the presence of polymorphic sites in these regions cannot be excluded for sure due to the fact that in many cases incompleteness of a bacterial genomic assembly tends to result from the occurrence of repetitive sequences [78].
Whole genome comparisons have been computed for D. solani chromosomal sequences previously, but former research included a significantly lesser number of genomes and took advantage of other bioinformatics software. Pedron et al. [47] juxtaposed the genome of D. solani 3337 to the one of D. dadantii 3937 with the use of Mauve. In spite of a high level of synteny between these genomes, there were noted two insertions and a notable inversion between two rrs ribosomal RNA-encoding operons [47]. Interestingly, des Essarts et al. [59] spotted two syntenic disruptions and a notable evidence for horizontal gene transfer in the genome of D. solani 3337 in contrast to D. dianthicola RNS04.9. The scale of study has been enlarged in the work of Garlant et al. [31], in which the genomic sequence of D. solani D s0432-1 was compared with a pool of representative genomes of other Dickeya spp. i.e. D. dadantii 3937, D. zeae Ech586, D. paradisiaca Ech703 and D. chrysanthemi Ech1591. The lowest number of rearrangements was observed between D. solani D s0432-1 and D. dadantii 3937 [31]. Subsequently, Khayi et al. [32] reported that the genomes of two D. solani strains, namely 3337 and 0512, exhibit significant syntenic conservation accordingly to Mauve-based visualization. Likewise, Golanowska et al. [33] incorporated the same tool to prove the lack of significant chromosomal rearrangements in the closed genomes of 5 D. solani strains. In more detail, the presence of 3 syntenic blocks was revealed in this work with two inversions regarding IFB0099, IFB0223 and RNS 08.23.3.1A strains, contrary to GBBC 2040 and IPO 2222 [33].
Basing genome comparisons on ANI values allows to avoid the bias linked with sequence selections and errors [79]. As this way of genomic distance determination takes advantage of whole-sequence information at high resolution of single nucleotides, three methods of pairwise genome comparisons, i.e. BLAST+ calculation of ANI (ANIb), MUMmer calculation of ANI (ANIm) and computation of the correlation indexes of the tetra-nucleotide signatures (Tetra) were utilized for proving an extraordinarily high similarity level between the analysed 22 D. solani genomes.
In more detail, the vast majority of ANIb values exceeded 99.96, reaching even 100.00 for over a dozen of juxtapositions (Supplementary Table 1). Similarly, in the case of ANIm, 99.98 was often reached, though no 100.00 values were acquired. It is also worth noticing that a high percentage of all the compared D. solani genomes have been successfully aligned (91.57 - 99.79 for ANIb and 93.26 - 100.00 for ANIm; Supplementary Tables 1 - 2). In addition, 1.0 correlation of the tetra-nucleotide signatures was likewise not rarely exhibited by the studied sequences. Regarding the observed differences, the genome of RNS 05.1.2A strain diverged to the greatest extent from the other sequences studied (Supplementary Tables 1 - 3). ANIb values acquired for comparisons including this genome ranged from 98.55 (vs. PPO 9019) to 98.68 (vs. either RNS 07.7.3B or RNS 08.23.3.1A) (Supplementary Table 1), ANIm varied from 98.71 (towards PPO 9019) to 98.82 (in contrast to RNS 07.7.3B) (Supplementary Table 2), while tetra nucleotide correlation coefficients differed from 0.99976 (vs. either IFB0417 or IFB0487) to 0.99987 (in comparison to MK16) (Supplementary Table 3). ANIb (98.55 - 99.93) and ANIm (98.71 - 99.92) calculations also pointed to PPO 9019 and PPO 9134 as the genomes slightly standing out from the others tested (Supplementary Tables 1 - 2), though this deviation was not supported by the correlation coefficients-based method (Supplementary Table 3).
All the herein computed ANI values for the pairwise comparisons between D. solani genomes exceeded the 95-96 % species delineation threshold that corresponds to 70% DDH [80]. Previously, Garlant et al. [31] juxtaposed the genome of D. solani D s0432-1 to several other members of the Dickeya genus, i.e. D. dadantii 3937, D. zeae Ech586, D. paradisiaca Ech703 and D. chrysanthemi Ech1591, with the resultant ANI values of 94, 85, 79 and 86%, respectively [31]. The work of des Essarts et al. [59] further supported the closest relationship between D. solani (3337 strain) and D. dadantii (3937 strain) with ANI and DDH values of 94% and 55%. Even though the herein investigated D. solani genomes turned out to be highly homogenous basing on ANI calculations as it was suggested previously [32, 60], the computed values did not always exceed the 99.9 threshold demonstrated before [32, 58, 81]. Our outcomes are in agreement with the study of Golanowska et al. [33], in which the ANI values determined for pairwise comparisons among 14 D. solani genomes ranged from 98.60 to 99.99%. It is worth keeping in mind that often various software has been utilized for ANI calculations e.g. nucmer with script calculate_ani.py [81, 82], ChunLab's online Average Nucleotide Identity Calculator (EzBioCloud) [33, 83] or JSpecies [31, 47], which might be the cause of slight discrepancies in the reported genome-to-genome deviations between D. solani strains.
The fact that the genome of D. solani RNS 05.1.2A clearly stands out from the other analysed is putatively associated with the abundance of unique genes as further investigated in the following pangenome-related section and suggested in the former studies [32, 33]. Besides, modest dissimilarities in comparison to the included genomic pool were noted for RNS 07.7.3B, PPO 9019 and PPO 9134 sequences, which were also reported to show discrepancies in SNPs/InDels in contrast to other D. solani genomes [58]. Khayi et al. [32] postulated HGT from a closely related habitant of the same ecological niche, namely D. dianthicola, as a possible explanation for this phenomenon. The fact that both PPO 9019 and PPO 9134 strains were acquired from hyacinths and stood out solely in the ANI calculations, in contrast to the computed correlation indexes of the tetra-nucleotide signatures, might be a point in favour of the HGT-based hypothesis.
Further insight into the pangenome composition of D. solani
The first glimpse into the structure of D. solani pangenome was provided by Golanowska et al. [33]. In that study, Mauve-based calculation on 14 (5 closed and 9 draft) D. solani genomes showed that 74.8% (3 756 genes) of the gene pool grouped into the core, 11.5% (574 genes) to the accessory and 13.7% (690 genes) to the unique pangenome fraction [33]. In the current work, we significantly enlarged the number of the included D. solani genomes to 22 and applied another software named Bacterial Pan Genome Analysis (BPGA version 1.3) [84] for handling the computations. The obtained data showed that contribution of the core genome increased to 84.7 (3 726 genes) while the accessory and unique pangenome fractions shrank to either 7.2% (318 genes) or 8.1% (356 genes) of the whole D. solani pangenome (4 400 genes) as shown in Fig. 2A and Table 3. A reduction in the unique genome fraction was expected due to the higher number of genomes considered. Similarly, the higher quality of genomes used here (as complete genomes) could likely have produced a better assignment of orthologs than in the previous study. However, we cannot a priori exclude a possibility that the use of different software for computing the pangenome between the two studies could have influenced the results.
Details on the contribution of specific D. solani genomes to the pangenome of this species are depicted in Table 3. The number of accessory genes detected in specific D. solani genomes ranged from 113 (RNS 05.1.2A) to 271 (IPO 2222). Regarding unique genes, there were nine strains deprived of such features (IFB0099, IFB0167, IFB0212, IFB0221, IFB0223, IFB0231, IFB0311, MK16, RNS 07.7.3B), in contrast to RNS 05.1.2A possessing even 286 unique genes (Table 3). 13 of the D. solani strains included, i.e. IFB0099, IFB0158, IFB0167, IFB0212, IFB0221, IFB0231, IPO 2222, MK16, D s0432-1, PPO 9019, PPO 9134, RNS 07.7.3B and RNS 08.23.3.1A, did not contain any genes stated as absent, contrary to RNS 05.1.2A strain, which lacked a huge number of 107 genes present in the other genomes analysed (Table 3). Construction and extrapolation of the core- and pan-genome plots (Fig. 2B), calculated with the use of the exponential curve fit model and power-law regression model, respectively, revealed that with the b parameter equalling 0.0256574, the pangenome of D. solani has been almost closed. In other words, the unique gene pool should no longer expand by addition of newly sequenced D. solani genomes. Such a feature is regarded as characteristic of the taxa whose representatives either occupy isolated ecological niches or lack efficient mechanisms for gene exchange and recombination [85]. Therefore, D. solani joined the group of real specialized pathogens with closed pangenomes [86] including e.g. Bacillus anthracis [62], Mycobacterium tuberculosis [87], Clostridium difficile [88], Yersinia pestis [89] or Staphylococcus aureus [90].
In contrast to D. solani, another member of the Pectobacteriaceae family, namely Pectobacterium parmentieri, was reported to possess an open pangenome [91]. In that work, computation with the use of Roary on 15 P. parmentieri genomes disclosed a notably lesser contribution of core (52.8%) and higher of accessory (20.9%) and unique (26.3%) pangenome fractions in comparison to D. solani. The authors associated the overrepresentation of the dispensable pangenome part with high genomic plasticity of P. parmentieri [91], suggesting a less clonal population structure with respect to that of D. solani [19, 21, 23, 59]. Thus, the closely related P. parmentieri species adhered to the categories of non-specialized species or opportunistic pathogens that often exhibit open pangenomes [86, 92] along with, for instance, Escherichia coli [93], Streptococcus agalactiae [94], Listeria monocytogenes [95], Legionella pneumophila [96] or Salmonella typhi [97]. One should bear in mind that the closed/open pangenome status of a species might have been affected by the number and representativeness of the genomes selected for the analysis [92]. Besides, not without importance is the software utilized for performing the pangenome calculations.
Functional assignment of the D. solani pangenome fractions
The outcomes of the attribution of the Clusters of Orthologous Groups (COGs) functional categories to the core, accessory and unique gene pools of 22 D. solani strains are depicted in Fig. 3. It might be noted that the core pangenome fraction is most abundantly represented in the general function prediction only (R), followed by amino acid transport and metabolism (E), carbohydrate transport and metabolism (G), transcription (K) and inorganic ion transport and metabolism (P) functional groups (Fig. 3). Regarding the accessory pangenome section, after the genes of general function prediction only (R), the ones involved in transcription (K) were highly represented, next these of function unknown (S), engaged in energy production and conversion (C) in addition to replication, recombination and repair (L), however all these overrepresentations were not statistically significant (Fig. 3). In the case of unique genes, they have been assigned most frequently to general function prediction only (R), function unknown (S), transcription (K), replication, recombination and repair (L) and amino acid transport and metabolism (E) COG categories (Fig. 3). Among the above-mentioned functional groups, just overrepresentations of unique COGs within the function unknown (S) and amino acid transport and metabolism (E) categories were not statistically significant. The groups in which both accessory and unique pangenome fractions dominated in contrast to the core section included general function prediction only (R), function unknown (S), transcription (K), replication, recombination and repair (L) and defence mechanisms (V) classifications (Fig. 3). It needs to be considered that the number of attributed core COGs was 3300, while the number of accessory and unique COGs equalled 157 and 120, respectively. The largest number of the assigned unique COGs derived from the genome of RNS 05.1.2A (106) followed by IFB0487 (8), IFB0417 (4) IFB0421 (1) and MK10 (1) (Supplementary Table 4). Among the functional roles of the assigned unique D. solani COGs, it is worth noticing, for instance, the genes encoding numerous transcriptional regulators (e.g. AcrR, ArsR, LysR, MarR, RpiR, AlpA, DksA), chemotaxis and adhesion proteins, ABC-type transport system components, proteins engaged in the stress response system (alkylhydroperoxidase, SbcCD, LexA), non-ribosomal peptide synthetase, components of the toxin-antitoxin system (RelBD), efflux permeases (MRS), DNA modification methylases, exo and endonucleases, mobile elements (transposase InsO), in addition to abundant prophage-associated proteins (e.g. tail protein, integrase, portal protein BeeE, primase, protein D, protein U, repressor protein C, protein W, protein X, DNA circulation protein, terminase-like protein, capsid-like protein, YmfQ, head maturation protease, head-tail adaptor) (Supplementary Table 4).
Previously, Golanowska et al. [33] pointed to MK10 and RNS 05.1.2A strains as the ones most distant from the others tested, basing on the largest number of unique genes as calculated by Mauve. Likewise, in the current research, RNS 05.1.2A stood out regarding the number of strain-specific COGs, however IFB0487 and IFB0417 followed. The MK10 strain possessed solely 1 unique COG, just as IFB0421. COGs of the unique D. solani pangenome fraction (Supplementary Table 4) were mainly assigned by BPGA v. 1.3 into general function prediction only (R) and function unknown (S) categories, though they belong now to the X group i.e. mobilome: prophages, transposons. This last evidence is in agreement with the former study of Golanowska et al. [33] which underlined the importance of prophages in the evolution of D. solani genomes. Out of 35 prophage sequences detected in 14 D. solani genomes, the majority of the strains harbored 2 - 3 prophages with the exception of RNS 05.1.2A, which showed the presence of 7 such prophage-like elements [33]. Also Khayi et al. [32] reported the RNS 05.1.2A strain to possess unique phage elements and hypothetical or unknown proteins except for some genes coding for two putative ABC transporters, two hypothetical virulence factors and one methyl-accepting chemotaxis protein, similarly to the types of COGs that have been established in the unique pangenome fraction of the herein described research (Supplementary Table 4). It is also worth noticing that a protein family involved in adhesion has been spotted in the D. solani unique pangenome fraction (Supplementary Table 4), which is in accordance with the previous suggestions of Golanowska et al. [33] on the putative involvement of these proteins in the overall D. solani virulence. Furthermore, quite a big number of the observed transcription-associated unique COGs (Fig. 3; Supplementary Table 4) confirms the assumptions of Potrykus et al. [34, 35] on the correlation between the regulation of genes expression and the noted differences in the virulence of various D. solani strains.
Core-genome-based phylogeny on D. solani strains
To the best of our knowledge, this is the first report on a large-scale genome-wide evolutionary study involving 22 D. solani strains. In the first large clade of the generated neighbour-joining phylogenetic tree computed on concatenated core gene alignments (Fig. 4), two strains obtained from Portugal (IFB0417, IFB0421) grouped in proximity to the ones isolated from hyacinths in the Netherlands (PPO 9019, PPO 9134). The above-listed strains are depicted in a subclade also with D. solani strains isolated in France (RNS 05.1.2A, RNS 07.7.3B, RNS 08.23.3.1A), in addition to IFB0158 strain isolated in Poland that grouped closely to IFB0221 strain from Germany. All before mentioned strains are hypothesized to share a common ancestor with IFB0311 from Poland that is the last strain included in the first large clade. To start with the second large clade, there is a MK16 strain from Scotland assembled together with the D. solani type strain IPO 2222 from the Netherlands. It is especially intriguing taking into account that Scotland produces 99.5% of its seed potato tubers, while the Netherlands is a potent exporter of this material with strict certification policies [21]. These two D. solani strains share a most recent common ancestor with GBBC 2040 from Belgium, which nicely coincides with the fact that Belgium imports huge amounts of seed potatoes, mainly from the Netherlands, followed by France, Germany and Denmark (https://www.trademap.org/; accessed 18.03.2020). The three above-mentioned strains have the same most recent common ancestor as MK10 from Israel grouped together with IFB0487 isolated in Poland in 2013. These two subclades share a common progenitor with D s0432-1 from Finland, and the previous ancestor with IFB0695 from Poland. The two above-described large clades have most recent common ancestors with the pool of closely related strains: IFB0231 (from Finland), IFB0223 (from Germany), IFB0212 (from Poland), IFB0167 (from Poland) and IFB0099 (from Poland) (Fig. 4). It might be spotted that the trade routes of seed, industrial and table potatoes find some reflection in the computed phylogeny.
Taking into consideration that the applied BPGA v. 1.3 software extracts protein sequences (excluding paralogs) from 20 random orthologous gene clusters to generate core genome-based phylogeny (Fig. 4), the herein presented visualization might give a hint on the evolutionary relatedness between the studied D. solani strains, but putatively shall not provide conclusive results. Rather, no obvious correlation between the geographical origin of the strains and the computed genome-wide relationships profiting from the core fraction was observed. However, if we have a look at the tree branches lengths reflecting the calculated distances, the recognizable divergence of RNS 05.1.2A strain, followed by IFB0487, MK16 and IFB0417 might be spotted (Fig. 4). This outcome is in agreement with the previous study of Khayi et al. [58] which observed an outgrouping of RNS 05.1.2A strain in a gapA-based phylogenetic tree. Also MLST analysis on the concatenated sequences of rpoD, gyrB, recA, rpoS, dnaX, dnaA, gapA, fusA, rplB, purA and gyrA housekeeping genes pointed to RNS 05.1.2A as an isolate notably different from the other D. solani strains [32]. In the herein generated core pangenome-based phylogenetic tree, the RNS 05.1.2A strain also stood out from the others tested (Fig. 4), similarly to what was noted in the other conducted comparative genomic analyses (Fig. 1, Table 3, Supplementary Tables 1 - 4). One of the major drawbacks associated with searching for phylogenetic relations among D. solani strains is the fact that these isolates are highly similar and they often group together regardless of the origin and year of isolation. Such a phenomenon was reported before by van der Wolf et al. [19] basing on the phylogenies computed on PFGE profiles, IGS regions, single house-keeping genes, i.e. dnaX, recA, dnaN, fusA, gapA, purA, rpoS, rplB or the concatenated sequences of all the above listed 8 genes in addition to IGS. In that study, solely the fatty acids fingerprints showed subtle differences between D. solani strains. It seems that still phylogenetic relatedness between diverse strains is affected to some extent by the applied marker and bioinformatic method, indicating that the most appropriate approach to be used still has to be revealed.