Virulence analysis of 81 of Pseudomonas aeruginosa genomes available in public sequence databases

Pseudomonas aeruginosa is a pathogen capable of causing a wide range of severe opportunistic infections. Its genome contains numerous virulence genes encoding secretion systems of different types, structures responsible for adhesion and motility, toxins, proteases, siderophores, and others. The aim of this study is to analyse virulence, population structure, and distribution of highly divergent genes among 81 P. aeruginosa strains available in whole genome sequence databases. For this purpose, 260 virulence genes were searched in 81 different P. aeruginosa whole genomes that were available in databases. We identified most of the virulence genes as core and softcore genes. The most of the highly divergent sequences encoding pyoverdines, flagella and pilA were acknowledged as accessory, because of the differences in sequence among different alleles of those genes. Phylogenetic tree revealed the existence of three genetic groups of P. aeruginosa. Strains of the first clade were characterised as ExoS positive, whiles genomes of the second clade were ExoU positive. The member of third clade, PA7 strain was the only strain deprived of all T3SS genes. The analysis of pyoverdine locus facilitated finding a new pyoverdine type similar to pyoverdine type III. This newly described variant was present in 7 different strains. It contained a gene that was probably created by the fusion of pilD and pilI genes. In order to determine the coexistence of genes encoding exoenzymes, flagella and pyoverdines, Pearson correlation coefficients were calculated. There were significant correlations between genes encoding ExoS/ExoU-type strains and genes encoding type-A/type-B flagella. The correlation also occurred between Conclusion on Pseudomonas aeruginosa whole genome information from online databases. We conclude that most P. aeruginosa virulence genes are present in more than 95% of available genomes of the species. There are correlations of occurrence of different P. aeruginosa Correlation in distribution of accessory virulence genes in analysed genomes was additionally investigated. We focused on occurrence of non-divergent genes exoS and exoU with different variants of flagella and different sets of pyoverdine locus genes in analysed genomes.

4 different amino acid sequences [9][10][11][12]. Genes encoding pyoverdine, flagella, pilA fimbrial protein and lipopolysccharide (LPS) can be highlighted as the ones with divergent sequences. The highest known sequence diversity in P. aeruginosa is observed for pilA gene. The pairwise differences between pilA variants are up to 71.3% [13]. The long chain of polysaccharide (O-antigen) is a component of bacterial LPS. Region encoding B-band of P. aeruginosa O-antigen is also highly divergent between different strains. Eleven conservative sets of genes were recognised in this region [11]. Each group of genes is highly divergent from one another. Pyoverdine is a major P. aeruginosa siderophore. Based on the differences in sequence of fvpA, pvdE, pvdD, pvdJ and pvdI genes in the pyoverdine locus, three different pyoverdine types were identified. According to the different gene set, bacteria produce structurally distinct pyoverdines [12]. Sequence diversity is exhibited in sequences of flagella encoding genes as well. Flagella of type A or B are produced by different P. aeruginosa isolates. The production of each type of flagella is facilitated by the possession of different gene set in flagellin biosynthesis locus. Amino acid sequences of different chains building flagella are identical between types in 63-65% [14].
Over the last decade, rapid development of new sequencing methods increased effectiveness of sequencing while reducing the cost. These recent advantages help researchers to sequence hundreds of thousands of full genomes of different organisms. This development has a great impact on microbiology. Whole genome sequences have been utilised to highlight genetic variations within species or to assess the size of the pan, core genome of different microorganisms. Core genes are described as the genes that are present in all strains of one microorganism. Those sequences are usually responsible for encoding essential factors for bacteria. Term 'accessory genes' refers to the genes that can be found in a subset of strains of a species. Those genes are accountable for interspecies variability. They contribute to individual features of different strains, for example: the ability to colonise different host organisms [15]. Together with the increasing number of whole sequenced genomes, it is now possible to precisely assess the size of the core and accessory genome. In recent studies, core genome of P.
In this work, 81 P. aeruginosa whole genomes of different strains from environmental and clinical sources were used. We analysed the virulence and population structure of those strains. In this analysis, we redefined core, softcore, and accessory virulence genome of P.
aeruginosa. Distribution of different variants of highly divergent genes (HDGs) encoding flagella and pyoverdines in downloaded genomes was also investigated. We decided to examine if there is correlation of occurrence of different accessory genes encoding exoenzymes, flagella and pyoverdines.

Results
Sequences of five genes including algC, pilA, pscP, vgrg1b were not detected with gene finding software. Among all investigated genes, 109 were acknowledged as core genes, 101 as softcore genes, and 50 as accessory genes (Supplementary table S1).
In a group of genes associated with T3SS, 40 of them are softcore genes and 4 are accessory genes. Genes encoding exoenzymes exoU, exoS , exoY, exsE, and pscP were accessory. ExoU was found in 23, exoS in 57, exoY in 70 and exsE in 75 genomes out of 81. Effector proteins of T3SS, exoS and exoU are mutually exclusive in examined genomes.
The conducted phylogenetic analysis of P. aeruginosa strains showed the existence of two large genetic clads. Strains of the first clade were characterised as ExoS-positive, whiles genomes of the second clade were ExoU-positive. We have not found any genome containing both exoS and exoU genes. Dendrogram is shown in figure 1. PA7 was the only strain deprived of all T3SS genes. Strain PA7 is genetically distant from other P. aeruginosa groups and is was not included in the tree.  Among genes associated with alginate production, only mucA is identified as accessory.
The genes encoding effector toxins of the T2SS are core and softcore. ToxA and plcH sequences are found in 80 genomes, whereas lasA and lasB in all the analysed genomes.
Alkaline protease aprA -effector protein of T1SS is found in 80 strains. Rhamnolipids encoding genes rhlA and rhlB are core sequences. Genes encoding T6SS are qualified as core and softcore genes except the sequence of fha1 gene which is accessory and is found in 76 genomes. All sequences encoding phenazines except phzH are qualified as core genes. In the most strains, more than one copy of each gene encoding phenazines has been detected. Distribution of all genes in core, softcore and accessory genome is illustrated in figure 4. Different genes involved in pyoverdine biosynthesis were core, softcore, and accessory.
All accessory genes were found in pyoverdine divergent locus. In this locus, three known sets of genes can be found. Alleles of these different sets are mutually exclusive. First set, known as type I is present in 27 of 81 analysed genomes. Alleles of pyoverdine type II are the most abundant. Genes of these type are found in 37 of 81 strains. Pyoverdine type III is found in 16 strains. The analysis of pyoverdine divergent locus in analysed genomes revealed the existence of previously unknown set of genes similar to the genes of pyoverdine type III. We refer to this set of genes as type IIIb. This type contains pvdE, fpvA and siderophore-interacting protein (sip) gene alleles of pyoverdine type III. However, type IIIb is devoid of pvdI, pvdJ or pvdD genes. Instead there is a long coding sequence which is a 5' side highly similar to pvdI of type III, and at 3' side almost identical as pvdD gene of type III. We referred to this sequence as pvdID. There is no similarity between pvdID sequence and pvdJ gene. Figure 3 demonstrates the locus structure of pyoverdine type IIIa and IIIb. Type IIIb occurred in strains isolated in Brazil.
As we have analysed the distribution of HDGs between different P. aeruginosa strains, we have also decided to determine the correlation of occurrence of different accessory genes encoding exoenzymes, flagella and pyoverdines. Symmetric similarity matrix was formed after all r-values were calculated (additional files 2 and 3). There were significant correlations between genes encoding: (1)

Discussion
The purpose of this research was to characterise virulence of 81 P. aeruginosa strains isolated worldwide, based on their genomic data. We regrouped virulence genes according to whether they are a part of core, softcore or accessory genome and analyse the distribution of HDGs between different strains that were isolated worldwide. This research facilitated demonstrating possible genetic differences and similarities in virulence between various P. aeruginosa strains.
P. aeruginosa has multiple virulence factors including toxins, proteases, pyoverdines, pili, flagella, secretion systems, or quorum sensing mechanism. In a group of T3SS encoding genes, no core gene was found. Lack of T3SS core genes is caused by the fact that one of the strains PA7 is deprived of T3SS. PA7 also does not have effector exoenzymes exoS, exoU, exoY and exoT of this system. According to this research and recently published articles, PA7 strain is genetically distant to other P. aeruginosa strains. There are known cases of isolation of other exoS and exoU negative PA7-like strains related to PA7 [2,62,63]. Research published by Stewart et al. 2014, demonstrates the existence of three genetic clades within P. aeruginosa species [2]. The first group was characterised as exoS positive, second as exoU positive. The third group was devoid of both exoU and exoS genes. In this group, PA7 strain and other PA7-like strains were found.
The phylogenetic analysis conducted in the research has similar structure to that published by Stewart et. al. [2]. Similarly, exoS and exoU positive strains are separated into two clades, and PA7 is the third branch of the tree. According to Kulasekara et al. 2006, both exoS and exoU genes together with whole T3SS apparatus were acquired by horizontal gene transfer from other microorganisms [64]. It explains the existence of the third PA7 group. In this clade, there are P. aeruginosa strains that have never acquired T3SS system with effector proteins. More frequent isolation of T3SS positive strains can be explained by potential evolutionary benefits caused by this system, and in result there is a genetic advantage of T3SS positive strains over T3SS negative strains.
PilA gene cannot be determined as core gene using standard cut-off values as this gene has highly divergent sequences between different genomes [10]. Although pilA sequence is not core or softcore, its function is still probably very important as different variants of this gene were found in 76 genomes. The presence of pilA in genomes was confirmed using manual search of genomes, as sequence of this gene were not detected by gene finding software.
Large number of genes encoding pyoverdines were acknowledged as accessory genes, because their sequence similarities between alleles of those genes exhibited more than cut-off values used for separating gene families. However, we did not find any genome lacking different gene sets of pyoverdines. Different variants of fpvA, pvdE, pvdY genes were found in all investigated genomes, and all found variants were located in the pyoverdine locus. Based on these results, disregarding that pyoverdine genes are accessory, it is seen that pyoverdine production and pyoverdine receptors are crucial for P. aeruginosa and therefore those accessory genes have core functions.
Similarly to exoS and exoU genes it is suggested that different types of pyoverdine genes were acquired by horizontal gene transfer. History of horizontal gene transfer is suggested by unusual codon usage and oligonucleotide usage in pyoverdine region [12].
Nevertheless, exoS and exoU distribution in various genomes is consistent with genetic clades of phylogenetic tree representing evolutionary relationships of P. aeruginosa strains. This is not observed for pyoverdine types. It is possible to find both pyoverdine types of type I and II in exoU and ExoS clade. To explain these differences, we suggest that sets of genes encoding pyoverdines were transferred to P. aeruginosa strains of different genetic clades or they were transferred between different P. aeruginosa strains.
Another question arises there. Why genes of different pyoverdine variants are mutually exclusive. We suppose that the possible benefits for strains harbouring two pyoverdines gene sets may be insufficient in comparison to the disadvantages, e.g. additional energy and resource consumption. In our study, we did not observe genomes with mixtures of alleles of different types, but the existence of strains with type I and type II pvd genes was described previously [12]. Those isolates may stand as strains that after acquisition of two types of pvd genes, recombined them and as a result the mixture strain containing type I and type II pvd genes was created. Throughout the analysis of pyoverdine locus in 7 strains, we found new undescribed coding sequence. This gene at 5' side is nearly identical with a part of pvdD(3) gene of pyoverdine type III, and at 3' site is similar to pvdI(3). This gene is presumably a fusion gene that was created after deletion of the locus between pvdD and pvdI genes. This gene has been found in 7 investigated genomes.

Genes encoding different flagella types A and B are distributed between ExoU and ExoS
clades. Similarly to pyoverdine genes, in this case horizontal gene transfer is also presumably responsible for distribution of different sets of genes between ExoU and ExoS clades. We also noticed that there is significant correlation between ExoS/ExoU clades and type of flagella. Most of strains of ExoU clade had also of type-A flagella. This correlation is difficult to explain as there is no direct linkage of function between those genes.
However both exoenzymes and flagella are acknowledged as virulence factors, and therefore we can suppose that there is some kind of beneficial effect for exoU strains to also harbour flagella of type II gene sets. There was also a significant correlation between occurrence of genes encoding pyoverdines of type II with different flagella type. In this situation beneficial effect could also explain this correlation. Similarly to pyoverdines, most flagella encoding genes were acknowledged as accessory. However it is seen that most of those genes have core functions, because through the analysis, we did not find a genome deprived of flagellar gene set.

Conclusion
We analysed 81 whole genome sequences of P. aeruginosa strain available in NCBI Reference Sequence Database. This study facilitates describing genetic differences between various P. aeruginosa strains. We conclude that the most P. aeruginosa virulence genes are present in more than 95% of available genomes of the species. There is significant correlation in distribution of different variants of flagella and pyoverdine sets of genes with ExoU and ExoS clades. The analysis of pyoverdine locus facilitated finding a new pyoverdine type similar to pyoverdine type III. This newly described variant was present in 7 different strains. It contained a gene that was probably created by the fusion of pilD and pilI genes.

Bacterial isolates and their source of genomic data
We used 81 whole genome sequences of P. aeruginosa. The nucleotide sequenceswere downloaded from NCBI Reference Sequence Database ( https://www.ncbi.nlm.nih.gov/refseq/) [24]. A description of the isolates is shown in table 1. The subject sequences originated from clinical and environmental sources. Genomes of chosen isolates had different sizes ranging from 6.04 to 7.5 Mb.  [52,53]. In the research, we used 260 genes associated with virulence, encoding various secretion systems, type IV pili, flagella, alginate, lipopolysaccharide (LPS), proteins responsible for pigmentation, proteases, toxins, and components of quorum sensing mechanism (Additional file 1).

Analysis of core and accessory virulence genes
In the first stage of our analysis, the genes were recognised in the genomes with bioinformatical tool Prodigal [54]. We evaluated the capacity of Prodigal to correctly predict the presence of reference virulence genes in reference genomes PAO1 and UCBPA-14. Sequences of correctly predicted genes were used as a database in Diamond Blast [55]. HDG and genes that were not identified by Prodigal were searched manually in the annotated genomes available on RefSeq database and Pseudomonas Genome Database website [24,56]. To identify gene homologues, we performed reference genes against all genes similarity search using Diamond Blast [55]. The genes have been clustered with the removal of similarities <80% sequence identity and <80% query and subject coverage. Sequences were recognised as core genes when homologues of reference genes were detected in all genomes. Soft core genes were identified as genes present in 95% of genomes. Below this value, all genes were qualified as accessory genes.

Variants of highly divergent genes in different genomes
The presence of different variants of HDG encoding pilA , flagella, and pyoverdines between different genomes was verified. The variants of those HDGs were searched in annotated genomes available in RefSeq database and Pseudomonas Genome Database using inbuilt BLAST software and known reference gene sequences from UniProt database [53,56]. Coding sequences that were found exhibiting at least 30% sequence identity, 30% query coverage with the refence genes were identified as possible variants of HDG. In the next step, genomic location of each identified possible variant was determined. When the location of the sequence was the same as for reference gene, we referred to the sequence as a genetic variant of HDG. The presence and location of the genes in specific locus were determined with the use of simple bioinformatic tools Blast and NCBI's Genome Workbench graphical viewer [57,58].

Phylogenetic analysis
Sequences of several core genes aroE, carA, gyrB, rpoB, rpoD were extracted from the genomes from Pseudomonas Genome Database [56]. Those core genes were used in phylogenetic analyses of different P. aeruginosa strains [59]. Genes were concatenated with SequenceMatrix software v.1.8 and compared with each other with the use of bioinformatic tool MEGA v.7.0.26 [60,61]. The phylogenetic tree was constructed using the neighbour-joining (NJ) method with 1,000 bootstraps.