Comparative genomic analysis of gene clusters of Pseudomonas aeruginosa that define specific biofilm formation in deciphering target regions for novel treatment options

Pseudomonas aeruginosa is an opportunistic pathogen associated with numerous nosocomial infections that are difficult to treat as a result of natural resistance to various antibiotics, particularly because of biofilm formation. The purpose of this study was to determine the distribution of biofilm formation genes in sequences of this opportunistic pathogen and their association with different ecological niches. In total, 13 genes responsible for biofilm formation by P. aeruginosa were identified and used in the study. They were clustered into seven categories based on the role they play in the biofilm formation process. The study also analyzed 185 complete genome sequences of P. aeruginosa strains retrieved from the NCBI and IPCD databases. These were classified into 14 categories based on the ecological niches they occupy. strong of these faster than the other of genes between different . bfiS cupC1 fliC pclass 2400kbps, 4600kbps.I. BRIG PA2771 3150kbps, amiC at 3750kbps, PA3989 4450kbps, bfiS and bfiR at locus 4700kbps, PA4625 at locus 5200kbps and algC at locus6000kbps. K. BRIG analysis of sequences from urine samples indicated that the genes in variable regions included p class at 2400kbp and amiC at locus 3750kbps.L. BRIG analysis for sequences from wound infections indicated that the genes in the variable regions included p class at locus 2400kbps, PA2752 at locus 2850kbps and amiC at locus 3750kbps.M. BRIG analysis of sequences isolated from abscess indicated that genes in the variable regions included cupC1 at locus 1050kbps, fliC at locus 1150kbps, p class at locus 2400kbps, amiC at locus 3750kbps and algC at locus 6000kbps.N. BRIG analysis of sequences from blood isolates indicated that the genes present in the variable regions included p class at locus 2400kbps, PA2824 at locus3150 kbps, amiC at locus 3750kbps, PA3989 at locus 4450kbps, PA4108 at locus 4600kbps, bfiS and bfiR at locus 4700kbps.


Background
Pseudomonas aeruginosa accounts for 11% of nosocomial infections with immunocompromised patients being the most susceptible (1). With a mortality rate of 60%, the opportunistic pathogen ranks high among the Gram-negative bacterial killers (2).
The high mortality rate is attributed to antibiotic resistance that often is a result of mutations in chromosomal genes, horizontal gene transfer and acquisition of virulence plasmids bearing genes encoding for beta-lactamases and other drug resistance genetic determinants (3,4). The antibiotic resistance is compounded by biofilms that cannot be easily destroyed (5). Pathogens often resort to the formation of biofilms to protect themselves from innate immune functions (6). The microorganism's susceptibility to mutagenesis is also increased when it experiences an increase in biofilm formation. This is likely to result in less sensitivity to antibiotics further fueling the antibiotic resistance ability of the organism (7). It is important that the gene clusters responsible for biofilm formation in P. aeruginosa are identified so that novel therapies can be centered on controlling this important phenomenon. It may be also of importance to be able to identify virulence genes in metagenomic DNA reads to monitor the distribution of nosocomial infections.
Pseudomonas is known to continually sense the environment and adapt to different phenotypic forms in a bid to increase its chances of survival within the human host. When the immune system is compromised, the microorganism steps up its pathogenesis as is the case in immunocompromised patients (6). It has also been previously reported that biofilm cells can modulate some of their genes in a bid to quickly adapt to harsh environmental conditions (8). These cells can either opt to decrease the permeability of the cell wall, express multidrug efflux pumps at high levels or activate antibiotic modifying enzymes as they try to evade therapeutic efforts castigated against them.
Previous studies have suggested that development of novel therapies for refractory P. aeruginosa may be augmented by understanding the mechanisms of biofilm-specific adaptations exhibited by the bacteria (8). The expression of drug resistance genes in P. aeruginosa often accompanies biofilm formation. This feature provides an evolutionary advantage to the bacteria allowing it to evade the effects of drugs as well as the defense system of the host organism (8). Apart from this synergy, previous studies have also revealed that the extracellular matrix of bacterial biofilms can act as a diffusion barrier for antibiotics (9). There is a clear need for a better understanding of the interplay between the genes involved in biofilm formation by P. aeruginosa that gives more credence to this study. Such knowledge will reveal potential molecular targets for development of novel effective therapies.
Biofilm formation in P. aeruginosa relies on a natural heterogeneity of bacterial populations switching to biofilm formation when the nutrients become scarce and waste products accumulated (10). Several antimicrobial agents have previously been developed to target these population development phase (11). Pathogens have adapted to these therapeutic agents by either population behavior changes or eliminating molecular targets (12). Phase variations in bacterial populations are mediated by a number of biofilm formation genes including amrZ, which modulates the psl operon responsible for microcolony formation (10). Our study sought to decipher the roles of various biofilm formation genes and tried to relate the functional properties with the evolutionary relationships and location of these genes in the genomes of the pathogen. Although the biofilm formation is an important factor in the development of antibiotic resistance, little is known about the conservation and variation patterns of the genes responsible for this attribute in P. aeruginosa. Rather than following a systematic exploration, new findings around biofilm formation in the opportunistic pathogen have been based on chance (13).
This study sought to employ various comparative genomics tools to establish the conservation and variation patterns of biofilm formation genes among different strains of P. aeruginosa. P. aeruginosa consists of different strains that occupy different niches within their human hosts and are classified under three clades based on their evolutionary relationships. The evolutionary differences have previously been linked to biotope associations within the host and environmental niches (14). Previous studies on the opportunistic pathogen are yet to provide reliable explanations for the pathogen's population structure and exhaustive analyses of the NCBI repositories remain scanty and disjointed (15). The National Center for Biotechnology Information (NCBI) contains approximately 176 complete genomes of Pseudomonas aeruginosa strains isolated from different ecological niches. These sequences provide a valuable source for investigating the complexity and diversity of the various strains of the opportunistic pathogen (16). Given the large number of genomes present in the NCBI GenBank database, an in silico approach to characterize the genes involved in biofilm formation in P.aeruginosa looks feasible.
Sequences of genes responsible for biofilm formation in the opportunistic pathogen were retrieved from the NCBI GenBank and analyzed using different comparative genomics tools in a bid to characterize these genes in various strains of P. aeruginosa. The study identified conservation patterns of genes in the opportunistic pathogen. These were designated as regions of interest and identified as biomarkers for novel treatment options.
Comparative genomics was used to identify variations in these genes especially in flexible regions of P. aeruginosa genomes. The findings of the study revealed potential biomarkers for novel treatment options and shed more light on the survival mechanism of P.
aeruginosa within the human host.

P. aeruginosa Sequence Retrieval
The study sought to retrieve complete genome sequences of P. aeruginosa strains isolated from various ecological niches. These sequences were analyzed by different comparative genomics tools in a bid to characterize the biofilm formation genes with respect to different strains of the pathogen. A total of 185 complete genome sequences of P. aeruginosa strains from the NCBI and IPCD databases were retrieved for analysis as indicated in Table 1. Microorganisms were classified into 14 categories based on the ecological niches the different strains were isolated from. The study used metadata on each strain to complete this classification. Table 1 shows niche-specific categories of P. aeruginosa isolates which were used for downstream analyses.  Figure 1 indicates the niche associations of the strains selected from NCBI. The selection doesn't represent any real prevalence of P. aeruginosa in nature. It is biased by how strains are selected for sequencing.

Identification of Biofilm Formation Genes
Using genome annotation data and sequence alignments, 13 biofilm formation genes common in most of the strains of the P. aeruginosa were identified and grouped into corresponding COGs represented by individual FASTA files as indicated in table 2. Each file contained 44 protein sequences of the respective genes selected from every P.
aerugionosa reference genome. Table 3 indicates the names, GenBank accession number, size and number of biofilm associated genes represented by individual COGs.  Table 3 The names, GenBank accession numbers, size, number of annotated genes and ecological niches of the 44 Pseudomonas aeruginosa genomes The retrieved sequences were further classified into four categories based on the role they play in the biofilm formation process. An additional category was created to cater for genes whose functions are not clearly understood yet. Specific roles were identified using related publications and metadata provided for each gene. Based on this, the genes were assigned to different classes as shown in Table 4.

Comparative Genomic Analyses
From the comparison of the 13 COG-based ML phylogenetic trees, the study created a tree of relationships between different biofilm related genes using the treedist distance matrix ( Figure 2). The phylogenetic analyses revealed four clusters. From the 13 biofilm formation genes analyzed, 10 genes fell into a single cluster. The algD and algU genes diverged the most from the other biofilm formation genes, while fliC was not completely divergent from the other genes which seem to have co-evolved together. While the study assumed that all the biofilm formation genes co-evolved together given that they belong to a group of functionally related genes that generally was confirmed by the obvious coevolution of these genes -the divergence of the three genes may result from a horizontal gene transfer. Alternatively, it may be assumed that these genes evolved faster than other genes as they are more important in terms of a proper response of the biofilm formation to specific environmental stimuli in different habitats. It makes these genes potential targets for antibiofilm therapies. Both the algD and algU genes were classified as regulatory genes responsible for the regulatory stage of the biofilm formation process.

Genome Mapping
To further investigate the role of horizontal gene transfer in the divergence observed in algU, algD and fliC, the study conducted a BRIG analysis of different strains of P.
aeruginosa. This analysis helped to determine the variable and conserved regions within the genome as well as the distribution of the biofilm formation genes. It also pointed further light to the meaning of these mutations in different ecological niches occupied by the ubiquitous pathogen. From the BRIG analysis the study created 14 gene maps, each map representing strains of the pathogen occupying different ecological niches. ( Figure 3A to figure 3N). Figure 3A indicated the distribution of the biofilm formation genes among the sequences of the strains isolated from the 13 ecological niches that were identified by this study. Biofilm formation genes retrieved by the custom python scripts were highlighted in green. This analysis revealed that the pslE, pslG and pslJ genes were all located around the same locus (2450kbps) in a variable region of the genome of strains of P. aeruginosa. This gives credence to the coevolution between these set of genes. The three genes are part of the eleven psl genes required for Psl production and surface attachment which is necessary for biofilm formation in nonmucoid strains of Pseudomonas aeruginosa which don't depend on alignate as the main biofilm polysaccharide (17). These genes are co-transcribed and our evolutionary analysis indicated that they co-evolve with each other. The other ten genes were located in different loci, all of which were conserved regions of the genome. Figure 3B to figure 3N indicated the distribution of the genes among sequences of strains isolated from each individual ecological niche. Besides the distribution of biofilm formation genes, the BRIG analysis was also used to identify the conserved and variable regions of the genome of the pathogen. This was done to further elucidate the differences in the distribution of genes in strains from different ecological niches.

Discussions
Protein AlgU has been identified as a founding member of the ExtraCytoplasmic Function (ECF) family -a group of products responsible for transcriptional regulation and response to environmental stress (18). This sigma factor has been associated with regulation of genes having the AlgD's promoter (19). Previous studies indicate that mutations in algU can affect mucoidy in P. aeruginosa or lead to a partially active AlgU (20,21,22). A study by Scalan et al. (2015), also indicated that this gene is important for P. aeruginosa to withstand various treatments hence the need for an adjustment in its activity by specific mutations (23). It may explain a rather specific evolution pattern displayed by this protein. Divergence of this gene as seen in Fig. 2  The gene fliC codes for the B-type flagellin protein that is necessary for biofilm formation in P. aeruginosa. Previous studies have indicated that fliC mutant strains are hampered in their ability to cause diseases in model animals (28). Similar studies have shown that pilli expression does not contribute significantly to virulence in the absence of flagella. The expression of these gene is however downregulated in strains from cystic fibrosis isolates, allowing the pathogen to colonize the lungs (29). With the significant diversion from the other biofilm formation genes, we speculate that the fliC gene undergoes necessary mutations to enable the pathogenesis of different strains of P. aeruginosa and survival within different ecological niches.
This study aimed to identify biofilm formation genes, classify them based on the role they play in the biofilm formation process and analyze the distribution of these genes in the genomes of various strains of P. aeruginosa. All these objectives were aimed at identifying genes and processes that could serve as potential targets for novel antibiofilm therapies.
Proper understanding of interactions mediated by these genes could potentially result in development of therapeutic agents that impair biofilm formation. Infections that are refractory to antibiotic treatments could be kept in check.

Conclusion
The diverse distribution of genes involved in biofilm formation in P. aeruginosa highlights the diversity of pathways the bacterium can explore in different ecological niches. The conservation and variability of some of these genes in particular niches offers the scientific body a lot to think about especially on the possibility of new antibiofilm therapies. It is possible that these evolutionary differences are associated with horizontal gene transfer, a phenomenon that could be the center of future investigations on biofilm formation genes. The study identified potential target genes for novel antibiofilm therapies, including fliC, algD and algU. The study's hypothesis that gene clusters with similar functions evolved similarly was confirmed by the coevolution of these genes.
Divergence witnessed in three genes was associated with horizontal gene transfer or faster evolution of these genes. Further studies on the evolutionary relationships between these genes need to be done to identify the underlying role of the differences that we witnessed. Our study revealed that the biofilm formation process is dependent on the synergy between different classes of genes that modulate various steps in the overall process.

P. aeruginosa Sequence Retrieval
The study collated and analyzed whole genomes of P. aeruginosa to determine the conservation and variation patterns of biofilm formation genes in selected strains of P.
aeruginosa strains. Complete genomes were downloaded from NCBI and the International Pseudomonas Consortium Database (https://ipcd.ibis.ulaval.ca/) as GBK files for complete genomes and Fasta files for individual genes. The study only selected "the complete genomes" and sequences available by December 2018 from these databases. The genome sequences were then categorized based on the ecological niches that they occupy within the human host based on the metadata provided by the databases.

Identification of Biofilm Formation Genes
For a comprehensive analysis of the phylogenetic relationship between the biofilm formation genes, the study sought to collect orthologous genes from representative genomes of P. aeruginosa based on amino acid sequence similarity using a reciprocal BLASTP search. Genes from different genomes showing at least 90% protein sequence similarity or above were grouped into clusters of orthologous genes (COG). An in-house Python 2.7 script was written to facilitate the retrieval of the biofilm formation gene sequences from the annotated genomes (GenBank files) of P. aeruginosa. The script was designed to create FASTA files for every COG, containing sequences of the respective genes selected from all reference genomes.

Comparative Genomic Analyses
To infer phylogenetic relationships between sequenced P. aeruginosa, all proteins of every COG were aligned by MUSCLE algorithm (30). Individual COG alignments were edited by the program Gblocks (31) and then concatenated into a super-alignment used for a phylogenetic inference by the Neighbour-Joining (NJ) algorithm implemented in the program of PHYLIP 3.69 (Phylogeny Inference Package) (32).
For every COG of biofilm formation genes found in selected reference different strains, n=13, the study conducted multiple sequence alignments using the MUSCLE algorithm (30). The default values of the MUSCLE algorithm were used. From the aligned sequences, maximum likelihood (ML) phylogenetic trees were created for every COG. The resulted phylogenetic trees were compared by the program treedist of PHYLIP 3.69 using the Branch Score Distance algorithm (32). An in-house Python script was used to reformat the treedist text output file into a matrix of distances between COG-based ML phylogenetic trees in PHYLIP format suitable for clustering of the trees by the NJ algorithm. The resulted dendogram was used to analyze grouping of co-evolved biofilm formation genes.
Clustering of several genes together would mean their co-evolution while separating genes to different clusters would mean horizontal gene transfer exchange.

Genome Mapping
The BLAST Ring Image Generator (BRIG) was used to analyze different strains of P.
aeruginosa to visualize the core and flexible genomes against a reference genome (33).
The PAO1 strain was used as the reference genome for these analyses. The BRIG analyses were also performed to determine the distribution of the biofilm formation genes among the different strains of P. aeruginosa with the PAO1 used as the reference genome.

Statistics
Gene orthology was predicted by a reciprocal BLASTP alignment of protein sequences of all protein coding genes in one genome against all protein coding genes in another genome in a direct and reverse order. Those genes which showed the best BLASTP alignment hits reciprocally as query and subject sequences with 90% protein sequence identity or above and 75% coverage were considered as pairs of orthologous genes.
Robustness of clustering of co-evolved genes was controlled by the bootstrapping analysis implemented in the program bootstrap of the package PHYLIP 3.69 using 100 replicates of the distance matrix.

Consent for publication
Not applicable

Availability of data and material
All the data sets used in the study are available with no supplementary material. Figure 1 Distribution of selected strains of Pseudomonas aeruginosa strains in the various ecological niches. Strains isolated from the sputum niche were the most abundant while the isolates from the lungs and dental niches were the least abundant.

Figure 2
Co-evolution analyses of genes responsible for biofilm formation in strains of P.
aeruginosa. The NJ dendogram shows that the majority of gene COGs produced identical phylogenetic trees of the selected reference genomes that indicates a strong co-evolution of these genes. Exceptions were the genes fliC, algD and algU which may be exchanged by horizontal gene transfer or evolved faster than other genes of this functional group. The tree is drawn to scale with branch lengths in the same units as those of the evolutionary distances used to infer the COG phylogenetic tree.