General genome features of M. aeruginosa strains
Genomes of 25 M. aeruginosa strains were retrieved from the National Center for Biotechnology Information (NCBI) database for series analysis. The general features of the genomes are presented in Table 1. Except for strains NIES 2481[37] and NIES 2549[38], no plasmid sequences were discovered in other strains. The average size of the genomes was 4.99 Mb, and the average G+C content was 42.70%. Among them, strain KW had the largest genome (5.89 Mb), whereas strain PCC9806 had the smallest genome (4.26 Mb; Table 1).
Genome sequences of the strains CHAOHU 1326 and NaRes975 were recently released by our laboratory. The draft genome of strain CHAOHU 1326 contained 617 contigs with an N50 of 19,902 bp and the largest contig size was 84,471 bp. The M. aeruginosa CHAOHU 1326 genome had a total size of 5,271,583 bp, with a G+C content of 42.50%. In total, 5,517 genes were identified from CHAOHU 1326, including 4,590 protein-coding sequences (CDSs), 59 RNA-coding genes, and 868 pseudogenes. RNA-coding genes consist of 46 tRNA genes, four noncoding RNAs (ncRNA), and three sets of rRNA genes. Among 4,590 CDSs, 3,434 could be assigned putative functions, whereas 1,156 were predicted to encode hypothetical proteins. As for strain NaRes975, the final assembly of the genome consisted of 413 contigs with an N50 contig length of 29,122 bp and a largest contig length of 97,956 bp. The assembled genome was 5.1 million nucleotides in length with a G+C content of 42.40%. In total, there were 5,388 genes identified from the NaRes975 genome, which included 4,617 CDSs, 47 RNA-coding genes, and 724 pseudogenes. RNA sequences consisted of 40 tRNAs, four ncRNAs, and one rRNA cluster. Additionally, four strains with complete genome sequences (NIES843, NIES2481, NIES2549, and PCC7806SL) all contained two sets of rRNA clusters (5S, 16S, 23S: 2, 2, 2), as did six other strains with draft genome sequences (Additional file 2, Table S2). Additionally, 14 strains contained one rRNA cluster (5S, 16S, 23S: 1, 1, 1).
Pan-genome of M. aeruginosa
To assess genome similarity among M. aeruginosa strains, a core–pan-genome analysis was performed using all 25 M. aeruginosa genome sequences as input in the Bacterial Pan Genome Analysis (BPGA) tool[39]. The pan-genome analysis revealed a core genome of 1,993 genes with an accessory genome of 36,030 genes and 2,265 unique genes (Fig. 1a). Accessory genes are those whose orthologs are present in two or more genomes, but not in all the genomes. Additionally, unique genes are genes that are only present in one genome out of all those compared. The core–pan plot (Fig. 1b) showed that the pan-genome trend curve did not reach a plateau and seemed to extend with addition of more genomes to the analysis. Therefore, the pan-genome was considered an “open” pan-genome. In contrast, as shown in Fig. 1b, the core genome curve leveled off, which indicated that the core genome size did not significantly change with the addition of new genomes. Consequently, the core genome was considered conserved.
To determine the distribution of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Cluster of Orthologous Groups (COG) categories in M. aeruginosa, annotation of the pan-genome of the 25 genomes has been mapped (Fig. 1c and Additional file 2, Fig. S1). KEGG pathway analysis revealed that the genomes of all strains had high proportions of core, accessory, and unique genes associated with carbohydrate metabolism. More specifically, the core, accessory, and unique genes all similarly encode proteins involved in amino acid and energy metabolism. Distributions of genes in COG categories showed that numerous unique genes were responsible for cell wall/membrane/envelope biogenesis and secondary metabolite biosynthesis, transport, and catabolism, which indicates diverse strategies of environmental adaptation by secondary metabolite synthesis.
The BLAST Ring Image Generator (BRIG) alignment made it clear that most regions within the 25 M. aeruginosa genomes were conserved when compared to the reference strain NIES843 (Fig. 2). Several regions appeared to have low or even no similarity, possibly because of acquisition/deletion/rearrangement or horizontal gene transfer (HGT). These regions contained a variety of genes (shown in Additional file 1, Table S1 with annotation) that could be identified by conserved sequences flanking the variable regions. Approximately 44.6% of the genes located in these regions were annotated as hypothetical proteins.
Phylogenetic analysis of M. aeruginosa strains
The 16S rRNA gene alone is unable to provide sufficient phylogenetic resolution[40-42], especially for closely related Microcystis species; therefore, phylogenetic relationships were further examined based on a set of conserved marker genes. To assess relationships between the M. aeruginosa strains, the phylogenetic tree was constructed based on amino acid sequences of 31 highly conserved proteins that were encoded by the genes distributed in genomes as a single copy along 24 M. aeruginosa genomes (strain SPC777 was not analyzed here because of the lack of a conserved marker gene). The sequences of the same proteins from Synechocystis sp. PCC6803 were used as an outgroup. The resulting phylogenetic tree (Figure 3a) revealed a topology with generally well-defined nodes, with bootstrap support values greater than 50% over 1,000 replicates. Compared with the phylogenetic tree based on the 16S rRNA gene (Additional file 2, Figure S2), the phylogenetic tree based on the conserved marker genes (Figure 3a) produced higher resolution, by which different strains well distributed and clustered for discrimination. Phylogenomic analyses based on binary gene presence/absence (1/0) pan-genome matrix generated by BPGA pipeline resulted in a tree (Figure 3b) with a topology similar to the tree obtained using conserved marker genes (Figure 3a). Both phylogenetic trees provided more robust topologies than that based on 16S rRNA gene analysis alone. Specifically, strains NIES843, PCC9809, KW, and NIES88 were included in the same clade, and strain CHAOHU 1326 was closely related to strains PCC9807, PCC9443, and PCC9717. Pairs of stains also appeared to be phylogenetically closely related, such as NIES2549[38] and NIES2481[37], DIANCHI905 and PCC7806SL[43], and NaRes975 and PCC9808. Strain M. aeruginosa NIES1211 seemed to be the most divergent.
The majority of the M. aeruginosa strains were isolated in different locations, but no correlation was found between their geographic distribution or bloom-forming ability and phylogenetic relationships (Table 1). Among them, strains CHAOHU 1326[44], DIANCHI905, NaRes975, and TAIHU98[45] are bloom-forming strains isolated from China. Strains NIES843[46] and PCC7806SL are representatives of toxic (microcystin-producing) bloom-forming strains; in contrast, NIES98[47], NIES44[48], NIES2549[38], and NIES87[49] are non-microcystin-producing strains.
Modular signaling proteins involved in c-di-GMP metabolism and regulation in M. aeruginosa
A genome search for genes that encode enzymes involved in c-di-GMP metabolism was performed to identify the putative translated products that have DGC and PDE activities in the selected 25 M. aeruginosa genomes. The predicted proteins are listed in Table 2 and the accession numbers of the corresponding proteins are shown in Additional file 2, Table S3. This survey led to identification of three enzymatic classes of predicted proteins DGCs, PDEs, and hybrid DGC–PDEs, which contain GGDEF and EAL domains, even though they have opposing activities. As listed in Table 2, 14 of the 25 M. aeruginosa genomes had genes that encode DGC enzymes, which contain a fused N-terminal REC domain and GGDEF domain in tandem. The REC domain, as a signal receiver domain present in association with c-di-GMP metabolism domains, is supposed to modulate the enzymatic activities in response to the internal or external stresses. Interestingly, compared with the HD-GYP domain-containing PDEs, which were identified in all selected M. aeruginosa genomes and seemed to be highly conserved proteins with partial EAL domains were found less frequently (in only three genomes). Except for the NIES44 genome, each of the other 24 genomes was found to have a GG[D/E]EF-EAL hybrid protein. The GG[D/E]EF domain-containing DGCs and GG[D/E]EF-EAL hybrid proteins also belonged to accessory genes according to the core–pan-genome analyses.
Bacteria express a variety of sensory and signal transduction proteins to sense and adapt to changes in the physicochemical makeup of their environment. Sensory and signal transduction proteins encoded in the selected 25 M. aeruginosa genomes were predicted, and 12 sensory domain-containing proteins were found. The accession numbers and domain architectures of the highly conserved GAF, PAS, and REC domain-containing proteins are listed in Table 3. As an important sensor for photosensory behavior, the GAF domain was commonly associated with c-di-GMP domains in cyanobacteria[50]. As many as 11 of the 12 proteins had the GAF domain, and some even contained two. PAS-containing proteins are related to sensory input (GAF), transduction (HAMP), or output (histidine kinases). Half of the four predicted PAS-containing proteins contain a PAC motif, a conserved region of 40–45 amino acids located at the carboxy-terminal of the PAS domain, which contributes to PAS structure [27]. Interestingly, some sensory domain-containing proteins in different genomes were identical, and were therefore assigned the same accession number, such as NIES2549 and NIES2481, DIANCHI905 and PCC7806SL, and NaRes975 and PCC9808.
Structural features of GGDEF and EAL domains of DGCs and hybrid proteins of M. aeruginosa strains
To elucidate the structural features, structure predictive modeling of GGDEF and EAL domains of the DGCs and hybrids proteins was performed on the corresponding M. aeruginosa strains. The NIES843 genome is a representative genome of M. aeruginosa because of its genome has been completely sequenced and is modeled in Figure 4. Similarly, the corresponding structural models of strain CHAOHU 1326 were shown in Additional file 2, Figure S3.
Using SWISS-MODEL, the structure of the GGDEF domain of the DGC protein was modeled based on the crystal structure of the conserved GGDEF domain of WspR (Protein Data Bank (PDB) id: 3BRE), which has a crystallographic resolution of 2.4 Å[51-52]. Structural alignments were performed using the GGDEF domain of 3BRE from amino acids S173 to N329 (Fig. 4a, left). C-di-GMP binds to the catalytic site and to a second site distal to the catalytic loop. DGC proteins possess a conserved allosteric inhibition site (I site), composed of a RXXD motif (in which X represents any amino acid) five amino acids upstream of the GGDEF active site, that is important for controlling DGC activity. When levels of c-di-GMP are high, the second messenger can bind the RXXD motif, thereby repressing the DGC activity[53]. A systematic analysis and comparison of the 14 genomes that have corresponding GGDEF domains was performed to identify the amino acid motifs or signatures involved in catalysis and allosteric inhibition. As shown in Fig. 4a left, the WebLogo alignment revealed that the RXXD and GGEEF motifs of the GGEEF domain are highly conserved in the same amino acid residues: Arg-Gln-Val-Asp (RQVD) and Gly-Gly-Glu-Glu-Phe (GGEEF), respectively. Even though the amino acid sequences of the putative DGC proteins that only contained the GG[D/E]EF domain did not possess high percentages of identity (34.2–37.2%, Additional file 2, Table S4), the proteins contained all of the essential conserved amino acid residues that bind the substrate GTP, which indicates they may have catalytic activity.
Because only three genomes had partial EAL domains, the EAL domain in hybrid proteins from the M. aeruginosa NIES843 genome were chosen as paradigms to examine the crystal structure. Based on the crystal structure of the GGDEF-EAL domain of RmcA (PDB ID: 5M3C), which has a crystallographic resolution of 2.8 Å[54], the GGDEF and EAL domains in the hybrid protein of NIES843 were modeled. Compared with 5M3C, the GGDEF-EAL domains in the hybrid proteins showed sequence conservation of 35.9–37.8% (Additional file 2, Table S5). The low sequence conservation appeared to have no impact on model prediction by SWISS-MODEL. Compared with DGCs that contained only the GGDEF domain, amino acid residues of RXXD and GGDEF motifs in the GGDEF domain of the hybrid proteins were less conserved (Fig. 4a, right). The WebLogo alignment in Fig. 4b showed that amino acid residues of the EAL domain involved in the binding of c-di-GMP and catalytic activity were highly conserved in all sequences. The Glu in the EAL signature motif is an essential residue that is required to bind the c-di-GMP, whereas a change of Ala into Val (EVL) still sustains the enzymatic activity[55]. Arg in the second position downstream of the EAL signature motif was conserved in nearly all EAL domain sequences; thus, the EAL signature motif can be extended as EXLXR motif, which forms a stable platform to bind c-di-GMP[56].
Structural features of the PilZ domain of M. aeruginosa strains
All selected M. aeruginosa genomes encoded proteins that possess a PilZ domain, except for M. aeruginosa SPC777, which had two corresponding domains. Twenty-one genomes encoded cellulose synthase (CelA), which contained a C-terminal PilZ domain, and the other four genomes encoded a protein that contained only a PilZ domain. The accession numbers of the corresponding proteins are shown in Additional file 2, Table S6.
To identify the structural features, structure predictive modeling of proteins with a single PilZ domain and PilZ domain-containing CelA was performed for M. aeruginosa strains. Predictive modeling was based on the crystal structure of the BcsA (PDB id: 4P02) from Rhodobacter sphaeroides, which has a crystallographic resolution of 2.65 Å, according to SWISS-MODEL results[29]. Modeling of the PilZ domain-containing protein CelA of strain CHAOHU 1326 is shown in Figure 5a. The c-di-GMP-binding PilZ domain was located in the C-terminal region of CelA and had similar structure with protein containing a single PilZ domain in Figure 5b, which were derived from the representative M. aeruginosa strain NIES843. Figure 5c shows that the PilZ domain consists of a six-stranded β-barrel and a short α-helix that follows the last strand of the β-barrel.