3.1 Pangenome of C. flaccumfaciens
The detailed information of the published genomes of C. flaccumfaciens is shown in Table 1. The genome size of the 12 C. flaccumfaciens strains ranged from 3.68 Mb to 3.94 Mb. The average number of protein coding genes in each genome was 3,536. A total of 6,577 gene clusters were identified via the pangenome analysis. Of these, the 1,893 genes found in all 12 genomes represented the core C. flaccumfaciens genome (Fig. 1). Six housekeeping genes (atpD, dnaK, gyrB, ppK, recA, and rpoB), which had been used to construct a phylogenetic tree of C. flaccumfaciens were all identified in the core genome (Goncalves et al. 2019). A total of 4,296 genes were found in more than two, but not all 12 genomes, these constituted the accessory genomes (Fig. 1). In addition, and representing the strain-specific genes, 2281 genes were uniquely found in individual genomes. The strain-specific gene in each genome varied from seven to 596 (Fig. 2A). Strain JUb65 and WW7 had the largest number of strain-specific genes (596 and 534, respectively).
The changes in pangenome size and core gene number is shown in Fig. 2B, with the relationship between the pangenome size (y) and the genome number (x) represented by the equation y = αxβ + σ. When β = 0.45, the pan genome is considered to be open. Under this condition, the magnitude of the pangenome tends to be enlarged with an increased number bacterial genomes. That is to say, the species are capable of obtaining homologous and nonhomologous genes from other gene pools. This serves an important evolutionary function. This is in contrast to what is observed for the Dickeyasolani plant pathogen, which presents a nearly closed pangenome structure (Motyka-Pomagruk et al. 2020).
3.2 Phylogenetic analysis
Plant, residential carpet and ice-wedge polygon C. flaccumfaciens isolates were obtained, and a phylogenetic tree constructed based on the core C. flaccumfaciens genome. The 12 strains were divided into two clades (Fig. 3). In the first clade, strains were all isolated from plants. Forming a single cluster and demonstrating similar degrees of evolution, both CFBP3418 and LMG 3645 were isolated from Phaseolus vulgaris, and their evolutionary degrees are consistent. In addition, given that they had the largest number of strain-specific genes, the highest degree of evolution was found between the JUb65 and WW7 strains. Among the pathogenic strains, only Cff1037 was grouped into the second clade. Hence, in this phylogenetic tree, it was found that there was no formed a monophyletic cluster among the three pathogenic strains, and there was no correlation between phylogeny and pathogenicity of C. flaccumfaciens, which was consistent with previous studies (Goncalves et al. 2019; Osdaghi, Taghavi, Hamzehzarghani, et al. 2018). Similarly, no correlation between Pantoeaananatis pathogenicity and genetic diversity was seen (Stice et al. 2018).
3.3 COG functional annotation of the C. flaccumfaciens pangenome
Following eggNOG mapper analysis (Huerta-Cepas et al. 2019), the genes in the C. flaccumfaciens pangenome represented 96.1%, 90.9% and 84.7% of core, accessory, and strain-specific genomes, respectively. The genes of each group were assessed according to COG categories. COG functional annotation is classified into four groups: cell physiological processes and signals (COG categories: D, M, N, O, T, U, V), information storage and processing (COG categories: J, K, L), metabolism (COG categories: C, E, F, G, H, I, P, Q), and limited characterization (COG categories: R, S).Function unknown (S) was the largest category into which annotated genes were assigned and accounted for 18.0%, 22.8%, and 23.8% of core, accessory, and strain-specific genes respectively. Since the core genome typically encodes proteins responsible for basic housekeeping functions (Loper et al. 2012), most core genes were involved in the housekeeping processes. Among core genes annotated to the metabolism group, Amino acid transport and metabolism (E) (11.55%) and Carbohydrate transport and metabolism(G) (10.28%) were the largest two categories; while Transcription (K, 8.17%) and Translation, and ribosomal structure and biogenesis (J, 7.83%) were the two most commonly observed among the information storage and processing groups (Fig. 4). Among the accessory genes, the largest category was K (15.47%), followed by G (10.56%), E (10.13%), and Cell wall/membrane/envelope biogenesis (M) (7.03%). In the two most prevalent groups within the strain-specific genes, the K and G categories occupied 14.68% and 9.67%, respectively. Replication, recombination and repair (L) accounted for 5.84% of the strain-specific genomes, and 2.26% of the accessory genomes. No core genes were annotated to group L.
In addition, accessory genes uniquely found only in the three pathogenic Cff strains were not seen. However, while 42 COG annotated genes were identified in three Cff strains, they were absent in other eight strains. Four of the COGs were related to serine protease, while others were found to relate to cellulase and pectate lyase (Table S1). It is worth noting that, except for Cff, these 42 COGs only existed in strain LMG3645.
3.4 Metabolic analysis of the C. flaccumfaciens pangenome
The main metabolic pathways of the C. flaccumfaciens core, accessory and strain-specific genomes were represented by KO numbers. In the core genome, the basal metabolic, glycolysis/glyconeogenesis metabolic and pentose phosphate pathway are complete. However, the citrate cycle (TCA cycle) was found to be missing the gene encoding 2-oxoglutarate dehydrogenase (EC:1.2.4.2) in all 12 strains. KAAS annotation of the core genome showed that C. flaccumfaciens can utilize a variety of carbon sources. The metabolic pathways of sucrose, starch, maltose, glucose, fructose, cellotriose, cellobiose, trehalose, mannose, xylose, and xylitol were complete. The metabolic pathway of sulfur source of C. flaccumfaciens is complete, but an enzyme (cysJ, EC: 1.8.1.2) with only WW7 strain missing. In metabolism involving nitrogen sources, only ammonia and formamide could be metabolized. Lastly, it was found that the core C. flaccumfaciens genome can produce ethanol.
3.5 Prediction of MGEs
Mobile genetic elements (MGEs) are segments that mediate the movement of DNA within genomes or between bacterial cells, which are not commonly found among chromosomal housekeeping genes. Known to contribute to the pathogenicity, environmental adaptation, and intraspecific diversity of C. flaccumfaciens (Frost et al. 2005), the MGEs associated with genomic islands (GIs) across the 12 C. flaccumfaciens genomes were investigated. The number of predicted GIs in the C. flaccumfaciens genome ranged from 10 (strain CFBP3418) to 27 (strainS5.26), indicating the widespread existence of MGEs in C. flaccumfaciens. The size of total GIs in each strain ranged from 157kb (strain CFBP3418) to 414kb (strainS5.26). In each genome, the genes of GIs, range from 320 to 854, most genes identified in the GIs encode hypothetical protein. Interestingly, the size of the total GIs of the three pathogenic strains and LMG3645 are the smallest among the 12 strains, while the nonpathogenic strains were all larger than that of the pathogenic strains, which may indicate that the insertion of foreign genes is less prevalent in the evolution of pathogenic strains.
The disease symptoms caused by Cff include water stress and leaf wilt. In the case of severe infection, even if the environmental conditions are suitable, the whole plant will wilt and die (Harveson et al. 2008). Cff reduces grain yield by colonizing xylem vessels, subsequently impeding the translocation of water and nutrients to the superior plant parts (Jeger et al. 2018). In this process, because plant pathogens must break through the barrier of plant cell wall, the existence of enzymes that degrade cell wall is very important for pathogenicity. Therefore, three genes which encoding pectate lyase, serine protease and cellulases, were inferred as virulence-associated (Chen et al. 2020; Osdaghi et al. 2018; Goncalves et al. 2019). And the presence of the three genes in 12 C. flaccumfaciens strains is shown in Fig. 5. Pectate lyase is an important enzyme that breaks down and attacks xylem vessels. Pectate lyase (protein id: QFS80865.3) was only found in the three Cff strains and LMG3645 strain. At present, while the role of serine proteinase (protein ids: QIH95653.1, QIH95654.1, and QFS80891.2) in the pathogenicity of Cff has not been elucidated, four, four, 10, and 10 serine proteases were found in the GIs of P900, CFBP3418, Cff1037, and LMG3645, respectively. Serine protease mainly involved in the breakdown of peptide bonds to amino acids required for nutritional purposes or to degrade proteins in the plant cell wall, allowing the bacterial translocation or overcoming plant chemical defenses (Dow et al. 1990). This indicates that serine proteases may have a significant effect on the pathogenicity of C. flaccumfaciens. In addition, found in the glycosyl hydrolase family, cellulase plays a key role in degrading cell wall and infecting xylem. Enzymes associated with the glycosylhydrolase family (protein id: QFS80892.1) were found in the GIs of P990, CFBP3418, Cff1037, LMG3645 and S5.26.
In the analysis of 12 C. flaccumfaciens strains, these Gram-positive plant pathogens often produced lyases that can cause diseases on host plants. This included the production of pectate lyase, serine protease and cellulases (Gartemann et al. 2008), typically through genes found in the GIs of LMG3645 and Cff. Previously, while the pathogenicity of C. michiganensis was shown to be attributed to a genomic island (Jung et al. 2014). Besides, LMG3645 has not been reported to be pathogenic, but the presence of pectate lyase, serine protein and cellulases related genes were found in the GIs of LMG3645, which also confirms the analysis of COG (Table S1). As the LMG3645 was also additionally isolated from Phaseolus vulgaris, we can infer that LMG3645 may also be pathogenic. Further experiment on the pathogenic effect of C. flaccumfaciens LMG3645 on P. vulgaris could be carried out. In addition, the online service PHASTER was used to identify and annotate the protophage sequences in the genomes of 12 strains of C. flaccumfaciens. A total of five incomplete prophage sequences were found in the WW7, VKM Ac-1386, S5.26 strains.