Genomic Analysis of Curtobacterium Flaccumfaciens Reveals the Differences Between Pathogenic and Nonpathogenic Strains


 Purpose

 Curtobacterium flaccumfaciens is a Gram-positive bacterium which has been isolated from different plants and abiotic environment. Curtobacterium. flaccumfaciens pv. flaccumfaciens (Cff) is a pathogenic bacterium that infects legume, which is causing great economic losses. At the genomic level, the metabolic and phylogenetic characteristics, and differences in pathogenicity between pathogenic and nonpathogenic C. flaccumfaciens strains have not been analyzed in detail.
Methods

Therefore, in order to discuss the differences in genome, phylogeny, gene function and mobile genetic elements between pathogenic and nonpathogenic strains, pangenomics and comparative genomics were used in this study to analyze 12 C. flaccumfaciens strains.
Result

The pangenome of C. flaccumfaciens is open. Phylogenetic analysis showed that there was no correlation between the phylogeny and pathogenicity of C. flaccumfaciens. KAAS annotation of the core genome shows that the citrate cycle was incomplete. In addition, gene islands analysis of the three pathogenicity-related genes encoding for pectate lyase, serine protease and cellulases showed that they only existed in the Cffs and LMG3645 strains. LMG3645 might be a pathogenic strain.
Conclusion

This study clearly and reliably revealed the differences between the pathogenic and nonpathogenic strains of C. flaccumfaciens at the genomic level, and paves the way for further research on its pathogenicity.


Introduction
Curtobacterium accumfaciens is a Gram-positive, short rod or coryneform, aerobic, and motile bacterium. It has been isolated from multiple environments such as residential carpet, ice-wedge polygon, and different plants. C. accumfaciens pv. accumfaciens (Cff) is a pathogenic bacterium that can cause bacterial wilt of common bean (Phaseolus vulgaris) as well as bacterial tan spot disease on soybean (Jeger et al. 2018), the former is one of the most important diseases threatening edible legume production around the globe (Chen et al. 2020b). In addition, Cff can also infect cowpea, wheat and red nightshade plants, and may survive and overwinter on solanaceous vegetable residues ). Previous studies have mainly focused on the identi cation of the pathogenicity of Cff in relation to legumes (Puia et al. 2021;Valdo et al. 2016), the selection of crops in rotation system to reduce the effect of Cff (Goncalves et al. 2021; Goncalves et al. 2017), the antagonistic effect of Cff on Bacillus spp. (Leao et al. 2016), and the sensitivity of Cff to protein synthesis group inhibitors (Tumbarski et al. 2018).
At present, only Cff strains were found to be pathogenic in the C. accumfaciens species . However, there is no obvious morphological difference between pathogenic and nonpathogenic C.
accumfaciens strains . Showing high genetic diversity amongst the strains, a phylogenetic tree of C. accumfaciens was constructed by multilocus sequence analysis (MLSA) based on six housekeeping genes (Goncalves et al. 2019). Nevertheless, to further reveal the differences between the pathogenic and nonpathogenic strains, systems biology approaches such as genomics is necessary. At present, genome sequencing and annotation has made great progress in understanding the molecular mechanism of plant colonization, pathogenicity and survival mechanism of plant pathogenic actinomycetes (Thapa et al. 2019). To date, whole genome sequences of 12 C. accumfaciens strains have been sequenced and published (Osdaghi et al. 2016), Three C. accumfaciens strains were analyzed by Chen et al. (Chen et al. 2020b). This involved studying the virulence factors of four phylogenetically closely related Curtobacterium spp. strains in relation to Cff using comparative genomics.
The accumulation of C. accumfaciens genomic data and the development of corresponding bioinformatics software can facilitate pangenomic analysis. Pangenomic analysis, which can help in understanding species diversity and the metabolic capabilities of a species (Golicz et al. 2020), includes studying the core genomes as they relate to the basic biological functions, accessory genomes, and unique genomes demonstrating the diversity of bacterial strains (Tettelin et al. 2005 In this study, the pangenome of 12 published C. accumfaciens genomes were analyzed. The phylogenetic relationship of C. accumfaciens was studied at the genome level, while the metabolic capabilities in the core and accessory C. accumfaciens genomes were also analyzed. Lastly, mobile genetic elements (MGEs) of C. accumfaciens were identi ed and analyzed. Collectively, this study analyzed the differences across genomes, phylogeny, gene function and mobile genetic elements between pathogenic and nonpathogenic strains.

Pangenome analysis
The 12 C. accumfaciens genomes used for the pangenome analysis were download from the NCBI database. Genome accession numbers are shown in Table 1. The pangenome and core genome analysis of C. accumfaciens were conducted via the Bacterial Pangenome Analysis (BPGA) pipeline (Chaudhari et al. 2016). Usearch is the clustering tool with a cutoff of 50% sequence identity. Using the gene presence-absence binary matrix (pan-matrix) obtained from BPGA as the input data, the pangenome and core genome pro les were evaluated with PanGP (Zhao et al. 2014).
To construct the phylogenetic tree of the 12 C. accumfaciens strains, the concatenated protein amino acid sequences in the core genome of each strain obtained from the BPGA pipeline analysis were included in the ClustalW multiple sequence alignment step (Larkin et al. 2007). The alignment le was imported into MEGA X in order to construct a maximum-likelihood phylogenetic tree (Kumar et al. 2018).

Pangenome of C. accumfaciens
The detailed information of the published genomes of C. accumfaciens is shown in Table 1. The genome size of the 12 C. accumfaciens strains ranged from 3.68 Mb to 3.94 Mb. The average number of protein coding genes in each genome was 3,536. A total of 6,577 gene clusters were identi ed via the pangenome analysis. Of these, the 1,893 genes found in all 12 genomes represented the core C.
accumfaciens genome (Fig. 1). Six housekeeping genes (atpD, dnaK, gyrB, ppK, recA, and rpoB), which had been used to construct a phylogenetic tree of C. accumfaciens were all identi ed in the core genome (Goncalves et al. 2019). A total of 4,296 genes were found in more than two, but not all 12 genomes, these constituted the accessory genomes ( Fig. 1). In addition, and representing the strainspeci c genes, 2281 genes were uniquely found in individual genomes. The strain-speci c gene in each genome varied from seven to 596 ( Fig. 2A). Strain JUb65 and WW7 had the largest number of strainspeci c genes (596 and 534, respectively).
The changes in pangenome size and core gene number is shown in Fig. 2B, with the relationship between the pangenome size (y) and the genome number (x) represented by the equation y = αx β + σ. When β = 0.45, the pan genome is considered to be open. Under this condition, the magnitude of the pangenome tends to be enlarged with an increased number bacterial genomes. That is to say, the species are capable of obtaining homologous and nonhomologous genes from other gene pools. This serves an important evolutionary function. This is in contrast to what is observed for the Dickeyasolani plant pathogen, which presents a nearly closed pangenome structure (Motyka-Pomagruk et al. 2020).

Phylogenetic analysis
Plant, residential carpet and ice-wedge polygon C. accumfaciens isolates were obtained, and a phylogenetic tree constructed based on the core C. accumfaciens genome. The 12 strains were divided into two clades (Fig. 3). In the rst clade, strains were all isolated from plants. Forming a single cluster and demonstrating similar degrees of evolution, both CFBP3418 and LMG 3645 were isolated from Phaseolus vulgaris, and their evolutionary degrees are consistent. In addition, given that they had the largest number of strain-speci c genes, the highest degree of evolution was found between the JUb65 and WW7 strains. Among the pathogenic strains, only Cff1037 was grouped into the second clade. Hence, in this phylogenetic tree, it was found that there was no formed a monophyletic cluster among the three pathogenic strains, and there was no correlation between phylogeny and pathogenicity of C. accumfaciens, which was consistent with previous studies ( of core, accessory, and strain-speci c genes respectively. Since the core genome typically encodes proteins responsible for basic housekeeping functions (Loper et al. 2012), most core genes were involved in the housekeeping processes. Among core genes annotated to the metabolism group, Amino acid transport and metabolism (E) (11.55%) and Carbohydrate transport and metabolism(G) (10.28%) were the largest two categories; while Transcription (K, 8.17%) and Translation, and ribosomal structure and biogenesis (J, 7.83%) were the two most commonly observed among the information storage and processing groups (Fig. 4). Among the accessory genes, the largest category was K (15.47%), followed by G (10.56%), E (10.13%), and Cell wall/membrane/envelope biogenesis (M) (7.03%). In the two most prevalent groups within the strain-speci c genes, the K and G categories occupied 14.68% and 9.67%, respectively. Replication, recombination and repair (L) accounted for 5.84% of the strain-speci c genomes, and 2.26% of the accessory genomes. No core genes were annotated to group L.
In addition, accessory genes uniquely found only in the three pathogenic Cff strains were not seen.
However, while 42 COG annotated genes were identi ed in three Cff strains, they were absent in other eight strains. Four of the COGs were related to serine protease, while others were found to relate to cellulase and pectate lyase (Table S1). It is worth noting that, except for Cff, these 42 COGs only existed in strain LMG3645.

Metabolic analysis of the C. accumfaciens pangenome
The main metabolic pathways of the C. accumfaciens core, accessory and strain-speci c genomes were represented by KO numbers. In the core genome, the basal metabolic, glycolysis/glyconeogenesis metabolic and pentose phosphate pathway are complete. However, the citrate cycle (TCA cycle) was found to be missing the gene encoding 2-oxoglutarate dehydrogenase (EC:1.2.4.2) in all 12 strains. KAAS annotation of the core genome showed that C. accumfaciens can utilize a variety of carbon sources. The metabolic pathways of sucrose, starch, maltose, glucose, fructose, cellotriose, cellobiose, trehalose, mannose, xylose, and xylitol were complete. The metabolic pathway of sulfur source of C.
accumfaciens is complete, but an enzyme (cysJ, EC: 1.8.1.2) with only WW7 strain missing. In metabolism involving nitrogen sources, only ammonia and formamide could be metabolized. Lastly, it was found that the core C. accumfaciens genome can produce ethanol.

Prediction of MGEs
Mobile genetic elements (MGEs) are segments that mediate the movement of DNA within genomes or between bacterial cells, which are not commonly found among chromosomal housekeeping genes. Known to contribute to the pathogenicity, environmental adaptation, and intraspeci c diversity of C.
accumfaciens (Frost et al. 2005), the MGEs associated with genomic islands (GIs) across the 12 C. accumfaciens genomes were investigated. The number of predicted GIs in the C. accumfaciens genome ranged from 10 (strain CFBP3418) to 27 (strainS5.26), indicating the widespread existence of MGEs in C. accumfaciens. The size of total GIs in each strain ranged from 157kb (strain CFBP3418) to 414kb (strainS5.26). In each genome, the genes of GIs, range from 320 to 854, most genes identi ed in the GIs encode hypothetical protein. Interestingly, the size of the total GIs of the three pathogenic strains and LMG3645 are the smallest among the 12 strains, while the nonpathogenic strains were all larger than that of the pathogenic strains, which may indicate that the insertion of foreign genes is less prevalent in the evolution of pathogenic strains. 2) in the pathogenicity of Cff has not been elucidated, four, four, 10, and 10 serine proteases were found in the GIs of P900, CFBP3418, Cff1037, and LMG3645, respectively. Serine protease mainly involved in the breakdown of peptide bonds to amino acids required for nutritional purposes or to degrade proteins in the plant cell wall, allowing the bacterial translocation or overcoming plant chemical defenses (Dow et al. 1990). This indicates that serine proteases may have a signi cant effect on the pathogenicity of C. accumfaciens. In addition, found in the glycosyl hydrolase family, cellulase plays a key role in degrading cell wall and infecting xylem. Enzymes associated with the glycosylhydrolase family (protein id: QFS80892.1) were found in the GIs of P990, CFBP3418, Cff1037, LMG3645 and S5.26.
In the analysis of 12 C. accumfaciens strains, these Gram-positive plant pathogens often produced lyases that can cause diseases on host plants. This included the production of pectate lyase, serine protease and cellulases (Gartemann et al. 2008), typically through genes found in the GIs of LMG3645 and Cff. Previously, while the pathogenicity of C. michiganensis was shown to be attributed to a genomic island (Jung et al. 2014). Besides, LMG3645 has not been reported to be pathogenic, but the presence of pectate lyase, serine protein and cellulases related genes were found in the GIs of LMG3645, which also con rms the analysis of COG (Table S1). As the LMG3645 was also additionally isolated from Phaseolus vulgaris, we can infer that LMG3645 may also be pathogenic. Further experiment on the pathogenic effect of C. accumfaciens LMG3645 on P. vulgaris could be carried out. In addition, the online service PHASTER was used to identify and annotate the protophage sequences in the genomes of 12 strains of C. accumfaciens. A total of ve incomplete prophage sequences were found in the WW7, VKM Ac-1386, S5.26 strains.

Conclusions
Cff is an emerging plant pathogen that is spreading rapidly worldwide and occurs in many beanproducing countries. Systems biology approaches, such as genomics, has been used to further reveal its pathogenicity. For 12 strains of C. accumfaciens, the analyses used in this study showed that strains were associated with an open pangenome, which meant that the size of new genes and pan genome would increase with the addition of new genome sequence. Phylogenetic analysis showed that there was no correlation between phylogeny and pathogenicity of C. accumfaciens. KAAS annotation of the core genome shows that the basal metabolic pathway, glycolysis/glyconeogenesis metabolic pathway and pentose phosphate pathway are complete while the citrate cycle (TCA cycle) was incomplete, missing the gene encoding 2-oxoglutarate dehydrogenase (EC:1.2.4.2) in all the 12 C. accumfaciens strains. GIs analysis showed that genes encoding pectate lyase, serine protein and cellulases, which may be related to the pathogenicity of C. accumfaciens, were found in the GIs of three known Cff pathogenic strains, and based on the analysis of COG and GIs, it was concluded that LMG3645 might also be a pathogenic strain. However, bacteriophages are not abundant in C. accumfaciens. This study deepens our understanding about the pathogenicity of C. accumfaciens at the genomic level, and may encourage further exploration thereof.

Declarations
Ethics approval and consent to participate: Not applicable. This article does not contain any studies with human participants or animals performed by any of the authors.

Consent for publication:
Not applicable.
Availability of data and materials: The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Competing interests: The authors declare to have no con icts of interest.   The presence of each gene in 12 strains of C. accumfaciens. The phylogenetic trees of the 12 C. accumfaciens are illustrated at the top of this gure. In the pangenome, the presence of each gene across the 12 C. accumfaciens strains is indicated by green, and the absence thereof is indicated by yellow. Each column represents the genome of one strain, while rows represent one gene. The row where all genes are represented in green indicate the core genes, the rows where accessory genes are located have at least two strains whose positions are green, while the row where the strain-speci c genes are located have only one strain whose relative positions are green. accumfaciens strains varied from one to12. The strains corresponding to 1-12 on the abscissa axis were as follows: 208, CFBP3418, Cff1037, JUb65, LMG 3645, MEB126, P990, S5.26, UCD-AKU, VKM Ac-1386, VKM Ac-1795, WW7. The cumulative curve (in blue) supports an open pangenome. Based on Heaps' law, the formula y =αxβ + σ (where y denotes the number of genes of the pangenome; x denotes the analyzed genome number; And α, β and σ are tting parameters) can be used to calculate the pan-genome size. When 0 < β< 1, the pan-genome is open, Whenβ> 1, the pan-genomic is closed. y =αeβx + σ (where Y denotes the number of genes of the core genome; x denotes the number of analyzed genomes; And α, β and σ are tting parameters), the size of the core genome can be determined according to the number of pan-genomes.