Genomic aspects and in silico identification of ncRNAs
The mining of genomic sequences enabled the selection of 132 complete genomes, with 44 being B. cereus, 40 being B. thuringiensis,and 48 being B. anthracis. Chromosome size varied slightly among species, with B. anthracis having the lowest chromosome variation (from 5.20 to 5.25 Mb) and B. thuringiensis having the highest variation (from 5.21 to 6.00 Mb). The GC content of the chromosomal sequences was about 35%, and no differences were observed among the three species (Online Resource Table 4).
Fig. 2A graphically presents the chromosomal size (Mb) and number of chromosomal ncRNAs of each strain studied; they are represented by spots. The data from the three species enabled us to identify three distinct clusters. The first one, positioned around 5.20 Mb, was composed by three strains of B. cereus and all strains of B. anthracis, indicating that this last species showed little variation in chromosome size and number of ncRNAs identified. The second cluster, positioned between 5.20 and 5.60 Mb, was composed of most B. cereus strains grouped with some B. thuringiensis strains, which demonstrates that these species showed more variation in number of ncRNAs and chromosome size as compared to B. anthracis. The third cluster, composed exclusively of B. thuringiensis strains, positioned between 5.60 and 6.00 Mb, demonstrates that some strains of this species showed greater chromosome size than the other evaluated strains. Four strains were not included in the clusters because they presented divergent chromosome sizes. One example is strain Bt YBT-1518, which had the largest chromosome among all studied (6.00 Mb) and the highest number of ncRNAs identified (332).
Distribution analysis between chromosome size and the number of ncRNAs was calculated by the Shapiro-Wilknormality test, which showed that the data did not show a normal distribution (W=0.778, p-value <0.05, and W=0.543, p-value <0.05, respectively). Spearman'stest indicated a high positive correlation (Rho=0.811, p-value < 0.01 and Linear R2=0.524) between number of ncRNAs and size (in bp) of chromosomes. This reveals that as the size of a chromosome increases, the number of identified ncRNAs increases as well.
The number of identified chromosomal ncRNAs varied among strains (Table 1). The Kruskal-Wallis test demonstrated that the number of ncRNAs differed significantly (p<0.001) among the three species. Thus, the results of this study demonstrated that B. anthracis strainsshowed the smallest average chromosome size and the smallest average of ncRNAs. Similarly, B. thuringiensis strainspresented both the largest average chromosome size and the highest average of chromosomal ncRNAs (Table 1), which may be due to the existence of B. thuringiensis strains with chromosome enlargement (Fig. 2A).
Analysis of the plasmid content of the 132 selected genomes identified 386 plasmids. Plasmids were not found in all strains of the three species, but were greater among B. thuringiensis strains, which also showed the highest average of plasmids per strain (Table 1). The average genome increase caused by the presence of plasmids was also higher among B. thuringiensis strains(12%), reaching 24% in Bt MC28 and Bt IS5056 strains (Online Resource Table 4). Among B. cereus strains, the presence of the plasmids led to an average increase of 6% in genome size, with a maximum of 14% in Bc 03BB108 and Bc C1L; in B. anthracis, the average increase was 5%, with a maximum of 6% in Ba 14RA5914 (Online Resource Table 4). The GC content among plasmids of the three species was similar: B. cereus plasmids averaged 33.21%, B. thuringiensis 33.28%, and B. anthracis 32.77%, as expected for species with a common ancestral origin (Nishida 2012).
Among the B. anthracis strains, ncRNAs were identified in all plasmids, while in the other two species, ncRNAs were identified in 58% and 61% of the plasmids of B. cereus and B. thuringiensis, respectively (Table 1). The Kruskal-Wallis test showed no significant difference in the number of ncRNAs identified in the plasmids of the three species (p=0.107).
The Shapiro-Wilktest for all 386 plasmids evaluated in this study demonstrated that the size in bp and the number of ncRNAs did not follow a normal distribution (W=0.798, p-value <0.05, and W=0.639, p-value <0.05, respectively), while Spearman's test demonstrated a moderate and significant correlation between plasmid size and the number of ncRNAs (Rho=0.625, p-value < 0.01). Fig. 2B illustrates that many plasmids follow the trend of the straight line, while others are located distantly, indicating a moderate linear relationship between the variables.
In addition, some plasmids (black circle) showed a high number of ncRNAs compared to other plasmids of the same size (from 150 to 350 kb). Blastn alignment demonstrated that these plasmids are similar to each other because they showed identity values between 87.07 and 100.00 and coverage ranging from 58% to 99% (Online Resource Table 5), except for plasmid pBMB0232 from the B. thuringiensis strain YBT-1518, which obtained the lowest coverage among the strains (0% to 2%). Analysis of these plasmids identified only Group II Introns, indicating that the presence of this genetic element precedes the dispersal and diversification events of these plasmids among B. thuringiensis strains.
Fig. 2B highlights two other plasmids from B. thuringiensis strains, which showed the highest amounts of ncRNAs identified in a single plasmid. The plasmid pBMB0233 with 240.661 kb, from strain YBT-1518, has 24 ncRNAs, as well as plasmid pYC1 with 761.374 kb, from strain YC-10, has 33 ncRNAs. Furthermore, the bottom of Fig. 2B reveals some large B. cereus plasmids that have no ncRNAs or only one ncRNA—for example, the unique plasmid from B. cereus strainFM1 with 402.61 kb and plasmid pRML01 from Bc strainHN001 with 435.42 kb. Sequence alignment of these plasmids by Blastn showed low identity and coverage values (data not shown), indicating that these plasmids did not have the same genetic origin.
Identification and occurrence of ncRNA families
One of the objectives of this study was to verify the occurrence of ncRNA families in B. cereus s.l. We identified 65 families of ncRNAs among the three species, which were assigned as described in the Rfam database. B. thuringiensis is the species with the highest number of families, a total of 64, while B. cereus presented 55 families and B. anthracis presented 43 families (the least) of ncRNAs (Online Resource Fig. 6, Table 1). Through the Kruskal-Wallis test, we identified that the difference in the number of families between species was significant for the chromosomes (p<0.001), while there was no difference for the plasmids (p=0.051) (Table 1).
Most of the families (43 of 64) were classified as common (Online Resource Fig. 6). In addition, B. cereus and B. thuringiensis strains shared 11 families. However, B. anthracis strains did not share ncRNA families with the other two species separately (Online Resource Fig. 3 and Online Resource Table 6). In addition, 10 families exclusive to B. thuringiensis strains, and one family exclusive to B. cereus strains, were identified (Online Resource Fig. 3 and Online Resource Table 6). Together, these data indicate that ncRNA acquisition occurred both at the common ancestor of the three species (ncRNA families classified as common) and after separation from B. anthracis (ncRNA families shared between B. cereus and B. thuringiensis). Moreover, after the separation between the latter two species, B. thuringiensis strains continued to acquire ncRNAs at a higher frequency than B. cereus strains, as they had a higher number of unique families.
Among the common families, most are located on chromosomes such as the SurA, BtsR1, and RsaE families, which are related to sporulation (Silvaggi et al. 2006; Irnov et al. 2010), defense and pathogenicity (Peng et al. 2018), oxidative stress (Durand et al. 2017; Marincola et al. 2019), and virulence (Geissmann et al. 2009; Guillet et al. 2013). The chromosomal copy number per strain was highly variable among the families (Table 2 and Online Resource Table 6), but the copy number of each family was similar among the three species. In plasmids, the occurrence of the ncRNA families was always a single copy or a low copy number (Table 2). A few families located both on chromosomes and on plasmids were also identified (T-box, BsrC, crcB, cspA, Intron_gpI, RsaE, SAM, SurA, ykok, Intron_gpII, group-II-D1D4-5, yidF, and L21_leader) (Online Resource Table 6). A large proportion of these families are present on the chromosomes of most strains, whereas on plasmids they were found in few strains, except for the group-II-D1D4-5 and Intron_gpII families, which were identified in high copy number on a larger number of B. thuringiensis plasmids.
Of the 11 families shared by B. cereus and B. thuringiensis, six were found exclusively on chromosomes (5_ureB_sRNA, Histone, rli38, rsaJ, sau-5971, and group-II-D1D4-4). Three were found exclusively on plasmids (Bacillus-plasmid, CRISPR-DR43, and rliF) and two were found on both chromosomes and plasmids (epsC and rli40) (Online Resource Table 6). In addition, all families of ncRNAs classified as shared were found in a single copy or a low copy number on chromosomes and plasmids (Table 2).
Of the families identified exclusively in a single species, the one exclusive to B. cereus (tsr25) was detected only on the chromosome of a single strain, sequenced twice, Bc E33L (CP000001.1 and CP009968.1). Meanwhile, from the 10 families exclusive to B. thuringiensis (Online Resource Table 6), five were identified only on chromosomes (ncr1175, PyrG_leader, rliI, RNAI, and ykkC- ykkD), two were identified only on plasmids (group-II-D1D4-2 and mir-10), and three were present on chromosomes and plasmids (group-II-D1D4-1, group-II-D1D4-3, and group-II-D1D4-6) (Online Resource Table 6). Most of these exclusive families also showed a limited ability to spread in the group, occurring in few phylogenetically related strains and in a single copy, except for group-II-D1D4-1, group-II-D1D4-2, group-II-D1D4-3, and group-II-D1D4-6, which occurred in a higher percentage of strains and in more copies than the other exclusive families (Table 2 and Online Resource Table 6).
In this study, we separated the ncRNAs according to size, unrelated to the functional/structural categories or classes, thus, we considered as sRNAs the families with up to 500 bp and lncRNAs over 500 bp. Thus, 63 families were classified as sRNAs and their copy numbers varied from 0-48 among strains (Table 2). Two families (SSU_rRNA_bacteria and LSU_rRNA_bacteria) were classified as lncRNA and identified with 10-16 copies per strain. Therefore, most of the identified families belong to the sRNA class, as previously described in other bacterial groups (Mars et al. 2016; Wolf et al. 2018), also confirming the low occurrence of lncRNAs for the B. cereus group.
Functional/structural categories and classes of ncRNAs
Through the Rfam database and literature (Li et al. 2016; Fiannaca et al. 2017; Harris and Breaker 2018; Amin et al. 2019; Lee et al. 2019; Drecktrah et al. 2020; Geissler et al. 2021), all identified families were distributed into three functional/structural categories, nine classes, and three subclasses (Fig. 3 and Table 2). Each functional/structural category was investigated for relative abundance. Thus, the functional/structural category with the highest abundance in B. cereus s.l. was Cis-reg with 67.78%, followed by Gene with 25.28% and Intron with the lowest abundance of 6.94% (Fig. 3). Among the classes, Riboswitch, Leader, rRNA, sRNA, and Group II introns accounted for between 37.83% and 5.85% of the ncRNA molecules, while the classes with the lower abundance were Group I introns, ribozyme, miRNA, and CRISPR, reaching up to 1.09% each (Fig. 3).
The occurrence of ncRNA classes varied between chromosomes and plasmids. Thus, the Leader, Riboswitch, and sRNA were identified mainly on chromosomes, whereas the Group II introns class was identified on both chromosomes and plasmids (Table 2). In addition, the number of families identified in B. cereus s.l. for each class was investigated. The sRNA presented 23 families that can regulate the expression of target genes at the post-transcriptional level (Sajid et al. 2018; Sridhar and Gayathri 2019). The Riboswitch presented 21 families; this class contains ncRNAs capable of controlling gene expression at the transcriptional and translational levels (Hör et al. 2018) and, then, to respond to environmental changes. With seven families, the Leader class is composed of ncRNA families that regulate translation (Irnov et al. 2010) and interact with metabolism-related proteins (Xia et al. 2012). The Group II introns contained seven families; these elements are ribozymes that act as mobile genetic elements, predominantly as retroelements (Harris and Breaker 2018; Toro et al. 2018; Fayad et al. 2019).
Among the classes found to have a lower incidence, ribozyme and rRNA were composed of two families, whose function is to catalyze reactions, such as RNA splicing, RNA cleavage, and protein synthesis (Moore and Steitz 2002; Lehmann and Schmidt 2003; Skilandat and Sigel 2013). The CRISPR, miRNA, and Group I introns are represented by a single family (Fig. 3, Table 2). CRISPR can provide defense mechanisms against viruses and may be inactivated in some strains (Grissa et al. 2007; Zheng et al. 2020), miRNA has already been described as acting in insect resistance to B. thuringiensis (Xu et al. 2015), while Group I introns have the same function as elements of the Group II introns (Harris and Breaker 2018).
Group II introns
Since the class of Group II introns (Group-II-D1D4 and Intron_gpII) correspond to 77% of the ncRNAs identified in B. thuringiensis plasmids and to 72% and 62% in B. cereus and B. anthracis plasmids, respectively, it was further analyzed in both chromosomes and plasmids. The sum of these elements found on the chromosomes of all B. thuringiensis strains was 487 copies, 103 copies on all B. cereus strains, and no copies on the chromosomes of B. anthracis. The Kruskal-Wallis test for the number of Group II introns in the chromosomes showed a significant difference between B. cereus, B. anthracis and B. thuringiensis species (p<0.001). This result highlights the great number of these elements in B. thuringiensis strains, especially on the strain Bt YBT-1518 that showed about one-third (32.6%) of the Group II introns from this species. Hence, a total of 159 copies of Group II introns were identified on the chromosome of the Bt YBT-1518 (Fig. 4) and when this strain was compared to the three species, the difference was statistically significant (p<0.001).
Among the plasmids, 503 copies of Group II introns were identified in B. thuringiensis strains, while 121 copies were identified in B. cereus, and 93 copies were identified in B. anthracis (Fig. 4). The Kruskal-Wallis test for the number of Group II introns in the plasmids also showed a difference between the three species - B. cereus, B. thuringiensis, and B. anthracis (p<0.001). However, we observed that B. cereus and B. thuringiensis exhibit no significant difference regarding the copy number of Group II introns (p=0.495), while the pairwise analysis showed that B. anthracis is different from the other two species (Ba-Bc p=0.011; Ba-Bt p<0.001). Similarly, to the chromosomes, the plasmids of the strain Bt YBT-1518 showed a remarkable number of Group II introns, corresponding to 10% of the elements found in all B. thuringiensis plasmids. Therefore, Bt YBT-1518 showed 42 copies in four plasmids (three copies in plasmid pBMB0229, three copies in pBMB0230, 13 copies in pBMB0232, and 23 copies in pBMB0233). In addition, when Bt YBT-1518 was compared to B. cereus, B. thuringiensis, and B. anthracis, the Kruskal-Wallis test demonstrated a statistically significant difference (p=0.004), indicating that the distribution of these elements in this strain is significantly different from the distribution in the plasmids of the three species.
From the first analysis, concerning the correlation among the replicons and the number of copies of ncRNAs, the strain Bt YBT1518 has already shown a divergent profile of ncRNAs content, with the highest number of chromosomal ncRNAs identified among all the evaluated strains and with a higher number of ncRNAs in the plasmids than other strains. Given the results for Group II introns, we observe that these elements are responsible for the higher number of ncRNAs in this strain. Due to this atypical profile, the statistical analyses between the species were also performed excluding Bt YBT1518. However, the result of the analysis of variance did not change, so we retained Bt YBT1518 in the overall comparative analyses.
Therefore, the genomic context of these elements was analyzed by aligning the chromosomal sequence of Bt strain YBT-1518 with the chromosomal sequences of Bt strain MYBT1846 and Bt 407 (Online Resource Fig. 7 A), which are closely related phylogenetically. Thus, on the chromosome of Bt strain MYBT1846, four copies of the Group II class introns were identified, whereas in Bt strain 407, six copies were identified (Online Resource Fig. 7 A). Elements belonging to the class Group II introns were detected both in highly conserved regions and in regions with low conservation among the three genomes (Online Resource Fig. 7 A). This result indicates that the acquisition and spread of these elements in strain Bt YBT-1518 took place after the evolutionary divergence of the three strains. This statement is also true for plasmid analysis, as these elements were observed in the plasmids of Bt strains MYBT1846 and Bt 407, although in smaller quantities (Online Resource Fig. 7 B), as compared to the occurrence in the plasmids of Bt strain YBT-1518.
Correlation between ncRNAs and phylogeny
Due to the complex taxonomic and phylogenetic relationships among B. cereus, B. thuringiensis,and B. anthracis, an MLST tree containing all strains of this study was constructed and classified into clusters according to Guinebretière et al. (2008). Moreover, we include data about the occurrence of ncRNAs in chromosomes and plasmids, aiming to obtain a profile of the distribution of ncRNAs in correlation with the phylogenetic and ecological classification of these strains.
Analyzing the phylogenetic tree (Additional file 1), one can see that B. cereus and B. thuringiensis strainsare distributed all over the clusters, while B. anthracis strains form a highly clonal subcluster that remains separate from the other species, as described previously (Rasko et al. 2005; Tourasse et al. 2011; Bazinet 2017). Cluster IV contains B. thuringiensis strains, widely used as a bioinsecticide, and environmental strains of B. cereus. On the other hand, pathogenic B. cereus strainsand B. anthracis strains, in addition to some B. thuringiensis strains, are part of cluster III.
Concerning ncRNA, the families common to the three species were present in most of the strains in all clusters and represent the majority of ncRNA families. Meanwhile, most of the ncRNA families shared between B. cereus and B. thuringiensis showed a limited ability to spread, being shared mainly by a few phylogenetically related strains (Additional file 1), except for Histone3, epsC, and rli40 families, which were found in a large number of strains in both clusters (Online Resource Table 6). The exclusive families also had a limited ability to spread in the group, as they were not identified in more than one strain and presented a single copy in the genomes in which they were found, except for the families group-II-D1D4-1, group-II-D1D4-2, group-II-D1D4-3, and group-II-D1D4-6, which are found in many strains and have a higher copy number. Analyzing this information was essential to selecting ncRNAs for use in the PCR expression validation step.
The comparative analysis of ncRNA abundance on chromosomes revealed differences between MLST cluster III and cluster IV. The Mann-Whitney test showed that the distribution of the number of families (p=0.000) and the number of ncRNAs (p=0.000) was different between the two clusters. However, in the analysis of ncRNA abundance in plasmids, the Mann-Whitney test showed no difference between clusters III and IV for either the number of families (p=0.890) or the number of ncRNAs (p=0.644).
Validation
For the in vitro validation, nine families were selected based on phylogenetic distribution, the abundance of each family in the tree species, and the number of copies in the genomes. These families belong to the three functional and structural categories (Genes, Cis-Reg, and Intron). The description of the families, as well as the primers used, are presented in Online Resource Table 3. All ncRNAs were amplified from the cDNA of B. thuringiensis var. thuringiensis 407 Cry- from the end of the exponential phase with the size corresponding to the amplicon (Fig. 5, Online Resource Table 3). This demonstrates their presence in the RNA samples of the tested strain and confirms their expression.