Comparative Genomics of Bacillus Subtilis MZK05 and Its Mutant Strain Revealed Genetic Factors Responsible for Enhanced Serine Protease Expression

16 Genome sequence study of an industrially-important strain, Bacillus subtilis M9, a mutant version of wild 17 strain Bacillus subtilis MZK05 was conducted to uncover genetic factors responsible for enhanced serine 18 proteases expression in addition to its other industrial enzymes, metabolites and bacteriocins producing 19 efficacy. The wild type and the mutant genome contained a size of 4,145,727 and 4,045,950 bp, with 20 4,352 and 4,383 genes; and 477 and 478 subsystems respectively. Genomic comparison with 31 B. 21 subtilis sourced from different countries showed both wild and mutant shared same type of genome 22 structure with 20 others. Moreover, 6,000 kb pangenome showed that they share 3082, 1449, and 25757 23 core, unique and accessory genes respectively. A sum of 32,559 mutations were found with three major 24 genomic structural changes in the upstream and downstream of an extracellular alkaline serine protease, 25 AprX and a periplasmic serine protease, HtrC in M9 genome when compared to the wild type. 26 Furthermore, 11 different serine protease genes and 4 different signal peptidases were found with several 27 mutational changes in M9. In addition, mutations found in core genome of BsM9 in phage-like element, 28 major capsid protein, and phage portal protein are the likely reasons of high-level serine protease activity. 29


Introduction 42
Bacillus subtilis is a rod-shaped, Gram-positive, spore forming, nonpathogenic soil bacterium that 43 secretes numerous enzymes to degrade a variety of substrates, enabling the bacterium to survive in a 44 continuously changing environment. Proteases from Bacillus spp are one of the most commercially 45 valuable enzymes 1 having enormous applications in food, feed, leather processing, detergent, digestive 46 aid and therapeutic purposes 2 . The genes coding for eight extracellular serine proteases were identified in 47 B. subtilis: aprE 3,4 , bpr 5,6 , epr 7,8 , mpr 9,10 , nprB 11 , nprE 12 , vpr 13 , and wprA 14 . While two of them were 48 considered as major protease-coding genes, viz. aprE (alkaline serine protease subtilisin) and nprE 49 (neutral metalloprotease) accounting for 95% of the total extracellular protease 15 , the rest were minor 50 extracellular proteases. These enzymes help cells supplying amino acids for growth by degradation of 51 extracellular proteins 16 . 52 Earlier, we isolated eight Bacillus species from the effluents of tannery and poultry farms of 53 Bangladesh 17 , of which B. subtilis MZK05 (Bs MZK05) demonstrated potential extracellular enzymes, 54 identified as serine protease 17 . With a view to increasing the performance, the bacterium was subjected to 55 random mutagenesis using both chemical (ethyl methane sulfonate) and UV radiation 18 that eventually 56 generated a mutant strain, B. subtilis M9 from a selection of potential strains, able to produce more 57 efficient extracellular protease than that of the wild type. This was demonstrated in unhairing and bating 58 of skin and hides, a performance comparable to that of the commercial enzymes, thereafter was applied 59 successfully in leather industries 19 , 20 , 21 . Therefore, we attempted here to compare the genome sequencing 60 of the two strains to pinpoint the probable cause of high-level expression of serine proteases in B. subtilis 61 M9 mutant, the number of proteases, and other industrially useful enzymes and metabolites, genes 62 responsible for extracellular protease secretion, genome variation in between the strains studied here with 63 that of strains from different countries of origin, and evolutionary relationships present with other 64 bacterial species of Bacillus subtilis group. 65

Genome sequence information 69
Classification, general features, and genome sequencing information of B. subtilis MZK05 are tabulated 70 (Table 1)

Mutational changes in genome sequence 100
The analysis of next generation genome sequence data yields major insights into the genetic features of 101 the organism 24 , 25 . Here, we found that the mutations in the B. subtilis MZK05 strain produced nearly 102 about 0.1 Mb reduced genome sequence than that of the mutant strain, B. subtilis M9. The related 103 mutational variants are shown in Fig. 1c and the major genetic changes are described in pangenome 104 section. From mauve genome alignment, it was clearly revealed that there were three major changes 105 occurred in very close region of two serine protease genes in B. subtilisM9. A frameshifted region was 106 identified in the downstream of a periplasmic serine protease htrC (Fig. 2a). Besides, deletions of about 107 75 kb and 125 kb nucleotides were detected from upstream and downstream of an extracellular alkaline 108 serine protease aprX (Fig. 2b). These deletions occurred just before 17 genes and after 2 genes of aprX,   Table S3). Major structural changes in rhomboid family intramembrane serine protease 127 owing to mutation were detected in the serine protease of B. subtilis M9, particularly 3 changes in active 128 sites -S398N, E436K and N463Y; and 2 mutations at G359R and V376A that produced major and 129 noticeable 3D structural changes (Fig. 3a). There were four changes observed in extracellular serine 130 metalloprotease, one of which was Q88L that caused 3D structural change owing to loss of an alpha helix 131 ( Fig. 3b). Two different sized AprX proteins were evident in both B. subtilis MZK05 and B. subtilis M9. 132 The larger AprX harbors 52 extra amino acids upstream of the coding sequence, the rest of the amino acid 133 sequences were similar to both AprX. Two mutations were found in the small AprX in two different 134 positions: N230S and T378A (Fig. 3c). The large AprX, on the other hand, had two substitutions: N282S 135 and T430A (Fig. 3d). Remarkably, both of these large and small AprX had same type of two non-136 synonymous changes: asparagine to serine at positions 282 and 230 (Fig. 3e); and threonine to alanine at 137 positions 430 and 378 in large and small AprX respectively (Fig. 3f). Interestingly, the positions of the 138 changes between the two varied sizes of AprX become alike if the stretch of upstream 52-amino-acid 139 residues present in the large AprX is deducted. Further, 6 mutational changes were evident in Vpr, one of 140 these were T719S that brought a 3D structural change in the enzyme (Fig. 3g). Besides, 1 mutational 141 change was observed in Isp (Fig. 3h), and each in HtrA, HtrB, and HtrC, however, no significant 142 structural change was resulted due to these mutations. In addition, 7 mutational changes were found in 143 immune inhibitor A peptidase M6 region of bacillopeptidase F (Suppl. Mat. Table S3). Overall, the 144 deletion of upstream and downstream of an aprX gene (Fig. 2b) (Fig.  172 4e). Further, mutation at T64A of a second S26 family signal peptidase (Fig. 4g) altered most of the 3D 173 structure of the enzyme. This change affected catalytic sites: S47 and H87 (Fig. 4h). Overall, multiple 174 mutations in different serine type signal peptidases are thought to produce a high-level enzyme activity of 175 serine proteases in B. subtilis M9.

Identification of bacteriocin gene clusters 177
From AntiSMASH 3.0 30 and BAGEL4 31 , four different classes of bacteriocin (nisin, sublancin, 178 lanthipeptide and subtilosin) encoding genes were found both in B. subtilis MZK05 and Bs M9. The 179 structure of bacteriocin genes with operon in MZK05 genome is illustrated in supplementary information, 180 Figure S4.

Pan-genome and COGs protein function analysis 207
The NCBI-retrieved 30 B. subtilis strains and the strains of interest of this study produced a 6,000 kb long 208 pangenome (Fig. 6a) 26,27 found that reduction of mobile element and genomic DNA increased high level productivity of extracellular cellulase, alkaline protease and beneficial properties. Likewise, reduction of some accessory 228 genes and mutation in core genes may increase the production and activity of alkaline serine proteases in 229 this study. The KEGG and COGs distribution of the representative proteins present in the core, accessory, 230 and unique genome are shown in Fig. 5c and Fig. 5d respectively, and the COGs frequency heatmap of 231 the representative proteins present in all of 30 genomes are shown (Fig. 5e). Relative evolutionary 232 divergence of Neighbor Joining (NJ) core and pan phylogenomic tree of 30 strains, constructed on the 233 basis of core genome and pan gene alignments are shown in (Figs. 6b-c). It was clearly revealed that the

Identification of bacteriocin gene cluster 300
The bacteriocin synthesizing gene clusters were revealed by AntiSMASH 3.0 30

Pangenome and COGs protein function analysis 320
Pan-and core-genome analyses of 30 B. subtilis genomes were performed using GView 40 and BPGA 321 v1.3 44 . GView was used to generate pangenome atlas as well as to study both pan and core genomes. 322 Besides, BPGA v1.3 was used with default clustering algorithm for orthologous gene identification and 323 clustering with a 50% sequence identity cut-off. BPGA software uses USEARCH, CD-HIT, OrthoMCL 324 and MUSCLE software for the orthology analyses and powerlaw regression and exponential curve fit for 325 the pangenome, core genome developments, phylogeny, subset analysis, and KEGG & COGs mapping. 326