Whole Genome Sequence of biosurfactant producing Bacillus tequilensis

Background: Bioremediation is crucial for recuperate polluted water and soil. By expanding the surface area of substrates, biosurfactants play a vital role in bioremediation. Bioremediation is crucial for recuperate polluted water and soil. By expanding the surface area of substrates, biosurfactants play a vital role in bioremediation. Biosurfactant producing microbes release certain biosurfactant compounds, which are promoted for oil spill remediation. In the present investigation, a biosurfactant producing bacterium Bacillus tequilensis was isolated from Chilika Results: The whole genome sequence is 4.47 MB consisting of 4,478,749 base pairs forming a circular chromosome with 528 scaffolds, 4492 protein-encoding genes(ORFs), 81 tRNA genes, and 114 ribosomal RNA transcription units. The total number of raw reads was 4209415, and processed reads were 4058238 with predicted genes of 4492. The whole-genome obtained from the present investigation was used for genome annotation, variant calling, variant annotation and comparative genome analysis with other existing Bacillus species. In this study, a pathway was constructed which describe the biosurfactant metabolism of Bacillus tequilensis. The study identified genes such as SrfAD, SrfAC, SrfAA, SrfAB which are involved in biosurfactant synthesis. Conclusion: The sequence of the genes SrfAD, SrfAC, SrfAA, SrfAB was deposited in Genbank database with accession MUG02427.1, MUG02428.1, MUG02429.1, MUG03515.1 respectively. The whole-genome sequence was submitted to Genbank with an accession RMVO00000000 and the raw reads can be obtained from SRA, NCBI repository using accession: SRX5023292.


INTRODUCTION
Heavy metal contamination has now become serious ecological threat raising environmental concerns. Metals especially cadmium and zinc are has posed serious threat as their degradation to innocuous products is hard and takes millions of years [1][2][3]. Bioremediation systems have however been long proposed to neutralize metal contamination, however, have low bioavailability leading to incomplete bioremediation process. Further, such bioremediation process like phytoremediation with synthetic chelators have been shown be expensive and environmentally hazardous [4][5]. Various surface-active compounds (SACs) commonly the biosurfactants produced by microorganisms have emerged as safe alternative to chemical remediation that are known to be safe, and are now exploited in environmental remediation techniques including heavy metal removal [6][7][8].
Bioemulsifier surfaced as promising remediation agent that can effectively remove metals from soil and water bodies. The Whole-genome sequence represents a valuable shortcut, helping scientists find genes much more easily and quickly. It expected that being able to study the entire genome sequence will help in understanding how genes endeavour together to direct the maintenance, development, and growth of a whole organism. Besides, it can use to predict the genes involved in the synthesizing of biosurfactants in microbes [9][10]. The present study aimed to sequence the whole genome of biosurfactant producing Bacillus tequilensis using Next-Generation sequencing, De-novo assembly, genome annotation, variant calling and variant annotation.

Identification of Biosurfactant producing Bacillus tequilensis
Majority of biosurfactants are produced by the microbes such as Pseudomonas genus followed by Bacillus and Acinetobacter respectively [11]. In a previous investigation, a novel strain of biosurfactant producing Bacillus tequilensis strain ANSKLAB04 [12] was identified using 16S rRNA gene sequencing by Sanger dideoxy sequencing method followed by the phylogenetic assessment. The strain was isolated from Chilika lake, a brackish water lagoon, spread over the Puri, Khurda and Ganjam districts of Odisha state on the east coast of India [12]. By conducting several biochemical tests such as Haemolysis test, oil spreading test, CTAB agar plate test and Drop collapse test, we concluded that Bacillus tequilensis produces biosurfactants [12] and the novel isolate was deposited in Genbankwith Accession number KU529483.

2.2Bioanalyzer profile
The DNA isolation was performed using Phenol/Chloroform(PCl) genomic DNA extraction method [12].The bioanalyzer profile of the prepared WGS library showed fragments in a size range of 300-600bp. The effective insert size of the library is 180-480bp flanked by adaptors having combined size of ~120bp. Based on the fragment distribution and concentration, the library was suitable for sequencing on Illumina platform [Fig 1].

2.3Genome Representation
The complete genome of Bacillus tequilensis consists of a single circular chromosome of 4,478,749 bp with an average G+C content of 46.33% (Table 1 and Supplementary Table 1). The 4492 predicted coding ORFs covers 87% of the complete genome, and each ORF has a moderate length of 283 aa(Supplementary Table 2). Among these, 1,347, i.e. 67.4% assigned as putative functions, 258, i.e. 12.9% matched to sustain hypothetical coding sequences of an anonymous function, and the rest 394, i.e. 19.7% shows no similarities to known genes[ Table 2]. The variations in nucleotide frequencies across the whole genome sequence was investigated using a non-overlapping active platform and by framing three indices of nucleotide frequency:  Normal distribution approximation was used as the total numbers of bases were large. The strand analyzed here was the 5' to 3' strand clockwise on the genetic map. A window size of 1 kb was used.From the inside: green and red bars represent RNA sequences on positive and negative strand respectively. Circle 1, represents G + C content (window size: 10Kb) higher and lower than 45%, where red represents higher and green represents lower. Pseudomonas species required Plasmid-encoded-rhlA, B, R and I genes of rhl quorum-sensing system for production of glycolipid biosurfactants as well as also involved in the production of rhamnolipids in a heterologous host. Iturin A is an antifungal lipopeptide biosurfactant produced by certain Bacillus subtilis strains such as B. subtilis RB14 is composed of four ORF namely ituD, ituA, ituB, and ituC, whose disruption leads to specific deficiency in iturin A production.
The three genes of arthrofactin operon of Pseudomonas namely arfA, arfB and arfC encode ArfA, ArfB and ArfC containing two, four and five functional modules respectively required for condensation, adenylation and thiolation. Besides, Amphisin is produced by Pseudomonas sp.
DSS73 require gacS and amsY genes for the production of biosurfactant as these genes are mutants defective in the genes. Amphisin synthesis is regulated by the gacS gene as the gacS mutant regains the property of surface motility upon the introduction of a plasmid. Moreover, genes dnaK, dnaJ and grpE positively regulate the biosynthesis of putisolvin [14].Putisolvin biosynthesis genes such as dnaK, dnaJ and grpEfrom Bacillus tequilensis were identified and the sequence were deposited in Genbank with accession MUF99480.1 MUF99481.1, MUF99479.1 respectively.
Acinetobacter species produces high molecular weight biosurfactants -Emulsan and Alasan with the involvement of gene. AlnA, AlnB and AlnC are essential for Alasan biosynthesis whereas wza, wzb, wzc, wzx, and wzy is required for Emulsan biosynthesis. For the production of fungal biosurfactant, emt1 and cyp1 are the two genes involved in the synthesis of these glycolipids and fb1 and hfb2 genes regulating the synthesis of hydrophobin [14]. Thus, gene plays a major role in the biosynthesis of various microbial surfactants, and hence the role of molecular genetics and gene regulation mechanisms in the production of biosurfactant is essential. In this study, we have identified biosurfactant producing genes and corresponding orfs of Bacillus tequilensis such gene SrfAD, SrfAC, SrfAA and the sequence of the same was deposited in Genbank database with accession MUG02427.1, MUG02428.1, MUG02429.1,MUG03515.1 respectively.

Genome evolution of B. tequilensis
The enormous genomic data obtained from sequencing of Bacillus tequilensisANSKLAB04 was      were found which recorded as the highest and P6 found the lowest [Fig 9].   After removing duplicates with Sambamba and identifying variants with SAMTools, information of each variant were gathered and classified by chromosomes or scaffolds. Table 9 shows the summary of the variant calling of Bacillus tequilensis ANSKLAB04 against other existing genomes in the database. [ Figure 10D] whereas Bacillus tequilensis KCTC 13622 is having the maximum number of insertions and deletions i.e. 841 and 653 respectively [ Figure 10A]. Meanwhile, Bacillus subtilis were having less number of SNPs, insertions and deletions [ Figure 10B].    Fig 11A] are the graphical representation of base count change.  Figure 11. Base change count of each sample

Transition and transversion information
The number of transition (Ts) and transversion (Tv), and the Ts/Tv ratio were calculated using the base change count. Base changes (DNA substitution) are of two types. Interchanges of purines (A <-> G), or pyrimidines (C <-> T) are transitions, while interchanges of a purine for pyrimidine bases, and vice versa, are transversions. Although there are twice as many possible transversions, transitions are more common than transversions due to differences in structural characteristics. Generally, transversions are more likely to cause amino acid sequence changes.
[  T and C to G conversion or interchange and vice-versa whereas transversion is indicative of A to C or A to G or T to C or T to G or vice-versa as shown in figure [11].Bacillus subtilis is having more number of transitions on comparison with Bacillus tequilensis (Figure 12  Transitions are less likely to result in amino acid substitutions and are therefore more likely to persist as "silent substitutions" in populations as single nucleotide polymorphisms (SNPs).  Figure 12. Transition, Transversion proportion

2.11Variant Annotation
To find out the annotation information such as amino acid changes by variants, SnpEff was used.            HIGH duplication Duplication of a large chromoome segment (over 1% or 1,000,000 bases).

HIGH inversion
Inversion of a large chromoome segment (over 1% or 1,000,000 bases).

HIGH coding_sequence_variant
One or many codons are changed. LOW inframe_insertion One or many codons are inserted (e.g.: An insert multiple of three in a codon boundary).

MODERATE disruptive_inframe_insertion
One codon is changed and one or many codons are inserted (e.g.: An insert of size multiple of three, not at codon boundary).

MODERATE inframe_deletion
One or many codons are deleted (e.g.: A deletion multiple of three at codon boundary).

MODERATE disruptive_inframe_insertion
One codon is changed and one or more codons are deleted (e.g.: A deletion of size multiple of three, not at codon boundary The variant deletes an exon which is in the 5'UTR of the transcript.

Description for the table 17
Type of annotation: Sequence ontology which allows to standardize terminology used forassessing sequence changes and impact. Hom/Het: Indicates the genotype. "hom" refers to non-reference homozygote, while "het"refers to heterozygote.

DISCUSSION
In the present investigation, we have introduced a high-quality draft genome sequence of Bacillus tequilensis, the first genome sequence of biosurfactant producing Bacillus tequilensishas been determined. Biosurfactant producing microbes have potential applications in various biotechnology, biodegradation, pharmaceutical industries. The whole genome sequence of biosurfactant producing Bacillus tequilensis will provide a foremost resource to start exploring the genes and gene products involved in biosurfactant synthesis. The genome sequence of Bacillus tequilensis obtained in the present investigation will be a key resource for the development of new concept and technique in genetic engineering such as molecular marker assisted breeding and large scale production of biosurfactant microbes for bioremediation.

Sample Collection and DNA Isolation
Water samples were collected from oil-contaminated sites of Chilika lake, Odisha, India (latitude and longitude: 19.8450 N 85.4788 E), a largest brackish water lagoon in India. Various organisms were isolated and purified on culture plates and were then enriched in the mineral salt medium (MSM). MSM gives the nutrient condition for the production of biosurfactants by the organisms which were then screened for their biosurfactant production by various screening tests and the emulsification index was calculated. Identification of organisms was performed based on biochemical, macroscopic and microscopic characters. The organism with the best emulsification index was then subjected to optimization for the production of biosurfactant for the factors affecting the production. Optimization was studied with the emulsification index calculated with each affecting factor. In a previous study, this organism was then subjected to 16S rRNA sequencing for the identification of the genus and species [12]. The DNA was isolated by Phenol/Chloroform (PCl) genomic DNA extraction method [12] [19]. The bacterial cell pellet obtained after centrifugation was subjected to DNA isolation. The DNA concentration and purity were checked with nanodrop spectrophotometer and qubit fluorometer. ImageQuant software was used to analyse the gel cropped image documentation [20].The sample was found to have suboptimal concentration and gave an intact band when running on gel QC against Hind III digested lambda ladder. The gel image demonstrates the intact Bacillus tequilensis. Whole genome having a concentration of 54.9 ng/ul. L represents Hind III Digested lambda ladder. '1 -SO_4915_Bt1' represents the Bacillus tequilensis sample code. [Fig 13].

Library Preparation and Genome Sequencing
Library preparation was performed using NEXTFlex DNA library protocol outlined in "NEXTFlex" DNA sample preparation guide (Cat # 5140-02). In brief, genomic DNA was sheared to generate fragments of approximately 300-500bpin a CovarismicroTube with the E220 system (Covaris, Inc., Woburn, MA, USA). The fragment size distribution was checked using Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA) with High Sensitivity DNA Kit (Agilent Technologies) according to the manufacturer's instructions. The resulting fragmented DNA was cleaned up using HighPrep beads (MagBio Genomics, Inc, Gaithersburg, Maryland).
These fragments were subjected to end-repair, A-tailing, and ligation of the Illumina multiplexing adaptors using the NEXTFlex DNA Sequencing kit as per the manufacturer's instruction [21]. The resulting ligated DNA was cleaned up using HighPrep beads (MagBio Genomics, Inc, Gaithersburg, Maryland)and size selected (400-600bp) on 2% low melting agarose gel and cleaned using MinElute column (QIAGEN, India). These adapter-ligated fragments were subjected to 10 rounds of PCR (denaturation at 98˚C for 2 min, cycling (98˚C for the 30S, 65˚C for 30S and 72˚C for 1 min) and a final extension at 72˚C for 5 min) using primers provided in the NEXTFlex DNA Sequencing kit(Perkin Elmer). The PCR products were purified using HighPrep beads. Quantification and size distribution of the prepared library was determined using Qubitflourometer (Table 18) and the Agilent High Sensitivity DNA Kit (Agilent Technologies) respectively according to the manufacturer's instructions (Fig 14). Illumina Paired-end sequencing was performed using NextSeq 500: 150*2. The following adapters were used for sequencing (Illumina, Inc) [21].

Whole Genome De-novo Assembly and Analysis
The obtained sequence raw reads were checked for quality control using FASTQC tool [22]. The quality of the raw reads was checked through the various modules provided by the FASTQC tool. Among the modules, per base sequence quality and tile sequence quality modules were studies to validate the quality of the data for further analysis. The low-quality reads were excluded from the analysis using Trimmomatic (v0.36) [23]. The filtered De-novo assembly of Illumina paired-end data was assembled using SPAdes -v3.13.0 genome assembler -an opensource algorithm for De-novo assembly [24]. SPAdes assembler is intended for de-novo assembly after error-correction of sequenced reads. Assembled contigs were further scaffolded using SSPACE program [25]. Genome map was constructed using Circos [26].

Whole Genome annotation and GO analysis
NCBI Prokaryotic Genome Annotation Pipeline (PGAP) version 4.8 was used to annotate the whole genome sequence of [27]. Pathway Analysis was done by using KAAS Server. Bacillus subtilis subsp. Subtilis 168 was taken as a reference organism for pathway analysis using KAAS server [28]. The functions of the predicted ORFs were categorized by comparison with the COG database [29]. Simple Sequence Repeats (SSR) was identified in each transcript sequence using MISA Perl script [30].  [31]. During mapping, duplicated reads can falsely cause erroneous data to stand out. To prevent such error, Sambamba tool was used to remove the duplicate reads [31]. Duplicate reads are identified using mapping information such as start position, and CIGAR string [32]. SAMTools was used to manipulate the SAM/BAM files that come out as a result of mapping [33]. In resequencing analysis, it is especially used for finding out variant information by calculating genotype likelihood from every position within the sample of analysis. Variant annotation was performed using SnpEff (v4.3t) [34]. SnpEff annotates the possible effects (on genes) that can be caused by variants identified through mapping.The present study used SnpEff to generate the Genes and transcripts affected by the variant, Location of the variants and the information on how the variant affects the protein synthesis (e.g. generating a stop codon). genome evolution analysis, conducted the variant calling, variant annotation and drafted the manuscript.