Prediction and analysis of Metagenomic operons via MetaRon: a Pipeline for Prediction of Metagenomic OpeRons

Background : Efficient regulation of bacterial genes against the environmental stimulus results in unique operonic organizations. Lack of complete reference and functional information makes metagenomic operon prediction challenging and therefore opens new perspectives on the interpretation of the host-microbe interactions. Methods : Here we present MetaRon (pipeline for the prediction of Metagenomic operons), an open-source pipeline explicitly designed for the metagenomic shotgun sequencing data. It recreates the operonic structure without functional information. MetaRon identifies closely packed co-directional gene clusters with a promoter upstream and downstream of the first and last gene, respectively. Promoter prediction marks the transcriptional unit boundary (TUB) of closely packed co-directional gene clusters. Results: Escherichia coli ( E. coli ) K-12 MG1655 presents a gold standard for operon prediction. Therefore, MetaRon was initially implemented on two simulated illumina datasets: (1) E. coli MG1655 genome (2) a mixture of E. coli MG1655, Mycobacterium tuberculosis H37Rv and Bacillus subtilis str. 168 genomes. Operons were predicted in the single genome and mixture of genomes with a sensitivity of 97.8% and 93.7%, respectively. In the next phase, operons predicted from E. coli c20 draft genome isolated from chicken gut metagenome achieved a sensitivity of 94.1%. Lastly, the application of MetaRon on 145 paired-end gut metagenome samples identified 1,232,407 unique operons. Conclusion: MetaRon removes two notable limitations of existing methods: (1) dependency on functional information, and (2) liberates the users from enormous metagenomic data management. Current study showed the idea of using operons as subset to represent the whole-metagenome in terms of secondary metabolites and demonstrated its effectiveness in explaining the occurrence of a disease condition. This

will significantly reduce the hefty whole-metagenome data to a small more precise data set. Furthermore, metabolic pathways from the operonic sequences were identified in association with the occurrence of type 2 diabetes (T2D). Presumably, this is the first organized effort to predict metagenomic operons and perform a detailed analysis in association with a disease, in this case T2D. The application of MetaRon to metagenome data at diverse scale will be beneficial to understand the gene regulation and therapeutic metagenomics.

Background
Bacteria are one of the simplest free-living forms of life known [1][2][3][4][5][6][7][8]. Due to their presence in diverse environments, these unicellular organisms are susceptible to dynamic conditions [9,10]. Bacteria can flourish in various conditions through adaptive transcription [11]. Recently, much interest exists in the metagenomics regarding the exploration of novel environments and information such as taxonomic profiling, drug discovery, secondary metabolites and many other [12][13][14][15][16]. Survivalunder various favorable and unfavorable environmental conditions is achieved through the evolution of new proteins, enzymes, and pathways via organizing and clustering of two or more genes into a single structural unit known as an operon [17][18][19][20][21][22], as shown in Fig. 1. Operon isan organization of genes formed in support of the bacterial system to respond to new environmental stimulus. They are alsovital for the production of natural products many of which have therapeutic importance [23]. This tightly packed co-directional and coexpressed cluster of genes play a crucial role by providing bacterial information about pathways, gene regulation as well as many natural products of industrial importance [24][25][26][27].
The influx of information about the uncultured microbes via metagenomics highlighted the importance of metagenomic operons in novel environments. Recent studies have already identified many natural products helpful in treating/prevention of cancer, diabetes, and cholesterol-lowering [28]. Many of which have operonic origins [29,30]. Operonic insights help in elucidating the diversity and complexity of poorly understood environments.
Metagenomic access to novel environments also underscored many essential properties of operons regarding identification of natural products, secondary metabolites, structure and functionality of uncultured microbial communities in association with the disease and environmental conditions [30][31][32][33]. Most of the whole-genome operon prediction methods were mostly dependent on experimental or functional information. Such information is often scarce in metagenomic data. Furthermore, the methods developed for whole-genome operon prediction were mostly based on validated information from E. coli and in some instances Bacillus subtilis ( B. subtilis). Therefore, predicting operons from a mixture of millions of bacterial species is challenging, but it will also open doors to identify numerous secondary metabolites and the pathways regulating them.
Most of the past metagenomic studies ended exploring the taxonomic classification in context with a particular environment or disease condition, but very few went a step further in analyzing new aspects of metagenomic data [34][35][36][37][38]. Metagenomic operon prediction remains an understudied facet. The scientific community, despite operon's potential contribution, looks forward to a convenient solution; independent of functional and experimental information as well as provides an automated standalone process for metagenomic operon prediction. Furthermore, very few whole-metagenome studies performed a systemic study for the prediction of whole-metagenome operon and analyzing the data from the aspect of operonic secondary metabolites and differentially abundant operonic pathways in association with disease occurrence. To overcome the limitations mentioned above in the prediction of metagenomic operon, we present MetaRon, a Metagenomic operon prediction pipeline for shotgun sequencing data. It is a user-friendly pipeline that performs the necessary downstream data processing (de novo assembly, gene prediction, de novo promoter prediction and gene clustering), before identifying the operons from the metagenomic sample. However, in case of availability of pre-assembled metagenome and genes, MetaRon also predicts the operons, provided the scaffolds are long enough and the format is consistent with the requirements. MetaRon performs operon prediction based on co-directionality, intergenic distance, and promoter parameters.
Operons are clusters of closely packed co-directional genes, highly prevalent in microbes and aidin their survival in diverse conditions. Metagenomic data contains a cumulative mixture of environmental DNA from millions of uncultivable microbes. Operons in these microbes are crucial in understanding the gene regulation, identification of new pathways and discovery of novel products in diverse environmental settings. Identification of metagenomic operons in the laboratory is an intensive and challenging process therefore computational operon prediction is an efficient way to identify operons. However, to our knowledge, no computational pipeline is available to predict metagenomic operons. We have developed MetaRon -a pipeline that performs metagenomic data processing and predicts operons with high sensitivity. This method will be highly beneficial for researchers studying microbial gene regulation, pathways and secondary metabolites.

Implementation
MetaRon pipeline is developed and implemented in python 2.7. The application of MetaRon results in several tab delimited and fasta files containing detailed information about predicted genes, operons, gene and operon coordinates, clustering details as well as intermediate processing files that users might be interested in to look at. These outputs will be explained and discussed in detail in this and forthcoming sections.

Data Input
MetaRon is a flexible pipeline that accepts input in one of two forms: (1) unassembled data raw reads or (2) metagenomic scaffolds and gene prediction file (.gff file format).
MetaRon executes two type of workflows depending on the user input. The process parameter "ago" (Assembly, Gene prediction and Operon prediction) expects trimmed and quality controlled metagenomic reads and performs the required downstream data processing (Fig. 2). This includes de novo assembly via IDBA [40] and prediction of genes via Prodigal [41]. Alternatively, if the user inputs are assembled metagenomic scaffolds and gene prediction file (.gff file format), the user will specify the process parameter "op" (Operon Prediction). The downstream data processing steps of the process "ago" will be skipped in process "op", and the program starts data extraction on the provided scaffolds and gene prediction files, as shown in Fig. 2.

Feature extraction
Post de novo assembly and gene prediction, the workflow for both "ago" and "op" process is the same. The user also needs to specify the gene prediction tool. MetaRon accepts gene prediction output from Prodigal [41] and MetaGeneMark [42] as legitimate inputs for "op". However, upon selection of the process "ago", it will perform gene prediction via Prodigal only (Fig. 2). The

Proximon identification
MetaRon will use the information generated in the previous steps to calculate the intergenic distance (IGD) between co-directional gene clusters via IGD_calc(). Intergenic distance is by far the most common parameter used for the prediction of operons in whole-genomes [19,25,27,[43][44][45]. MetaRon keeps a flexible (< 601 bp) maximum threshold for Intergenic distance. The intergenic distance (IGD) between two genes is calculated as: Where,G1 and G2 are two adjacent co-directional genes, start (G2) refers to the beginning position of second gene in the pair on the genome; while end (G1) refers to the last nucleotide position of the first gene. Co-directional genes with intergenic distance of < 601 bases are clustered as proximons or candidate operons. This threshold is defined as a stretchy parameter due to extremely personalized and diversed definition of IGD in various bacterial species [24]. For the same reason, the proximons identified based on codirectionality and IGD may contain many false positives. The transcription unit boundary could not be accurately defined via the parameters as mentioned above.

Operon prediction
Neural Network Promoter Prediction 2.0 (NNPP) [46] is integrated in the module promoter_prediction(), to predict the upstream promoter for each of the genes in the codirectional closely packed gene clusters. Promoter predictions are parsed and organized via Promoter_file_parse(). Utilizing the outputs of previous modules, Prom_IGD_Clustering() clusters the co-directional genes by intergenic distance and presence of a promoter. At this point, an operon is defined as a cluster of two or more codirectional and closely packed genes with a promoter upstream of the first gene. As shown in Fig. 1, an operon starts with a promoter and ends with a terminator, however, the presence of the promoter for downstream gene could also signify the end of an operon.
Therefore, an operon is a gene cluster delimited by an upstream and downstream promoter signifying the start and end of the operon, respectively.
Unlike Prom_IGD_Clustering(), where co-directionality, IGD and presence of promoter were The implementation of MetaRon on test genomes was followed by application on wholegenome of Escherichia coli MG1655, simulated microbial genomes, Escherichia coli C20 draft genome and lastly on 145 whole-metagenomic samples from the human gut.

Data Analysis
Most of the operon prediction studies focused on efficient operon prediction and little attention to post prediction analysis. We carried out a comprehensive analysis of metagenomic operons, which mainly includes a comparative analysis of biosynthetic gene clusters (BGCs) of operonic sequences and whole-scaffolds as well as the differential pathway analysis of operonic gene clusters. All the 145 gut microbiome samples were divided into eight major group of individuals, based on occurrence of disease, gender and weight (Table 1).

Secondary metabolite identification
Secondary metabolites were identified from operonic sequences and complete scaftigs using anti SMASH (v3.0) (antibiotic and secondary metabolites analysis shell) [47]. The operonic sequences were extracted via operonic cluster coordinate file generated by MetaRon, while complete scaftig sequences were used. Only the top hit proteins per sequence were selected for further analysis. A comparative approach was devised to observe the abundance trend of secondary metabolites in operonic sequences as well as scaftigs for control and type 2 diabetic group of individuals.

Functional mapping and pathway analysis
Raw metagenomic reads were mapped to the operonic sequences using BOWTIE2 [48] and the ".sam" output was processed via SAMtools [49]. Processing includes the conversion of aligned raw read information from sam file format to bam and finally to fastq file format.
The raw metagenomic reads aligned to the operonic sequences were further analyzed for differential pathways via a standalone pipeline for functional analysis FMAP [50]. FMAP is a very convenient pipeline that integrates Metagenomic and Metatranscriptomic data and performs differential pathway analysis. Metagenomic raw reads were aligned to the UniRef100 [51] using DIAMOND [52] as the mapping mode. Mapping hits that qualified through the default FMAP settings (sequence identity = > 80%, e-value = > 1e-10) were taken through to the next step of differentially abundant pathways from controls to disease condition. Reads from the previous step were mapped to the KEGG Orthology (KO) database [53,54]. The mapped reads were normalized to the total number of paired-end reads. The protein gene ID was extracted from the UniProt database [55]. The normalized abundance for each sample was calculated as the number of reads aligned to a gene divided by total read count, followed by a summation of all the genes in the pathway. The pipeline also performs mapping of raw metagenomic reads to the UniRef100 [51] reference database using DIAMOND [52] and the estimation of gene abundance to identify the differentially abundant pathways and modules.

Results And Discussion
The dependency of previous whole-genome operon prediction methods on experimental and functional information and unavailability of such information in metagenomic data makes metagenonmic operon prediction a tricky task [44,[56][57][58][59][60][61][62]. We addressed these limitations via MetaRon, by accurately predicts metagenomic operons without the dependency on functional or experimental information.

MetaRon Implementation
Simulated Genomes We started with the implementation of MetaRon pipeline on E. coli K-12 MG1655 raw reads simulated via Next-Generation Simulator for Metagenomics (NeSSM) [63]. The microbe is considered as the gold standard for operons. This implementation serves as a litmus test for the performance. The simulation of E. coli genome produced around one million paired end reads of 100 bp length at 20X depth. MetaRon assembled the raw reads via IDBA [64] into 82 scaftigs. The scaftigs with length less than or equal to 500 bp were removed. The remaining scaftigs contains 4,227 genes that were predicted using prodigal [41]. In the first step, MetaRon identified 822 co-directional proximal gene clusters (IGD < 601 bp), containing 2,955 genes. These gene clusters were named as proximons, since they were identified based on direction and intergenic space, as defined by proximon proposition [65][66][67]. The proximon cluster length range from binary (2 genes) to 32 genes, with no proximons of length 17, 21, 23, 24, 26, 27, 28 and 29 (Fig. 3).
Out of 822 proximons, majority of the clusters (32.9%) are binary while 19.7% and 11.8% proximons are three and four genes long, respectively. The remaining 35.5% of proximons are longer than four genes (Fig. 4). Introduction of a structure defining feature clearly refined the results from proximons to operons. Many genes that that were a part of the proximon cluster are removed by adding the promoter parameter hence, the number of genes in each cluster reduced, leaving behind co-directional closely packed genes that are under the control of a single promoter. This means, an increase in the percentage of binary operons, three and four gene operons but a decrease in clusters with length more than four genes. The proportion of operons with length 2-4 increased to 78% as compared to 64.5% of proximon clusters (Fig. 4). At this point, it is imperative to highlight that no Transcription Unit Boundary (TUB) is definedin the proximal gene clusters. This means that a proximon or a candidate operon might enclosemore than one operon or nonoperonic genes. Therefore, we added upstream promoter as a more stringent and structure defining parameter to outline the transcription start and end for an operon (176) and 13.2% (110), respectively (Fig. 4).
These results corroborate with the fact that in E. coli genome, the majority of the operons have binary organization [72,73]. The percentage of binary gene clusters hold a significant role in accessing the operon predictions since, most of the operons in microbial genomes are binary [27]. An increase in the proportion of such operons in comparison with proximal gene clusters signifies the removal of false positives and improved sensitivity.
About 21.7% of operonic clusters have length ranging between five and sixteen (Fig. 4). In order to test its applicability and accuracy, MetaRon was also implemented on raw reads achieved sensitivity, specificity, and accuracy of 87%, 91%, and 88%, respectively [69,71]. Majority of operons (68%) discretely mapped to a single reference operon while 20% of predicted operons have over one hit with the reference. Twelve per cent operons demonstrated unique configuration that has less than 50% match with the reference (Fig. 5). This is expected due to the fact that similar genomes could demonstrate variable operonic settings in different conditions [74][75][76][77].
Since metagenome data does not have a complete reference on which the raw reads could be mapped, so it is assembled into multiple contigs/scaftigs, rather than in one wholegenome; hencemultiple operonic configurations were observed (Fig. 6). Unlike the proximon proposition, where the majority of the proximons were mapped to more than one operon in a subset fashion, 66% of the operons identified via MetaRon matched precisely to one reference operon as a perfect match. About 8% of the operons show a subset configuration (Fig. 7). A subset configuration refers to an exact match with one or more extra genes in the reference (Fig. 6). While 4% of the predicted operons displayedcontrariwise formation known as a superset configuration, i.e., an operonis longer than the reference operon by one or moregenes (Fig. 7). The subset formations could be due to the distribution of an operon between two scaftigs or different transcription unit boundary (Fig. 6). Furthermore, there were operons that encompasses more than one reference operon in an exact or partial match. Such operonic settings are named as bridge-1, while the other way around is named as bridge-2. About 5% of the above-mentioned bridge settings are observed in the predicted operons. Bridge configurations could be due to altered TUB or the inability of the promoter prediction tool to identify the promoter.
Metagenomic data from various conditions demonstrates new microbial functions under different levels of stress and environmental stimulus [24]. Many unique operonic organizations are likely to appear as a response to new environmental stimuli. This leads to the formation of new or modified operonic configurations such as subsets, supersets or unique operons. In the case of E. coli C20, 17% of predicted operons have less than 50% or no match with the reference (Fig. 7). Such unique organizations may well carry precious insights about the microbial activity for a particular environment regarding bacterial products and pathways [24]. Such insights at metagenomic scale could be valuable in understanding disease condition, its prevention and possibly the cure as well.
MetaRon was further implemented on whole-metagenomic raw reads from the gut of 145 Chinese individuals (74 Type 2 Diabetic (T2D), 71 controls) [39]. The two groups of individuals are further divided into four sub-groups in each category based on gender, weight and diabetic/non-diabetic (Table 1). MetaRon produced 3,868,389 operons containing 12,414,125 genes (Fig. 8). This makes up almost 50% of the total 23,280,123 genes. There could be similar operons in different samples, so for better organization and an operon catalogue was curated, which resulted in 1.23 million unique operons. The longest operon is 185 genes long. An average 61.3% of the predicted operons showed binary setting. The percentage was consistently high in all samples as shown in Fig. 8.

Prediction of secondary metabolites
Biosynthetic gene clusters (BGCs) were identified from the operons as well as wholemetagenome (Fig. 9). The idea was to explain the occurrence of the disease via secondary metabolites (SMs) and observe the extent of information operons hold in the whole metagenomic assembly. Figure 9 presents a holistic view of the secondary metabolites (SMs) predicted from the operonic sequences and the whole-metagenomes. It can also be observed that there is a notable change in the abundance of SMs from healthy to diabetic state (Fig. 10). Another important observation to highlight is the similarity in the patterns of SMs from operons and whole-metagenomic assembly (Fig. 10). The abundance of the SMs in whole-metagenome was higher than the operonic sequences, which is to be expected, however, the operonic sequences represent nearly the exact pattern as demonstrated via whole metagenomic assembly.  [78][79][80][81][82][83][84][85][86][87]. However, here we also report three pathways to have strong association with type 2 diabetes, namely, Maltose phosphorylase (K00691), 3-deoxy-D-glycero-D-galacto-nononate 9-phosphate synthase (K21279) and an uncharacterized protein (K07101). The Maltose phosphorylase catalyzes the phosphorylation process of maltose, resulting in the production of glucose 1-P and glucose. The pathway also overlaps with the glycan degradation [55]. The pathway has never been reported to have any association with T2D, however, glycogen phosphorylase pathway is consistently reported to have strong association with the disease [88, 89].
Further investigation could provide much clear insights into the role of maltose phosphorylase in the occurrence of T2D.

Conclusion
This study presents a convenient publicly available command line pipeline for the processing of Metagenomic data and operon prediction in shotgun sequencing data. A major advantage of MetaRon is that it identifies metagenomic operon without the need of any experimental or functional information, on which most of the previous whole-genome operon prediction methods were based upon. MetaRon is therefore the second pipeline that performs systemic identification of metagenomic operons and the first one to do so, without any prior functional or experimental information. Considering the complexity and incompleteness of metagenomic data, the pipeline predicts metagenomic operons with significantly high specificity. This is the first study to perform a detailed analysis of the metagenomic operons and explaining the occurrence of the disease from the operonic point of view. The analysis reveals a difference in the abundance of secondary metabolites and pathways from the operonic reads, which highlights the role of operons in the metagenomic data. This is also the foremost study to associate the abundance of operonic secondary metabolites with disease and health. Moreover, from the aspect of data management, we demonstrated that operons could also act as a subset to represent the whole-metagenomic sample. As verified through this research, with the extent of information stored in the predicted operons, MetaRon presents many opportunities.
MetaRon promises to be a useful pipeline in the identification of metagenomic operons and it is quite certain that more in-depth investigation, aided with wet-lab resources, could provide insightful findings about the diverse microbial biosphere. In this research, the analysis was performed separately on the MetaRon's predicted operons, however, in the future we plan to integrate the prediction of secondary metabolites, pathway annotation and graphical representation within the pipeline.

Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable

Availability of data and materials
Please contact author for data requests.

Competing Interests
The authors declare that they have no competing interests.   Figure 1 Conceptual structure of an operon.  The distribution of operonic and proximonic gene clusters by length.  Percentage of E. coli C20 operons mapped to one or more reference operons.

Figure 6
Various operonic configurations from a perfect match to a unique organization.

Figure 7
Percentage of operons falling in each operonic category.

Figure 8
The total number of predicted operons (light grey), the total number of genes present in the operons (dark grey) and the percentage of binary operons.  Abundant Secondary Metabolites (SMs) predicted from whole assembly as well as from operonic sequences only.

Figure 11
Secondary metabolites significantly abundant differentially.