Determination of Transcriptional Start Sites (TSS)
Understanding a regulatory gene is one of the most difficult challenges in the entire genome. Therefore, identification of the TSS is key information for gene expression. Transcription start sites (TSS) are the first nucleotides of DNA sequences where transcription has been started. On the other hand, it is where the RNA polymerase enzyme binds upstream of the start site. The online Neural Network Promoter Prediction (NNPP) version 2.20 databases were used to find the TSS for the gene extracted from Pseudomonas spp., which is widely used for mercury bioremediation. The promoter region located upstream of 1kb of the TSS was characterized on the assumption that the functional gene elements of the promoter can be found within the region. The predicted values for each of the coding sequences of mer operon gene varieties in mercury bioremediation have been summarized and presented in Table 2. Accordingly, the mer operon gene variety has several TSS values ranging from 1 to 4. Interestingly, about six identified genes (merA, merB, merD, meer, merF, and merP) have the same TSS values and merC has only one TSS value as can be seen from Table 2. The current studies show that the promoter region of almost all sequences had multiple TSS values, showing a similar investigation of genome-wide identification of TSS, promoter, and TF binding sites in E. coli [14].
TSSs were located at various distances from the start codon, as observed in Table 2. This shows that promoter sequence plays an important role in enhancing or hindering transcription initiation and gene regulation in response to environmental changes. The genes indicated by merD, merG, and merR were the highest values observed for positive-strand localization, respectively. While merB and MerE were the highest values that have been among the other TSS found on the negative strands. However, the majority of the TSS of mer operon genes were found on the negative strand, while few of them were on the positive strands. Therefore, knowing the wide application of TSS such as gene function and its structure determination, predicting the promoter region, and gene regulation have been perceived in the current scenario of gene prediction.
Table 2
TSS number, its promoter predictive score values, and distance from the start codon of mer-operon genes associated with mercury bioremediation
SN
|
Gene Id
|
Gene symbol
|
No of predictive promoter
|
No of TSS
Identified
|
The predictive score value cut off at 0.80
|
Distance
From ATG
|
Orientation of Complementary Strands
|
1.
|
69751970
|
merA
|
2
|
2
|
0.97, 0.91
|
-929
|
-ve
|
2.
|
66762507
|
merB
|
2
|
2
|
0.85,0.82
|
-1951
|
-ve
|
3.
|
66762509
|
merC
|
1
|
1
|
0.85
|
-686
|
-ve
|
4.
|
69747981
|
merD
|
2
|
2
|
0.93,0.89
|
2921
|
+ve
|
5.
|
69751968
|
merE
|
2
|
2
|
0.86,0.94
|
-1361
|
-ve
|
6.
|
69751971
|
merF
|
2
|
2
|
0.97,0.91
|
-687
|
-ve
|
7.
|
69751974
|
merR
|
4
|
4
|
0.97, 0.89, 0.89,0.86
|
865
|
+ve
|
8.
|
69751972
|
merP
|
2
|
2
|
0.97,0.91
|
-409
|
-ve
|
9.
|
69747978
|
merT
|
3
|
3
|
0.92,0.93.0.89
|
663
|
+ve
|
10.
|
46432416
|
merG
|
3
|
3
|
0.89,0.94,0.85
|
2217
|
+ve
|
Determination of Common motifs and TFs
The five candidate motifs were predicted and investigated by the MEME algorithm as shown in Table 3. Algorithm-generated the five most promising candidate motifs concerning the ten imported thousand-length gene sequences. The predicted motifs and proportion of promoters containing common motifs for the mer operon gene were evaluated. The data show that the best common motifs (motif_1) with the lowest e-values have 100% binding sites. The predicted candidate motifs have the lowest (motif_5) and highest (motif_1) e-values (7.2e-033 and 7.3e-074), respectively. Therefore, the most likely candidate (motif 1) has the highest binding sites compared to the other candidate motifs. As it was presented in Table 3, the two common candidate motifs (motif_2 and motif_3) shared common binding sites and had common motif width by variation in the e-values. It could be hypothesized that these transcription factors activate gene regulatory roles in the bioremediation of environmental pollutants by mercury (II) reductase in the case of the merA gene, organomercury lyase (merB), mercury transporter genes (merC, merE, merF, and merT), transcription regulators (merR), and finally mercury-resistant genes (merF, merP, and merG) as revealed in Table 3. The motif patterns in the promoter region, which operates the binding sites of the transcription factors, have enhanced gene regulation [15].
For TSS, we checked distribution from position + 1 of the upstream to position − 1kb (Fig. 1). Using the present analysis, the motif distributions (75% on the positive complement strands) and (25% on the negative complement strands) are presented in Fig. 11. They were distributed at each site according to the transcriptional start site (ATG). Additionally, the data indicates that the dense distribution of the common candidate motif lies around the + 1kb region, while few of them are distributed around the − 100kb region relative location and spatial distribution of these motifs in the promoter regions were constructed by MEME and the created logos of common motifs, resulting in different characteristics of the column's motif orientations, with the height of the letter illustrating how frequently that nucleotide is expected to be observed in that particular position of the two strands (Fig. 2). It has been suggested that the motif, found in a large number of promoter regions, could provide a significant amount of information [16].
Table 3
List of predicted motifs and the number and proportion of promoter-containing motifs
S.N
|
Predicted and Discovered Candidates motifs
|
No of the promoter for each of the motifs in %
|
E- value
|
Motif width
|
No of the Binding sites
|
1.
|
Motif_1
|
10 (100%)
|
7.3e-074
|
50
|
10
|
2.
|
Motif_2
|
7 (70%)
|
1.1e-046
|
50
|
7
|
3.
|
Motif_3
|
7 (70%)
|
2.0e-048
|
50
|
7
|
4.
|
Motif_4
|
9 (90%)
|
1.4e-046
|
50
|
9
|
5.
|
Motif_5
|
7 (70%)
|
7.2e-033
|
41
|
7
|
A candidate common motif with the lowest e-value (7.e-033) represents a statistically significant and functionally significant motif imported into TOMTOM versions 5.4.1 for further analysis (https://meme-suite.org/meme/doc/tomtom-output-format.htmll), which is a publicly available database for transcription factors prediction that could be similar to known regulatory motifs. TOMTOM provides LOGOS representing the alignment of the known motifs with the candidate transcription factors. The TOMTOM output from the databases includes links to the parental TF databases for more information such as activation, repression, and dual regulatory roles of the matched motifs. Again, there was also other conformational information associated with the TF databases such as monomer, dimer, tetramer, and unidentified as well as other factors. The binding types associated with the databases were also predicted. As indicated in Table 3. The motif_5 with the lowest e values (7.2e-033) and statistically significant with 11 matched TF from 84 collected databases with matched e values thresholds less than 10 or less as screened and observed from the TOMTOM database. The forward and reverse strands of the statistically significant strands are depicted in Fig. 1.
Motifs have been revealed to be extremely beneficial in identifying genetic regulatory networks and interpreting specific gene activities. Regulatory motif discovery analysis has advanced significantly attributable to our current computational capabilities, and it remains at the forefront of genomic investigations of bacteria employed in environmental remediation. According to the current studies, the identified candidate motif was widely dispersed between + 1 and − 400bp, sparsely distributed between − 400 and − 800bp, and less distributed above − 800bp as illustrated in Fig. 2. The distribution was on both positive and negative strands, with transcription start sites as a reference. Only one candidate motif was found on the positive complementary strands in the gene identified by gene id (66762507). Approximately 75% and 25% of candidate motifs were located on the positive and negative strands respectively. This indicates the majority of the candidate motif was discovered on the positive strands. The variation of motif distribution that is resulted from the difference in nucleotides sequences of the identified genes.
Identification of transcription factors are essential regulators of gene expression, determining, where, and to what extent genes are expressed in molecular biology. As observed in Table 5, eleven transcriptional factors matching the candidate motif were discovered, each with different regulatory activities. From the commonly identified transcriptional factors four [PhhR (90%), VqsM (7%), CcpA (1%) and LrP (1%)] have activation regulatory roles with differences in degrees. This study also revealed that only one CtrA (9.09%) and two namely CRP and GlxR (18.18%) TF identified from C.crescentus, Y.pestis, and C.glutamicumorganismsm have a dual and repression regulatory functions respectively. The majority of the TFs (CodY, EspR, MatP antoin some extent VqsM, Fur, Lrp as well as CtrA) have been found for activation of transcription for mercuric bioremediation have not yet been described, therefore, additional wet-lab based research might be needed in the future.
Table 4
Lists of matching candidates from EXPREG transcription factor (TF)
S.N
|
Candidate of TF
|
Strains showed motif sequence binding
|
GC (%)
|
Regulatory Elements
|
Statistical Significance
|
Activation
(%)
|
Repression (%)
|
Dual (%)
|
Not specified (%)
|
1.
|
CRP
|
Y.pestis
|
46.88
|
0
|
100
|
0
|
0
|
2.11e + 00
|
2.
|
PhhR_
|
P.putida
|
46.67
|
90
|
10
|
0
|
0
|
2.29e + 00
|
3.
|
VqsM_
|
P.aeruginosa
|
59.33
|
7
|
0
|
0
|
92
|
3.43e + 00
|
4.
|
CodY
|
B.anthracis
|
20.41
|
0
|
0
|
0
|
100
|
3.99e + 00
|
5.
|
Fur
|
P.syringae
|
40.25
|
0
|
13
|
0
|
85
|
4.88e + 00
|
6.
|
EspR
|
M.tuberculosis
|
52.83
|
0
|
0
|
0
|
100
|
5.95e + 00
|
7.
|
MatP
|
E.coli
|
47.23
|
0
|
0
|
0
|
100
|
6.75e + 00
|
8.
|
CcpA
|
C.difficile)
|
26.32
|
9
|
36
|
0
|
53
|
6.87e + 00
|
9.
|
GlxR
|
C.glutamicum
|
46.55
|
0
|
100
|
0
|
0
|
7.38e + 00
|
10.
|
Lrp
|
E.coli
|
40.00
|
1
|
1
|
0
|
97
|
7.91e + 00
|
11.
|
CtrA
|
C.crescentus
|
28.95
|
0
|
0
|
20
|
80
|
9.29e + 00
|
Transcription factors regulate some sets of gene regulation, and conformational factors and flexibility of genes lead to an effective and selective assembly of co-regulatory proteins to regulate the target genes. This indicates that the transitory interactions between TF and site-specific DNA sequences are common and important in a variety of biological functions. Accordingly, the transcriptional factors confirmation mechanism of eleven mer genes employed in mercury bioremediation was studied. According to the current results, no regulatory role has been assigned to the whole set of candidate TF as monomers, tetramers, or other conformational modes as indicated in Table 5. Approximately four of these (PhhR, Fur, EspR, MatP, and Lrp), discovered TF candidates, have 100% and 96% dimer conformational roles in co-regulating genes respectively. The current investigation revealed that about 54.54% of the identified common candidates for TF conformational mechanisms’ function were not identified in Table 5. The conformational flexibility of TF binding proteins maximizes gene regulatory efficiency.
Table 5
Lists of match candidates from EXPREG transcription Confirmation Factor (TCF)
S.N
|
Candidate of TF
|
Strains that show motif sequence binding
|
GC (%)
|
TF Confirmation Mode
|
Not
Specified (%)
|
Statistical Significance
|
Monomer (%)
|
Dimer (%)
|
Tetramer (%)
|
Other (%)
|
1.
|
CRP
|
Y.pestis
|
46.88
|
0
|
0
|
0
|
0
|
100
|
2.11e + 00
|
2.
|
PhhR
|
P.putida
|
46.67
|
0
|
100
|
0
|
0
|
0
|
2.29e + 00
|
3.
|
VqsM
|
P.aeruginosa
|
59.33
|
0
|
0
|
0
|
0
|
100
|
3.43e + 00
|
4.
|
CodY
|
B.anthracis
|
20.41
|
0
|
0
|
0
|
0
|
100
|
3.99e + 00
|
5.
|
Fur
|
P.syringae
|
40.25
|
0
|
100
|
0
|
0
|
0
|
4.88e + 00
|
6.
|
EspR
|
M.tuberculosis
|
52.83
|
0
|
100
|
0
|
0
|
0
|
5.95e + 00
|
7.
|
MatP
|
E.coli
|
4723
|
0
|
100
|
0
|
0
|
0
|
6.75e + 00
|
8.
|
CcpA
|
C.difficile)
|
26.32
|
0
|
0
|
0
|
0
|
100
|
6.87e + 00
|
9.
|
GlxR
|
C.glutamicum
|
46.55
|
0
|
0
|
0
|
0
|
100
|
7.38e + 00
|
10.
|
Lrp
|
E.coli
|
40.00
|
0
|
96
|
0
|
|
3
|
7.91e + 00
|
11.
|
CtrA
|
C.crescentus
|
28.95
|
0
|
0
|
0
|
0
|
100
|
9.29e + 00
|
CpG islands are DNA methylation sites in promoter regions that are utilized as gene regulation tools by silencing a related gene during transcription. For this study, two algorithms, offline CLC Genome Workbench version 22.0.10 and online database search tools were used. The two regions (promoter and gene body) were analyzed in FASTA format from the upstream of the TSS as well as the whole gene body sequences. Using online database searching tools, the analysis revealed that CpG islands exist in approximately 30% of the gene body and 40% of the promoter regions respectively. The gene body sequences with gene IDs 46432416, 66762507, and 69751970 were among the genes with one CpG island each when compared to other genes. Similarly, 46432416, 69747978, 69751968, and 69751974 had one CpG island of the promoter regions as depicted in Table 6 Further investigations were done offline using CLC Genome Workbench version 22.0.10 to analyze the CpG islands. The restriction enzyme MspI was used in the second alternative, which revealed the presence of CpG islands in both promoter regions and gene bodies, as shown in Table 7. As shown in Table 7, the restriction enzyme MspI was used to cut fragments between 40 and 220bps in the promoter region rather than the gene body. In general, the nucleotide cutting position of the promoter region was higher than the gene bodies. This indicated that the poorer CpG islands were observed in the gene body than in the promoter regions.
Table 6
CpG islands Identified for both promoter and gene body regions
S.N
|
Gene ID
|
Start
|
End
|
length
|
No of the CpG island (s) were found in both regions
|
Gene
body
|
GC%
|
start
|
End
|
Length
|
Promoter regions
|
GC%
|
1.
|
46432416
|
8
|
631
|
624
|
1
|
57
|
1
|
974
|
974
|
1
|
66
|
2.
|
66762507
|
1
|
631
|
631
|
1
|
50
|
–
|
–
|
–
|
–
|
–
|
3.
|
66762509
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
4.
|
69747978
|
–
|
–
|
–
|
–
|
–
|
1
|
971
|
971
|
1
|
50
|
5.
|
69747981
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
6.
|
69751968
|
–
|
–
|
–
|
–
|
–
|
1
|
978
|
978
|
1
|
63
|
7.
|
69751970
|
1
|
1639
|
1639
|
1
|
53
|
–
|
–
|
–
|
–
|
–
|
8.
|
69751971
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
9.
|
69751972
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
–
|
10.
|
69751974
|
–
|
–
|
–
|
–
|
–
|
1
|
979
|
979
|
1
|
50
|