The process of genome reannotation combined with detailed manual reviewing encompasses the re-identification and labeling of characteristic features of a sequenced genome, and is a process that has been performed extensively for numerous organisms, including bacteria [16, 17]. Unlike the human genome, which is about 1.3% protein coding, 90% of the bacterial genome codes for proteins, with only short intergenic stretches[18]. Precise genomic annotation is thus fundamental to the further interpretation of the biochemical and physiological characteristics of organisms, to provide detailed information on protein coding sequences, pseudogenes, non-coding RNAs, repeat sequences and various other genomic data [19]. In this study, we reannotated the genome of the E.coli strain ER2566 through a reannotation pipeline, as illustrated in Fig. 1, with high speed and accuracy.
Improvement in coding sequences (CDSs)
For the systematic reannotation of the CDSs, the prediction and identification of coding genes occurred in two stages. In the first stage, Prodigal software was used to predict a total of 4,180 CDSs on the complete ER2566 genome deposited in GenBank (accession number NC_CP014268.2). Using sequence alignment to the Swiss-Prot database [20] by blastp [21], with a threshold e-value of < 10− 6, all CDSs were artificially annotated to provide accurate information regarding the sequences and functions of the enrolled proteins. A total of 4,044 (97%) of the 4,180 CDSs were annotated as protein-coding genes, with the remaining CDSs (136 CDSs; 3%) marked as hypothetical genes, with no registration in the Swiss-prot database. To improve upon this prediction, three other well-established gene finders—GLIMMER, Zcurve and GeneMark—were independently used, identifying 4,231, 4,287, and 4,213 CDSs, respectively. These putative CDS sets were subsequently filtered by blastn against the first Prodigal-predicted CDS set. Overall, an additional 428 CDSs were found: 194 CDSs were identified using GLIMMER, 123 using GeneMark, and 201 using ZCURVE. These additional 428 CDSs were then searched against the Swiss-Prot database by blastp with a stricter threshold e-value < 10− 10, coverage > 80%, and an identity > 70%. This filtered out 402 of these additional CDSs, resulting in an additional 26 genes. This led to a total of 4,087 protein-coding CDSs (4,044 + 43) included in the reannotation of the ER2566 genome, along with 136 CDSs for hypothetical genes.
By comparing with RefSeq annotation, a total of 120 protein-coding CDSs were removed, as they were identified as either hypothetical proteins or pseudogenes, with most of them having no assigned function (Additional Files S1). Meanwhile, 65 new CDSs were added (Additional Files S2). The complete reannotation list can be found in Additional File S3.
Due to trimming or splitting, genes with real function can often be incorrectly assigned as pseudogenes through protein homology alignment. In our reannotation, we manually reviewed 194 pseudogenes from the RefSeq database annotation. In total, 144 of the 194 pseudogenes were directly identified as coding genes and are now found in the reannotated list, while the remaining pseudogenes without any function were removed. These newly identified protein-coding genes included 34 mobile genetic elements, a common type of genetic change considered to be important in evolution. For instance, a genomic positive strand region (3,296,328–3,297,025 bp), previously annotated as a pseudogene without function, was identified to be two genes, insA and insB, which are homologues of the insertion element protein, IS1, and related to DNA binding and transposase activity (Fig. 2a) [22]. In addition, two annotated pseudogenes (C2566-RS05300 and C2566_RS05310) and one hypothetical protein (C2566_RS22600) in the RefSeq annotation (range 1,088,980-1,094,720) were reannotated as three new genes (lacZ1, lacZ2, and ECBD_2906, respectively), and one related pseudogene, C2566_RS05305, was removed; these changes are consistent with previous results [5] (Fig. 2b). These three new genes are flanked by the lactose permease gene lacY upstream and the lactose operon repressor gene lacI downstream, both of which are essential to the lac operon system. This reannotation may have uncovered genes related to the transcription and translation of genes encoding the lactose operon repressor in ER2566 strain.
Overall, we determined 4,223 protein-coding genes for the ER2566 genome, including 136 CDSs labeled as hypothetical proteins. There are now no pseudogenes in the annotated data. This reannotation effectively eliminated the possibility of false interpretations introduced by the original annotation and provides a more integral view of the regulatory networks in ER2566 strain (Table 1).
Table 1
Overview of the differences between the original annotation and the reannotation
| Original annotation (NZ_CP014268.2) | Re-annotation |
Genome length | 4,478,958 bp |
Plasmids | None |
G + C% | 50.81% |
Genes(total) | 4,364 | 4,577 |
Protein_coding genes | 4,054 | 4,223 |
Pseudo Genes | 194 | 0 |
tRNAs | 85 | 87 |
rRNAs | 22 | 22 |
Miscellaneous RNAsa | 9 | 245 |
Backbone genes | 4,170(4,054 protein-coding genes,85 tRNA genes,22rRNAs and 9 misc RNAs) | 4,577(4,223 protein-coding genes,87 tRNA genes,22rRNAs and 245 misc RNAs) |
a: The concept of miscellaneous RNA includes ncRNA, tmRNA and all other ncRNAs. |
Miscellaneous RNA improvement
Miscellaneous RNA, such as transfer RNA (tRNA), ribosomal RNA, and other non-coding RNAs, play pivotal biological roles in cellular activity. To date, about 119 RNA molecules in the E.coli ER2566 strain have been identified, including 85 tRNAs, 22 rRNAs, and 12 ncRNAs[6]. However, almost all of these ncRNAs are missing from the original RefSeq annotation. We used Aragorn 1.2, Barrnap 0.9, and infernal 1.1 ncRNA finders independently to predict genes coding for tRNAs, rRNAs, and non-coding RNA. Compared with the auto-annotation, 2 tRNAs and 230 ncRNAs are new, with most having functions in translation, DNA replication, and expression regulation (Additional Files S4). In addition, most of the ncRNA functions have been verified experimentally previously. For instance, about 94 of the nucleoid-associated ncRNA molecules play key functions in DNA-RNA interactions [23]. In summary, our reannotation introduced 230 new ncRNAs and 2 new tRNAs, with an overall tally of genes encoding for 87 tRNAs, 22 rRNAs and 245 ncRNAs (Table 1).
Variant calling of whole-genome re-sequencing under consecutive subculture
Genomic variations in bacterial species usually reflect an evolutionary response that occurs under various external—usually unfavorable—environmental stressors. Thus, we next performed variant calling to identify any nucleotide-level differences (i.e., single nucleotide polymorphisms [SNPs], insertions and deletions [indels], and/or structural variations) in the ER2566 strain. There are two approaches for variant calling: by mapping reads against the reference genome directly or by assembling a de novo genome to compare against a reference genome. In most cases, mapping reads produces a better resolution for SNPs and indels than genome assembly, whereas the latter is optimal for identifying structural variants and regions with high divergence. Here, we used both methods to interrogate the ER2566 genome (Fig. 3).
First, we re-sequenced the whole genome of the ER2566 strain grown in our laboratory under consecutive subculturing. Two different-sized insert libraries (500 bp, 2000 bp) were built, and a total of 10.0 million paired-end reads of 90 bp in length were generated using an Illumina HiSEq. The raw reads were mapped to the C2566 reference genome (NC_CP014268.2) with a good coverage depth (> 100-times). No SNP or indel was detected. With the whole-genome assemblies, a more accurate, longer assembly pipeline was constructed. Various de novo assembly software were used to construct a confident and long (4,469,460-bp) scaffold, and variant calling was aligned to the reference genome. One structural variation was found, and was subsequently verified by Sanger sequencing to be a genome assembly issue caused by regions of high repetition. Overall, no SNP, indel or structural variation was found, suggesting that the E. coli ER2566 strain is relatively stable under continuous subculture.
Mutation detections by RNA-sequencing under overexpression
RNA-Seq is widely used in quantitative gene expression studies for the identification of non-annotated transcripts and polymorphisms, and for RNA editing in transcribed regions. Thus, to identify any variations in the ER2566 genome due to overexpression pressure, we used RNA-seq to analyze the transcriptomes of the ER2566 strain growing at 37 °C without plasmids (B37, three replicates) or overexpressing human papillomavirus 16 type capsid protein L1 via plasmid-based inducible expression (Y37, three replicates) (Fig. 4).From a total of 81.3 million 125-bp paired-end reads, 78.6 million reads (96.74%) were mapped to the reference genome in B37 (control) samples. Yet, for the Y37 (overexpressed) samples, among the 76.0 million reads, only 24.0 million (31.50%) reads mapped to the reference genome, which was significantly lower than that for the B37 samples. We presume that the large number of recombinant proteins in the Y37 samples hindered a comparison with the reference genome.
The variant detection analysis using BactSNP revealed one mutation in the Y37 samples (1,094,824 position; Fig. 4b), located within the non-coding region 19-bp downstream from the lacI gene. No SNPs were identified in the B37 samples. We next compared the base means for all genes between the B37 and Y37 samples (Additional Files S5). Interestingly, lacI is the highest transcribed gene in the Y37 samples, and the mutation located in the 3’ non-coding region may be relevant in the regulation of the transcription and translation of lacI gene encoding the transcriptional regulator protein.
Next, Sanger sequencing was used to assess the B37 and Y37 samples. However, the same mutation was not detected at the genomic level (Fig. 4c). From this, we infer that the mutation may be a key site involved in the transcription and translation of genes encoding the lactose operon repressor and also provides new insight into the regulation of the lactose operon. The function of this site should be the focus of future work, using a combination of gene editing, RNA methylation analysis, and other relevant techniques.