Background: Bacteriophage (phage) is the most abundant and diverse biological entity on the Earth. This makes it a challenge to identify and annotate phage genomes efficiently on a large scale. Portal (portal protein), TerL (large terminase subunit protein) and TerS (small terminase subunit protein) are the three specific proteins of the tailed phage. Here, we develop a CNN (convolutional neural network)-based framework, DeephageTP, to identify these three proteins from metagenome data. The framework takes one-hot encoding data of the original protein sequences as the input and extracts the predictive features in the process of modeling. The cutoff loss value for each protein category was determined by exploiting the distributions of the loss values of the sequences within the same category. Finally, we tested the efficacy of the framework using three real metagenomic datasets.
Result: The proposed multiclass classification CNN-based model was trained by the training datasets and shows relatively high prediction performance ( A ccuracy : Portal, 98.8%; TerL, 98.6%; TerS, 97.8%) for the three protein categories, respectively. The experiments using the independent mimic dataset demonstrate that the performance of the model could become worse along with the increase of the data size. To address this issue, we determined and set the cutoff loss values (i.e., TerL: -5.2, Portal: -4.2, TerS: -2.9) for each of the three categories, respectively. With these values, the model obtains high performance in terms of Precision in identifying the TerL and Portal sequences (i.e, ~ 94% and ~ 90%, respectively) from the mimic dataset that is 20 times larger than the training dataset. More interestingly, the framework identified from the three real metagenomic datasets many novel phage sequences that are not detectable by the two alignment-based methods (i.e., DIAMOND and HMMER).
Conclusion: Compared to the conventional alignment-based methods, our proposed framework shows high performance in identifying phage-specific protein sequences with a particular advantage in identifying the novel protein sequences with remote homology to their known counterparts in public databases. Indeed, our method could also be applied for identifying the other protein sequences with the characteristic of high complexity and low conservation. The DeephageTP is available at https://github.com/chuym726/DeephageTP .