DeepEnzyPred: A Bi-Layered Deep Learning Framework for prediction of Bacteriophage Enzymes and their Sub-Hydrolases Enzymes via Novel Multi Level- Multi Thresholds Feature Selection technique

Background Bacteriophage or phage is a type of virus that replicates itself inside bacteria. It consist of genetic material surrounded by a protein structure. Bacteriophage plays a vital role in the domain of phage therapy and genetic engineering. Phage and hydrolases enzyme proteins have a significant impact on the cure of pathogenic bacterial infections and disease treatment. Accurate identification of bacteriophage proteins is important in the host subcellular localization for further understanding of the interaction between phage, hydrolases, and in designing antibacterial drugs. Looking at the significance of Bacteriophage proteins, besides wet laboratory-based methods several computational models have been developed so far. However, the performance was not considerable due to inefficient feature schemes, redundancy, noise, and lack of an intelligent learning engine. Therefore we have developed an anovative bi-layered model name DeepEnzyPred. A Hybrid feature vector was obtained via a novel Multi-Level Multi-Threshold subset feature selection (MLMT-SFS) algorithm. A two-dimensional convolutional neural network was adopted as a baseline classifier. we believe that the developed predictor will be a valuable resource for large scale discrimination of unknown Phage and hydrolase enzymes in particular and new antibacterial drug design in pharmaceutical companies in general.


Background
Bacteriophages are among the most widely recognized and diverse elements in the biosphere [1] [2]. They are also known as Phage, which remains a natural enemy of the bacterium by still keeps specificity for pathogenic bacteria and beneficial flora. Phages replicate itself within the bacterium by injecting their genome into its cytoplasm. It is estimated that there are more than 10 31 bacteriophages on the planet, more than every other organism on Earth, including bacteria, combined. Phage-coded hydrolyses is a key component of cleavage and helps fight bacterial pathogens, especially those that cannot be killed by antibiotics and chemicals. Research studies have shown [3] [4], that phage was extensively used as a cure for those bacterial infections, which don't respond to anti-biotic [4]- [7].
Overconsumption of antibiotics is the most important factor leading to antibiotic resistance all over the world. Some drug-resistant viruses cannot be effectively controlled due to the abuse of antibiotics. This problem can be solved by phage hydrolysis therapy, which breaks down the host-virus during the release of the offspring of the bacteriophage [10], [11]. Besides these, they are also used as food safety tools to reduce bacterial contamination. As a result, there is an increasing demand in public health domain for the rapid detection of phages and hydrolytic enzymes [8] [9].
Phage therapy has some advantages over antibiotic therapy, as follows: cost-effectiveness, facile isolation, and purification from the living environment, abundance in the environment, strong specific effects on bacteria, and low side effects. Along with study of phage hydrolysis enzyme got momentum, in studying host cell lysis activated by hydrolysase, it was found that calcium can regulate bacteriotic lysis induced by phage [12]. Therefore, the correct identification of hydrolytic enzymes encoded in phage has become an important research topic. Although various technological wet-lab techniques such as mass spectrometry have been developed to annotate the phage proteins from sequence data but these biochemical experimental techniques are overpriced and time-consuming. The computational methods provide the best opportunity to study and analyze the phage hydrolysis enzymes in contrast to biochemical-based methods [13], [14].
Phylogenetic analysis or similarity search could find relative conservation of motifs among related species [4], [8], [15], [16], but in the case of phage open reading frame (ORF) which varies greatly, with more than 70% of the base sequence syllables in GenBank are then unable to find similar genes with desired annotation function [4], [5], [9].
With the accumulation of more post-genome data, several models have been developed for discrimination of the functions of phage proteins. Reed et al, proposed a model to predict the three-dimensional structure of t-even phage-type tail fibrin. Their finding were more likely consistent with electron microscopic data [17]. Over computational approach, several models have been developed for phage T7 [13], [18]- [21]. Recently, the viral protein encoded by phage was studied by Feng et al using simple Bayes algorithms combined with first-level sequence information [16]. They have recorded a 85.02% of ACC from overall ACC of (79.15%) using the feature selection approach. Recently, Hong-Fi li et al [3] have developed an excellent computational model, by combining multiple feature vectors. They have obtained 85.1% of Accuracy, 88% of Specificity, and 83% of sensitivity for the prediction of Phage enzymes and 94.3% of Acc, 93% of Specificity, and 96% of Sensitivity for the further discrimination of phage hydrolase enzymes. They have used ANOVA as a feature selection strategy with an SVM as a baseline classifier.
Successively, each predictor brought a significant improvement in different classification metrics but, these models lack optimum classification power to correctly predict phage enzymes and its hydrolase encoded phage. This makes the model unable to get a true generalization power over unseen data. To fill this research gap, we have developed a more robust and intelligent computational predictor, in which optimum features were obtained via a novel multi-threshold values feature selection algorithm. The proposed model is simulated via a 2-dimensional convolutional neural network build in python Keras library. For model evaluation and generalization testing, a 5-fold rigorous cross-validation was used. Empirical results shown that the proposed technique outperformed with existing phage enzymes prediction tools due to (i) new feature fusion scheme via feature selection, and (ii) state-of-the-art deep learning algorithm. We believe that the proposed model will be a very handy tool in the field of biological research, academia, and the applied drug design industry by further providing additional future insight in the field of computational proteomics.
The main contributions presented in this article are as follows: 1) A new Bi-layer computational method for identifying phage enzymes are proposed, the first layer identifies the phage and non-phage enzymes, and the second layer identifies the further types of hydrolase phage enzymes. To the best of our knowledge, the bi-layered fasion approach model was first devised. The proposed model is not only cheap and computationally fast but also reliable than wet laboratory methods.
2) An automatic feature extraction and selection scheme are proposed to obtain the most conductive feature information out of the base feature vector.
3) A Novel Multi-Level Multi-threshold subset feature selection (MLMT-SFS) model is proposed for selecting the best features for establishing a reliable model.

4) A Two Dimensional Convolutional Neural Network has been used as baseline classification algorithms
The organization of the paper is as follows. In Section 1 detail background study and literature have been discussed. Section 2 discussed the benchmark dataset, feature extraction, and selection technique, and evaluation metrics are discussed. Section 3 plot a detailed picture of results and discussion. Finally, Section 4 makes a summary of this paper, conclusion, and directions for future work.

Benchmark Dataset
Constructing a reliable benchmark dataset could guarantee the reliability of the proposed computational model [22]- [27]. In this work, samples were gained from previous studies of [3], [20] studies [3], [20], which were rigorously screened through the following three steps: (1) Phage proteins have been annotated by the standard operating procedure for UniProt manual curation (Swiss-Prot); (2) Protein sequences samples containing illegal characters were deleted; (3) Sequence identity in the dataset must be less than 30%, which was implemented by CD-HIT (Fu et al., 2012) software. Consequently, the definitive benchmark dataset contains 255 phage proteins, of which 124 proteins belong to phage enzymes (positive samples of set 1), and the remaining 131 are phage non-enzymes (negative samples of set 1). Furthermore, 124 phage enzymes are divided into 69 hydrolases (positive samples of set 2) and 55 nonhydrolases (negative samples of set 2), respectively. The following calculations are all based on these data.

CTD(Composition Transition Distribution)
The global feature holds decisive information and has an effective contribution to the prediction performance of a predictor. Considering this, we use the Composition, Transition, and Distribution (CTD) algorithm [28]. In this approach, the 20 amino acids are split into three groups such as neutral, hydrophobicity, and polar according to the seven physicochemical properties including polarity, solvent accessibility, charge, polarizability, vander waals volume, secondary structure, and hydrophobicity. The complete list is given in Table 1.  The Composition (C) defines the global percent composition of each unit and noted as: here , i ϵ {1, 2, 3}, indicates the amino acids to group i of the protein sequence with length L. The Transition (T) describes the percent frequency of amino acids in one group followed by amino acids from the other group and calculated by Eq.2 Here , i ϵ {1, 2, 3}, shows the number of one group and represents the number of dipeptides in form of . The Distribution (D) calculates the corresponding positions of the first, 25%, 50%, 75% and 100% of the amino acids in a group, which is described as: here 1 , 2 , 3 , 4 and 5 represent the chain length of first, 25%, 50%, 75% and 100% of the amino acids in i group respectively are located. According to Table 1, a 147-dimension CTD feature vector against each protein sequence is generated.

Feature Selection
To remove and make the feature space optimally fit for the prediction of unseen data an excellent feature selection strategy is employed. As it is evident from several research studies [24], [27], [32]- [34], that feature selection plays a prominent role in building a reliable computational model. The optimum selected feature prevents the model from the curse of dimensionality, avoids overfitting, reduces training time, and enhanced model generalizability. A new scheme of feature selection is discussed in preceding section.

Multi-Level Multi Threshold Feature Subset Selection
One challenging task in computational biological problems is the formulation of the biological sequence via a strong mathematical equation because of a machine learning model unable to process raw biological sequence [25]- [27], [35]- [37]. Therefore, these protein sequences are represented in a more compact and mathematical discrete representation. Studies have shown that extracted primitive feature vector often contains redundant, vague, and irreverent features, which not only mislead the prediction of a classifier but also cause the curse of dimensionality, which eventually, leads to overfitting/underfitting. Various studies [38]- [40] indicate that a single representative feature fails to represent significant feature information concealed in a protein sequence. Hence by coping with such situation, the concept of fusion is used, in which hybrid feature is obtained from a multiple selected feature. To reduce the possibility of the curse of dimensionality and model over-fitting [39], [41]- [46], we have implemented a novel Multi-Level Multi-Threshold feature subset selection to select only the most favorable feature for building the model. In MLMT-SFS, the proposed feature selection model operates over a set of three threshold values, i.e. 0.03, 0.05, and 0.07. Iteratively, the model distilled and obtained optimum features by applying these three sets of threshold values through a cascading approach. A score function is used by calculating the weightage value of each feature. Then only those features are selected by having a score value greater than the 0.03 threshold value. Again the obtained feature space is run through threshold value of 0.05 and 0.07 to get the most optimum feature space. The proposed model schematic diagram is given in Figure.1

Model Architecture
The proposed methodology has been simulated via different classification algorithms i.e., Multilayer Perceptron (MLP) [25], Support Vector Machines (SVM) [27], DT (Decision Tree), RF (Random Forest) [47][48] [49] and 2D-CNN. Deep learning attained huge and considerable attention concerning its implication in the field of computational genomics and proteomics [50]- [53]. We have built 2D-CNN model with a Keras framework (http://www.keras.io) For a baseline chosen hybrid feature space, the following tuned parameters were used for building the convolutional neural network model. A CNN layer was instantiated with a 2-Dimensional(2D) Convolutional layer of 32 filters, a kernel shape of (3,3), 2D Zero Padding with an input shape of (1,84,84) and activation function 'relu'. A 2D-MaxPooling layer with consistent stride shape of (2,2), dim ordering 'th' and ZeroPadding2D with a shape of (1,1) was adopted on each following 2D Dense layer instantiation. The second 2D-CNN layer with 64 filters, of shape (3,3) kernel and activation function 'relu' and third layer was instantiated with a filter size of 128, kernel shape of (3,3,) and activation function 'relu' was adopted.
Final flatten and output dense layer with a filter size of 128 , nb_classes of 2 and activation squash function 'sigmoid' was used. Binary Cross Entropy was used as a loss function. Adadelta, obtained best success result as optimization parameter. To provent model from overfitting and underfitting , a drop out of 0.3 as regularization technique was used.

Cross validation
Model evaluation is assessed over rigorous CV (cross-validation) in order to find out the best generalization parameter of the model [26], [54]. Different types of CV like a jackknife, subsampling (K-fold), and independent dataset test are employed. In CV, jackknife always could generate unique results. To assess the efficiency of the novel predictor, we utilized 5-fold jackknife Cross validation methodology.

Model evaluation Metrics
To measure the quality of a model, two things are kept under consideration, i) Quantitative measure metrics and ii) cross-validation test (further discussed in Section 2.11) [55]. Different statistical model evaluation metrics were exercised to quantify the robustness and authenticity of the model [56], [57]. Accuracy is a well-known statistical metrics used as a classifier correctness measure tool, sensitivity or recall measure true positive rate, specificity measure the true negative rate, MCC measures the model stability and ROC curve measure the overall model performance and authenticity in respect of just making random judging. F-measure is the harmonic mean of sensitivity and specificity operating in a range of 0 and 1. Cohens-kappa statistics, measure the inter-observer agreement, the reliability of the system. Average precision is a singular value metric to evaluate the model prediction.
In Eq. 9, Sn refers to sensitivity, Sp to specificity, Acc to accuracy, and MCC to Mathew correlation coefficient. We have incorporated another few metrics, Log-loss (LL), Gini-index (GI), and Normalized Gini-Index (NGI) for evaluating the imbalance issue measure. Log-loss values approaching zero shows an optimum model. Similarly, Gini-index values are the area below line of perfect quality minus the area below the Lorenz curve divided by the area below the perfect quality line. The lower value of Gini-index shows the perfect distribution of model prediction towards each target class. In Eq.4 + represents positive observational samples, − is the set of the total number of negative samples investigated. Whereas + − and − + are termed as a number of negative samples predicted incorrectly as positive, and the number of positive samples incorrectly predicted as negative, respectively.

Simulation analysis over primitive feature space
The proposed model has been simulated over individual feature spaces, which are given in in Table . Over layer-2 the proposed model obtained 91.32% of Accuracy, 60.65% of Sensitivity and 95.82% of Specificity over CTD feature space and 93.82% of ACC, 73% of Sensitivity and 96.92% of specificity over KSGPAAC feature space respectively. These Layer-2 detail results of discrimination of Phage Hydrolases enzymes are given in Table.3    Table.4.   Table.5. Respective Model train test loss is given in Figure.2. ROC curve is calculated to signify the underline model robustness and authenticity , which is shown in Figure.3 and Figure.4

Comparative Analysis
The prime objective of building any computational predictive model is its generalization power, which should perform best on the unseen data, provided that the model is also not susceptible to overfitting or random classification. The proposed models' results are significantly encouraging compared to the existing Ding et al [20] and Hong et al [3]. It has been observed that the proposed approach outperformed in all evaluation metrics in contrast to all existing approaches. The detailed results are shown in Table 6.

Webserver and User Guide
Many literature studies in the field of computational biology and bioinformatics, indicates the importance and development of a user-friendly publicly accessible web server [58][22]- [25], [27], [47], [48], [59], [60]. Further, a web server simulates intuitions and signifies the importance and future direction for both academians and experimental scientists through carrying various kinds of biological (medical) computational analysis and reporting. For the ease of end-user and experimental biologist, publicly accessible web server, from where can the end-user can obtain required results without going through technical and mathematical can be accessed via http://2clphageenzyme.pythonanywhere.com/. Figure.5 shows the index page of the developed webserver.

Conclusion
In this research, we have developed a novel sequence-based automated predictor for phage enzymes and hydrolase enzymes, called DeepEnzyPred. Simulations outcomes with a training dataset and independent validation dataset have revealed the efficacy of the proposed theoretical model. The good performance of DeepEnzyPred is due to several reasons, i.e. anovative feature selection algorithm and careful construction of the prediction model through the tuned 2D-CNN classifier. We believe that the proposed research work will provide a potential insight into a further prediction of phage enzymes characteristics and functionalities. Many literature studies in the field of computational biology and bioinformatics, indicates the importance and significance of developing a user-friendly publicly accessible web server [58]. For the ease of end-user and research academia, we have made effort by establishing a robust and intelligent web server for our proposed method which can be accessed via http://deepenzypred.pythonanywhere.com. For the reproduction of the proposed methodology, all the source code and dataset can be accessed via https://github.com/zaheerkhancs/DeepPhageEnzyme. Log-loss GI:

Ethics approval and consent to participate
The study does not involve participation in any human and/or animals

Consent for Publication
All the authors aware of, and there no leading consent from other party, member or person.