Protein Function Prediction with Deep Neural Learning

Background: The function of protein is directly related to its structure, and plays a pivotal role in the entire life process. The protein interaction network controls almost all biological cell processes while fulfilling most of the biological functions. In fact, protein function prediction can be regarded as a multi-label classification problem to fill the gap between a huge number of protein sequences and known functions. It is not only a key issue in related research fields, but also a long-standing challenge. Protein function prediction with Deep Neural Network (DNN) almost study data set with small scale proteins based on Gene Ontology (GO). They usually dig relationships between protein features and function tags. It still needs further study for large-scale protein to find useful prediction approaches. Methods: This paper proposed a protein function prediction approach with DNN which used Grasshopper Optimization Algorithm (GOA), Intuitionistic Fuzzy c-Means (IFCM), Kernel Principal Component Analysis (KPCA) and DNN (IGP-DNN). The features in protein function modules were extracted by combining GOA and IFCM. The KPCA was used to reduce the dimensions of features in protein properties. Both features were integrated to enrich the features information and the integrated


INTRODUCTION
Protein function prediction is a classification problem of multiple labels that fills up the gap between a large number of protein sequences and known functions. The prediction is a challenging research direction in biology and plays an important role in grasping the tissues and functions of the biological system 20. The traditional biology experiment predicts protein functions by extracting useful information from protein sequences. However, the approaches have a slow speed and high cost. On the contrary, computational methods is widely used in protein function prediction because of its low cost and ease of implementation [2].

Vector construction of protein features
Vector of protein features is built by using IFCM-GOA and KPCA. IFCM-GOA is used to extract module features. KPCA is used to reduce the dimension of protein property. The features that are reduced dimension and standardized are input to the DNN model.

IFCM-GOA
IFCM-GOA uses the idea that the grasshopper optimization algorithm (GOA) ingeniously balances the two processes of exploration and development to optimize and search for the best cluster center. According to the best cluster center, the intuitionistic fuzzy c-means (IFCM) cluster calculates the intuitionistic fuzzy membership matrix. Cluster results are obtained by dividing the matrix.
The prediction model transforms protein function prediction into a binary classification problem with multiple labels of which the samples are proteins and the sample labels are function module terms.

Protein function prediction
Deep learning solves the regression and classification by extracting the feature representation from the input data with different abstraction levels. In this paper, the DNN model IGP-DNN that predicts protein function is built. The input of IGP-DNN is the standardized feature matrix X ' =(x ' ij) of the protein with known function in PPI network. And the output is the information label matrix of the corresponding protein function module.
The performance of the protein function prediction by using DNN is affected by many factors. It The function tanh, shown as formula (5), is selected as the activation function of the output level.
The loss function is the cross-entropy loss function of which the basis is the maximum likelihood estimation. The formula of the loss function is shown as the formula (6).
The adaptive learning algorithm is selected as the optimization algorithm of the experiment. The value of the β1 is 0.9, the β2 is 0.999, the ε is 10 -8 . The IGP-DNN is trained by using batch learning of which the size is 20% of the number of the protein in the training set. The number of learning iteration Epoch is 200. The regularization coefficient is 0.0005 and the proportion of Dropout is 30%. The IGP-DNN predicts the probability that an unknown protein has a function and selects the K comment terms The deep learning framework is Tensorflow1.8. According to the IFCM-GOA, the function modules of which the number of clusters is 410, 130, 250 are selected as function module features.

Measure
The  It can be seen from Figure 5 that the Recall of IGP-DNN is slightly lower than HPMM on Gavin, but it has more advantage on the other two data sets. The Precision and F-measure of IGP-DNN are higher than HPMM on the three data sets. The reason is that IGP-DNN uses IFCM and GOA to solve the problems that the PPI network is easy to fall into local optimal and be disturbed by noise points during clustering, when extracting the protein function features. In addition, KPCA can deal with nonlinear data better. Precision, Recall and F-measure of the IGP-DNN model are better than IGP-SVM, because DNN is better than SVM and has stronger nonlinear fitting ability while they process large scale data set. The performance of IGP-DNN model is obviously better than FFPred and more stable. In

CONCLUSION
This paper proposes a protein function prediction approach based on DNN. In the model, protein features are composed of the protein function module features that are extracted by using IFCM-GOA and the protein property features that are reduced dimensions by using KPCA to address the noise sensitivity and the other problems during predicting protein function. In addition, the enumeration is used to choose the optimal hyper parameters that are the basis of building the DNN model. Then, the IGP-DNN is compared with the IGP-SVM, HPMM and FFPred on three different data sets of the yeast PPI network. The experimental results demonstrate that the Precision, Recall, F-measure of IGP-DNN are