2.5 Machine learning algorithms used
A big part of machine learning is classification — we want to know what class a new peptide is (Dengue inhibiting peptide or non-inhibiting). We have considered 8 machine learning algorithms with the following parameters.
Random forest parameters:
Tree number:100; number of thread: 1; Autooptimization Tree range from: 50; Tree range to: 500; step:50; cross-validation:5
Random forest is a Supervised Machine Learning Algorithm which builds decision trees on different samples and takes their majority vote for classification and average in case of regression. It can handle continuous variable as well as categorical variables. It has shown good results in classification problems .
Light Gradient Boosting Machine (LightGBM) parameters:
Boosting type: gbdt; number of leaves:31; maximum depth: -1; learning rate: 0.1; number of threads:1 Auto optimization, leaves range: 20:100:2;depth range: 15:55:10; learning rate range: 0.01,0.15,0.02
It is a gradient boosting framework that makes use of tree-based learning algorithms which performs quite well in huge dataset. It is an extremely fast and accurate classifier, employed for binary classification of Biological sequences .
Support Vector Machine (SVM):
Kernel: rbf; Penalty:1, Gamma: Auto, Penalty from:1’Penalty to: 15’Gamma from: -10,Gamma to: 5
SVM is one of the most popular Supervised Learning algorithms, which is primarily used for Classification. It creates multiple decision boundary that can separate data points in n-dimensional space into classes. The decision boundary is determined by extreme vectors called a support vector. The best decision boundary is called hyperplane. Hence, this algorithm is called as Support Vector Machine. The SVM is widely used in classification of biological sequences .
Logistic regression (LR):
Logistic regression is a supervised learning classification algorithm which primarily classify two classes. It is used to predict the probability of a target variable. Therefore, the LR is also used in classification of biological sequences .
k-Nearest Neighbour (k-NN):
Top k values:3
The k-NN is also Supervised Learning technique which calculates the similarity between the query data with available dataset. It classifies the query data based on similarity percentage. The k-NN algorithm can also be used for regression. The k-NN is widely used for biological image and sequence classification .
Naive Bayes classifiers are based on Bayes’ Theorem. It is a collection of classification algorithm. Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. The Naïve Bayes is also used in classification of biological sequences which helps in taxonomy .
Adaptive Boosting (Adaboost)
This method is based on the principle that learners growing sequentially. The weak learners are converted into strong learner. It is boosting technique used in which the weights are re-assigned to each instance i.e., higher weights assigned to incorrectly classified instances. It is an ensemble method in Machine Learning. This algorithm is generally used in biological sequence classification .
MLP (Multi-Layer Perceptron)
Hidden layer Size: 32,32
Multi layer perceptron (MLP) is a supplement of feed forward neural network which consists of three layers namely a input layer, arbitrary number of hidden layers and a output layer. The input signal is handled by input layer. The hidden layers are the true computational engine of MLP. The output layer helps in prediction and classification. The data is processed in the forward direction from input to output layer. The nodes in the MLP are trained with the back propagation learning algorithm. MLP can solve non-linear problems. MLP is used in biological sequence classification .
Number of CPU: 1
Bagging is an ensemble learning technique that enhances the accuracy and performance of machine learning algorithms. It avoids overfitting of data and is used for both regression and classification models, specifically for decision tree algorithms. Bagging algorithm is rarely used in biological sequence classification .
Here, uncorrelated models can produce ensemble predictions that are more accurate than any of the individual predictions . The normalized dataset (Training set: 102,3; Testing set: 14,3) was taken as input and loaded in these machine learning algorithms. The cross validation was set to 5 in all cases. Subsequently, 24 models were developed and compared with respect to model Accuracy, ROC and PRC. The ROC and PRC curve was ploted. The evaluation metrics was reported.