Dataset
We used the malicious domain dataset for conducting research in this paper [8]. It had 12 independent features, where 1 feature was text based, 3 were multi-category and the rest of the features were numerical. The final goal using these independent features was to predict the categorical dependent feature (malware/not malware). The dataset consisted of 10000 instances of these features and was divided into 80-20 ratio where 80% of data was used for training the models and 20% of data was used for testing purposes. This train-test distribution ensures us a class balanced data, where the number of examples of both the classes were equal [9].
Models
Support Vector Classifier (SVC) - Support Vector Classifier [10] is a supervised machine learning algorithm. SVC is used for both classification and regression problems. SVC aims to predict the best decision boundary between two class labels in N-dimensional space, Where N is the number of features. All the data points in the space are placed according to given features, to classify them into given class SVC draws separating planes to make a boundary line of each class. The best possible separating plane is known as Hyperplane. The data points which lie at the nearest distance from the hyperplane are called support vectors and Planes passing through support vectors and parallel to the hyperplane are known as marginal planes and the hyperplane is equidistant from both marginal planes, and the distance between these is called marginal distance.
Many decision boundaries can be drawn for the same data to classify into the same classes but the most accurate separation can be obtained by the hyperplanes having larger marginal distances.
The equation for the hyperplane is:
wT𝜱(z)+b=0
Where b is a constant.Now let z0 be a data point vector in the space. Distance of z0 from hyperplane can be calculated as:
d(Φ(z0))= | wTΦ(z0)+b|/||w||
Where b is constant and ||w|| is the euclidean norm of w.
Since support vectors are the nearest data point vectors from the hyperplane, therefore the distance of the support vector can be obtained by minimizing the distance.
dmin= minm( | wTΦ(zm)+b|/||w||)
Dmin is the distance from the hyperplane to the marginal plane i.e. marginal distance. For the optimal classification, marginal distance should be maximum.
w*= max( minm( | wTΦ(zm)+b|/||w||))
w*=max( dmin)
Mathematically we can understand maximizing marginal distance helps to create accurate separating boundaries.
Random Forest (RF) - Random forest [11] is an ensemble bagging technique. It is used for both regression and classification tasks. Random forest uses decision trees as base learners. In a random forest multiple decision trees perform classification individually on random input data samples. Inputs to the decision trees may be repeated but repetition of inputs may limit the performance of the model to achieve better accuracy inputs should be selected randomly. Each decision tree predicts the result independently. To get final results, random forests find the majority of the predictions given by the multiple decision trees. If we use a single decision tree for the final prediction, and a small part of the dataset is changed then the decision tree may change its prediction value. In the random forest if a small part of the dataset is altered then it doesn't leave any major impact on the final prediction. The accuracy of the model is also dependent on the number of decision trees; a large number of decision trees leads to better accuracy. Random forest is used when features are weak. Random forests easily process data and perform better with noisy data and if a part of data is missing.
For optimal classification of a dataset on a node of a decision tree Gini impurity and entropy are calculated for splitting of data into given classes. Gini impurity is calculated to get the cost of splitting, which helps in optimal classification at a node with the given features. Gini impurity is the probability of misclassification and its lower value leads to better splitting. Entropy is also calculated to ensure optimal splitting of sample data on a node. Its value lies between 0 and 1. Entropy determines the boundary of classification.
Gradient Boosting Classifier (GBC) - Gradient boosting is a supervised machine learning algorithm[12] that uses ensemble bagging techniques to perform classification and regression tasks. Gradient boosting prediction is done by combining multiple decision trees in an additive manner. Multiple decision trees are added in a series with changed target values for each tree. In gradient boosting weak learners make predictions. Weak learners are used in the direction of gradient descent to reduce prediction error in a minimum amount of time. Gradient boosting combines the results of multiple weak learners to achieve better accuracy. Dependent and independent features are used as input to the model. Independent features remain unchanged for all models while dependent feature changes for each model. For better prediction accuracy, the decision tree should have more height.
In the first step, the base model predicts the result for every row in the data sample, and then After getting the predicted value, residuals/ error is calculated-
Error = (x-x’)
where x’ is the predicted value and x is an actual value. In the next step, models are trained in an additive approach by targeting error as an input feature. to reduce error, a loss function is generated. For regression problems generally, the least square function is considered for loss calculation.
Loss function=(x-x’)2
To get optimal classification we have to optimize the loss function.
Data Analysis
In this subsection we discuss the data analysis steps undertaken while conducting the research. The research workflow is shown in Figure 2.
After loading our dataset, we first checked our dataset for null values and found our dataset containing no null values. Then we started to extract top level domain (com) and domain (netflix) from our existing “domain name (www.netflix.com)” feature. These two extracted features were treated as categorical features and after extraction “domain name” feature was dropped from our dataset. In the next step we removed “.” from the “IP” feature and treated them as numerical features.
As a final step we extracted top 20 occurrences from top level domain, domain, country code and owner and added them to the dataset as new categorical features. After this step the initial “top level domain” , “domain”, “country code” and “owner” features were dropped. These preprocessing steps made our dataset containing 128 distinct features and also added some noise to our data. The features in our dataset were also from varying scales and hence making it important to use feature selection and feature scaling techniques to help our machine learning algorithm converge faster and make accurate predictions.
We used the Support Vector Machine, Random Forest and Gradient Boosting algorithm for creating our models.
Then we used two different feature scaling techniques with these models. The techniques are listed below:
- Min-Max Scaler: Rescales the feature between a custom range. We scaled our features in the range [0,1]. Scikit-Learn Library was used for implementing Min-Max Scaler [13].
- Robust Scaler: The shortcoming of Min-Max scaling is that it is sensitive to outliers. Therefore, to mitigate the possibility of this happening we used Robust Scaler. It scales data using Interquartile range and is less sensitive to the outliers [14].
For selecting the best subset of features from our dataset, we used the Chi-Square filter-based feature selection technique [15] which selected 48 features in total. Using 3 different machine learning algorithms, 2 different scaling techniques and 1 feature selection technique made it possible to make a total of 6 models for the problem.
The comprehensive results are discussed in the next section i.e. Result and Analysis. The related data analysis code can be found in our github repository.