Two different ML model types were trained and validated on the collected dataset: an artificial neural network (ANN) and an ensemble classifier consisting of decision trees. Both classifiers were trained using 5-fold cross validation to better utilize the collected data, reduce overfitting and generalize the model predictions [57].
Due to the efficiency of deep learning models, the suggested approach was tested on an ANN [58] with multiple hidden layers. Neural networks with different hidden layer architectures were trained using 5-fold validation. All networks resulted in high accuracy (above 97.8%) and low FPRs (0.2%) but were only able to correctly identify low rates of malicious websites (2.5-4%). In addition, an analysis of the receiver operating characteristic (ROC) curves yielded by the different network architectures indicates that the trained models were affected by overfitting.
This is not surprising, as the effect of class imbalance on neural network classification performance was previously proven to be detrimental [59, 60]. Two main approaches were previously suggested for handling imbalanced classification problems while using ANN models by significantly increasing the weight of the minority class: supervised oversampling [61] and synthesized data augmentation [62–64]. These approaches share one main disadvantage — active manipulation of the original dataset. Such manipulation contradicts the intention of accurately representing a commercial, real-life scenario. In addition, it has been previously suggested that data augmentation techniques do not learn the target distribution [65].
As a proof of concept regarding ANN efficiency in this problem space and to neutralize the imbalanced effect induced without adding synthesized data, a more balanced dataset was examined.
This dataset was a subset of the original dataset consisting of 2,788 samples, including all 697 original malicious samples and 2,901 legitimate samples that were randomly selected. The prior probability of being a malicious website in this dataset was 13 times greater than that in the previous experiment (25% vs. 1.95%).
To adapt the ANN model to a smaller dataset and prevent overfitting, the feature space and the network architecture needed to be reduced. Accordingly, the trained ANN consisted of 2 hidden layers, and principal component analysis (PCA) was used for dimensionality reduction, resulting in a feature space consisting of 50 components. As expected, the ANN model functioned significantly better on the more balanced dataset and yielded better classification results (recall: 0.786; precision; 0.734; accuracy: 0.876). The balanced model accuracy was inferior to the imbalanced model accuracy, a fact that can be satisfactorily explained by the significant difference between their minority class prior probabilities.
Another effort to address the crucial effect of class imbalance was made. A sequential NN was trained with oversampling, and the weight for the minority class of malicious websites was increased. As presented in Table 2, this model was able to detect 75% of malicious websites and achieved higher accuracy (94%) and a lower FPR (0.06) than the balanced ANN model (0.09), as presented in Table 2. This FPR level is similar to those of previously reported methods and is not sufficient for a real-life scenario.
Table 2
Performance comparison among the neural network models.
ModelDataset characteristics | Minority class prior probabilityAccuracy | TPRFPR |
ANNImbalanced | 1.95%0.98 | 0.040.002 |
ANNBalanced | 25%0.87 | 0.770.09 |
RNNImbalanced with oversampling | 1.95%0.94 | 0.750.06 |
Ensemble models achieve high accuracy by combining a number of base estimators and can increase the reliability of machine learning relative to a single estimator [66].
Bagging (bootstrap aggregation) is a commonly used ensemble classification method that reduces the variance of a decision tree and addresses classification noise [67]. In situations with substantial classification noise, bagging has been found to be superior to boosting and randomization [68]. The algorithm randomly creates several subsets of the given training set data by sampling with replacement. Every subset is used to train different decision trees, and each different prediction is aggregated into an averaged aggregated prediction. As opposed to the random forest classifier, the bagging classifier does not use a subset of the dataset features and, as a result, can leverage the most significant features for all of its weak classifiers.
The chosen bagging classifier consisted of 10 base estimators. Each estimator was a classification and regression decision tree (CART) with a maximal depth of 4, adjusted to an extremely imbalanced dataset by enforcing a high penalty for classification mistakes produced on the minority class. The CART algorithm was selected due to its advantage in identifying the splitting variables based on searching through all possibilities among the input variables and its ability to leverage its results for explainability purposes. The model was able to successfully detect 75% of all malicious websites while maintaining a low FPR rate of 2% and achieving an overall accuracy of 97.5%, as described in Table 3. Allowing a higher FPR resulted in a higher TPR while maintaining a high accuracy level, as shown in Fig. 3.
Table 3
Performance metrics for the bagging classifier and the convergence experiment.
| Recall | Specificity | Precision | NPV | FPR | FNR | Accuracy |
Bagging Classifier | 0.75 | 0.98 | 0.41 | 0.99 | 0.02 | 0.25 | 0.97 |
Convergence Experiment | 0.83 | 0.97 | 0.44 | 0.99 | 0.02 | 0.25 | 0.97 |
To provide explainability for both ML model results and understand the impact of each feature, an implementation of Shapley values [69] for explainable AI was made. Shapley values interpret the impact of having a certain value for a given feature in comparison to the prediction that would have been made if that feature took some baseline value [70]. This implementation also calculates the aggregated contribution in a way that provides insights on a model level.
When analyzing a specific prediction, one can learn what features contributed most and their actual values, as shown in Fig. 4. In this specific example, the model prediction was ‘1’ (‘1’ represents a malicious website). The biggest impact came from the 3rd-party usage statistics, the high number of images and the relatively high number of HTML elements compared to the baseline values.
[Fig. 4 about here.]
A deeper dive into the predictions classified as false positives (FPs) revealed that in many cases, malicious websites were not identified by GSB. Out of the predictions considered FPs, 100 websites were randomly selected and manually reviewed. 18% of these websites were identified as malicious, while additional 21% were identified as suspicious. Among these websites, we could find a variety of websites that contained content proposals that might raise suspicion, from software downloads via prescription medicine through engagement with human beings, as demonstrated in Fig. 5.
Accordingly, we conducted a convergence experiment in which the model predictions were reviewed and relabeled. Each prediction classified as an FP was reviewed by going over the explainable AI results and by manually accessing the source code and screenshot of the corresponding website. Then, the allegedly malicious websites were relabeled, and the model was retrained. We learned that the 18% assumption regarding FP predictions that would actually be malicious websites continued to exist through the iterations. However, every iteration discovered additional malicious websites that were classified by GSB as legitimate and as a result the experiment did not converge after 5 iterations.
Taking that into consideration, it is reasonable to claim that the model performance was actually higher than that indicated by the above performance metrics. Examining the first iteration of the convergence experiment, as demonstrated in Table 3, the results imply that the model performance will actually be higher in a real-life scenario and that 83% of the malicious websites can be detected by the proposed approach under the same FPR and accuracy measurements.