Malware, being a big threat has caught attention towards it from cyber security professionals because of its varied nature and extent of its effects. So, for detecting it machine learning algorithms were being used which are helpful in producing high detection accuracy [7]. Machine learning and deep learning algorithms learns to differentiate between malware and benign by effective training on quality datasets [8]. Malware analysis can be done in many ways. Most commonly used analysis is static analysis which is used in detecting malwares with the help of opcode features [9]. Another static analysis approach which uses frequency occurrence of each opcode extracted from portable executable malicious files [10].
Now a days, Behavior-based detection techniques are widely used in malware analysis. The main approach is that it observes the behavior of the program by monitoring and concluding whether the malware is benign or malware. The behaviors can be obtained by any of the following techniques:
-
Using a sandbox environment.
-
Continuous monitoring of system calls
-
Observing any file modifications.
-
By monitoring processes.
One such detection algorithm which used ensemble learning method namely voting classifier for detection of malware samples [11]. The limitations are many in traditional security mechanisms and to overcome it. Machine learning techniques can be used [12]. Challenges are being identified, which are mainly detection of Intrusion Detection System, spams and malware [13]. Many machine learning algorithms used support vector machines (SVM) [14,15] which were trained on android malwares until the model correctly predicted all the android malware samples.
Another most popular classification algorithm is KNN algorithm, abbreviated as K-Nearest Neighbors were trained. A model trained with mobile malware and was found to be successful in producing great accuracy [16]. Another model used KNN which was specifically used for signature-based detection [17]. A hybrid model built using Navie Bayes and Support Vector Machines was used to classify malicious android malwares [18]. It uses a combination of both which produces a high accuracy in classifying android malware samples. Random Forest classifier which used “Malimg” was used to classify which has 9,342 malware samples of 25 different malware families gave an accuracy of around 94.64% depicting the efficiency of ensemble model (Random Forest) [19]. Some machine learning algorithms were previously trained on the same dataset which was used in building the model. Many classification models were trained on same dataset and the results showed that, Extra tree classifiers was the top performer with accuracy of 99.82% and detection rate of 99.86, which is immediately followed by Random Forest which was slightly behind Extra Tree classifier with accuracy of 99.78% and detection rate of 99.89% [20].
Deep learning has been extensively used in modern tasks which can handle large amounts of data and give more accurate results when compared to traditional machine learning models. Traditional machine learning models take a lot of time in training and can even produce wrong results. Deep Neural Networks, which was trained on CIC dataset by Canadian Institute of Cybersecurity yielded an accuracy of 85.04% and detection rate of 85.17% [20]. Not all malwares are in hash code form but some malwares are in unstructured forms such as images or videos. This type needs to be handled using different techniques. One such technique which is popular for image processing are CNNs. An algorithm using CNN integrated with recurrent neural networks yielded an accuracy of 98.92% and detection rate of 98% [21]. With all the work done in the past and after analyzing the results of all individual classifiers both from machine learning and deep learning models a new and robust model has been proposed.
The overview of all the existing models and the type of dataset used for classification along with the proposed model and its dataset has been explained in Table 2. It also briefs, what kind of preprocessing techniques have been used in order to achieve better accuracy and what type of classification they are (binary or multi-class).
Table 2
Overview of existing models against proposed model.
Used Approach | Dataset | Pre-processing | Classification |
Hybrid RNN [22] | NSL-KDD | Principal Component Analysis | Binary |
Attention based LSTM [23] | MSCAD | PSO | Multi-Class |
Catboost [24] | CIC Dataset | - | Binary |
Gradient Boosting [25] | CIC Dataset | Person Correlation | Binary |
Deep Belief Networks [26] | Microsoft Malware Dataset | Auto-Encoder | Binary |
K-Nearest Neighbors [27] | Drebin | Feature Extract | Binary |
Multi-layer Perceptron [28] | Android Malwares | - | Binary |
Stacking Classifiers (Proposed Model) | CIC Dataset | Autoencoder-Decoders | Binary |
Stacking Classifiers: Stacking classifiers, an ensemble technique, utilizes the combination of several machine learning methods to increase the model's prediction performance. Stacking classifiers have been shown to be the useful in malware detection in recent researches. For example, the effectiveness of stacking classifiers in classifying malware datasets by analyzing hash codes showed good results and performed effectively [29]. They discovered that stacking classifiers beat separate techniques in terms of their efficiency metrics, implying that they have the potential to improve malware detection systems.
Another example, of a stack-based malware detection framework was presented in a separate research paper which uses stacking to build various base classifiers to detect Android malware [30]. In a study, it was found that stacking improves malware detection accuracy significantly compared to a single classifier. Thus, the study highlights the importance of stack classifiers in building strong malware detection frameworks which can dynamically adjust to malware.
To build an efficient machine learning and deep learning models, there is a need of some important preprocessing techniques like feature scaling and dimensionality reduction. Dimensionality reduction is needed a raw dataset contains a lot of features; it would be difficult for a model to train on entire features. Training with those many features can take significant amount of time and resources. Dimensionality reduction provides two main techniques: feature selection and feature extraction.
Under feature selection, methods like filter methods, wrapper methods and embedded methods are present. A model built using deep learning, describes a correlation-based feature selection methods which selects the best features by computing correlation between target column and other columns [31]. A novel feature selection was used in detecting malwares, which relies on modified whale optimization algorithm [32]. Malware attacks are most common among android mobiles, a new technique based on frequency differential enhancement was used in detecting android malware [33]. As malware can change their appearance or behavior to avoid detection, a efficient method was used with the combination of structural and behavioral features for reducing dimensionality and improve the quality of features [34]. Quadratic programming, a methodology used as feature selection which compares several feature selection methods and selects relevant features [35]. Feature selection method used weighted voting on CCCS-CIC-AndMal-2020 dataset, which was based on R2 scores and helped in selecting optimum features and achieved high accuracy [36]. It achieved high accuracy even after excluding 60.2% pf the features.
Among all dimensionality reduction methods, Principal Component Analysis (PCA) is the most widely used technique. It works by reducing the dimensionality of the data by selecting the most important features that capture maximum variance of the data. A model built using ANN used PCA as feature selection method for malware detection achieving high accuracy and low false positive rate [37]. Another research used unsupervised feature selection, which compared four different methods, namely PCA, Rough PCA, Unsupervised Quick Reduct (USQR) and Empirical Distribution Ranking (EDR) for malware detection, highlighting PCA’s performs well on most of the datasets [38]. For detecting and classifying ransomware, a two-stage selection method based on wrapper technique was used [39].
Modern dimensionality reduction methods also include autoencoder-decoders for feature generation. An efficient method was used in single ranker model in combination with autoencoders to perform greedy backward elimination of features [40]. An unsupervised feature ranking which was based on autoencoders was used by exploring the original feature space [41]. Autoencoder-decoders are being used now a days, due its feature representation techniques.
From evolutionary biology to cybersecurity, dendrogram tree analysis (PTA) has been used to explore the relationships and development of malware. Dendrogram trees provide a graphical representation of the ‘family tree’ of malware, providing insights into the similarities and differences between malware samples. In a study, researchers developed a scalable approach for building dendrogram trees to group large-scale malware samples [42]. The aim of this study was to improve the accuracy of clustering and reduce the burden on malware analysts by categorizing malware using dendrogram trees. Similar to this, another study presented an efficient technique to construct dendrogram trees that might be used for malware clustering [43], with a high correctness rate, the suggested method accomplished clustering 22 times quicker than with conventional techniques. This demonstrates how dendrogram tree analysis may be used to effectively handle and classify thousands of malware specimens.
Dendrogram tree analysis and stacking classifiers combined offer a comprehensive method for classifying and analyzing malware. Their separate advantages imply that their combined use could result in notable progress in malware detection. The utilization of several methods by stacking classifiers yields a strong prediction model, whilst dendrogram tree research provides an in-depth comprehension of malware evolution and linkages. Combining these techniques could result in the creation of complex malware detection systems that are able to recognize malware with accuracy as well as anticipate and adjust to new variations. This would enable the forecast of threats and the creation of efficient countermeasures, which would be a major advancement in proactive cybersecurity.
In brief, dendrogram tree analysis and stacking classifiers are effective methods for classifying and analyzing malware. With this possible combination might completely transform the industry and offer effective methods for defending off the constantly changing danger landscape of cyberattacks. It is impossible to overestimate the significance of such sophisticated techniques in guaranteeing digital security, given the ongoing evolution of malware. In order to fully realize the promise of these approaches when working together to combat malware, future research should concentrate on examining how they complement one another.