Detection and defense of network virus using data mining technology

The spread of network viruses has posed a serious threat to the security of the network; therefore, it is necessary to detect and defend them effectively. This paper used debugging application programming interface (API) technology to obtain the features of API calls as viruses, filtered API calls according to information entropy, and finally used the support vector machine (SVM) model for virus detection. The experimental results showed that when the number of API was 1200, the algorithm had the best virus detection performance, with an average true positive rate (TPR) of 95.2%, a false positive rate (FPR) of 3.31%, and an overall accuracy of 95.42%; compared with the C4.5 algorithm, the K‐means algorithm, and the Naive Bayes algorithm, the SVM algorithm had the best performance. The results show that the proposed method is effective in virus detection and defense and can be further promoted and applied in practice.


INTRODUCTION
The rapid development of the network has brought great conveniences to people's study, life, work, and entertainment, 1 but at the same time, the security problem of the network has become increasingly prominent 2 ; therefore, technologies such as data encryption, 3 identity authentication, 4 intrusion detection, 5 and virus defense 6 have been extensively studied to realize network security better. Network virus refers to a group of codes that can destroy the function or data of computers and also has the ability of self-replication, that is, it can spread quickly among the networks to affect the normal use of computers. Once the virus code is executed in computers, it will be inserted into various programs to multiply itself. If it is not handled in time, these viruses will spread to other uninfected computers through all platforms on the network, making more computers lose the ability to work normally and even causing the paralysis of the network system. In addition, the viruses have strong concealment, sometimes cannot be detected by antivirus software. Before triggering, the viruses will lurk in the program; once triggered, they will wantonly destroy the network. The increasingly rampant network virus severely challenges computer security. 7 The high-speed spread of information and the rampant virus brings a huge threat to people's information security. Therefore, how to realize the detection and defense of network viruses and reduce the damage of viruses to computers to ensure the normal work of the network has been widely concerned by researchers. to realize the mathematical model of virus protection system and made statistical analysis. In this paper, based on support vector machine (SVM), the detection and defense methods of network viruses were studied. The API call of the PE file was obtained using debugging API technology for virus detection, and useful features were screened as the input of the SVM algorithm through information entropy. Virus samples and normal samples were detected through the classification function of SVM. The performance of the SVM algorithm was tested on the data set to verify the possibility of the method in practical application. This work makes some contributions to the further development of network security.

Network virus feature screening
In the Windows system, the PE file is the standard format of executable file. 13 Therefore, the virus is written according to the format of the PE file in order to spread better. In the operation process, Windows calls various service functions to realize a function, and the functions are called API functions. Before reading a file on a disk, first of all, the Readfile function in Kernel32.dll is called; if the permission meets the requirement, the NtReadFile function in Ntdll.dll is called; then, CPU is switched to the operating system kernel mode through the KiSystemService function; finally, the NtReadFlie kernel function in ntoskrnl.exe is called. After file reading, the NtReadFile function in ntoskrnl.exe will return in the same way. This is an API call. All programs on the Windows platform need API calls to implement functions, including the degree of viruses. Therefore, network viruses can be detected by capturing API calls, such as Windows debugging API 14 and APIHOOK technology. 15 However, APIHOOK technology can only be implemented when every API prototype is known, and not every function can be effectively captured. Therefore, this study used Windows debugging API technology to obtain API calls.
Debugging API technology can load a program or bind itself to the program to facilitate debugging. If there are debugging-related events, the debugger will be triggered, that is, the monitoring process will be activated. The steps of debugging API technology can be described as follows: (a) the monitor program starts running after samples are input; (b) the export table of the library file called by the debugging program is analyzed; (c) the breakpoint is set at the entry address of the function; (d) if an interrupt occurs at the breakpoint, the debugging process information will be obtained; (e) the debugging process is cycled until the end, and a behavior report is obtained.
In the collected API calls, a large part of the content does not play a great role in virus detection but will cause information redundancy and increase the calculation of subsequent virus detection. Therefore, it is necessary to screen the collected features and select the features with a high differentiation degree and a high important degree. This paper used the method of information entropy. Regarding whether a program calls related API as information and whether a program is a virus or not as an event, the contribution of information to an event was judged through the calculation of information entropy to determine the importance of the feature.
The entropy of a random variable X is set as: where X has k values and p i represents the probability that the value of X is v i . It is assumed that under the condition that a random variable Y has been known, the conditional entropy of X is: where H(XY ) refers to the uncertainty of X. Then, under the condition that Y has been known, the information added value of X, that is, the information grain, can be written as: the greater the value is, the larger the function of the API call is. Features with large information gains are reserved to establish a virus feature set. Every feature vector is represented by a boolean vector. If the feature appears in the program, it is assigned as 1; otherwise, it is assigned as 0. A program is set as B i , and a feature attribute is set as A i . The feature set can be represented by matrix M A×B : where

The SVM-based detection mode
SVM is a method based on structural risk minimization. 16 It finds the optimal solution with quadratic programming to avoid the local minimization problem of neural networks and solves the dimension problem through the kernel function. Therefore, it has been widely used in many fields. Virus detection can be regarded as a classification problem. Therefore, this paper uses SVM to detect viruses. It is assumed that a sample set is (x i , y i ), where i = 1, 2, · · · , n, y ∈ {+1, −1}, x i represents the feature of the program, and y i represents the class that the program belongs to. The corresponding classification surface can be written as that is, satisfying where w stands for a weight and b stands for a bias. The class interval is 2 ||w|| , the optimal classification surface can be written as: and the constraint condition is: Then, the decision function is obtained: In order to solve Equation (10), Lagrange multiplier a is introduced to transform it into a dual problem: The constraint condition is: Finally, the SVM model for virus detection can be written as: where k is a kernel function. The common kernel functions are: ( , r, and d are all nuclear parameters. In selecting the kernel function, because of the large number of samples and features required in virus detection, the RBF kernel function can avoid the dimension disaster better. Therefore, this paper chose the RBF kernel function to build the SVM model.

EXPERIMENTAL ANALYSIS
Virus samples were collected from some well-known forums and laboratories, and normal samples were also selected from the Windows system. Finally, 1256 samples were obtained, including 692 virus samples and 564 normal samples. Then, features were extracted by debugging API technology, and the obtained features were stored in the MySQL database. Virus detection and defense were realized by the SVM model designed above. The features extracted by debugging API technology were filtered by information entropy and input into the SVM model to train the model. Then, the trained model distinguished virus samples from normal samples.
The purpose of the experiment was to analyze the virus detection and defense ability of the model designed in this paper. The judgment result was "yes" or "no," as shown in Table 1.

Normal Virus
Normal file TP FN

Virus files FP TN
The evaluation indexes of the model included: true-positive rate ∶ TPR = TP TP + FN ; (18) false-positive rate ∶ FPR = FP FP + TN ; (19) total accuracy ∶ Total_Accuracy = TP + TN TP + FP + TN + FN . (20) Firstly, the influence of the number of features on the performance of virus detection was analyzed. After extracting API calls by using debugging API technology, a total of 2167 API calls were obtained. Through feature screening, different numbers of APIs were selected for virus detection. The overall accuracy of the model is shown in Table 2.
It was seen from Table 2 that with the increase of the number of APIs, the overall accuracy of the model rose rapidly, from about 80% to about 90%. When the number of APIs reached 1200, the overall accuracy was the highest, 95.27%. Then, with the continuous growth of the number of APIs, the accuracy of the algorithm began to decline. When the number of APIs reached 2100, the overall accuracy of the algorithm was 90.27%, which decreased by 5%. Therefore, the number of selected APIs was set as 1200.
The method of 10-fold cross-validation was adopted to determine the performance of the SVM model, as shown in Table 3.
It was seen from paper had a high accuracy rate in the detection and defense of network viruses. The SVM algorithm was compared with the decision tree algorithm (C4.5), the K-means algorithm, and the Naive Bayes algorithm to further verify its effectiveness.
The experimental results are shown in Figure 1. It was seen from Figure 1 that the TPR values of the four algorithms were 92.17%, 93.22%, 94.26%, and 95.2%, respectively, that is, the TPR of the SVM model was 3.03%, 1.98%, and 0.94% higher than that of the former three algorithms; the FPR values of the four algorithms was 6.86%, 6.32%, 4.67%, and 95.2%, respectively, that is, the FPR of the SVM model was 3.55%, 3.01%, and 1.36% lower than that of the former two algorithms. The overall accuracy of the four algorithms was 85.49%, 88.67%, 92.16%, and 95.42%, respectively. The SVM model had the highest accuracy, which was 9.93%, 6.75%, and 3.26% higher than the former two algorithms, which verified the reliability of the SVM model. The SVM model had a higher TPR value, a lower FPR value, and a significantly higher accuracy, showing the best performance in detecting viruses.

DISCUSSION
The emergence of more and more new and unknown viruses not only will have a huge impact on society but also may lead to irreparable economic losses 17 ; therefore, virus detection and defense has become a key and difficult problem in the field of network security, 18 and more new and effective methods are urgently needed. At present, the commonly used methods include the behavior detection method, 19 the heuristic scanning method, 20 the intelligent detection method, etc. The intelligent detection method refers to applying intelligent algorithms such as data mining and machine learning to realize the detection and defense of viruses. Many data mining algorithms have been successfully applied in virus detection, such as decision trees and neural networks. It can be found that data mining has great potential and prospects in virus detection and defense.
In this study, API calls were extracted by debugging API technology as virus features and screened based on information entropy. The SVM model was selected as the classification algorithm to study the network virus detection. First of all, the number of extracted features had an impact on the virus-detection performance. When the number of features was too large, the redundant information contained will not be conducive to virus detection. According to Table 2, when the number of APIs used was 1200, the algorithm had the best performance. In the 10-fold cross test, the SVM model showed 95.2% TPR, 3.31% FPR, and 95.42% overall accuracy, indicating that the SVM model had an excellent performance in virus detection and could effectively distinguish normal and virus programs. Finally, compared with the other algorithms, the SVM model had good reliability in virus detection. In Figure 1, comparisons with C4.5, K-means, and Naive Bayes algorithms demonstrated that the SVM model had high TPR, lower FPR, significantly higher overall accuracy (95.42%), indicating that the performance of the classification model could affect the performance of virus detection when the same feature was used as the input. The SVM model showed a performance superior than the other algorithms when classifying virus files and normal files, verifying that the SVM model was effective in detecting viruses.