Detecting vulnerability in source code using CNN and LSTM network

Automated vulnerability detection has become a research hot spot because it is beneficial for improving software quality and security. The code metric (CM) is one class of important representations of vulnerability in source code. The implicit relationships among different metric attributes have not been sufficiently considered in traditional vulnerability detection based on CMs. In this paper, in view of the local perception capability of convolutional neural network (CNN) and the time-series prediction capability of long short-term memory (LSTM), we propose VulExplore, a compound neural network model for vulnerability detection that consists of a CNN for feature extraction and an LSTM network for deep representation. Moreover, to further indicate the vulnerability features in the source code, we reconstruct a CM dataset that includes two additional important attributes: maintainability index and average number of vulnerabilities committed per line. Our proposed numerical method can obtain both false-negative rate (FNR) and false-positive rate (FPR) under 20% and, meanwhile, achieve recall and precision over 80%, respectively.


Introduction
In the context of information technology, software are used in large quantities in various industries. Unfortunately, however, the problem of vulnerabilities cannot be completely avoided due to the vulnerabilities caused by programmers, either intentionally or unintentionally, and the high complexity of software.
Considering that vulnerabilities cannot be prevented from the root of generation, the main defense method is to detect vulnerabilities and patch them as early as possible, thus giving rise to the field of software vulnerability detection [1]. Due to the importance of the vulnerability issue, many researchers have conducted a lot of research on vulnerability detection.
The current solutions for vulnerability detection are mainly static detection, dynamic detection, machine learning, and deep learning approaches [2][3][4][5][6][7][8]. Many mature methods have been developed for static and dynamic detection, but these methods require a large amount of a priori expertise. Machine learning and deep learning-based methods have emerged in recent years and are rapidly evolving with the support of cloud computing and high computing power, making automated vulnerability detection a reality.
Recent research has used deep learning on code features to detect vulnerabilities in code. Unlike traditional detection methods, deep learning does not require researchers to manually perform extensive processing of code features and can automatically determine the relationship of features in the sample data from the training samples [9][10][11][12][13][14][15][16].
In order to explore the correlation between vulnerability code features and vulnerabilities, we propose a code metric-based vulnerability detection method VULNEXPLORE. VULNEXPLORE uses a code metric dataset to implement vulnerability detection using a long short-term memory network (LSTM), with the aim of exploring the implied relationship between different code metrics, while practically verifying that there is a positive relation between code metrics and the generation of vulnerabilities. Previous researchers have used CNNs or DNNs on code metrics for vulnerability detection, but these schemes only process and learn each feature and do not explore the relationship between each code metric.
As mentioned previously, code metrics can quantify the characteristics of vulnerabilities quite effectively, so in this study, we hope to explore the implicit relationship between different code metrics by building a deep learning model for the purpose of vulnerability detection.
The contribution of our study is twofold： As an important step in the study, we constructed a code metric dataset and make public this function granularity-based code metric dataset.
Aiming at the code metric dataset, we constructed a deep learning detection model to explore the relationship between code metrics and vulnerabilities, and also improved previous work of the researchers to make our vulnerability detection model better in terms of accuracy, recall, and F-measure of vulnerability detection.
Paper Organization. The rest of this paper is organized as follows.
Section II presents the relevant background and the most relevant studies, Section III describes the problem studied and the methodology of the study, Section IV describes the design of the experiment, Section V presents the results and a discussion of the experiments, Section VI discusses the limitations of the study methodology, and Section VII presents future work on the subject.

Background and related works
In this section, the background and other related work are briefly described. The background of the research is first presented, then the concepts related to our research (neural networks, vulnerability detection, and code metrics) are introduced, and in the end, other related articles are cited.

Deep Learning Neural Networks
Deep learning is a sub-branch of machine learning. There are well-established deep learning methods in image processing, medicine, natural language processing, and several other fields. Most deep learning models are now based on artificial neural networks (ANN), which are complex neural networks composed of artificial neurons (ANs) inspired by neurons in the human brain cortex [17]. The information processing of individual artificial neurons is quite simple, but the global behavior caused by the interaction of individual neurons in a neural network allows solving complex problems.
While neural networks have been very successful in such areas, it is not the same as in the vulnerability detection area. Source code is different from image processing and natural language processing, and many deep learning networks are not applicable to vulnerability detection, leading to the need to select an appropriate neural network for the source code. In this paper, considering the use of code metrics to characterize source code, we use the convolutional neural network (CNN) for feature extraction of code metrics and long short-term memory network (LSTM) for vulnerability detection to explore the relationship between different code metric attributes and whether contextual semantic information is reflected between attributes.
Convolutional neural networks have excelled in the field of image processing. Hubel and Wiesel's research from the 1850s to the 1860s found that the visual cortex of monkeys and cats contains neurons that correspond separately to a small visual area [18]. When their eyes are fixed in an area that excites a single neuron, and adjacent neurons have similar receptive intervals. In order to form a complete image, the size and location of the receptive range of neurons on the whole visual cortex show systematic variations. With the above concept, convolutional neural networks are proposed. In the field of image processing, features of images such as edges, lines, and corners of images can be extracted. For one-dimensional features such as code metrics, we use asymmetric convolutional kernels (e.g., 1*2 or 1*3 convolutional kernel size) for feature extraction of the input data and then perform feature sampling through the pooling layer [19,20].
Long Short-term Memory network (LSTM) is a variant of Recurrent Neural Network (RNN). In order to overcome the problem of gradient disappearance and gradient explosion during the training of long sequences of RNNs, LSTMs have better performance in training long sequences compared to normal RNNs. Since LSTM is explicitly designed to avoid the long-term dependency problem, remembering information for a long time is the default behavior of LSTM rather than the behavior that needs to be learned during the training process, and the transmission state is controlled by gating the state to selectively remember information.

Vulnerability Detection
In this paper, we choose C/C++ as the target languages for vulnerability detection. Unlike the previous studies, our CNN+LSTM can better capture the relationship between individual code metrics and significantly improve the detection accuracy of vulnerabilities. We will describe the vulnerability detection model in Section 3.

Code Metrics
A code metric is a set of software metrics that characterize the nature and specification of the source code with the goal of obtaining objective, quantitative metrics. Some basic code metrics are the number of lines of code, the number of blank lines, the number of comment lines, the number of words, etc [21].
The McCabe metric is a complexity metric based on the program control flow proposed by Thomas McCabe, also known as the loop metric [22]. He argues that the complexity of a program depends heavily on the complexity of the program control flow graph. A single sequential structure is the simplest, and the more loops formed by loops and selections, the more complex the program is, and the higher the likelihood of vulnerabilities. After the problem of loop complexity is depicted as a control flow graph, the loop complexity can be calculated using any of the following three.
(1) The circle complexity of the control flow graph V(G) = R, where R is the number of regions in the control flow graph.
(2) The circle complexity of the control flow graph V(G) = E -N + 2, where E is the number of edges in the graph, and N is the number of nodes.  Table 1 [23,24].

Methodology
Our target is to design a vulnerability detection system based on the designed vulnerability detection model that can automatically determine the existence of vulnerabilities in software based on the source code, by simply doing code metric calculations on the given source code. In this section, we first describe the design of the model, then we present the main research questions based on our final goal to achieve vulnerability detection, and in the end, we elaborate on the modules in the model based on these questions.

Overview of the Model
This subsection is an overview of the model. The model has two phases: a training phase and a testing phase. In the learning phase, we construct the code metric dataset by extracting code metrics from a large number of source code files, some of which are vulnerable and some of which are clean.
Our CNN+LSTM network is trained by labeling the vulnerable and clean data. At the same time, using cross-validation to evaluate the model.

RQ1: Can code metrics be used as input features to deep learning models for vulnerability detection?
To answer this question, we will first describe the preparation of the dataset. The data used for  We use the neural network structure of CNN+LSTM. Specifically, it includes an input layer, a convolutional layer, a pooling layer, an LSTM layer and a dense layer. Due to the good performance of CNN in image processing and natural language processing, its local perception and weight sharing can greatly reduce the number of parameters and thus improve the learning efficiency of the model.
Since our code metrics dataset is a sequence of individual code metrics, with the good ability of CNN to abstract features, the information implied on the code metrics can be extracted and higher quality and high concentration features can be passed to the LSTM for vulnerability detection.
Our CNN consists of two main components: convolutional layer and pooling layer. Figure 1 shows the details of CNN. Since each instance in the code metric dataset is a one-dimensional sequence, we choose an asymmetric convolutional kernel to perform feature extraction on the sequence. After the convolutional layer performs the convolutional operation, although the features of the data are extracted, the dimensionality of the extracted features is pretty high, so in order to solve the problem and reduce the cost of training the network, a pooling layer is added after convolution to select the

features. Its calculation formula is as in Equation:
= tanh( * + ), where represents the output value after convolution, tanh as the activation function, is the input sequence, is the weight of the convolution kernel, and is the bias of the convolution kernel. The pooling layer selects the maximum pooling. The output vector after convolution and pooling is as in Equation： Where is the output after the CNN layer and is the max-pooling operation.
Then the output of the CNN is passed to the LSTM layer. the LSTM unit consists of forgetting gate, input gate, and output gate. As shown in Figure 2.
(a) Input the output value of the last moment and the input value of the current time to the forgetting gate, and calculate the output of the forgetting gate, as shown in the following formula: where the value of is in the range (0, 1), is the weight of the forgetting gate, ℎ −1 is the bias of the forgetting gate, is the output value at the current time, and is the input value at the last moment.
(b) Input the last output value and the input value at the current time to the input gate, and calculate to obtain the output value and the candidate cell state of the input gate, as shown in the following equation.
where has values in the range (0, 1), is the weight of the input gates, is the bias of the input gates, is the weight of the candidate input gates, and is the bias of the candidate input gates.
(c) Update the status of the current unit as follows: where is in the range (0, 1).
(d) Accept the output and input as the input value of the output gate at time t and obtain the output of the output gate as follows.
where is in the range (0, 1), is the weight of the output gate, and is the bias of the output gate.
(e) The final output value is obtained by calculating the output of the outputs and the state of the cell, as shown in the following equation.
The output of LSTM is obtained by the above computational process, and finally, the output of the whole neural network is obtained by a layer of the dense layer. The experiments and results will be discussed in Section 4.

Experiments and Results
In this section, we describe the steps and results of the experiment in detail. To verify the validity of the model, we give the evaluation metrics. Then, we describe the process of data pre-processing and how to train the model. Finally, we use the model for vulnerability detection and compare the results with other methods and tools.

Evaluative indicators
A good vulnerability detection model should make as many correct detections as possible within the detection range and miss as few vulnerabilities as possible. Given the above purpose of vulnerability detection, we use five general and well-known evaluation metrics: accuracy, recall, false-negative rate, false-positive rate, and F-measure.

Data Preparation
As described in Subsection 3.2, the original dataset comes from the public dataset proposed by Li et al. The code slices in this dataset are from NVD [26] and SARD [27], where NVD contains flaws in real-world software applications and may also contain diff files before and after patching of the vulnerable code slices. SARD contains samples of vulnerabilities in real-world software applications and artificially constructed vulnerabilities and is classified as positive, negative, and mixed, i.e., patched [28]. Here we give a piece of code slice to demonstrate the exact process of computing the code metric. if (strncmp(lastcomm, me->comm, sizeof(lastcomm))) { printk(KERN_INFO "IA32 syscall %d from %s not implemented\n", call, me->comm); strncpy(lastcomm, me->comm, sizeof(lastcomm));}

return -ENOSYS; A snippet of sliced code
This code slice is a fragment of code that has been processed to retain only the statements related to the vulnerability. As mentioned in RQ1 of subsection 3.2, in this study we propose 20 code metrics to characterize a piece of code, among which 8 code metrics can be extracted directly from the code fragment, namely: Empty lines, Lines of comments, lines of programs, physic lines, number of distinct operators ( 1 ), number of distinct operands ( 2 ), total number of distinct operators ( 1 ), total number of distinct operands ( 2 ). The above code snippet gives 1 =21, 2 =8, 1 =45, and 2 =20.
The rest of the code metrics can be obtained by the above calculation. The formulae for these code metrics are as follows： ·Program vocabulary: = 1 + 2 , ·Program length: Through the above code metric formula, we can get a complete code representation to build a code metric dataset.

Experiments
We use the code metric dataset constructed in subsection 4.2 to train the neural network and find the best network model parameters. To evaluate the effectiveness of the model, we used a k-fold cross-validation technique to evaluate the model on the same benchmark. The dataset is divided equally and randomly into k subsets of the same size, so that each subset can be used as a test set in each iteration round, and the other subsets are used as training sets for training the model. For our dataset, 3-fold cross-validation works best.

Results and Discussion
In

Limitations
The current VULNEXPLORE has several limitations in the design, experimentation, and evaluation process. This also provides new ideas and directions for our future work. First, our code metrics dataset is based on a publicly available code slicing dataset, and the inability to compute code metrics for class granularity (e.g., class coupling, inheritance depth) due to code slicing makes the experimental dataset inevitably missing some information. In the future, we need to collect a large amount of vulnerability source code data in order to investigate class-granularity code metrics.
Second, we have currently only targeted four types of vulnerabilities in C/C++: library/API calls, incorrect use of arrays, improper use of pointers, and incorrect use of arithmetic expressions, and were unable to verify the generality of the model. The VULNEXPLORE implementation was then written in Python. Future consideration needs to be given to using other languages to improve the operational efficiency of the model.

Conclusion
In this study, we propose an improved composite neural network vulnerability detection model-VULNEXPLORE. It utilizes convolutional neural networks for feature extraction of code metrics, followed by long and short-term memory networks for learning the extracted features, and is able to detect vulnerabilities in code metrics. Experiments show that our model is effective, but there is room for further research.
We conclude that while it is possible to characterize vulnerable code using a single granularity or one code metric, this characterization is incomplete. Characterizing vulnerabilities by multi-scale (e.g., different granularity, control flow, code slicing, or even word-by-word comparison using sliding windows) code metrics may provide better detection performance.
We also conclude that the CNN+LSTM neural network model has better results in terms of processing of features than other single networks [29][30]. Combined with the previous conclusion, it would be an interesting topic to introduce scale pyramids in the future by borrowing the concept of image pyramids in the field of target recognition, combined with attention mechanisms or classical ML methods.

Future Works
For future work, we will first collect a large amount of vulnerability source code, and on this basis, we will study multi-scale code metrics to characterize the vulnerability source code at both shallow and deep levels, such as slicing the source code in conjunction with control flow to obtain semantic information about the vulnerability and use it as a code metric at the flow scale. On the other hand, we will continue to investigate deep learning-based vulnerability detection. Deep learning has been very effective in image processing, natural language processing, and vulnerability detection, vulnerabilities have many similarities with images and natural language, so we will further investigate neural network models for vulnerability detection to improve the effectiveness of vulnerability detection.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, author-ship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declarations
Confirm that all the research meets the ethical guidelines, including adherence to the legal requirements of the study country.