Semantic-based vulnerability detection by functional connectivity of gated graph sequence neural networks

In computer security, semantic learning is helpful in understanding vulnerability requirements, realizing source code semantics, and constructing vulnerability knowledge. Nevertheless, learning how to extract and select the most valuable features for software vulnerability detection remains difficult. In this paper, we first derive a subset of vulnerability knowledge representations from the Functional Connectivity (FC) of Graph Gated Sequence Neural Networks (GGNNs). The Gated Graph Sequence Neural Networks can be utilized to capture the long-term dependency to understand a high-level representation of potential vulnerabilities in order to detect vulnerabilities on a target project. Studying functional connectivity-based Graph Neural Networks ensures our deep understanding of the operation of sequence graph networks as highly complex interconnected systems. This ensures that the model focuses on vulnerability-related code, which makes it more appropriate for vulnerability mining tasks. Which constructs a composite semantic code property graph for code representation based on the causes of vulnerabilities. The experimental findings indicate that the suggested Model can select relevant discriminative features and achieve superior performance than benchmark methods.


Introduction
It is vital to detect software vulnerabilities in applications early to implement cost-effective attack mitigation solutions. Concerning code execution, it is possible to divide code analysis techniques for identifying vulnerabilities into the following categories: static, dynamic, and hybrid approaches. Static techniques, including rule/templatebased analysis and code similarity detection, in other words, code clone detection symbolic execution, primarily rely on analyzing the source code. However, they usually have a disadvantage, such as high false positives. Dynamic analysis involves fuzz testing and taints analysis, which usually has the problem of low code coverage. Although the purpose of hybrid approaches combining static and dynamic analysis techniques is to overcome the drawbacks mentioned above, they also inherit their deficiencies, and their operation in practice is ineffective.
Machine learning (ML)-based solutions have been presented to close the gap by introducing different feature sets for the detection systems to learn from, trying to approximate a practitioner's search for vulnerabilities. ML-based detection systems are only capable of trying to approximate the process of a practitioner judging whether there is a potential vulnerability in a code snippet as a result of learning from the feature sets that are extracted manually (research based on traditional ML algorithms) or automatically (research based on deep learning) from the code (both the source code and binary). Unlike practitioners, they cannot wholly comprehend the semantic meanings that underlie vulnerable code patterns.
Features are extracted automatically by deep learning (DL)-based methods, achieving low false positives and false negatives. DL-based vulnerability detection methods are divided into the following three types following the feature extraction method. Sequence-based vulnerability detection methods represent the first type. This type of research uses deep neural networks (DNNs) to extract feature representations from sequence code entities. The sequence usually contains text sequence and function call sequence. The text sequence primarily comprises the source code text, assembly instructions, and the source code processed by the code lexer. Static and dynamic calls are included in the function call sequence. Furthermore, they ensure that neural networks capture flow-based patterns and advanced features.
Recurrent networks on general graphs were first proposed under Graph Neural Networks (Gori-Monfardini and Scarselli 2005;Scarselli et al. 2009). Graph neural networks (GNNs) constitute neural models capturing the dependence of graphs via message passing between the nodes of graphs. Important information about human brain networks' topological architecture is revealed by Graphbased network analysis. The connectome is the entity of neural connections as a mathematical graph representing its connection strengths.
Such graphical models act as a network consisting of nodes and edges as connections between nodes. At a methodological level, graph theory techniques have been frequently utilized to analyze brain networks, providing new insight into atypical changes in brain connectivity. Nodes of Graph Neural Networks represent connectivity, and edges represent the similarity between software vulnerability features using their connectomes in the network. In Graph Neural Networks (GNNs) (Cho et al. 2014;Choi et al. 2017) every node updates its hidden state information as a result of the aggregation of the hidden states of its neighbors and its state information at the previous time step. Thus, the node attributes or the attributes of the whole graph are predicted. Using the most connectional aspects of Graph Neural Networks, the most reliable features are derived from sequence graphs with a connectomic feature selection strategy. The candidate feature representation groups are obtained. A semantic-based vulnerability is a network as a directed graph, modeling the most reliable features derived from most connectional nodes, resulting from learning from the feature selection strategy. It does not describe the connectome as a weight collection in a connectivity matrix. Still, it utilizes a semantic network to describe every functional connectivity-based graph neural network as an individual semantic group and connections as semantic relationships among them.
We use a Gated Graph Sequence Neural networks model in the present study, which learns the connectivity structure to learn vulnerable semantic patterns. The suggested approach represents the graph connectome by a semantic network, a formalism that is often utilized in the management of knowledge to describe semantic relations between the connected nodes. Functional connectivity is presented concerning correlation values between the nodes of neural graph networks measured in the similarity between software vulnerability features over time. We employed the said approach to creating a unified feature space for graph neural network connectivity in which we created functional connectivity and node connectome networks. Finally, the semantic network accumulates the knowledge of node connectivity, becoming transcendent over the gated graph neural network measurements. The main focus is the theoretical graph analysis of connectivity patterns in the complex brain network and its applications in software vulnerability. Applying gated graph sequence neural networks allows using graph theory metrics to investigate significant features of the brain connectome. An investigation of the human brain's connectivity ensures an indepth understanding of the brain's operation as a highly complex interconnected system. However, code semantics can be directly learned and processed to identify potentially vulnerable code patterns using neural graph networks. The rationale behind the suggested Model is two-fold. On the one hand, the existing research that applies different network types can acquire various sets of semantic information from the software code. On the other hand, with more complex network structures applied, there is less need for code analysis for feature extraction because complex network models are more expressive and capable of code semantics learning.
The principal contributions of the current study are given below.
• We suggest a novel framework to extract useful features for detecting software vulnerabilities from the Graph Neural Networks to learn the unified representations of patterns. • We develop a challenging framework to investigate whether connectome-based features identified in highlevel representations can be transferred to a vulnerable sequence for a classification task. • In the current paper, we specifically emphasize applying a Gated Graph Sequence Neural Networks model to detect software vulnerabilities, concerning how the evolution of the network structure facilitates the reduction of the semantic gap. • In the current study, we first suggest a novel approach, which creates a semantic network model of vulnerable patterns and models connectivity as semantic relationships among them. The critical idea of turning the graph of a connectome network into a semantic network is a type of functional connectivity as a semantic relationship between the two connected features measured.
The remaining part of the current work is structured in the following way. In the ''Related Works'' part, some related noteworthy works are introduced. In the ''Background'' part, the details of feature selection strategies are presented. The ''Methods'' part includes the details of the used techniques. The ''Methodology'' part contains the details of our suggested Model. The ''Experimental Results and Performance Analysis'' part describes the experimental setting and evaluation measures. Finally, the ''Conclusion'' part summarizes the paper.

Related works
Many studies have been proposed to apply Graph Networks models for detecting software vulnerabilities to date. In this part, we will focus on reviewing the state-of-the-art studies which adopted deep neural networks and Graph Neural Networks for vulnerability discovery from a new perspective, pointing out that vulnerability detection based on code semantic learning can be a new research trend that has brought promising results. In ), a supervised framework was proposed to capture deep contextual representations to learn long-range code dependency. In this way, the textual code sequences were converted into meaningful dense vectors and fed to the Bi-LSTM layer to learn long-range dependency further. The experimental study demonstrated that the suggested framework was a useful feature set for vulnerability detection. In ), a framework for vulnerability detection having the six advanced mainstream network models built-in ensured one-click execution for model training and testing on the suggested dataset. Empirical findings demonstrate that the variants of recurrent neural networks and convolutional neural networks exhibit good performance on the proposed dataset. Concerning the ability to generalize, the fully connected network performs better than the remaining network architectures. In (Ye et al. 2020), a neural code semantics similarity system was proposed. The proposed system learned the framework of the deep neural network (DNN) semantics similarity scoring. It showed its efficacy across bag-of-features, a recurrent neural network (RNN), and a graph neural network (GNN). The experimental evaluation showed that there might not be one universally optimal context-aware semantics structure configuration. In , multiple semantic graphs were combined to create a more comprehensive graph. Afterward, the graph neural network was adopted instead of the sequencebased model for the automatic analysis of the comprehensive graph. The experimental findings show that the proposed Model obtained more promising results than state-of-the-art methods. (Nguyen et al. 2021) presented a new graph neural network-based model for vulnerability detection in the source code. The suggested algorithm ensures the novel usage of residual connection among GNN layers and a valuable mixture of the sum and max poolings to learn code graph representation better.
According to the experimental findings, the suggested Model performs considerably better compared to the baseline models and achieves the maximum accuracy of 63.69% on the benchmark dataset. ) suggested a new model of deep learning-based vulnerability detection that defines features by employing the clustering theory of the clonal selection algorithm. Deeplearned, long-lived team-hacker features are proposed to process memories of sequential features and map from the history of previous inputs to the target vectors in theory. (Singh and Chaturvedi 2020) concentrated on all such modifications needed to effectively accommodate the problem of detecting software vulnerabilities by deep learning approaches. Moreover, it examined different vulnerability databases/resources and a number of the newly developed successful applications of deep learning to predict vulnerabilities in the software. (Lin et al. 2019) suggested a benchmarking framework to build and test Deep Learning (DL)-based vulnerability detectors, ensuring six built-in mainstream neural network models with the possibility of selecting three embedding solutions. In (Sahin-Dinler and Abualigah 2021), a Deep Neural Networks SYMbiotic-based Genetic Algorithm model (DNN-SYMbiotic GAs) was proposed to learn the phenotyping of dominant features for issues of predicting software vulnerabilities. The suggested method aimed to increase the detection capability of vulnerability patterns with vulnerable components in the software. Introduced an automated and intelligent method for detecting vulnerabilities in the source code according to the minimum intermediate representation learning.  presented the SySeVR framework to utilize deep learning for vulnerability detection. According to a comprehensive dataset of vulnerabilities collected, some insights involving an explanation of the efficiency of deep learning in vulnerability detection were provided. (Zhou et al. 2019) suggested a general graph neural network-based model for graph-level classification by learning on a rich set of code semantic representations. It involves a new Conv module to efficiently extract valuable features in the learned rich node representations for graph-level classification. In (Sahin 2021a, the first Clock-Work RNN-based Dendritic Cell Algorithm (DCA) was proposed to identify complex dependencies between vulnerable object-oriented software metrics. The findings indicate that the proposed algorithm performs excellently concerning the detection rate and obtains encouraging results in reducing the number of false-positive errors that similar systems present. (Guo et al. 2022), presented a graph neural network vulnerability mining system, HyVulDect, on the basis of the suggested hybrid semantics, constructing a composite semantic code property graph for representing codes according to the vulnerability causes. (Cao et al. 2021), Semantic-based vulnerability detection by functional connectivity of gated graph sequence… 5705 studied the restrictions of the current deep learning-based vulnerability detection approaches and suggested a vulnerability detection approach, named BGNN4VD (Bidirectional Graph Neural Network for Vulnerability Detection), by establishing a Bidirectional Graph Neural Network (BGNN). In the study by (Hin et al. 2022), statement-level vulnerability detection was formulated as a node classification task. LineVD leverages control and data dependencies between statements by utilizing graph neural networks, and a transformer-based model for the purpose of encoding the raw-source code tokens. (Cao et al. 2022), suggested a novel graph neural network-based approach MVD (Memory-Related Vulnerability Detection), leveraging the flow-sensitive graph neural network (FS-GNN) with the objective of jointly embedding unstructured and structured information in order to preserve high-level program semantics for learning implicit vulnerability patterns.

Software vulnerability
The short definition of software security vulnerabilities, which are also called security defects (Sestili-Sanvley and VanHoudnos 2018), security bugs (Votipka et al. 2018), and software weaknesses (Lee et al. 2017), is ''software bugs with security implications'' (Sabottke-Suciu and Dumitras 2015). It is possible to categorize vulnerability analysis and approaches into three main groups: (1) software metrics-based, (2) vulnerable code pattern-based, and (3) anomaly based. Software metrics, including McCabe (McCabe 1976) and Code Churn (Nagappan and Ball 2005) metrics, indicate the quality of software products numerically from various perspectives. The McCabe metrics represent software complexity metrics. The Code Churn metrics utilized as an indicator of vulnerability detection suggest a tendency of a code that undergoes frequent modifications to contain errors. Thus, there is a high possibility of such a code having defects and probably vulnerabilities, considering that vulnerabilities constitute a subset of software defects. Despite the long history of the mentioned metrics in measuring software quality and predicting defects, they do not indicate vulnerabilities directly.

Feature selection strategies
As a strategy for data preprocessing, the effectiveness and efficiency of feature selection have been confirmed in the preparation of high-dimensional data for data mining and machine learning issues. Generally, it is possible to divide feature selection approaches into two groups: supervised and unsupervised. Supervised approaches utilized label information to guide the selection process. Unsupervised approaches aim to describe the data structure in some feature space when label information is absent. Another rough classification of feature selection methods is filters, wrappers, and embedded methods. Wrapper approaches, e.g., sequential forward selection and sequential backward selection, assess a subset of features according to the accuracy of a particular classifier on a particular data set. The training and testing of a model for every subset of candidate features are performed, and the Model's performance is utilized for guiding the feature selection process. Wrappers are computationally very intensive, particularly in the case of the complex selected Model. In embedded methods, feature selection is embedded in classification. Filter methods assess a subset of features according to its information content rather than optimizing the performance of a particular classifier.

Feature representation learning
Feature representation learning utilizes the current data sources from the target software project and feeds them to the trained deep networks to acquire representation groups. Representation learning or feature learning aims to find a suitable data representation to perform a target task. Every hidden layer maps its input data to an internal representation in a neural network, tending to capture a higher abstraction level. The said learned features contain more information increasingly through layers toward the learning task. First of all, one of the data sources trains the samples used to feed a network. The representation produced by the sub-graph network is a vector v 1 = [r 1 , r 2 , r 3 ]. Afterward, the same sample is fed to the other sub-graph network that is trained by another data source, and the representation referred to as the vector v 2 = [r 4 , r 5 , r 6 ] is acquired. The combined feature vector is obtained due to concatenating v1 and v2; in other words, v concat = [r 1 , r 2 , r 3 , r 4 , r 5 , r 6 ]. Finally, the remaining data are used from the target code project as the test set and fed to every sub-graph network trained. The acquired representations of the data sources are utilized as inputs to train a classifier to obtain the performance results.

Source code representation learning
Source codes must be transformed into a proper form to employ data-driven approaches. This form should represent the source code's semantic and syntactic information and be suitable for other deep learning algorithms or neural networks analysis.
Feature extraction and selection are essential when exploring the connectivity patterns implicated in the vulnerability-related data source. Feature sets are extracted from various forms of program representation produced by static analysis, such as Control Flow Graphs (CFGs), Abstract Syntax Trees (ASTs), Program Dependency Graphs (PDGs), Data-Flow Graphs (DFGs), etc. Compared with software metrics and frequency-based code features, feature sets obtained from the results of code analysis tools and parsers provide more information about the code since every form of the program representation provides a perspective on the source code from various angles.
Tree-based approaches learn the representation from an abstract syntax tree (AST). An AST represents a syntactical structure of a source code (e.g., a function), which describes the correlations among the code's components in a hierarchical tree view and represents the function level control flow reliably. Furthermore, the AST is a tree to abstract a source code with an explicit syntactic structure. The AST serves as an intermediate representation between a source code and sequence to facilitate the representation using syntactic knowledge.

Gated graph neural networks
Graphs represent a data structure modeling a set of objects (nodes) and their correlations (edges). To show how the node annotations of Gated Graph Neural Networks (GG-NNs) are utilized, let us think of an example task of training a graph neural network to predict whether it is possible to reach node t from node s on a particular graph. The task mentioned above has two special nodes related to the problem, s, and t. An initial annotation is given to the mentioned nodes to indicate them as special. The annotation x s = [1, 0] T is given to the first node s, while the annotation x t = [0, 1] T is given to the second node t. The initial annotation of all other nodes v is set to x v = [0, 0] T . Naturally, the first input argument is marked as s, while the second input argument is marked as t. Afterward, the node state vectors h v (1) are initialized by utilizing the said label vectors due to copying x v into the first dimensions and padding with extra 0 s to allow hidden states greater in comparison with the annotation size. In the reachability case, the propagation model may quickly learn propagating the node annotation for s to every node reached from s. For instance, if the propagation matrix related to forward edges is set to having a 1 in the (0,0) position. Thus, copying the node representation's first dimension will be performed along forwarding edges. With the mentioned parameter setting, the first bit of node representation of all nodes that can be reached from s will be set to 1 due to the propagation step. The output step classifier can quickly tell whether it is possible to reach node t from s by examining whether there are nonzero entries in a node in the first two dimensions of its representation vector. The data can be processed and fed to the trained networks with real-world vulnerability data sources, and the outputs are acquired from one of the hidden layers as the learned representations. We establish a hypothesis that the transfer of the vulnerable patterns learned from the data sources can produce beneficial representations to better distinguish between vulnerable and non-vulnerable samples. It is possible to formulate the process mentioned above as follows: Equation 1 represents the initialization step, copying node annotations into the hidden state's first components and padding the remaining with zeros. Equation 2 represents the step passing information between various graph nodes through incoming and outgoing edges with parameters dependent on the direction and edge type [ R2D includes activations from edges in the two directions. The rest represent GRU-like updates containing information from the remaining nodes and the preceding time step for updating the hidden state of every node. z and r indicate the update and reset gates, r(x) = 1/(1 ? e -x) refers to the logistic sigmoid function and denotes element-wise multiplication. Finally, after the convergence of the inference process, the application of a final readout function is performed to acquire the output feature vector in the following way: Semantic-based vulnerability detection by functional connectivity of gated graph sequence… 5707 (k) , are utilized to predict o (k) using X (k) , and F x (k) is utilized to predict X (k?1) from X (k) . It is possible to consider X (k?1) as the states that are carried over from step k to k ? 1. F o (k) and F x (k) include propagation and output models. In propagation models, the node vector-matrix at propagation step t of output step k th is denoted as H (k,t)-= [h 1 (k,t) ;...; h |V| (k,t) ] T [ R |V|9D . As previously mentioned, in step k, H (k,1) is set due to 0-extending X (k) per node. Otherwise, it is possible that F o (k) and F x (k) will share one propagation model with different output models. The training and evaluation of this easier variant are quicker, and it can reach a level similar to that of the full Model in many situations. However, in the case of the different targeted propagation behavior for.
F o (k) and F x (k) , the variant may not perform well. A node annotation output model is introduced to predict X (k?1) from H (k, T) . It is predicted for every node independently by utilizing a neural network j (h v (k, T) , x v (k) ), taking the concatenation of h v (k, T) and x v (k) as input and outputs in a real-valued score vector: There are two settings to train GGS-NNs. The first one specifies all intermediate annotations.
X (k) , while the second one trains the full model end to end if only X (1) , graphs, and target sequences are provided. In a more general sense, if there are no intermediate node annotations X (k) in training, they are treated as hidden units in the network. The entire Model is trained jointly as a result of the backpropagation through the entire sequence. The symbol notations of Equations are shown in Table 1.

Problem formulation
The suggested method takes a function list from a program as input and outputs in a function ranking list based on the probability that input functions are vulnerable.
Let us consider that F = {f 1 , f 2 , f 3 ,…. f n } is all C source code functions in the given software project. Vulnerability detection is essentially a binary classification problem. The aim here is to find a function level vulnerability detector D: F ? [0,1], in which ''1'' means vulnerable, and ''0'' means non-vulnerable. Thus, the probability that function f i contains a vulnerable code is measured by D(f i ). Usually, treating D(f i ) as a vulnerability score is sufficient for investigating a low number of functions of the top risk.
Our goal is to design a detector that can classify a program into a correct label based on the knowledge learned from labeled datasets. It can be described as follows: Let us define a sample as ((c i , , n}, in which C represents the set of programs, and Y = {0, 1} n denotes the label of corresponding programs with 0 for clean and 1 for vulnerable. Our method aims to learn the mapping from R to Y, f: R ? Y to predict whether a program has vulnerabilities.

The proposed model
Gated Graph Sequence Neural Networks (GGS-NNs) represent deep learning models comprising neural networks that compete to solve a target learning task. Feature learning on graphs has two settings. The first one is learning the input graph's representation, whereas the second is learning representations of the internal state in the course of generating an output sequence. The primary limitation of the existing deep graph embedding architectures is their inability to capture high-order relationships between samples. The difficulty is feature learning on the graph, encoding the partial output sequence, which has already been generated and still needs to be generated. The human brain's connectivity was our inspiration for proposing the functional connectivity of Gated Graph Sequence Neural Networks (FC-GGNNs). Studying functional connectivity-based Graph Neural Networks ensures our deep understanding of the operation of sequence graph networks as highly complex interconnected systems. To this end, we try to learn latent vulnerable programming patterns by investigating a unified representation of the patterns of vulnerable source codes using connectomes, given their capacity to model the one-to-one relationship between connectomes and circumvent the curse of dimensionality in learning tasks. This part contains details about the steps for the suggested FC-GGNNs architecture from a source graph. The first component converts the program's source code to a program graph. Many studies conducted previously have utilized the ASTs extracted from the source code as a method to represent the code syntax rather than utilizing the source code itself (Lin et al. 2017;Yamaguchi-Lottmann and Rieck 2012) because ASTs preserve more meaningful syntactic information at the function level in comparison with the remaining program representations, e.g., the control flow graph. In contrast, the second component learns the distributed representation of the program graph and uses the learned representation for the program classification. The said representations trained with historical software vulnerability data become advanced features, reflecting the intrinsic patterns indicative of a software vulnerability. The learned patterns or representations were obtained from the trained networks in the feature representation learning phase. This phase utilizes the existing data sources from the target software, and they are fed to the trained graph neural networks to acquire representation groups. Afterward, representations are combined due to their concatenation to create an aggregated feature set. In the last phase, the rest of the data from the target code project are utilized as the test set and fed to the trained network. The acquired representations of the data sources are employed as inputs for training a random forest classifier, and the trained classifier is fed with the test set representations for acquiring the performance outcomes. Figures 1, 2 show the architectures of our proposed Model. Semantic-based vulnerability detection by functional connectivity of gated graph sequence… 5709 The purpose of the suggested Model is to learn how to project node features semantically while preserving structural relationships. According to the properties of the Functional Connectivity Gated Graph Sequence Neural Network (FC-GGNN) model, the computation of every node's feature representation is performed by learning its functional connectivity structure. In other words, its connectome nodes and edges among them. Therefore, nodes with the connectome structure must have an equal representation. The functional connectivity of the GGNN is crucial in understanding how different parts of the network interact with each other. We implemented the functional connectivity strategy on the basis of the ''correlation of connectomes,'' constructed by the trained sub-GNNs that are computed with the objective of extracting higher-level interactions between the separated GG-NNs. FC represents a deep network architecture for jointly learning a feature extraction network capturing features from a learnable similarity network, which computes the semantic relation between the pairs. The network in question uses a neural network to learn FC between pairs of extracted features from two connectome regions. The network calculates their FC, which is associated with the similarity between the two regions (sub-GNNs). To train FC-GGNN, similar (i.e., functionally connected) and dissimilar (i.e., not functionally connected) regions having corresponding labels (one and zero, respectively) are required. Pairs are made for regions in the same cluster, and the label (i.e., functionally connected) is assigned to them. Regions not belonging to the same cluster are picked randomly for unconnected pairs (not functionally connected regions), and the pair is labeled with zero. Particularly, the correlation between one GG-NN and all other GG-NNs is considered a sequence, and the correlation between the said sequences is computed to create separate GG-NNs, allowing the detection of a typical connectivity alteration in order to capture their high-order relationships. A set of time-dependent sub-GGNN graphs is cascaded by our FC-GGNN architecture, where every sub-GGNN graph conveys its predicted graph connectomes at a specific time point for the purpose of training the learned representation of output mode in the cascade at the followup time point. The semantic network turns the connectome graph into an explicit knowledge base of potential vulnerabilities to learn context-dependence and high-level representation. Semantic networks form a unified feature space with explicit facts that our software source code data sets contain. They are placed into the continuously growing semantic context for vulnerability discovery in an easily interpretable way.
The suggested approach has several limitations. The problem of long-term dependency is relieved by graphbased approaches, and they enhance detectors' performance. Nevertheless, they require intensive computation, which needs many hardware resources. Furthermore, the majority of graph-based methods are based on the compiled intermediate representation, which limits their scenarios of application. Programs are parsed by graphbased approaches in the form of a source code into graph representations with structure knowledge. Nevertheless, the complexity of code fragments is considerably increased by tree-based methods. The limitation is also the GNN layer's learning capability, in which only information from immediately neighboring nodes is propagated inherently. Multiple graph neural network layers should be stacked for propagating further range information.

Experimental results and performance analysis
In the current part, experiments are carried out to assess the proposed methods and compare them and other standard methods. The pre-training phase trains the FC-GGNN using historical vulnerability data of several various software projects. AST-based features from vulnerable and non-vulnerable functions constitute the training inputs. This ensures that the hidden nodes capture the sequential interactions, discriminative of vulnerable programs. The input data are separated into training and validation sets to establish and assess the Model and guide the model tuning processes for performance maximization. When the Model's training is performed, and satisfactory performance is achieved, the trained networks are fed with the processed AST-based features of a target project with limited labels. The learned representations are acquired from the networks' third layer. Before feature extraction from ASTs, it is necessary to acquire ASTs from the source code. In general, a compiler produces ASTs at the code parsing stage. Nevertheless, it is non-trivial to acquire ASTs from C/C ? ? code without a working build environment. Using ''CodeSensor 1'', a robust parser implemented based on the concept of island grammars, extraction of ASTs is possible from individual source files or even from function code fragments in the absence of dependent libraries. It is possible to generate the parsed ASTs in a serialized format and ready for consecutive processing due to feeding CodeSensor with the source code files. The network takes a ''tokens'' sequence as input. The Word2vec embedding layer represents the first layer, mapping every sequence element to a vector in a semantic space with a close distance between similar elements.
The learned representations are high-level abstract features that can train a connectivity model. The number of steps is a hyper-parameter set as two based on our experiments. The graph neural network helps nodes create new Fig. 2 General framework of the proposed Model embeddings based on the embeddings of other connectomes in the graph weighted by the edge weights between them. To obtain one feature vector from the entire graph, we concatenate the output of different top k connectome sub-groups. Each of the connectome sub-groups gives us one feature factor. In the end, we concatenate the k subgroups to obtain one final embedding for the entire graph that represents a subject. We obtain our experimental results based on the different top k connectome sub-groups.
It is possible to apply sensitivity for the analysis of a wide class of architectures consisting of the produced subgraphs of hidden layers, generalized functional connections, and graph-gated sequence neural network computation modules. We demonstrate that the capability of different architectures to solve modeling tasks is excellently predicted by sensitivity with an increase in network depth.
The suggested functional connectivity strategy presents a framework, which allows jointly learning a feature extraction network that captures features from a learnable similarity network and carrying out extensive ablations of our method. In empirical terms, the FC mechanism is robustly enhanced on the connectome regions of the standard GGNN model. For ablation research, we test dynamic produced graphs by utilizing FC features and conduct subgraph representation groups on the basis of N time periods for every stored graph connectome with the Euclidean distance to ensure that adjacency matrices are learned. Concerning the details of implementation, a mini-batch of vectors (sequences) is taken by the network as an input, and the possibilities of the corresponding input being vulnerable or not are provided by the network as an output. To ensure a better generalization, we utilized a comparatively small batch size of 32 for training. The learning rate is 0.004. The first layer represents an input layer that takes (sample number of 1000), in which the parameter ''sample number'' denotes the batch size, and 1,000 represents the padding length being a sample's dimension, also known as the time series in an RNN.
N v denotes the total node number, and N e refers to the total edge number. L represents the layer number. To ensure simplicity, the dimensions of the node hidden features stay constant, represented by d. O(LN e d ? LN vd 2 ). We predict at differing time intervals 6 h, to 22 h.

Experiment datasets
The vulnerability data were acquired from the National Vulnerability Database (NVD) and the Common Vulnerability and Exposures (CVE). The NVD and CVE repositories utilize CVE ID as unique identifiers for vulnerabilities. Vulnerabilities are assigned with CVE IDs, which allows the prompt access of a security professional to the technical information of known vulnerabilities across multiple CVE-compatible sources. The NVD presents a favorable alternative for searching for the known vulnerabilities of a software project. By utilizing the NVD description, the project's corresponding version can be downloaded, and every vulnerable function can be located in the software project's source code and labeled accordingly. We conduct experiments on six open-source projects benchmark datasets. The experiments in our study involved: LibPNG, LibTIFF, Pidgin, FFmpeg, VLC Media Player, and Asterisk. Our suggested approach uses source code functions as inputs, making filtering possibly vulnerable functions in the course of development more straightforward.

Evaluation measures
Classification accuracy indicates the ratio of true positives and true negatives and is computed by Eq. (9): where TN refers to the number of true negatives, TP denotes true positives, FP indicates false positives, and FN refers to false negatives. The precision is presented in Eq. (10): The recall indicates the ratio of true positives to true positives and false positives, as demonstrated in Eq. (11): Semantic-based vulnerability detection by functional connectivity of gated graph sequence… 5713 The value of the F-measure varies in the range of 0 to 1. It indicates a harmonic mean of precision and recall and is shown in Eq. (12):

Experimental results
This section provides and summarizes the experiments' findings to evaluate the effectiveness of our FC-GGNN scheduling algorithm. The best accuracy results for the examined methods are bolded for all tables. The comparison of the statistical performance of the algorithms for each dataset is focused on. In Tables 3, 9, the statistical performance of the proposed FC-GGNN model was tested In Table 3, we compared the performance of the suggested FC-GGNN model in detecting vulnerabilities based on the selection of the top 25 vulnerable functions. The results demonstrate that the Asterisk dataset performed the best with 0.723 accuracies compared to other vulnerability datasets. Nevertheless, according to the results, the lowest classification accuracy performance was found in the Pidgin dataset with 0.523 accuracies. In point of other datasets, it was observed that the proposed algorithm achieved the highest performance results in the LibPNG, VLC media player, LibTIFF, and FFmpeg data sets, respectively.
The statistical performance of the proposed FC-GGNN model was tested on the selection of the top 50 vulnerable functions in Table 4. According to the performance results, the Asterisk and VLC media player datasets almost outperformed the other datasets (FFmpeg, LibTIFF, LibPNG, and Pidgin). They achieved 0.724 and 0.686 classification accuracies with 0.702 and 0.63 F-measure values. However, the proposed FC-GGNN Model achieved the worst comparative results compared to the Pidgin dataset. This table shows that the worst statistical performances were obtained by FFmpeg, LibTIFF, and LibPNG, respectively.
In Tables 5, 6, we compared the performance of the proposed FC-GGNN model based on the selection of the top 75 and 100. The obtained results in Table 6 indicate that the proposed FC-GGNN Model achieved the higher performance for the VLC media player dataset with 0.766 accuracies. Nevertheless, according to the results in Table 5, the VLC media player dataset again performed the best classification accuracy with 0.766 accuracies. While it was observed that the Asterix dataset generally gave close results except for Precision and AUC metrics in both tables, there is generally no observation about dramatic change for the VLC media player dataset in terms of the number of vulnerable functions.
Selection of top 150 and 200 are shown in Tables 7, 8. In Table 7, the best accuracies (0.878) and (0.876) for the VLC media player and Asterisk datasets, respectively. The results in Table 8 demonstrate that the best classification accuracy increased by approximately 0.098 for the VLC media player dataset and 0.107 for the Asterisk dataset. Furthermore, the proposed FC-GGNN performed in In line with the findings in the tables, the comparison of the classification accuracy of the proposed FC-GGNN model shows that, generally, the Asterisk dataset can significantly improve classifiers' performance compared with other real-world datasets. Tables 3,4, 5, 6,7, 8 and 9 show that the best F-measure value achieved based on the proposed FC-GGNN model for the VLC Media Player dataset with the top 250 vulnerable functions was 0.997. The highest precision performance was exhibited, particularly by the VLC Media Player dataset.
The recall of the proposed FC-GGNN model and VLC Media Player dataset was 0.999 and 0.995, respectively. Nevertheless, an accuracy of 93.77% and a true positive rate of 0.938 were found in the FC-GGNN algorithm. In Fig. 3 Furthermore, the selection of the top 250 vulnerable functions exhibited the highest statistical performance compared to other selection options. Therefore, we can conclude that the proposed FC-GGNN model is more effective in detecting security vulnerabilities when the selected number of vulnerable functions and hidden layers are higher because of improved statistical performance results. Comparing performances in the selected number of vulnerable functions demonstrates that the proposed FC-GGNN model can significantly improve learning feature representation.
The result suggests that the built-in representation derived patterns are inflexible and have failed with the Pidgin and FFmpeg datasets in Fig. 3, respectively. The results show that the top 250 selection vulnerable function is the best choice for long-sequence pattern learning. Table 10 demonstrates the recent research on vulnerability detection by employing deep learning and Graph Neural Networks. We emphasize that the proposed FC-GGNN model can identify more vulnerabilities than other methodologies in Table 10. It is observed that the proposed FC-GGNN exhibits better performance than the other models in six recent types of research on vulnerability b Fig. 3 (Li et al. 2016;Lin et al. 2017) and Guo et al. 2022;Cao et al. 2021 andHin et al. 2022). Thus, the suggested method is promising for the obtained results compared to other similar published methods.

Implementation details
A Python environment was used to construct our Model. The given tests were performed on a computer using the NVIDIA GeForce RTX 2060 Super Turing GPU architecture.

Conclusion
Vulnerability detection represents a crucial phase in ensuring software security and quality. Nevertheless, the current vulnerability detection methods have many disadvantages, including out-of-vocabulary, long-term dependency, coarse detection granularity, and bias toward global or local features. Deep neural networks capable of automated feature learning can significantly reduce the effort of feature engineering tasks. By utilizing neural memory networks, code semantics can be directly learned and processed to identify potentially vulnerable code patterns. The present study suggests a novel framework for vulnerability detection in a source code based on Functional connectivity of Graph Neural Networks. The representation learning capability of Gated Sequence Neural Networks based on functional network connectivity and their customizable structure has great potential for the automated learning of complex, vulnerable code patterns. The empirical studies showed that our method could achieve better precision than traditional code metrics to examine cross-project scenarios. The method suggested in the current work can still be enhanced. At the moment, it is only capable of detecting vulnerabilities in the source code but is unable to locate the particular location of vulnerabilities. Afterward, we can conduct studies to solve the said problems.
Funding The authors have not disclosed any funding. Semantic-based vulnerability detection by functional connectivity of gated graph sequence… 5717 Data availability Enquiries about data availability should be directed to the authors.

Declarations
Conflict of interest There is no conflict of interest for authorship.
Ethical approval This manuscript does not contain any studies with human participants carried out by any of the authors.