ActGraph: prioritization of test cases based on deep neural network activation graph

Widespread applications of deep neural networks (DNNs) benefit from DNN testing to guarantee their quality. In the DNN testing, numerous test cases are fed into the model to explore potential vulnerabilities, but they require expensive manual cost to check the label. Therefore, test case prioritization is proposed to solve the problem of labeling cost, e.g., surprise adequacy-based, uncertainty quantifiers-based and mutation-based prioritization methods. However, most of them suffer from limited scenarios (i.e. high confidence adversarial or false positive cases) and high time complexity. To address these challenges, we propose the concept of the activation graph from the perspective of the spatial relationship of neurons. We observe that the activation graph of cases that triggers the model’s misbehavior significantly differs from that of normal cases. Motivated by it, we design a test case prioritization method based on the activation graph, ActGraph, by extracting the high-order node feature of the activation graph for prioritization. ActGraph explains the difference between the test cases to solve the problem of scenario limitation. Without mutation operations, ActGraph is easy to implement, leading to lower time complexity. Extensive experiments on three datasets and four models demonstrate that ActGraph has the following key characteristics. (i) Effectiveness and generalizability: ActGraph shows competitive performance in all of the natural, adversarial and mixed scenarios, especially in RAUC-100 improvement (∼×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim \times $$\end{document}1.40). (ii) Efficiency: ActGraph runs at less time cost (∼×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim \times $$\end{document}1/50) than the state-of-the-art method. The code of ActGraph is open-sourced at https://github.com/Embed-Debuger/ActGraph.

For example, a car in autopilot mode recognized a white truck as a cloud in the sky and caused a serious traffic accident [5].Therefore, it is crucial to test the DNN to find vulnerabilities before deployment.
In the testing phase, numerous diverse test cases are fed into the DNN in order to evaluate the reliability of the model.These test cases require expensive manual labeling and verification.To reduce the cost of labeling, a feasible solution is to prioritize test cases to find more vulnerabilities.These test cases are more likely to expose DNN vulnerabilities are marked in advance, improving the efficiency of DNN testing.Some test case prioritization methods have been proposed to address the labeling cost problem.They can be divided into three categories, including Neuron Coverage (NC)-based [6,7,8,9,10], model activationbased [11,12,13,14] and mutation-based [15] test case prioritization methods.NC-based methods draw on the concept of test coverage [16,17,18] from the traditional software testing to measure the adequacy of test cases.But some studies [19,20,21] have pointed out that it isn't strongly correlated between NC and misclassified inputs, so NC cannot be effectively applied to prioritization.
Model activation-based prioritization methods can be divided into confidence-based and embeddingbased test case prioritization methods.Confidence-based methods [11,12] extract the probability distribution of test cases in the DNN confidence layer.They believe that the correctly classified cases should output higher probabilities, but misclassified cases should output multiple similar probabilities.This assumption limits their application scenarios.For example, Carlini-Wagner (C&W) [22] adversarial cases can cause the DNN to output high confidence in the wrong class, so the effect of confidence-based methods will be significantly reduced.On the other hand, embedding-based prioritization methods [13,14] generally extract the hidden layer outputs of test cases.They prioritize test cases with inconsistencies based on the differences between adversarial and normal cases.However, they cannot effectively prioritize false positive (FP) cases, because the confusion of hidden layer features directly leads to the misclassification of the model.
Mutation-based test case prioritization, i.e., PRioritizing test inputs via Intelligent Mutation Analysis (PRIMA), is the state-of-the-art (SOTA) method, which uses mutation operations to make test cases as different as possible to produce different prediction results, arguing that it is easier to reveal the DNN vulnerabilities.The time complexity of PRIMA is O (nm 1 + nm 2 N θ ), where n is the total number of test cases, m 1 is the number of sample mutations, m 2 is the number of model mutations and N θ is the number of parameters in the model mutation.Since m 1 , m 2 and N θ are usually very large, PRIMA has to take higher time complexity than activation-based methods (O(n log n)).
Based on the above analysis, the existing test case prioritization methods show limitations in two aspects: (i) Limited Scenarios.Model activation-based prioritization methods have limited application scenarios.Specifically, confidence-based prioritization methods consider that misclassified cases have multiple similar probabilities, thus they are less effective for adversarial cases with high confidence.On the other hand, embedding-based prioritization methods are less effective for FP cases, due to that the embedding features of FP cases are confused with the wrong classes.(ii) High Complexity.Mutation-based prioritization has a high time complexity since most of them require numerous mutations, memory read and write, and mutation query operations.Therefore, its time complexity is higher than activation-based methods.
Therefore, to address the prioritization challenges, we explore the in-depth relationship between the test cases and the model's dynamic features.Although the existing model activation-based prioritization methods have extracted features in the hidden layer or the confidence layer, but they overlook some more fine-grained features, such as the neuron's activation.Therefore, we propose graph-level neuron activation features for test cases, to extract the activation graph between DNN layers.The activation graph is defined as the connection relationship of neurons.We study the differences in the activation graph for test cases.An illustration is carried out shown in Fig. 1.The activation graphs for the normal, FP, and C&W adversarial test cases are shown.It is a LeNet-5 [23] trained on MNIST [23].Fig. 1 (a) and (b) are normal test cases with class labels 9 and 7, respectively.Fig. 1 (c) and (d) are misclassified as 9, which are FP and adversarial cases, respectively.
We only show weighted edge values greater than 0.4, in order to show the activation graph clearly.The size of the node is determined by its node feature value.We observe the distribution differences in activation graphs for the test cases.Specifically, the edges of the last layer of the activation graph of the normal cases are only connected to the correct node, as shown in Fig. 1 (a) and (b).On the contrary, the misclassified cases will connect not only the correct node but also the wrong node in the last layer of the activation graph, as shown in Fig. 1 (c) and (d).Comparing the activation graph of normal case (Fig. 1 (b)) and adversarial case (Fig. 1 (d)), we observe that the similar distribution of edges in shallow layers (L1, L2), but the differences distribution of edges gradually increase in deeper layers (L3, L4 and L5).Therefore, we propose a model activation graph-based test case prioritization, namely ActGraph.
ActGraph regards the neurons of DNN as nodes, and the adjacency matrix as the connection relationship between nodes.Then, the node features and the adjacency matrix are aggregated by message passing [24], the aggregated node features will contain the features of neighbor nodes and the structural information between nodes, which can be effectively used for test case prioritization.ActGraph has the following two key characteristics.(ii) Efficiency.ActGraph uses the graph to extract high-level node features instead of mutation operations, which is much more efficient than the mutation-based approaches.
In addition, ActGraph builds a ranking model using Learning-to-Rank (L2R) [25], which can effectively prioritize test cases by learning the node features of test cases.According to the priority results of the ActGraph, the test cases that trigger the vulnerability can be marked earlier, thereby greatly improving the efficiency of DNN testing and effectively saving development time.
The main contributions are summarized as follows.
• By identifying the limitations of existing prioritization methods, we propose the activation graph, and find that cases that trigger model vulnerabilities and normal cases in the activation graph.
• Motivated by the observation, we propose a novel test case prioritization method, namely ActGraph.
It extracts the spatial relationship between neuron nodes, calculates the center node feature in the activation graph, and adopts L2R model to intelligently combine center node feature to achieve efficient test case prioritization.
• Comprehensive experiments have been conducted on three datasets and four models to verify the effectiveness and efficiency of ActGraph.It outperforms the SOTA baselines in both natural and adversarial scenarios, especially in RAUC-100 (∼ ×1.40).
The remainder of this paper is organized as follows.We describe the related work in Section 2. In Section 3, we describe the designed ActGraph method in detail.Experimental results are provided in Section 4. Threats to validity are described in Section 5 and conclusions are made in Section 6.The application scenarios are different.ActGraph is used for test case prioritization in DNN testing.

Overview
Existing DAG studies have been able to convert the DNN into the graph, discussing the dynamic properties of the DNN model [32,33,34,35].They used an undirected weighted graph to treat neurons as nodes and the weights of the model as weighted edges between nodes.But they did not consider expressing the activation information of case to graph.Different from them, we use a directed weighted graph to take the model weights as the skeleton of the graph, and map the activation values to the graph, so that the feature of the test case is expressed on the graph.For convenience, the definitions of symbols used in this paper are listed in Table 1.
Table 1: The definitions of symbols.

Test Case Activation
ActGraph is a model activation-based test case prioritization method that runs during the test phase.Test cases are input to the DNN, and each layer outputs activation values.In order to facilitate the construction of the graph, the weights and activations of each neuron are averaged, and the weights and activations of each layer are normalized.
For a trained DNN and a test case x.DNN has L layers, n l i is the i-th neuron of the l-th layer.The case x is input to the DNN, and get the output of each layer of neurons in the DNN.The activation value ϕ l i of n l i is calculated as: where F l i (x) ∈ R Height l ×W idth l ×Channel l is the feature map of n l i output when x is input.For the convolutional layer, the output dimension is Height l × W idth l , and the dimension of the fully connected The neuron activation value ϕ l (x) of each layer performs the max-min normalization, which normalizes ϕ l (x) to the [0,1] range, is calculated as: For the neuron n l i , its neuron weight is calculated as: where θ l−1,l j,i ∈ R Height θ l ×W idth θ l represents the weight parameter between the neuron n l−1 j and the neuron n l i .The dimension of the neuron weight of the convolutional layer of the l-th layer is Height θ l × W idth θ l , and the dimension of the weight of the fully connected layer is 1 × 1.
Normalize the neuron weight w l−1,l of each layer, and convert w l−1,l of each layer to the range of [0,1], is calculated as: To reduce the computational cost, ActGraph only obtains the neuron activations and their weights of the last K layers of the DNN.

Feature Extraction
ActGraph adopts the L2R framework, and builds a ranking model for each DNN model.Since L2R requires a set of features like other supervised machine learning [25].ActGraph extracts a set of features from the activation values of the test cases and the structural information of the model.In this section, we propose the steps of feature extraction for test cases in ActGraph.
As shown in Fig. 1, the weighted edges can significantly represent the differences of distribution between different test cases, but cannot clearly express the characteristics of neurons.Therefore, we would like to extract more effective node features from the activation graph for prioritization.The message passing of Graph Neural Network (GNN) can aggregate the features of the current node and neighbor nodes, which is similar to the data flow of DNN, that is, the activation values of the previous layer are passed to the next layer.
Specifically, we use a directed activation graph, and extract the weighted in-degrees of nodes as node features.
The weighted in-degrees are low-order node features that represent the importance of the nodes.Further, we aggregate node features and the adjacency matrix to obtain higher-order node features by message passing, namely center node feature.For the explanation of its effectiveness, we describe it in detail in Section 4.4.
Because the activation values of the DNN have the data flow, we construct the DNN as a directed weighted graph.Let D = (V, E) be a directed weighted graph whose node set is V and its edge set is E, where V is the neuron set of DNN, and E is the set of the directed weighted edges, as follows: where A is the adjacency matrix of D, v i is the i-th node of D, w j,i is the weight between v j and v i , and We use weighted in-degree as node features (nf ).Degree is the simplest and most effective feature for nodes, which captures the connectivity of nodes.The v i 's weight is the sum of the weights of adjacent input edges, calculated as follows: where nf i is the node feature value of v i .Weighted in-degree is a low-order node feature.Therefore, we use the message passing of GNN to aggregate the adjacency matrix and node features of the activation graph to obtain the center node feature (cnf ), which is calculated as follows: where the aggregation function AGG() can use Sum(), M ax(), and Average().We use the Sum() function.
After calculating the cnf of all nodes, ActGraph only takes the cnf of the last two layers.Because we believe that the deeper activation of the model can fully express the high-dimensional characteristics of test cases.The two-layer cnf needs at least four layers of weight and activation, so we set K=4.The reason is described in Section 3.5.

Ranking Model Building
ActGraph adopts the XGBoost algorithm [36], which is an optimized distributed gradient reinforcement learning algorithm, and establishes an L2R-based ranking model.
The cnf of the validation set obtained by Eq. ( 7) is used as the training set of the ranking model, and according to the DNN's prediction of the sample, it is labeled as 0 (prediction is correct) or 1 (prediction is wrong).The loss function for training the ranking model is as follows: where ŷ = is the predicted value of the t-th tree, y is 0 or 1, T is the number of trees, and Ω is regularization.In summary, the process of training ranking model is shown in Algorithm 1.

Utility Analysis of Center Node Feature
In this section, we analyze the utility of cnf and explain how the K value of ActGraph is determined.Let a DNN with N neurons.Its output value of each layer is activated by a case x.The activation value ϕ and the weight W can be expressed as: where ϕ and W are layer normalized by Eq. (2) and Eq. ( 4).Then the adjacency matrix A is calculated by Eq. ( 5) as: Algorithm 1: ActGraph Input: A DNN f 1 to be tested; The last K layers selected; Validation dataset x c ∈ X = {x 1 , x 2 , ...}; The set of center node feature cnf .
Output: A ranking model f 2 .
for l in K do 8: Obtain neuron activation ϕ l (x c ) by Eq. ( 1).
Then the node feature nf is calculated by Eq. ( 6) as: where nf is the in-degree of the activation graph.A node feature nf i is ϕ i j w j,i , which indicates that the nf i value is obtained by aggregating the activation value and input edges of v i .Finally, the center node feature cnf is calculated by computing A and nf by Eq. ( 7) as: where the cnf i of v i is z ϕ z w z,i nf z .Intuitively, cnf i is aggregated by the activation values and the nf value of neurons in the upper layer of v i .
Therefore, ActGraph requires at least three layers of network to aggregate so that the cnf of the last layer is valid (not zero).In the experiment, we use the last four layers of DNN, i.e.K=4, to obtain the effective cnf of the last two layers.

EXPERIMENT
We evaluate ActGraph through answering the following five research questions (RQs).
• RQ1: Does ActGraph show the SOTA prioritization performance in both natural and adversarial scenarios?
• RQ2: Does ActGraph show the competitive generalizability in mixed scenarios?
• RQ4: How is the stability of ActGraph under different hyperparameters (i.e.trainset size and training parameters of ActGraph)?
CIFAR-100 includes 60,000 32×32 three-channel RGB-color images, which are divided into one hundred classes equally.For each dataset, 40,000 are used for training, 10,000 for validation and 10,000 for testing.

Models.
For MNIST, we use LeNet-5 [23] for prioritization.On CIFAR-10, VGG16 [38] and ResNet18 [39] are adopted.On even larger dataset CIFAR-100, we adopt VGG19 [38] model.The datasets and models configurations are shown in Table 2. Data Preparation.To verify that ActGraph is able to prioritize various test cases that trigger model bugs.
We use a variety of data operations for data generation, and generate a variety of different types of datasets to make the DNN misclassify, including adversarial and natural noise cases.The natural operations include image rotation, translation and flipping, collectively referred to as Rotate.And the adversarial operations include C&W [22] and Jacobian-based Saliency Map Attacks (JSMA) [40].C&W attack leads the model to output false label with high confidence.JSMA can change only a few pixels to implement the attack, so the power of the disturbance is small.
We construct the Testset to prioritize and the Trainset for training the ranking model.For Original, Testset comes from the testset of the original dataset, and Trainset comes from the validation set of the original dataset.However, the DNN has high accuracy and only a few misclassified samples in Trainset.
Therefore, these samples need to be repeatedly sampled until the balance of positive and negative samples reaches 5,000 to 5,000, and Testset remains unchanged.For other types of data (Rotate, JSMA, C&W and Mix), the ratio of Trainset is 5,000 to 5,000, that is, it consists of 5,000 normally classified samples and 5,000 manipulated misclassified samples, and the ratio of Testset is 8,000 to 2,000.Mix is randomly sampled from four types of sets.
Baselines.We adopt the model activation-based and mutation-based prioritization algorithms as the baselines in our experiment, including DeepGini [11], MCP [12], DSA [13] and PRIMA [15].The parameters for these algorithms are configured following their settings reported in the respective papers.In addition, in order to explore the impact of ActGraph extracted graph-level features on test case prioritization, we extracted and concatenated the confidence output and the last hidden output as our baseline, namely Act.
Metrics.We use RAUC [15] as the evaluation metric for prioritization.RAUC is calculated as the ratio of the area under the curve for the test input prioritization approach to the area under the curve of the ideal prioritization, as follows: RAUC-n is the RAUC for the first n test cases, as follows: The value range is [0, 1], and 1 indicates the best ranking result.RAUC-100, RAUC-500, RAUC-1000 and RAUC-ALL are used in the experiments.
Implementation Details.Our experiments have the following settings: (1) For XGBoost ranking algorithm in ActGraph, we set Learning rate to be 0.1, Colsample bytree to be 0.3, and M ax depth to be 5; (2) In order to reduce the computational cost, we set K as 4 and take the cnf of the last two layers; (3) For all image data, we normalize the range of each pixel to [0, 1].

Effectiveness of ActGraph
In the section, we find the answer to RQ1, by comparing ActGraph with 5 baseline algorithms to verify the effectiveness of ActGraph in natural and adversarial scenarios.

Implementation details.
(1) Each model and dataset is set with two natural scenarios (Original and Rotate) and two adversarial scenarios (C&W and JSMA).( 2) The training type is the same as the test type.
(3) The size of trainset of ranking model is 2,000, which contains 1,000 positive samples and 1,000 negative samples.Because the time and cost of prioritization is limited, the number of test cases that can be labeled is often small.This also means that RAUC-100 is more important than RAUC-ALL for the test case prioritization approaches.In RAUC-100, the average result of ActGraph is 0.871, 13.19% higher than DeepGini, 5.96% higher than PRIMA, 20.20% higher than MCP, 24.36% higher than DSA, and 13.50% higher than Act.These results show that DeepGini has better effects than PRIMA in natural scenarios, PRIMA has better effects than DeepGini in adversarial scenarios, and the average results of PRIMA are better than DeepGini.ActGraph shows the SOTA effect in both adversarial and natural scenarios, especially in RAUC-100.

Generalizability of ActGraph
In the section, we find the answer to RQ2, validating the performance of ActGraph for prioritization of Results and analysis.The results are shown in Table 4. Since DeepGini and MCP are unsupervised methods, their prioritization results are not affected by the type of testset.In the total 40 results, ActGraph shows 29 best results (72.5%), followed by PRIMA with 8 best results (20%), MCP with 2 best results and DeepGini with zero best results.Then, we average the 4 metrics, and ActGraph performs the best, followed by PRIMA.The average results of ActGraph are 0.878∼0.898,which are 3.58%∼4.65%higher than PRIMA, 9.02%∼11.08%higher than DeepGini, 6.46%∼6.49%higher than Act, 22.70%∼23.54%higher

Interpretability of ActGraph
In the section, we find the answer to RQ3, explaining why ActGraph can be used effectively for prioritization.We show the visualization of test cases with high prioritization, and carry out qualitative analysis and quantitative analysis.Test cases visualization.The visualization of test cases with high prioritization is shown in Fig. 3.
Intuitively, for the FP and Rotate test cases, they are also difficult to recognize by humans.For example, the images "7" and "5" are incomplete; The image "dog" is too bright, and the hair is too long, which blocks the basic features of dog; The colors of the images "cat", "flatfish", "seal" and "horse" are similar to the environment.In particular, the "tractor" and "camel" are rotated 180 degrees resulting in the blue sky at the bottom of the image, thus identifying them as "lobster" and "shark".For JSMA and C&W, some images are also broken or blurred, such as "9" and "2" in the first line and "pine tree" in the third line.Most of the images are clear, but adding unobserved adversarial perturbations causes DNN output errors.This shows that ActGraph can pick up weak adversarial perturbations.Qualitative analysis.The visualization of t-SNE is the first row of Fig. 4 for qualitative analysis.Answer to RQ4: For the trainset size, we set four scenarios ranging from 500 to 2,000.ActGraph has stable performance and increases with the increase of the trainset size.For training hyperparameters, ActGraph is also stable.These results indicate that our previous parameter settings are appropriate.

Time Complexity
In this section, we find the answer to RQ5, referring to the prioritization time cost.
Implementation details.We measure average running time for ranking 10,000 test cases by ActGraph and baselines.We run each method for 5 times, and the average is identified as the final result.
Results and analysis.Firstly, we theoretically analyze the complexity of ActGraph according to different steps.
The time complexity of ActGraph includes acquiring activation values of multi-layer neurons, calculating ap, calculating nf and calculating cnf, so its time complexity is where t is the number of samples, V is the number of neurons, and E is the number of edges between neurons.
Further, we analyze the efficiency of ActGraph from the real running time.According to the Table 5, the running time of ActGraph is acceptable, the time cost of ActGraph increases due to the increase of the total number of neurons and edges.In general, ActGraph is much faster than PRIMA, but is inferior to other methods.The reason is that we search more layers and neurons.Besides, ActGraph calculates the weighted edges between neurons, and calculates the center node features in the activation graph.PRIMA runs on average 50 times longer than ActGraph.Answer to RQ5: The average running time for ActGraph to prioritize 10,000 cases is about four minutes, which is acceptable.

THREATS TO VALIDITY
The depth of the graph.We explain why K=4 and take only two layers of cnf in Section 3.5.Too large K value (more than 4) and the layer number of cnf (more than 2) will not only increase the time cost, but also introduce noise from the shallow layer.And the value of K too small to obtain the valid cnf value.
Therefore, it is appropriate to set K=4, which is also confirmed by the comprehensive experimental results.The experiments show that ActGraph has significantly better performance in terms of effectiveness, generalizability, and efficiency.
In the future, we will improve our ActGraph approach and apply it to more popular tasks and models, such as the transformer model for natural language processing and the long short-term memory model for time series forecasting.

Figure 1 :
Figure 1: The activation graphs of three type cases.(a) Activation graph of the normal digital 9. (b) Activation graph of the normal digital 7. (c) Activation graph of the false positive digital 7, but misclassified as 9.(d) Activation graph of the C&W digital 7, but misclassified as 9.
(i) Effectiveness and generalizability.ActGraph extracts the finer-grained activation features of the test cases on the model, and converts the model activations into the spatial relationship of neurons, which can solve the limitation problem of the scene based on the model activation method.ActGraph can prioritize multiple types of test cases by learning one type of cases, which is more general than model activation-based methods.

[ 34 ]
proposed a "Rolled" graph representation of convolutional layers to solve the DNN performance prediction problem by capturing the early DNN dynamics during the training phase.To maintain the semantic meaning of the convolutional layers, they represent each filter as a node and link the filters in successive layers by weighted edges.Zhao et al.[35] proposed feature entropy to quantitatively identify individual neuron states of a specific class.In general, most of the existing DAG studies focus on the relationship between the structural parameters of the DNN and the overall behavior of the DNN.ActGraph and the existing DAG mainly have the following differences: i) The methods of constructing the graph are different.The graph of ActGraph uses the DNN structure and multi-layer activation, unlike past works that only consider the DNN structure; ii) normalized value of F l i θ the model weights of the trained DNN W the average and normalized weights of θ D = (V, E) the direct weighted graph with sets of nodes and edges of predecessors of v i A the adjacency matrix of D nf the node feature cnf the center node feature AGG(•) the aggregation function of the message passing y the flag of whether the x triggers the model vulnerability (0 or 1) Ω(•) the regularization function T the number of trees of xgboost We propose a novel test case prioritization method based on model activation graph, namely ActGraph.It consists of three stages.i) Test Case Activation: the test cases are fed into the trained DNN, and each layer of the DNN outputs activation values (Section 3.2); ii) Feature Extraction: activation graphs are constructed based on the activations of DNN's each layer, and the adjacency matrix and node features are extracted on the activation graphs.Finally the center node features are obtained by message passing aggregation (Section 3.3); iii) Ranking Model Building: ActGraph adopts the framework of L2R to build a ranking model, which can utilize the center node features, for prioritizing test cases (Section 3.4).The framework of ActGraph is shown in Fig. 2.

Figure 2 :
Figure 2: The framework of ActGraph, which be divided into three stages: test case activation, the feature of activation graph extraction and ranking model building.
multiple types of test cases, especially with limited training knowledge.Limited training knowledge means that the trainset of the ranking model contains only one type of cases, which can trigger DNN vulnerabilities, but multiple types of test cases need to be prioritized in the testing phase.Implementation details. (1) Five training types are set up to explore the generalizability of ActGraph to Mix testset.(2) The size of trainset of ranking model is 2,000, which contains 1,000 positive cases and 1,000 negative cases.(3) RAUC-100 and RAUC-500 are evaluated in the experiment, since the time and cost of prioritization is limited.

Implementation details. ( 1 )
The t-SNE visualization for qualitative analysis and the heat map for quantitative analysis.(2)We use mixed cases to analyze the intra-class and inter-class distances of the features of ActGraph and baseline algorithms.

Figure 3 :
Figure 3: The visualization of test cases with high prioritization (i.e.FP, Rotate, JSMA and C&W).The first row is from MNIST, second row is from CIFAR-10 and third row is from CIFAR-100.

Figure 4 :
Figure 4: The t-SNE and heatmap visualization.(a) Confidence is the output from the DNN confidence layer.(b) Embedding is the output from the last hidden layer of the DNN.(c) PRIMA represents the mutation feature from the PRIMA method.(d) ActGraph represents the center node feature.The number of cases of each type is 200.It is a VGG16 trained on CIFAR-10.

Figure 5 :
Figure 5: Influence of different size of trainset.

Table 2 :
Dataset and model configurations.

Table 3 .
The bold indicates the optimal results of different methods under the same type scenario and the same metric.In the total 64 results, ActGraph performs the Then, ActGraph outperforms Act, which illustrates that cnf is more effective than model activation feature.Because cnf not only has the information of neuron activation characteristics, but also the node connection relationship between neurons.In particular, we show that ActGraph also outperforms Act in generalizability in Section 4.3.
Specifically, confidence-based methods perform better on FP cases, and embedding-based methods perform better on adversarial cases.In Original, DeepGini performs best with 11 best results.But in C&W scenario, ActGraph has 13 best results.Especially in RAUC-100, ActGraph is 9.41%∼52.51%higher than DeepGini.On the contrary, DSA is an embedding-based method, which performs better in adversarial scenarios than in natural scenarios.These results also confirm the previous hypothesis that the confidence-based methods work well for natural scenarios and the embedding-based methods work well for adversarial scenarios.

Table 3 :
The results of prioritization in adversarial and natural scenarios.

Table 4 :
The results of prioritization in mixed test case scenarios.Act from 0.0005 to 0.0184.The variance of ActGraph achieve 0.0041, 0.0028 and 0.0027, respectively.As the result, ActGraph performs more consistently than other supervised learning baseline methods.This shows the stable effectiveness of ActGraph for different types of test cases in limited training knowledge.This also shows that, for the DNN model under test, the ranking model of ActGraph does not need to be retrained frequently, because it can perform stably on different types of test cases, which indicates the generalizability of ActGraph.
Intuitively, the features of Confidence and Embedding are confused, and the distances between different types are relatively close.In PRIMA, Clean and FP are close in distance, JSMA and C&W are close in distance, and Rotate is in the middle between natural and adversarial cases.This shows that although PRIMA can distinguish between FP and adversarial cases, it is difficult to distinguish between Clean and FP cases.In ActGraph, Rotate and Clean are distinguished, and most of the FP and Rotate overlap, only a few FP and Clean intersect, and the distance between JSMA and C&W is also farther than PRIMA.This shows that the center node feature of ActGraph not only has better prioritization performance for adversarial cases, but also has better prioritization effect on natural cases and FP cases than existing methods.
distance.The heat map is the second row of Fig.4.For intra-class distance, ActGraph is 6.46∼12.78.Except JSMA, the intra-class distance of ActGraph is smaller than other methods.For inter-class distance, intuitively, the distance between natural cases (Clean, FP and Rotate) and adversarial cases (JSMA and

Table 5 :
Time (seconds) taken to prioritize 10,000 test cases Time cost.The time cost of ActGraph may be affected by the neuron number of the DNN.We demonstrate that the ActGraph's runtime is acceptable by applying three popular DNN models with significant differences in the number of neurons.Access toDNNs.ActGraph is a white-box method of prioritizing test cases.It needs to access the DNN for model weight parameters and deep activation of test cases.It is widely accepted that DNN testing could have full knowledge of the target model in software engineering.Aiming at the problems of limited application scenarios and high time cost of existing test case prioritization methods, we propose a test case prioritization method based on the DNN activation graph, named ActGraph.We observe that the activation graphs of cases that trigger model vulnerabilities different from those of normal cases significantly.Motivated by it, ActGraph extracts the node features and adjacency matrix of test cases by building an activation graph, and uses the message passing mechanism to aggregate node features and adjacency matrix to obtain more effective center node features for test case prioritization.Extensive experiments have verified the effectiveness of ActGraph, which outperforms the SOTA method in both natural and adversarial scenarios, especially in RAUC-100 (∼ ×1.40).And when the number of test cases is 10,000, the actual running time of the SOTA method is 50 times that of ActGraph.