In this section, through a case study on 20 datasets of open-source projects we have evaluated the following research questions:
4.1 Experimental Setup
In this section, we presented our experiment setup. Specifically, we described experimental parameters and tools used for implementation, indicated the benchmark datasets and preprocessing steps used in our experiments.
4.1.1 Experimental Parameters and Comparable Tool
In this study, we employed the default parameters for the used algorithms, except for the PH algorithm. When we used default parameters for the PH change detection algorithm, we encountered a challenge where CD occurs on all datasets at almost the same points. Hypothetically, a standard CDD method detects CD points on a stream containing 500 observations. Let us assume that CD points are identified at five different points. We want to determine whether these are false positives or actual CD points detected by the detector. If one CD is detected every 100 observations, all detected points are likely false positives. However, if a detector identifies CD points at different intervals, we can conclude with some degree of confidence that the detected CD points are likely genuine [48]. Our approach also controls false positive detection rates and enhances the CDD method performance by adjusting parameters. The default parameters for PH are a warning threshold (corresponding to the allowable false alarm rate) β = 0.01, sensitivity (acceptable error rate) δ = 0.001, and fading factor (update weight for historical values in the PH statistic) γ = 0.999 [12]. By configuring the parameters of the PH algorithm with these default values, our proposed method detected CD points at equidistant intervals. This indicates that these points are indeed false positives rather than genuine CD points. To address this issue, we increased the false alarm value of the PH algorithm to β = 0.1 and recommend checking the CD using other values of this parameter. This prompted us to explore larger parameter values for optimal results. For other domains with higher data entry rates, different parameter values may be useful.
The implementation of the studied model-agnostic techniques is ready-made in R (IME2 and BreakDown3). We applied the SMOTE technique using the R package implementation of the SMOTE function [49].
4.1.2 Dataset
In this study, we present a proposed CDD method based on unlabeled data using examining instance interpretation consistency over time for the evolving JIT-SDP problem. Our study addressed the RQs by utilizing 20 GitHub repositories that were previously used in other studies (Lin, Tantithamthavorn et al. 2021). Lin et al. employed this dataset to derive the most important features in the JIT-SDP problem and investigated the consistency and stability of prediction performance across different datasets and models. Additionally, this dataset has two key characteristics: public availability and long-term development, making it an ideal resource for our study. Table 2 provides a summary of this dataset. The dataset contains commit level metrics that are divided into three categories (Table 3).
Diffusion category figures out how distributed a commit is. A heavily distributed commit is more defect-prone, as shown in [50, 51]. Measuring the diffusion of a commit is produced by collecting the number of modified subsystems (NS), the number of modified directories (ND), the number of modified files (NF), and the distribution of modified code across each file (Entropy). Due to differences in the number of files across changes, we normalized the entropy by the maximum entropy log2n similar to Hassan [51].
Size category figures out the commit size utilizing the lines deleted (LD), lines added (LA), and lines total (LT).
Purpose category figures out whether a commit has fixed a defect. Research has demonstrated that a commit that fixes a defect is likely to cause another defect [52, 53].
4.1.3 Data Preprocessing
The datasets used in this study contain commit-level metrics that are skewed and vary in scale, with some metrics showing high correlation. To address this problem, we followed a similar approach to the work of Lin et al. [4]. Specifically, we centered and scaled the values of commit-level metrics using the scale function in R, except for the "FIX" metric which is Boolean. Moreover, because correlated and inconsistent features cause interpretation inconsistencies, it is necessary to remove correlated metrics in the pre-processing stage [54]. Jiarpakdee et al. have shown that the Spearman correlation method removes correlated metrics significantly better than feature selection techniques [55]. Therefore, we used the same method to remove correlated features with correlation coefficient greater than 0.7. Additionally, we normalized the entropy log2n by the maximum entropy log2n to consider differences in the number of files across changes, as suggested by Hassan [51]. We used the Synthetic Minority Oversampling Technique (SMOTE) proposed by Chawla et al. [56] for class rebalancing to overcome the limitations of traditional oversampling and undersampling techniques. This technique generates artificial data based on feature space, rather than data space, and combines synthetic oversampling of minority defective modules with undersampling of majority clean modules. The SMOTE technique has an adjustable parameter that needs to be specified. However, Tantithamthavorn et al. demonstrated that changing the parameter of the SMOTE technique does not affect their findings [1]. The literature employed SMOTE to create balanced class distribution. They found that SMOTE heavily improved the performance of SDP with more consistent results than that of SDP without the balance process [11, 57]. Therefore, we applied the SMOTE technique for resampling and we left the review of other resampling methods to future works.
Table 2
Summary of software projects from the studied dataset (The percentage of defect commits, compared to clean ones, is shown in parentheses) [4]
Project name | Date of first commit | Lines of code | # of changes |
accumulo | Oct 4, 2011 | 600,191 | 9,175 (21%) |
angular | Jan 5, 2010 | 249,520 | 8,720 (25%) |
brackets | Dec 7, 2011 | 379,446 | 17,624 (24%) |
bugzilla | Aug 26, 1998 | 78,448 | 9,795 (37%) |
camel | Mar 19, 2007 | 1,310,869 | 31,369 (21%) |
cinder | May 3, 2012 | 434,324 | 14,855 (23%) |
django | Jul 13, 2005 | 468,100 | 25,453 (42%) |
fastjson | Jul 31, 2011 | 169,410 | 2,684 (26%) |
gephi | Mar 2, 2009 | 129,259 | 4,599 (37%) |
hibernate-orm | Jun 29, 2007 | 711,086 | 8,429 (32%) |
hibernate-search | Aug 15, 2007 | 174,475 | 6,022 (35%) |
imglib2 | Nov 2, 2009 | 45,935 | 4,891 (29%) |
jetty | Mar 16, 2009 | 519,265 | 15,197 (29%) |
kylin | May 13, 2014 | 214,983 | 7,112 (25%) |
log4j | Nov 16, 2000 | 37,419 | 3,275 (46%) |
nova | May 27, 2010 | 430,404 | 49,913 (26%) |
osquery | Jul 30, 2014 | 91,133 | 4,190 (23%) |
postgres | Jul 9, 1996 | 1,277,645 | 44,276 (33%) |
tomcat | Mar 27, 2006 | 400,869 | 19,213 (28%) |
wordpress | Apr 1, 2003 | 390,034 | 37,937 (47%) |
Table 3
Summary of commit level metrics [4]
Category | Name | Description |
Diffusion | NS | Number of modified subsystems |
ND | Number of modified directories |
NF | Number of modified files |
Entropy | Distribution of modified code across each file |
Size | LA | Lines of code added |
LD | Lines of code deleted |
LT | Lines of code in a file before the commit |
Purpose | FIX | Whether or not the commit is a defect fix |
4.2 Experimental Results
In this section, we discuss our findings for each RQ.
4.2.1 RQ1 How does the proposed method based on different interpretation algorithms perform compared to baseline methods for detecting CD?
In this paper, we aim to discover CD by exploring the inconsistency of the instance interpretation over time. For this purpose, we obtained the instance interpretation vector for each newly generated sample. The instance interpretation vector comprises the contribution of every individual feature-value in predicting each sample. Zheng has acquired the most important features for more accurate detection of software defects through model interpretation. It has been demonstrated that by extracting these features, it is feasible to achieve 96% of the original efficiency by using 45% of the previous effort [6]. This observation led us to infer that changing these important features over time will result in CD and a subsequent decline in model performance in the near future, a matter that has been disregarded by SDP studies. Hence, through examining the significant change in the distance between consecutive model interpretation vectors over time, it becomes feasible to approximate the occurrence of CD. However, the challenge of model interpretation lies in its dependence on data labeling. As a result, it is not feasible to predict CD for the new incoming data lacking labels. As previously mentioned, the instance interpretation does not require a data label. We have provided the instance interpretation of two samples in Fig. 4. The subfigure. on the left shows that the sample is defective with a probability of 95% (red box). The effect size of each feature with a specific value on the sample class prediction is depicted by the blue and orange bars. For example, the effect size of the nld feature with the specific value 2.6 on the prediction, with a 95% probability belongs to the defect class, close to 0.2, which is more than other features. In general, positive effect sizes are represented by the blue bars, while negative effect sizes are indicated by the orange bar.
We partitioned the data vectors into chunks based on chronological order and derived the instance interpretation vector of each sample. In cases where the distribution of a data chunk exhibits a statistically significant difference from the preceding data chunks, it is identified as an inconsistency. This inconsistency is detected through the MANOVA algorithm. Then, in order to ascertain its validity and eliminate the possibility of randomness, the PH algorithm is employed. Finally, inconsistencies that are deemed valid (as determined by the output of the PH algorithm) are recognized as CD. In the JIT-SDP, there is no dataset with labeled CD points. Therefore, for evaluation, we compared the output of our work with the output of known baseline methods. These baseline methods were fully described in the previous section. Finally, in order to address the research questions, we evaluated the outcomes of our proposed methods using performance measures of the CDD method. According to Bifet, appropriate evaluation measures for detecting CD are outlined below [58, 59]:
-
CDD_Accuracy: The accuracy of the proposed method for detecting CD points.
-
Missed Detection Rate (MDR): The possibility of not receiving a warning when there is a CD by calculating the ratio of unknown CDs to complete CDs. a suitable detector would have small or zero MDR. It is obtained using Eq. (1).
MDR = 1- CDD_Accuracy | (1) |
-
Mean Time to Detection (MTD): The average delay between the detection of a CD point and its original CD occurrence location. A suitable detector should have a small MTD value.
-
Mean Time between False Alarms (MTFA): This measure determines the duration of a false alarm in the absence of any CD occurrence. A suitable detector should have a high MTFA value.
-
Mean Time Ratio (MTR): This is a measure of the trade-off between sensitivity and robustness. A suitable detector should have a high MTR value which can be obtained through Eq. (2):
MTR = \(\frac{MTFA}{MTD}\times\)(1 - MDR)(2)
We calculated CDD method performance measures to evaluate the proposed method. The evaluation method has been indicated in Fig. 5. Considering what interpretation algorithm we have used for the proposed method or the implementation has been performed on uninterpreted raw data, three designations were utilized: IME_base, BD_base, and raw_base. In the event that the method is executed on resampled data, the designations rIME_base and rBD_base are employed. These methods are elaborated upon extensively in Section 3.2. Finally, we compared the results of the methods using the Friedman test. The Friedman test [60] is a statistical analysis employed to assess and contrast multiple sets of results. It is important to note that the minimum rank serves as the benchmark for evaluating performance in this particular test. In summary, our method involves the following steps:
-
Dividing the test data into groups of 100 for each of the 20 datasets as illustrated in Fig. 6 (As stated in Section 3.1). For example, if a dataset contains 9000 commits and we consider the test dataset to be 100, we would have 90 data groups.
-
Data preprocessing, such as the removal of correlated features and rebalancing the class distribution.
-
Deriving instance interpretation vector for each sample of test data. To this end, the contribution of each metric in predicting the instance class is calculated.
-
Identifying the test data groups that exhibit significant statistical differences with their previous groups using the MANOVA method. MANOVA analyzes the relationship between multiple dependent variables and one independent variable. Groups are considered independent variables, whereas software metrics (or data features) as dependent variables. A p-value less than 0.05 indicates a significant difference in distribution between the data within a group compared to the preceding groups. The process of this identification is shown in Fig. 7. These differences are presented in the table within Fig. 7. Each column in this table represents a feature, while each row corresponds to a test data group. The value within each cell signifies that the feature values within that group significantly differ from those of previous groups in terms of statistical distribution. Hypothetically, the statistical distribution of the "FIX" feature vector in the 29th group significantly differs from that of two of the previous groups. We calculated the sum for each row in this table and included it as the last column (sum column).
-
Detecting CD points using PH test: In the "sum" column, the row in which the values are significantly larger than the previous rows indicates the occurrence of CD. Wherever consecutive rows in the "sum" column have larger numbers than before, it indicates that this difference is non-random and significant. We located these phase difference points using the PH change detection algorithm. If only one row has a considerable difference, it may be a coincidence. For example, group 29 values in the "sum" column are increasing, representing a concept change.
Instead of implementing the PH algorithm to detect a significant change in the "sum", we can apply it to each individual feature. So, it is feasible to investigate the significant change of which feature has a more substantial impact on the occurrence of CD. We defer this notion to be explored in future research.
- Identifying CD points using proposed method for all 20 datasets on uninterpreted raw and interpreted data.
- Collecting values of performance measures of the RF trained model over time for each dataset, such as ER, AUC, Gmean, Precision, F-measure, Recall, and MCC. These measures are commonly used in research to assess the stability of the SDP model.
- Identifying CD points for each of the 20 datasets by monitoring each of these criteria using the PH statistical test (baseline methods).
- Calculating the performance measures of the evaluation of CDD methods including Accuracy, MDR, MTD, MTFA, and MTR (for evaluating results of our method (step 6) compared to results of the baseline method (step 8))
- Evaluation of the proposed method by determining the degree of concurrence with the output of the baseline method using the Friedman test.
This was an overview of the design process. In the following, we provided an illustrative example for better understanding. For example, the ER PH baseline method identifies points of CD occurrences as presented in Table 4. Figure 8 displays the complete output of column "sum" (the table within Fig. 7) on the WordPress dataset. Table 5 displays the CD points obtained through applying the PH algorithm on the output of column "sum". Figure 9 illustrates the segmented diagram of Fig. 8, highlighting changes of concept in these areas that correspond to CDs obtained from the baseline method (Table 4).
Table 4
Discovered CD points of WordPress dataset using ER PH method
24 | 53 | 82 | 111 | 142 | 174 | 203 | 232 | 261 | 290 | 325 | 357 |
Table 5
Discovered CD points of WordPress dataset using proposed method
27 | 57 | 86 | 115 | 144 | 173 | 202 | 231 | 260 | 289 | 318 | 347 |
We evaluated the results of our proposed method based on the output of baseline methods using CDD evaluation measures such as cdAccuracy, MTFA, MTD, and MTR. As previously mentioned, baseline methods are acquired through the monitoring of performance measures, both threshold-dependent and threshold-independent, which are calculated over time. These criteria are obtained by training the RF machine learning algorithm. In this section, we examined the effect of training data resampling on these criteria.
Table 6 The performance of the RF classifier before and after resampling
|
Before
|
After
|
Accuracy
|
12
|
3
|
AUC
|
13
|
3
|
Precision
|
15
|
1
|
Recall
|
4
|
14
|
F Measure
|
4
|
10
|
Gmean
|
5
|
11
|
First, we obtained the values of each performance measure from the RF incremental machine learning model over time on 20 datasets. Then, we conducted the Wilcoxon test to determine whether that performance measure exhibited a significantly superior performance on the 20 datasets prior to resampling or afterwards. Table 6 presents the number of times the performance of the classifier was better before or after training data resampling. For instance, the first row shows that out of 20 datasets, the accuracy measure for 12 datasets before resampling significantly outperformed over time, while the accuracy of 3 datasets improved after resampling. These results indicated that resampling reduces the value of threshold-independent performance criteria (the first two rows of Table 6, which are red), while enhances the value of threshold-dependent criteria (the last three rows of Table 6, which are green) after resampling. We conducted this analysis using the Wilcoxon statistical test [61]. Consequently, balancing has a positive effect on threshold-dependent criteria within our problem. As a result, our data is categorized as unbalanced. Hence, it is necessary to investigate the performance of the method on the resampled data. Furthermore, we presented the values analysis of these criteria over time and on 20 datasets in a box-plot format (Fig. 10). The results indicated that all criteria, with the exception of precision, exhibit commendable stability. Previous studies have also examined the stability of the AUC criterion [1].
In the following, we evaluated the proposed CDD method using cdAccuracy, MTFA, MTD, and MTR criteria. For example, Table 7 presents the evaluation results of the MTR criterion related to the output of the proposed CDD method in different modes compared to the baseline method based on ER monitoring. In other words, the CD points were acquired through the proposed method in different modes and the baseline method based on ER monitoring for every dataset. Then we computed the MTR value from the Eq. (2) for each dataset. For example, the orange cell in Table 7 indicates that the MTR value would be 1500 when comparing the output of the raw_base method on the dataset "accumulate" with the output of the baseline method based on ER monitoring. We computed the MTR for all five methods on each of the 20 datasets. Finally, we analyzed the MTR values to determine which method exhibited significantly superior performance (higher), employing the Friedman test. The last row of Table 7 provides the results of the Friedman test. The lower rank number indicates better performance. As only the results of the Friedman test (the last line of Table 7) are important for evaluation, we have not included Table 7 for other baseline methods in the paper. In summary, the final results obtained from the Friedman test have been compiled in Table 8 through Table 11. Each of these tables presents the evaluation result of one of the performance measures, namely cdAccuracy, MTFA, MTD, and MTR to compare the proposed methods with different baseline methods. According to the data presented in Table 8, for all baseline methods whose names are in the first column, the values in the IME_base column are consistently lower than those in the BD_base and raw_base. Therefore, we can deduce that the method based on IME interpretation outperforms BD_base and raw_base (Table 8). Even on resampled data, IME_base (rIME_base column) performed better than BD_base (rBD_base column). The output of the other three criteria (cdAccuracy, MTFA and MTD) in Table 9 to Table 11 is the same as MTR.
Table 7 The evaluation results of the MTR criterion, which compares CDD methods with method based on ER monitoring using the Friedman test
methodology/ dataset
|
raw_base
|
IME_base
|
BD_base
|
rIME_base
|
rBD_base
|
accumulo
|
1500
|
1500
|
1500
|
6000
|
6000
|
angular
|
221
|
239
|
221
|
239
|
221
|
brackets
|
279
|
139
|
279
|
11700
|
11700
|
bugzilla
|
4800
|
4800
|
4800
|
3200
|
3200
|
camel
|
2193
|
1919
|
2193
|
10233
|
10233
|
cinder
|
370
|
370
|
370
|
7400
|
7400
|
django
|
6300
|
4200
|
1232
|
4200
|
1938
|
gephi
|
643
|
643
|
643
|
643
|
643
|
hibernate-orm
|
4050
|
2700
|
4050
|
579
|
225
|
hibernate-search
|
750
|
750
|
750
|
857
|
857
|
imglib2
|
575
|
575
|
575
|
177
|
177
|
jetty
|
620
|
620
|
620
|
12400
|
1771
|
kylin
|
6000
|
6000
|
6000
|
6000
|
500
|
log4j
|
1000
|
1000
|
1000
|
1000
|
1000
|
nova
|
2690
|
6725
|
6725
|
13450
|
13450
|
postgres
|
732
|
2090
|
1819
|
4390
|
2665
|
tomcat
|
2350
|
2089
|
2089
|
895
|
1567
|
wordpress
|
4688
|
9375
|
4312
|
6250
|
4688
|
meanAvg
|
4.22
|
4.11
|
4.5
|
3.17
|
3.94
|
Table 8
The evaluation results of the MTR criterion, which compares CDD methods with methods based on various criteria monitoring using Friedman test
| raw_base | IME_base | BD_base | rIME_base | rBD_base |
ER PH-Er PH (the last line of Table 7) | 4.22 | 4.11 | 4.5 | 3.17 | 3.94 |
AUC-Er PH | 4.33 | 3.75 | 4.67 | 3.42 | 3.61 |
Gmean-Er PH | 4.06 | 3.17 | 4.06 | 3.83 | 4.75 |
Precision-Er PH | 4.08 | 3.56 | 4.81 | 3.61 | 4.08 |
Recall-Er PH | 4.28 | 3.28 | 3.89 | 3.89 | 4.61 |
MCC-Er PH | 4.31 | 3.31 | 4.25 | 3.14 | 4.03 |
F Measure-Er PH | 4.22 | 3.03 | 4.31 | 3.81 | 4.33 |
Table 9
The evaluation results of the cdAccuracy criterion, which compares CDD methods with methods based on various criteria monitoring using Friedman test (no significant difference compared to MTR)
| raw_base | IME_base | BD_base | rIME_base | rBD_base |
ER PH-Er PH | 3.69 | 3.89 | 4.33 | 3.89 | 4.22 |
AUC-Er PH | 3.75 | 3.94 | 4.42 | 3.86 | 3.94 |
Gmean-Er PH | 3.75 | 3.75 | 4.44 | 3.86 | 4.17 |
Precision-Er PH | 3.75 | 3.86 | 4.44 | 3.75 | 4.14 |
Recall-Er PH | 3.69 | 3.69 | 4.36 | 3.89 | 4.31 |
MCC-Er PH | 3.72 | 3.92 | 4.42 | 3.64 | 3.94 |
F Measure-Er PH | 3.75 | 3.75 | 4.44 | 3.86 | 4.19 |
Table 10
The evaluation results of the MTD criterion, which compares CDD methods with methods based on various criteria monitor using Friedman test (no significant difference compared to MTR)
| raw_base | IME_base | BD_base | rIME_base | rBD_base |
ER PH | 4.03 | 4.42 | 4.31 | 3.28 | 3.72 |
AUC-Er PH | 4.19 | 3.81 | 4.58 | 3.31 | 3.94 |
Gmean-Er PH | 3.89 | 3.5 | 3.94 | 4.03 | 4.56 |
Precision-Er PH | 3.79 | 3.71 | 4.44 | 3.76 | 4.5 |
Recall-Er PH | 4.03 | 3.64 | 3.89 | 3.83 | 4.5 |
MCC-Er PH | 4.17 | 3.69 | 4.17 | 3.19 | 3.92 |
F Measure-Er PH | 3.91 | 3.29 | 4.18 | 3.79 | 4.59 |
Table 11
The evaluation results of the MTFA criterion, which compares CDD methods with methods based on various criteria monitor using Friedman test (BD_base performed better than IME_base in rebalanced ones)
| raw_base | IME_base | BD_base | rIME_base | rBD_base |
ER PH | 4.64 | 4.08 | 4.36 | 3.33 | 3.28 |
AUC-Er PH | 4.53 | 3.94 | 4.33 | 3.75 | 3.36 |
Gmean-Er PH | 4.36 | 3.58 | 4.17 | 4.17 | 3.97 |
Precision-Er PH | 4.35 | 3.74 | 4.56 | 3.74 | 3.74 |
Recall-Er PH | 4.33 | 3.56 | 4.14 | 3.94 | 4.14 |
MCC-Er PH | 4.33 | 3.75 | 4.14 | 3.94 | 3.75 |
F Measure-Er PH | 4.5 | 3.72 | 4.31 | 3.72 | 3.72 |
In order to evaluate that the proposed method based on which interpretation algorithm and which type of data (interpreted or raw and simplified or resampled) yields superior results, we conducted an experimental study on 20 datasets with different baseline methods and using performance measures, namely cdAccuracy, MTFA, MTD, and MTR. For a better comparison, we obtained radar diagrams for each measure to illustrate which method performed better. A radar diagram is utilized to compare the values of three or more variables with respect to a central point. These diagrams prove to be highly effective when comparing the performance of an algorithm with that of the baseline method. Figure 11 displays a radar diagram that pertains to Table 8 through Table 11. This type of diagram is used to compare the consistency of the output of the proposed interpretation-based methods with that of the baseline methods, which were influenced by four well-known performance criteria of CDD methods: MTR, cdAccuracy, MTD, and MTFA (four subfigure). Since the diagram is associated with the Friedman test output, the closer it is to the center, the better that method performs. As depicted in Fig. 11, it is evident that the orange graph (i.e., IME-based method) consistently falls within the gray graph (i.e., BD-based) in every four subfigures. Consequently, it is apparent that the IME-based method consistently outperforms BD-based even in resampled mode (as indicated by the yellow graph (i.e., rIME-based method) always being inside the bold blue graph (i.e., rBD-based)).
4.2.2 RQ2 How does the proposed method based on resampled data perform compared to the baseline methods that rely on threshold-dependent and threshold-independent criteria?
In this section, we conduct a comparison between the performance of the proposed method, which was implemented on the resampled data, and the output of the baseline methods. The comparison is based on monitoring of both thresholds-dependent and threshold-independent criteria. The Friedman test revealed that our methods did not exhibit significant differences in findings but yielded promising results. In Section 4, we present exciting results from the ranking of the method outputs on 20 datasets. As previously mentioned, detection of CD points using ER monitoring is a common method. However, in this work, we have located CD points using monitoring of measures other than ER. Then, to more extensively examine the performance of the proposed methods based on interpretation, we have evaluated their consistency with the baseline methods based on monitoring the performance of different criteria using four well-known performance criteria of CDD methods; MTR, cdAccuracy, MTD, and MTFA. The criteria used in the baseline methods are either threshold-dependent or threshold-independent. In the previous section, we noted that studies have shown that the performance of each of these two categories on the stability of incremental models differs. Therefore, relevant studies avoid them. On the other hand, checking model stability on resampled data is also a challenge in this field. To address these two challenges, we posed RQ2. To address this RQ, we investigated the consistency of CD points obtained from baseline methods with those obtained from proposed methods implemented on simplified and resampled data. To provide a clearer response, we have categorized the results of Table 8 to Table 11 into two groups: threshold-dependent and threshold-independent criterion-based methods using radar diagrams. These diagrams are displayed in Fig. 12 (related to the method based on threshold-independent criteria) and Fig. 13 (related to the method based on threshold-dependent criteria). Since the diagram is associated with the Friedman test output, the closer it is to the center, the better that method performs. In view of Fig. 12, when using threshold-independent criteria for baseline methods, the output of proposed methods implemented on resampling data is more consistent with the results of these baseline methods according to all four famous criteria of CDD methods: MTR, cdAccuracy, MTD, and MTFA. As shown in Fig. 12, the yellow chart (rIME-base) which is based on resampled data falls within the orange chart (IME-base) which is based on simplified data, as well as the bold blue chart (rBD-base) falls within the gray chart (BD-base).
According to Fig. 13, when using threshold-dependent criteria for the baseline methods, the output of the proposed methods implemented on simplified data is more consistent with the results of these baseline methods than on the resampled data. The orange graph (IME-base) which is based on simplified data always falls within the yellow graph (rIME-base) which is based on resampled data as shown in Fig. 13. However, the accuracy of the CDD method based on BD_base interpretation is lower than rBD_base, which affects the false alarm criterion, i.e., MTFA.
RQ2: Threshold-independent performance measures of the classifier decrease after resampling, while its threshold-dependent criteria improve. Resampling also reduces the matching of the CD points discovered by proposed CDD methods with the output of baseline methods based on monitoring threshold-dependent criteria. However, it improves compared to baseline methods based on monitoring threshold-independent measures. Accordingly, if scholars are to predict the performance instability of threshold-dependent measures, they ought to employ the method based on simplified samples. Conversely, if they are to predict the performance instability of threshold-independent measures, they should use a method based on resampled samples. Hypothetically, When to improve the performance of threshold-dependent measures, such as Recall [1], and increase the completeness of detecting software defects is important, methods based on simplified and unbalanced data outperform, and the output of the proposed method is closer to the baseline method.