The proposed framework for the defect prediction system is shown in Figure 1. It is divided into three levels i.e. datasets, data pre-processing, and model training and testing In the 1st level i.e. datasets, eight datasets of the Promise Software Engineering repository are used for experimentation. In level 2 i.e data pre-processing, feature selection techniques are used which are divided into Filter, Wrapper, and Hybrid methods. After applying these techniques to the above datasets we found that the hybrid method which is a combination of SFS and SBS outperforms filter and wrapper methods in terms of accuracy and execution time which we had already published in one of the international conferences (i.e. AICES 2022) organized at NIT Raipur on 12 Feb 2022. Further, we want to refine the dataset to improve the accuracy, for that we wrapped the hybrid feature selection with outlier removal techniques.
3.1 Data preprocessing
As we know in today’s scenario while developing machine learning applications most important part is the dataset as compared to programming/coding, so if we focus more on the dataset then accuracy will be improved much. Therefore, in this research, our key purpose is to apply the outlier removal technique to the hybrid feature selection which will further trim the redundant data and can result in high accuracy and minimized execution time for predicting defects in module-based software. Outliers are abnormal values in the dataset that can twist statistical analysis. Four outlier removal techniques are used namely IQR(Interquartile range), Z-score, Isolation forest, and Dbscan(Density-Based spatial clustering of applications with noise). IQR figures out the difference between the middle of the first half and the middle of the second half. It is a measure of spread, how far apart these data points are whereas Z-score helps to understand how far away is the data point away from the mean. Isolation Forest aims to find shorter paths in the trees as data points on shorter paths are outliers. Dbscan is a clustering algorithm that finds out clusters of different sizes from the dataset which can be an outlier. Now in the next section, we are applying the IQR technique to the Hybrid feature selection dataset.
3.1.1 Hybrid + IQR
Now we are applying the IQR technique to the dataset which we obtained after applying the hybrid technique. In IQR we divide the dataset into quartiles, which specifies 25% of the data points. We find 1st quartile, Z1, and 3rd quartile, Z3 then calculate IQR which is the difference between Z3 and Z1. We define the data range with upper and lower limits as Z3+1.5*IQR and Z1- 1.5*IQR respectively. The data point inside this point is considered as an inlier and outside this range as an outlier. Table 1 shows each dataset with the Total number of rows, Accuracy of hybrid feature selection, Number of rows left after applying hybrid and IQR outlier removal technique, Accuracy of hybrid and IQR, Reduction in data points, and Execution time. If we use hybrid + IQR then accuracy increases and execution time is minimized compared to simply applying hybrid feature selection. For example, in the jm1 dataset the accuracy after applying hybrid feature selection is 94.2% and after applying hybrid + IQR (outlier removal) is 95.8% i.e accuracy increases. And execution time before hybrid + IQR is 15.6 sec while after applying hybrid + IQR is 11.9.
Table 1 Hybrid+IQR.
Dataset
|
Total Data
Points
(rows)
|
Accuracy (Hybrid)
|
Hybrid+ IQR
|
Accuracy
(Hybrid
+IQR)
|
Reduction
in Data
Points(%)
|
ET(sec)
|
cm1
|
498
|
92.3
|
312
|
94.3
|
37.3
|
5.1
|
jm1
|
10885
|
94.2
|
5022
|
95.9
|
53.8
|
11.9
|
mw1
|
253
|
91.2
|
108
|
93.2
|
57.3
|
3.5
|
kc1
|
2109
|
92.8
|
1223
|
93.5
|
42.0
|
7.2
|
kc2
|
522
|
95.3
|
324
|
96.5
|
37.9
|
5.7
|
pc1
|
745
|
92.9
|
409
|
93.5
|
45.1
|
6.1
|
pc3
|
1077
|
91.6
|
654
|
93.9
|
39.2
|
6.4
|
pc4
|
1287
|
94.3
|
753
|
95.7
|
41.4
|
6.6
|
Table 2 Hybrid+Z-score.
Dataset
|
Total Data
Points
(rows)
|
Accuracy (Hybrid)
|
Hybrid+ Z-score
|
Accuracy
(Hybrid
+Z-score)
|
Reduction
in Data
Points(%)
|
ET(sec)
|
cm1
|
498
|
92.3
|
382
|
93.4
|
23.2
|
4.5
|
jm1
|
10885
|
94.2
|
6087
|
94.3
|
44.0
|
10.2
|
mw1
|
253
|
91.2
|
146
|
92.1
|
42.2
|
2.9
|
kc1
|
2109
|
92.8
|
1464
|
93.3
|
30.5
|
6.6
|
kc2
|
522
|
95.3
|
382
|
96.5
|
26.8
|
5.7
|
pc1
|
745
|
92.9
|
513
|
93.4
|
31.1
|
5.8
|
pc3
|
1077
|
91.6
|
710
|
92.5
|
34.0
|
6.0
|
pc4
|
1287
|
94.3
|
822
|
94.6
|
36.1
|
6.1
|
3.1.2 Hybrid + Z-score
Similarly, after applying IQR, we are applying another outlier removal tech- nique i.e. Z-score on the output of feature selection which is shown in Table 2. As we know that 68% of the data samples lie between 1st standard deviation. If the z-score of any data sample is greater than 3 then it is considered an outlier. It is the score we get when we subtract any data point from the mean of the dataset and divided it by the standard deviation. The table shows that the accuracy of datasets cm1, jm1, mw1, kc1, pc1, pc3, and pc4 increases. For example, the accuracy of kc1 after applying hybrid is 92.8% and after apply- ing hybrid + Z-score is 93.5%. Similarly, execution time was minimized from 15.5 sec to 7.2 sec before and after applying the hybrid + IQR technique.
3.1.3 Hybrid + Isolation Forest
After applying Z-score next, we applied another outlier removal technique i.e. isolation forest on the output of feature selection which is shown in Table 3. Similar to the random forest, the isolation forest also uses decision trees. It selects a feature from the feature set and arbitrarily selects a value between the maximum and minimum of that feature which is called a split value. This will result in shorter paths in trees and samples which are found on shorter paths are considered anomalies. In this Table if we use a Hybrid + Isolation forest the execution time of all datasets is minimized and the accuracy of cm1, jm1, mw1, kc1, kc2, pc1, and pc4 increases. For example, the accuracy of kc2 and pc1 before applying hybrid + Isolation forest is 95.3% and 92.9% respectively and after applying hybrid + Isolation forest is 96.8% and 94.3% respectively. The execution time is also minimized from 15.7 sec to 5.7 sec in dataset pc1.
Table 3 Hybrid+Isolation Forest.
Dataset
|
Total Data Points (rows)
|
Accuracy (Hybrid)
|
Hybrid+ Isolation forest
|
Accuracy
(Hybrid
+Isolation
Forest)
|
Reduction in Data Points(%)
|
ET(sec)
|
cm1
|
498
|
92.3
|
404
|
93.5
|
18.8
|
4.2
|
jm1
|
10885
|
94.2
|
6523
|
94.7
|
40.0
|
10.5
|
mw1
|
253
|
91.2
|
189
|
92.9
|
25.2
|
2.6
|
kc1
|
2109
|
92.8
|
1534
|
93.9
|
27.2
|
6.2
|
kc2
|
522
|
95.3
|
412
|
96.1
|
21.0
|
5.4
|
pc1
|
745
|
92.9
|
551
|
93.4
|
26.0
|
5.7
|
pc3
|
1077
|
91.6
|
745
|
92.3
|
30.8
|
5.8
|
pc4
|
1287
|
94.3
|
852
|
94.7
|
33.7
|
6.0
|
Table 4 Hybrid+Dbscan.
Dataset
|
Total Data
Points
(rows)
|
Accuracy (Hybrid)
|
Hybrid+ Dbscan
|
Accuracy
(Hybrid
+Dbscan)
|
Reduction
in Data
Points(%)
|
ET(sec)
|
cm1
|
498
|
92.3
|
352
|
91.5
|
29.3
|
4.3
|
jm1
|
10885
|
94.2
|
6024
|
95.3
|
44.6
|
10.3
|
mw1
|
253
|
91.2
|
200
|
93.8
|
20.9
|
3.2
|
kc1
|
2109
|
92.8
|
1765
|
93.1
|
16.3
|
7.4
|
kc2
|
522
|
95.3
|
413
|
94.2
|
20.8
|
5.6
|
pc1
|
745
|
92.9
|
553
|
93.2
|
25.7
|
6.5
|
pc3
|
1077
|
91.6
|
790
|
93.6
|
26.6
|
6.1
|
pc4
|
1287
|
94.3
|
912
|
94.8
|
29.1
|
6.5
|
3.1.4 Hybrid + Dbscan
It clusters the data points based on two parameters i.e. minimum number of samples and epsilon. Epsilon indicates the radius of the circle that is generated around each point of a dataset to check the density and minimum samples specify minimum points required to be indicated as a core point inside that circle. Based on these factors it is decided which value is an outlier and which is not. It makes clusters by looking at the density of data points. The accuracy of all the datasets are shown in Table 4. The accuracy of datasets jm1, mw1, and pc1 increases from 94.2%, 91.2%, and 92.9% to 95.2%, 92.4%, and 93.6% respectively. And execution time of all datasets is minimized for example before applying hybrid + DBSCAN the execution time of pc1 was 14.5 sec and after it reduces to 6.5 sec.
3.1.5 Comparison chart
After applying all the outlier techniques on Hybrid, now we are comparing the results in terms of accuracy which is shown in Table 5. It shows that if we use outlier removal techniques such as DBSCAN, IQR, Z-score, and Isolation forest with a Hybrid Feature selection technique then accuracy improves by approximately 2% concerning feature selection technique only. For example, accuracy after applying only the hybrid technique is 92.3% on dataset cm1 and after hybrid with z-score, isolation forest and IQR becomes 93.4%, 93.5%, and 94.3% respectively. And while using Hybrid with DBSCAN the accuracy gets enhanced on jm1, mw1, and pc1 datasets but the execution time is minimized on all datasets. This shows that in pre-processing approach if we go with feature selection and outlier removal techniques together then accuracy will increase whereas execution time will be minimized while predicting software defects. Also, biases in the dataset will be minimized whereas fairness will improve.
Table 5 Comparison of accuracy of all datasets using outlier removal techniques.
Dataset
|
Accuracy
(Hybrid Feature
Selection)
|
Accuracy (Hybrid+ Dbscan)
|
Accuracy (Hybrid+ Z-score)
|
Accuracy
(Hybrid+ Isolation
forest)
|
Accuracy (Hybrid+ IQR)
|
cm1
|
92.3
|
91.5
|
93.4
|
93.5
|
94.3
|
jm1
|
94.2
|
95.3
|
94.3
|
94.7
|
95.9
|
mw1
|
91.2
|
93.8
|
92.1
|
92.9
|
93.2
|
kc1
|
92.8
|
93.1
|
93.3
|
93.9
|
93.5
|
kc2
|
95.3
|
94.2
|
96.5
|
96.1
|
96.5
|
pc1
|
92.9
|
93.2
|
93.4
|
93.4
|
93.5
|
pc3
|
91.6
|
93.6
|
92.5
|
92.3
|
93.9
|
pc4
|
94.3
|
94.8
|
94.6
|
94.7
|
95.7
|