Pre-processing of datasets with best feature selection and outlier removal techniques for a fair and robust model of software defect prediction

doi:10.21203/rs.3.rs-1624790/v1

In software engineering, the most demanding task is to predict soft- ware defects. Usually, for predicting defects machine learning techniques are required. As we know in machine learning, datasets play a vital role on par with code. Before applying these techniques pre-processing of data is required for improving the model training process along with making fair and robust models. Pre-processing of the dataset is also done to achieve better accuracy and minimum execution time. In today’s scenario main challenge is to make the dataset fair with the help of pre-processing. Seeing this fact in this paper, we investigate the result of the data pre-processing with the help of best feature selec- tion along with different outlier removal techniques using four different machine learning classifiers and compared the result with the existing state-of-the-art methods. The techniques used for pre-processing include feature selection which is a Hybrid technique combination of Sequen- tial forward selection(SFS) and Sequential backward selection(SBS) developed by us and outlier removal techniques which are IQR, Z- score, Isolation Forest, and DBSCAN. Our findings suggest that the accuracy for module-based defect prediction increases when we pre- processed the dataset by applying feature selection with the outlier removal technique as compared to the feature selection technique only.

Outlier removal

Feature selection

Data pre-processing

Machine learning

Current progress in technology allows us to collect a large amount of data. So data pre-processing becomes an important phase even becomes more important than the coding phase as it can save a lot of time during the testing phase[1]. It reduces the irrelevant data from our datasets such as an outlier, or the features which are not contributing much to the prediction. Outlier detection has become an area of interest for many academic researchers and is one of the main parts of data mining. It has been studied in diverse applications such as health care[2], software defect prediction[3], security, etc. To enhance the model training process pre-processing is required. After doing the literature survey we found that till now only a single technique has been used. Based on the single technique i.e. either outlier or feature selection technique we can not justify the robustness of the model, fairness of the model, and prediction accuracy. We can minimize these demerits by wrapping the different Pre-processing techniques. For this, our contribution to the research is, first apply the feature selection technique and get the best technique out of it afterward apply the outlier removal technique to that best feature selection technique which helps in better accuracy.

The rest of the paper is organized as follows: Section 2 outlines the related work done in this area. Section 3 presents the proposed method which includes datasets, outlier removal, and classification techniques. Section 4 shows the result discussion and the comparison between the hybrid techniques and hybrid + outlier techniques. At last, in Section 5, we conclude our investigation and provide proposals for the future.

To fulfill the challenges we investigated fourteen papers on data pre-processing. In which Guan et.al. [4] proposed a method BCOPS to evaluate the outlier detection rate of a specified method. Xiaojuan et.al. [5] uses a productive method based on two geometrical attributes to eliminate the noisy data points. Meng et.al. [6] proposed outlier removal algorithms based on three character- istics multi-attribute such as direction, time, etc., distance metric, and ways to improve algorithms to find outliers with less time and space complexity. Debasrita et.al. [7] use autoencoders and neural networks to solve a prob- lem when one of the classes has more samples than the other class. Sheng et.al. [8] proposes a multi-view low-rank framework to convey the problem of multi-view data. Abir et.al.[9] categorized the outlier methods into statistical, distance, density, and clustering and provide advantages and disadvantages of each method. Alghushairy et.al. [10] discussed several local outlier detection techniques with importance on local outlier factor algorithms. Conde et.al. [11] provide a systematic and broad state-of-the-art(SOTA) techniques on unsu- pervised outlier detection in time series context. Salehi et.al. [12] categorized current outlier removal strategies based on static and non-static data and then compared them with SOTA techniques. Boukerche et.al. [13] present several outlier techniques for high-dimensional data, big data, less labeled data, and data streams. After they analyze the advantages and disadvantages of each category. Hickman et.al. [14] applied the outlier exclusion criteria and sev- eral statistical methods to 36 different datasets. In the present investigation, we found that no work is done on the combination of feature selection and outlier removal techniques, and most of the work is done only on a single data pre-processing technique. Seeing this fact we conclude that a combination of feature selection and outlier removal techniques is missing while predict- ing defects in a software module. Simply applying feature selection or outlier removal techniques does not improve prediction accuracy and minimize execu- tion time. For that, we proposed a framework in which we first apply feature selection techniques and then the outlier removal techniques.

The proposed framework for the defect prediction system is shown in Figure 1. It is divided into three levels i.e. datasets, data pre-processing, and model training and testing In the 1st level i.e. datasets, eight datasets of the Promise Software Engineering repository are used for experimentation. In level 2 i.e data pre-processing, feature selection techniques are used which are divided into Filter, Wrapper, and Hybrid methods. After applying these techniques to the above datasets we found that the hybrid method which is a combination of SFS and SBS outperforms filter and wrapper methods in terms of accuracy and execution time which we had already published in one of the international conferences (i.e. AICES 2022) organized at NIT Raipur on 12 Feb 2022. Further, we want to refine the dataset to improve the accuracy, for that we wrapped the hybrid feature selection with outlier removal techniques.

3.1 Data preprocessing

As we know in today’s scenario while developing machine learning applications most important part is the dataset as compared to programming/coding, so if we focus more on the dataset then accuracy will be improved much. Therefore, in this research, our key purpose is to apply the outlier removal technique to the hybrid feature selection which will further trim the redundant data and can result in high accuracy and minimized execution time for predicting defects in module-based software. Outliers are abnormal values in the dataset that can twist statistical analysis. Four outlier removal techniques are used namely IQR(Interquartile range), Z-score, Isolation forest, and Dbscan(Density-Based spatial clustering of applications with noise). IQR figures out the difference between the middle of the first half and the middle of the second half. It is a measure of spread, how far apart these data points are whereas Z-score helps to understand how far away is the data point away from the mean. Isolation Forest aims to find shorter paths in the trees as data points on shorter paths are outliers. Dbscan is a clustering algorithm that finds out clusters of different sizes from the dataset which can be an outlier. Now in the next section, we are applying the IQR technique to the Hybrid feature selection dataset.

3.1.1 Hybrid + IQR

Now we are applying the IQR technique to the dataset which we obtained after applying the hybrid technique. In IQR we divide the dataset into quartiles, which specifies 25% of the data points. We find 1st quartile, Z1, and 3rd quartile, Z3 then calculate IQR which is the difference between Z3 and Z1. We define the data range with upper and lower limits as Z3+1.5*IQR and Z1- 1.5*IQR respectively. The data point inside this point is considered as an inlier and outside this range as an outlier. Table 1 shows each dataset with the Total number of rows, Accuracy of hybrid feature selection, Number of rows left after applying hybrid and IQR outlier removal technique, Accuracy of hybrid and IQR, Reduction in data points, and Execution time. If we use hybrid + IQR then accuracy increases and execution time is minimized compared to simply applying hybrid feature selection. For example, in the jm1 dataset the accuracy after applying hybrid feature selection is 94.2% and after applying hybrid + IQR (outlier removal) is 95.8% i.e accuracy increases. And execution time before hybrid + IQR is 15.6 sec while after applying hybrid + IQR is 11.9.

Table 1 Hybrid+IQR.

Dataset	Total Data Points (rows)	Accuracy (Hybrid)	Hybrid+ IQR	Accuracy (Hybrid +IQR)	Reduction in Data Points(%)	ET(sec)
cm1	498	92.3	312	94.3	37.3	5.1
jm1	10885	94.2	5022	95.9	53.8	11.9
mw1	253	91.2	108	93.2	57.3	3.5
kc1	2109	92.8	1223	93.5	42.0	7.2
kc2	522	95.3	324	96.5	37.9	5.7
pc1	745	92.9	409	93.5	45.1	6.1
pc3	1077	91.6	654	93.9	39.2	6.4
pc4	1287	94.3	753	95.7	41.4	6.6

Table 2 Hybrid+Z-score.

Dataset	Total Data Points (rows)	Accuracy (Hybrid)	Hybrid+ Z-score	Accuracy (Hybrid +Z-score)	Reduction in Data Points(%)	ET(sec)
cm1	498	92.3	382	93.4	23.2	4.5
jm1	10885	94.2	6087	94.3	44.0	10.2
mw1	253	91.2	146	92.1	42.2	2.9
kc1	2109	92.8	1464	93.3	30.5	6.6
kc2	522	95.3	382	96.5	26.8	5.7
pc1	745	92.9	513	93.4	31.1	5.8
pc3	1077	91.6	710	92.5	34.0	6.0
pc4	1287	94.3	822	94.6	36.1	6.1

3.1.2 Hybrid + Z-score

Similarly, after applying IQR, we are applying another outlier removal tech- nique i.e. Z-score on the output of feature selection which is shown in Table 2. As we know that 68% of the data samples lie between 1st standard deviation. If the z-score of any data sample is greater than 3 then it is considered an outlier. It is the score we get when we subtract any data point from the mean of the dataset and divided it by the standard deviation. The table shows that the accuracy of datasets cm1, jm1, mw1, kc1, pc1, pc3, and pc4 increases. For example, the accuracy of kc1 after applying hybrid is 92.8% and after apply- ing hybrid + Z-score is 93.5%. Similarly, execution time was minimized from 15.5 sec to 7.2 sec before and after applying the hybrid + IQR technique.

3.1.3 Hybrid + Isolation Forest

After applying Z-score next, we applied another outlier removal technique i.e. isolation forest on the output of feature selection which is shown in Table 3. Similar to the random forest, the isolation forest also uses decision trees. It selects a feature from the feature set and arbitrarily selects a value between the maximum and minimum of that feature which is called a split value. This will result in shorter paths in trees and samples which are found on shorter paths are considered anomalies. In this Table if we use a Hybrid + Isolation forest the execution time of all datasets is minimized and the accuracy of cm1, jm1, mw1, kc1, kc2, pc1, and pc4 increases. For example, the accuracy of kc2 and pc1 before applying hybrid + Isolation forest is 95.3% and 92.9% respectively and after applying hybrid + Isolation forest is 96.8% and 94.3% respectively. The execution time is also minimized from 15.7 sec to 5.7 sec in dataset pc1.

Table 3 Hybrid+Isolation Forest.

Dataset	Total Data Points (rows)	Accuracy (Hybrid)	Hybrid+ Isolation forest	Accuracy (Hybrid +Isolation Forest)	Reduction in Data Points(%)	ET(sec)
cm1	498	92.3	404	93.5	18.8	4.2
jm1	10885	94.2	6523	94.7	40.0	10.5
mw1	253	91.2	189	92.9	25.2	2.6
kc1	2109	92.8	1534	93.9	27.2	6.2
kc2	522	95.3	412	96.1	21.0	5.4
pc1	745	92.9	551	93.4	26.0	5.7
pc3	1077	91.6	745	92.3	30.8	5.8
pc4	1287	94.3	852	94.7	33.7	6.0

Table 4 Hybrid+Dbscan.

Dataset	Total Data Points (rows)	Accuracy (Hybrid)	Hybrid+ Dbscan	Accuracy (Hybrid +Dbscan)	Reduction in Data Points(%)	ET(sec)
cm1	498	92.3	352	91.5	29.3	4.3
jm1	10885	94.2	6024	95.3	44.6	10.3
mw1	253	91.2	200	93.8	20.9	3.2
kc1	2109	92.8	1765	93.1	16.3	7.4
kc2	522	95.3	413	94.2	20.8	5.6
pc1	745	92.9	553	93.2	25.7	6.5
pc3	1077	91.6	790	93.6	26.6	6.1
pc4	1287	94.3	912	94.8	29.1	6.5

3.1.4 Hybrid + Dbscan

It clusters the data points based on two parameters i.e. minimum number of samples and epsilon. Epsilon indicates the radius of the circle that is generated around each point of a dataset to check the density and minimum samples specify minimum points required to be indicated as a core point inside that circle. Based on these factors it is decided which value is an outlier and which is not. It makes clusters by looking at the density of data points. The accuracy of all the datasets are shown in Table 4. The accuracy of datasets jm1, mw1, and pc1 increases from 94.2%, 91.2%, and 92.9% to 95.2%, 92.4%, and 93.6% respectively. And execution time of all datasets is minimized for example before applying hybrid + DBSCAN the execution time of pc1 was 14.5 sec and after it reduces to 6.5 sec.

3.1.5 Comparison chart

After applying all the outlier techniques on Hybrid, now we are comparing the results in terms of accuracy which is shown in Table 5. It shows that if we use outlier removal techniques such as DBSCAN, IQR, Z-score, and Isolation forest with a Hybrid Feature selection technique then accuracy improves by approximately 2% concerning feature selection technique only. For example, accuracy after applying only the hybrid technique is 92.3% on dataset cm1 and after hybrid with z-score, isolation forest and IQR becomes 93.4%, 93.5%, and 94.3% respectively. And while using Hybrid with DBSCAN the accuracy gets enhanced on jm1, mw1, and pc1 datasets but the execution time is minimized on all datasets. This shows that in pre-processing approach if we go with feature selection and outlier removal techniques together then accuracy will increase whereas execution time will be minimized while predicting software defects. Also, biases in the dataset will be minimized whereas fairness will improve.

Table 5 Comparison of accuracy of all datasets using outlier removal techniques.

Dataset	Accuracy (Hybrid Feature Selection)	Accuracy (Hybrid+ Dbscan)	Accuracy (Hybrid+ Z-score)	Accuracy (Hybrid+ Isolation forest)	Accuracy (Hybrid+ IQR)
cm1	92.3	91.5	93.4	93.5	94.3
jm1	94.2	95.3	94.3	94.7	95.9
mw1	91.2	93.8	92.1	92.9	93.2
kc1	92.8	93.1	93.3	93.9	93.5
kc2	95.3	94.2	96.5	96.1	96.5
pc1	92.9	93.2	93.4	93.4	93.5
pc3	91.6	93.6	92.5	92.3	93.9
pc4	94.3	94.8	94.6	94.7	95.7

In our research, we have used data pre-processing techniques such as feature selection and outlier removal which help in cleaning the dataset, improving accuracy, and minimizing execution time. In our work first, we apply feature selection on eight datasets and found that the hybrid method which is a combi- nation of SFS and SBS is the best feature selection method. Then, we applied different outlier techniques to the hybrid method which shows that applying outlier removal techniques to the hybrid feature selection technique results in better accuracy and minimized execution time. In the future, we would like to focus on different outlier removal techniques to improve the model predictions.

Author’s Contribution

Shiva Singh – Worked on hybrid feature selection technique with outlier removal and written the main manuscript text.

Md. Tanwir Uddin Haider – Worked on outlier removal techniques and design the framework.

Conflicting Interests

This is not applicable.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Garc´ıa, S., Ram´ırez-Gallego, S., Luengo, J., Ben´ıtez, J.M., Herrera, F.: Big data preprocessing: methods and prospects. Big Data Analytics 1(1), 1–22 (2016)
Reddy, R.V.K., Subhani, S., Rao, B.S., Anantha, N.L.: Machine learning based outlier detection for medical data. Indonesian Journal of Electrical Engineering and Computer Science 24(1), 564–569 (2021)
Rathore, S.S., Kumar, S.: A study on software fault prediction techniques. Artificial Intelligence Review 51(2), 255–327 (2019)
Guan, L., Tibshirani, R.: Prediction and outlier detection in classification problems. arXiv preprint arXiv:1905.04396 (2019)
Ning, X., Li, F., Tian, G., Wang, Y.: An efficient outlier removal method for scattered point cloud data. PloS one 13(8), 0201280 (2018)
Meng, F., Yuan, G., Lv, S., Wang, Z., Xia, S.: An overview on trajectory outlier detection. Artificial Intelligence Review 52(4), 2437–2456 (2019)
Chakraborty, D., Narayanan, V., Ghosh, A.: Integration of deep feature extraction and ensemble learning for outlier detection. Pattern Recognition 89, 161–171 (2019)
Li, S., Shao, M., Fu, Y.: Multi-view low-rank analysis with applications to outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 12(3), 1–22 (2018)
Smiti, A.: A critical overview of outlier detection methods. Computer Science Review 38, 100306 (2020)
Alghushairy, O., Alsini, R., Soule, T., Ma, X.: A review of local outlier factor algorithms for outlier detection in big data streams. Big Data and Cognitive Computing 5(1), 1 (2020)
Bl´azquez-Garc´ıa, A., Conde, A., Mori, U., Lozano, J.A.: A review on outlier/anomaly detection in time series data. ACM Computing Surveys (CSUR) 54(3), 1–33 (2021)
Salehi, M., Rashidi, L.: A survey on anomaly detection in evolving data: [with application to forest fire risk prediction]. ACM SIGKDD Explorations Newsletter 20(1), 13–23 (2018)
Boukerche, A., Zheng, L., Alfandi, O.: Outlier detection: Methods, mod- els, and classification. ACM Computing Surveys (CSUR) 53(3), 1–37 (2020)
Hickman, P.E., Koerbin, G., Potter, J.M., Glasgow, N., Cavanaugh, J.A., Abhayaratna, W.P., West, N.P., Glasziou, P.: Choice of statistical tools for outlier removal causes substantial changes in analyte reference intervals in healthy populations. Clinical Chemistry 66(12), 1558–1561 (2020)

No competing interests reported.

Pre-processing of datasets with best feature selection and outlier removal techniques for a fair and robust model of software defect prediction

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

3 Proposed Method

3.1 Data preprocessing

3.1.1 Hybrid + IQR

3.1.2 Hybrid + Z-score

3.1.3 Hybrid + Isolation Forest

3.1.4 Hybrid + Dbscan

3.1.5 Comparison chart

4 Conclusion

Declarations

References

Additional Declarations

Status:

Version 1