An empirical study toward dealing with noise and class imbalance issues in software defect prediction

The quality of the defect datasets is a critical issue in the domain of software defect prediction (SDP). These datasets are obtained through the mining of software repositories. Recent studies claim over the quality of the defect dataset. It is because of inconsistency between bug/clean fix keyword in fault reports and the corresponding link in the change management logs. Class Imbalance (CI) problem is also a big challenging issue in SDP models. The defect prediction method trained using noisy and imbalanced data leads to inconsistent and unsatisfactory results. Combined analysis over noisy instances and CI problem needs to be required. To the best of our knowledge, there are insufficient studies that have been done over such aspects. In this paper, we deal with the impact of noise and CI problem on five baseline SDP models; we manually added the various noise level (0–80%) and identified its impact on the performance of those SDP models. Moreover, we further provide guidelines for the possible range of tolerable noise for baseline models. We have also suggested the SDP model, which has the highest noise tolerable ability and outperforms over other classical methods. The True Positive Rate (TPR) and False Positive Rate (FPR) values of the baseline models reduce between 20–30% after adding 10–40% noisy instances. Similarly, the ROC (Receiver Operating Characteristics) values of SDP models reduce to 40–50%. The suggested model leads to avoid noise between 40–60% as compared to other traditional models.


Introduction
Software defect prediction (SDP) (Hall et al. 2011;Pandey et al. 2021Pandey et al. , 2020aMalhotra 2015) attempts to identify most likely fault-prone modules in the software project by utilizing software metrics (Radjenović et al. 2013;Shihab 2012;Schneider et al. 1992). It is always advisable to carefully and meaningfully execute testing of fault-prone modules rather than treating all the modules in a similar manner. SDP models make use of bug reports for representatives in old software that indicate faulty and non-faulty modules. Module's met-  (Offutt 1992) is used for the training of defect prediction models. SDP models may make use of change-log of software configuration management documentation, as the change-log indeed reports the modules that experience change upon correction when faults being detected. SDP models that use well-known traditional classifiers as a classification technique to predict buggy or clean modules we called them classical SDP or traditional SDP models. All the classifiers that we have used for experimental purposes are classical SDP models. These SDP models and their variants are widely applied in the SDP domain, so they are also baseline methods in defect prediction. We have used five baseline methods for the experimental purpose discussed in Sect. 4.3.2. The links between logs and bug reports may be inconsistent because of few reasons (Linberg 1999) and may also cause mislabeled data. Therefore, quite likely that an SDP model may be working with noisy data and leads to erroneous results. When the cardinality of one of the classes is much smaller than the other class, the dataset is said to be imbalanced data. Seiffert et al. (2014) reported an analysis over combined study on both noise and class imbalance (CI) problems in software quality. They have conducted experiments over eleven classification techniques and seven sampling methods over public datasets. They concluded that few classifier combined with sampling methods that are most confronted over the noisy and imbalanced dataset. To the best of our knowledge, the combine interaction between CI and noisy instances still has limited research. Limitations of isolated studies evaluation between CI and noisy instances are presented as follows: (a) It will not be easy to find the concurrent impact of both challenges over defect prediction models; both these problems degrade the performance of SDP techniques. (b) Dealing with noisy instances only helps in suggesting the percentage of noisy instances a model can tolerate.
In contrast, studies about the CI problem only help in recommending the rate of imbalance in the dataset that a predictive model can digest. (c) The common approach that can conquer both the challenges together cannot proposed. (d) The trade-off ratio between the percentage of noisy instances and the CI rate can not be explored.
After our empirical study, we have listed of few compelling motivational queries about the requirement of combined analysis between noise and CI problem in software defect prediction and are shown below.
(i) Many of the software practitioners and researchers claimed over the quality of defect prediction datasets, and Shepperd et al. (2013) claimed many instances were noisy in NASA data repository. Joint analysis of CI problem and noise can help to study the relative impact between classifier and sampling methods over noisy instances; as limited studies were reported over the interaction of sampling and classifiers over noisy data. (ii) How classifiers interact with sampling methods? Do certain sampling methods outperform when simultaneously used with specific classification algorithms over noisy datasets? (iii) A combined study of sampling methods & classifiers and their performance analysis over various SDP models is still unexplored at different noise levels.
Researcher either suggested an SDP that dealt with CI problem or noisy instance but not both, but we proposed an SDP model that address both issues. Apart from this, we analyzed the tolerable noise capability of existing SDP models, i.e., after adding noise, the performance of the SDP model remains unchanged. For the meaningful treatment of the study approach, we framed five research query (RQ) based on evaluation metrics that may guide the attempt to proposition of a model. These RQ justify the observations of our empirical study. List of research queries (RQs) is as follows.

RQ-1
What are the effects of noise on True positive rate (TPR) and False positive rate (FPR) over classical SDP models? RQ-2 To what extent the suggested model is resistant over the various level of noise compared to the other classical SDP models? RQ-3 What is the range of tolerable noise in baseline defect prediction models? RQ-4 How does the class imbalance problem affect the performance of various SDP models over different noise levels? RQ-5 Compare the performance of proposed approach with other classical SDP models without applying sampling method.
All six RQs can explore the circumstances under which classical SDP models work over noisy data. Noise tolerant ability in defect prediction still has a scope of research. We have conducted similar experiments for noise handling like (Kim et al. 2011) done in their article; apart from this, we also dealt with CI issues. Tantithamthavorn et al. (2015) also explore the challenges of mislabeled data, which leads to inconsistent results. Cabral et al. (2019) suggested an approach JIT-SDP that makes defect predictions at the software change level, and they presume that the characteristics of the problem remain constant over time. The article makes the following contributions.
(i) Combined empirical studies between noise and CI problem. The article also evaluates the impact of these two problems in the performances of baseline methods. (ii) The article also analyzed the various tolerance level of noise and CI problems over baselines methods. (iii) Suggested SDP model that can tolerate maximum noise degree and circumvent class imbalance issues. The suggested approach is mainly a change in a buggy prediction model. These two are the most prominent challenges in any SDP technique. We have tested the significance level of the suggested approach using TPR, FPR, F-measure, Precision, and ROC compared with other traditional SDP models. (iv) We conducted 864 experiments over 3 public datasets using 5 classifiers and 1 sampling method. We also dispense a few guidelines for noise tolerance level and CI issue in baseline SDP models. Those guidelines will assist in better SDP models in the future.
In the next Sect. 2, we have discussed the related work, followed by the background details in Sect. 3; then after we illustrate the experimental procedure and suggested approach in Sect. 4. In last sections, we have analyzed and discuss results of our experiments in Sect. 5. Afterward, we talk about threats and validity in Sect. 6. In the final Sect. 7, we present conclusions drawn from the article.

Related work
A given software module consists of source code and other software metrics, and SDP classifies the module either clean or buggy. SDP classifies a module as either clean or buggy, whereas a given software module consists of software features, e.g., source code metrics; few existing SDP methods are SVM (Elish and Elish 2008;Hou and Li 2009), Naive Bayes (Ji et al. 2019;Pandey et al. 2018), Random forest (Zhou et al. 2019;Khanh et al. 2018), AdaBoost (Hu et al. 2008;Peng et al. 2011), J48 (Queiroz et al. 2016;), etc. Recently, a few other ensemble learning (Peng et al. 2011;Yang et al. 2017;Laradji et al. 2015) and deep learning-based (Wang et al. 2016;Li et al. 2017; Tripathi 2020) defect prediction architecture have been reported. Wei et al. (2019) suggested an interesting approach using dimension reduction in different software metrics, and they have suggested an approach using tangent-based SVM. There are several unsupervised and semisupervised machine learning methods that are also applied in the SDP domain. Abaei et al. (2015) proposed a semi-supervisedbased approach using hybrid self-organizing map (HySOM), the model has the ability to predict defect-prone module in an unsupervised manner. They performed experiments using NASA dataset and found improvement over existing methods. Catal (2014) performed a comparative analysis between performances of various semi-supervised methods. A semisupervised method is proposed by Lu et al. (2012); they found the proposed model significantly better over the random forest. Metric-driven software quality prediction model was proposed by Catal et al. (2010); their method can be used when defects are absent; it does not require information about a number of clusters before the clustering phase. They found the proposed model significantly outperforms existing methods. Li et al. (2012) proposed an SDP model called ACoFores, which addresses the problem of inadequate availability of historical dataset. They used the PROMISE dataset for experiments and found optimal results compared to other state-of-the-art methods. Similar work has been performed by Seliya and Khoshgoftaar (2007); they investigated Expectation-Maximization (EM) algorithm for software quality prediction. They used NASA dataset and found EM-based prediction model improves generalization performance. SDP techniques are suffering from two main challenges; thus, we have categories related work into that fashion. The quality of the data in first categories, and second about the class imbalance issue.

Quality of defect dataset
In few studies (Malhotra 2015;Ramler and Himmelbauer 2013), researchers had claimed the existence of some errors in large datasets, like field error, the rate of field error is nearly around 5% (Maletic and Marcus 2000;Wu 1995). Zhu and Wu (2004) tried to handle noisy data and also tried to overcome from an over-fitting (Liu and Khoshgoftaar 2004) problem. Shepperd et al. (2013) also raised the question on the quality of data of NASA repository, but still, a few powerful machine learning-based defect prediction models are available such as (Pandey et al. 2018;Zheng 2010;Li et al. 2019). There are two different types of noises in a defect dataset, and both of these noises (Shanthini et al. 2019) affect the performance over machine learning algorithms, first is class noise and second is feature noise. However, we have only considered class noise in this article. Class noise is an interchange of the class label from clean to buggy or buggy to clean or both, due to any consequences. This problem leads to inconsistent results. Bachmann et al. (2010) concluded that few defects are not found in commit logs of a dataset, and hence, they are also not visible in automated linking tools. Bird et al. (2009) found that more accomplished developers are more likely to direct links between issues report to code change. Rahman and Devanbu (2013) investigated the influence of SDP models by inducing artificially generated defect dataset. Catal et al. (2011a) conducted a study over class and noise detection; they proposed a detection algorithm based on software feature threshold values. Riaz et al. (2018) proposed a two-stage data preprocessing methods that incorporates the feature selection and noise filter execution; they employed K-nearest neighbor and ensemble learning in their proposed approach. Alan and Catal (2011) proposed an outlier detection approach using metrics threshold and class label; they employed NASA datasets to identify class outliers; they found the proposed model outperforms over baseline methods.

Class imbalance problem
The class imbalance issue may produce biased result toward the negative instance (Ali et al. 2015). A comprehensive study about the class imbalance in SDP is recently done by Song et al. (2018). Few studies (Bennin et al. 2019;Chen et al. 2018) compared results produced from imbalance and balance class labels, but there are few researchers who proposed some solutions that have increased some accuracy ofSDP Researchers (Sharma et al. 2018;Limsettho et al. 2018;Tong et al. 2018) have proposed random subsampling (Clarkson and Shor 1989), SMOTE (Chawla et al. 2002), class balancer (Laurikkala 2001), and spread subsampling Tan et al. (2015) techniques, which help to avoid class imbalance issue and provide unbiased results. Joon et al. (2020) performed a combined study over class imbalance, feature selection, and simple noise removal strategy over public datasets; they used precision, recall, f-score, roc, and accuracy as performance measures.

Background
The general framework of a software defect prediction model is shown in Fig. 1. Software repository consists of two segments (Fan et al. 2020); version control system (VCS), and issue tracking system (ITS) as shown in Fig. 1. Most of the time software practitioners use both of them because version control systems (source code repository system) are unable to store bugs. As figure reports the instance are generated from software repository. These instance are made up of software metrics, the data cleaning and other preprocessing are required to build the training set. Then, training set is fed into the trained/untrained model that can classify buggy or clean module. We will provide detailed discussion about preprocessing and trained model in the later sections. Before designing any prediction model, we need to create the prediction target, i.e., class label. Software modules consist of software entities such as file (Moser et al. 2008), component (Nagappan et al. 2006) or change (Kim et al. 2006); SDP model is intended to predict software module as either buggy/defective or clean/non-defective. There are two types of defect prediction models, buggy file prediction and change buggy prediction. The detail description is given below.

Buggy files prediction
Identification of the buggy files in advance helps the development team leader to properly and optimally allocated resources, and it leads to minimizing the testing effort. As we know, some of the internal properties of a software system, such as software metrics, and have associates with the external property such a fault-proneness of a module. This kind of SDP model mainly identifies software features that are expressed in a defect dataset. This classification model learns from historical data and predicts the fault-prone modules in a test data. A lot of software features are responsible for this kind of SDP models such as resource metrics (Ram et al. 2010), process metrics (Kan 2002) and cyclomatic complexity metrics (Zhang et al. 2007).

Change buggy prediction
When some new changes are introduced in a software modules, the change buggy prediction predicts whether the changed software module is buggy or not, and it learns from change classification (CC). Let us say a module consist n files and suppose a new file is added to it, so there are total n + 1 files are present in the module. Now, this n + 1 files to the module may cause the software faulty. It mainly involves two source code revisions, an old revision, and a new revision. This change in several files is related to metadata, which includes author, change-log, date of commit, etc. After mining the change history, it can derive the co-change count, which indicates, for how many files changes, the system will remain clean or buggy. Kim et al. (2008) illustrated this process in their article.
To build any of the two types of SDP models as defined above requires class labels (buggy/clean) and various features. Model fitting using mislabeled data may cause incorrect results. In this direction, we proposed a change buggy prediction model by using public data. In the next section, we will discuss its experimental details, performance measure, and build a useful SDP model which can tolerate noisy instance up to some extents.

Experimental procedure and suggested approach
In this section, we will illustrate the experimental details, dataset description, noise addition phenomenon, performance metrics, preprocessing, classification techniques, and the suggested approach.

Experimental setup
This section will discuss every aspect required in experiments. We have performed 864 experiments over three public datasets that are Scarab, Columba, and Eclipse. We have implemented our experiment on 8 GB RAM, 1 TB of the hard drive over windows 10 operating system. Python libraries that have been applied for experiments are Numpy Scipy, Scikitlearn, Keras, Pandas, and Matplotlib. Validation of all 864 experiments is conducted over WEKA (Waikato Environment for Knowledge Analysis) tool (Witten and Frank 2000) to reverify experimental results.

Dataset description
The three public datasets are Columba, Scarab, and Eclipse that we have used in our experiments for the buggy change prediction model, as detailed shown in Table 1. These are classical datasets and have substantial training instances, as shown in Table 1, compared with other open-source datasets, which lead to satisfactory results. Although (Kim et al. 2011) uses these datasets in their experiments, they also conducted similar experiments by adding noise manually into the datasets, so we are extending their experiments.

Noise added in dataset
We assumed datasets that we are using are pure, i.e., there are no noisy instances. Kim et al. (2011) also considered similar assumptions over these three datasets. So we injected some percentage of noisy labels into it and then exercised its training using various SDP models. Now, with the interchange of the class label from buggy to clean and clean to buggy that introduced class noise in the defect dataset. Then, we evaluated the performance of various SDP models over different noise levels. To analyze the performance of various classical SDP models, we have added noise in the defect data from 0 to 80%. 0% means no noise added, whereas 10% means, 10% of the total instance has been selected and interchange of the target label.

Performance measure
To evaluate the performance of five SDP models, we used five performance metrics; those are True Positive Rate (TPR), False Positive Rate (FPR), Precision, F-measure, and area under ROC (receiver operating characteristic) curve. The brief details of these evaluation metrics are given below.
It is the harmonic mean of precision and recall (TPR). Its value lies between 0 to 1, 0 implies the worst result and 1 implies the best result. It is also known as f-score and F 1 score.

SDP models for experiments
We added noise levels from 0 to 80% and then trained various models. We analyze the performance of defect prediction models over different noise levels. The other major challenge in any defect datasets is a skewed distribution of a particular class causes class imbalance problem. To deal with such a problem, few researchers applied sampling methods that establish the balance between positive and negative classes.
In the next sections, we explain the preprocessing methods and then various baseline models.

Preprocessing techniques
In the preprocessing stage, after data cleaning, the feature selection and sampling methods are the major steps. The steps involved in preprocessing are shown below. Table 1 reports the datasets have skewed distribution and suffer from class imbalance challenge. To avoid CI problem, we have applied random undersampling methods (Kotsiantis and Pintelas 2003).

Sampling technique
We have also tried other sampling methods, e.g., class balancer (Chawla 2009), synthetic minority oversampling technique (Chawla et al. 2002), and spread subsample technique (Davies 1995), but achieved optimal results over the random sampling technique. We considered the algorithm from Charte et al. (2015), which deletes random samples of the majority class label (SetLabels). The full description of the algorithm is given in Charte et al. (2015). The main objective of this algorithm is to achieve uniform distribution of buggy and non-buggy class labels. 2. Feature selection As Table 1 reports, each dataset has a massive number of features. So we need to select relevant features for better analysis. We have used Information gain (Lee and Lee 2006) as a feature selection method and Ranker method as search technique (Chandrashekar and Sahin 2014;Roobaert et al. 2006) for feature ranking. It is mainly entropy-based method; it is defined as a amount of information provided by the selected item for categorization. It is calculated by how much an item's information is important for classification, in order to compute the importance of lexical items for the classification problem. The ranker method uses conjunction with attribute evaluator (Entropy, Gain ratio, etc.). It has three parameters, P, T , and N . The P (start state) specify the starting set of the attribute; specified attribute is ignored during ranking. T (threshold), the threshold is specified by which features are ignored. N (Number of selection); it specified the number of attributes selected.

Classification techniques
We have used five different classification techniques while applying the same preprocessing methods (discussed in Sect. 4.3.1). These classifiers build five different SDP models. The list of classification techniques is shown below.
(i) Naive Bayes (NB) It is a probabilistic classifier (Rish et al. 2001;Huang and Li 2011), which is derived from Bayes theorem. It is a family of algorithm which shares a mutual principle. Every pair of features is classified as independent of each other. The underlying assumptions are features makes an equal and independent contribution to the outcomes (Catal et al. 2011b). We have used batch size = 100, set "doNotcheckCapabilities" and "kernelEstimator" as "False". (ii) Least Square Support Vector Machine (LSVM) It is a supervised learning algorithm (Suykens and Vandewalle 1999). It can be used for both classification and regression problems. It is a binary classifier which creates n-dimensional hyperplane to classify the instances (Elish and Elish 2008). We have used radial basis function kernel in our experiments. We used batch size = 100, catch size = 40, cost = 1, degree = 3, loss = 0.1, nu = 0.5, and seed = 1. (iii) J48 It is variation (Bhargava et al. 2013) of the C4.5 algorithm; it is a decision tree-based classification algorithm which used to create Univariate Decision Trees (UDT) . The leaf node will decide the instance belongs to which category; it mainly calculate the information gain of each attribute and select the attribute with max info gain. We have used batch size = 100, we set "binarySplits" as "False", "collapseTree" as "True", no. of folds = 3, seed = 1, "unprunned" = "False", and "useLaplase" = "False". (iv) AdaBoost It is short of Adaptive Boosting (Zheng 2010;Hu et al. 2008), which is mainly an ensemble learning technique. It combines different weak learners into one model and combines the results of each weak learner.
That makes the classifier more powerful. As it is an ensemble learning technique; it overcomes the overfitting problem. We have utilized weak classifier as "Decision stump", "weightThresold" = 100, seed = 1, and "doNotcheckCapabilities" as "False". (v) Random Forest (RF) We have applied RF as a classifier in our proposed model. It is also an ensemble learning technique (Liaw and Wiener 2002). Algorithm 1 shows the pseudocode of RF learning algorithm. We considered this algorithm from Liaw and Wiener (2002), complete discussion about RF can be found in Liaw and Wiener (2002). There is a function in RF algorithm called "RandomizedTreeLearn" which mainly returns the learned tree. It is a decision tree-based learning algorithm; it is one of the most robust SDP model (Kaur and Malhotra 2008;Catal and Diri 2009). We have used batch size = 100, "breakTiesRandomly" as "False", "ComputerAttributeImportance" as "False", "no. of slots" = 1, and seed = 1.
Algorithm 1: Pseudo code of Random forest.
Initial condition: Training set T = (x 1 , y 1 ),(x 2 , y 2 ),....,(x n , y n ), let number of features be F and number of tree be B.
at every node f ← is small subset of F; Split of best best feature in f; return Learned Tree end function Before applying the learning technique, we split the dataset into a training set and testing set. Where 70% for the training data and 30% for testing data, we have also performed other split ratios but got optimal results on a 70-30% ratio. Then, we have used tenfold cross-validation (Dor and /* PRC, MCC, ROC and F-measure */ end // classifier applied withot adding noise Compare P 1 and P 2 /* compare performance of models with and without adding noise */ Zhou 2007) on training set in each classifier. It avoids the possibility of an over-fitting problem (Moore 2001) in the classification model.

Suggested approach
We have given the sequence of procedures regarding experiments in Algorithm 2. The Underlying architecture of the suggested approach is shown in Fig. 2; it reflects each phase of the suggested model. Noise is added using the mislabeling of the class label, as shown in Fig. 2. We have injected various noise level in a dataset, and we have tested the endure noise level in change buggy prediction model. We have applied information gain as an attribute selection method and ranking method as a search method to rank the attribute and select the most relevant attribute (see Sect. 4.3.1).
We have utilized random undersampling as a sampling technique to address the class imbalance problem. We have also tried a few other (SMOTE, Class Balancer, Spread Sub-Sampling, etc.) well know sampling techniques, but the best results came from random undersampling technique.
After preprocessing, we have split the dataset into training set and testing set, 70% for training, and 30% for testing (see Fig. 2). After that, we have applied the tenfold crossvalidation technique on training data. Cross-validation Dor and Zhou (2007) also avoids the over-fitting Moore (2001) and makes the better prediction model. To avoid random bias, each experiment has been performed ten times and taken the mean value of each performance measure. The basic architectural view of the suggested approach shown in Fig. 2. Random Forest is applied as a classifier, as shown in Fig. 2. We have performed similar prepossessing step for the pure set, i.e., 0% noise and compared the performance P 1 and P 2 , as shown in the Algorithm 2, here P 1 performance at 10-80% noise level and P 2 at 0% noise level.
In the next section, we will discuss the performance of various SDP models after adding different noise levels from 0 to 80%. Besides, we also test the tolerable noise in the suggested architecture without applying the sampling technique for all five baseline methods.

Results and analysis
We have experimented with, and without noisy instances in datasets, and the noise has been added from 10 to 80% in all three datasets. We have also conducted experiments with and without applying the sampling method at various noise levels and evaluating the performances. In this section, wewill address all five research queries (see Sect. 1) and also justify the conclusion corresponding to experimental results. Justifications of every RQs are shown below.

5.1
What are the effects of noise on True positive rate (TPR) and False positive rate (FPR) over classical SDP models? Table 2 and Fig. 3 represents the TPR of all baseline models over the various noise levels, whereas Table 3 and Fig. 4 report the FPR of various SDP models over different noise levels. On Eclipse data, the RF-based model exceeds its performance over other defect prediction models. On pure data, the TPR value is 0.975, whereas the lowest TPR value produced by the proposed model is 0.844 at 60 and 80%. The worst performance processed by NB-based model over pure Eclipse data is 0.854. The AdaBoost was the most variant model when the noise level increased from 0 to 80%, and its TPR values started decreasing from 0.9 to 0.489. As we can see in Table 2 and Fig. 3 the TPR of all baseline models over the various noise levels. Whereas Table 3 and Fig. 4 report the FPR of various SDP models over different noise levels. On Eclipse data, the RF-based model exceeds its performance over other defect prediction models. On pure data, the TPR value is 0.975, whereas the lowest TPR value produced by the proposed model is 0.844 at 60 and 80%. The worst performance processed by NB-based model over pure Eclipse data is 0.854. The AdaBoost was the most variant model when the noise level increased from 0 to 80%, and its TPR values started decreasing from 0.9 to 0.489.
FPR value of all five defect prediction models is shown in Table 3. The lowermost FPR value for pure Columba data (see Fig. 4b) is 0.209, and it produced by the RF-based model. SVM-based technique has the highest FPR value for pure data, which is 0.583. The FPR value of NB-based model increases up to 0.377 at 40% noise, then started decreasing. Similarly, for the J48 defect prediction model, it starts increasing to 0.212 at 30% noise level then started dropping. The suggested model is most resistant, as FPR values decrease when noise level increases. AdaBoost has the worst resistance because it starts decreasing up to 10% noise then increasing.
SDP models for the Scarab dataset are more robust, as shown in Table 3. The RF-based defect prediction model has lowest FPR value (see Fig. 4c), i.e., 0.11 at 0% noise. The worst performance processed by NB-based technique at the same noise degree with 0.273 FPR value. The suggested model is most tolerable as its value is close to 0.11-40% noise. Even the J48-based model also has high noise tolerant ability with 0.153 FPR value at a 40% noise degree, which is close to 0.132 at 0%.     Figure 3b reports RF-based SDP models have least deviated, i.e., the TPR value at 0% noise is 0.893, and 80% is 0.877, which are close to each other. Even for the other models, TPR values fluctuated. We can see in Fig. 3a that RF-based model has TPR range from 0.975 (0% noise) to 0.844 (80% noise). Although till 20% of noise level, its value is 0.923, which is close to 0.975. The proposed model over Scarab data has the highest tolerable capability, as shown in Fig. 3c and Table 2, at 0 and 70% the TPR value is 0.890, and 0.878, respectively, approximate equal value. Whereas no other methods have that much ability to tolerate this amount of noise, and they showed inconsistent results. The FPR and precision are productive metrics to measure the efficiency of SDP models.  Fig. 4. The FPR value for Columba data at 30% noise is 0.180, which is close to 0.209 at 0% noise; it implies model can tolerate noise up to 30%, as shown in Fig. 4b. The FPR value of Eclipse and Scarab data processed by the proposed model at 0% noise is 0.458 and 0.111, respectively, as shown in Table 3. The FPR curve deviation for Eclipse data produced by the RFbased model has lest deviation, as shown in Fig. 4a. Whereas the FPR curve deviation of Scarab data is shown in Fig. 4c, we can see that the RF-based model has the least deviation, whereas NB-based model has the most deviated curve. FPR values for Scarab data produced by the proposed model at 0 and 40% are 0.11, and 0.121, respectively, which is close to each other; it implies tolerable deviation till 40% noise.
The precision values of classical SDP models over various noise levels for all three datasets are shown in Table 5. We can see in Fig. 5b, the precision of RF-based model at 0 and 40% noise is 0.894, and 0.875, respectively. These two values are close to each other, which implies the model tolerates noise The precision value of Eclipse at 20% noise processed by the RF-based model is 0.923, which is close to 0.970 at 0%; it indicates, the model remained unchanged until 20% noise. The precision value of Scarab data at 80% noise is 0.883, which is approximately equal to 0.890. Precision values deviation for Eclipse and Scarab dataset can be better viewed in curves Fig. 5a, c, respectively. The J48based model also has high noise tolerant rate for Scarab data, and the precision value is 0.854 at 50% noise level, which is close to 0.870-0% noise.

What is the range of tolerable noise in baseline defect prediction models?
After performing experiments, we can conclude that for all five SDP models, the range of tolerable noise is different. Figures from Figs. 3, 4, 5, 6 and 7 show the TPR, FPR, Precision, F-score, and ROC, respectively, of various defect prediction models under different noise condition. We have analyzed each classifier. The tolerable noise range in NBbased model is from 20 to 30% because TPR values (see Table 2 and Fig. 3) and FPR (Table 3 and Fig. 4) values remain unchanged at 30% noise level; it indicates model is stable and tolerable up to 30% noise. Even precision (Fig. 5) and ROC (Fig. 7) of NB-based SDP for every dataset are fallen after adding noise more than 30%. F-measure (Fig. 6) and TPR (Fig. 3) values continuously fall down, but up to 30% of noise level, TPR, and f-score values are approximately equal, indicates performance breakdown point at 30% noise. FPR values (Fig. 4) are increased when the noise level rises. Still, from 0 to 30%, FPR values are close to each other; it indicates the models are uniformly performed up to 30% noise, but after that model becomes misclassifying the actual class; which leads to degrade in the performance. SVM is also an effective SDP model, Figs. 5, and 7 show the precision, and ROC values, respectively. We can easily see the effectiveness of SVM over every datasets and deviation of SVM over different noise conditions. We can see that precision is fallen over Eclipse and Columba datasets after adding different noise level, but for the Scarab data, Fig. 6 F-measure of all three datasets the precision first gradually increases than started decreasing. Whereas ROC rises in the early phase for both Eclipse and Scarab datasets. ROC values produced by the SVM defect prediction model over Eclipse dataset increase up to 70% noise and then started decreasing indicates SVM-based model is highly noise tolerable over Eclipse data. For the Scarab data, the SVM-based model degrades its ROC values after a 10% noise level, as shown in Fig. 7. It stipulates the SVM model is not stable over the Scarab dataset. Whereas for Columba data, the 20% noise is significant, which means there are no hard changes in ROC values. SVM-based method is efficient to tolerate noise up to 40% for Eclipse data. The TPR values are continuous, falling down for all 3 datasets, as shown in Fig. 3, whereas f-score values are changing stochastically for Columba and Scarab datasets, and started reducing at every noise level for Eclipse data as reported in Fig. 6. FPR values (Fig. 4) for Eclipse and Columba datasets gradually decrease when noise increases, but for Scarab data; it increases till 70%, and then, there is a sudden decrease. It indicates the SVM is stable and more tolerable over Scarab data.
The J48 algorithm uniformly performs when the noise level is between 30-40%, because TPR, F-score, ROC, and precision values are approximately equal for all the three datasets. The FPR is suddenly started decreasing when noise is more added in the Eclipse data, as we can see in Fig. 4c; it indicates that J48-based model is inefficient to tolerate noise in Eclipse data after 30% noise. Whereas FPR values increase when the noise level increases, which makes results unpredictable.
AdaBoost-based SDP model is performing least effective over each data when noise increases as shown in Figs. 6 and 7. When the noise level is increased to 10%, the TPR and precision started decreasing for all three datasets. Even FPR values decrease for the Eclipse data as shown in Fig. 4. Although in Columba, and Scarab datasets, FPR values increase when the noise level increases; it implies the Adaboost is inefficient to tolerate noise over these datasets; so Adaboost can tolerate maximum noise up to 30%.
In our suggested approach, i.e., RF-based SDP model, the ROC, and precision were almost unchanged till 60-70% as shown in Figs. 7 and 5. F-score and TPR values started decreasing as the noise level increases, but from 40 to 60% of noise, the TPR and f-score are unaffected. The FPR in Fig. 4a started decreases when noise increases, whereas in Fig. 4b, c the FPR increases as per noise increases. For all three datasets, the noise tolerates capacity by the proposed approach is 30-40%. Table 4 reports the f-score of all methodologies at various noise degree. The maximum f-score processed by the proposed model for pure Columba, Eclipse, and Scarab datasets is 0.889, 0.698, and 0.889, respectively. Figure 6 shows the deviation curve about f-score from various SDP models over different noise degree.
We can see in Fig. 6b that RF and SVM-based SDP have the least, and most deviated curve, respectively. In all noise level scenarios, the RF-based SDP has approximately equal f-score, their respective f-score values at 10, 20, 30, 40, 50, 60, 70, and 80%, noise levels are 0.871, 0.872, 0.849, 0.873, 0.847, 0.855, 0.871, and 0.874, respectively. The f-score is still more than 0.85 after 40% of noise; this can be because of the sample space of buggy and clean instance are approx equal.
f-score values of Eclipse data on various noise states are shown in Fig. 6a. We can see in Table 4 and Fig. 6c, the highest f-score value for pure Eclipse data processed by NB-based SDP model, is 0.974, followed by RF with 0.968 f-score value. The lowermost f-score of Eclipse data is 0.870 which is produced by AdaBoost-based SDP model. The least deviated curve is of proposed model and J48-based SDP models, whereas the most deviated curve is of NB-based SDP technique, as shown in Fig. 6a. The f-score value of 20,30,40,50,60,70 and 80% noise levels is 0.940,0.919,0.869,0.861,0.841,0.844,0.870,and 0.844, respectively. The f-score is still more than 0.8 after 50% of noise. When the noise increases, the sample of actual buggy/clean instances degrades. Hence, the sample space of both classes becomes approximately equal, so the model leads to a good fitted model and outperforms at the high noise level.
In Table 4, we can see the f-score of Scarab data by various methodologies under different noise conditions. The maximum f-score value is 0.889 for the Scarab data, and the RF-based SDP model produces it. After that J48-based model with 0.867 f-score value. Figure 6c shows the devia-     , 20, 30, 40, 50, 60, 70, and 80% are 0.889, 0.882, 0.895, 0.885, 0.884, 0.870, 0.878, and 0.883, respectively. SVM, NB, and AdaBoost-based defect prediction models are not effective after high noise levels. Table 6 and Fig. 7 report the ROC of all five models under various noise levels. The maximum ROC value for pure Columba data is 0.951, which is produced by the RFbased model, as shown in Fig. 7b. Then, after J48-based SDP has ROC value, i.e., 0.843. Lowest ROC produced by SVMbased method with 0.58 ROC value. NB and AdaBoost-based SDP model has a moderate performance with 0.717 and 0.748 ROC values, respectively.
The ROC of Scarab data at various noise degree is shown in Fig. 7c. The maximum ROC value at 0% noise is 0.960, which is processed by the proposed model, followed by J48based model with 0.889 with ROC value. The ROC curve generated by RF-based SDP is almost uniform with the least deviation compared with other methodologies, indicated RF outperforms over every SDP model at various noise levels.

How does the class imbalance problem affect the performance of various SDP models over different noise levels?
We have also conducted experiments over datasets without applying sampling techniques at various noise levels. The second column of each Table from Tables 2, 3, 4, 5 and 6 reports the performance metrics without using the sampling  Table 2 that for each non-sampling-based SDP model over all three datasets. The TPR value is less at various noise levels compared with sampling-based SDP models. Similarly, the FPR produced by SDP methods in which the sampling technique is not utilized, and they have higher FPR values compared with sampling-based methods, as shown in Table 3. f-score values of every model in which sampling technique is not applied have lesser compared with samplingbased SDP techniques over every dataset at various noise levels, as shown in Table 4. Similarly, the precision and ROC values are even lower at different noise levels produced by imbalanced SDP models compared with sampling-based SDP models, as shown in Tables 5 and 6, respectively. Although at high noise level (60-80%), the sampling-based SDP model misclassifies the actual class in some rare cases, which causes the worst performance. Since in very few cases, the non-sampling-based SDP outperforms because the cardinality of buggy instance is more than a clean instance; it implies the model is overfitted toward buggy instances. But sampled-based SDP models do not overfitted at any noise level.

5.5
Compare the performance of proposed approach with other classical SDP models without applying sampling method.
As we discussed earlier, all three datasets are imbalanced.
In this section, we compared the performance of imbalanced baseline methods with an imbalanced suggested approach over every dataset. The TPR value of RF without applying the sampling method is higher than all non-sampling classical methods, as shown in  Table 3.
The maximum value of F-measure reported by RF when the sampling method is not applied over pure Columba, Eclipse, and Scarab datasets is 0.718, 0.921, and 0.783, respectively, as shown in Table 4. Although the maximum Fmeasure value reported by other traditional models without applying the sampling technique are 0.707 (J48 at 0%), 0.965 (SVM at 0%), and 0.755 (AdaBoost at 0%), respectively. We have also reported the precision value of the classical model and suggested approach in Table 5. The uppermost precision value reported by RF when sampling is not utilized over pure Columba, Eclipse, and Scarab datasets are 0.730, 0.95, and 0.783, respectively. Whereas the maximum precision value reported by other traditional models over pure datasets when no sampling method applied is 0.705 (J48), 0.971 (SVM), and 0.756 (AdaBoost), respectively. The uppermost ROC value produced by RF without utilizing sampling technique over pure Columba, Eclipse, and Scarab datasets is 0.793, 0.964, and 0.744, respectively, as reported in Table 6. Although the maximum F-measure value produced by other traditional models without applying sampling technique is 0.748 (J48 at 20%), 0.806 (AdaBoost at 10%), and 0.830 (AdaBoost at 0%), respectively. As reported above, in most of cases, the classical classifiers outperform at 0% noise level. As all the 3 datasets are imbalanced, so it leads to an overfitted model and produces biased results. But RF avoids overfitting (Ali et al. 2012) up to some extent.

Insightful discussion
When the noise level increases in datasets, the learning technique started misclassified the actual class, and the performance of classical SDP models degrades. As the noise level increases, the number of actual class degrades, and the model becomes predicting the wrong class as an actual class. Although when the sampling method not applied over traditional baseline models, due to the imbalance dataset, the classifiers started overfitting and leads to unsatisfactory outperformed results. AdaBoost is an ensemble learning (EL) method; the EL methods mainly split the dataset and combined the results. EL is also avoids overfitting problem (Rätsch et al. 1998). In a few cases, AdaBoost outperformed over RF-based model when sampling technique is not applied; such results are unbiased.
When we applied the sampling technique, the proposed model outperforms over each SDP model approximately in every noise level. In very few cases, the AdaBoost and SVM surpass the performance. When noise level increases, the classical SDP models started degrading its performances, because the actual class started reducing and model start predicting the notional classes. RF and J48 are tree-based models, and the leaf node represents the class. RF provides an improvement over other trees model by way of small tweaking that decorrelates the tree. At every split in the RF, the algorithm is not even allowed to consider a majority of the available predictors (possible square root of the full set). RF method uses the square root of total predictors causes better results when the noise level increases. It also offers efficient estimates of the test error without incurring the cost of repeated model training associated with cross-validation, so it is sufficient to avoid notional class and predict the actual class.

General discussion and threats to validity
We conducted a significant test using the Wilcoxon Rank-Sum test (Lam and Longnecker 1983) the noise versus clean performance of the proposed model and other SDP models at different noise levels for all three datasets. In Table 7, we have listed the ROC value of the proposed model and other optimal SDP models at various noise level. In Table 7, we reported corresponding ROC values of the proposed model and other optimal baselines methods at particular noise levels. Obuchowski (1997) suggested that nonparametric testing using ROC is effective over other evaluation metrics. We have taken two samples, in the first sample (S 1 ), we have listed ROC values of the proposed model with increasing order of noise level from 0 to 80%, whereas in Sample two (S 2 ), we have listed the ROC of most optimal SDP model at that noise level in the same order of noise. The hypothesis H 0 is the median (difference) between two samples is 0, and hypothesis H 1 is the median (difference) > 0. The sample size n 1 = n 2 = 27. Based on the information provided, the significance level is α = 0.005, and the critical value for a right-tailed test is z c = 2.58. The rejection region for this right-tailed test is R = z : z > 2.58, where R is the rank sum of sample n 1 , and n 1 is 1082. We got z = 5.873 since it is observed that z = 5.873 > z c ; its concluded that the null hypothesis is rejected. Therefore, there is enough evidence to claim that the population median of differences is greater than 0, at the 0.005 significance level.
Few threats to the validity of these experiments are follows.
-We have collected an open-source dataset for our experiments, the types of noise present in open source dataset and software available in a large organization may be different because of data acquisition by different trained employees. It will be better if private industries reveal their dataset so that it can be tested over noise resistance and class imbalance problem. -We have used public dataset as a pure dataset, but there can be some instances which are not correctly linked, and some defect items are not adequately lined by SCM. It is also possible that few defects may not be recorded by a bug tracking system. Note here AB is AdaBoost -We have not considered feature noise in our study, and this noise also impacts the performance of an SDP model. -As we have randomly added noise in the public dataset by changing class labels, but it can be possible that sound can follow the specified pattern. That pattern can be because of poorly managed data during development. -It is challenging to perform significant analysis between all five performance measures. It needs a multi-variant significant nonparametric test. -We have used TPR, FPA, F-measure, Precision and ROC performance measures which have been widely used in SDP (Pandey et al. 2020a;Nam et al. 2013;Peters et al. 2013), another threats to validity to our conclusion. -We performed Wilcoxon signed-rank test to investigate the performances made by various approaches; it is a classical method to validate significant improvements over these methods. -In future, we plan to reduce threats by performing experiments over other diverse datasets.

Conclusion and future work
Noise and class imbalance problems are the two significant challenges in SDP. We have performed 864 experiments over 3 public datasets and analyzed the noise endure for wellknown SDP models. We have manually added noise into it from 0 to 80%. We have used 4 baseline SDP methods and trained them using these noisy datasets. We have used random sampling to avoid the class imbalance problem. We also suggested an approach that can tolerate maximum noise and still outperforms over baseline methods. We have also compared the performance without applying sampling methods. We found the proposed approach surpasses the performance over baseline technologies with noisy instances and with imbalanced data. We have also provided a few guidelines. Additionally, we have concluded a few points that are listed below.
(i) We have applied Random sub-sampling as a sampling technique which provides the most effective results compared with other sampling techniques. (ii) Random forest outperforms compared with other state of the art techniques. RF has high noise tolerate rate (30-40%) compared with other methodologies. (iii) AdaBoost is least capable, and it has very lesser noise dealing capacity, i.e., from 10 to 20% only. (iv) J48 is also approximately active as random forest and has a higher level of noise dealing capacity in the range of 30-40%. (v) The TPR and FPR of RF have the least deviation; however, SVM and AdaBoost have high variation toward the noise. J48 and NB have an average difference after noise is added. (vi) The f-score and ROC of RF are consistently similar in every noise scenario for all three data. SVM and NB have a high deviation when noise is added. J48 and AdaBoost have moderate deviation. (vii) Naive Bayes and SVM are moderately active and have an intermediate level of noise tolerance ability, Naive Bayes has up 30%, and SVM has up to 40% noise bear level.
We have used public datasets; software industries should reveal their project data so that better data sources can be available for research purposes. Noise dealing algorithms need to be suggested because no such algorithm is present to deal with noise in defect data items.
There is a scope of ensemble learning in software bug tracking systems; it can outperform with state-of-the-art techniques. There is still deep learning-based model is not available till now because of lesser number of instance in a dataset, by applying data augmentation, we can make our training set bigger so that deep learning-based architecture can easily apply. Even deep learning architecture can be used as a feature selection method. Cross defect bug tracking systems can also be helpful for different types of software systems, and we must be careful while combining other metrics and their datasets because it can create redundancy, which affects the performance of the learning model.