Software fault prediction using deep learning techniques

Software fault prediction (SFP) techniques identify faults at the early stages of the software development life cycle (SDLC). We find machine learning techniques commonly used for SFP compared to deep learning methods, which can produce more accurate results. Deep learning offers exceptional results in various domains, such as computer vision, natural language processing, and speech recognition. In this study, we use three deep learning methods, namely, long short-term memory (LSTM), bidirectional LSTM (BILSTM), and radial basis function network (RBFN) to predict software faults and compare our results with existing models to show how our results are more accurate. Our study uses Chidamber and Kemerer (CK) metrics-based datasets to conduct experiments and test our proposed algorithm. We conclude that LSTM and BILSTM perform better, whereas RBFN is faster in producing the required results. We use k-fold cross-validation to do the model evaluation. Our proposed models provide software developers with a more accurate and efficient SFP mechanism.


Introduction
Software development follows a set of steps to produce reliable and high-quality software.Faults can occur at any time during the software development process.Software testing and quality assurance involve static and dynamic analysis of software under test (SUT) intending to find and fix faults.However, locating faults, fixing them, and retesting to ensure they are fixed without having any side effects such as regression in preexisting software quality has financial implications (Jones & Bonsignour, 2011).Another critical aspect of software testing and quality assurance activity within software development is that we must discover faults as early as possible since the cost of fixing faults is higher when found in later stages (Singh, 2011).
Software fault prediction (SFP) helps predict faults early in the development phase and helps improve the software quality of the final product in a fast and cost-effective manner.SFP allows the development team to test modules or files with a high likelihood of failure.As a consequence, defective modules will receive more attention.As a result, the possibility of resolving the remaining faults increases, and any software products provided to the end users will develop more efficiently.Fault prediction also cuts down on project maintenance and support costs.SFP decreases the costs, time, and effort (Erturk & Sezer, 2016).It requires further investigation so that it may be established as part of interactive development environments (IDE) to assist developers in developing high-quality code in the first place.
Data mining and machine learning techniques are used for software fault prediction, including clustering, support vector machine, genetic algorithm (Rosli et al., 2011), artificial neural network (ANN) (Jin et al., 2012;Jin & Jin, 2015), and decision tree (Rathore & Kumar, 2016).We find a comprehensive account of which data mining, machine learning, and deep learning techniques are already used for SFP, along with available datasets details and performance analysis in Batool and Khan (2022).Deep learning encompasses a combination of techniques used in neural network models to learn from several layers; it is a sub-field of machine learning that incorporates supervised and unsupervised, hybrid, and reinforcement learning methods.Deep learning handles large volumes of data and provides a range of models that allow the use of historical data to find meaningful patterns.Deep neural network-based representation may be reused between tasks (Zhang et al., 2019).We investigate whether deep learning can enhance software faults prediction performance and how hyper-parameter tuning improves results.
Software faults have an impact on the economy as well as time loss and other consequences.Since software reliability has become a primary concern of the industry and community, software must be fault free.Software fault prediction is studied by many researchers while using different approaches of data mining, machine learning, and deep learning; each approach provides different accuracy.We have thoroughly studied applications of machine learning and deep learning techniques for SFP and presented a detailed discussion on the available dataset in Batool and Khan (2022) and Aziz et al. (2021).Deep learning-based models achieve tremendous success in various domains (Peng et al., 2017).Deep learning enables multi-layer computational models to learn data representations with numerous levels of abstraction (LeCun et al., 2015).Deep learning can automatically extract the features from raw data and make it robust, and it can also handle a large amount of data (Zhang et al., 2019;Al Qasem et al., 2020).We contribute by assessing the significance and efficacy of the proposed deep learning methods for software fault prediction.We also determine the optimum algorithm quality and address the effectiveness of profound learning algorithms design modification.The method will allow developers and testers to concentrate on code (all types of errors), reducing testing and maintenance costs while contributing to software advancement and overall product stability.We make use of three deep learning techniques, long short-term memory (LSTM), bidirectional LSTM (BILSTM), and radial basis function network (RBFN).The factors used in this study are the number of layers, epochs, batch size dropout rate, and optimizer.Several comparisons are performed and discussed in the result section.

Related work
This section discusses significant research on software fault prediction using machine learning, neural networks, and deep learning approaches.We further highlight our reasons for selecting particular deep learning techniques to propose our algorithms for SFP.Zhang et al. (2023) proposed a block-chain-based decentralized federated transfer learning technique for diagnosing machinery faults.The study aims to build and enhance security for private data and communications among different clients.Two decentralized fault diagnosis datasets are used for the experiment.The results indicate that the proposed technique is effective in data privacy preserving collaborative fault diagnosis for multiple users.

Authors in
We present the most commonly used machine learning techniques and their findings, as shown in Table 1.We have presented in this subsection that machine learning models are investigated in detail for SFP, and we need to use deep learning to improve performance and accuracy further.

Deep learning techniques
Defect prediction convolutional neural network is a method (Li et al., 2017) that leverages deep learning for effective feature generation based on the abstract syntax tree (AST).In this study, the researchers first extract the token vectors, which are further encoded as numerical vectors via mapping and word embedding.They input the numerical vectors into a convolutional neural network to automatically learn programs' semantic and structural characteristics.
A prediction model using a tree-structured long short-term memory network (LSTM) technique is designed to automatically learn characteristics for describing source code used for defect prediction in Dam et al. (2018).Their model directly corresponds to the abstract syntax tree representation of source code (AST).Their methodology is divided into four phases, where they first parse a source code file into a tree of the abstract syntax and then apply embedding on the AST tree by transforming each AST node's label name to a vector.Then, they input the AST embedding into a tree-based LSTM network to get a vector representation for the whole AST.Finally, a classification algorithm is employed to forecast problems.For the experiment, Samsung open-source datasets and PROMISE datasets are used.
A model for defect prediction is proposed in Liang et al. (2019) that includes the word embedding and LSTM methods.In this model, the process is broken down into three parts, with the first being the extraction of a token from the abstract syntax tree represented by the model's abstract syntax tree.Once this has been accomplished, the token is turned into a vector.A short-term long memory is created by employing the vector and its labels in the third step.LSTM identifies program errors by automatically learning the semantic meaning of the code in question.The experiment uses eight open-source projects, and the proposed model outperforms defect prediction methodologies.Previous research has focused chiefly on feature extraction via the use of tree representation to examine the semantic nature of a program.DeepLineDP (Pornprasit & Tantithamthavorn, 2022) is a deep learning-based method that automatically learns the semantic features of surrounded tokens and lines to identify defective lines and files.The proposed DeepLineDP is more accurate than other file-level prediction approaches and cost-effective than other line-level defect prediction approaches.
A model that automatically learns the characteristics is proposed in Phan and Le Nguyen (2017) using CNN that first constructs control flow graphs (CFGs) using assembly code.CFG is intended to explain the execution flows of the assembly instructions that disclose the behavior of a program, and it is used to do so.A graphical model for CFG datasets that includes multi-view, multi-layer CNN for directed labeled graphs will be used in the next stage, resulting in models based on CFG data.Recently, researchers attempted to apply LSTM to the problem of SDP.The use of  (Li et al. 2022) remaining useful life (RUL) method addresses the sensor malfunction problem.The automatic feature extraction approach is adopted to exploit the information from different sensors.The experimental results indicate that the proposed approach is best for industrial applications.
LSTM and BILSTM are used in various studies for software fault prediction, where they show good performance.However, it is pertinent to note that investigations involving the use of LSTM and BiLSTM use small datasets in these studies, and more importantly, they do not address the metrics-based features in the creation of datasets (Liang et al., 2019;Wang et al., 2021;Uddin et al., 2022).We consider this as a gap and therefore consider the metric-based features in datasets with relatively large datasets.We also consider radial basis function network (RBFN), which is not yet used for software fault prediction.It is already highlighted that RBFN is successfully employed to solve similar classification problems (Wu et al. 2012) with greater accuracy and better performance.For this reason, we also select RBFN and use LSTM and BiLSTM, considering bigger datasets for investigation since they have already been used for SFP but with smaller datasets.

Our approach
In the preceding section, we presented that deep learning techniques based on fault prediction can be more beneficial and improve overall results.Figure 1 illustrates the steps of our approach.First, we apply LSTM and modify its parameters to measure performance.We then use the same procedures for BILSTM and RBFN, comparing our findings to the best possible outcomes and applying the same procedures to them.LSTM, BILSTM, and RBFN are implemented using Python 3.8.8, and we conduct experiments using Google Colab.

Datasets details
We use two different datasets to compare LSTM, BiLSTM, and RBFN for fault prediction, and we call them "Dataset-1" and "Dataset-2."Dataset-1 contains 88,672 instances, and Dataset-2 contains 6052 instances.It is pertinent to highlight that both of these datasets are based on static analysis of code since they are code metrics-based datasets.We share their construction in the subsequent subsections.Since our proposal is useful for early prediction of software faults assisting developers, we provide the prediction of the likelihood of faults of mainly functional nature.We use two datasets for our study to establish a generalization of our proposal.

Dataset-1
We use the metrics Chidamber and Kemerer (CK) proposed (Padhy et al., 2018) for arriving at Dataset 1.We use CK metrics-based datasets with a smaller number of examples (or instances or rows) for the construction of Dataset-1 and apply concatenation to arrive at more number examples than the individual ones which we call larger or bigger in comparison to its components.A complete picture showing how did we perform concatenation where we share details of the normalization, encoding, splitting, cleaning, filtration, and statistical analysis such as skewness evaluation in Aziz et al. (2019Aziz et al. ( , 2020)).We also note that the use of larger datasets with more examples or instances or rows helps to implement deep learning techniques more effectively (Aziz et al., 2019).Several deep learning algorithms are already used for software fault prediction, but in those studies, the authors used datasets of a relatively small number of examples (Al Qasem & Akour, 2019;Nevendra & Singh, 2021).CK metrics are used by researchers (Shaik et al., 2012;Radjenović et al., 2013) where there are studies that demonstrate the usefulness of CK (or simply CK) metrics for software fault prediction (Suresh et al., 2012;Sharma & Dubey, 2012;Malhotra & Bansal, 2015;Suri & Singhal, 2015;Singh et al., 2011).We share details of sources of smaller datasets in Table 2 where the complete working is available in Aziz et al. (2019Aziz et al. ( , 2020)).It is important to note that Dataset 1 contains 82% non-faulty and the rest faulty classes.

Dataset-2
We select Dataset-2, the GHPR dataset used for defect prediction in Xu et al. (2020).The selected dataset consists of 6052 instances and 21 static metrics, as shown in Table 3.This static metric-based dataset, Dataset-2, is a balanced dataset with a fault ratio is 0.5%.

Statistical analysis of datasets
Features in datasets may be correlated when they have similar aspects of programming.A high correlation value of >= ± 0.75 allows us to remove redundancy since keep- ing the duplicate measures needs addressing before using them for model training (Han et al. 2022).We performed Pearson correlation coefficient (r) and also calculated Spearman correlation coefficient (p) for the pairs found in our chosen datasets as shown in Fig. 2. We found all relationships positively correlated, yet none of them was significant.

Pre-processing phase
Our pre-processing effort includes the following steps

Normalization
Normalization is used with numerical characteristics to find new ranges based on an equation.It is done during the pre-processing stage.In this study, we also applied the standardization method.Dataset-1 includes various datasets such as MFA and CA.Since we concatenated small datasets such that the TRUE label indicates a faulty instance and the FALSE label indicates a non-faulty instance, we performed a cleaning operation on the combined dataset to remove anomalies.Figure 3 illustrates the pre-processing steps.

Label encoding
Label encoding turns labels into a numeric form so that machines can read them.Machine learning algorithms can then better decide how those labels should be used.In supervised learning, it is a crucial pre-processing step for the structured dataset.In our study, Dataset-1 consists of TRUE and FALSE labels; we use label encoding to convert these categorical values into numerical values.Dataset-2 already contains numeric label values, so there is no need to do label encoding.

Applying deep learning algorithms
We apply three deep learning algorithms (LSTM, BiLSTM, and RBFN) to measure the performance in this stage.Performance parameters significantly impact deep learning algorithms performance (Al Qasem et al., 2020).We choose different parameters for the experiment, such as numerous layers, different activation functions, and hyper-parameters.After applying these parameters, we repeat the steps until we get satisfactory results.

Long short-term memory (LSTM)
Long short-term memory (LSTM) is a sub-type of artificial neural network that detects patterns in data sequences.The LSTM architecture is composed of memory blocks that are linked through recurrent sub-networks.LSTM consists of four components (Yu et al. 2019b).
1. Memory cell: Remembering and forgetting based on the input context.2. Forget Gate(f): This gate is often used to determine which information should be eliminated from the LSTM memory by applying a sigmoid function on the information in the LSTM memory.This decision is primarily based on the values of h t -1 and x t .The output of this gate is f t , a value between 0 and 1, where 0 indicates that the learned value should be fully discarded, and one indicates that the absolute value should be retained.This result is calculated using Eq. 1.
where b f is constant value and called bias. 3. Input Gate(i): It affects how much information will be written to the Internal Cell State.This gate has two layers: a Sigmoid layer and a tanh layer.The Sigmoid layer selects which values should be updated, while the tanh layer provides a vector of new candidate values for storage in the LSTM memory.These two layers' outputs are computed using Eqs. 2 and 3.
Equation i t denotes whether the value should be updated, and c t represents the vector of new values that will be added into the LSTM memory cell.4. Output Gate(o): This gate first employs a Sigmoid layer to determine which part of the LSTM memory contributes to the output.Then, it uses a non-linear tanh function to map values between 1 and 1.Finally, the output of a Sigmoid layer is multiplied by the result.The formulas used to compute the output are represented by Eq. 4. (1) These gates improve efficiency by leveraging their decision power to determine which information should be dumped and which new information should be added to the cell state.The architecture of LSTM is shown in Fig. 4.

Bi directional LSTM (BILSTM)
Bidirectional LSTM (Siami-Namini et al., 2019) is an extension of the above-described LSTM algorithm that employs two LSTM on the input data.The input sequence is subjected to LSTM in the first round (i.e., forward layer).The reverse form of the input sequence is fed into the algorithm in the second round in the backward layer.Using the LSTM twice improves learning long-term dependencies and, as a result, improves the model's accuracy (Schmidhuber, 2015).BILSTM works fast and takes less time in prediction.Figure 5 illustrates the architecture of BILSTM.

Radial basis network (RBFN)
RBFN is a type of deep learning consisting of several input, hidden, and output layers.The number and locations of the radial basis functions, their design, and the method (4) Fig. 4 LSTM architecture used to calculate the associative weight matrix all influence the functioning of an RBFN network.Gaussian distribution functions are primarily used for radial basis function so that the Gaussian radial function can be defined below Eq. 5.
In RBFN, hidden layer is trained using back-propagation.We have to compute receptors and the variance for each node in the hidden layer.In the training phase, weights are updated to reduce the error.After training, we select the prototype.
For RBFN, the best prototype is K-mean clustering; after choosing the prototype beta coefficient, sigma equals the average distance between all points in the cluster and the cluster center.After training, output weights are computed using gradient descent (least square error).For every node, gradient descent is run separately.We illustrate the architecture of RBFN in Fig. 6.

Modifying models parameters
The network configuration we define includes hyper-parameter tuning, the number of layers, and activation functions.

Hyper-parameter tuning
The selection of appropriate parameters is critical and a complicated aspect of network training due to constraints such as memory limitations; trade-offs are inherently present in parameter selection (Ali & Gravino, 2021;Snuverink, 2017).Throughout the examination, several hyper-parameters were used to measure the influence of accuracy.Tuning hyper-parameters is crucial since it ensures that the proper parameter is used for the best outcome.In our study, we performed a range of experiments before recording the findings; for each algorithm, we examined and assessed the values of its hyper-parameters before evaluating the outcomes. (5) Fig. 5 BILSTM architecture We choose different hyper-parameters which are helpful in improving the deep learning algorithm's performance (Verma et al., 2020), and our strategy comprises the following: 1. Epochs: Epochs are the numbers from which the datasets pass.Our experiment changes the number of epochs depending on the requirement.We start from 10 up to 1000 to demonstrate the best results.2. Batch-size: Batch-size indicates the number of training samples.We use different batch sizes and start from 10 up to 64.We finalize the batch size 64 as we consider it the best batch size for a large dataset.However, a large batch size takes more memory space.3. Dropout: Usually, we use dropout to avoid over-fitting.We do the process of choosing dropout randomly (Srivastava et al., 2014).We select different dropouts from 0.2 to 0.7 with an increment of 0.1.4. Optimizer function: In back-propagation, the optimizer function calculates the weights to minimize the error rate.In our study, we use the Adam optimizer function.Adam optimization is a stochastic gradient descent approach based on the adaptive estimate of first-and second-order moments.According to Kingma and Ba (2014), the approach is "computationally efficient, requires minimal memory, is insensitive to diagonal rescaling of gradients, and is well suited for issues with huge data/parameter sets."

Number of layers
Numerous parameters significantly impact network design, including the number of hidden layers and the number of neurons in each hidden layer.The number of layers is the most important.As we increase, the number of layers model becomes more complex.In our study, we work on both single layers and multi-layers.We also compare single hidden and multiple hidden layers and check the performance of the models based on hidden layers.

Activation function
An artificial neuron's activation function specifies the neuron's output in response to an input or collection of inputs.Each activation function accepts a single x value as input and applies a specified mathematical operation.In deep neural networks, Sigmoid, rectified linear units, and hyperbolic tangents are the most often employed activation functions.
In back-propagation, the Sigmoid is the most often used shape.The range encompasses (0, 1).It is a suitable extension of non-linearity that restricts earlier applications in neural networks (NN) and exhibits a suitable degree of smoothness (Mercioni et al., 2019).Equation 6 defines the Sigmoid equation.
Hyperbolic tanh: Tanh is the ratio of hyperbolic sine to hyperbolic cosine function (Karlik & Olgac, 2011).However, the tanh function is defined by Formula 7.
Rectified linear units (ReLUs): When utilized as the activation function for a deep neural network's hidden layers, it enables the training of much deeper networks than the Sigmoid or tanh activation functions.ReLU accelerates and optimizes the learning of deep neural networks (DNN) in complex, high-dimensional input contexts.The primary benefit of ReLU is that no costly computations are required; just comparison and multiplication are required.Because ReLU's efficient back-propagation algorithm avoids bursting or disappearing gradients, it becomes an excellent choice for DNN (Farhadi, 2017).

Performance measure metrics
Our study uses the following performance metrics: accuracy, recall, precision, and F1-measure.ROC-AUC analysis can be examined using a confusion matrix.

Experimental setup
We use local computing resources to conduct the first experiment and Google Colab to run the second one since we use our larger dataset for the second one requiring more computation power and resources.

LSTM results
LSTM, BILSTM, and RBFN are implemented using the Keras framework.Numpy, Panda, Sklearn, and keras tuner libraries were also deployed.The Matplotlib package is used for visualizing, while Jupyter notebook is used as the programming environment.We conducted over 200 tests with various (hyper-parameter, activation functions, and several layers).( 6) To enhance the findings, we conducted experiments in which the model's parameters were changed.The findings are outlined in the following sections.The algorithms steps of LSTM and BiLSTM are shown in Algorithm 1.
To evaluate the impact of the number of epochs, batch size, dropout rate, optimizer, layer count, and activation function, using the LSTM algorithm, we obtained different results using Dataset-1, as shown in Table 4. Results of LSTM with Dataset-2 are shown in Table 5.
It is pertinent to note that we highlight results in bold, in Tables 4-12, to point out the epochs, batch sizes, dropout, single and multi-layers, and activation functions resulting in optimal results.

Number of epochs:
In order to investigate the impact of the epoch, we experimented using LSTM with Dataset-1 and Dataset-2 as shown in Tables 4 and 5.
Batch size: In order to investigate the impact of batch size, we experimented using different batch sizes as shown in Tables 6 and 7.
Dropout: We conducted various experiments using different dropout rates as results shown in Tables 8 and 9 to investigate the impact of dropout rate using the LSTM algorithm.

Number of layers:
We conducted the experiments using LSTM with single and multiple layers to determine the influence of layer count as shown in Tables 10 and 11.
Activation function: We run a different experiment using different activation functions and their results as shown in Table 12 to determine the effect of the activation function.

BILSTM result
In order to evaluate the impact of the number of epochs, batch size, dropout, optimizer, layer count, and activation function, we obtained different results using BILSTM algorithm with Dataset-1, as shown in Table 4, and the results with Dataset-2 are shown in Table 5.The algorithm steps of BILSTM are described in Algorithm 1.

Number of epochs:
In order to analyze the number of epochs in the case of Dataset-1 and Dataset-2, BILSTM algorithm results are shown in Tables 4 and 5, respectively.
Batch size: Tables 6 and 7 indicate the impact of batch size in BILSTM.

Number of layers:
In order to show the impact of the number of layers, we perform different experiments as shown in Tables 10 and 11.
Activation function: We use 'sigmoid' and 'relu' activation functions and we present our results in Table 12.

RBFN results
We evaluate the impact of the number of epochs, batch size, dropout rate, optimizer, layer count, and activation function, using the RBFN algorithm with Dataset-1, as shown in Table 4, and the results with Dataset-2 are shown in Table 5.We also explain the algorithm step of RBFN as shown below Algorithm 2.

Number of epochs:
In order to analyze the impact of the number of epochs, we did an experiment using the RBFN algorithm; the results are shown in Tables 4 and 5, respectively.
Batch size: In order to observe the impact of batch size, various experiments are performed; the results are shown in Tables 6 and 7.

Number of layers:
In order to study the impact of layer count, we conduct different experiments, and the results are described in Tables 10 and 11, respectively.

Cross-validation
Tables 13 and 14 illustrate the results after applying the k-fold cross-validation.We applied 10-fold cross-validation on both datasets, Dataset-1 and Dataset-2.

Confusion matrix
A detailed analysis of confusion matrices demonstrates that our proposed algorithms outperform, as shown in Figs.7 and 8. Our proposed model correctly classifies faults while exhibiting better performance.

AUC-ROC analysis
ROC curve plot shows the TPR and FPR rate of our constructed algorithms.Our proposed algorithms perform state-of-the-art.Figure 9 depicts that LSTM and BiLSTM perform better.Figures 9 and 10 show the AUC-ROC analysis: LSTM, BILSTM, and RBFN, respectively

Discussion
We examine the parameters of the LSTM, BILSTM, and RBFN algorithms that provide meaningful predictions to explain and clarify the outcomes (Tables 15 and 16).

LSTM
According to experimental findings, the proposed LSTM algorithm achieves effective outcomes through the modification of parameters.
The effect of hyper-parameter: The number of epochs had a noticeable impact, particularly on the precision.When we raise the epoch number, we increase the following: accuracy, precision, and F1-score.When we applied 1000 epochs, the precision increases from 92.12 to 93.23, accuracy from 93.21 to 93.53, and the F1-score increased from 92.12 to 93.33 on Dataset-1.There is a slight decrease in the recall from 92.22 to 92.21.Similarly, at 1000 epochs, the recall and accuracy of Dataset-2 also increased from 82.02 to 82.94 and 83.13 to 83.71, respectively.However, there is a slight decrease in the precision from 82.12 to 82.11 with the F1 score increasing from 83 to 83.14.However, when an optimum number of epochs is reached, the precision and the recall drop, resulting in a reduction in accuracy.When we try to use 3000 epochs, our precision and the recall decrease, and accuracy reduces from 93.50 to 93.23.The batch size has a similar effect to the epoch.The accuracy of the batch improves as the batch size grows until it exceeds the threshold size.Despite that, increasing batch size results in a decrease in accuracy.In our study, we use a batch size of 64, which gives better accuracy; when we use a batch size of more than   53, 93.23, 93.28, and 93.44.We use a dropout rate from 0.2 to 0.7.We get the best results on dropout rates of 0.5, which are 93.53, 93.25, 93.28, and 93.44.Dataset-2's best results at 0.5 are 83.11,82.01, 82.94, and 83, respectively.Thus, in our study, we get the best results at a 0.5 dropout rate.We report that the Sigmoid activation function works better than ReLU because we use a binary target class.Hence, it gives better accuracy, precision, recall, and the F1-measure in the case of Sigmoid.Using the Sigmoid activation function, we get an accuracy of 93.53; however, when we use the ReLU activation function, accuracy goes down to 82.01%.
The effect of number of layers: In our study, we perform experiments using two types of layers: single layers and multiple layers.In a single layer, we increase the number of neurons.We started with ten and went up to 100 neurons.We use five layers.In the multiplayer layer, we use ten neurons in the first layer, and in the second layer, we use two layers with 10 and 15 neurons.In the third layer, we use three layers and 15, 20, and 22 neurons, respectively.By following this, we add layers 4 and 5 with an increase in the number of layers.
Dataset-1: In the case of a single layer, we achieve a high accuracy of 93.66% at layer 4, and in the case of a multi-layer at layer 3, we achieve the best accuracy of 93.66.
Dataset-2: In the case of Dataset-2, we get the best accuracy at layers 3 and 4 with a slight difference of 83.56 and 83.58 with 50 and 75 neurons.In the multi-layer case, we achieve 83.56 accuracies at layer 3 with 10, 15, and 20 neurons, respectively.We use Adam, SGD, and AdaGrad optimizer and get better accuracy in the case of Adam optimizer.

BILSTM
Based on our testing and the findings we gained from these experiments, we concluded that changing the network design impacted the network's behavior.
The effect of hyper-parameter: Increase in the number of epochs significantly impacts Dataset-1 and Dataset-2 on which BILSTM is applied.When we applied 1000 epochs on Dataset-1, recall increased from 92.02 to 92.94 and accuracy increased from 93.13 to 93.75.Precision increased from 92.14 to 93.11, and the F1 score increased from 93.0 to 93.14.Similarly, the accuracy and precision of Dataset-2 precision increased from 83.16 to 84.94, and accuracy increased from 83.12 to 84.75.Recall and F1-score also increase from 83.02 to 84.94 and 83.92 to 84.05, respectively.
The impact of batch size is similar to epochs when we increase the number of epochs, accuracy, precision, recall, and F1-score increase, as already mentioned.When the optimal results are achieved after that, accuracy decreases.The effect of batch size: An increase in batch size increases the accuracy, precision, recall, and F1 score.When optimal batch size reached accuracy, precision, recall, and F1-score decreased.We get state-of-the-art outcomes at batch 64.On batch 64, we get an accuracy of 93.75, a precision and F1 score of 93.14, and a recall of 93.28.Similarly, on Dataset-2, we get the best values on batch 64.We achieve the highest accuracy, 83.56.
The effect of layers (Dataset-1): Using single layers with an increment of neurons, we achieve the best result at layer 3 with 100 neurons.We achieve an accuracy of 93.75, precision of 93.35, recall of 93.45, and F1-score of 93.34.When we use multi-layer architecture, we get the best results on layer three, using three layers further with 15, 20, and 22 neurons.
The effect of layers (Dataset-2): Using a single layer with an increment of neurons, we get the best results of 83.75 on 4 layers, and in multi-layer, we achieve maximum accuracy 83.75 on layer 4 with 10, 15, 20, and 22 neurons respectively.

RBFN
Our experiments and findings indicate that modifying network design has a great impact on the behavior of the network.
The effect of hyper-parameter: When we increase the number of hyper-parameter, there is a positive impact on the accuracy, precision, recall, and F1-score of both Dataset-1 and Dataset-2.When we increase the number of epochs up to 1500, accuracy increases from 82.15 to 82.75.There is a slight increase in precision, recall, and F1-measure from 82.11 to 82.13, 82.11 to 82.19, and 82.11 to 82.15, respectively.In the case of Dataset-2, when we increase the number of epochs, accuracy increases from 78.21 to 79.13.Precision increases from 78.18.Recall decreases from 77.28 to 78.06; however, the f1-score increases from 78.13 to 78.46.The effect of batch size is the same as epochs.When we increase batch size, accuracy, precision, recall, and F1 score also increase.We get the best accuracy (82.72) at batch 64 using Dataset-1.On Dataset-2, we get an accuracy of 78.27.
The effect of layers (Dataset-1): In the case of a single layer, we achieve high accuracy (82.45) at layer 4 with 75 neurons.When we use multiplayer architecture, we achieve high accuracy (82.58) at layer four with four further layers with 12, 15, 20, and 22 neurons.
Dataset 2: Using Dataset-2, we achieve optimal results at single layer 4, 78.27, and regarding multiple layers, we get the best result on layer 4 as well with 10, 15, 20, and 22 neurons.

Cross-validation
We use K-fold cross-validation to show completely unbiased results.K-fold cross-validation splits data into roughly equally size k-folds.Nested K-fold cross-validation is another way to tune the parameters of algorithms.Data is divided into k fold, and one fold is reserved for test (Ali & Gravino, 2021).K-1 training folds were used for validation.This process of splitting data is repeated k times, having k testing tests.Performance is obtained by aggregating the k results achieved for different k folds.Selecting the value of K is a challenging task; however, in our study, we select the value of K = 10.
Dataset-1: To achieve the optimal results, we use K-fold cross-validation.LSTM achieves 93.53 highest accuracy.We achieve a precision value of 93.06%, recall of 93.11, and 93.01%.Using the BILSTM algorithm, we achieve optimum accuracy of 93.75, a precision of 93.56, a recall of 93.41, and an F1-score of 93.35.After using RBFN algorithms, we achieve an accuracy of 82.51%, precision of 83.44%, recall of 83.21, and F1-score of 83.6%.

Comparison
In this study, we choose the best model between three deep learning algorithms, LSTM, BILSTM, and RBFN, using two different datasets.As discussed earlier, Dataset-1 consists of 88,672 instances, and Dataset-2 consists of 6052 instances.
Our proposed deep learning algorithms, LSTM and BILSTM, perform state-of-theart compared to RBFN.It is pertinent to note that LSTM is much more time-consuming, whereas the BILSTM is faster than LSTM.We observe that the performance speed of RBFN is much faster than both LSTM and BILSTM.When we compare hyper-parameters, we examine that all hyper-parameters significantly affect deep learning algorithms.But there is still a limit when a hyper-parameter reaches its optimal results, then accuracy, precision, recall, and F1-score decrease.When it comes to implementing deep learning algorithms, datasets are pretty important.We observe that on a large dataset, deep learning algorithms perform better.Table 18 describes the comparison between different benchmark datasets.We also compare these three deep learning techniques with machine learning techniques such as K nearest neighbor (KNN) and support vector machine (SVM).We use the same datasets for these machine learning techniques.We observe that on large datasets, machine learning did not perform well.Deep learning performs state-of-the-art as compared to machine learning techniques.When we compare deep learning techniques with existing machine learning techniques used for SFP, DL techniques perform better than ML techniques when we are using large datasets, as shown in Table 17.
If we compare the metrics, CK metrics perform state-of-the-art in various studies (Suresh et al., 2014;Aziz et al., 2019); in our study, CK metrics perform better as compared to static metrics.

Conclusion
Machine learning and neural networks are standard techniques used for software fault prediction.In this study, we aimed to implement deep learning algorithms for software fault prediction to answer two main questions, i.e., how deep learning algorithms help to improve performance and how various model architecture consideration lead us to an acceptably accurate model?Three deep learning algorithms (LSTM, BILSTM, and RBFN) are used for the experiments.We used two different datasets to evaluate the performance of the proposed deep learning algorithms.Dataset-1 is an open-source dataset comprising 70 publicly available datasets with CK metrics.We accessed Dataset-2 from the Git repository comprising 21 static metrics.We used accuracy, precision, recall, and F1-score for performance measures.In the comparison of algorithms, BILSTM and LSTM outperform.To achieve the optimal result, we perform cross-validation.
LSTM and BILSTM algorithms achieve better performance with 93.53 and 93.75 accuracies.However, RBFN achieves 82.58% accuracy.But when we compare the speed of LSTM, BILSTM, and RBFN, the radial basis function network performs much faster than these two algorithms.We also examine that hyper-parameters significantly impact the performance of deep learning algorithms.Each parameter has a positive effect on deep learning architecture.However, the dataset plays a vital role.
In the future, we are looking for a hybrid deep learning approach for software fault prediction.We can use other deep learning algorithms to check the performance.We can also use other benchmark datasets to evaluate the performance, as deep learning algorithms depend on the dataset.Just because of the automatic feature extraction nature of deep learning, there is a need to use data with many features.We perform statistical tests (P value test) to see the performance.In our case, we accept the null hypothesis as we got a value greater than 0.05.

Table 1
Related work

Table 4
Effect of epochs on Dataset-1 In order to observe the impact of activation functions, various experiments are performed.Results are shown in Table12.

Table 6
Effect of batch size on Dataset-1