Breast Cancer Detection and Classification using Deeper Convolutional Neural Networks based on Wavelet Packet Decomposition Techniques


 Breast cancer is a commonly diagnosed disease in women. Early detection, a personalized treatment approach, and better understanding are necessary for cancer patients to survive. In this work, both Deep learning Network and traditional Convolution Network has been employed for Digital Database for Screening Mammography (DDSM) dataset. In this work, breast cancer images are subjected to removal of background images followed by Weiner filtering and Contrast Limited Histogram Equalization (CLAHE) filter for image restoration. Wavelet Packet Decomposition (WPD) using Daubechies wavelet level 3 (db3) is employed to improve the smoothness of the images. In the first part of breast cancer recognition, these preprocessed images are fed to a deep convolution neural network, namely GoogleNet and AlexNetfor ADAM. RMSPROP and SGDM optimizers for different learning rates such as 0.01, 0.001, and 0.0001. As medical image necessitates discriminative features for classification, the pre-trained GoogleNet architectures extract the complicated features from the image and increase the recognition rate. In the latter part of this paper, Particle Swarm Optimization based Multi-Layer Perceptron (PSO-MLP), and Ant Colony Optimization based Multi-Layer Perceptron (ACO-MLP) are employed for breast cancer recognition using statistical features like skewness, kurtosis, variance, entropy, contrast, correlation, energy, homogeneity, mean which are extracted from the preprocessed image. The performance of this GoogleNet has been compared with AlexNet, PSO-MLP, and ACO-MLP in terms of accuracy, loss rate, and runtime and achieves an accuracy of 99%, with less loss rate of 0.1547 and the lowest run time of 4.14 minutes.


Introduction
Breast cancer is the commonly occurring cancer, especially in women than other cancer types like lung & bronchus, colon & rectum, and uterine corpus in women [1,2].Even with a lot of advancement in diagnosis and treatment methods a large number of annual mortalities occurred [2].Long-term mortality mainly resulted due to the size of the tumor [3] and the detection of breast cancer less than two cm is essential [4].Identifying and treating breast cancer at an early stage is needed to reduce the death rate [5].Mammography, which uses low energy electromagnetic waves, is used to image breast cancer and ranked as the most efficient method for detection at an incipient stage.Hence, an accurate computer-aided classifier is needed to aid medical professionals in the effective recognition of normal, benign, and malignant breast cancer to reduce the wrong verdict due to asthenopia, lethargy, or lack of skillset.
In the past years, a lot of breast cancer recognition was developed, and the performance of the computer aided recognition had been verified for the dataset taken from the UCI machinelearning repository.The recognition rate of breast cancer using the optimized learning vector quantization (optimized-LVQ) method, bigLVQ method, and artificial immune system were 96.7%, 96.8%, and 97.2%, respectively [6].The classification rate had reached 94.74% accuracy for the decision tree method with C4.5 using a 10-fold cross [7].In fuzzy clustering method using supervisory control yielded 95.57% and Neuro-Fuzzy based classifier achieved a 95.06% accuracy rate [8].The breast cancer classification accuracy was compared for Wisconsin Breast Cancer (WBC) datasets for cancer using three classifiers, namely Multi-Layer Perceptron (MLP), Naïve Bayes (NB), and decision tree.The fusion classifier made of MLP and J48 classifier found to be superior [9].96% recognition rate was achieved for breast cancer classification with RIAC [10], and around 98.53% was achieved for Least Square Support Vector Machines(LSSVM) [11].The performance of LSSVM was compared in terms of accuracy sensitivity specificity, and the high accuracy rate is due to the robustness nature of LSSVM.The feature reduction-based breast cancer decision system was explored in independent component analysis, and the substantiated single dimension feature was sufficient to increase the radial basis Neural network accuracy up to 90.49 [12].BPA based rough set method incorporated two phases, with the first phase used to handle missing values to smoothen the dataset, and the second phase is used for correctly selecting the appropriate attributes from the obscure clinical dataset.Particle Swarm Optimization (PSO) based wavelet network was employed for abnormalities in breast cancer using texture energy measured from mammograms and achieved 93.671%, 92.105%, and 94.167% for accuracy, specificity, and sensitivity, respectively [13].Detection of breast cancer using deep belief network unsupervised path with Liebenberg Marquardt learning produced an accuracy of 99.68% for WBC Dataset [14].
A knowledge-based system using fuzzy logic was used for breast cancer recognition by employing maximization for clustering the data, and the problem of multicollinearity was avoided using principal component analysis [15].Extreme Learning Machine classifier classified the benign and malignant breast masses using fused deep features, morphological features, texture features, and density features and achieved an 86.50% recognition rate.
In the first part of the proposed work, Digital Database for Screening Mammography

Preprocessing of Breast cancer Images
Digital Database for Screening Mammography (DDSM) dataset images is taken for breast cancer recognition work.This database contains normal, malignant, and benign cancer images with 650 pictures for each category.DDSM images are preprocessed to remove the background, noise, and augment the contrast between the cancer cell and neighborhood areas, which aid in localizing the Region of Interest (ROI).Feature extraction, training, and testing images are fed to DCNN based GoogleNet classifier.

Preprocessing of DDSM Images
There are so many preprocessing methods are reported in [16][17][18].The preprocessing approach adopted in this work is shown in Fig. 1.The background is deleted in the image, as shown in Fig. 3 (a), by deleting the zero intensity pixels rows and columns.The gray thresholding based on Ostu's method is applied to the processed image, and the intensity of the image is varied in mid half between minimum and maximum of the original intensity, and resultant muscle removed image is shown in Fig. 3 (b).
Then the breast image fed to the Wiener filter to eliminate noise present in the image.Once the spectral power is estimated, a mask is applied to all pixels of the original image with a signal to noise ratio (SNR) of 0.2.By using the mean and variance of the authentic images, new mean, variance, and power are calculated for all pixels of the transformed images.

Weiner and CLAHE Filtering
Weiner filter has the characteristics of the trade-off between inverse filtering and smoothening of noise, and hence it has the simultaneous capability of removing additive noise and blur inversion.
The stochastic nature of the wiener filter estimates the normalized image in a linear manner based on the orthogonality expressed in Equation (1).
where ( 1 ,  2 ) is the blurring filter,   ( 1 ,  2 ), and   ( 1 ,  2 ) are spectral power of normalized image and noise, respectively.It is evident that the wiener filter performs deconvolution in a high pass filter and compression along with noise removal in a low pass filter.For the purpose of contrast enhancement, CLAHE filtering is employed as it has the suppleness of choosing a local mapping function for a histogram.CLAHE computes the histogram for the local window for the given pixel and boosts the contrast.Fig. 3 (c) and (d) show the breast cancer images after the Weiner filter and CLAHE filter, respectively.
where,   ( −  ) ,   ( −  ) represents the wavelet decomposition in m and n direction, respectively, i represent the scaling parameter, and it is always greater than zero, j and k represent translation parameter in x and y direction, respectively.Equations ( 3), ( 4), ( 5), and ( 6) represent the approximation coefficient, the detailed coefficient in the horizontal orientation, the detailed coefficient in perpendicular orientation, and the detailed coefficient in a transverse orientation.
(, , ) =   Training and testing are done for a 70:30 proportion of preprocessed images in both the networks.

GoogleNet
The GoogleNet model proposed in [19] is complex and goes deep than all other CNN architectures.It has a 22-layer DCNN with an inception module, which cascades various dimensions and sizes into a single new filter shown in Fig. 7.Each layer act as a filter and the common features are detected in starting layers and followed by discriminating features identification in final layers.The inception layer covers a large area and retains delicate resolution even for small information in an image.This property is maintained by convolving in a parallel approach with different sizes ranging from 1x1 to 5x5.Inception layers are employed with Gabor filters in series with various sizes having the learning capability.
Convolution layer and pooling layers reduce the image features, which are not necessary for training, thus reducing feature dimension and storage space.To avoid overfitting, nine inception modules are used in GoogleNet.The top layer has its own output layer and both joint and parallel training, and this subtlety helps in faster convergence is in most.
In this work, GoogleNet with a total of 144 layers was retrained to classify breast cancer images by inserting four layers to its structure, namely a dropout layer with a 50% probability of dropout, a fully connected layer, a softmax layer, and a classification-output layer with three number of outputs.

AlexNet
AlexNet is also a pre-trained network [20] that has 11 × 11, 5 × 5, 3 × 3, convolution, max pooling, dropout, and fully connected layers with ReLU activation functions after every convolutional and fully connected layer.The dropout layer has a dropout with a 50% probability.
The fully connected layers are modified to three as there is three output classes.SGDM optimizer swing towards optimum through steepest descent path.The magnitude of swing can be limited by momentum [21].Equation (7) gives the weight updation in SGDM optimization method.
where is the iteration, ∝ is the learning rate, and it is greater than zero, V is the vector parameter, and L(V) is the loss function.∇L(V)iscalculated from training data.βindicates the previous gradient step involved in the current iteration step.Table 2 shows the parameter used for SGDM optimizer for breast cancer recognition.

Root Mean Square Propagation (Rmsprop) optimization
Rmsprop overcomes the drawback of SGDM optimizer as it used the same learning rate for all parameters.But Rmsprop implements various learning rates for the various parameter.
Weight updation is done using Equation ( 9), which is normalized by Equation ( 8) based on loss function optimization [22].
where γ is the decay rate.
where μ is the minimum value acting as constant to avoid Equation ( 9) becoming indefinite.This optimizer has the advantage of less learning rate, even for an extensive input database.

Adaptive Momentum (Adam) optimization
ADAM also follows like Rmsprop for weight updation using momentum.Equations (10) and ( 11) are used to find the gradient parameter and the squared gradient parameter.
where σ 1 and σ 2 are decay rates from SGDM and Rmsprop optimizer training function.
Equation ( 12) is used for weight updation in Adam optimizer by using average values of SGDM and Rmsprop.
If the gradient remains constant for many iterations, then pick up momentum is used for the gradient's moving average.

Training of GoogleNet and AlexNet
This

Multilayer Perceptron
A multilayer perceptron (MLP) is one among the feed-forward artificial neural network (ANN).MLP used nonlinear activation for all nodes except the input node and implemented backpropagation learning during training with the capability to distinguish nonlinearly separable data.The optimization process involved in MLP training is finding the weights to minimize the mean square error.The backpropagation algorithms based on the gradient descent method get locked in local minima as this algorithm efficiency depends on initial weights and bias values, and numerous iterations are required to tune the learning rate.To obtain the optimal weights of MLP networks, evolutionary-based approaches like Particle swarm optimization (PSO) [23] and Ant colony algorithm [24] are implemented as less number of parameters are required to be adjusted for better convergence.

Particle Swarm Optimization based Multi Layer Perceptron
Particle Swarm Optimization (PSO) algorithm is an outcome of the birds' flock behavior in flying pattern proposed by Kennedy and Eberhart in 1995.Each particle of PSO is characterized by position and velocity.In PSO, swarm particle size is taken as 'm' and current instant as 't', then each particle 1 ≤  ≤  has the position as   () ∈ ℝ  with velocity   () ∈ ℝ  .Both position and velocity control the position and direction of the swarm in PSO.
Every particle, as well as the whole swarm, holds the best position in their memory as   () ∈ ℝ  and  ̂() ∈ ℝ  , respectively.Equations ( 13) and ( 14) describe the updation of the velocity and new position of each particle.
where   and   are maximum and minimum inertia momentum, is the current generation, and  is the maximum generation.

Ant Colony Optimization based Multi-Layer Perceptron
ACO algorithm is an outcome of ant foraging behavior of finding the optimum path between food source and destination.ACO can resolve the complex optimization problem when it combines with another method.Based on transition probability and pheromone quantity in that region, the pheromone trajectory is updated.The fitness value is calculated for global ants produced in each iteration.On updating pheromone, if fitness improves, then the local ant is moved to the better region, else the search is directed towards a new direction.ACO continues its local and global search by updating pheromone and evaporating it.Equation ( 16) express the transition probability in the region r, which is the measure of local ant capability to move in a concealed region.

𝑃 𝑟 (𝑡) =
() where   () is the total pheromone in region r, and n is the number of global ants.
Equation ( 17) is used to update the pheromone.
where L is the cost function for  ℎ ℎ, and Q is a pheromone update constant.

3.2.3Features Extraction and Training of PSO-MLP and ACO-MLP
The statistical features such as mean, skewness, kurtosis, variance, entropy, contrast, correlation, energy, and homogeneity are extracted from the preprocessed image and fed as input to PSO-MLP ACO-MLP for breast cancer recognition.
Mean indicates the brightness of the image calculated using Equation ( 19) where ∑∑ p(i,j) is the summation of pixel values, and (m*n) is the image size.
Skewness reflects the disproportionate pixel value distribution used to measure the dark lustrous portion calculated using Equation (20).
Kurtosis measures the tendency of the peak with normal distribution, which is calculated using Equation (211).Variance: it is a measure of the contrast of the image calculated using Equation ( 22) Entropy: Randomness nature and texture characterization are done using entropy, which is calculated using Equation (23).

𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 = (𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦−𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦)
( + ) (24) Correlation: filter mask is moved over the image, and the sum of products is evaluated in each region used to detect the object in the image irrespective of the position in the image and calculated using Equation (25).The structure used for PSO-MLPand ACO-MLP for breast cancer images is shown in Fig. 8, and the parameters used for PSO-MLP and ACO-MLP are listed in Table 1.

IV. Experimental Results and Discussion
During the training process, GoogleNet, AlexNet, PSO-MLP, and ACO-MLP classifiers are trained to categorize the DDSM breast cancer images into three categories, namely normal, benign and malignant classifiers, using their extracted features and corresponding labels.The test images, which are underdiagnosis breast cancer images, are classified using this trained network.The performance of these networks has been compared with the highest recognition rate of pretrained network produced by AlexNet with ADAM optimizer for learning rate of 0.001.Fig. 11 shows the recognition rate of PSO-MLP and ACO-MLP.

Receiver Operating Characteristic curves of GoogleNet and AlexNet
The Receiver Operating Characteristic (ROC) curves estimate the performance of GoogleNet and AlexNet in breast cancer recognition by establishing a relationship between true positive rate and false positive rate.

(
DDSM) dataset images is used for research, and these images are subjected to removal of noise and image restoration using Weiner filter, CLAHE filter, and Wavelet tree decomposition before it is subjected to feature segmentation.In the first part, the reconstructed images are then fed to Deep Convolutional Neural Network (DCNN) based GoogleNet model, which has an inception approach for feature extraction at the training and testing phase.The performance of GoogleNet has been compared with AlexNet.In the Second Part, the statistical features of images like skewness, kurtosis, variance, entropy, contrast, correlation, energy, homogeneity, mean are extracted from the preprocessed image and fed as input to PSO-MLP and ACO-MLP for breast cancer recognition.Section II explains the preprocessing of breast cancer images using the Weiner filter, CLAHE filter, and Wavelet Packet Decomposition.Section III covers the proposed methodology, classifier, and initialization Parameters.Section IV discusses the results and comparison of different classifiers, and Section V concludes the impact of this work.

Fig. 3 1 √𝑖
Fig. 3 Breast cancer image before and after preprocessing (a) background removed image (b) muscle removed and breast alone present (c) Weiner filter applied image (d) image after applying CLAHE filter

Equation 3 to 6
generates the coefficient for the sub-band with one coefficient in low frequency in both horizontal and perpendicular region (L-H, L-H) and three coefficients at high frequency, namely, low frequency in horizontal and high frequency in the perpendicular region (L-H, H-P), high frequency in horizontal and low frequency in the perpendicular region (H-H, L-P) and high frequency in both horizontal and perpendicular region called as transverse orientation (H-H, H-P).Fig.4represents the Wavelet packet decomposition of the image and its hierarchy for level 2.

3. 1 . 3 3 . 1 . 3 . 1
Training Optimization DCNN networks are trained with stochastic gradient descent with momentum (SGDM) optimization algorithm, Root Mean Square Propagation (Rmsprop) optimizer, and Adaptive Momentum (Adam) optimizer along with transfer learning as it does not necessitates too many databases for achieving high accuracy.The learning rate is used to control the weight updation based on the error value.A low learning rate may result in a long time for training may get stuck, whereas a high value may yield a nominal weight value resulting in instability.In training optimization, SGDM, ADAM, and Rmsprop updated their descent algorithm weight based on the loss function value of each iteration Stochastic gradient descent with momentum (SGDM) optimization , )  (, ) = ∑ ∑  (, )( + ,  + )  =−  =− (25) where  (, ) is the filter Energy measures the gray level distribution, and it is based on image normalized histogram is calculated using Equation (26).It describes how values of a pixel vary in the image, and it is calculated using Equation (27).

Fig. 9 8 thatFig. 9 .Fig. 10 .
Fig.9 and 10 show the performance such accuracy and loss function value with learning rate as 0.001 of GoogleNet and AlexNet, respectively.The graphs are shown for a single epoch only for better visualization.The performance of both GoogleNet and AlexNet classifier varieswith learning rate and highest accuracy achieved for low learning rate.It is evident from Fig.8that GoogleNet achieves 98.23 % accuracy in the first epoch for ADAM optimizer and becomes oscillatory.It becomes stable till the maximum iteration for a low learning rate of 0.001, and it is clear the loss function is greatly reduced and remains constant after the 50 th iteration.From Fig.96.19% accuracy is achieved by AlexNet only for ADAM optimizer.Both GoogleNet andAlexNet have low loss function valuation around the 50 th iteration and remain stable for ADAM optimizer.Both GoogleNet and AlexNet is not showing a stable learning process for a high learning rate.

Figure 12
shows the ROC curves of GoogleNet and AlexNet for 0.0001 learning for ADAM, RMSPROP, and SGDM.Both GoogleNet and AlexNet show a stable learning process of 0.0001 learning rate along with slow convergence.Evaluation metrics, including the sensitivity for both DCNNs, reached around 98% to 100%.There is a complete trade-off between the sensitivity and the specificity for both Networks.In GoogleNet and AlexNet act as a perfect classifier with ADAM optimizer with the true positive rate as one and false positive rate as zero.

Table 1 .
Parameter Setting for PSO-MLP and ACO-MLP

Table 2
GoogleNet Performances for Breast Cancer Image for Training and Testing

Table 3 AlexNet
Performances for Breast Cancer Image for Training and TestingDuring training, the maximum accuracy is recorded at 98.23% for GoogleNet.In contrast, it is 97.97% for testing data as the recognition accuracy for GoogleNet and AlexNet for training with ADAM as optimizer and learning rate as 0.001.During testing, GoogleNet yields an accuracy of 99% for learning rate 0.001 and ADAM optimizer.Still, AlexNet, the accuracy achieved is 98.91% for learning 0.001 with optimizer as RMSPROP, which is slightly higher than 98.80% for ADAM with the same learning rate.The performance is also measured in terms of runtime during process and measures the lowest run time of 4.14 minutes for 0.001 learning rate for ADAM and the highest run time of 5.56 recorded SGDM with 0.01 learning rate in GoogleNet whereas for AlexNet it records the lowest time as 4.71 minutes by ADAM optimizer

2 Comparison of Breast Cancer Recognition of GoogleNet with PSO/ACO-MLP PSO
-MLP and ACO-MLP classifiers are trained and tested in the proportion of 70:30 proportion of preprocessed images.The recognition rate of PSO-MLP and ACO_MLP are listed in Table4.

Table 4
Comparison of GoogleNet Performances over PSO-MLP and ACO-MLP