Multi-path Convolutional Neural Network to Identify Tumorous Sub-classes for Breast Tissue from Histopathological Images

Malignancy is one of the leading causes of death. It is on the rise in the developed and low-income countries with survival rates of less than 40%. However, early diagnosis may increase survival chances. Histopathology images acquired from the biopsy are a popular method for cancer diagnosis. In this work, we propose a deep convolutional neural network-based method that helps classify breast cancer tumor subtypes from histopathology images. The model is trained on the BreakHis dataset but is also tested on images from other datasets. The model is trained to recognized eight different tumor subtypes, and also to perform binary classification (malignant/non-malignant). The CNN model combines an encoder–decoder architecture and a parallel feed-forward network with attention mechanism. The proposed model provides state-of-the-art scores. Comparing with the other models, the accuracy of the proposed model is higher at different magnification and patient levels. The implementation is available at github.com/rangan2510/Residual_\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\_$$\end{document}Unet


Introduction
Cancer is one of the leading causes of death globally, particularly in developed countries [1,2]. In the United States, breast cancer incidents increased by 0.3% per year between 2012 and 2016 [1]. The situation in low and middle-income countries is also alarming. For instance, in the Indian subcontinent, according to the National Institute of Cancer Prevention and Research (NICPR) under the Indian Council of Medical Research (ICMR) [3], around 2.25 million cases are being treated since 2018. Every year, around 750 thousand deaths are reported, and 11.6 million new cases are registered. However, the statistics on individual cancer types are also disturbing. Breast cancer cases make up 14% of the overall number of malignant cases. New cases registered in 2018 are 160 thousand. In rural areas, 1 in 60 women develop breast cancer respectively whereas, in urban areas, 1 in 22 do so. Early detection and quick diagnosis of breast cancer play an important role in increasing the survival rate. It has been observed that the survival rate of breast cancer is around 50% of the diagnosed cases. Internationally, non-invasive mammography is highly popularized for the early detection of breast cancer stages. It has also gained interest along with computational automation. Furthermore, the histopathological images [4] from invasive sampling through biopsy are extensively utilized to diagnose malignant cases including breast cancer. There are two major drawbacks. Firstly, diagnosed malignancy detecting affective zones are highly subject to expert interest. Moreover, it is also extremely time-consuming. Therefore machine learning techniques can be used to make the diagnosis procedure much more efficient. Moreover, designing comprehensive machine learning techniques can reduce human error as well as the time is taken for manual diagnosis. Recently, deep learning-based methods have been used extensively for image classification tasks. Therefore, such models can help pathologists make precise diagnoses much faster. In this article, we have proposed a method based on deep learning [5] for diagnosing the tumorous sub-types of breast tissue. The proposed model implements a multi-path convolutional neural network that uses a U-net-like encoder-decoder architecture as well as a simple feed-forward CNN running in parallel. The encoder part makes use of residual blocks. Residual blocks have been extensively used in deep learning [6]. U-Net, on the other hand, is used for biomedical image segmentation. U-Net makes use of an encoder-decoder-like architecture to extract relevant features from the images. The proposed network is trained on the BreakHis data-set [7] that contains histopathological images of the benign and malignant tumors along with four sub-classes of each tumor type. Hence, the provided tool can classify eight different types of tumors irrespective of the magnification levels of the microscopic images. The model has also been trained to do binary classification. Furthermore, a blind test has also been done to ensure that the model can classify the images from outside the training dataset.

Contributions
The notable contributions of this work are as follows: • The proposed classifier is developed by incorporating a combination of multiple CNN architectures. The model has a feed-forward network constructed using a convolution block attention module (CBAM) that runs parallel to an encoder-decoder architecture. The encoder-decoder architecture is based on the U-Net. The encoder part is designed using residual connections. It outperforms multiple state-of-the-art CNN models in the task of classifying the different sub-types of breast tumors. • The model makes use of skip connections and parallel paths that facilitate the gradient flow, enabling easier optimization. • 1x1 convolutions are used to compute feature-map reductions before more expensive operations in the skip connections of the U-Net. This keeps the network much faster and reduces the number of trainable parameters. • Transfer learning has been used to train the residual blocks for the encoder part of the U-net. This has significantly improved the training performance of the network. • The generalization ability of the model has been tested extensively on various breast cancer histopathology datasets where it is trained on one dataset and is then evaluated on different datasets to simulate the real-life performance of the model. Here, the proposed method outperforms other state-of-the-art models by a significant margin.

Related Works
For more than forty years, automatic image processing for cancer diagnosis has been an active research topic [8].
Before deep learning, multiple attempts had been made for the classification and gradation of breast cancer using methods such as support vector machines (SVMs) [9,10], random forest (RF) [11,12], principal component analysis (PCA) [13] and so on. Feature extraction was done using scale invariant feature extraction (SIFT) [14], local phase quantization (LPQ), local binary patterning (LBP), and other similar methods [7]. Most methods require laborious feature engineering. Kowal et al. [15] have compared and tested different algorithms for nuclei segmentation and further classified the images as either benign or malignant. This case of binary classification was done on a dataset of 500 images and accuracy levels of 96.5% were reported. However, the feature selection was done manually and used only 699 images. Filipczuk et al. [16] proposed a method that makes use of four different classifiers and a 25-dimensional feature vector to perform binary classification on cystological images of fine needle biopsies. In this case, around 98.5% accuracy was achieved on a dataset containing only 737 images. Similarly, George et al. [17] also proposed a method that involves tedious manual feature extraction followed by classification using SVMs, probabilistic neural networks, multi-layer perceptron, and learning vector quantization on a dataset of 92 images only. Doyle et al. [18] makes use of textural features of images to develop an SVM for grading breast cancer. However, most of these methods were limited by the availability of large and open annotated datasets.
Over the last few decades, significant effort has been made in recognizing breast cancer from histopathological images using deep learning. Convolutional neural networks (CNNs) have been primarily used for the detection and classification of breast cancer [4]. Spanhol et al. [7] introduced a data-set called BreakHis that provides 7909 images, acquired from 82 patients. Not only is this dataset larger than the ones previously available, but it also comes with 8 different tumor subtypes: 4 for benign tumors, and 4 for malignant tumors. Multi-classification has more clinical value too. This dataset has enabled the development of multiple deep learning approaches. Teresa et al. [19] have combined CNN with SVM to achieve an accuracy of 77.8% for four classes and 83.3% for binary classification. Bayramoglu proposed [20] a CNN architecture with an average accuracy rate of 82.13%. Spanhol et al. [21] used the popular CNN architecture called LeNet and achieved around 72% accuracy. Han et al. [22] SN Computer Science proposed a class structure-based deep convolutional neural network (CSDCNN) which achieved around 93.2% accuracy on the BreakHis dataset. Using the same dataset, [23] proposed a deep manifold preserving autoencoder that performs binary classification and acquired an average of 89.51% on image-level classification whereas a more recent dependency-based lightweight convolutional neural network (DBLCNN) [24] achieves an image-level accuracy of 96.0%. However, none of these tasks verify the classification performance of the model on breast cancer histopathology images beyond the BreakHis dataset.
Previously, residual blocks have been used in U-Nets for segmentation tasks in other domains [25][26][27]. Newer models that use residual connections includes the ResNeXt [28] model and Wide Residual Network [29,30]. Variants of ResNeXt has also been used in classification of breast cancer histopathology images. Newer classification models include RegNet [31], EfficientNet [32] and Vision Transformers [33], all of which are used for comparison in this paper.

Data Description
The Breast Cancer Histopathological database or BreakHis [34] comprises images of benign and malignant breast tumors that were collected by open surgical biopsy (SOB). A key feature of the BreakHis dataset is the diverse classes of both malignant and benign tumors that are contained in the dataset. The histopathology study and labeling were done by pathologists of the P &D Lab, taking care of the anonymity of all the images. The digitized images for the breast tissue biopsy slides were obtained by an Olympus BX-50 system microscope having a relay lens with a magnification of 3.3x coupled with a Samsung digital color camera SCC-131AN. The dataset consists of 7909 images of breast cancer tissue, which were collected from 82 patients using 4 different magnification levels (40× , 100× , 200× , 400× ). Table 1 contains details of how the images are distributed among the 4 magnification levels. It contains 2480 benign and 5429 malignant images in PNG format with 700 × 460 pixels, 3-channel RGB, and 8-bit depth in each channel. The 2 types are further subdivided into 4 categories each, i.e., four benign tumors and 4 malignant tumors.These are labelled as:  Table 2.

Methodology
The proposed method makes use of a deep learning [5] model that is built upon two popular CNN architectures: ResNet [35] and U-Net [36]. For the network, we have modified the U-net architecture and incorporated residual blocks for classification of breast cancer histopathological images.
The following subsections explain the methodology.

Network Architecture
The proposed network architecture is made up of two primary components: a feed-forward network that uses an attention mechanism and an encoder-decoder architecture that is based on the U-Net model. These two feature extractors run in parallel and the final classification of the images is done from the aggregated feature maps of these two models. The model is illustrated in Fig. 4.

Attention-Based Feed-Forward Module
The feed-forward network uses sequential convolutions along with a convolution block attention module (CBAM) [37] with dilated convolutions. The CBAM module uses both spatial and channels attention mechanisms which boost the model performance. In the spatial attention module (SAM), a the convolution operation is performed using a 3 × 3 kernel with a dilation rate = 2 . The dilated kernel helps reduce the performance cost of the spatial attention mechanism. In the channel attention module (CAM), we use log-sum-exponent and a global average pooling of the input features and then pass each 1-D vector to a multi-layer perceptron (MLP). The output of the MLP for each vector is summed element-wise and is then passed through a ReLU function. We use the log-sum-exponent pooling instead of max pooling as it helps preserve the edge details and also helps deal with noise. The output of the CBAM is then stacked with the output of the encoder-decoder architecture that runs in parallel and is then passed through two more layers of convolution.

Encoder-Decoder Module
The encoder-decoder module is a modification of the standard U-Net architecture. In the standard U-net architecture, the encoder part uses successive convolutions, followed by 2 × 2 pooling for the down-sampling operation. In the proposed model, the successive convolution and pooling layers in each step of the encoder are replaced with a custom residual block. Every residual block, as illustrated in Fig. 3, has two convolution layers. These residual blocks have the same components as the ResNet18 architecture, allowing us to perform transfer learning. However, the forward pass through the layers is performed differently. This allows us to have improved performance while using the same weights as the original ResNet18 model. In the residual block, we use two skip connections. In the first convolution layer of every residual block, the convolution operation is implemented with a stride value of 2. This reduces the size of a K × K feature map to K∕2 × K∕2 . Hence, we need to perform identity downsampling. This is performed using two different methods for each of the skip connections. The first skip connection uses max pooling that halves the image dimensions. The second skip connection uses a strided convolution operation with batch normalization. This dense connectivity aids better information propagation through the network and removes dependence on just pooling operations for downsampling as pooling operations can be lossy [38]. Using these residual blocks enables us to simultaneously extract features and compress the image representation. The encoder uses 4 residual blocks. The input image is down-sampled 5 times by a factor of 2. The number of stages was determined experimentally. The input to the  After the end of the contractive part, the dimension of the feature map is 7 × 7.
The residual blocks that are used in the residual U-Net follow the same architecture as those implemented in ResNet18. Each residual block contains two convolution operations each followed by batch normalization. So, for the first down-sampling step, convolution is performed using a 7 × 7 kernel with stride 2. For the second step of down-sampling, a simple max-pooling operation is done. From the next step onward, the residual blocks are used for down-sampling. The first residual block is slightly different from the rest as it does not reduce the spatial dimension of the feature maps. The convolution operation is performed with 3 × 3 kernels with stride 1 and padding 1 in the first residual block. In the following residual blocks, the stride is increased to 2 for the first convolution operation in the block. This is where the down-sampling takes place.
In the decoder part, up-sampling is done 5 times. After the feature maps of the decoding, stage is concatenated from those of the encoding stage, a 3 × 3 convolution is performed. This convolution operation is also used to reduce the number of feature maps. After passing the feature maps through a ReLU activation layer, they are upsampled using bilinear interpolation. After the upsampling operation, another convolution operation is performed before the feature maps are again concatenated.
For the final classifier part, the output of the decoder part is passed through two layers of convolution, followed by average pooling and a fully connected layer. The number of nodes in the final layer is set based on the required number of classes.

Ablation Study
In an encoder-decoder-based architecture, the performance of the model depends on the number of encoding and decoding steps used. In this case, the depth of the U-net structure needed to be assessed to see what number of steps provided the highest accuracy. An ablation study was done to find out the required number of encoding the decoding steps. Initially, 3 steps of encoding and three steps of decoding were used which provided an accuracy of 89.41% on 8 classes. Using 4 steps gave the maximum accuracy of 98.65% while using 5 steps reduced the accuracy again to 89.67%. Furthermore, tests were performed to see the effectiveness of the 1x1 convolution operations. These convolution operations were used before the concatenation operations to reduce the number of feature maps. The results of the study are summarized in Table 3.

Training and Testing
Training and testing were done separately in two phases. The BreakHis dataset itself comes with certain standard tests. These were done according to the 5-fold validation structure that is mentioned in [7]. For each of these folds, the set of 7909 images is split into two halves: a training set containing 5211 images and a testing set consisting of 2698 images. In the second case, the model was trained again from scratch and was compared to other methods for the 8-class classification. Furthermore, the performance for 2 class classification was also explored.
During training, the images were resized and augmented. The images were resized to 224 × 224 pixels directly without cropping. Augmentation techniques that were used include random rotation and flipping. Since the decoder part of the network makes use of the residual blocks that are similar in architecture to that of ResNet18, they were initialized with the ImageNet weights. This is shown in Fig. 4. For each of these folds, the model is trained for 200 epochs with a decaying learning rate and varying up-sampling. The Fig. 4 The architecture of the proposed model. The feed-forward network is a sequential network with a convolution block attention module. A copy of the input is also sent to the U-Net. The encoder part of the U-Net is initialized with the ImageNet weights of ResNet-18. The output of the decoder is appended to the output of the second convolution operation of the feed-forward network To address the minor class imbalance, the loss function used during training is the Focal loss. This is given as p) and y is the binary class label indicator (0 or 1) and p is the predicted probability.
The mean loss for every batch of the epoch was monitored and the weights of the model were only updated after an epoch only if the mean loss improved after training. The best model that was saved during the training was used for testing. The entire process of acquiring the model weights for training is described in Algorithm 1.
The results of the 5-fold validation are provided in Table 4. We have used four different performance metrics: accuracy, precision, recall, and f 1 -score.
A similar training strategy was adopted for the second case. In this case, the dataset was divided into a training set and a validation set. From Tables 1 and 2, it can be seen that the dataset suffers from a massive class imbalance problem. Here, the Ductal Carcinoma (DC) class has the largest number of images, which is 3451 images. Therefore, an augmented dataset was created where each class has 3451 images, totaling 27,608 images. This augmentation was done by simply creating copies of images in the other classes. The copies were also randomly rotated and flipped. During training, 80% of these image files were used as a training set, while the remaining 20% was used for validation of the trained model.
On this set of 27,608 images, the model was initially trained for 8-class classification. After training for 250 epochs, the accuracy level on the validation set reached its maximum value and plateaued. The varying up-sampling technique, the optimizer settings, and the learning rate value and decay were kept the same. Furthermore, other models were separately trained on this data-set for comparison with the same hyper-parameters. This allowed us to perform a direct comparison of the proposed model with the state-ofthe-art models.

Performance on BreakHis
We performed multiple evaluations to show the performance of the proposed model on the BreakHis dataset and the generalization ability of each of these models. Since the BreakHis dataset contains eight classes of images, which is the highest in any publicly available dataset, it allows us to train models that can be used on other breast cancer histopathology datasets as well. We evaluate the performance of the method on the 5-fold validation sets that are defined in [7]. The performance values are shown in Table 4. Figure 5 shows how the CBAM module helps in extracting relevant features from challenging images. The activation maps acquired by adding the attention module include more edge details that are that is otherwise absent from the corresponding feature maps.
According to [4], we perform image-level classification for each magnification level, and we extend the evaluation to three more datasets. For image-level classification, we checked the classification performance of images belonging to different magnification levels. BreaKHis contains four levels of magnification: 40× , 100× , 200× and 400× . For each magnification level, the classification accuracy is reported. Table 5 contains the accuracy of the eight-class classification problem for different datasets. Here, we can see that the proposed model either provides or surpasses the state-of-the-art.

Blind Testing
To ensure the generalization capabilities of the proposed model, a blind test was performed. In blind testing, we trained the model on the BreakHis dataset and evaluated the model on other datasets. We created multiple binary classification datasets to ensure the model performs consistently in real-life scenarios. The data selection process is described in Table 6. Since the BreakHis dataset contains a diverse annotated classification of breast tumors (both benign and malignant), a model trained on the BreakHis dataset should perform well with other datasets too. This is demonstrated by three separate blind tests. We used three separate datasets to test the model in real-life conditions. The datasets used are BreAst Cancer Histology images (BACH) [40], an Invasive Ductal Carcinoma (IDC) Grading dataset [41] and a dataset with IDC patches that is aggregated from [42] and [43]. We devised three tests. The first test uses 200 samples from the BACH dataset. The second test uses 400 malignant samples from the IDC grading dataset and 100 benign samples from the BACH dataset. The third test uses 500 images from each class of the IDC patches dataset. The performance of each of these datasets is shown in Table 7.
Our model shows the best performance on the BACH dataset, which is the first blind test. On a total of 200 images, the proposed model gave an accuracy of 98.5%. In contrast, the other state-of-the-art models had a much lower score on the blind test, despite giving better results on the BreakHis dataset itself. This clearly shows that our model has better generalization power than others. On the BACH dataset, the ResNet18 model scored 76.0%. ResNeXt model used before scored 88.0% and DenseNet scored 83.5%. ShuffleNet got 72.5% while MNasnet got a much lower score of 52.5%. The Vision Transformer model, the RegNet, and the Wide ResNet provided comparatively better performance. This shows that the proposed model is much more effective at the task of binary classification of breast cancer histopathological images in real life. Interestingly, these test contains four sub-types of Breast tissueassociated benign tumor and three sub-types of Invasive Ductal Carcinoma (two of the rare tumor genesis cases). Despite that, the machine has outperformed in binary classification.
The tested models show a similar trend across the other two tests. In test 2, the model achieves a score of 86.0% whereas, in test 3, the model scores 75.3%. The lower score in the last test is possibly because of the low-resolution image patches ( 50 × 50 pixels) as well as the large size of the dataset. A higher score across all the blind tests corroborates the generalization ability of the model. Figure 6 shows the average accuracy of the proposed method across all the evaluations.

Conclusion
In this article, an improved deep learning model has been designed which is used to successfully classify breast cancer histopathology images. The model has been trained on the BreaKHis dataset which allows it to classify images from  [42,43] SN Computer Science eight types of tumors. The proposed model has been shown to have better performance while classifying the eight different classes of histopathology images. Furthermore, the model has been shown to successfully perform multi-class classification irrespective of the magnification levels of the images. Moreover, to simulate real-life scenarios, we also tested the trained model on other datasets to perform binary classification tasks. Multiple blind tests show that the model can work with images beyond the training dataset. Therefore, it is expected that the type, as well as subtypes of the tumor, can be separated unambiguously over a wide range of histopathology images. In the blind tests, the proposed model outperforms other state-of-the-art methods available in the literature. This ensures that the model can have real-life applications. The developed model will reduce the time and human error in breast cancer histopathological image analysis. We can infer that this model can be utilized to build an understanding of the sub-stages of malignancy in the future. Even though this paper shows the model for the classification of breast histopathological images only,  Fig. 6 Comparison of generalization performance of commonly used and state-of-the-art models with the proposed model