Spatial neighborhood intensity constraint (SNIC) clustering framework for tumor region in breast histopathology images

Precise segmentation of tumor regions plays prominent role in the grading of breast carcinoma using the Nottingham Histological Grading (NHG) system. A robust segmentation framework is expected to produce cost-effective, repeatable, and reproducible quantitative outputs. In this study, a spatial neighborhood intensity constraint (SNIC) clustering framework for tumor region in breast histopathology images is presented. The proposed framework consists of five main stages: (1) color normalization, (2) segmentation and removal of nucleus cells, (3) SNIC, (4) FCM with knowledge-based initial centroids selection, and (5) post-processing. The novelty of the proposed framework lies within its simple but powerful in clustering tumor regions precisely in a heterogenous environment. The SNIC is implemented to remove and replace the intensity of the nucleus cells based on the spatial constraints. Also, a knowledge-based initial centroids selection method is implemented to ease the FCM clustering algorithm. Both of these methods are posited to facilitate the clustering stage producing complementary results. To validate the hypothesis, careful justifications are performed to evaluate the role of SNIC and knowledge-based initial centroids selection. These methods are found plausible by achieving positive results in AAAAAA, FF1, AAAAAA, and CCCCCC of 91.2%, 92.1%, 85.7%, and 90.1%, respectively. To further demonstrate the applicability of the proposed framework, four recent works are included for benchmarking purposes. The proposed framework found outperformed these methods with the lowest percentages in over-segmentation and under-segmentation: 8.7% and 6.6%, respectively.


Introduction
The degree of cell differentiation is one of the critical prognostic markers stated in the Nottingham Histological Grading (NHG) system for breast carcinoma grading purposes [1,2]. The score for this prognostic marker is based on stringent assessment in (1) percentage of glandular formation and (2) the area of tumor regions. In pathology laboratory, the tumor region is routinely estimated via manual vision inspection using the histopathology slides. These slides are very complex which contain mitotic cells, nucleus cells, tumor regions, non-tumor regions, and underlying tissue architecture such as glandular structures, fatty region, artifact, and cell residues. Long hours of manual vision inspection on the heterogenous histopathology slides are inevitable to human errors which could possibly impinge the output grading. Several evident reports that the manual inspection is susceptible to inter-and intra-observers variability [3,4]. Also, the output grading is found unrepeatable and irreproducible [4,5,6]. With the emergence of whole slide imaging (WSI) scanner, the analogue histopathology slides are routinely converted to digital slides. The resultant digital slides enable various image processing techniques to be implemented for quantitative measurement purposes. Many recent works have focused on the detection of mitotic cells [6][7][8][9] and breast cancer diagnostic [10][11][12]. However, the insight in quantitative measurement specifically for tumor regions is limited. The objective of this study is to propose an automated framework that can precisely quantify the tumor regions (pixels-based measurement, but is convertible to micron using the calibration value) using breast histopathology images. In order to quantify the tumor regions, an accurate segmentation framework is essential. The proposed framework is expected to benchmark with the ground truth prepared by histopathologist expert based on the NHG system. Assessment of tumor regions in digital slides usually involves several processing steps. A proper parameterization and combination of the image processing steps allow high throughput and accuracy in the output of the prognostic factor. Earlier study employed features based classification approach that termed as Random Projections with Ensemble Clustering [13]. The proposed framework employed the textural features that represent each pixel in an image as a point in a high dimensional feature space. The main advantage of the Random Projections with Ensemble Clustering is that the proposed framework involved unsupervised training in the features selection step which is superior than the minimum redundancy maximum relevance (mRMR) method. Some studies proposed a pixel-wise approach to segment the tumor regions from the histopathology l image [14][15][16]. Qu et al. [14] proposed a pixel-wise Support Vector Machine (SVM) with four morphological features to distinguish tumor regions and the background. Khan et al. [15] used a hybrid magnitude-phase approach to segment the tumor regions from the background. In this approach, the breast histopathology image was divided into four regions: tumor, hypo-cellular stroma, hyper-cellular stroma, and the background. The hypo-cellular and hyper-cellular stroma were segmented by calculating features using the magnitude and phase spectra, respectively, in the frequency domain.
Majeed et al. [17] proposed a texton-based approach to segment breast tissue into different regions: epithelium, stroma, and lumen. The proposed method used a Leung-Malik (LM) filter bank to compute the gradient at different orientations at different spatial scales. A Random Forest Classifier was used to perform classification. Fouad et al. [18] proposed an unsupervised superpixel-based segmentation by using the adaptive consensus clustering method. In [18], a multi-stage segmentation processing with the Simple Linear Iterative Clustering (SLIC) superpixel framework was used to segment the histopathology image into different regions.
In this study, an automated tumor regions segmentation framework using spatial neighborhood intensity constraint (SNIC) clustering approach (i.e., Fuzzy C-Mean (FCM)) is proposed. Different from most of the existing methods, the proposed framework is simple yet powerful to segment tumor regions from the heterogenous histopathology images. In the clustering stage, the initial centroids of FCM are not generated randomly but based on domain knowledge. This method is termed as knowledge-based initial centroids selection in the subsequent section. This method can effectively reduce the search space (reflected with a lower number of iteration) and eliminate limitations in the conventional FCM (with random initial centroids generation), such as dead center, center redundancy, and possibility of initial centroid to trap in local minima [19,20]. SNIC is a new method that aims to eliminate nucleus cells while preserve image information and reduce fuzziness of the input image. The combination of the SNIC and FCM with knowledge-based initial centroids selection is found effective and robust in tumor region segmentation.
Similar to how histopathology slides are typically reviewed in standard procedure, the proposed framework is performed on low resolution (i.e., 10x magnification).
The remainder of this study is organized as follows: Section 2 details the methodology of the proposed framework; Section 3 presents the experimental results and discussion. The conclusion of this study is given in Section 4.

Methodology
The proposed framework consists of five main stages: (1) color normalization, (2) segmentation and removal of nucleus cells, (3) SNIC, (4) FCM with knowledge-based initial centroids selection, and (5) postprocessing. The proposed framework starts with color normalization. This is to ensure the image color from different slides are normalized and in the similar Red, Green, and Blue (RGB) color range. This stage is important to ensure a stable performance of fuzzy clustering across different slides in the later stage. Next, the nucleus cells in the input images are segmented using the hard K-Mean in the Cyan channel. The SNIC is then implemented to remove and fill in the pixels of the nucleus cells while preserve the image information. FCM with knowledge-based initial centroids selection is applied to partition the input image into different clusters. Next, post-processing is applied as the last stage to remove hemorrhage and blood cells, artefacts, and perform hole filling. Fig. 1 summarizes the block diagram of the proposed framework.
The detailed methodology of each stage is as follows: Segmentation of nucleus and is proven capable to preserve image information better than some of the existing methods [22,23]. Fig.   2 shows a sample output of the histogram matching algorithm.  [20], where detection of nucleus cells in histopathology images proven can be done effectively using K-Mean in Cyan channel. Next, the segmented nucleus cells obtained from the K-Mean were used as mask to remove the pixels of the nucleus cells in the RGB input images. The R, G, and B intensity values of each pixel in the segmented nucleus were eliminated by changing to 0 (i.e., RT=0, GT=0, and BT=0), where RT, GT, and BT denote the temporal intensity in R, G, and B channels, respectively. The pixel values of the background remain unchanged (i.e., RGB color model). Fig. 3 shows an example of image with nucleus before and after the removal of nucleus cell.

Reference image Input image
Matching Stage 3 (SNIC): FCM clustering algorithm is sensitive to outlier [24]. The outlier is defined as a data point which is not belonging to any of the clusters [24]. In this study, the nucleus cells act as outliers and could possibly hamper the performance of the FCM clustering algorithm. To address this limitation, the SNIC is proposed. The SNIC is meant for two purposes. First, the SNIC is used to eliminate the outlier (i.e., the nucleus cells) by setting new intensity values to the respective nucleus based on spatial constraint. The second purpose is to reduce the randomness and complexity of the input image. This is found important to enhance the fuzzy clustering result in the later stage. The SNIC could reduce the image components that possibly contribute to an alleviation in image complexity. Entropy is a statistical metric that is commonly used to measure image randomness [25]. The proposed SNIC is posited to reduce the randomness and complexity of the input image by achieving a lower entropy value after the SNIC and improve the overall performance of the FCM clustering algorithm. The SNIC starts by replacing the black pixels of the segmented nucleus cells (i.e., RT=0, GT=0, and BT=0) with new intensity values. The newly assigned intensity values in R, G and B are not created randomly (to avoid unwanted noise or false information), but are inherent from the spatial information of the neighboring pixels corresponding to the different segmented nucleus. In an image, a neighboring pixel that is spatially closer would have similar spatial information than a pixel that is spatially distant. The neighborhood pixels that belong to the same cluster should share similar information such as color feature [26,27]. If the spatially closer neighborhood pixel shows significantly distinct in information, the neighboring pixel could be possibly affected by noise or a subset of a different cluster. The implementation of the SNIC is as follows (refer to Fig. 4).
1. Determine the centroid and the major axis length ( ) of the segmented nucleus (i.e., measured in pixel). The major axis is defined as the length (in pixels) of the major axis of the nucleus that has the same normalized second central moments as the region. The nucleus centroid can be calculated by averaging the x-coordinates and y-coordinates of the boundary pixels of the segmented nucleus.
where is a constant value.
4. Place the on the centroid of the segmented nucleus.  To partition the input image into three clusters (i.e., background, non-tumor region, and tumor region), three initial centroids were used as inputs to the clustering algorithm, namely initial centroid 1 ( 1 ), initial centroid 2 ( 2 ), and initial centroid 3 ( 3 ), respectively. 1 and 3 were selected from the histogram, whereas 2 can be calculated using an equation. Hill climbing optimization technique [28,29] was implemented to obtain the first and second intensity peaks which were labelled as 1 and 3 , respectively.
The search for the first peak ( 1 ) started at the first value of histogram (i.e., intensity= 0). The first local maximal (i.e., first peak) was selected as 1 . The next search for 2 started at the intensity given by intensity= 1 +1. The search stopped when the next local maxima (i.e., second peak) was obtained and this value was selected as 3 . The selection of 1 and 3 for the background and the tumor regions respectively were dependent on the difference in terms of basicity level [30]. Basicity is defined as the quality of being a base (not an acid). In terms of chemical bonding, a basic substance tends to bind with an should be selected from the values between 1 and 3 . The 2 can be computed using Eq (3).

Number of pixels
Stage 5 (post-processing): Hemorrhage and blood cells are removed by using the transformed matrix R that can be calculated using Eq (4) [31], where I is a m x n x 3 (i.e., RGB channels) intensity matrix and = � denotes the pth matrix in the RGB channels; denotes the binary image contains all hemorrhage and blood cells; and R1 denotes the threshold value determined using the Otsu thresholding method [32]. Fig. 6 shows a sample output of hemorrhage and blood cells extraction (in RGB channels). Fig. 6. Sample of hemorrhage and blood cells extraction.
In breast histopathology images, regions that are small in size (i.e., 0.04% (selected heuristically) of a histopathology image) are too small to represent a tumor region. The regions are probably affected by unknown noise and were eliminated. A simple morphological operation is then applied using "closing" with a "disk" structure element (i.e., radius of 1 pixel) to remove and fill holes in the tumor regions.

Evaluation metric
In this study, the evaluation is performed using the statistical metrics based on the confusion matrix. In addition, Area Overlap Measure ( ), over-segmentation, under-segmentation, and Combined Equal Importance ( ) are implemented to explicit the performance of the proposed framework. is used to evaluate the performance of the object region segmentation algorithm. is defined as the ratio of the intersection to the union of the two areas to be compared. The equation of is given in Eq (5). is used to measure the over-segmentation and under-segmentation of the output results obtained from the proposed framework towards the ground truth images. The is a combined equation where , oversegmentation, and under-segmentation are inclusive in the equation by giving them an equal importance.
The equations of over-segmentation and under-segmentation are given in Eqs (6) and (7), respectively, where denotes the result obtained from the proposed framework and denotes the ground truth. The equation of is given in Eq (8).

Experimental results and discussion
To justify and validate the applicability of the proposed framework, a set of data was collected for evaluation purposes. The dataset is collected locally in Kangar, Perlis, Malaysia under stringent protocol performed by histopathologist expert. Section 3.1 presents the dataset and ground truth annotation in details.

Dataset
The breast histopathology slides used in this study were provided by the Pathology Department, Hospital Tuanku Fauziah, Kangar, Perlis, Malaysia. These slides were prepared under a standard procedure from a mastectomy resected specimen removed for breast carcinoma. Hematoxylin and Eosin (H&E) were used as standard dyes in the staining process. The analogue histopathology slides were converted to digital slides by using an Aperio CS2 WSI scanner. For evaluation purposes, 10 breast histopathology slides were used in this study. From the 10 breast histopathology slides, three slides were obtained from Grade 1, three slides were from Grade 2, and four slides were from Grade 3. The manual annotation of ground truth for each corresponding breast histopathology slide was performed by the histopathologist expert under stringent standard procedure as stated in the NHG system [1,2]. A total of 200 images at 10x magnification were captured and used for evaluation purposes such that 20 images were captured from each slide corresponding to different dominant areas on the slide. The input images were prepared in 8-bit RGB color model with a dimension of 614x 1240 pixels (calibration value: 0.2521 microns/ pixel). The captured images were presented in bitmap format (i.e., BMP). Table 1 summarises the dataset for this study.

Justification on the proposed SNIC
The SNIC was used to reduce the influence of the outlier (nucleus cells in this case) and reduce the complexity of the input image. For validation purpose, the entropy value after the implementation of the SNIC is compared to the entropy value of the original input image (before implementing SNIC). A low entropy value denotes a low image disorder (i.e., randomness) which is preferable in this study. The entropy value obtained after the SNIC is posited to be lower than that of the original input image as the nucleus cells are eliminated from the input image. Fig. 7 shows a line graph comparing between the entropy values obtained before and after implementing the SNIC. Based on Fig. 7, it was found that the proposed SNIC was able to reduce the randomness and complexity of the input image. It is important to emphasize that the percentage of reduction is not significant (i.e., 6.5% (±2.5)). The SNIC was meant to target the nucleus cell and other image components remain unaffected. This is to ensure minimum loss in the image information. The small percentage in the reduction rate is closely related to the low number in pixels of the nucleus cell in the input image.
In addition, outputs from fuzzy clustering with the SNIC is compared to fuzzy clustering without the SNIC.
The main purpose is to evaluate the impact of SNIC in the clustering stage. Results obtain from both clustering were compared by plotting Positive Predictive Value ( )-Sensitivity ( ) scatter plot. Fig.   8  the FCM has a difficulty in handling the outlier data points [24,33]. The proposed SNIC eliminates the original intensity value of the nucleus cell and assign each pixel a new intensity value based on the spatial constraint. From Fig. 8, the implementation of the proposed SNIC was found to be robust in addressing the limitation aforementioned and alleviate the presence of outliers producing plausible clustering results. Fig. 8. Plot of -for the proposed segmentation procedure (comparing between FCM with guided initialization using SNIC and without SNIC).

Justification on the FCM with knowledge-based initial centroids selection
To evaluate the impact of the knowledge-based initial centroids selection, the FCM with knowledge-based initial centroids selection is compared to the conventional K-Mean and FCM (both with random initial centroids generation), and K-Mean (with knowledge-based initial centroids selection). The main purpose of this comparison is to: (1) justify if the knowledge-based initial centroids selection method is effective to enhance the overall clustering results by minimize the possibility of dead center, center redundancy, and possibility of initial centroid to trap in local minima; (2) justify if the knowledge-based initial centroids selection method could reduce the iteration numbers of fuzzy clustering algorithm. Fig. 9 shows a combined scatter plot for the conventional K-Mean (i.e., green markers), conventional FCM (i.e., blue markers), and K-Mean with knowledge-based initial centroids selection (i.e., gray markers). The clustering results were compared with the FCM with knowledge-based initial centroids selection (i.e., red markers).
scatter plots of the proposed segmentation procedure (with SNIC) using the conventional K-Mean, K-Mean with guided initialization, conventional FCM, and the proposed FCM with guided initialization.
In Fig. 9, the -plot for the K-Mean clustering algorithm with random initial centroids generation is found scattered. The low percentage of shows that the K-Mean clustering algorithm with random initial centroids generation was unable to accurately segment the tumor regions when comparing to the ground truth images. The obtained results are consistent with few K-Mean studies as they showed low performance when clustering overlapped and fuzzy dataset [24,33]. A better -plots was obtained for K-Mean with knowledge-based initial centroids selection (i.e., gray marker) when comparing with the K-Mean clustering algorithm with random initial centroids generation (i.e., green marker). The obtained result has verified the limitation of K-Mean clustering algorithm as reported in previous studies (i.e., may not successful in clustering noisy, fuzzy, and non-linear datasets). The obtained clustering results from the FCM with random initial centroids generation are encouraging and comparative to the FCM with proposed knowledge-based initial centroids selection. From the same figure, the clustering results from the FCM with knowledge-based initial centroids selection were found to be higher in 1 (i.e., 92.1%) and (i.e., 91.2%) than that of the FCM with random initial centroids generation (i.e., 1=85.5% and =85.5%).
This reflects that the proposed knowledge-based initial centroids selection method has the capability to improve the overall clustering results. The bad clustering outputs (e.g., images with between 30.0% to 80.0%) obtained from the FCM with random initial centroids generation could be a result of dead center, center redundancy, and trapped in local minima. Also, the FCM with knowledge-based initial centroids selection is found effective in search space reduction (to obtain the final centroids). This is reflected by a lower number in iterations when compared to the FCM with random initial centroids generation (see Fig.   10).

Overall fuzzy clustering results
To explicit the performance of the FCM with knowledge-based initial centroids selection, the clustering results of the FCM with knowledge-based initial centroids selection, conventional K-Mean and FCM (both with random initial centroids generation), and K-Mean with knowledge-based initial centroids selection are depicted in Fig. 11. For visual comparison, a total of nine breast histopathology images were selected from the dataset, where three images were selected from each breast cancer grade (i.e., Grades 1 to 3). The images from Grade 1 were labeled as Img 1, Img 2, and Img 3; the images from Grade 2 were labeled as Img 4, Img 5, and Img 6; and the images from Grade 3 were labeled as Img 7, Img 8, and Img 9. To further demonstrate the superiority of the proposed framework, the proposed framework is compared to several recent works (see Fig. 15). From this figure, the , 1, , and of the proposed framework is found respectively 5.9%, 5.8%, 8.5%, and 7.0% higher than that of the best output results among the recent works [1]. This could be explained by the implementation of proposed SNIC and knowledge-based initial centroids selection that eliminate the limitations of FCM clustering algorithm producing complementary results. Also, the over-segmentation and under-segmentation of the proposed framework is the lower amongst the recent works. This shows that the proposed framework has better accuracy in tumor region segmentation when benchmark with the ground truth. In this study, the proposed framework is evaluated in terms of performance and applicability in tumor regions segmentation. Although FCM with knowledge-based initial centroids selection is found could reduce the iteration numbers, however, no evaluation in terms of computation time has done. Therefore, in future works, this study would investigate the accountability of the proposed framework in terms of practical development. This may include the computation time, the graphic user interface, and the quantitative output (i.e., final percentage of tumor regions (based on pixels count)) in one histopathology slide.

Conclusion
In this study, a SNIC clustering framework for tumor region segmentation in breast histopathology images is proposed. Also, a knowledge-based initial centroids selection is implemented to systematically select the initial centroids for the FCM clustering algorithm. Both of these methods were found capable to enhance the overall clustering output producing complementary results. The novelty of the proposed framework lies within its simple but powerful in tumor regions segmentation which has proven to outperform some of the recent works. The proposed framework is believed applicable for multiple applications in the pathology laboratory, typically involving tumor region segmentation using H&E histopathology images (e.g., prostate carcinoma and colorectal carcinoma). The quantitative output (i.e., pixel-based measurement) is posited

Conflicts of interest/Competing interests
The authors declare that they have no conflict of interest.

Ethical Approval
The protocol of this study had been approved by Medical Research and Committee of National Medical Research Register (NMRR) Malaysia referring to the protocol number: NMRR-17-281-34236.

Consent for publication
Informed consent was obtained from all individual participants included in the study. Figure 1 Block diagram of the proposed framework.

Figure 3
Removal of nucleus cells.

Figure 4
Implementation of SNIC.  Sample of hemorrhage and blood cells extraction.

Figure 7
Entropy values before and after implementing the proposed SNIC.

Figure 8
Plot of PPV-Sen for the proposed segmentation procedure (comparing between FCM with guided initialization using SNIC and without SNIC). PPV-Sen scatter plots of the proposed segmentation procedure (with SNIC) using the conventional K-Mean, K-Mean with guided initialization, conventional FCM, and the proposed FCM with guided initialization.

Figure 10
Comparison in iteration numbers for random and knowledge-based initial centroids selection for FCM.

Figure 11
Evaluation metrics for FCM with guided initialization, conventional K-Mean, K-Mean with guided initialization, and conventional FCM.

Figure 15
Results comparison for proposed and other methods.