RoughSet based Feature Selection for Prediction of Breast Cancer

Breast cancer is the most deadly cancer and has highest mortality rate in women all over the world. Early prediction of breast cancer can improve the survival rate of the patient. Consequently, high accuracy in cancer prediction is important to avoid any mis-diagnosis. Machine learning algorithms can contribute in early prediction and diagnosis of breast cancer. In this study, we have used rough set based feature selector to extract relevant features from the breast cancer feature set and classify them using machine learning algorithm like Decision Tree, Naive Bayes, Support Vector Machine, K-Nearest Neighbor, Logistic Regression, Random Forest, Adaboost. The main aim is to predict cancerous breast nodules, using rough set driven feature selection and machine learning classification algorithms. The results were evaluated pertaining to accuracy, sensitivity and specificity and positive predictive value. It is observed that random forest outperformed all other classifiers and achieved the highest accuracy using the proposed approach (95.23%).


Introduction
Cancer is one of a group of diseases experiencing unprecedented growth. It causes cells to divide and grow uncontrollably, invading nearby organs. These grown portions appear as a lump in an X-ray image. [1] When this growth spreads to surrounding tissues and distant parts, it becomes malignant. Microcalcification clusters in the breast can be seen as a sign of cancer. Most breast lumps are benign rather than malignant. Mammogram screening is the most effective method of detecting breast cancer. [2] According to the WHO, cancers such as breast, cervical, ovarian, lung, and prostate cancer will account for more than 10 million deaths by 2022. Breast cancer is the most common cancer, accounting for 2.26 million cases, and the leading cause of premature mortality among women worldwide, accounting for 685,000 deaths [3]. Breast cancer (BC) is one of the most common cancers in women worldwide, with men having fewer cases [4].
Breast cancer is the leading cause of death among women [5]. Breast cancer is seen in 1.5 million women yearly. Breast cancers occur when one type of cell multiplies abnormally. The mass of tissues thus formed is called tumour. Tumours can be benign or cancerous. Tumors that are malignant are cancerous and can quickly spread to other organs, but benign tumors are not. Tumours are differentiated using several diagnostic methods like ultrasound, CT-scans, mammography and biopsy. However it is not very easy to differentiate the tumours even for a specialist. According to the recent research, the tumours can be classified based on statistical features.
Breast tumours are distinguished by the presence of mass lesions and microcalcification. Thousands of mammogram tests are captured in diagnosis centres as breast cancer awareness grows. Even in advanced countries, there is a disparity between the availability of experts and the number of experts required for mammogram analysis. [6,7] Manual classification is also time-consuming and prone to error. It can have a negative impact on critical outcomes. Because of these flaws, computer-assisted detection systems are required to diagnose. These issues prompted the investigation and development of a high-accuracy computer-assisted detection system for the diagnosis of breast cancer.
Doctors require a trustworthy diagnostic tool to classify the types of tumors. Even for specialists, however, distinguishing tumors is often challenging. In biopsy approach, the sample tissues are extracted and investigated. The breast sample are collected through a procedure called fine needle aspiration and analysed under a microscope. Few statistical features are collected under microscope and are analysed to predict the probability of a person having cancerous nodule. The relevant statistical features can be extracted and analysed using various machine learning algorithm. Several organisations have amassed vast repositories of data collected from various sources in various formats over the last few decades [8,9]. These gathered data could be used in a variety of applications, including medicine, agriculture, and weather forecasting. These ever-increasing amounts of data outstrip the ability of traditional methods for analysing, searching for patterns and information hidden in them, and making decisions [10,11]. Machine learning algorithms such as classification, clustering, and regression could be used to analyse data obtained from medical data repositories. Machine learning algorithms and their utility in detecting knowledge from medical data repositories have proven to be valuable tools for disease prediction success [12]. Several studies have reported the use of machine learning algorithms for breast cancer prediction. Machine learning algorithms have been widely used in the development of predictive models for breast cancer prediction [13].
This can significantly reduce the mortality rate of patients, by detecting the cancer at an early stage. In recent years, many machine learning approaches such as Linear Regression, Decision Tree, K-Nearest Neighbour, Random Forest, and Support Vector Machine have been utilized to categorize breast nodules. Feature selection is an important phase in classification where in the best features are selected before giving to the classification layer, which not only reduces the dimensionality of features but also improves the performance of the model.
This study made use of the Wisconsin Breast Cancer Dataset. We extract the most relevant features using rough set based feature selection algorithm and then classify them using different classification algorithm. Rough set groups the objects which are characterized by similar information and considers it as indiscernible. The set of all objects that surely belong to the set is called lower approximation. And the set of objects that might possibly belong to the set is called upper approximation. The boundary region is defined as the difference between the upper and lower approximation. We use the notion of reduct in feature selection. Reduct is a minimal subset of features which holds the same information as the original set of features thus eliminating the redundant and irrelevant features in the dataset. We use quick_reduct algorithm to calculate the reduct. The best features thus selected using rough set theory are then classified using classification algorithm like Support Vector Machine, K-Nearest Neighbor, Naive Bayes, Decision Tree, Logistic Regression, Random Forest, Adaboost.

Related Work
In 2020, [14] used an artificial neural network (ANN) and SVM for the prognosis of breast cancer recurrence as well as patient's death within 32 months of undergoing surgery. SVM had the best performance, with an accuracy of 96.86%.
Sakri et al. [15] concentrated on improving accuracy by combining a feature selection algorithm called particle swarm optimization (PSO) with machine learning algorithms K-NNs, Naive Bayes (NB), and reduced error pruning (REP) tree. According to their work perspective, the Saudi Arabian women's breast cancer problem is one of the major problems in Saudi Arabia. According to their findings, women over the age of 46 are the primary victims of this heinous disease. With this in mind, the authors applied four phase-based data analysis techniques to the WBCD dataset. They presented a comparison of classification without a feature selection method and classification with a feature selection method. They obtained accuracy of 70%, 76.3%, and 66.3% for NB, RepTree, and K-NNs, respectively. They used the Weka tool for data analysis. They discovered four features that are ideal for this classification task after implementing PSO. They obtained accuracy values of 81.3%, 80%, and 75% for NB, RepTree, and K-NNs with PSO, respectively.
Ni et al. [16] wanted to know if the enrichment of miRNAs in exosomes reflects the pathogenesis of BC and ductal carcinoma in situ (DCIS). Exosomal miR-16 levels in plasma were higher in BC (P = 0.034) and DCIS (P = 0.047) patients than in healthy women and were associated with oestrogen (P = 0.004) and progesterone (P = 0.008) receptor status. Furthermore, lower levels of exosomal miR-30b were linked to recurrence (P = 0.034), while exosomal miR-93 was found to be upregulated in DCIS patients (P = 0.001). 54 Their findings demonstrated that different signatures of miR-16, miR-30b, and miR 93 in exosomes from BC and DCIS patients are associated with a specific biology of breast tumours. 54 As a result of their high diagnostic potential, exosomes have become a research hotspot in recent years.
Ricciardi et al. [17] used a combination of linear discriminant analysis (LDA) and principal component analysis (PCA) for the classification of coronary artery disease with principal component analysis used to create new features and linear discriminant analysis for the classification, which improved the diagnosis of patients. Machine learning is being used in a variety of fields in the twenty-first century as a result of advances in data analysis and classification techniques [19]. This method is more accurate and less expensive in detecting breast cancer. As a result, biomarkers are increasingly being used as attributes in machine learning models to classify breast cancer. In 2008, a log regression algorithm was developed that used two inputs: specific antigen 15-3 and insulin-like growth factor-binding protein-3. The receiver operating characteristic (ROC) metric produced an area under the curve (AUC) of 0.86, with an 85% sensitivity and 62% specificity.
Machine learning is being used in a variety of fields in the twenty-first century as a result of advances in data analysis and classification techniques [18]. This method is more accurate and less expensive in detecting breast cancer. As a result, biomarkers are increasingly being used as attributes in machine learning models to classify breast cancer. In 2008, a log regression algorithm was developed that used two inputs: specific antigen 15-3 and insulin-like growth factor-binding protein-3. The receiver operating characteristic (ROC) metric produced an area under the curve (AUC) of 0.86, with an 85% sensitivity and 62% specificity.
Several studies have tried with various improvisations on breast cancer classification utilising appropriate machine learning algorithms, with superior outcomes in terms of classification system accuracy and sensitivity. Feature selection algorithms are crucial for extracting relevant features that are necessary for increasing a classification system's performance. Dhanya et al. proposed a method which uses feature selection algorithm like sequential forward feature selection, recursive feature elimination, f-test and correlation and classified the breast nodules into benign and malignant using logistic regression, naive bayes and random forest algorithm [19]. Milon et al. used support vector machine and k-nearest neighbour for classification of breast nodules [20]. Ram et al. used principal component analysis to minimize the dimensionality of the data and then categorised them using logistic regression, k-nearest neighbours, and ensemble learning [21]. Dana et al. used support vector machine, random forest, and bayesian networks to classify the breast nodules and found that support vector machine provided the best accuracy [22]. Qaung et al. employed feature scaling and principal component analysis to reduce the number of features and then categorised them using an ensemble-voting classifier, logistic regression, SVM, and the adaboost algorithm [23]. Ahmed et al. employed logistic regression for feature reduction and classification of breast nodules [24]. Hasan et al. employed principal component analysis to reduce the number of features in the nodules and then used an artificial neural network to classify them. [25].Smita et al. employed principal component analysis for feature reduction and classified the nodules using decision tree algorithm such as CART and C4.5 [26]. Sujan et al. employed random forest algorithm after reducing the feature dimension using principal component analysis to classify the nodules and found that random forest provided better accuracy than decision tree [27]. Ahmet et al. classified the nodule using support vector machine with quadratic kernel after reducing the features using independent component analysis [28].Phonethepet.al used principal component analysis to reduce features and classified the data using J48 decision tree [29]. Liu selected highly relevant features using data correlation and independence test and classified the data using decision tree model [30]. Yang reduced the dimensionality of features using isomap and classified it using support vector machine [31]. Jain used genetic algorithm for feature selection and classified the feature using k-nearest neighbour classifier [32]. Emina used genetic algorithm to extract relevant features and classified them using Rotation Forest model. Ed-daoudy used reduced feature set using association rules and classified the features using support vector machine. Rahman extracted relevant features using genetic algorithm and classified using random forest. Kamel used gray wolf optimisation for feature selection and classified the features using support vector machine. Sharma extracted relevant features using correlation-based selection, information gain-based selection, and sequential feature selection, and then classified them using a max voting classifier [20]. The use of appropriate feature selection algorithm has significantly improved the accuracy of classification by selecting the relevant features for classification. Figure 1 depicts the system model for identifying breast cancer as well as the difficulties. As a result, alternative models that are inexpensive, innocuous, easier to execute, and can function with disparate data sets must be developed in order to give a reliable prediction. This study introduced a novel deep learning technique to address prediction difficulties. The proposed approach used MRI breast image datasets with great performance for diagnosing breast cancer.

Image Acqisition
Breast MRI scans were used to classify the disease categories in the proposed Machine Learning with the Roughset. In this study, 1000 breast MRI images are acquired from the internet and processed on the suggested Machine Learning with the Roughset technique to classify Fig. 1 Proposed architecture cancer kinds as benign or malignant. 735 malignant pictures and 265 benign images with 256 × 256 resolution and less than 2 mm thickness were gathered. As a result, the gathered photographs are used as the dataset, which is then processed in the developed manner via training and testing. The collected MRI contains unwanted mistakes or sounds that are eliminated throughout the preprocessing phase.

Data Preprocessing
The MRI breast images collected contain random and unrelated noises or faults that reduce image diagnostic and alter image contrast resolution. Thus, the preprocessing procedure seeks to improve the performance of the subsequent stages by utilising some conversions to set up the MRI and make it suitable for the subsequent processing stages. A unique Wienmed filter is used in this work to remove unwanted noise and sections of the image backdrop, so constraining the high or low frequencies that improve or perceive the image borders. The Wienmed filter is the combined form of two filters, Wiener [7] and median filters [35]. These two filters are coupled to effectively minimise noise distribution and mistakes in MRI breast images. The primary goal of such a filter is to replace the noisy and nearby image pixels, which were previously organised based on the intensity of the image. This preprocessing phase improves the MRI breast image component that contains unwanted distortions or improves a variety of image properties that are important for further processing.

Feature Selection
In the proposed model, we extract the relevant features using rough set feature selection algorithm and then classify them using machine learning algorithm as shown in the Fig. 1. In the training phase, the relevant statistical features are gathered from the training set, using rough set feature selection algorithm. The dataset has 32 features. To acquire the relevant set of features, we employ the rough set feature selection algorithm. Quick reduct feature selection algorithm calculates minimal reduct right from an empty set and selects a feature in each iteration. This iteration continues until the features do not increase the degree of dependency. The Quick Reduct algorithm is described in detail below:

Quick Reduct Algorithm
Rough set notations are described in this section. The decision table is denoted by DT and consists of U a set of objects, C a set of conditional characteristics, and D a set of decision attributes. B X signifies lower approximation and B X denotes higher approximation with respect to a particular idea X ⊆ U with respect to a collection of attributes B ⊆ C. The Positive region of B is denoted by POS B (D). For a set R and R ⊆ C, R (D) denotes the kappa measure, which indicates D's dependence on R.
For a given dataset several reducts may exist. An important application of Rough Sets to Machine Learning is in dimensionality reduction wherein the decision system is built with using only a reduct attributes. In such applications finding one reduct would be sufficient. One of the popular algorithms to find reduct is Quick Reduct algorithm proposed by A. The QuickReduce algorithm begins with an empty set and includes an attribute in each iteration to maximise the kappa. Because the QuickReduct algorithm employs a greedy approach, it has been demonstrated in [10] that QuickReduct may not always produce a reduct but may occasionally produce a super reduct. A super reduct is a set of qualities that includes a reduct as a subset. QuickReduct is still commonly utilised due to the speed with which it may arrive to a set close to a reduct.

Classification of the Dataset
After applying rough set feature selection algorithm, we extract around nine relevant features, thus also reducing the dimensionality of feature set. In training phase, the reduced feature set are trained using classification algorithm like Decision Tree(DT), Adaboost, Random Forest(RF), Support Vector Machine(SVM), Naive Bayes(NB), K-Nearest Neighbor(KNN), Logistic Regression(LR) and the trained model created by each of the classification algorithm are saved. The trained model is further used in the testing phase to predict the malignancy of the testing dataset.

Decision Tree
The classification and regression models will act as the basis for the decision tree that will be constructed. The primary dataset has been used as the basis for the development of a select few subsets. With this smaller amount of data, it is still possible to make a prediction with the highest level of precision that is conceivable. The decision tree technique is comprised of the CART, C4.5, and C5.0 components, as well as the conditional tree.

Naïve Bayes
The application of this model requires the assumption of a substantial training dataset. The Bayesian approach of calculating the probability is accomplished with the help of the algorithm. During the process of calculating the probabilities using noisy data as an input, it offers the best level of accuracy possible. It is a type of analogy classifier, and its purpose is to compare the training dataset with the training tuple.

Support Vector Machine
It is a technique for supervised learning that can be applied to problems involving classification as well as regression. It is made up of both theoretical and numerical functions, and its purpose is to solve the regression problem. During the process of making predictions using massive datasets, it achieves the maximum possible accuracy rate. It is a powerful approach to machine learning that makes use of both three-dimensional and two-dimensional modelling.

K Nearest Neighbor
This algorithm is used to recognise patterns. It is an effective method for predicting breast cancer. Each class has been given equal weight in order to recognise the pattern. K Nearest Neighbor finds similar highlighted data in a huge dataset. We classify a large dataset based on feature similarity.

Logistic Regression
It is a supervised learning technique with additional dependent variables. This method generates a binary response. Logistics regression can produce a continuous result for a specific data set. This algorithm is built around a statistical model with binary variables.

Random Forest
The Random Forest algorithm is a sort of supervised learning that can be utilised for the resolution of problems involving classification as well as regression. Both of these types of problems can be solved by applying the technique. It is a crucial component of machine learning that enables the predicting of future data based on study of the data that has come before it. The analysis of the data that has come before it is known as "historical data."

Adaboost
The term "boosting" refers to a specific kind of homogenous weak learner that combines numerous different weak classifiers into one single powerful classifier. It is based on a sequence of step-by-step techniques for developing the model, with the first step being the collection of some training data.
In testing phase, the relevant statistical features are gathered from the test set, using rough set feature selection algorithm. The features thus selected are classified using trained machine learning model created in the training phase as shown in Fig. 2. The predicted results obtained from the model are then cross-validated against the ground truth labels, and performance measures are used to assess the findings.

Experimental Setup
The training and test samples were run on a Windows 10 machine with an i5 processor and an NVIDIA RTX 2070. The model was implemented using python library.

Dataset Description and Data Preprocessing:
The data set was obtained from the freely accessible Breast Cancer Wisconsin (Original) Data Set in the UCI machine learning repository [21]. which contains 669 instances consisting of 10 attributes, including 458 benign and 241 malignant cases. These features are computed from a digitized image of a breast mass using fine needle aspirate which determines the characteristic of nuclei in a cell. Clump thickness, cell size uniformity, cell shape uniformity, marginal adhesion, single epithelial cell size, plain nuclei, bland chromatin, normal nucleoli, and mitosis are some of the features. Data pre-processing is a step in the data analysis process that applies data normalisation and separates inconsistent data, incomplete data, and outliers. We perform data preprocessing before performing classification. We fill in the missing data using most frequent data of each category. Rough Set Feature selection is a method for transforming data dimensions and generating a new feature set by picking a subset of relevant features/attributes and neglecting those feature with limited significance. The best meaningful subgroup evaluation has been gathered and will be used in the experiment's next step. Clump thickness, uniformity cell size, uniformity cell shape, marginal adhesion, and single epithelial cell size are some of the features selected by the rough set feature selector. Benign cells tend to be gathered in monolayers in the Clump thickness, but malignant cells are commonly grouped in multilayers. In the uniformity of cell size/shape, cancer cells tend to fluctuate in size and shape. Marginal adhesion allows normal cells to stick together, whereas cancerous cells lose this function. As a result, loss of adhesion is an indication of cancer. A malignant cell could be a considerably expanded epithelial cell. As a result, all these selected features are important in establishing whether or not the cells are cancerous or else not.

Simulation
The formulas for precision, recall and f-measure are given below for reference: There are 342 training and 227 testing sets in the database. As indicated in Fig. 3 and Table 1, precision, recall, and f-measure were used to assess the model's efficacy. Precision refers to the percentage of nodules that were accurately diagnosed after being projected as positive. The percentage of malignant nodules accurately identified is measured by recall. We tested our rough set-based feature selection model using a variety of machine learning   We also evaluated the model using measures like accuracy, specificity, and Negative Predictive Value (NPV) as shown in Fig. 4   After evaluation, it was observed that proposed method with Random Forest has shown better accuracy of 95.23% when compared to other classification algorithm. K-Nearest Neighbor has shown better specificity of 93.15% whereas Naive Bayes and SVM classifier has shown better negative predictive value of 96%.

Comparative Study
Our proposed approach was also compared to various machine learning algorithms. Fig. 5 and Table 3 compares the proposed strategy to previous work in the field of breast nodule categorization. We examine the model's accuracy, specificity, sensitivity, and positive predictive value to determine its efficacy in classifying breast nodules.
The proposed model has exposed a better accuracy of 95.23%, specificity of 97.7%, and sensitivity of 91.13% and positive predictive value of 94.81% respectively with random forest classifier when compared to other models. It is also evident from Fig. 6 and Table 3 that, rough set based feature selection has significantly improved the accuracy of classification algorithm.

Conclusion
In this paper we used extracted the relevant features using rough set-based feature selection algorithm and classified it using machine learning algorithm such as decision tree K-nearest neighbour, naive bayes, support vector machine, logistic regression, random forest, ada boost to classify the breast nodules into benign and malignant. The random forest classifier was used to evaluate our model on the publicly available Breast Cancer Wisconsin (Original) Data Set in the UCI machine learning repository, and it showed better accuracy of 95.23%, specificity of 97.7%, sensitivity of 91.13%, and positive predictive value of 94.81 percent.

Data Availability
The datasets generated during the current study are available from the corresponding author on request.