We would explain a recommender model for breast cancer research specifically using a blended approach in this study. While designing the recommender model several data sources have been considered from different studies and arranged in a layered architecture to form layers with rules. Each model comprises different machine learning models and takes different kinds of input as per the condition required. This architecture tries to blend different conditions with data and provides layer by layer recommendations.
2.1 Methodology
This research has a four-stage methodology: Problem definition, Data collection, model selection for system design and result reporting.
- Problem Definition and Objectives:
The foundation of current systems is clinical verifications using computational methods. The only issue with such systems is that they are extremely specialized and only accept a specific kind of data. All factors (such as cancer's symptoms, lifestyles, allergens, genetics, etc.) must be taken into account for a more accurate cancer diagnosis. Consequently, a blended approach in the recommendation system is required. This study's objective was to gather secondary data from various levels of cancer testing in order to combine that information with machine learning models to develop a recommendation model. We are working to incorporate symptoms and reports into this system; in the future, we plan to use secondary datasets to integrate lifestyles and genetics into this system.
(i) Data Collection & Preprocessing:
A review of literature revealed a wide range of breast cancer data. It combines clinical test results, pathology reports, genetics, and lifestyle information. Table 2 discusses the datasets used in this study and their sources.
The dataset from [16] was preprocessed using one hot encoding for categorical features while real valued features were scaled and normalized to obtain the final dataset. The dataset from [8] contains only real values for all features so scaling and normalization were performed to obtain the transformed dataset.
Table 1: Datasets utilized for study
Sources
|
Sample collection
|
Nature of data & Size
|
Model
|
[16]
|
Clinical report
(biomarker
of breast cancer)
|
• Numeric and categorical
• Size 166
(both patients with and without cancer)
|
Ensemble
-Data set is a combination of numeric & categorical data. In such cases, the ensemble has high performance and is quite robust .
|
[8]
|
fine needle aspirate(FNA)
|
-real values
-Size 569
|
[3]
|
Histopathology images
|
-images
-9,109 microscopic images of breast tumor
magnifying factors :40X, 100X, 200X, and 400X
Contains 2,480 benign and 5,429 malignant samples
|
CNN
- Image data sets are processed using CNN which perform well with medical images
|
[9]
|
Histopathology images
|
162 breast cancer histopathology images
|
Mask R-CNN
- This data set is masked with specific parts of infection .The problem here is image recognition and in such cases Mask CNN is suitable
|
(ii) Model Selection & Design
Model Selection: After data selection, best performing classification models were identified. Table 1 discusses data, selected model and reason of selection.
Layered architecture: In this study, a layered architecture with the blending of data has been proposed because, as we all know, breast cancer screening is an efficient mechanism for timely identification and to ensure a higher chance of a successful course of treatment. Table 2 discusses the layers, purpose & prerequisite, dataset & model, recommendation, and feedback in the proposed model.
Table 2: Layer Structure of Blended Recommendation System- BCRecomender
Layer No.
|
Name
|
Purpose & Prerequisite
|
Dataset & model
|
Recommendation
|
Feedback
|
1
|
Clinical Assessor
|
-Examines the presence of cancer
-blood sample
|
Dataset: 166
Train/Test: 70:30
Model : Ensemble
|
-Possibility of cancer
-If possibility is true, recommended for physical examination followed by FNA test
|
Outcomes with samples stored in the database after verification from experts.
|
2
|
FNA Test
|
-Examine the type of tumor
-digitized image sample
|
Dataset: 569
Train/Test: 70:30
Model: Ensemble
|
-Possibility of lump or node to be cancerous
-If possibility is true, recommended for examination of type of cancer and treatment
|
-Sample values with recommendations stored in the database after verification from experts.
|
3
|
Tumor Categorization
|
-Examine the type of tumor
-digitized image sample
|
Actual Dataset: 2,013
Selected images:1390 of malignant types
(consist of 200X images only)
Train/Test: 70:30
Model: CNN
|
-Possibility of particular type of cancer
- Annotate sample for better recognition
|
-Sample images with recommendation stored in database after verification from expert
|
4
|
Strain Recognition
|
-Recognize infected part among :Mitosis, Apoptosis, Tumor nuclei, Non-tumor nuclei, Tubule and Non-tubule
|
Dataset: 162
Train/Test: 162 with number of annotations: 23549
Model: Mask R-CNN
|
-Strain recognition in different parts
|
-Sample annotated images with recommendation stored in database after verification from expert
|
In the first layer, the dataset contains data from routine blood analyses - notably, glucose, insulin, HOMA, leptin, adiponectin, resisting, MCP-1, age and Body Mass Index (BMI) - which can all be utilized to foretell the presence of breast cancer. This is an entry layer which will try to screen cancer based on clinical data. The model used is an ensemble, trained on already existing data. The outcome of this model is likelihood of presence (1)/absence (0) of breast cancer. This model receives initial patient data during recommendation execution, and it produces binary results (presence or absence). We advance to the second layer and also record the case in the training dataset of layer one if the outcome is presence. The system also advises getting pathology testing because the model predicts a high risk of malignancy. No additional test results would be taken into account in the event of absence.
In layer 2, a dataset from the FNA test was considered to design the model. Features are extracted from the imageset that includes traits of the cell nuclei visible in the image. It is used to train the ensemble model of this layer. In the testing phase, the FNA of the new target (patient) is then predicted. After being obtained from the patient, FNA test data is transformed into numerical values.The outcome of this model will again confirm presence and absence of breast cancer. If target values reveal the presence, then target would be suggested to go for further test and shifted to layer 3.
CNN models are trained using histopathology images in layer 3. The dataset is taken from BreakHis. It contains breast tumor tissue of 4 types of benign and 4 types of malignant tumors. We used a malignant tumour dataset of 200X images (i.e., 1390 total images) in this system to train our model. 4 malignant tumors: carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC) and papillary carcinoma (PC).This decision was made in order to maintain system speed and direct attention toward cancer type detection. This model's output aids in classifying the type of cancer from images. The suggested cancer type is either PC, DC, LC, or MC. If the input image differs from one of these classifications, the system won't produce accurate results.
In layer 4, an image recognition model is created using breast cancer histopathology images named BreCaHAD (developed by the biomedical imaging community). A pathologist labeled or annotated the hematoxylin and eosin (H&E) stained histological images as mitosis, apoptosis, tumour nuclei, non-tumor nuclei, tubule, or non-tubule. For our system, we have adjusted the image annotations. Since BreCaHAD does not have annotated data from papillary carcinoma (PC) cases, this model has not been trained for PC. The rectangle boxes used to highlight the image portions in this model will provide a masked part.
Though the system is in its infancy, we will use the MYSQL database for importing the data from the dataset and thereby storing the generated recommendations. These recommendations can be used as input to the feedback module wherein the experts can verify the results.
iv) Results & Analysis: In layer1 the bagging classifier achieves the highest accuracy of 61.06% while in layer2 bagging has the highest accuracy of 97.52%. Layer 3 has accuracy 97.39% after augmentation.During test phase, confidence ranges from 60%-100%.In Layer 4 the confidence varies from 50%-100%.
Table 3: Performance of Each Layer in Recommendation System
Layers
|
Model performance
|
1
|
Ensemble classifier
|
Accuracy
(mean)
|
Recall
|
Precision
|
F1-score
|
Bagging (n_estimators=100, criterion="entropy",
max_features="auto"
)
|
61.06
|
73.0
|
73.0
|
74.0
|
Adaboost
(n_estimators=200,learning_rate=0.1
)
|
56.72
|
73.0
|
75.0
|
74.0
|
Voting classifier
|
53.49
|
74.0
|
74.0
|
74.0
|
2
|
Ensemble classifier
(n_estimators=100, criterion="entropy",
max_features="auto"
)
|
Accuracy
(mean)
|
Recall
|
Precision
|
F1-score
|
Bagging
|
97.52
|
94.0
|
94.0
|
94.0
|
Adaboost
(n_estimators=200,learning_rate=0.1)
|
95.60
|
95.0
|
95.0
|
95.0
|
Voting classifier
|
55.56
|
96.0
|
96.0
|
96.0
|
3
|
CNN
|
Training Loss
|
Training Accuracy
|
Validation Loss
|
Validation Accuracy
|
Model before augmentation
|
.4311
|
83.18
|
0.9860
|
66.91
|
Model after augmentation
|
.0631
|
97.39
|
1.9344
|
65.11
|
4
|
Mask RCNN
(resnet)
|
Training Loss
|
Evaluation Loss
|
Confidence
|
|
0.045
|
5.541
|
50-100%
|
|