In this part, 42 ensemble learning methods used for cancer detection are classified
into three famous distinct fusion categories and discussed; include data and feature
integration methods, decision integration methods, and on the smaller scale, model
integration methods. Then in each case, input data, number of samples, the statistical
tools for evaluation of performance will be introduced. And decision-making strategies
for decision integration methods will be specifically determined. In this systematic
research, ensemble systems regarding 45 most relevant articles in the category of
cancer prognosis and diagnosis were studied. The studies that used a valid statistical
tool to evaluate their performance were considered and included. In contrast, studies
that did not compare the percentage of accuracy of their findings with other methods
and also the different aspects of their performance had not clearly assessed, were
excluded.
2.1 Data and Feature Integration
When a couple or more biological heterogeneous input data such as clinical data, mutation
data, expression data, proteomics data, or gene ontology (GO) database data are combined,
a kind of fusion system is created that is called data integration. The higher level
of this category called multi-omics data integration. It means that various levels
and scales of data, including genomics, epigenomics, transcriptomics, proteomics,
and metagenomics data are integrated. It has been mentioned combining multi-Omics
data can improve predictive performance [11]. Also, in some studies, various features
are selected or extracted from homogeneous or heterogeneous data. When the combination
of these features is used to implement an algorithm, another type of integration is
created, which is called feature fusion. In most cases, the integration of data and
features are simultaneously used as an ensemble system.
2.1.1 Bayesian Network Classifiers (1,2)
Bayesian network classifiers are methods that can integrate heterogeneous data from
multiple sources to reveal the mechanisms of complex diseases such as cancers. In
a study on hepatocellular carcinoma (HCC), also called malignant hepatoma, carried
out by previous researchers., microarray and clinical data were integrated from biological databases and literature.
Some liver cancer protein biomarkers were predicted, functional modules that show
progression mechanism of liver cancer were identified, and performance was evaluated
with 10-fold cross-validation on the testing set. For selecting the test set, the
training data was split into ten approximately equal size sets, and then one of them
was used for testing. For choosing a different testing set, this process was repeated
ten times. Results showed that compared to Bayesian network (BN), Naïve Bayes (NB),
full Bayesian network (FBN), and Support Vector Machine (SVM) classifiers, their proposed
method gives maximum the area under the curve (AUC) [12].
In other studies, the Bayesian network was used for integrating four types of data
consist of GO database data, MicroArray (MA) co-expression data, orthologous Protein-Protein
Interactions (PPI) human data and True Positive (TP) data as two networks including
(GO+MA+PPI) network or (GO+MA+PPI+TP) network. This approach uses dissimilar data
sets and can deal with missing data. This Bayesian network has been applied for prioritizing
candidate genes in related to breast cancer such as PIK3CA, CHEK2, BARD1, and TP53
that were predicted with (GO+MA+PPI) network. This data integration method was called
Prioritizer. It has been evaluated the performance of the various gene networks using
cross-validating on all data sets, ten times. Performance of Prioritizer is good when
genes on the basis of their functional interactions are ranked. Prioritizer approach
can help to the diagnosis of disorders by introducing driver genes. Accuracy of (GO+MA+PPI)
network is significant (AUC = 90%). Also, results showed that the proposed method
has a far better performance for the prioritization of genes in Mendelian diseases
in comparison with complex disorders [13].
2.1.2 stSVM by Data Integration
In an investigation that has been carried out by in 2013, a new fusion method was
introduced. This approach was called smoothed t-statistic SVM (stSVM). It integrates
features obtained from experimental data such as mRNA and miRNA expression data into
one SVM classifier. It has been applied for prognosis and diagnosis of cancers including
breast, prostate and ovarian cancers and for gene prioritization. In this study, four
datasets were used [14]. One breast cancer dataset (GSE4922) [15] two prostate cancer
datasets (GSE25136 [16] and GSE21032 [17] and one ovarian cancer dataset (TCGA) [18]
from various data repositories. The stSVM was evaluated via 10 times repeated 10-fold
cross-validation method [14]. Finally, stSVM were compared with saliency guidedSVM
(sgSVM) as a meta-classifier, an SVM machine trained with significant differentially
expressed genes with FDR cutoff 5%, selected by Significant Analysis for Microarrays
(SAM) [19]. It has been shown that stSVM approach has high predictive power for introducing
novel gene lists for mentioned cancers [14].
2.1.3FSCOX-SVM
Feature selection with Cox proportional hazard regression model (FSCOX) is a novel method based on feature selection
with Cox proportional hazard regression model that integrates data from different
datasets into one SVM classifier. The proposed method was called FSCOX-SVM. This fusion
model carried out data integration between miRNA and mRNA features. It has been applied
for improving the prediction power of various cancers survival time, especially ovarian
cancer and Glioblastoma Multiforme (GBM). In this study, two computational methods
were suggested for the prediction of target genes, including TargetScan and miRanda.
After computing optimal sequence complementarities between a mature miRNA and mRNA,
TargetScan identifies miRNA targets. While a weighted sum of the match and mismatch
scores for base pairs and gap penalties was computed by miRanda. The proposed approach
predicts the class of each test sample, via the Leave-One-Out Cross-Validation (LOOCV)
procedure. Finally, the approach was compared with three classifiers, including RF,
SVM, and FSCOX median. Their findings demonstrated that FSCOX-SVM approach showed
the highest performance and accuracy, among others. In fact, data integration between
miRNA and mRNA features Led to better achievements [20].
2.1.4 Multiple RFE Selection Methods (1,2)
Multiple Recursive Feature Elimination (RFE) is an ensemble feature selection method.
This strategy has been applied for identifying metastatic breast cancer core module
biomarkers. First, 100 features, including gene expression based on DNA microarray
technology and activity vectors features, were the candidate for classification and
divided into 500 random splits (with the possibility of overlapping). Then, 500 classifiers
were constructed and recorded their AUCs as well as their weight vectors. Third, the
features were ranked by the average square weight of each feature among 500 different
splits. The lowest ranked feature was eliminated recursively until the maximum average
AUC obtained. this procedure was repeated 100 times for selecting a final marker
gene set [21]. The consistency of the proposed approach was evaluated with a multi-level
reproducibility validation framework [22]. It is a kind of level-by-level validation
method [23]. The algorithm related to this strategy identifies the highly reproducible
marker. It means that it generates highly reproducible results across multiple experiments.
The results show that this method has improved accuracy and biomarker reproducibility
by as much as 15% and 30%, respectively. This method computes an average weight from
500 classifiers and uses algebraic combiners for decision making. Multiple RFE was
applied for a classification tool called COre Module Biomarker Identification with
Network ExploRation (COMBINER) in the feature selection step [21]. COMBINER software
was implemented on three independent breast cancer datasets including the Netherlands
[24], the USA [25], and Belgium [26] and identified 13 driver genes as reproducible
discriminative biomarkers. Also, a robust regulatory network was constructed [21].
In other studies, RFE selection method has also been applied for biomarker discovery
with colon, leukemia, lymphoma, and prostate cancer. Finally, the outputs of all selectors
are aggregated, and the ensemble result is computed. In general, this method generated
a diverse set of feature selections [27]. The proposed approach was assessed on four
microarray datasets, including leukemia dataset [28], colon dataset [29], lymphoma
dataset [30], and prostate dataset [31]. For selecting a training set, subsampling
was done, and each time, 10% of the data was used as an independent validation set
to evaluate classifier performance. The results showed that the robustness of the
selected driver genes was increased up to almost 30% and performance improved up to
∼15% in the classification method [27].
2.1.5 Feature Subsets Method
In a study aimed for predicting survival in breast cancer patients, the researchers
designed an ensemble method to learn models using feature subsets then combined their
predictions [32]. Two breast data set were used, including Dataset 1 [24, 33] and
Dataset 2 [25]. Data achieved through microarray experiments, and their results were
compared with clinical criteria. Feature subsets were obtained from three different
methods, including splitting feature selection method, sliding window feature selection
method, and random subsets feature selection method. These three feature-subset-selection
methods were integrated to construct the proposed ensemble model. The performance
of the proposed method was evaluated for 100 different training/test sets from dataset
2, and the results of the method showed a high performance and confidence interval.
Compared to the Amsterdam signature and clinical criteria, the proposed method generated
a high sensitivity and Negative Predictive Value (NPV). The performance of the approach
was enhanced. When splitting feature subsets were used, sensitivity and accuracy were
improved[32].
2.1.6 Multimodal Data Fusion of Separate Datasets
Multimodal Data Fusion is a fusion model that integrates clinical and bio-molecular
data such as image and microarray data. In fact, in this study, data were heterogeneous.
This approach has been applied to the diagnosis of melanoma. There are two different
types of multimodal data fusion for fusing separate datasets. Two approaches including
a combination of data (COD) and the combination of interpretations (COI) were exploited
for feature integration. COD is applied before classification, and it aggregates features
from each source for producing a single feature vector, but in COI, independent classifications
are done based on the individual feature subsets using a proper voting mechanism [34],
and involve aggregating outputs, so it uses algebraic combiners for decision-making
strategy. In another study that was done in related to prostate cancer has been told
that COD methods are more optimal [35]. It should be noted that in this method and
feature selection step, SBE (sequential backward elimination) with RF (random forest)
algorithm was used that utilizes integrations of decision trees. This kind of feature
selection techniques leads to dimensionality reduction. Finally, feature selection
and dimensionality reduction methods use algorithms for extracting of better biomarkers.
The performance of this classification method was evaluated using 10-fold cross-validation
procedure with 50 repetitions on different datasets. Results demonstrated a random
forest approach that was used in classifying of bootstrapped samples gained an AUC
score of high. In contrast, obtained performance with other linear methods such as
principal component analysis (PCA) and linear discriminant analysis (LDA) was not
high [34].
2.1.7 Meta-classifier Ensemble Learning based on Genetic Programming Technique
In other studies, that were conducted by researchers, an ensemble meta-classifier
combining five classifiers was used for feature integration. The features were produced
by Genetic Programming (GP) technique which used a kind of evolutionary algorithm
for generating thousands of classifiers as features. Input data for the GP system
is gene expression data. Finally, the top five classifiers were chosen as the individual
classifier. GP classifiers often including five or fewer genes as biomarkers and predict
cancer class, successfully. Then, the proposed ensemble method integrates these features
for achieving better results. This method is applied for the diagnosis of some cancer
types such as prostate and Lung Cancers. Also, it can classify their subtypes, such
as Metastatic Prostate Cancer (MPC) and Primary Prostate Cancer (PPC). Results demonstrated
GP is a robust method in feature selection and accurately suggests genes for prognostic
and diagnostic targets. Also, misclassification Error Rates of GP is very slight.
The performance evaluation of the GP system was done using five-fold cross-validation
on the training set. The maximal accuracy obtained with the proposed method and average
prediction rate was very high when meta-classifier ensemble learning based on GP compared
with other classification methods such as 3-nearest neighbors, nearest centroid, covariate
predictor, SVM, and Diagonal Linear Discriminant Analysis (DLDA) [36].
2.1.8 Ensembles of BioHEL Rule Sets
Bioinformatics-Oriented Hierarchical Learning (BioHEL) is an evolutionary machine
learning approach that integrates microarray data from different datasets. It uses
Random Forest based Feature Selection (RFS), Correlation-Based Feature Selection (CFS)
and Partial-Least-Squares Based Feature Selection (PLSS) in the feature selection
phase. This method has been applied for gene prioritization as diagnostic markers
related to prostate, lymphoma, and breast cancers. The proposed method uses algebraic
combiners for feature ranking. For each training set, BioHEL was run 100 times, separately.
The main procedure that was used in this classification method is a kind of cross-validation
scheme called as two-level external cross-validation. Results compared with some other
machine learning tools. Genetic Algorithms based classifier system (GAssist), SVM,
RF, and Prediction Analysis of Microarrays (PAM) are among them. The accuracy obtained
from BioHEL was highest. BioHEL showed better performance on large datasets in comparison
to its nearest rival, the GAssist [37].
2.1.9 Kernel-based Data Fusion Method for Gene Prioritization
This fusion method combines all kernel matrices on human genes. A kernel matrix is
used in kernel machines such as SVM. So, all instances are represented by a kernel
matrix, which is an n × n positive semidefinite matrix. Each element (ai,j) shows
the similarity between ith and jth instances using a pairwise kernel function k(x_i,
x_j). By using this approach, we can enhance learning methods by not explicitly making
a feature space. In fact, the kernel matrix implicitly represents the inner product
between all Paris of instances in an embedded feature space yielded by an explicit
feature mapping. Since the resulting feature space may be high-dimensional or even
infinite-dimensional, kernel matrix helps to have tractable and efficient computation
in the original space without explicit mapping [38]. They are integrated through
Log-Euclidean Mean (LogE), Arithmetic Mean (AM) and weighted-version of LogE (W-LogE).
Input data were 12,000 human genes. By this approach, 24 novel driver genes were a
candidate for 13 diseases such as breast and ovarian cancers. The method uses GO,
Swiss-prot (SW) annotation, PPI network based on STRING database, and literature as
annotated data sources. For these cancers, kernels performance was evaluated using
leave-one-out cross-validation on the training genes. Average True-Positive Rate (TPR)
results obtained from proposed Kernel-based data fusion tools were compared with ENDEAVOUR.
Results showed that various approaches of Kernel-based data fusion including LogE,
W-LogE, and AM performed better than ENDEAVOUR, and the best was W-LogE [39].
2.2 Decision Integration
These methods have been constructed from several base classifiers. Actually, the critical
component of any ensemble system is the strategy employed in combining classifiers.
The module of outputs combination based-decision making methods is another major issue
in this kind of ensemble method. There is no unique naming procedure for the same
decision making strategy that has been used in different articles/books.
The terminology we have used to outputs combination and decision-making for final
decision integration is according to the following pattern (Fig. 1):
The pattern is divided into two main categories, such as combining class labels (CCL)
and combining continuous outputs (CCO). CCL is divided into four sub-types that include
majority voting (MV), weighted majority voting (WMV), behavior knowledge space (BKS)
and Borda Count. CCO is divided into three sub-types, which includes algebraic combiners,
decision templates, and Dempster-Shafer based combination [3].
2.2.1 Homogeneous Ensemble Methods
Homogeneous ensemble methods refer to the fact that all of the base classifiers are
the same and from a single type, but they are different at the data used for training
phase or model parameters (e.g., linear combination fusion function model) or a combination
of the two categories [40].
2.2.1.1 SVM Classifiers Fusion (three SVM)
The kind of homogeneous ensemble method is SVM Classifiers Fusion. It is a Multi-classification
system (MCS) that combines three SVM classifiers. This computational method has been
used for breast cancer detection. Combination of three classifiers minimized the classification
error in the training phase. For every base SVM, training data and testing data achieved
from Digital Database for Screening Mammography (DDSM) mammographic images database
[41, 42]. In this study, 300 images were used at the training phase and 100 images
for the testing phase. It uses simple Majority Voting for decision making. The cross-validation
technique was used for these multiple classifier system evaluations. The results showed
that fusion of SVM classifiers improves the performance of the system. It is better
than applying of all features in one features vector. Also, results of MCS with voting
compared to each single SVM classifier showed that accuracy was increased because
of the quality of the decision is improved [43].
2.2.1.2 enSVM (200 SVM)
In a study, a fusion approach was used and called ensemble SVM (enSVM) that is included
in three steps. Step 1 is sub-sampling of genes that generate Gene Subsets and then
constructs 200 diverse classifiers. Input data for this step are gene microarray data
related to 97 patient samples. In step 2, SVM came into operation and generated 25
candidate classifiers. In this phase, SVM is suitable because it solves variable and
high dimensionality problem of training data. In step 3, final decision making with
majority voting strategy mechanism was done. The proposed method has been applied
for microarray data classification and accurate diagnosis of breast cancer, cancers
of the central nervous system, colon tumor, leukemia, and prostate cancer. In this
study, LOOCV was used to evaluate the performance of SVM as the base classifier. Results
showed that the proposed gene sub-sampling-based ensemble learning that is called
enSVM outperforms single SVM and re-sampling ensemble learning methods such as bagging
and boosting and enSVM showed relatively the best classification accuracy [44,45]
.
2.2.1.3 Three neural networks fusion
In previous studies, the researchers had constructed a combinational feature selection
method concerning ensemble three neural networks (NNs). This method has both levels
(feature selection integration level and decisions integration level). It has been
applied for several cancers diagnosis and treatment via discovering of marker genes
of the diseases by gene expression data. These cancers are adult acute lymphoblastic
leukemia (ALL), acute myeloid leukemia (AML), malignant pleural mesothelioma (MPM),
adenocarcinoma (ADCA) of the lung, and prostate cancer. At the first step, bagging
generates 100 individual classifiers by resampling 100 times on microarray data. Then
each classifier as an input will be given to 3 neural networks. The ensemble three
neural networks method uses algebraic combiners for decision making, but there are
100 ensemble networks methods with 100 different outputs. Majority voting was used
for combining their results. Also, final decision making was done by majority voting.
In comparison with other methods, the proposed method effectively improved results.
This method can provide more information in microarray data to increase accuracy and
introduce the driver genes of the diseases for diagnosing and treatment. The accuracy
of this method for ALL, AML, lung cancer, and prostate cancer was 100%, 100%, and
97.06%, respectively. In this method, classification performance and accuracy value
were evaluated through 10-fold cross-validation and LOOCV. The accuracy of bagged
decision trees for ALL, AML, lung cancer, and prostate cancer were 91.18%, 93.29%,
and 73.53%, respectively. The accuracy of the best methods considering ALL, AML, lung
cancer, and prostate cancer were 97.06%, 97.99 %, and 73.53 %, respectively [46].
2.2.1.4 NED method (five artificial neural networks fusion)
An ensemble method called Neural Ensemble based Detection (NED) is a learning method
that combines five Artificial Neural Networks (ANNs) [48]. The neural algorithm of
each network is Fast Adaptive Neural Network Classifier (FANNC) [47]. FANNC has both
high performance and speed. Also, FANNC is an automatic algorithm, and it requires
no manual set up. The proposed ensemble method has been applied for the lung cancer
diagnosis. In this study, images of the specimens of needle biopsies are used as input
data containing 552 cell images from biopsies of subjects. This ensemble system has
a two-level ensemble structure. At the first level, outputs of each individual neural
network place in two classes: normal cell or cancer cell. Then, NED method uses full
voting for decision making. In this decision making strategy, the cell is considered
normal when all of the individual networks vote it is normal. At the second-level
ensemble, each network has five outputs, including adenocarcinoma, squamous cell carcinoma,
small cell carcinoma, large cell carcinoma, and normal. This method uses plurality
voting for decision making. In this way, the identification rate of NED is high, and
its false negative rate is low. This method helps to miss less positive cancer patients.
The accuracy of the method was evaluated by 5-fold cross-validation on the data set
and was demonstrated that confidence of the first-level ensemble is high, especially.
The proposed approach consists of five FANNC networks were compared with a single
artificial neural network. Results showed that NED outperforms the single FANNC [48].
2.2.1.5 Clinical decision support system
The Clinical Decision Support System (CDSS) is an ensemble method that is included
four different Weighted Random Forest (WRFs). Each WRF has been constructed with 80
trees. This ensemble method combines the results of clinical techniques such as classic
and ancillary techniques. Crucial clinical data were included visit dates, patient
age, Human Papilloma Virus (HPV) genetic examinations, cytological diagnoses, and
histological examination of biopsies. Under the project, 740 cases were studied. With
this method, more accurate results were produced. It has been applied for Cervical
Cancer (CxCa) diagnosis and uses majority voting for decision making strategy. The
performance of the proposed system was estimated using 10-fold cross-validation. Results
showed that performance obtained from the proposed method (CDSS consists of four different
WRFs) is better than single classifiers approaches including k-Nearest Neighbors (KNN),
NB, Classification and Regression Tree (CART), Multi-Layer Perceptron Network (MLP),
Radial Basis Function (RBF) Network and Probabilistic Neural Network (PNN), but it
showed a bit worse performance than the integrated CDSS consists of two ANNs [49].
2.2.1.6 Bagging subgroup identification trees
Bagging subgroup identification trees is a tree-based ensemble method which combines
binary trees. Generation of bootstrap samples is done by resampling the training data
with replacement, and then several trees are constructed as diverse classifiers. In
the next step, each tree converted to a binary classifier. Finally, “binary trees”
are constructed, and the final prediction is done with a simple majority vote strategy.
In this study, clinical data such as gender, age, surg, etc., was used. About colon
cancer, 929 cases participated [9]. The dataset associated with this cancer can be
downloaded from R package survival [50]. Also, the GSE14814 dataset [51] related to
lung cancer, including 133 patients, was applied. These two datasets were extracted
from the R package survival and GEO database related to colon cancer and lung cancer,
respectively. As mentioned, this proposed method uses a majority voting for decision
making strategy. The selection of biomarkers and building of classifiers were done
with Leave one out cross-validation. Sensitivity, specificity, and accuracy rate of
the proposed method was compared with a multivariate Cox model. Results showed that
the proposed ensemble bagging (a novel tree-based) method is better, especially when
data is imbalanced. For balanced data, the Cox model was slightly better [9].
2.2.1.7 CAD system
Computer-Aided Diagnostic (CAD) system is an ensemble method that combines Bayes classifiers.
It was applied for the tissue classification and diagnosis of focal liver lesions.
In this study, input data were Computed Tomography (CT) contrast-enhanced images of
20 cases of liver cancer. This classification process has two phases. In the first
phase, CT images were classified using the Bayes classifier. In the second phase,
the combination of classifier outputs and the decision-making process were done using
majority voting strategy. Also, classification success rates were evaluated by leave-one-out
technique. This approach for classifier combination generated better performance.
Findings showed that the best performance of this method was obtained by majority
voting in data. And CAD-based Bayes generates relatively high accuracy value [52].
2.2.2 Heterogeneous Ensemble Methods
Heterogeneous ensemble methods incorporate different base classifiers in spite of
they usually use the same training dataset and input data for running different learning
algorithms [53].
2.2.2.1 BNCE method
BNCE is an ensemble approach for training neural network fusions. This ensemble method
combines boosting with negative correlation (NC). It has been used for breast cancer
detection to the classification of the tumor as either benign or malignant [54]. Data
were well-known benchmarks related to breast cancer that were downloaded from the
datasets of the UCI machine learning benchmarks repository [55, 56]. The proposed
approach uses majority voting for decision making. In order to performance evaluation,
classification error in the percentage of BNCE was estimated. The results of this
method were compared with the results of other methods, including Evolutionary Programming
Network (EPNet), a single NN, a simple NN ensemble, bagging, Ada-boosting, and arc
boosting. They used BNCE on breast cancer well-known benchmarks. Generally, a comparison
in terms of the classification error rate for benchmark datasets showed the proposed
method has the best performance [54].
2.2.2.2 The meta-learning method
In another study, a meta-classification tool was used for prostate cancer detection.
Data for this ensemble strategy are mass spectrum data (MS data), and it combines
the results of several machine learning approaches [57]. Individual classifiers are
ANN, KNN, SVM, LOGISTIC-REGRESSION, and CART [58]. It uses weighted majority voting
for decision making. The proposed method combines multiple error independent base
classifiers into a meta-classifier. Validation of meta-classifier was done with k-folding
validation (leave-one-out) experiments on the training set. This ensemble method improves
prediction accuracy over individual classifiers. In comparison with individual classifiers
such as ANN, KNN, SVM, LOGISTIC-REGRESSION, CART, results showed that their proposed
method accuracy is better. Also, sensitivity and specificity were high (respectively,
91.30% and 98.81%). By the way, 11 biomarkers associated with prostate cancer was
diagnosed [57].
2.2.2.3 Heterogeneous ensemble (KNN-SVM- DT-LDA)
Generally, heterogeneous ensemble methods combine the outputs of several base classifiers.
They train some learner machines with different learning strategies using a single
common training dataset. This definition is in contrast with the methods that use
different datasets for training a single learner machine. In one of the heterogeneous
ensemble methods proposed by previous studies, five base classification algorithms
such as KNN (K=3), KNN (K=5), SVMs, DT (Decision Trees) and LDA (Linear Discriminant
Analysis) were used. This method has been designed for increasing the chance of early
prostate cancer diagnosis [59]. Data were proteomic prostate cancer data obtained
from protein mass spectrometry available in JNCI Data 7-3-02 [60]. The statistical
population was 322 patients that data obtained from their sera. 63 people were with
normal prostate, 190 patients were with benign prostate tumors, 26 patients with prostate
cancer and Prostate-Specific Antigen (PSA) level in the range 4–10, and 43 patients
with prostate cancer and PSA levels above 10. The proposed approach uses simple majority
voting for decision making. Also, for performance validation of the method, 10-fold
cross validation was used. The results showed that accuracy and sensitivity were increased,
but specificity slightly decreased after using ensemble method. This simple fusion
strategy improved prostate cancer mass spectroscopy dataset-based methods which are
faced with High Dimensionality Small Sample (HDSS) problem. By this approach, overall
performance is boosted. Diagnosis using protein mass spectrometry technique is a new
solution. Many of the learning algorithms use it to increase the chances of prognosis
of cancer in the early stages. However, the problem of small samples with high dimensions
concerning the proteomic data in cancers requires more sophisticated solutions to
improve classification accuracy. In this method, five classification algorithms were
used. Applying this simple strategy in making the final decision, it yields a more
promising performance for the use of mass spectroscopy data related to prostate cancer
[59].
2.2.2.4 MRS method
The mixture of Rough set and SVM (MRS) is a mixture classification model based on
clinical markers that are made by combining rough set and SVM classification tools
in serial form. This model is a serial multi-sensor system that integrates several
methods with different sources and characteristics for breast cancer prognosis. In
this fusion method, rough set classifier acts as the first layer for identification
of some singular samples in data, and the SVM classifier comes into operation as the
second layer for the classification of remaining samples. The upper layer is called
shrinking classifier, too and uses a voting strategy for decision making. For each
sample, the rough set tries to assign a class type to it. In this step, if the class
type is unknown, the second layer comes into operation for assigning a class type
to its sample. This two-layer construction without voting is a suitable way for better
clinical prognosis. MRS has used two open breast cancer datasets for prediction [61].
One dataset called BRC-1 hereafter that is included both clinical data and gene expression
data from 97 breast cancer tumors of lymph node-negative patients [62]. The other
dataset is BRC-2 hereafter that uses baseline human primary breast tumor clinical
data from Lawrence Berkeley Laboratory (LBL) breast cancer cell collection containing
174 samples [63]. This approach gives higher accuracy, specificity, sensitivity, and
Matthew’s correlation coefficient (MCC) than previous prognostic methods such as NB,
SVM, J48, random forest, and attribute selected classifier. Also, the higher accuracy
of the method was validated by 5-fold cross-validation [61].
2.2.2.5 Bagging (bootstrap aggregating) method
In a study was done in 2016, two different meta-learning algorithms were used. They
applied Bagging- RF, Bagging- NB and Bagging-K* instance (K*) and compared their results
with individual classifiers including RF, NB, K* and vote ensemble classifiers (RF-NB-K*)
too. All of the methods have been used for melanoma skin cancer detection. Data were
clinical images of skin lesions. These ensemble methods use simple majority voting
for decision making. Because of the Bagging reduces variance and helps to avoid overfitting,
so Bagging aggregation improves the accuracy and stability of the other selected tools.
In this study, 10 − fold cross-validation test was used for estimation of its accuracy.
In comparison with other methods, the results show that when the number of positive
cases is insufficient, using Bagging with the Random Forest is suitable. Using this
approach, sensitivity and AUC have been meaningfully improved [64].
2.2.2.6 Artificial intelligence based hybrid ensemble technique
Researchers have also designed a novel artificial intelligence based hybrid ensemble
technique for screening of cervical cancer. They used smear images data for diagnosis
as clinical data. The hybrid ensemble system combines fifteen different classifiers
[65] including Bagging, decorate, decision table, Ensemble of Nested Dichotomies (END)
[66], filtered classifier, J48 graft [67], Projective Adaptive Resonance Theory (PART)
[68], multiple backpropagation ANN, multiclass classifier, NB, random subset space,
radial basis function network [69], rotation forest, random forest and random committee.
The method uses voting for decision making. Validation is done on multiple training
and testing datasets, and 10-folds cross-validation is applied for evaluation of this
algorithm. This approach provides high performance for classification of complex datasets.
The hybrid ensemble technique is a promising method for classification of pap-smear
images and can be used to detect cancer cervical cancer. receiver operating characteristic
(ROC) Area of the proposed novel hybrid ensemble system was increased, and the overall
performance of the ensemble approach was improved. Also, in comparison with individual
classifiers, results were better for both multi-class and two-class problems [65].
2.2.2.7 Boosting-TWSVM method
In other studies, researchers used boosting with SVM together for MicroCalcifications
(MCs) clusters detection in digital mammograms. MCs clusters are an important sign
of breast cancer diagnosis. This ensemble method uses algebraic combiners for decision
making because the aggregation is computed by the weighted averaging in this method
[70]. In comparison, they showed that their proposed method outperformed the other
methods such as twin SVM (TWSVM) [71]. In this study, there were 650 positive and
3567 negative samples. Samples were split into two subsets, the first part was used
for the training set and validation, while the second part was applied for the testing
set. Also, the TWSVM classifier was trained using 10-fold cross-validation technique
for evaluation of method performance. Since the TWSVM is sensitive to the training
samples, it is inconsistent, but when Bagging was integrated into TWSVM, the inconsistent
problem in the training set will be solved. Result related to BOOSTING-TWSVM showed
that Sensitivity and specificity were increased. Also, they demonstrated that Boosted-TWSVM
is a promising approach for MCs detection [70].
2.2.2.8 Bagging and boosting-based TWSVM
Bagging and boosting based twin support vector machine (BBTWSVM) is yet another ensemble
method. The structure of the algorithm of this ensemble method consists of three modules:
the image preprocessing, the feature extraction component, and the BBTWSVM modules.
Also, BBTWSVM modules composed of 2 two algorithms: bagging TWSVM and boosting-TWSVM
Combining these algorithms results in a more efficient solution that composed of several
classifiers: BBTWSVM. This method has been applied for clustered MCs detection and
so is called MCs detection approach too. Breast cancer can be diagnosed by MCs detection
approach. This fusion method uses algebraic combiners for decision making because
it finds the maximum score from all the base classifiers, or computes a weighted scoring
scheme from among the base learners. Data for validation were chosen through the training
set and like of Boosting-TWSVM method, 10-fold cross-validation technique was used
in the training phase for evaluation of method performance. BB-TWSVM outperforms TWSVM.
The sensitivity of BB-TWSVM classifier was increased, and ROC curves showed that in
comparison with TW-SVM, the performance of the proposed approach is improved [72,73].
2.2.2.9 Ensemble multi-class learning
Ensemble multi-class learning algorithm is an ensemble approach that combines error-correcting
output coding (ECOC) scheme and one-against-one pairwise coupling (PWC) scheme. This
method has been used for finding biomarkers in liver cancer. The method uses an algorithm
called extended Markov blanket (EMB). Also, a liver cancer matrix-assisted laser desorption/ionization
time-of-flight (MALDI-TOF) mass spectrometry (MS) dataset was used as a training dataset.
In this method, redundancy and relevance were two aspects of biomarkers that were
considered for feature selection. By this ensemble method, identification of proteomic
biomarkers for liver cancer was possible. It uses voting for decision making [74].
Samples for liver cancer data were 201 spectra of MALDI-TOF MS belonging to HCC patients,
cirrhosis patients, and healthy participants [75]. Then all samples were divided into
10 exclusive folds, randomly. In this study, the error rate was estimated, and 10-fold
cross-validation was selected for evaluation of experimental results such as accuracy
value. The proposed method was compared with random forest, NB, classical ECOC, and
J48 approaches, and the results showed accuracy was increased to 88.71% [74].
2.2.2.10 RSS-SCS method
This approach has combined Random Subspace (RSS) and Static Classifiers Selection
(SCS) Paradigms. Proposed ensemble method has been used for Breast Cancer diagnosis
by CAD. The approach used from a real database, including 300 mammograms as clinical
data. These mammograms collected from the DDSM. In this research, it has been shown
that CAD is an effective approach for breast cancer detection in the initial stages.
At first, RSS constructs diverse classifiers by using different subsets of features
that are used in the training phase. The second, SCS selects diverse classifiers.
Then, the outputs of classifiers are combined. These diverse classifiers use majority
voting for decision making. For estimating the final accuracy of feature subsets,
cross-validation was used. Results demonstrated that in comparison with the three
best ensemble methods such as Bagging, AdaBoost, and Random Subspace, proposed approach
generated higher rate in three metrics including sensitivity, specificity, and accuracy
[76].
2.2.2.11 REIS-based ensemble method
The Resonance-frequency Electrical Impedance Spectroscopy (REIS) – based ensemble
method has fused five classifiers. It is a kind of heterogeneous ensemble method.
These classifiers are ANN, SVM, Gaussian mixture models (GMM), CART, and LDA. This
fusion method has been applied for suspicious breast lesions detection. The lesions
are the sign of the risk of having or developing breast cancer. In this investigation,
174 cases were examined. Imaging-based examinations such as mammography, additional
views, ultrasound, and magnetic resonance imaging were used as clinical data. For
feature selection stage, a genetic algorithm was applied. The Reis-based method uses
algebraic combiners for decision making. Actually, this method combines the results
of classifiers via three rules, including sum rule, Weighted Sum Fusion Rule (WSFR),
and Weighted Median Fusion Rule (WMFR). The performance was evaluated using a leave-one-case-out
cross-validation technique. In this study, ROC curves were compared among the ANN,
SVM, GMM, CART, and LDA individual classifiers. Without fusion, ANN had a higher rate.
Also, Comparison of ROC curves for the single best classifier (ANN) and proposed fusion
model with three rules showed that WSFR and WMFR are better than ANN and Sum Rule.
So, the weighted median fusion rule is the best fusion approach in this study [77].
2.2.2.12 MV-ACE method
Multi-view based AdaBoost classifier ensemble (MV-ACE) framework is an ensemble method
that has integrated multiple views in a straight forward manner, such as the linear
combination of different views and AdaBoost algorithm. AdaBoost produced the base
classifiers and optimized them. In this study, gene expression datasets were used.
This ensemble method has been applied for class prediction from several cancers gene
expression profiles, including blood, bladder, liver, prostate, brain, endometrium,
and bone marrow. MV-ACE works well for cancer classification by gene expression profiles.
This ensemble method uses algebraic combiners for decision making. In this way, the
algorithm was run 20 times separately, and the average value was calculated. Also,
prediction accuracies were evaluated using 3-fold cross-validation. In this investigation,
an accuracy value of the proposed method (MV-ACE) was compared with other best classifier
ensemble methods, such as Bagging, MultiBoosting (MB), RF, RSS, and AdaBoost The results
showed this approach achieved relatively better performance for most of the data sets
[78].
2.2.2.13 DECORATE method
Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples
(DECORATE) is another ensemble classifier method that can be categorized as a member
of decision integration level class. Decorate is an ensemble method that combines
four base classifiers including NB, Sequential Minimal Optimization (SMO) algorithm
for training a support vector classifier, C4.5 DT, and forest of random trees. It
has been applied for introducing prognostic biomarkers in related to breast cancer
and ovarian cancer. For feature selection step, 10 different methods were used as
individual classifiers, and then 10 feature vectors were constructed, respectively
[79]. Input data for these classifiers is somatic mutation data that is available
in TCGA data set [18, 80]. Five individual classifiers which rank candidate genes
based on p-values are including OncodriveFM, OncodriveCLUST, MutSig, ActiveDriver,
Simon. The five remaining individual classifiers, including FLN, NetBox, MEMo, Dendrix,
and FLNP, choose driver genes based on linkage weights. This method also has a layer
of feature integration. Input test data is 20,624 genes annotated as protein-coding
that was downloaded from NCBI database. Finally, supervised classification is done
and DECORATE uses the average posterior probabilities of four above mentioned base
classifiers. It is necessary to note when data is limited, and this ensemble method
is effective because DECORATE is an approach that creates diverse artificial data.
This method has been applied for introducing cancer driver genes, including breast
cancer and ovarian cancer. Although DECORATE is grouped in decision integration methods
with algebraic combination rule, in a distinct layer, it employs some kind of feature
fusion mechanism. During training, the method was run 50 times, and performance evaluation
was estimated using 10-fold cross-validation. Results showed that when the training
set is small, DECORATE gained higher accuracy than other best ensemble approaches
such as Bagging or boosting [79].
2.2.2.14 HyDRA method
Hybrid Distance-score Rank Aggregation (HyDRA) is an ensemble approach that has combined
advantages of score and distance methods [81]. The score and distance methods [82]
are aggregation techniques. The predictive potency of this aggregation approach is
evaluated very high. HyDRA aggregates genomic data based on mutation data and has
been applied on several gene sets related to diseases such as autism, breast cancer,
colorectal cancer, endometriosis, glioblastoma, meningioma, ischaemic stroke, leukemia,
lymphoma, and osteoarthritis. By this method, driver genes for these diseases were
prioritized. The proposed approach uses decision templates for decision making strategy
because this method ranks driver genes based on different similarity criteria that
are combined with statistical tools. In each disease and for disease gene discovery,
the performance of HyDRA method was evaluated by Cross-validation. Results showed
that performance obtained from HyDRA was higher than other methods such as Endeavor
[83] and ToppGene [84] for a majority of quality criteria. The analyses also show
that each method has specialized advantages in prioritization for some diseases [81].
2.2.2.15 Stacking IB3-NBS-RF-SVM method
This ensemble approach has combined four well-known individual classifier types, including
Instance Based3 (IB3), Naïve Bayes Simple (NBS), RF, and SVM. This method is grouped
as a decision integration level tool. It classifies DNA microarray data using biological
gene sets such as KEGG gene sets. This ensemble approach has been applied for breast
cancer and Leukemia diagnosis. It uses weighted majority voting for decision-making
strategy. In this study, the kappa value is calculated instead of the accuracy because
it is better criteria for the classification of unbalanced data; then, evaluation
of individual classifiers was estimated using a 10-fold cross-validation schema. In
this study, the proposed approach gives a better performance in comparison of the
various integration methods, including AdaBoost hybrid, Bagging hybrid, Stacking-IB3,
Stacking-NBS, Stacking-RF, and Stacking-SVM. The proposed approach is able to generate
a ranked list of genes that can be effective for cancer diagnosis and shows meaningful
improvement in cancer classification results such as accuracy and kappa values [85].
2.2.2.16 GenEnsemble method (NBS-IB3-SVM-C4.5 DT)
Similar to the previous method is GenEnsemble method. This ensemble method has combined
biological knowledge in the form of gene sets for the microarray data classification
process. Four base classifiers for these approaches are NBS, IB3, SVM, and C4.5 DT.
Clinically, GenEnsemble model has been applied for cancer diagnosis such as breast
cancer in bi-class classification issue and leukemia in multi-class classification
issue. In this study and in the training phase, each gene set was used as the informed
feature selection subsets to train base classifiers and then determine their accuracy.
Similar to the previous case in 2.2.17 section, this approach uses weighted majority
voting for decision-making strategy too. An internal k-fold cross validation strategy
was used for each data set, and GenEnsemble was evaluated over the training data.
Although, the Naïve Bayes algorithm as base classifier of Bagging or AdaBoost ensembles
gave the best results for the three breast cancer datasets but other evidence showed
that the proposed approach achieved better performance compared with other popular
ensemble algorithms, such as Bagging, Boosting, IB3, SVM, J48, AdaBoost-IB3, AdaBoost-J48,
Stacking-IB3 and Stacking-SVM [86].
2.2.2.17 ADASVM method
ADASVM is an ensemble method that has incorporated AdaBoost with linear SVM classifier.
It classifies cancers based on microarray gene expression data using Support Vector
Machines Ensemble [87]. In this study, the cancer dataset that is selected as a benchmark
was leukemia dataset [28]. Leukemia has two sub-types, including AML and ALL. ADASVM
is a suitable algorithm for two class problem. This algorithm resolves defects and
dilemmas of AdaBoost and SVM. This fusion method has dealt with the diversity of the
AdaBoost algorithm. Also, the boosting mechanism was caused to reduce misclassification
rate to improve accuracy. It uses weighted majority voting for decision-making strategy.
The main measure for evaluation of AdaBoost is weight error of the component, and
if it was higher than 0.5, the process is stopped. In this study, researchers showed
that the proposed method outperforms than SVM and KNN classifiers. Results showed
ADASVM accuracy was 100%. In return, SVM and KNN accuracy were lower, respectively
[87].
2.2.2.18 NB (Naïve Bayes) combiner method
The previous investigations showed that NB combiner could be introduced as a fusion
strategy in decision integration level. It combines 100 decision tree classifiers
[88]. This integration model has been used with 73 benchmark datasets such as breast
cancer [55, 56], Arrhythmia [89, 90], Hypothyroid [91, 92]. These datasets belonged
to the UCI Machine Learning Repository. The UCI Machine Learning Repository is an
open-access database of machine learning problems. In this study, the NB combiner
was compared with other three combination methods, including Majority Vote combiner,
Weighted Majority Vote combiner, and Recall Combiner (REC). In this study, 10-fold
cross-validation was applied. For each cross-validation fold, the training set was
divided into two equal parts, including “proper” training and validation. All base
classifiers were validated on the proper training part, but combiners were evaluated
on the validation part. Of course, estimation of prior probabilities was validated
on the whole training data. Results have shown among mentioned methods; NB combiner
was the best. Accuracy value of NB combiner was high. This approach has solved problems
with a large number of fairly balanced classes while WMV combiner is successful for
problems with a small number of imbalanced classes. The previous studies with simulation
had demonstrated that NB combiner estimates are inaccurate, but these results did
not show such anomalies when data were real and sufficient [88].
2.2.2.19 Collective approach (correlation, color palette, color proportion, and SVM)
The collective ensemble method operates at the decision integration level. It has
combined several methods, including correlation method, color palette approach, color
proportion method, and SVM classifier. This fusion method determines CEN17 and HER2
biomarkers status by Fluorescence in Situ Hybridization (FISH) images that are important
in breast cancer detection as clinical data. This method uses weighted majority voting
for decision-making strategy. Performance of the ensemble approach was confirmed using
the statistical evaluation of the mentioned spot recognition system. It was demonstrated
that the main advantage of this method is the absolute repeatability of scores in
several independent runs. This property is in contrast with human expert decisions
which are dependent on mental and physical condition. In this study, the average sensitivity,
specificity, and the mean of summed sensitivity and specificity of the proposed fusion
approach were compared with different individual methods. Results showed the fusion
method gives better efficiency than other methods [93].
2.2.2.20 Rankboost_W
Rankboost _weighting function (Rankboost_W) is a Rankboost algorithm that a heuristic
weighting function has been added to it [94]. It is an ensemble method that uses boosting
learning techniques for combining different computational approaches as a set of weak
features to improve overall performance [95]. Similar to DECORATE method, this approach
also has a layer of features integration. It has been applied for gene prioritization
related to prostate cancer. In this study that was carried out in 2013, training and
test data were genomic data based on mutations for prostate cancer detection. Driver
genes and protein-coding genes data were downloaded from Online Mendelian Inheritance
in Man (OMIM) and HUGO Gene Nomenclature Committee (HGNC) databases, respectively.
It uses algebraic combiners for final decision-making strategy with a novel weighting
function. They used the LOOCV method for determining confidence interval estimation.
In comparison with other approaches, including ToppGene and ToppNet [87], the performance
of the proposed model (Rankboost_W) was better. AUC and mean average precision (MAP)
as two performance indicators showed better results for Rankboost_W in comparison
with ToppGene method [94].
2.2.2.21 RVM-based ensemble learning
Relevance Vector Machine (RVM) is an ensemble approach that is a combination of AdaBoost
and reduced-feature. The RVM has been applied to classify and diagnose different cancers
by the construction of a human genetic network. Input data were heterogeneous genomics
data such as microarray data. There are three major problems related to heterogeneous
genomics data in the construction of a human genetic network. Lack of gold-standard
negative set, large-scale learning, and massive missing data values are these problems.
This ensemble method has addressed two problems by using kernel-based techniques.
AdaBoost helped to solve the problem of large-scale learning, and the reduced-feature
model resolved the problem of massive missing data values, which both caused meaningful
improvement in performance. 10-fold cross- validation testing was used to evaluate
the performance of models. Generally, RVM is an effective approach for encountering
the large dimensionality of feature space and the existence of massive missing values.
The proposed ensemble method uses algebraic combiners for decision making strategy.
In comparison with a robust ensemble approach such as Naïve Bayes Baseline, the proposed
model is preferred. Its performance, even with massive missing data values, is high,
and this method can be used for classification tasks in biological datasets [96].
2.2.2.22 PSO–ANN ensemble
particle swarm optimization (PSO)–ANN ensemble is an ensemble method that was used
in microarray data classification. The critical point of microarray data analysis
is related to the fact that only a few numbers of thousands of genes affect the classification
results. This fusion approach has been applied for cancer diagnosis, including leukemia,
colon cancer, ovarian cancer, and lung cancer by microarray data classification. PSO-ANN
approach has four steps. In step 1 and for gene selection, Fisher-ratio is used. Also,
for feature selection and dimension reduction, correlation analysis is employed. In
step 2, feature subsets are re-sampled with PSO algorithm, and several base classifiers
are trained. In step 3, appropriate base classifiers are chosen. In the step 4, selected
base classifiers are combined using an Estimation of Distribution Algorithms (EDAs).
In this study, the ANN was used as the base classifiers and was trained with the PSO
algorithm. This intelligent ensemble method uses algebraic combiners for decision
making. For each data set and evaluation of classification, leave-one-out cross-validation
was used. In this investigation, the proposed method was compared with single PSO–ANN,
SVM, C4.5, Neuro-fuzzy, and KNN. On the basis of this comparison, the accuracy of
classification was improved. Results showed the PSO–ANN ensemble model offers the
best overall classification accuracy [97].
2.2.2.23 MF-GE system
The multi-filter enhanced genetic ensemble (MF-GE), hybrid ensemble model, includes
two sequentially phases. The first phase consists of a filtering process, and the
second phase includes the wrapper process. In phase 1, genes in the microarray dataset
were scored using multiple filtering (MF) algorithm and obtained scores were integrated.
In the wrapper process, genes were selected with genetic ensemble (GE) algorithm [98].
This approach has been applied to four benchmark microarray datasets for gene selection
related to leukemia [31], colon cancer [32], liver cancer [99] and mixed-lineage leukemia
(MLL) [100]. This fusion method can be effective for binary-class and multi-class
classification problems. Also, the hybrid system overcame the overfitting problem
of the GE algorithm. It was used both majority voting and algebraic combiners for
decision making, but majority voting generated better classification results. In this
study MF-GE system compared with the original GE system and the GA/KNN hybrid. In
this study, the double cross-validation process was applied, including internal cross-validation
and external cross-validation. The internal cross-validation was done in gene selection
phase while the external cross-validation was used for evaluation of selection results.
Results showed the proposed approach (MF-GE system) achieved higher classification
accuracy value, generated more compact gene subset, and led to the election results
more quickly [98].
2.2.2.24 Evolutionary Ensemble Model
In a study, an ensemble model was designed that integrated results of three modules
of evolutionary Multilayer Perceptron Neural Networks (MLPNNs). This approach is a
parallel ensemble method. Four techniques including polling, maximum, minimum, and
weighed average were used for integration, separately. An evolutionary ensemble model
is suitable for breast cancer correct diagnosis [101]. Data were taken from Wisconsin
Diagnostic Breast Cancer dataset [102, 103] from the UCI Machine Learning Repository
that is contained data vectors from 569 patients. About 70% of the total dataset were
used as training data using a genetic algorithm, and 30% of data were used as testing
data. Each module used algebraic combiner for decision making strategy. Then, voting
came into operation between modules. The results demonstrated that Maximum fusion
operator generates the best performance when was compared with other fusion technique.
The authors of the mentioned work proposed that considering the type of their training
method, validation data is not necessary. Accuracy value obtained using the maximum
integration operator showed the best performance. Also, sensitivity, specificity,
False Positive Rate (FPR), and False Negative Rate (FNR) values were reported [101].
2.2.2.25 Optimized naïve-Bayes model
This classification fusion system is a heuristic algorithm that improves the performance
of the naïve-Bayes classifier. It integrates different heterogeneous data, including
clinical, laboratory, and flow cytometry. The dataset was included of 112 cases of
B-Cell Chronic Lymphocytic Leukemia (B-CLL) patients. The mentioned dataset was obtained
from clinical, general laboratory (hematological) examinations and flow cytometry
analysis. The proposed method uses algebraic combiner for decision-making strategy.
In this study, data classification was done using naïve-Bayes, and performance was
evaluated by 10-fold cross-validation. The proposed optimized naïve-Bayes model showed
high classification accuracy values. Results demonstrated that including the flow
cytometry parameters can improve performance [104].
2.3 Model Integration
Model integration implies when we construct the model, the integration should be done
at the model level. In this approach, each model transforms the input data into the
format required, and then models are combined. By linking models, a single model comes
to decisions to be made based on it. This method can be developed using different
tools [105]. One of the tools is based on Bayesian networks that are mentioned in
the literature.
2.3.1Bayesian networks-based model integration (1,2)
In a study experimented in 2006, Bayesian networks have also been used concerning
breast cancer. Researchers used three models for integration. In the first model named
full integration, they integrated two data sources and then built a Bayesian network
based on integrated data and two data sources, including the clinical and microarray
data, were combined. So, at this step, data integration was just done. In the second
modal in this approach named decision integration model, an independent model was
built for each data source, and then the outcome decision from these models was combined
based on weighting policy. In the last modal named partial integration, similar to
the second one, and the independent model was developed for each data source, and
then the models were linked and integrated for building a single combined model and
final decision making. This method used model integration for decision making. The
models mentioned above were used for predicting metastatic state in breast cancer.
The training set was selected 100 times, randomly and proposed methods performance
was evaluated using ROC (Receiver Operator Characteristic) curves analysis. The obtained
results revealed that partial integration achieved higher performance and proved to
be the best method for data integration [106].
In another study, a novel Bayesian hierarchical model-based method has been proposed.
This approach uses single-nucleotide variants (SNVs) and insertions and deletions
(InDels) in whole genome sequence data as mutation data [107] obtained from sequencing
of the breast cancer cell lines dataset that are available in TCGA [108] and data
can be downloaded from https://gdc.cancer.gov/files/public/file/TCGA_mutation_calling_benchmark_files.zip. It first generates two models include of the tumor model and error model by setting
partition rules on paired-end reads and datasets, and then this framework integrates
these models for mutation calling associated with breast cancer through input data
partitioning. So, it is confirmed that the proposed method can improve performance
using incorporating heterozygous single nucleotide polymorphisms (SNPs) and strand
bias information comparison with other Bayesian network classifiers [107].