A Post-Method Condition Analysis of Using Ensemble Machine Learning for Cancer Prognosis and Diagnosis: a systematic review


 Background

Ensemble methods are supervised learning approaches that integrate different types of data or multiple individual classifiers. It has been shown that these methods can improve professional performance.

Methods

This study is an attempt to provide an in-depth review on 45 most relevant articles and aims to introduce 42 ensemble classifier (EC) machine learning methods used for the detection of 18 different types of cancer. Compared to other types of cancer, breast cancer, and the 22 ensemble methods introduced for its identification, is extensively investigated. The purpose of this study was to identify, map, and analyze the current academic discourse on EC machine learning methods in order to: 1. identify overarching themes emerging from empirical studies regarding EC methods, 2. determine their input data and decision-making strategies, and 3. evaluate relevant statistical procedures.

Results

By comparing various approaches, we can introduce Relevance Vector Machine (RVM)-based ensemble learning method that can provide optimal solutions for problems such as curse the dimensionality and high-dimensionality of feature space without missing data values.

Conclusions

To obtain robust performance and achieve better results, it is tactfully suggested to use multi-omics data integration, which has demonstrated to identify cancers and their subtypes more efficiently.


Abstract
Background Ensemble methods are supervised learning approaches that integrate different types of data or multiple individual classifiers. It has been shown that these methods can improve professional performance.

Methods
This study is an attempt to provide an in-depth review on 45 most relevant articles and aims to introduce 42 ensemble classifier (EC) machine learning methods used for the detection of 18 different types of cancer. Compared to other types of cancer, breast cancer, and the 22 ensemble methods introduced for its identification, is extensively investigated. The purpose of this study was to identify, map, and analyze the current academic discourse on EC machine learning methods in order to: 1.
identify overarching themes emerging from empirical studies regarding EC methods, 2. determine their input data and decision-making strategies, and 3. evaluate relevant statistical procedures.

Results
By comparing various approaches, we can introduce Relevance Vector Machine (RVM)-based ensemble learning method that can provide optimal solutions for problems such as curse the dimensionality and high-dimensionality of feature space without missing data values.

Conclusions
To obtain robust performance and achieve better results, it is tactfully suggested to use multi-omics data integration, which has demonstrated to identify cancers and their subtypes more efficiently.

Background
The prognosis and diagnosis of complex diseases such as cancers are two of the most crucial issues in precision oncology. There are large volumes of bio-molecular, biological and biomedical data in multiple databases including National Human Genome Research Institute (NHGRI), National Cancer Institute, National Institutes of Health (NIH), and The Cancer Genome Atlas (TCGA) project that have developed our ability to screen cancer and have created a lot opportunities for studies on the susceptibility to cancer [1].
Researchers are currently investigating the possibility of using ensemble machine learning tools to help detect cancer. Ensemble methods are supervised learning models that integrate different types of data such as bio-molecular and clinical data with computational function. Committee approaches can unmistakably outperform every one of the single models [2]. The idea also builds a predictive function that integrates two or more models and can enhance gloomy predictions [3]. Accuracy and diversity are two important conceptions in the construction of any classifier ensemble. For designing an accurate ensemble, it is necessary to base classifiers as diverse as possible. Typically, ensembles produce better results when there is greater diversity in the models [4]. Due to recent developments in high-throughput technologies (HTTs), a lot of data associated with cancers has been generated.
Previous studies showed that the most recent predictive machines integrate various data types and datasets, including genomic, clinical, histological, imaging, demographic, epidemiological, and proteomic data [5].
Ensemble machine learning theory has a very long history [3,6,7] Bagging and boosting are among The significant progress in the field of ensemble learning was obtained during the 90s when Bagging and Boosting were created as the most popular ensemble methods [8].
In 1996, and in advancing the research on bagging methods, Breiman noted that this approach would generate better accuracy when the individual base classifiers are unstable [9]. In the same year, bagging nearest neighbor classifiers were used for breast cancer dataset [10]. Thus far, the usage of various ensemble methods for analyzing bio-molecular, biological, and biomedical data has been increased.
Generally, the fusion issue can be investigated at three levels: 1. data and feature integration, 2. decision integration, and 3. model integration (Fig. 1).

Fig. 1) Integration patterns for ensemble learning
The main purpose behind this study, therefore, is to look for the most pre-eminent ensemble method for cancer detection and driver genes prioritization for cancer prognostic and diagnostic. The study is an attempt to answer the following questions:

1.
To what extent does the empirical research surrounding the most pre-eminent ensemble method with superior performance (specifically on large or small size datasets) relate to existing typologies?

2.
Among the existing studies, which fusion systems have been used to identify cancers?
3. What patterns emerge from empirical studies of ensemble learning addressing the diversity between different decision makers/datasets for improving the performance?

4.
Which ensemble model is a more effective approach to process large-scale and highdimensional problems with massive missing data values?

5.
Which ensemble is successful for problems with a small number of imbalanced classes?
The authors will first introduce the most prominent ensemble methods and will determine their base classifiers. Decision-making method for classification will be elaborated, the challenges about each of these methods will be addressed, and the solution for many of these challenges will be worked out.

Methods
In this part, 42 ensemble learning methods used for cancer detection are classified into three famous distinct fusion categories and discussed; include data and feature integration methods, decision integration methods, and on the smaller scale, model integration methods. Then in each case, input data, number of samples, the statistical tools for evaluation of performance will be introduced. And decision-making strategies for decision integration methods will be specifically determined. In this systematic research, ensemble systems regarding 45 most relevant articles in the category of cancer prognosis and diagnosis were studied. The studies that used a valid statistical tool to evaluate their performance were considered and included. In contrast, studies that did not compare the percentage of accuracy of their findings with other methods and also the different aspects of their performance had not clearly assessed, were excluded.

Data and Feature Integration
such as PIK3CA, CHEK2, BARD1, and TP53 that were predicted with (GO+MA+PPI) network. This data integration method was called Prioritizer. It has been evaluated the performance of the various gene networks using cross-validating on all data sets, ten times. Performance of Prioritizer is good when genes on the basis of their functional interactions are ranked. Prioritizer approach can help to the diagnosis of disorders by introducing driver genes. Accuracy of (GO+MA+PPI) network is significant (AUC = 90%). Also, results showed that the proposed method has a far better performance for the prioritization of genes in Mendelian diseases in comparison with complex disorders [13].

stSVM by Data Integration
In an investigation that has been carried out by in 2013, a new fusion method was introduced. This approach was called smoothed t-statistic SVM (stSVM). It integrates features obtained from experimental data such as mRNA and miRNA expression data into one SVM classifier. It has been applied for prognosis and diagnosis of cancers including breast, prostate and ovarian cancers and for gene prioritization. In this study, four datasets were used [14]. One breast cancer dataset (GSE4922) [15] two prostate cancer datasets (GSE25136 [16] and GSE21032 [17] and one ovarian cancer dataset (TCGA) [18] from various data repositories. The stSVM was evaluated via 10 times repeated 10-fold cross-validation method [14]. Finally, stSVM were compared with saliency guidedSVM (sgSVM) as a meta-classifier, an SVM machine trained with significant differentially expressed genes with FDR cutoff 5%, selected by Significant Analysis for Microarrays (SAM) [19]. It has been shown that stSVM approach has high predictive power for introducing novel gene lists for mentioned cancers [14]. vectors. Third, the features were ranked by the average square weight of each feature among 500 different splits. The lowest ranked feature was eliminated recursively until the maximum average AUC obtained. this procedure was repeated 100 times for selecting a final marker gene set [21]. The consistency of the proposed approach was evaluated with a multi-level reproducibility validation framework [22]. It is a kind of level-by-level validation method [23]. The algorithm related to this strategy identifies the highly reproducible marker. It means that it generates highly reproducible results across multiple experiments. The results show that this method has improved accuracy and biomarker reproducibility by as much as 15% and 30%, respectively. This method computes an average weight from 500 classifiers and uses algebraic combiners for decision making. Multiple RFE was applied for a classification tool called COre Module Biomarker Identification with Network ExploRation (COMBINER) in the feature selection step [21]. COMBINER software was implemented on three independent breast cancer datasets including the Netherlands [24], the USA [25], and Belgium [26] and identified 13 driver genes as reproducible discriminative biomarkers. Also, a robust regulatory network was constructed [21].
In other studies, RFE selection method has also been applied for biomarker discovery with colon, leukemia, lymphoma, and prostate cancer. Finally, the outputs of all selectors are aggregated, and the ensemble result is computed. In general, this method generated a diverse set of feature selections [27]. The proposed approach was assessed on four microarray datasets, including leukemia dataset [28], colon dataset [29], lymphoma dataset [30], and prostate dataset [31]. For selecting a training set, subsampling was done, and each time, 10% of the data was used as an independent validation set to evaluate classifier performance. The results showed that the robustness of the selected driver genes was increased up to almost 30% and performance improved up to ∼15% in the classification method [27].

Feature Subsets Method
In a study aimed for predicting survival in breast cancer patients, the researchers designed an ensemble method to learn models using feature subsets then combined their predictions [32]. Two breast data set were used, including Dataset 1 [24,33]

Multimodal Data Fusion of Separate Datasets
Multimodal Data Fusion is a fusion model that integrates clinical and bio-molecular data such as image and microarray data. In fact, in this study, data were heterogeneous. This approach has been applied to the diagnosis of melanoma. There are two different types of multimodal data fusion for fusing separate datasets. Two approaches including a combination of data (COD) and the combination of interpretations (COI) were exploited for feature integration. COD is applied before classification, and it aggregates features from each source for producing a single feature vector, but in COI, independent classifications are done based on the individual feature subsets using a proper voting mechanism [34], and involve aggregating outputs, so it uses algebraic combiners for decision-making strategy. In another study that was done in related to prostate cancer has been told that COD methods are more optimal [35]. It should be noted that in this method and feature selection step, SBE (sequential backward elimination) with RF (random forest) algorithm was used that utilizes

Kernel-based Data Fusion Method for Gene Prioritization
This fusion method combines all kernel matrices on human genes. A kernel matrix is used in kernel machines such as SVM. So, all instances are represented by a kernel matrix, which is an n × n positive semidefinite matrix. Each element (ai,j) shows the similarity between ith and jth instances using a pairwise kernel function k(x_i, x_j). By using this approach, we can enhance learning methods by not explicitly making a feature space. In fact, the kernel matrix implicitly represents the inner product between all Paris of instances in an embedded feature space yielded by an explicit feature mapping. Since the resulting feature space may be high-dimensional or even infinite-dimensional, kernel matrix helps to have tractable and efficient computation in the original space without explicit mapping [38]. They are integrated through Log-Euclidean Mean (LogE), Arithmetic Mean (AM) and weighted-version of LogE (W-LogE). Input data were 12,000 human genes. By this approach, 24 novel driver genes were a candidate for 13 diseases such as breast and ovarian cancers. The method uses GO, Swiss-prot (SW) annotation, PPI network based on STRING database, and literature as annotated data sources. For these cancers, kernels performance was evaluated using leave-one-out cross-

Decision Integration
These methods have been constructed from several base classifiers. Actually, the critical component of any ensemble system is the strategy employed in combining classifiers. The module of outputs combination based-decision making methods is another major issue in this kind of ensemble method.
There is no unique naming procedure for the same decision making strategy that has been used in different articles/books.
The terminology we have used to outputs combination and decision-making for final decision integration is according to the following pattern ( Fig. 1): The pattern is divided into two main categories, such as combining class labels (CCL) and combining continuous outputs (CCO

NED method (five artificial neural networks fusion)
An ensemble method called Neural Ensemble based Detection (NED) is a learning method that combines five Artificial Neural Networks (ANNs) [48]. The neural algorithm of each network is Fast Adaptive Neural Network Classifier (FANNC) [47]. FANNC has both high performance and speed. Also, FANNC is an automatic algorithm, and it requires no manual set up. The proposed ensemble method has been applied for the lung cancer diagnosis. In this study, images of the specimens of needle biopsies are used as input data containing 552 cell images from biopsies of subjects. This ensemble system has a two-level ensemble structure. At the first level, outputs of each individual neural network place in two classes: normal cell or cancer cell. Then, NED method uses full voting for decision making. In this decision making strategy, the cell is considered normal when all of the individual networks vote it is normal. At the second-level ensemble, each network has five outputs, including adenocarcinoma, squamous cell carcinoma, small cell carcinoma, large cell carcinoma, and normal. This method uses plurality voting for decision making. In this way, the identification rate of NED is high, and its false negative rate is low. This method helps to miss less positive cancer patients.

Bagging subgroup identification trees
Bagging subgroup identification trees is a tree-based ensemble method which combines binary trees.
Generation of bootstrap samples is done by resampling the training data with replacement, and then several trees are constructed as diverse classifiers. In the next step, each tree converted to a binary classifier. Finally, "binary trees" are constructed, and the final prediction is done with a simple majority vote strategy. In this study, clinical data such as gender, age, surg, etc., was used. About colon cancer, 929 cases participated [9]. The dataset associated with this cancer can be downloaded from R package survival [50]. Also, the GSE14814 dataset [51] related to lung cancer, including 133 patients, was applied. These two datasets were extracted from the R package survival and GEO database related to colon cancer and lung cancer, respectively. As mentioned, this proposed method uses a majority voting for decision making strategy. The selection of biomarkers and building of classifiers were done with Leave one out cross-validation. Sensitivity, specificity, and accuracy rate of the proposed method was compared with a multivariate Cox model. Results showed that the proposed ensemble bagging (a novel tree-based) method is better, especially when data is imbalanced. For balanced data, the Cox model was slightly better [9].

CAD system
Computer-Aided Diagnostic (CAD) system is an ensemble method that combines Bayes classifiers. It was applied for the tissue classification and diagnosis of focal liver lesions. In this study, input data were Computed Tomography (CT) contrast-enhanced images of 20 cases of liver cancer. This classification process has two phases. In the first phase, CT images were classified using the Bayes classifier. In the second phase, the combination of classifier outputs and the decision-making process were done using majority voting strategy. Also, classification success rates were evaluated by leaveone-out technique. This approach for classifier combination generated better performance. Findings showed that the best performance of this method was obtained by majority voting in data. And CAD-  [70]. In comparison, they showed that their proposed method outperformed the other methods such as twin SVM (TWSVM) [71]. In this study, there were 650 positive and 3567 negative samples. Samples were split into two subsets, the first part was used for the training set and validation, while the second part was applied for the testing set. Also, the TWSVM classifier was trained using 10-fold cross-validation technique for evaluation of method performance. Since the TWSVM is sensitive to the training samples, it is inconsistent, but when Bagging was integrated into TWSVM, the inconsistent problem in the training set will be solved.
Result related to BOOSTING-TWSVM showed that Sensitivity and specificity were increased. Also, they demonstrated that Boosted-TWSVM is a promising approach for MCs detection [70]. The sensitivity of BB-TWSVM classifier was increased, and ROC curves showed that in comparison with TW-SVM, the performance of the proposed approach is improved [72,73].

Ensemble multi-class learning
Ensemble multi-class learning algorithm is an ensemble approach that combines error-correcting output coding (ECOC) scheme and one-against-one pairwise coupling (PWC) scheme. This method has been used for finding biomarkers in liver cancer. The method uses an algorithm called extended Markov blanket (EMB). Also, a liver cancer matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry (MS) dataset was used as a training dataset. In this method, redundancy and relevance were two aspects of biomarkers that were considered for feature selection.
By this ensemble method, identification of proteomic biomarkers for liver cancer was possible. It uses voting for decision making [74]. Samples for liver cancer data were 201 spectra of MALDI-TOF MS belonging to HCC patients, cirrhosis patients, and healthy participants [75]. Then all samples were divided into 10 exclusive folds, randomly. In this study, the error rate was estimated, and 10-fold Minimal Optimization (SMO) algorithm for training a support vector classifier, C4.5 DT, and forest of random trees. It has been applied for introducing prognostic biomarkers in related to breast cancer and ovarian cancer. For feature selection step, 10 different methods were used as individual classifiers, and then 10 feature vectors were constructed, respectively [79]. Input data for these classifiers is somatic mutation data that is available in TCGA data set [18,80]. Five individual classifiers which rank candidate genes based on p-values are including OncodriveFM, OncodriveCLUST, MutSig, ActiveDriver, Simon. The five remaining individual classifiers, including FLN, NetBox, MEMo, Dendrix, and FLNP, choose driver genes based on linkage weights. This method also has a layer of feature integration. Input test data is 20,624 genes annotated as protein-coding that was downloaded from NCBI database. Finally, supervised classification is done and DECORATE uses the average posterior probabilities of four above mentioned base classifiers. It is necessary to note when data is limited, and this ensemble method is effective because DECORATE is an approach that creates diverse artificial data. This method has been applied for introducing cancer driver genes, including breast cancer and ovarian cancer. Although DECORATE is grouped in decision integration methods with algebraic combination rule, in a distinct layer, it employs some kind of feature fusion mechanism. During training, the method was run 50 times, and performance evaluation was estimated using 10-fold cross-validation. Results showed that when the training set is small, DECORATE gained higher accuracy than other best ensemble approaches such as Bagging or boosting [79].

HyDRA method
Hybrid Distance-score Rank Aggregation (HyDRA) is an ensemble approach that has combined advantages of score and distance methods [81]. The score and distance methods [82] are aggregation techniques. The predictive potency of this aggregation approach is evaluated very high.
HyDRA aggregates genomic data based on mutation data and has been applied on several gene sets related to diseases such as autism, breast cancer, colorectal cancer, endometriosis, glioblastoma, meningioma, ischaemic stroke, leukemia, lymphoma, and osteoarthritis. By this method, driver genes for these diseases were prioritized. The proposed approach uses decision templates for decision making strategy because this method ranks driver genes based on different similarity criteria that are combined with statistical tools. In each disease and for disease gene discovery, the performance of HyDRA method was evaluated by Cross-validation. Results showed that performance obtained from HyDRA was higher than other methods such as Endeavor [83] and ToppGene [84] for a majority of quality criteria. The analyses also show that each method has specialized advantages in prioritization for some diseases [ Ensemble [87]. In this study, the cancer dataset that is selected as a benchmark was leukemia dataset [28]. Leukemia has two sub-types, including AML and ALL. ADASVM is a suitable algorithm for two class problem. This algorithm resolves defects and dilemmas of AdaBoost and SVM. This fusion method has dealt with the diversity of the AdaBoost algorithm. Also, the boosting mechanism was caused to reduce misclassification rate to improve accuracy. It uses weighted majority voting for decision-making strategy. The main measure for evaluation of AdaBoost is weight error of the component, and if it was higher than 0.5, the process is stopped. In this study, researchers showed that the proposed method outperforms than SVM and KNN classifiers. Results showed ADASVM accuracy was 100%. In return, SVM and KNN accuracy were lower, respectively [87].

NB (Naïve Bayes) combiner method
The previous investigations showed that NB combiner could be introduced as a fusion strategy in decision integration level. It combines 100 decision tree classifiers [88]. This integration model has been used with 73 benchmark datasets such as breast cancer [55,56]

Rankboost_W
Rankboost _weighting function (Rankboost_W) is a Rankboost algorithm that a heuristic weighting function has been added to it [94]. It is an ensemble method that uses boosting learning techniques for combining different computational approaches as a set of weak features to improve overall performance [95]. Similar to DECORATE method, this approach also has a layer of features

MF-GE system
The multi-filter enhanced genetic ensemble (MF-GE), hybrid ensemble model, includes two sequentially phases. The first phase consists of a filtering process, and the second phase includes the wrapper process. In phase 1, genes in the microarray dataset were scored using multiple filtering (MF) algorithm and obtained scores were integrated. In the wrapper process, genes were selected with genetic ensemble (GE) algorithm [98]. This approach has been applied to four benchmark microarray datasets for gene selection related to leukemia [31], colon cancer [32], liver cancer [99] and mixedlineage leukemia (MLL) [100]. This fusion method can be effective for binary-class and multi-class classification problems. Also, the hybrid system overcame the overfitting problem of the GE algorithm. It was used both majority voting and algebraic combiners for decision making, but majority voting generated better classification results. In this study MF-GE system compared with the original GE system and the GA/KNN hybrid. In this study, the double cross-validation process was applied, including internal cross-validation and external cross-validation. The internal cross-validation was done in gene selection phase while the external cross-validation was used for evaluation of selection results. Results showed the proposed approach (MF-GE system) achieved higher classification accuracy value, generated more compact gene subset, and led to the election results more quickly [98].

Evolutionary Ensemble Model
In a study, an ensemble model was designed that integrated results of three modules of evolutionary

Model Integration
Model integration implies when we construct the model, the integration should be done at the model level. In this approach, each model transforms the input data into the format required, and then models are combined. By linking models, a single model comes to decisions to be made based on it.
This method can be developed using different tools [105]. One of the tools is based on Bayesian networks that are mentioned in the literature.

2.3.1Bayesian networks-based model integration (1,2)
In a study experimented in 2006, Bayesian networks have also been used concerning breast cancer.
Researchers used three models for integration. In the first model named full integration, they integrated two data sources and then built a Bayesian network based on integrated data and two data sources, including the clinical and microarray data, were combined. So, at this step, data integration was just done. In the second modal in this approach named decision integration model, an independent model was built for each data source, and then the outcome decision from these models was combined based on weighting policy. In the last modal named partial integration, similar to the second one, and the independent model was developed for each data source, and then the models were linked and integrated for building a single combined model and final decision making. This method used model integration for decision making. The models mentioned above were used for predicting metastatic state in breast cancer. The training set was selected 100 times, randomly and proposed methods performance was evaluated using ROC (Receiver Operator Characteristic) curves analysis. The obtained results revealed that partial integration achieved higher performance and proved to be the best method for data integration [106].
In another study, a novel Bayesian hierarchical model-based method has been proposed. This approach uses single-nucleotide variants (SNVs) and insertions and deletions (InDels) in whole genome sequence data as mutation data [107] obtained from sequencing of the breast cancer cell lines dataset that are available in TCGA [108] and data can be downloaded from https://gdc.cancer.gov/files/public/file/TCGA_mutation_calling_benchmark_files.zip. It first generates two models include of the tumor model and error model by setting partition rules on paired-end reads and datasets, and then this framework integrates these models for mutation calling associated with breast cancer through input data partitioning. So, it is confirmed that the proposed method can improve performance using incorporating heterozygous single nucleotide polymorphisms (SNPs) and strand bias information comparison with other Bayesian network classifiers [107].

Results
There are numerous ensemble methods which yield various research results. In this review, 45 works of literature related to 42 ensemble machine learning methods (Table 1) associated with various cancers were investigated. These 42 approaches can be used for diagnosis, prognosis, or predicting 18 cancer types. We have presented these relations in the following bipartite graph (Fig. 2).  Fig. 2) The identified cancers by ensemble learning methods. Note that items marked by an asterisk are a complex disease and they are not a type of cancer Among these 18 cancer types, breast cancer has been studied more than others, and 22 methods have been introduced for detection purposes. These approaches have been presented in the following pattern ( Fig. 3):

Fig. 3) The introduced methods for breast cancer detection
In the previous section, it was mentioned that we have classified learning methods into three famous fusion categories including data and feature integration methods and decision integration methods.
On a smaller scale, there is another method called model integration. In most of these methods, we attempted to introduce data type, number of samples, and decision-making strategies which are used for them. Also, we determined the statistical tools and methods used for their accuracy validation.
Within the category of fusion and level of data and feature integration, nine methods (21.43% of the total) were introduced. These methods were presented in the following pattern (Fig. 4): The following diagram shows the percentage of studies used at each level of fusion (Fig. 6). We also introduced the input data for each machine that have been mentioned in the following pattern (Fig. 7): In these studies, various data were applied as input data. Input data includes genomic data such as mutation data, gene expression data based on microarray technology, clinical data, GO database data, PPI data, literature data, proteomic data based on protein mass spectrometry technology and multivariate data based on UCI Machine Learning Repository.

Fig.7) Various input data that have been used in different studies
Clinical data is various, including different images, flow cytometry, age, gender, surgery, and cytological diagnoses and histological examination of biopsies (Fig. 8).  In the third rank, five methods related to seven studies have used several types of data as multiple input data. These methods have been shown in the following diagram (Fig. 11). One of these methods called Kernel-based data fusion has used GO, SW, PPI network based on STRING database, and literature data as input data sources. Learning Repository (Fig. 13). The following diagram shows the kinds of input data used in these 45 studies by percentage (Fig. 15).      The number of methods (%) that use any kind of statistical method for validation has been shown in the following diagram (Fig. 22). one key question remains to be answered: which proposed fusion method is the best?
Results demonstrate that there is no best and superior approach for all classification problems, and the optimum solution depends on the kind of problem, the structure of the available data and prior knowledge about the related algorithm. So, this review can help researchers choose the most suitable ensemble method in the field of molecular biology and complex diseases diagnosis. However, in this study, it was found that Ensembles of BioHEL rule sets outperform on large datasets. Besides, the fusion of KNNs-SVMs-DT-LDA as heterogeneous ensemble method can provide superior performance specifically on small size datasets. RVM-based ensemble learning is a more effective approach to solve high-dimensionality problems of feature space without missing data values. As compared to other methods, ADASVM method seems to be the most appropriate method for improving the classification performance by applying diversity. ADASVM is based on the AdaBoost algorithm. Finally, NB combiner approach is successful for problems with a large number of fairly balanced classes, and WMV combiner method can solve problems with a small number of imbalanced classes.
Meanwhile, a careful look at Figures 9 and 10, it can be inferred that clinical data and microarray data have been used over time. However, due to advances in equipment and imaging methods, the trend of using clinical data has changed. For example, digital images data were used as clinical data in recent research. Also, by the development of bio-molecular technique such as Next Generation Sequencing (NGS), the massive genomic data, including somatic and germline mutation data, has been generated. The use of this data has revolutionized the diagnosis of complex diseases such as cancers. These aspects of genomic diagnostics contribute to the prosperity of modern sciences, including personalized medicine and precision oncology [109].
However, one of our limitations in classifying and introducing cancers-based ensemble methods was the low number of approaches based on model integration, and so these categories of methods were studied on a much smaller scale.

Conclusions
The authors hope that in subsequent publications, they will further introduce the applications of multiomics data integration in precision oncology. In this paper, due to the complexity and challenges involved in integrating omics data, these types of fusion system have not been thoroughly investigated. In summary, multi-omics approaches have been used for identifying cancers and their subtypes [110]. These methods integrate multi-omics data to diagnose cancers effectively.
Keyword search results, such as "multi-omics data" and "cancer" in Scopus show that the trend of using the multi-omics data as input data in cancer researches has increased significantly from 2006 to 2018, especially in 2017 and 2018.
The first multi-omics database called "LinkedOmics database" includes more than a billion data points. This database contains clinical data and multi-omics (genomics, epigenomics, metagenomics, transcriptomics, and proteomics) data. This data consists of 32 cancer types in 11158 patients from TCGA project [111]. It can help researchers who want to use this data for solving medical problems.
So, for robust performance and to get better results, we especially suggest using multi-omics approaches for integration of different omics data types such as genomics data, epigenomics, transcriptomics data, proteomics data, metabolomics, and microbiomics in different layer levels [112]. This kind of data integration is our next plan for developing a novel and robust predictive fusion method for identifying driver genes as biomarkers, especially, for breast invasive carcinoma (BRCA).
Finally, using machine learning methods is less expensive than bio-molecular testing. It can help reduce the magnitude of the search space. There is hope that this study can provide a roadmap to help researchers in using appropriate ensemble methods in the field of molecular biology and  Kernel-based data fusion for gene prioritization 1 Figure 1 Integration patterns for ensemble learning The identified cancers by ensemble learning methods. Note that items marked by an asterisk are a complex disease and they are not a type of cancer The introduced methods for breast cancer detection The number of ensemble learning methods (%) used at each level of fusion Various input data that have been used in different studies Kinds of clinical data in these studies   The methods (6.67% of the 45 studies) that have used proteomic data as input data. They are meta-learning, KNNs-SVMs-DT-LDA, and Ensemble multi-class learning    The number of ensemble learning methods (%) that have used different kinds of input data The methods that have been decided based on the weighted majority voting strategy Figure 19 The number of methods (%) that have decided with different strategies The number of ensemble learning methods (%) validated with kinds of statistical methods