In this study, an ensemble method as a synergistic combination of computational tools has been designed and proposed to find the putative cancer drivers. This fusion system can help to analyze the Whole-Exome Sequencing (WES) data. It consists of four steps: selection of dataset, feature extraction, feature integration, and decision integration (Figure 2). We have also shared Python source code and other requirements for the implementation of the proposed ensemble machine learning algorithm as the protocol via GitHub (https://github.com/lmirsadeghi/EARN/).
2.1 Selection of dataset
In this study, to identify candidate driver genes based on mutations that occur in genes, breast cancer primary and metastasis data have been analyzed.
For primary breast cancer, an open-access mutation annotation format (.maf) file was downloaded from TCGA data set regarding BRCA (51). This file includes 90969 masked somatic mutations identified in 17990 genes from 983 tumor samples of BRCA patients that their whole exome had been sequenced by Illumina Genome Analyzer II [see this mutation file in Additional file 1: Table S1]. Also, for processing of sequences, a bioinformatics pipeline framework called "MuSE Variant Aggregation and Masking" in TCGA has been used. For MBCA, two files (.txt) were downloaded from the cBio Cancer Genomics Portal (52,53). The first mutation file includes WES of 213 tumor samples from 213 MBCA patients by Illumina HiSeq. It is associated with 22949 somatic mutation counts that occurred among 10791 genes (54). The second file consists of WES of 237 metastasis tumor samples by Illumina GAIIx from 180 patients regarding 24027 somatic mutations identified in 10273 genes (55). Clinical data shows that 86 samples were taken when patients were in the metastatic disease stage, and other samples had been taken less than 4 months prior to the metastatic disease is diagnosed (56). After selecting two initial datasets concerning MBCA [see these mutation files in Additional file 2: Table S2 and S3], they augmented to build a comprehensive mutation data file, including 46928 somatic mutations identified among 14293 genes from 450 MBCA tumor samples (393 patients).
2.2 Selection of software tools for feature extraction
After preparing mutation files, four software tools including MutSigCV v.1.4, OncodriveCLUST 0.4.1, OncodriveFM, and NetBox 1.0 were used to extract the convenient numerical features. The selection of tools for feature extraction was a crucial step to achieve better performance on the final algorithm of the proposed ensemble learning model. We select the four software tools based on evidences of a paper in 2015 on identification and ranking of plausible drivers for BRCA and ovarian (OV) cancer (19). It had been demonstrated that among ten tools for extracting features, OncodriveFM and NetBox generate high sensitivity, especially about BRCA. Also, the sensitivity of OncodriveCLUST tool is high concerning the OV cancer. On the other hand, in both cancers, it had been shown that the positive predictive value (PPV) for NetBox and OncodriveFM is high. MutSigCV was able to propose a large number of drivers in the top 50 genes for BRCA, where at least five other methods had also predicted them as top genes. These advantages led us to use these tools. Practically, these four tools evaluate original mutation files from different aspects and assign a score (p-value) to genes to show their relevance to disease according to that software’s logic. MutSigCV gets data concerning point mutations and small insertions and deletions (INDELs) from the WES file. After analyzing and estimating mutation frequency, it can identify and introduce a significant list of mutated genes for cancers (43). OncodriveClust software tool is able to identify mutations that generate oncogenes and leads to changes in the function of the proteins. For this purpose, it analyzes synonymous mutations and protein-affecting mutations, including non-synonymous, stop, and splice-site mutations (44). Also, this tool uses data from the Cancer Gene Census (CGC) database (47). for selecting known drivers associated with cancers. OncodriveFM is our next tool which can detect driver genes across tumor samples, identify pathways in cancers, and discover gene modules by using information that is available in the WES file. This data is provided by three methods, including SIFT, PolyPhen2, and MutationAssessor (45). The fourth tool is NetBox, and it can detect driver mutations based on a network. First, a global human interaction network is constructed by this tool. Then, it finds the linker genes between mutated genes for module discovery and identification of candidate drivers (46). Indeed. the concept and criteria of selecting these four software tools are based on the study in 2015 where the performance metrics of ten methods for prediction of plausible driver genes of BRCA and ovarian OV were compared (19).
2.2.1 Feature extraction and feature vector construction
In this step, the four software tools explained above are used for the extraction of features from primary and metastasis mutation data files. After running the tools, all genes are ranked based on p-value as output data, and each method assigns a number (0≤p-value≤1) to genes as numerical features. Therefore, a four-dimensional feature vector is constructed for each gene (Figure 3a). Since the genes with lower p-value play a more critical role in the development of cancer, we decided to use "1- p-value" as the final numerical feature for each gene. With this plan, the genes that are more important in the occurrence of BRCA and MBCA will also get higher feature values. Different and independent logics behind the ranking mechanisms in the exploited tools guarantee enough diversity between inputs of the ensemble system which is an essential property for efficient fusion methods.
2.3 Classifier model selection
Three supervised machine learning methods, including non-linear SVM to learn non-linear functions to separate the classes, ANN, and RF are used as individual classifiers. For selecting these methods, the literature and previous studies were surveyed. We did a comprehensive review regarding fusion systems and the results showed that SVM has been used as a base classifier in many studies or applied as a baseline for comparison between the performance of the different machine learning methods (57–60). Since in this case, the positive and negative training gene set for the implementation of learning algorithms are highly imbalanced, 40 positive genes versus 2151 negative genes (refer to 2.4), a solution must be found. It has been demonstrated the SVM classifier can be a robust method for generating optimal results with imbalanced positive and negative datasets (61), especially when an Instance-weighted SVM algorithm is used (62). So, we weighed this algorithm to get better results. On the other hand, RF is an ensemble machine learning method used as one of the individual classifiers. This method can partially solve the problem of the unbalanced positive and negative training set by bootstrap sampling and can also improve performance, i.e., predictive accuracy reached 88.89% using RF for breast cancer risk prediction (63). The ANN classifier is another machine learning method with long-lasting profound literature. In 1990, Hansen and Salamon integrated multiple neural networks and improved results (64). This method is also widely used in biology studies and has achieved high performance. In 2017, it was shown that ANN could be used for the diagnosis of lung cancer (65). Meanwhile, in this study, the positive training set is small, and recent researches have revealed that ANN may improve performance for problems with small training set sizes and give better performance, especially for problems with time-series data category (66). All of these reasons and criteria led to the selection of these three machine learning methods as base classifiers of the final ensemble system.
2.4 Training and testing
In this study, we train separate models for BRCA and MBCA. Some criteria for the selection of training data sets are described below and visualized in Figure 4a. For testing the performance of models in terms of evaluation metrics (e.g. recall, precision, etc.), we average over 100 trials. In each trial, 3-fold cross-validation with random shuffles is used to calculate the metrics on all data. Finally, the mean and standard deviation of metrics over 100 trials are obtained. Average outputs for cross-validation of the estimator of each model on testing data based on some metrics, including precision, f1 score, recall, accuracy, and Receiver Operating Characteristic-Area under Curve (ROC-AUC) are presented in section 3.
2.4.1 Training data set selection
The positive training set of genes for BRCA and MBCA were obtained from searching known genes and mentioned drivers concerning these cancers in several databases, including the OMIM, CGC, NCG, HCMDB, and the Human Protein Atlas (HPA) (https://www.proteinatlas.org/). Also, about selecting negative training gene set, we reviewed a comprehensive list of prior works. Since there is no gold and standard database for a negative set selection, most researchers have used the bootstrap method for resampling, and the negative training genes have been mostly selected randomly. In this study, negative data was selected by counting the occurrence of mutations across all samples in the initial mutation data file, and the genes with the lowest mutation count were used as negative training set (19). It is crucial to note that in both positive and negative training data, we only accepted protein-coding genes. [see further details for training genes in Additional file 3: Supplementary Methods and Additional file 4: Table S4-S7].
2.5 Genome-wide screening
For the genome-wide screening, 20208 homo sapiens genes annotated as protein-coding were downloaded from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/ on February 2019. The proposed ensemble model is applied to 18017 genes for BRCA and 16698 genes for MBCA, after excluding positive and negative training sets (Figure 4b) [see Additional file 4: Table S8-S10].
2.6 Implementation of three machine learning algorithms based on feature integration
After adding features to the system, and training and testing of learning methods including non-linear SVM, ANN, and RF, they are applied to the protein-coding genes as the unseen data. Each of these methods integrates the features extracted from the initial mutation file (refer to section 2.2.1). We use scikit-learn package to implement our algorithms in python (67). Since this problem is a binary classification of genes based on drivers and passengers, they could label genes based on two indexes -1 and +1 (-1 means passenger genes and +1 means drivers), and also compute a score for each gene, independently (Figure 3b).
2.7 Implementation of proposed ensemble machine based on decision integration
Finally, the decision-making strategy for ensemble machine is based on aggregation of the predicted scores obtained from other machines. We call the proposed EC machine learning method EARN (ensemble of ANN, RF, and non-linear SVM). EARN uses the average of the scores of the outputs of the three basic classifiers to assign a new score (ranging from 0 to 1) to each gene. The genes with higher prediction scores (scores ≥0.5) are labeled as drivers (+1) while the other genes will be passengers (-1). This process has been illustrated in Figure 3c.
2.8 Biological inferences
At this step, all the driver genes introduced for BRCA and MBCA, as well as top genes predicted by learning machines, are searched in the public databases to determine which genes have been already known related to cancer and which ones are new. Pathway enrichment analysis is also performed using ReactomeFIVIz tool (FDR <0.03) (68–70)to identify the biochemical pathways associated with the candidate genes and examine the biological role of them. It is applied to find biological pathways and patterns related to cancer and other complex diseases.