Ensemble Machine Learning Approaches for Proteogenomic Cancer Studies

Background : The identification of important proteins is critical for medical diagnosis and prognosis in common diseases. Diverse sets of computational tools were developed for omics data reductions and protein selections. However, standard statistical models with single feature selection involve the multi-testing burden of low power with the available limited samples. Furthermore, high correlations among proteins with high redundancy and moderate effects often lead to unstable selections and cause reproducibility issues. Ensemble feature selection in machine learning may identify a stable set of disease biomarkers that could improve the prediction performance of subsequent classification models, and thereby simplify their interpretability. In this study, we developed a three-stage homogeneous ensemble feature selection approach for both identifying proteins and improving prediction accuracy. This approach was implemented and applied to ovarian cancer proteogenomics data sets: 1) binary putative homologous recombination deficiency positive or negative; and 2) multiple mRNA classes (differentiated, proliferative, immunoreactive, mesenchymal, and unknown). We conducted and compared various machine learning approaches with homogeneous ensemble feature selection including random forest, support vector machine, and neural network for predicting both binary and multiple class outcomes. Various performance criteria including sensitivity, specificity, kappa statistics were used to assess the prediction consistency and accuracy. Results : With the proposed three-stage homogeneous ensemble feature selection approaches, prediction accuracy can be improved with the limited sample through continuously reducing errors and redundancy, i.e. Treebag provided 83% prediction accuracy (85% sensitivity and 81%


I. INTRODUCTION
Ovarian cancer is the deadliest gynecologic malignancy, with most patients diagnosed in late stages.Early detection and Antineoplastic therapeutics are vital to treating ovarian cancer patients who may have heterogeneous responses [1][2][3].Proteogenomics is an emerging approach integrating proteomics with genomics and transcriptomics, gaining new insights for a more complete understanding of complex diseases and treatments, to advance basic, translational and clinical research [4][5][6].Mass spectrometry (MS) based proteomic technologies have enabled the profiling of thousands of global proteins and have made proteogenomic data available in order to examine the linkages among DNA, mRNA, proteins, and disease status, and to determine which proteins are associated with gene mutation and disease status (such as cancer subtypes, diseases stages, and patient treatment heterogeneity) [7][8][9].
Identifications of protein and gene signatures from thousands of omics data generated from high throughput technologies have been challenging from both computational and biomedical perspectives [10][11].Standard statistical marker selection methods with association analysis such as forward, backward, stepwise selection methods are based on the p-values of statistical models; however, these single feature selection methods are sample dependent and may have specific biases.These approaches also face the multi-testing burden of low power associated with a limited patient sample size.
Moreover, high correlations among the features such as protein markers, along with small to moderate effects on disease status (often accumulated over time), may lead to unstable selections that may cause reproducibility issues [12][13][14].Group selection techniques such as lasso or ridge regressions or Bayesian shrinkage models may overcome the above drawbacks [15].One inherited challenge for the proteogenomic data is that some recent studies have indicated modest/moderate correlations between genes, mRNAs and proteins across different organisms (i.e., correlation coefficients 0.09 to 0.46 in multi-cellular organisms) [16].
Machine learning (ML) with wrapper methods and ensemble feature selection has the advantage to alleviate and compensate for some of the demerits of statistical approaches [17][18].Ensemble learning models integrate predictions from multiple independent predictors to generate the strongest signals across predictors that rise to the top.Ensemble predictors consistently performed among the best across challenges and tended to be the most robust to noise in the datasets [19][20].
But different sets of ranked features from various ML methods could provide the same classification performance.Therefore, one key question is whether we can ensemble the ML feature selection approach that could result in consistent and reproducible biomarkers.This is important to reduce the burden of clinical validation.From the biomedical perspective, one important question is how well the protein markers are predictable for gene mutation or mRNA or disease status, and which biomarkers are associated with those status [21].Last but not least, despite the power and flexibility of ML ensemble modeling, tuning various algorithms' parameters could vary the prediction accuracy and selection results.How to best measure a predictor's relative importance in the model and enhance interpretability of ML is a challenge [22].

A. Ensemble Machine Learning
Ensemble machine learning algorithms include bagging, boosting, stacking, and errorcorrecting output coding [19][20][21][22].These approaches combine multiple independent ML and statistical approaches to construct a set of classifiers into a single predictive model, and then classify new data points by taking a (weighted) vote of their predictions.The reasoning behind such approaches is based on the fact that all machine-learning approaches are biased to identify method-specific patterns and features.Thus, combining multiple learners can produce better and more robust predictions for boosting in accuracy compared to an individual learner or model.
In this study, prior to ensemble the following individual feature selection approaches were compared and tested [19,[24][25][26][27]: 1) Median: utilize the non-parametric Mann-Whitney-U Test (p values); 2) Spearman and Pearson r: select features that are highly correlated with the outcome variable, but the low correlation with other features to avoid multicollinearity; 3) LogReg: standardized β-coefficients of the logistic regression (LR) to represent the importance and comparability between the different ranges of protein features; 4) Naïve Bayes; 5) Neural network (NN); 5) Random forests (RFs): ensembles of multiple decision trees based on the classification and regression tree (CART) algorithm, cforest is one type of RF that uses conditional trees for classification and regression.Moreover, since protein features may highly correlate in their expressions, principal component analysis (PCA) was applied to all proteins, and then the constructed PCAs were used in LR and RFs to see if there are classification accuracy differences either with PCA or without PCA.
Receiver Operating Characteristics (ROC) curves and Area Under Curve (AUC) values were used as performance evaluation criteria [21].In LR models, the selected features with a leave-oneout cross-validation (LOOCV) scheme was applied, followed by training an LR model with all available feature in order to compare the two LR models based on their ROC curves and AUC values.In RFs, error-rate based, AUC based feature selection, and Gini-index were used as performance evaluation criteria for feature selection and importance rank.
The error-rate-measures the difference before and after permuting the class variable depending on the underlying trees, AUC is then computed for each tree before and after permuting a feature for the importance measure.The Gini-index measures the node impurity in the trees of RF [20][21].Overall, Error-rate RF, Gini RF, Error-rate cforest, AUC cforest could be considered as errorcorrecting output coding for correcting bias and variances to improve the learning.
To reduce the redundancy and generate the feature importance, the following ensemble steps were conducted to build the engine of the caret from each model: 1-train each method/model, save all resampling results.2-choose a set of methods/models using ROC criteria, get the variable ranking from them by using the fscaret from R for the scaling process.
3-estimate/find the correlation between the methods/models for redundancy 4-Use the resampling results to remove highly correlated models.Pearson's Correlation Coefficient is calculated for all possible pairs of trained models and then we removed the highly correlated models using 0.8 as a utilized threshold.
5-make the ensemble of remained models using a linear combination, boosting models, and complex models like RF (VSURF) or NNs for variable/feature selection.
6-generate the variable importance using individual model variable importance and their associated weight in the final model.The results of each individual method are normalized to a common scale to quantitative ensemble importance.

B. Three stage homogeneous ensemble feature selection
To further stabilize the selected features for the reproducibility and to improve prediction accuracy, we continue to refine the above ensemble process, and generalize three-stage homogeneous ensemble feature selection (HEFS) for both identifying biomarkers and reducing errors as follows: In stage one: a homogeneous ensemble biomarker selection based on random forest approach is utilized to identify important biomarkers even with some redundancy.
In stage two: a smaller number of variables, with very low redundancy and sufficient for a better prediction, are identified.
In stage three: utilizing the selection results of stage one and stage two, expand and compare various more advanced ML methods including the following 1) gaussprLinear (Gaussian Processes For Regression And Classification with Linear kernel function); 2) gaussprRadial (Gaussian Processes For Regression And Classification with Radial Basis kernel function); 3) LogitBoost (Boosted Logistic Regression); 4) MLP, MLPML (Multi-Layer Perceptron, multiple layers); 5) RF, parRF (Parallel Random Forest), wsrf (Weighted Subspace Random Forest; 6) SVM (Support Vector Machine); 7) treebag (Bagged Classification And Regression Tree).
The following specific steps are employed: 1. train RF models for feature selection and save variable importance results 2. Choose a set of important variables based on the out-of-bag error (interpretation set) 3. Utilize the stepwise approach for interpretation set to build the prediction set 4. Utilize the interpretation set and the prediction set, and compare various ML methods for predicting binary and multiple classes.
The key questions: 1) Does this homogeneous ensemble feature selection approach find consistent and reproducible biomarkers? 2) Are the protein markers predictable for HRD/gene mutation status or mRNA status?
To validate the stability of the HEFS for reliable results, and to provide the unbiased prediction accuracy, resampling with k-fold cross-validation, bootstrapping and permutations are tested and employed [19,[25][26][27][28].We used a 10-fold cross-validation method for a stratified random sample of the data into the training set and hold out a test set for the very end, and only use the training and parameter tuning.We also evaluated the training data for the effect of model tuning parameters on performance in order to guide which tuning parameter values should be chosen.The training performance of all methods with automatic parameter tuning is also considered and tested since the process produces a profile of performance measures and the tuning parameters associated with the best measure value, then the "optimal" model is chosen across these parameters.Permutation tests are further conducted for the robustness of the resulting model.For instance, RF methods are conducted 100 times and averaged over the number of runs.An evaluation of the stability of feature importance is conducted by a bootstrapping algorithm.Sensitivity and specificity are reported based on the selected feature/proteins, which has been included in the caret package in Comprehensive R Archive Network (CRAN) https://topepo.github.io/caret)[22].Additional analyses were conducted in R for ML with ensemble approaches and SAS for data pre-processing.

A. Proteogenomic Datasets
MS proteomic ovarian cancer data was obtained from the Clinical Proteomic Tumor Analysis Consortium (NIH/NCI) and The Cancer Genome Atlas (TCGA), which include the genomic and transcriptomic characterizations of ovarian high-grade serous carcinoma (HGSC) and 9606 global proteins measured from MS [5,8,33].Two proteogenomics datasets were studied that were generated under similar experimental settings: both binary disease status defined based on gene mutation and multiple mRNA classes.Primary outcomes: 1) binary outcomes are putative homologous recombination deficiency (HRD): positive or negative.HRD positive is defined by the presence of germline or somatic BRCA1 or BRCA2 mutations, BRCA1 promoter methylation, or homozygous deletion of PTEN.One hundred twentytwo serious ovarian carcinoma patient samples (67 HRD positive, 55 non-HRD/negative) from 9606 proteins were included.

B. Data pre-processing
Statistical process control for data quality examination (i.e., correcting technical variation, examining heterogeneity, high percentage missing, strong positive skewness, large proportions of zero) through Measurement system analysis, process screening were conducted prior to our proposed approaches (See Figure 2).Variations and measure shifts are compared for HRD positive (67) versus HRD negative (55); Glycosite versus nonglycosite.i.e., for non-glyco the largest upshift was found for HRD positive sample for TCGA-29-1698-01A-01.Protein distributions and variations (CV: Coefficient of variations) are also examined to examine outliers and irregularly distributed variables.Data (transpose) transformation was conducted due to high dimensionality with the few samples (9606 proteins, 122 patients with HRD: positives versus negative; 396 samples with 5 mRNA known classes).
Missing data evaluation and imputations: missing value patterns are examined.There are 15% to 33% missing in the protein-HRD status data.Several multiple imputations algorithms to impute missing (either biological or technical) were tested and compared [19,22]

C. HEFS for protein rankings and prediction accuracy
Tables 1-3 provide the prediction accuracies according to the discrimination (kappa statistics for the agreement, sensitivity, specificity) resulting from different ML methods and HEFS approaches.Results revealed marked differences for the prediction accuracies with different individual statistical or ML models as discussed in section 2. PCA with composite protein predictors didn't provide better prediction accuracies, either with PCA or without PCA ranged from 45% (LR) to 58% (RFs with ctree).
Sensitivities of presented ML approaches are higher than specificities for binary HRD class predictions.RF and NN provided better prediction accuracies than simple Naïve Bayes.The overall prediction accuracies for multiple classes' classification are comparable to predict binary HRD classes (see Table 3).
The key questions: 1) Are the protein markers selected through multiple refining steps proposed in section 2 improved the prediction accuracy for both binary HRD/gene mutation status or multiple mRNA classes? 2) Does the refined homogeneous ensemble feature selection approach find consistent and stable biomarkers?
With proposed three-stage HEF's approaches, prediction accuracy can be improved with the limited sample through continuously reducing errors and redundancy, i.e.Treebag provided 83% prediction accuracy (85% sensitivity and 81% specificity); RFs provided 71% prediction accuracy.

IV. DISCUSSIONS AND CONCLUSIONS
This paper seeks to evaluate ML with ensemble feature selection for the protein biomarker identifications' stability and reproducibility in addition to prediction accuracy.We proposed and conducted three-stage HEFS for biomarker identification and applied the ovarian proteogenomics data.Our HEFS approach incorporates stability considerations into the algorithm design stage and has the advantage to alleviate and compensate the reproducibility issues.Results showed that the ensemble approaches provided the stable selection of the important biomarkers linked to ovarian cancer stages.Overall, ensemble approaches may hold promise with better prediction power for reproducibility issues.
One potential drawback of the utilized HEFS is that it is computationally expensive (i.e. in the ensemble of RF with homogeneous feature selection algorithm using 10000 trees), therefore requiring constant tuning of the model parameters for improved performance.More sophisticated ensemble models are much more susceptible to over-fitting than simple linear weights, which generally require large sets of samples to train.
From biological and medical perspectives, the potential correlations between protein expression and genes are moderate, which may be one reason why individual ML model did not result in high classification accuracies for HRD in addition to the sample size limitations.Additional subtypes of proteomic profiles with functional differences representing distinct subpopulations or hidden HRD stages within HRD positive or negative may explain how ensemble ML approaches perform better for multiple mRNA classes than binary HRD status.The predictable proteins for HRD status identified from our proposed approaches did not include some well-known drug target proteins, such as RAS (i.e., HRAS, KRAS, NRAS), which may indicate some unique values of proteomic data beyond HRD positive vs negative, given HRD status is defined by the presence of germline or somatic BRCA1 or BRCA2 mutations, BRCA1 promoter methylation, or homozygous deletion of PTEN [41][42][43].The identified high ranked important protein markers/features from MS could be further examined to better understand the important biological processes and biomarkers influencing the disease progression, heterogeneity, and efficacy of platinum therapeutics [44][45][46].

FUNDING
There is no funding support for the current study.
Top selected important proteins using NN (three layers, 10 hidden nodes) for mRNA classes prediction Top selected important proteins from HEFs for HRD classes' prediction compared with discriminant analysis: Red and green dots represent OV HRD Positive versus negative and normal samples, respectively.

Figure 3
Figure3shows the top selected proteins (list on the top right) from mlpML (three layers, 10

Fig. 5
Fig.5shows the top selected important proteins from HEFs for HRD classes' prediction with 46. Tewari D, Java JJ, Salani R, et al(2015).Long-term survival advantage and prognostic factors associated with intraperitoneal chemotherapy treatment in advanced ovarian cancer: a gynecologic oncology group study.J Clin Oncol;33:1460-66.

Figure 1 .
Figure 1.Machine learning Solutions with Ensemble Feature Selections for Proteogenomics Data

Figure 2 .Figure 3 .
Figure 2. Left: Statistical process control for quality examination; Middle: Missing patterns