Data
A total of 851 fecal metagenomic whole genome sequencing (WGS) samples (See Table 2) were analyzed. Sequence data were downloaded from European Nucleotide Archive. The conditions of the samples were obtained from the different supplementary tables of different publications (see Table 2) and complemented in the possible using the R package curatedMetagenomicData [64] available in Bioconductor [65].
Bacterial Whole Genome Sequence data processing
Whole genome sequencing data was managed using the NGLess-Profiler [66] package. Raw sequencing data preprocessing and quality control was carried out using a version of the human-gut.ngl pipeline. The substrim built-in function was used to discard reads that do not meet the basic quality filter of being longer than 45 bases and having all bases with a Phred score over 25. To prevent potential contaminations with human genome sequences the reads were mapped against the human genome hg19, and those that do not map were conserved. SAMtools [67] and BWA [68] were used to handle and map reads respectively.
Functional profiles
Strain functional profiles are generated by assessing the gene coverage for KEGG [69] functional orthologs, taken from the KEGG Orthology database of molecular functions, represented in terms of orthology groups of genes [70]. Ortholog genes are the basic features used here, in each sample. Then, the representation of each feature of the profile in any sample is estimated from the number of reads mapping on it. These counts were obtained by mapping the reads that passed the filters against the integrated gene catalog of the human gut [71]. We used the NGLess built-in function count with the default values, applying the scaled normalization that consists of dividing the raw count by the size of the feature and scaled up so that the total number of counts is similar to the total raw count.
Taxonomic profiles
Strain taxonomic profiles were obtained using the Centrifuge application [72]. The centrifuge-download command was used to download the reference genomes of archea, bacteria, virus and vertebrate mammalian (human and mouse) taxons. The reads of each sample were mapped over the reference genomes. The taxonomic profile consists of the relative representation of each genome in the sample, which is obtained by normalizing the number of reads mapping on them by the respective lengths of the genomes.
Tumor status prediction
For tumor status prediction a combination of different machine learning methods was used with the aim of attaining a white-box interpretable model [34-36] that improve results previously reported. The white-box model is sequentially constructed by combining robust feature selection techniques with an interpretable yet powerful classification method in the form of Generalized Additive Models (GAM) [73]. GAMs have been used in many fields, including healthcare applications [74, 75], for predictive modeling due to being more flexible than a general linear model without losing the interpretability.
In order to classify a sample as tumor or healthy a new interpretability-based algorithm we have used: the Explainable Boosting Machine (EBM) [76], a GAM model that uses state-of-the-art machine learning methods, such as bagging and parallel gradient boosting, in order to learn a function for each feature. Note that GAMs are non-parametric learning algorithms with a high flexibility regarding to the data distribution, a very interpretable modeling, due to the additive effects, and the ability to learn complex non-linear relationships between the variables and the outcome. The EBM builds a boosting methodology where each feature is learned in a circular order with a learning rate low enough to avoid ordering-based issues. These learning cycles allow the algorithm to learn a precise functional for each feature in a profile in such a way that is easy to extract how any of the features contribute towards a given prediction. By summarizing these contributions a global relevance score (i.e. a way to rank each profile feature) is obtained. The learning rate was optimized via ten-fold cross-validation.
Adenoma risk
In addition, from a healthcare point of view, one of the most interesting aspects of the predictions lies in the classifier ability to differentiate between adenomas and normal or CRC samples. In order to check the discriminative power of the classifier we proceeded as follows: (1) perform a split of the binary dataset with 30% of the samples in the test set, (2) aggregate the adenoma samples to the test set, (3) train the classifier on the train set and predict the new test set and finally, (4) compute the risk at k score defined as the sum the (signed) contributions of the top k features: (see Equation 1 in the Supplemental Files)
This risk function can be constructed for any k, but here k corresponds to the features selected by the feature selection method (see below).
Feature Selection
The feature selection pipeline is based on the biological hypothesis that there are complex relationships between the functional profiles and the tumor status. Since KEGG [77] orthologs, which correspond to biochemical reactions, are taken as features, there is an unbalance between the number of features (in the range of thousands) and the number of samples (in the range of hundreds)
In order to overcome such difficulties a multivariate feature selection technique based on the EBM feature ranking method (discussed below) was developed. The method is similar to the stability selection technique [78] in which a model is fitted across several bootstrapped versions of the data, thus having several rankings (each model produces a feature ranking score). The idea is to select those features that are ranked as top across the majority of the partitions. Instead of selecting the top ranked features, the distribution of the ranking scores of EBM is computed across all data partitions. Then a cut point based on the elbow point of the medians is selected. This method allows using the same multivariate model used for prediction, which has proven to perform well on high dimensional spaces with colinearties by better distributing the feature contributions of similar variables between them. The method is used on the global dataset (sample-wise union of all projects) using bootstrap to try to leverage the problems associated to the inter-project discrepancies.
Feature relevance ranking
To rank each feature EBM computes the feature-wise contribution for each sample in the training set, i.e. how each sample would be scored by a given feature. Then, for each feature the mean of the absolute value of the contributions is taken. Since EBM is an additive model, the ranked features correspond with those that have the most predictive power in the training set. Note that, given the additive properties of the model, the impact of a given feature on any sample can be directly observed. Actually, an interpretability curve can be built, where the variation across samples of a given feature and its associated model contribution can be directly plotted and observed. The possibility of extracting feature-wise contributions from a complex multivariate classifier is a distinctive property of this approach that allows making more meaningful interpretations.
Feature relevance in the adenoma
In order to infer the relevance of the features for the Adenoma experiment we have used the SHapley Additive exPlanation (SHAP) [79] method: a novel additive feature attribution method that has been used with great success in many domains, such as hypoxaemia prevention during surgery [80] or NO2 forecasting in ecology [81].
The SHAP method combines local explanation methods, as Local Interpretable Model-agnostic Explanations (LIME) [82], with game theory approaches (Shapley values). In broad terms the SHAP values estimate a given feature contribution by approximating the change between the expected and actual prediction when perturbing that feature. To build the expectation the method needs a set of background samples. In this case all the training samples we can safely be provided to the method since all the adenoma cases belong to the blind test set. Note that the SHAP method provides local explanations and, by combining them, a global relevance score can be built.
Batch effect removal
Batch effect is a known problem in gut metagenomic studies [28, 29]. Here, experimental effects were removed in the possible by filtering those features that contribute to perform a project-wise separation of the healthy samples. Then, project-specific effect removal is done in three steps: first, samples labeled as healthy across all the studies are selected, second, the feature selection pipeline is applied to the healthy subset using a multiclass approach, where the study is the target variable of the underlying classifier. Finally, those features labeled as relevant by the pipeline in all the studies are removed.
Model interpretability
As EBM belong to the additive family of machine learning algorithms, all features contribute to predictions in an adaptive, intelligible way. Thus, we can visualize how a given feature contributes to the prediction of any sample. We can summarize this information by visualizing the feature distribution along its contribution to the EBM prediction.
Software
The machine learning methodology has been implemented in Python 3.6 using the scikit-learn library [83] as the basic building block for the pipelines. To train EBM models we have used the interpretml package [76].