Interpretable machine learning analysis of functional metagenomic profiles improves colorectal cancer prediction and reveals basic molecular mechanisms.

Background Gut microbiome is gaining interest because its links with several diseases, including colorectal cancer (CRC). Results Here we performed a meta-analysis of samples from five publicly available studies. We used an interpretable machine learning approach based on functional profiles, instead of the conventional taxonomic profiles, to produce a highly accurate predictor of CRC with better precision than those of previous proposals. Moreover, this approach is also able to discriminate samples with adenoma, which makes this approach very promising for CRC prevention by detecting early stages in which intervention is easier and more effective. In addition, interpretable machine learning methods allows extracting features relevant for the classification, which reveals basic molecular mechanisms accounting for the changes underwent by the microbiome functional landscape in the transition from healthy gut to adenoma and CRC conditions. Conclusion Functional profiles provide superior accuracy in predicting CCR and adenoma conditions than taxonomic profiles and additionally, in a context of explainable machine learning, provide useful hints on the molecular mechanisms operating in the microbiota behind these conditions.


Background
In the last years the study of the microbiome has progressively gained interest, especially in the context of human health [1][2][3][4]. Microbial abundance profiles based on 16S rRNA genes have been used to study microbiomes, although whole genome sequencing (WGS) is becoming increasingly popular in recent years due to the decreasing sequencing costs [5,6]. Contrary to 16S rRNA data, WGS microbiome data provides the real gene composition in the bacterial pool of each sample, which allows identifying strain-specific genomic traits [7,8]. During the last years, microbiome WGS has been used to explore microbiome-host interactions within a disease context by means of metagenome-wide association studies, that allow studying gut microbiome alterations characteristic of different pathologic conditions [3,[9][10][11][12][13][14][15][16][17]. In particular, recent evidence suggests that the human gut microbiome could be a relevant factor in human diseases [18,19]. In fact, the existence of carcinogenic mechanisms mediated by bacterial organisms has recently been proposed [20][21][22]. And, more specifically, it has been suggested that the gut microbiome could play a relevant role in the development of colorectal cancer (CRC) [15,16,[23][24][25]. Due to this, the gut microbiome has been proposed as a potential diagnostic tool for CRC [16,17,26,27]. Nevertheless, its reproducibility and the predictive accuracy of the microbial gene signatures across different cohorts have been questioned [28,29]. The increasing availability of whole metagenome shotgun datasets of CRC cohorts [15-17, 26, 27] facilitates large-scale multi-population exploratory studies of the CRCassociated microbiome at the resolution level of strain [30,31]. In two recent studies, a combined analysis of heterogeneous CRC cohorts was able to identify reproducible microbiome biomarkers and accurate disease predictive models that open the door to future clinical prognostic tests [28,29]. The subsequent meta-analysis of the functional potential in the strains of the signature found gluconeogenesis and putrefaction and fermentation pathways associated with CRC, in coherence with the current knowledge on microbial metabolites implicated in carcinogenesis [32].
It is important to note that the current approaches used to obtain biomarkers with predictive power use microbial strain or gene signatures as features to train a predictive model. Since genes or strains have not a clear interpretability by themselves, the interpretation of the results of the classification produced by the model relies on the analysis of the potential functionalities encoded by these features. In other words, the predictive model is built using features that need to be interpreted a posteriori [33]. In fact, this is a relatively common problem with many current machine learning techniques, which have evolved in recent years to enable robust association of biological signals with measured phenotypes but, in many cases, such approaches are unable to identify causal relationships [34,35]. However, the interpretability of models, especially in a clinical context, is becoming an increasingly important issue [34][35][36]. The use of features with a direct functional interpretation has been suggested as crucial for the interpretability of the models [37]. In a recent study, gene profiles derived from WGS of samples of the MetaSub project [38] were initially transformed into functional profiles, which account for bacterial metabolism and other cell functionalities, and have subsequently been used as features to build a city classification machine learning algorithm [39]. Since the features are informative by themselves, their relevance in the classification provides an immediate 4 interpretability to the prediction model built.
Here, we propose an interpretable machine learning approach in which functional profiles of microbiota samples, with a direct interpretation, are first obtained from shotgun sequencing and subsequently used as features for predicting CRC in the patient donor of the sample. Moreover, in the prediction schema proposed, a feature relevance method allows extracting the most important functional features that account for the classification. Thus, any sample is described as a collection of functional modules contributed by the different bacterial species present in it, which account for the potential functional activities that the bacterial population in the sample, as a whole, can perform.

Functional profiles discriminate between CRC, adenoma and normal samples across different cohorts
The application of the machine learning pipeline described in methods using functional profiles across the cohorts used after removing the batch effect produced a clear separation between healthy samples and CRC samples. Moreover, contrarily to the case of previous classification models based on taxonomic profiles [28,29], the two adenoma types, although with less accuracy than the CRC samples, were also well differentiated (see Figure 1). The figure also showcases the predictive power of the selected features As it has been previously noted [28,29] sample prediction accuracy is not uniform across projects and some of them present a poorer prediction results when the general predictor model trained with the rest of projects is used (see Supplementary Figure 1). However, many of these misclassified samples are special cases either with comorbilities or with advanced metastases, which probably changes the functional landscape of the bacterial functionality.

The most relevant features for the classification provides a biological interpretation of the model and account for disease mechanisms
The feature selection process used to build the classifier renders a number of features that can be ranked by relevance (see Figure 2A) Table 1 shows the ten most relevant modules. Figure 3 shows the distribution of samples along the value of the feature, which allows understanding the contribution of each feature to the disease condition. As described in the Discussion section all these features are related to bacterial activities that have directly or indirectly been linked to CRC.

The most relevant features that differentiate adenomas from CRC and normal samples
In order to estimate the contribution of features to the classification of the adenoma samples, we used the SHAP method. Supplementary Table 5 lists the most relevant features for the classification of large and small adenomas according to the SHAP scores. Supplementary Figure 5 shows the Plots of feature interpretability for the most relevant features used by the model for the discrimination of small and large adenoma. Most of them are specific features of CRC that start to appear probably in the first adenoma stages previous to the cancer. Some of them are known CRC prognostic biomarkers, such as the archaeal large subunit ribosomal protein L19e [40], while other are characteristic of the adenoma stage, like the coenzyme F420 hydrogenase subunit beta, related to methane metabolism, enhanced in adenoma patients [41] or the chemotaxis protein methyltransferase CheR, which has been described as a distinctive feature of patients with adenomas [42].

Comparison of functional and taxonomic profiles
In order to compare the performance obtained in the classification that uses functional profiles as features with the conventional approach in which the features used are taxonomic profiles we obtained per sample taxonomic profiles as described in Methods and repeated the training procedure to obtain another predictor model. Once ranked by relevance ( Figure 2B), a total of 132 taxonomic features were over the threshold (listed in Supplementary Table 4 and represented in a phylogenetic tree in Supplementary Figure 4). Figure 4 shows the relative performance of the prediction using both functional and taxonomic profiles. The predictor based on taxonomic profiles results to be slightly better only in a few cases for the easiest problem of distinguishing between CRC and healthy samples. However, it is interesting to 6 note that in the more difficult problem of distinguishing between adenomas, normal and CRC samples, taxonomic profiles clearly show a poorer performance than functional profiles (compare Figure 5 and Figure 1).
It is interesting to note that when the correlation between the most relevant functional features and the most relevant taxonomic features across samples is studied it becomes apparent that it varies among the different experiments (Supplementary Figure 6). This observation suggests that the particular CRC microbiota profile defined by the taxonomic features can be reached by means of different combination of bacterial species.

Comparison with other approaches
The availability of whole metagenome shotgun datasets of CRC cohorts is relatively recent [15-17, 26, 27] and for this reason there are no much studies on CRC prediction from fecal microbiome data. An early study with 156 samples obtained a classification accuracy, as area under the curve (AUC), of 0.84 [16], although the samples belonged to a unique cohort. A more recent study of 526 samples found seven enriched bacterial markers that classified cases from controls with an accuracy of 0.80 [43] although it has been suggested that could be affected of over fitting issues [29]. The last study published, that outperforms the previous ones, obtained an average AUC 0.84 [29]. Here a more detailed comparison of the results obtained in this last study with the results obtained here is shown. Figure 4 depicts a comparison of the AUC scores obtained in this study and in the last study published [29]. Interestingly, the internal validation of our method, a 20 times repeated tenfold cross-validation over the whole dataset shows a systematic advantage of the approach presented here over the mentioned previous study. It is also interesting to note that both results present a similar trend across all the projects. Moreover, as commented below in the Discussion section, our results are, from a biological point of view, congruent with the findings in the literature.

Discussion
This study uses a comprehensive collection of the cohorts of CRC (listed in Table 2). Only one available dataset was discarded, PRJEB12449, frozen for more than 25 years before it was sequenced [44], which most probably compromised the quality of the results [45], and, actually, was described 7 as technically flawed by previous studies [28,29].
The interpretable machine learning approach used here has demonstrated to outperform other class predictors previously reported. One of the most interesting properties of this approach is its immediate interpretability. Thus, the features chosen by the model that optimize the discrimination between the conditions compared account for the functionalities that operate differentially among both conditions. The most relevant feature is the Heptose II phosphotransferase (K02850). This enzyme, located in the Lipopolysaccharide biosynthesis (ko00540) pathway, is associated with CRC in high values (see Figure 3A). Actually, the presence of lipopolysaccharides produced in the surface of Gram-bacteria has been reported to induce an inflammatory response as well as to stimulate the proliferation of colon carcinoma [46,47]. The second most relevant feature is Manganese/zinc/iron transport system permease protein (K11708). According to the model the presence of this feature increases the probability of CRC (see Figure 3B). This transporter increases its number in excess iron conditions that are known to promote colorectal carcinogenesis [48]. Tagatose 6-phosphate kinase (KO00917), the third most relevant feature, belongs to the Galactose metabolism (ko00052) metabolic pathway. According to the model ( Figure 3C), high values seem to be protective (predictor of healthy status). Actually, the metabolism of galactose is related to diets rich in fiber (fruits and vegetables) that have a protective effect against CRC [49]. Methyltransferase (K16168), related to polyketide synthesis, is the next most relevant feature. In this case, the model predicts that low levels of Methyltransferase activity are associated to low probability of CRC ( Figure 3D). It has recently been described that a class of molecules, colibactins, are produced from the gene cluster called the polyketide synthase island that occurs in certain strains of Escherichia coli prevalent in the microbiota of CRC patients [50]. Another relevant feature is Methylaspartate mutase sigma subunit (K01846), whose high activity is related with high probability of CRC according to the model ( Figure 3E). It has been described that cancer cells undergo modifications that include increased glutamine catabolism  An extensive description of the features selected by the model is beyond the scope of this paper, however it is worth noting that fully agree with the findings of functional analysis done in previous reports [28,29].
Interpretability of the predictive models is becoming a major issue, especially in biomedicine [33, 34,36]. The idea of using features with full biological meaning to gain interpretability in the machine learning methodology used has recently been proposed as a "white box" strategy [37] and has successfully been used for the first time in the analysis of urban microbiota [39] in the context of the METASub project [38]. Actually, when the relative performance of the predictors based either of functional or on taxonomic features is compared (Figure 4) functional profiles perform better in most of the projects distinguishing between CRC and healthy samples. However, functional profiles clearly overcome the performance of taxonomic profiles in distinguishing between adenoma, normal and CRC samples, which is a more difficult problem (compare Figure 5 and

Conclusions
The interpretable machine learning approach proposed here has demonstrated a superior performance to other approaches previously proposed. Moreover, it demonstrated a better resolution not only with respect to the separation between healthy and CRC samples, but it is also able to discriminate samples with adenoma, being a promising tool for CRC prevention by detecting early stages in which intervention is easier and more effective. And finally, the model has a biological interpretation that provides important clues to better understand the mechanistic implications of the gut microbiota in CRC as well as in the previous stages of adenoma, which can have an interesting potential in preventive medicine and, specifically, in cancer interception [63].

Data
A total of 851 fecal metagenomic whole genome sequencing (WGS) samples (See Table 2) were analyzed. Sequence data were downloaded from European Nucleotide Archive. The conditions of the samples were obtained from the different supplementary tables of different publications (see Table 2) and complemented in the possible using the R package curatedMetagenomicData [64] available in Bioconductor [65].

Bacterial Whole Genome Sequence data processing
Whole genome sequencing data was managed using the NGLess-Profiler [66] package. Raw sequencing data preprocessing and quality control was carried out using a version of the humangut.ngl pipeline. The substrim built-in function was used to discard reads that do not meet the basic quality filter of being longer than 45 bases and having all bases with a Phred score over 25. To prevent potential contaminations with human genome sequences the reads were mapped against the human genome hg19, and those that do not map were conserved. SAMtools [67] and BWA [68] were used to handle and map reads respectively.

Functional profiles
Strain functional profiles are generated by assessing the gene coverage for KEGG [69] functional orthologs, taken from the KEGG Orthology database of molecular functions, represented in terms of orthology groups of genes [70]. Ortholog genes are the basic features used here, in each sample.
Then, the representation of each feature of the profile in any sample is estimated from the number of reads mapping on it. These counts were obtained by mapping the reads that passed the filters against the integrated gene catalog of the human gut [71]. We used the NGLess built-in function count with the default values, applying the scaled normalization that consists of dividing the raw count by the size of the feature and scaled up so that the total number of counts is similar to the total raw count.

Taxonomic profiles
Strain taxonomic profiles were obtained using the Centrifuge application [72]. The centrifugedownload command was used to download the reference genomes of archea, bacteria, virus and vertebrate mammalian (human and mouse) taxons. The reads of each sample were mapped over the reference genomes. The taxonomic profile consists of the relative representation of each genome in the sample, which is obtained by normalizing the number of reads mapping on them by the respective lengths of the genomes.

Tumor status prediction
For tumor status prediction a combination of different machine learning methods was used with the aim of attaining a white-box interpretable model [34][35][36] that improve results previously reported. The white-box model is sequentially constructed by combining robust feature selection techniques with an interpretable yet powerful classification method in the form of Generalized Additive Models (GAM) [73]. GAMs have been used in many fields, including healthcare applications [74,75], for predictive modeling due to being more flexible than a general linear model without losing the interpretability.
In order to classify a sample as tumor or healthy a new interpretability-based algorithm we have used: the Explainable Boosting Machine (EBM) [76], a GAM model that uses state-of-the-art machine learning methods, such as bagging and parallel gradient boosting, in order to learn a function for each feature. Note that GAMs are non-parametric learning algorithms with a high flexibility regarding to the data distribution, a very interpretable modeling, due to the additive effects, and the ability to learn complex non-linear relationships between the variables and the outcome. The EBM builds a boosting methodology where each feature is learned in a circular order with a learning rate low enough to avoid ordering-based issues. These learning cycles allow the algorithm to learn a precise functional for each feature in a profile in such a way that is easy to extract how any of the features contribute towards a given prediction. By summarizing these contributions a global relevance score (i.e. a way to rank each profile feature) is obtained. The learning rate was optimized via ten-fold cross-validation.

Adenoma risk
In addition, from a healthcare point of view, one of the most interesting aspects of the predictions lies in the classifier ability to differentiate between adenomas and normal or CRC samples. In order to check the discriminative power of the classifier we proceeded as follows: (1) perform a split of the binary dataset with 30% of the samples in the test set, (2) aggregate the adenoma samples to the test set, (3) train the classifier on the train set and predict the new test set and finally, (4) compute the risk at k score defined as the sum the (signed) contributions of the top k features: (see Equation 1 in the Supplemental Files) This risk function can be constructed for any k, but here k corresponds to the features selected by the feature selection method (see below).

Feature Selection
The feature selection pipeline is based on the biological hypothesis that there are complex relationships between the functional profiles and the tumor status. Since KEGG [77] orthologs, which correspond to biochemical reactions, are taken as features, there is an unbalance between the number of features (in the range of thousands) and the number of samples (in the range of hundreds) In order to overcome such difficulties a multivariate feature selection technique based on the EBM feature ranking method (discussed below) was developed. The method is similar to the stability selection technique [78] in which a model is fitted across several bootstrapped versions of the data, thus having several rankings (each model produces a feature ranking score). The idea is to select those features that are ranked as top across the majority of the partitions. Instead of selecting the top ranked features, the distribution of the ranking scores of EBM is computed across all data partitions.
Then a cut point based on the elbow point of the medians is selected. This method allows using the same multivariate model used for prediction, which has proven to perform well on high dimensional spaces with colinearties by better distributing the feature contributions of similar variables between them. The method is used on the global dataset (sample-wise union of all projects) using bootstrap to try to leverage the problems associated to the inter-project discrepancies.

Feature relevance ranking
To rank each feature EBM computes the feature-wise contribution for each sample in the training set, i.e. how each sample would be scored by a given feature. Then, for each feature the mean of the absolute value of the contributions is taken. Since EBM is an additive model, the ranked features correspond with those that have the most predictive power in the training set. Note that, given the 13 additive properties of the model, the impact of a given feature on any sample can be directly observed. Actually, an interpretability curve can be built, where the variation across samples of a given feature and its associated model contribution can be directly plotted and observed. The possibility of extracting feature-wise contributions from a complex multivariate classifier is a distinctive property of this approach that allows making more meaningful interpretations.

Feature relevance in the adenoma
In order to infer the relevance of the features for the Adenoma experiment we have used the SHapley Additive exPlanation (SHAP) [79] method: a novel additive feature attribution method that has been used with great success in many domains, such as hypoxaemia prevention during surgery [80] or NO 2 forecasting in ecology [81].
The SHAP method combines local explanation methods, as Local Interpretable Model-agnostic Explanations (LIME) [82], with game theory approaches (Shapley values). In broad terms the SHAP values estimate a given feature contribution by approximating the change between the expected and actual prediction when perturbing that feature. To build the expectation the method needs a set of background samples. In this case all the training samples we can safely be provided to the method since all the adenoma cases belong to the blind test set. Note that the SHAP method provides local explanations and, by combining them, a global relevance score can be built.

Batch effect removal
Batch effect is a known problem in gut metagenomic studies [28,29]. Here, experimental effects were removed in the possible by filtering those features that contribute to perform a project-wise separation of the healthy samples. Then, project-specific effect removal is done in three steps: first, samples labeled as healthy across all the studies are selected, second, the feature selection pipeline is applied to the healthy subset using a multiclass approach, where the study is the target variable of the underlying classifier. Finally, those features labeled as relevant by the pipeline in all the studies are removed.

Model interpretability
As EBM belong to the additive family of machine learning algorithms, all features contribute to predictions in an adaptive, intelligible way. Thus, we can visualize how a given feature contributes to the prediction of any sample. We can summarize this information by visualizing the feature distribution along its contribution to the EBM prediction.

Software
The machine learning methodology has been implemented in Python 3.6 using the scikit-learn library [83] as the basic building block for the pipelines. To train EBM models we have used the interpretml package [76].

Availability of data and material
The WGS of fecal metagenomic samples data that support the findings of this study are available in

Competing interests
The authors declare that they have no competing interests Funding This work is supported by grants SAF2017-88908-R from the Spanish Ministry of Economy and Competitiveness and "Plataforma de RecursosBiomoleculares y Bioinformáticos" PT17/0009/0006 from the ISCIII, both co-funded with European Regional Development Funds (ERDF) as well as H2020 Programme of the European Union grants Marie Curie Innovative Training Network "Machine Learning Frontiers in Precision Medicine" (MLFPM) (GA 813533) and "ELIXIR-EXCELERATE fast-track ELIXIR implementation and drive early user exploitation across the life sciences" (GA 676559).

Authors' contributions
CCS has carried out the bioinformatic analysis of the samples and contributed to the interpretation of 15 the results; CL has carried out the machine learning analysis of the data; MPC has made the functional interpretation of the results; JD has conceived the work and wrote the paper.
invasive biomarkers for colorectal cancer. Gut 2017, 66(1):70-78.         Comparison between the ML approach presented in [29] and the approach proposed here using both functional and taxonomic profiles.
33 Figure 5 Distribution of risk score across the healthy, CRC and adenoma samples using taxonomic features.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.