DeepBiome: a phylogenetic tree informed deep neural network for 1 microbiome data analysis

Abstract


48
Emerging high-throughput sequencing technologies have vastly improved our understanding of 49 the role of human microbiome in many diseases (Sartor, 2008 genus), at a deeper depth (e.g., phylum), or even at a mixture taxonomic ranks? 63 Multiple approaches have been proposed to incorporate the phylogenetic structure into anal-   Suppose we have p OTUs from a total of n microbiome samples and a phylogenetic tree that 109 depicts the evolutionary relationship among microbes. Each OTU is a tip node on the phy-110 logenetic tree and each internal node is a taxonomic unit representing a common ancestor of 111 its descendent taxa. In this article, we aggregate p OTUs to m genus level taxa as the basic 112 analyzing units. However, basic analyzing unit can start at finer levels. Figure ? DeepBiome is a neural network architecture that associates input vectors x (representing mi-119 crobiome abundance) with a clinical outcome y. A major challenge in constructing a neural 120 network is to make decisions about the optimal number of layers and neurons in the network.

121
The conventional wisdom of going deep (many layers) and wide (many neurons per layer) finds 122 great success in many artificial intelligence tasks such as image pattern recognition and nat-123 ural language processing but requires a huge amount of training data (Bergstra et al., 2011; 124 Snoek et al., 2012), which is rarely affordable in biomedical studies due to resource constraints.  The information is then propagated through multiple layers of the DeepBiome network to the 130 outcome of interest y. The input vector x, e.g., the abundances of m (0) genera, is propagated 131 to the first hidden layer vector z (1) with a total of m (1) neurons, e.g., the number of family level 132 taxa, using an m (1) × m (0) weight matrix w (1) and an m (1) bias vector b (1) (1) Each weight parameter w but it can be easily changed to other activation functions in our software. In the same manner, where L is the total number of hidden layers in the neural 138 network. The last hidden layer z (L) is linked to outcome using either an identity link or a softmax 139 link. Specifically, we use identity link to predict continuous outcome, y = w (L+1) z (L) + b. For 140 categorical outcome with K categories, softmax function is adopted to predict the probability 141 of ith subject belongs to cth category, Finally, we use f θ (x) with parameters θ = {w, b} to represent the whole neural network that 143 maps an input x to an output y, where w = (w (1) , . . . , w (L) , w (L+1) ) and b = (b (1) , . . . , b (L) , b (L+1) ).

144
2.3 Phylogeny regularization via weight decay 145 We introduce phylogeny regularization through weight decay. We assume that if taxa j and 146 k have ancestor-descendent relationship, the associations between the corresponding neurons 147 are stronger, i.e., larger weight value w jk . When taxa j and k do not have this ancestral 148 relationship, we assume w j,k to be a small value, i.e., weight decay. Thus, we construct a weight 149 decay matrix ω to regularize weights in the neural network using evolutionary relationship 150 carried by the phylogenetic tree. If nodes j and k are ancestor-descendent related, ω jk = 1; if 151 not, ω jk is a small value, e.g., 0.01. See Figure ?? as an illustration.
where y i,k is a binary indicator (0 or 1) indicating whether observation i belongs to class k, . This phylogeny regularization effectively uses biological meaningful 170 prior knowledge to limit the number of free parameters in the model, therefore avoids overfitting.

10
Algorithm 1: Phylogeny regularized weight decay in Adam. β 1 , β 2 refer to the exponential decay rates for the moment estimates in Adam. ǫ = 10 −8 is used to prevent division from zero error (Kingma and Ba, 2014).

172
We employ several statistical metrics to evaluate the performance of DeepBiome for its predic- where TP is true positive (or recall), TN is true negative, FP is false positive, and FN is false 179 negative. F1 score is the harmonic mean of precision and sensitivity. An F1 score reaches its 180 best value at one when the prediction has perfect precision and recall and the worst at 0. Note 181 that F1 score does not take the true negative into account. We use the g-measure, which is the 182 geometric mean of sensitivity and specificity, to assess the performance of a binary classifier.

183
Same as F1 score, a g-Measure reaches its best value at one when the sensitivity and specificity 184 are both perfect (one) while the worst at zero if any of sensitivity and specificity is zero. We also 185 report AUC (area under the receiver operating characteristics), which reports the capability of 186 a model to distinguish between classes. Sensitivity, specificity, g-measure, and ACC across all 187 hidden layers (see Table 1) are used to report the selection accuracies.

205
The association network between the microbiome x and the outcome y can be extremely 206 complex. In this session, we use a forward propagation approach described below to generate 207 y. We start with 2964 OTUs and aggregate them as 48 genus. The following steps were used 208 to generate outcome y.  x family = z (1) = v(w (1) x genus + b (1) ).
5. Repeating step 2 -4 to compute the x order , x class , and x phylum .
Pr(y i = c) = (e w (4) x phylum +b (4) ) c K q=1 (e w (4) x phylum +b (4) ) q where K = 2 for binary classification and K ≥ 3 for multicategorical classification.  We compare DeepBiome to linear regression as well as penalized regression with ℓ 1 norm 231 (Lasso), ℓ 2 norm (Ridge), and ℓ 1 + ℓ 2 norm (Elastic-Net) penalties. We also compare DeepBiome 232 to conventional DNN and ℓ 1 -regularized DNN. DNN and ℓ 1 -DNN use the same number of hidden 233 layers and neurons on each layer as DeepBiome without phylogenetics tree regularization. 234 We use five-fold cross-validation to choose the tuning parameters for regularized linear re-  DeepBiome consistently achieves the best performances on the test set.

258
Identifying associated taxa at precise levels is critical for downstream biological validation.  Tables S1 and S2). Interestingly, despite being the second best 271 method regarding prediction (see Table 3), ℓ 1 -DNN fails to identify the true microbiome taxa 272 across all phylogenetic levels.

273
Scenario 2: Binary classification 274 We consider the case that outcome associated taxa are clustered at a mixture of phylogenetic 275 levels. For a binary outcome, we suppose 276 (1) the higher the abundance of blue node taxa, the higher the probability of y belong to the 277 disease group;

278
(2) the higher the abundance of red node taxa, the higher the probability of y belong to the 279 healthy control group. 280 We compare DeepBiome to logistic regression, three penalized logistic regression models, and 281 two conventional deep learning networks. The same learning rate, stopping criteria, and mini-282 batch size (100) are used for DeepBiome, DNN and ℓ 1 -DNN. In Table 4, we present the metrics 283 for evaluating the classification performance of binary outcome, including sensitivity, specificity,  Table S3).

294
Scenario 3: Multiclass classification 295 We simulated multi-categorical outcomes, e.g., the severity of illness, which may be categorized 296 as "mild", "moderate", or "severe". Consistent with previous simulations, we assume that the 297 blue node taxa contribute to the "severe" group, reds contribute to the "mild" group, and part 298 of the gray node taxa contribute to the neutral "moderate" group. To examine the robustness of DeepBiome, we consider two sources of model mis-specifications,

317
(1) Abundance contain measurement errors at genus levels. We assume that 10% of the associ-318 ated genus reads are mis-classified to one randomly selected genus from the same phylum.

319
The microbiome abundance data with measurement errors is then used for training models.

321
• At class level, the genera that belong to Clostridia and Flavobacteria are mis-classified 322 to Bacilli and Bacteroidia.

323
• At order level, the genera that belong to Coriobacteriales and Flavobacteriales are 324 mis-classified to Actinomycetales and Bacteroidales.

325
The same learning rate, stopping criteria, and mini-batch size (100) are used for DeepBiome,  Table 7 and Supplementary Table S6 show the results when using a mis-328 specified phylogenetic tree. Like Scenario 1, we compared DeepBiome to linear regression, penal-329 ized regression, conventional DNN, and ℓ 1 -regularized DNN. When the model is trained using 330 data with measurement errors (case 1), performance of DeepBiome and ℓ 1 -DNN drops, i.e., 331 higher MSE and lower Pearson's ρ, compared to scenario 1 using data without errors (Table 6; 332 see also Table 3). DeepBiome has the best prediction performance among all methods. The  https://github.com/biocore/American-Gut/tree/master/data. The phylogenetic tree was 353 extracted from the .biom file that contains OTU tables and the taxonomic information. For 354 this analysis, we further excluded microbiome samples with less than 10,000 sequence reads and 355 genera with abundance less than 2% in all available samples. Samples without metadata or 356 missing demographic information, i.e., age, gender, and ethnicity, were also excluded.  Table S7), 373 genus level taxa, and a 370 phylogenetic tree to classify T2D as well as to select associated microbiome taxa. 371 Table 8 shows the performance of classifying T2D based on 5-fold cross-validation. Although    The continuous BMI is either available in AGP metadata or is calculated from their reported   (Table 9). To classify BMI categories (Table 10)                     Microbiome and phenotype data were downloaded from American Gut Project (AGP). One hundred and fifty-five self-reported T2D was extracted from survey. A random selected 154 non-T2D were then sampled to serve as controls.
The blue and red nodes indicate taxa have negative and positive weights, respectively. The size of colored nodes represent the magnitudes of the weights. Black nodes represent non-selected taxa.            Figure 1 DeepBiome architecture. (a) A phylogenetic tree with 48 genera as tip nodes. Color represents phylum types. (b) Network layout of DeepBiome architecture. The input layer is genus level microbiome abundance. Each hidden layer represents one phylogenetic level, e.g., family, order, class, and phylum.

Figures
The dark lines represent relationship de ned by a phylogenetics tree. The gray lines represent association between layers. (c) Phylogenetic tree regularized weight decay. Suppose we have a simple tree as shown in the left panel, which has 6 genera (taxa 1-6) and 2 classes (7-8). Genera 1-3 belong to class 7 and genera 4-6 belong to class 8. The ancestor-descendent information is embedded into a 6 × 2 matrix. Without loss of generality, we use ω to indicate a regularization factor with small value (e.g., 0.01). For tree regularized weight decay, the weight estimation matrix w6×2 is multiplied with this phylogenetic embedded matrix Ω6×2 elementwisely, denoted by Ω6×2 w6×2.

Figure 2
Simulation speci cations. The outcome associated taxa (blue and red) are speci ed at (a) the phylum level and (b) a mixture of phylum & order levels. The blue nodes represent "bad" taxa which result in a disease status or are negatively associated with continuous phenotype, e.g., FEV1. The red nodes represent "good" taxa which result in a healthy status. In simulation scenario 4, we evaluate the impact of mis-speci ed phylogenetic tree: (c) indicates the true phylogenetic tree used in simulation scenario 4 and (d) indicates the phylogenetic tree used in model learning (same as the tree shown in Figure (b)).

Figure 3
Taxa selection performance under 4 simulation schemes at each phylogenetic level. Sensitivity, speci city, g-measure, and accuracy (ACC) were used to evaluate taxa selection performance. Vertical bar represents standard deviation over 1000 simulation replicates. DeepBiome selected T2D associated taxa from phylum Proteobacteria among the American Gut Project (AGP). Estimated weights were overlaid on Proteobacteria branch of the phylogenetic tree. One hundred and fty-ve self-reported T2D was extracted from survey and 154 non-T2D participants were randomly sampled as controls. The red and blue nodes indicate tax3a2have positive and negative weights, respectively. The size of colored nodes represent the magnitudes of the weights. Black nodes represent non-selected taxa.

Figure 5
DeepBiome selected T2D associated taxa from all phylums overlaid on the phylogenetic tree. Microbiome and phenotype data were downloaded from American Gut Project (AGP). One hundred and fty-ve selfreported T2D was extracted from survey. A random selected 154 non-T2D were then sampled to serve as controls. The blue and red nodes indicate taxa have negative and positive weights, respectively. The size of colored nodes represent the magnitudes of the weights. Black nodes represent non-selected taxa.

Figure 6
Selection of FEV1% predicted associated microbiome taxa using Lasso, Elastic-Net, £1-DNN, and DeepBiome. The red nodes indicate taxa have positive association with FEV1% predicted and blue nodes indicate taxa have negative association with FEV1% predicted.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.