Subgroup Identification Using Virtual Twins for Human Microbiome Studies

Even when the same treatment is employed, some patients are cured, while others are not. The patients that are cured may have beneficial microbes in their body that can boost treatment effects, but it is vice versa for the patients that are not cured. That is, treatment effects can vary depending on the patient's microbiome. If the effects of candidate treatments are well-predicted based on the patient's microbiome, we can select a treatment that is suited to the patient's microbiome or alter the patient's microbiome to improve treatment effects. Here, I introduce a streamlined analytic method, microbiome virtual twins (MiVT), to probe for the interplay between microbiome and treatment. MiVT employs a new prediction method, distance-based machine learning (dML), to improve prediction accuracy in microbiome studies and a new significance test, bootstrap-based test for regression tree (BoRT), to test if each subgroup's treatment effect is the same with the overall treatment effect. MiVT will serve as a useful guideline in microbiome-based personalized medicine to select the therapy that is most suited to the patient's microbiome or to tune the patient's microbiome to be suited to the treatment.


Subgroup Identification Using Virtual Twins for Human Microbiome Studies Hyunwook Koh
Abstract-Even when the same treatment is employed, some patients are cured, while others are not.The patients that are cured may have beneficial microbes in their body that can boost treatment effects, but it is vice versa for the patients that are not cured.That is, treatment effects can vary depending on the patient's microbiome.If the effects of candidate treatments are well-predicted based on the patient's microbiome, we can select a treatment that is suited to the patient's microbiome or alter the patient's microbiome to improve treatment effects.Here, I introduce a streamlined analytic method, microbiome virtual twins (MiVT), to probe for the interplay between microbiome and treatment.MiVT employs a new prediction method, distance-based machine learning (dML), to improve prediction accuracy in microbiome studies and a new significance test, bootstrap-based test for regression tree (BoRT), to test if each subgroup's treatment effect is the same with the overall treatment effect.MiVT will serve as a useful guideline in microbiome-based personalized medicine to select the therapy that is most suited to the patient's microbiome or to tune the patient's microbiome to be suited to the treatment.Index Terms-Human microbiome, subgroup identification, virtual twins, cancer immunotherapy, personalized medicine, precision medicine.

I INTRODUCTION
T HE human microbiome is the entire ecosystem of all microbes that inhabit different organs (e.g., gut, mouth, nose, skin, etc) of the human body.The roles of the microbiome on human health or disease have been increasingly studied due to the recent advances in massively parallel sequencing.The key underlying channels through which the microbes can influence human health or disease have been found as immunologic or metabolic regulations and digestive processes [1], [2], [3].The microbiome industry has been rapidly growing, and microbiome-based dietary supplements (e.g., prebiotics, probiotics, dietary fiber), therapeutics (e.g., antibiotics, pharmabiotics, fecal microbiota transplant, phage therapy) and diagnostics are currently flooded.
The two major sequencing platforms for microbiome profiling are 16S rRNA-based amplicon sequencing [4], [5] and shotgun metagenomics [6].Either of these sequencing platforms can produce various types of metagenomic information, yet the type of the microbiome data on which I focus here is the typical microbiome data that are on microbial abundance and phylogenetic tree information.The data are high-dimensional including numerous microbial features, such as operational taxonomic units (OTUs) or amplicon sequence variants (ASVs), that are characterized by their relative abundances, taxonomic annotations, and phylogenetic tree relationships.The data are also sparse with excessive zeros, and highly skewed with few microbial features that occupy most of the total abundance; hence, most of the other microbial features are rare variants.The underlying etiological mechanisms can be multifactorial.That is, many microbial features can jointly influence human health or disease, especially the complex disease like cancer, diabetes, obesity, asthma, atopy, brain disorder and so forth.However, it is also likely that only few microbial features solely influence human health or disease [7].The high complexity of the microbiome data and underlying etiological mechanism makes the downstream data analysis challenging.Hence, more delicate analytic methods and protocols are needed.
Here, I especially pay attention to the research question on if the microbiome can improve (or lower) treatment effects.Even when the same medical treatment is employed, some patients are cured, while others are not.For example, the melanoma of the former U.S. president, Jimmy Carter, has been cured by the cancer immunotherapy, called Pembrolizumab, but the same treatment effect does not apply to all the patients for all different types of cancer [8], [9], [10].There are also various cancer immunotherapies, and their treatment effects can all vary.We can hypothesize here that the patients that are cured may have beneficial microbes in their body that can boost treatment effects, but it is vice versa for the patients that are not cured.
Matson et al. showed that the microbiome can improve treatment effects through randomized control trials using genetically similar mice, germ-free mice and gut microbiota transplant [11].The researchers report that the anti-carcinogenicity of the cancer immunotherapy can be doubled (or halve) depending on the microbiome the mouse had [11].This indicates that the treatment effect can be improved by altering the microbiome, and it is also the reason why the coadministration of microbiome-based dietary supplements and therapeutics along with a primary treatment like the cancer immunotherapy has been intensely studied.
However, the limitation of Matson et al. is that it was an animal (not human) microbiome study through mouse trials [11].We have different genetic traits and surrounding environments from mice or any other animals.Hence, the human microbiome should be different from any other animals' microbiome, and it is hard to apply the results from animal trials to the personalized medicine or precision medicine for the humans.The human microbiome studies are therefore deemed to be observational studies that are on the humans, to which we need analytic methods and protocols that are suited.
In this paper, I introduce a streamlined analytic method, named microbiome virtual twins (MiVT), that can predict treatment effects, and then identify subgroups by treatment effects based on microbiome composition to probe for the interplay between microbiome and treatment.MiVT is based on the subgroup identification framework, called virtual twins [12], that involves a two-step algorithm, (1) treatment effect prediction through machine learning and (2) subgroup identification using a decision tree.MiVT, however, employs a new prediction method, named distance-based machine learning (dML).dML is based on a dimension reduction technique using ecological distance measures and a series of machine learning methods, elastic net (EN) [13], random forest (RF) [14] and deep feedforward network (DFN), to improve prediction accuracy in microbiome studies.MiVT also employs a new significance test, named bootstrap-based test for regression tree (BoRT), to test if each subgroup's treatment effect is the same with or different from the overall treatment effect.Thereby, we can, for example, interpret the final results from MiVT as the patients that have Mogibacterium < 0.00511% and Akkermansia ≥ 0.0126% in their gut have significantly a higher treatment effect, 10%, compared with the overall treatment effect, 5.3% (P-value (BoRT): <.001).Such results can serve as a useful guideline in microbiome-based personalized medicine to select the best therapy among multiple candidates depending on the patient's microbiome composition or to use dietary supplements or therapeutics to tune the patient's microbiome composition to improve treatment effect.
The rest of this paper is organized as follows.In the Methods section, the methodological details of dML and BoRT are dissected.In the Results section, dML and BoRT are evaluated in silico (see Simulations), and then the use of MiVT is demonstrated in praxis through the gut microbiome study on the effects of cancer immunotherapies on melanoma patients (see Real data applications).In the Discussion section, I summarize the methodology and results, and then describe the potential extensions, implementations, limitations and promises of MiVT.

A. General Settings
Suppose that there are N patients.Let Y i be a binary cure outcome (0: uncured, 1: cured), T i be a binary treatment status (0: control, 1: treatment), X i be a vector of p microbial features (e.g., OTUs, ASVs), where p >> N, and Z i be a treatment effect for i = 1, … …, N.While I can describe the binary treatment status as 0 (control) vs. 1 (treatment), it can be more generally 0 (placebo) vs. 1 (treatment), 0 (old treatment) vs. 1 (new treatment), 0 (treatment level 1) vs. 1 (treatment level 2), and so forth.The microbial features can be correlated with each other, and they tend to be phylogenetically related in microbiome studies.The treatment and microbial features can have both main and interaction effects on the cure outcome.The stable unit treatment value assumption, known as SUTVA, underlies.That is, the cure outcome for any patient do not vary with the treatment status of other patients, and for each patient, there are no different forms of each treatment level, which result in different cure outcomes [15].
It is also assumed that the treatment has no effect on microbial features [12], which is simply satisfied if the microbiome is profiled before the treatment.That is, again, I consider the situation where the patient's microbiome is profiled first to select the therapy that is most suited to the patient's microbiome among multiple candidate therapies.

B. Step 1 (Treatment Effect Prediction)
To see which microbiome composition improves treatments effects, we first need to predict treatment effects.The treatment effect can be measured by comparing the cure rate of the patient who received the treatment (say, treatment) and the cure rate of the same patient who did not receive the treatment (say, control) as in (1).
However, the same patient cannot be assigned to both of the treatment and control groups at the same time.Therefore, one of the cure rates is always potentially missing, and thus patientlevel treatment effects Z i s are not measurable.This dilemma is called the "fundamental problem of causal inference" [15].
An alternative approach can be to employ genetically identical twins while assigning one of the twins to the treatment group and the other to the control group, and then we can compare their cure rates.For example, we can first transfer cancer cells into the twins making both of them cancer patients, and then give a medication to one of the twins and a placebo to the other, and then compare their cure rates.However, this approach can be limitedly permitted in an animal trial like the mouse trials of [11].It is, of course, unethical to conduct such trials on the humans.Again, it is hard to apply the results from animal trials to the personalized medicine or precision medicine for the humans.
Therefore, to predict treatment effects in an observational study where the human trials on twins are not permitted, I employ the method, called virtual twins [12], based on the potential outcome framework [15].The virtual twins mimic the animal trials on twins, and its key ideas are as follows.We can first predict the cure rate of the patient in the treatment group, and then predict the cure rate of "virtually" the same patient (say, virtual twin) in the control group.Then, we can finally predict the treatment effect by subtracting the former cure rate of the real patient from the latter cure rate of the virtual twin.Here, "virtual" means to switch only the data of the variable on the treatment status from 1 to 0, while "twin" means to fix all the other data on the other variables.That is, more formally, the virtual twins calculate the cure rates for both treatment and control groups Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
by flipping the treatment status on the data while fixing all the other data in a prediction model, and then the treatment effects are calculated as in (2).
is the predicted cure rate of the control.Again, the patient-level treatment effects (Z i s) in ( 1) are unknown and cannot be measured in reality because of the fundamental problem of causal inference [15].The method of virtual twins is an alternative approach to estimate the patient-level treatment effects based on the potential (not real) outcomes framework [15].
We can notice here that the success of virtual twins primarily hinges on the accuracy of the fitted prediction model f (.) in (2).The original virtual twins paper [12] suggests to use RF [14], but here I introduce dML for higher accuracy in microbiome studies.dML tailors machine learning methods, such as EN [13], RF [14] and DFN, to account for the unique features of the microbiome data, such as the high-dimensionality, sparsity and phylogenetic relationships.For this, dML first extracts the lower-dimensional representations of the microbial features (say, coordinates) through multidimensional scaling [16] based on an ecological distance measure, such as Euclidean distance (Euclidean), Jaccard dissimilarity (Jaccard) [17], Bray-Curtis dissimilarity (BC) [18], unweighted UniFrac distance (UU-niFrac) [19], generalized UniFrac distance (GUniFrac) [20] or weighted UniFrac distance (WUniFrac) [21].The coordinate matrix can be derived by the eigen-decomposition of the kernel matrix (denoted as, K (h) ) in (3).
where h is an index for a distance measure in a set of candidate measures (e.g., h ࢠ{Euclidean, Jaccard, BC, UUniFrac, GUniFrac, WUniFrac}), D (h) is the N × N pairwise distance matrix and D 2 (h) is its element-wise square matrix, I N is the N × N identity matrix, and 1 N is the N × 1 vector of 1's.Let λ (h)1 , …, λ (h)M be positive eigenvalues and q (h)1 , …, q (h)M be their corresponding eigenvectors obtained by the eigen-decomposition of the kernel matrix K (h) in (3).Then, the N × M coordinate matrix (denoted as, V (h) ) can be derived as in (4). where (h)M , and M ≤ N.Then, the coordinates (i.e., the lower-dimensional representations of the microbial features) are used as inputs in a machine learning method, such as EN [13], RF [14] or DFN, which is to relax the high-dimensionality and sparsity of the microbiome data and modulate phylogenetic tree information using phylogenetic and non-phylogenetic distance measures.
Furthermore, the performance of machine learning methods also varies depending on underlying prediction patterns.The EN fine-tunes the extent of variable selection and shrinkage in a linear model through the regularization that linearly combines the L 1 and L 2 penalties [13].Thereby, the EN can suit the prediction patterns that are linear with varying sparsity levels.The RF [14] is a bootstrap aggregation method that averages the predictions resulting from the collection of bagged decision trees built with randomly selected inputs.Thereby, the RF can suit the prediction patterns that are non-linear with varying sparsity levels.The DFN (a.k.a.multi-layer perceptron) extracts various linear combinations of inputs, and then nonlinearly maps them to the outputs through a large number of artificial neurons and hidden layers.Thereby, the DFN can suit various prediction patterns that are multifactorial or not, linear or nonlinear, and so forth.However, the DFN may require a huge sample size because of its high model complexity.Similarly, in practice, we do not know which machine learning method makes the best prediction accuracy in advance due to the varying and unknown nature of the underlying prediction patterns.
Therefore, dML employs the k-fold cross-validation (CV) to select the optimal combination of distance measure and machine learning method that results in the smallest cross-entropy of (5).
where Φ is the validation set of patients, N Φ is the number of patients in the validation set, h is an index for a distance measure in a set of candidate measures (e.g., h ࢠ{Euclidean, Jaccard, BC, UUniFrac, GUniFrac, WUniFrac}), and l is an index for a machine learning method (e.g., l ࢠ{EN, RF, DFN}.As a result, dML can robustly adapt to various prediction patterns (e.g., linear or nonlinear, sparse or dense, rare or common, phylogenetically related or independent, and so forth) through the extensive search in distance measure and machine learning method.
Let f dM L (.) denote the prediction model with the optimal combination of distance measure and machine learning method that results in the smallest cross-entropy.Then, the treatment effects are predicted as in (6).
where f dM L (Y i = 1|T i = 1, X i ) is the predicted cure rate of the treatment, and f dM L (Y i = 1|T i = 0, X i ) is the predicted cure rate of the control.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Of course, dML is not an interpretable method.That is, the outcomes from dML cannot be interpreted and extended to other microbiome data.However, for MiVT, we do not need any interpretation in Step 1 (treatment effect prediction), but only need a high prediction accuracy.All the interpretations will be made in Step 2 (subgroup identification) in the following section.

C. Step 2 (Subgroup Identification)
To see which microbiome composition improves treatments effects, after the first step of treatment effect prediction, we need to classify patients into subgroups by treatment effects based on patients' microbiome composition.For this, the virtual twins paper [12] suggests to use a regression tree [29] that involves stratifying or segmenting the predictor space (i.e., the microbial feature space or upper-level taxonomic space in microbiome studies) into a number of simple regions by treatment effects.Thereby, we can make simple and useful interpretations using a nice graphical representation of the top-down tree structure [29].
Let R j s be distinct and non-overlapping regions and Ẑ R j s be their corresponding treatment effects estimated by the mean treatment effects for the patients in each region for j = 1, …, J.Then, if we let A be the group of all N patients, ẐA becomes the overall mean treatment effect.R j s.Ẑ R j s can be efficiently found through recursive binary partitioning [29].As a result, we can identify subgroups (R j s) and estimate their treatment effects ( Ẑ R j s) that can be compared with the overall treatment effect ( ẐA ).R j is, for example, the subgroup of patients that have Mogibacterium < 0.00511% and Akkermansia ≥ 0.0126% in their gut, and ẐR j is their treatment effect estimated as, 10%, that can be compared with the overall treatment effect ẐA of 5.3%.
However, it has all been so far about parameter estimation with no facility for hypothesis testing.The problem is that we do not know if the estimated difference between each subgroup's treatment effect and the overall treatment effect is statistically significant, which is on the null and alternative hypotheses in (7).
Thus, I introduce a significance test, BoRT, that is based on the test statistic (denoted as, U ) in (8).
The test statistic U is simply the difference in mean between the overall and subgroup treatment effects.If U is positive, the overall treatment effect is greater than the subgroup treatment effect, but it is vice versa if U is negative.A large absolute value of U tends to lend credence to H 1 .
The distribution of ẐA can be approximated using the bootstrap method [30] by random sampling with replacement of the patient-level treatment effects Ẑ i s.Let Ẑb i s be a bootstrap resample of the patient-level treatment effects Ẑ i s.Then, the bootstrap overall treatment effect (denoted as, Ẑb A ) can be calculated as in (9).
Under H 0 , all N patients in A are equally likely to belong to R j , which indicates a random relocation of the selected region (denoted as, R r j ).Hence, the null test statistic value can be calculated as in (10).
where Ẑb Ẑb i and N R j is the number of patients that belong to R j .If we repeat it many times (say, B times), B null test statistic values (U b Null for b = 1, … B) are generated.Then, a P-value is calculated as the proportion of the null test statistic values that are equal to or greater than the observed test statistic value as in (11).
where I(.) is an indicator function and U Obs is the observed test statistic value that is calculated using the original data.
To save space, I described all the details on data normalizations, model specifications and implementation procedures for the distance measures, machine learning methods and regression tree in Supplemental Material.

D. Data Normalization, Model Specification and Implementation Procedures
Here, I described all the details on data normalizations, model specifications and implementation procedures for the distance measures, machine learning methods and regression tree.
Distance measures require different data forms to be used.For example, Jaccard [17] and BC [18] are based on absolute abundances (i.e., counts).However, in microbiome data, total read count per subject dramatically varies due to technological limitations (e.g., varying sequencing depths).Hence, MiVT rarefies count data before calculating Jaccard and BC.On the contrary, MiVT calculates Euclidean based on centered logratios (CLRs) to address compositional issues [34], and UniFrac distances based on relative abundances (i.e., proportions) with no rarefaction.
There are also various model specifications and implementation procedures for the machine learning methods.The EN has two tuning parameters, λ (that controls the extent of L 1 and L 2 penalties) and α (that modulates the amount of L 1 and L 2 penalties to their linear combination).MiVT surveys a hundred equally spaced values with the largest value that makes all the coefficients to be zero and the ratio of the smallest value to the largest value as 0.0001 for λ, and 0.05, 0.10, …, 0.90, 0.95 for α [13].The RF with 1000 bagged decision trees was used, and it has one tuning parameter that is the number of randomly selected predictors for each tree..] is the rounding function).Following Foster et al. [12], I used the coordinates and the first interaction terms between the coordinates and treatment status as inputs.I repeated 10-fold CV with non-overlapping folds twice.The subgroups of patients (R j s) and their treatment effects ( Ẑ R j s) were found through recursive binary partitioning with the settings of the minimum number of observations in a node for a split to be attempted as 10, the minimum number of observation in any terminal node as 5, and the complexity parameter as 0.01 [14].

A. Simulations
Here, I demonstrate the performance of dML and BoRT through simulation experiments because their performance can vary by sample sizes and parameter values that are unknown in reality.To reflect real microbiome composition, I first estimated parameters (i.e., proportions and dispersion) of the Dirichletmultinomial distribution [31] based on 755 microbial features (that have the mean proportion > 10 −5 ) of the gut microbiome for 39 melanoma patients prior to immunotherapy in [32].Then, I randomly generated counts for 200 patients (say, small sample size) and 400 patients (say, large sample size), respectively, from the Dirichlet-multinomial distribution using the estimated parameters and total counts randomly generated from the uniform distribution from 10000 to 100000 to reflect varying total read counts.A half of the patients' outcomes (i.e., 100 patients' outcomes among 200 patients' outcomes for the small sample size, and 200 patients' outcomes among 400 patients' outcomes for the large sample size) was assigned to the test set, while the other half of the patients' outcomes was assigned to training set.I generated binary cure outcomes (Y i s) based on the logistic regression model (12).(12) where i is the patient (i = 1, … …, 200 or i = 1, … …, 400), j is the microbial feature (j = 1, … …, 755), Y i is the binary cure outcome, T i = 1 (treatment) for a half of patients, T i = 0 (placebo) for the other half of patients, X ij is the proportion, β 0 = 0.1 (main placebo effect), β 1 = 0.5 (main treatment effect), Ω is the set of microbial features that influence treatment effects, β 2 is the main effect of the microbial features and β 3 is the interaction effect (treatment effect influenced by microbial features).
I surveyed two sets of the main (β 2 ) and interaction (β 3 ) effects of the microbial features in (12) as β 2 = 0.0005 and β 3 = 0.001 for relatively small effects and β 2 = 0.001 and β 3 = 0.0015 for relatively large effects, respectively.I also surveyed Ω in (12) (i.e., the set of microbial features that influence treatment effects) using two different scenarios, respectively.First, I randomly selected 10% of the microbial features (denoted as, Ω = {randomly selected features}).Second, I partitioned microbial features into 10 clusters using the partitioning-around-medoids (PAM) algorithm [33] based on phylogenetic distances, and then randomly selected one cluster (i.e., Ω = {phylogenetically related features}).This mimics a situation when phylogenetically related microbial features jointly influence treatment effects.I repeated each scenario 300 times (that resulted in 2021 leaves (i.e., regions, R j s) in total for randomly selected features with the small sample (n = 100); 1938 leaves in total for phylogenetically related features with the small sample (n = 100); 3432 leaves in total for randomly selected features with the large sample (n = 200); 3510 leaves in total for phylogenetically related features with the large sample (n = 200); and report average estimates.
I evaluated the proposed method, dML, compared with other existing methods, EN [13], RF [14] and DFN addressing compositional issues using the centered log-ratio transformation [34], with respect to test classification error and test area under the curve (AUC).I first observed that as the sample size increases, the prediction accuracy increases for all surveyed methods but their relative ranks are equally retained [Figs. 1 and 2 for the large sample size; Fig. S1 and Fig. S2 for the small sample size]; hence, to save space, I moved the results for the small sample size [Figs.S1 and S2] to Supplemental Material.We can observe that as the main (β 2 ) and interaction (β 3 ) effects of the microbial features in (12) increase, the prediction accuracy increases for all I also evaluated BoRT with respect to type I error rate and power.The empirical type I error was calculated as the proportion of the P-values for the randomly relocated regions (R r j s) that are smaller than 0.05, and the empirical power was calculated as the proportion of the P-values for the selected regions (R j s), as they are, that are smaller than 0.05.Here, the randomly relocated region (R r j ) is the region that has the same number of patients as the originally selected region (R j ), but the patients that belong to the region were selected randomly from all the patients to mimic the null hypothesis.We can observe that the empirical type I error rates are close to 5% [Table I].Hence, BoRT is a valid significance test with the correct control of type I error rate.We can also observe that the empirical powers for the relatively large effects of β 2 = 0.001 and β 3 = 0.0015 in (12) are  greater than the empirical powers for the relatively small effects of β 2 = 0.0005 and β 3 = 0.001 in (12) for both small and large sample sizes [Table I].

B. Real Data Applications
Here, I demonstrate the use of MiVT through the gut microbiome study on the effects of cancer immunotherapies on melanoma patients in [32].The researchers collected fecal samples from metastatic melanoma patients prior to immunotherapy, and processed them via shotgun metagenomics [32].Then, the researchers processed raw sequence data using NGS-QC and NCBI BMTagger Human Contamination Screening Tool for quality controls, and then constructed feature tables, taxonomic annotations and phylogenetic tree using MetaPhlAn [35].More details on metagenomic sequencing and profiling procedures can be found in [32].
The microbiome data contain 39 metastatic melanoma patients treated by immune checkpoint inhibitors targeting the programmed cell death 1 protein (PD-1), cytotoxic T lymphocyteassociated antigen 4 (CTLA-4) or both PD-1 and CTLA-4, and 755 microbial features that have the mean proportion > 10 −5 .I dropped one patient treated by Anti-CTLA-4 only, and compared 14 patients treated by Anti-PD-1 only with 24 patients treated by both Anti-PD-1 and Anti-CTLA-4 [Table II].Of the 14 patients treated by Anti-PD-1 only (T i = 0), 10 (71.4%) were non-responders (T i = 0) and 4 (28.6%) were responders (Y i = 1), while of the 24 patients treated by both Anti-PD-1 and Anti-CTLA-4 (T i = 1), 10 (41.7%) were non-responders (Y i = 0) and 14 (58.3%) were responders (Y i = 1) [Table II].Hence, it seems more likely to be a responder if the patient is treated by both Anti-PD-1 and Anti-CTLA-4 than by Anti-PD-1 only.However, the Fisher's exact test gives a non-significant result on the association between treatment and outcome status (P-value: 0.101).I surveyed if the gut microbiome improves (or lowers) the effect of Anti-CTLA-4 over Anti-PD-1 using MiVT.I performed the treatment prediction on the feature-level, but the subgroup identification on the lower-dimensional genus and species levels, respectively, while removing unknown and unclassified genera and species in taxonomic annotation.For reference, genera and species are also better perceived by microbiome researchers than  III] and species [Fig.4

and Table IV] as follows.
On genera [Fig.3 and Table III].The melanoma patients that have the genus Mogibacterium < 0.00511% and the genus Akkermansia ≥ 0.0126% in their gut have significantly a higher effect of Anti-CTLA-4, 10%, compared with the overall effect of Anti-CTLA-4, 5.3% (P-value (BoRT): <.001) [Fig.3 and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
For additional reference, the RF with UUniFrac was the optimal combination that resulted in the smallest CV cross-entropy of 0.656 [Table V].

IV DISCUSSION
In this paper, I introduced a streamlined analytic MiVT, that predicts treatment effects and identifies subgroups by treatment effects based on the patient's microbiome composition to survey the interplay between microbiome and treatment.As parts of MiVT, I introduced a new prediction method, dML, to improve prediction accuracy in microbiome studies and a new significance test, BoRT, to test if each subgroup's treatment effect is the same with or different from the overall treatment effect.I demonstrated in silico that dML robustly reaches a high prediction accuracy and BoRT is a valid significance test correctly controlling type I error rates.I also demonstrated the use of MiVT in praxis through the gut microbiome study on the effects of cancer immunotherapies on melanoma patients [32].This example study was equipped with the binary cure outcome, binary treatment status, microbial features and phylogenetic tree that are required to use MiVT.Moreover, the assumption that the treatment has no effect on microbial features was satisfied because the fecal samples were collected prior to immunotherapy.I performed the subgroup identification on the lower-dimensional genus and species levels for better interpretability, but any taxonomic levels can also be surveyed.
The results from MiVT can be a useful guideline in microbiome-based personalized medicine or precision medicine to select the therapy that is most suited to the patient's microbiome or to use dietary supplements or therapeutics to tune the patient's microbiome to be suited to the treatment.
However, I described MiVT only for a binary cure outcome, though in practice, there are many different types of cure outcomes, such as continuous, survival and repeated measures outcomes.Hence, further extensions of MiVT are needed to make it more practical.The candidate distance measures, machine learning methods and implementation procedures that I described were sufficient to reach the robust performance in my simulations and real data applications.However, researchers may believe that they are less sufficient, and thus, for example, they want to consider some more candidate parameter values, repeat 10-fold CV more times, and so forth.Hence, I added various user options in the R package, MiVT, for different model specifications, implementation procedures, and so forth.It would be better to make it overly sufficient than less sufficient.If it is less sufficient, MiVT may not have enough flexibility to make it robust.
One major limitation of this research is that the real microbiome data I used are of only a small sample size of 38 patients (24 patients treated by both Anti-PD-1 and Anti-CTLA-4; 14 patients treated by Anti-PD-1 only).This small sample size may affect the reliability of the reported estimates and inferences.Although the cancer immunotherapy has recently been spotlighted as a new paradigm of cancer treatment, the cost for cancer immunotherapy is currently very expensive ranging from hundred thousand to million dollars; hence, it is hard to obtain the data for a large human population cohort.This research is focused more on methodology, and MiVT can be a promising tool for future applications as we get larger and larger databases.

Fig. 1 .
Fig. 1.Empirical test classification errors (see Error (%)) and test AUC (see AUC (%)) for the relatively small effects of β 2 = 0.0005 and β 3 = 0.001 (12) for the large sample size (400 patients in total; 200 patients in the train set and 200 patients in the test set).(a) and (c) For a situation when randomly selected features influence treatment effects (i.e., Ω = {randomly selected features}.(b) and (d) For a situation when phylogenetically related microbial features influence treatment effects Ω = {i.e., phylogenetically related features}.

Fig. 2 .
Fig. 2. Empirical test classification errors (see Error (%)) and test AUC (see AUC (%)) for the relatively large effects of β 2 = 0.001 and β 3 = 0.0015 (12) for the large sample size (400 patients in total; 200 patients in the train set and 200 patients in the test set).(a) and (c) For a situation when randomly selected features influence treatment effects (i.e., Ω = {randomly selected features}.(b) and (d) For a situation when phylogenetically related microbial features influence treatment effects Ω = {i.e., phylogenetically related features}.

TABLE I EMPIRICAL
TYPE 1 ERROR RATES AND POWERS FOR BORT (UNIT: %)

TABLE II CONTINGENCY
TABLE ON THE TREATMENT AND OUTCOME STATUS

TABLE III RESULTS
OF BORT FROM MIVT ON THE MICROBIAL GENERA IN THE GUT OF MELANOMA PATIENTS THAT IMPROVE (OR LOWER) THE EFFECT OF ANTI-CTLA-4 OVER ANTI-PD-1

TABLE IV RESULTS
OF BORT FROM MIVT MICROBIAL SPECIES IN THE GUT OF MELANOMA PATIENTS THAT IMPROVE (OR LOWER) THE EFFECT OF ANTI-CTLA-4 OVER ANTI-PD-1

TABLE V CV
CROSS-ENTROPY VALUES FOR EACH COMBINATION OF DISTANCE MEASURE AND MACHINE LEARNING METHOD FROM THE REAL DATA APPLICATION OF THE GUT MICROBIOME STUDY ON THE EFFECTS OF CANCER IMMUNOTHERAPIES ON MELANOMA PATIENTS