VTwins: Identifying Robust Associations from High-dimensional Data of Limited Samples

Robust associations are strong indicators for causalities but challenging for identication from high-dimensional datasets. In examples of metagenomic research where microbiota is highly complex and variable, low concordance between studies in identifying disease-causative microbes has become the main obstacle in the eld. Here, we develop a simple method—Virtual Twins (VTwins)—for inferring robust associations, imitating the twins in genetic research. From the original groups, paired samples of distinct phenotypes but matched taxonomical proles are selected to reconstruct a “Twin” cohort, where statistical signicance is often achieved. In direct comparison to current methods by revisiting the largest meta-analysis metagenomic dataset, VTwins can 10-fold reduce the sample-size for recalling disease-associated microbes robustly across-datasets and constructing machine-learning models of the same accuracy level as pooled samples in predicting disease status. In practice, VTwins is straightforward, powerful, and versatile in handling highly variable and high-dimensional datasets, suggesting potentials in mining causalities in the Big-data Era.


Introduction
Identi cation of robust associations in observation studies is essential for causal insight, the primary goal of science. Observational data collected nowadays in the Big-data Era are often high-dimensional, highly variable, and highly correlated, from which inferring causality is still a big challenge due to the lack of effective algorithms in dealing with the random perturbations of features. For instance, in metagenomic studies, a microbiome often comprises hundreds of microbial species, a typical kind of high-dimensional data where the abundance of constituents is highly variable and highly cross-related.
Fast-growing evidence has highlighted the association of human microbiota to a broad range of disease conditions 1, 2 , which makes microbiota-targeted therapies one of those most promising ways to control them effectively 3,4 . However, identifying the target microbes contributing to disease pathogenesis is still a di cult task. The di culties are attributed mainly to the high interindividual-diversities, which often overwhelms the disease-associated differences in contemporary cross-sectional studies 5,6 . Associations of false-positives caused by confounding cofactors and false-negatives due to insu cient sample size are commonly introduced, making low concordance of disease-associated microbial features identi ed among studies 7,8 .
Several path-forward ideas have been proposed to limit confounding factors in metagenome studies, but they are either not applicable for human research 9 or cross-sectional study 10 , or hard-to-use requiring a large control cohort and as more as possible information of confounding factors to be collected 8 . Here, by virtually generating paired samples of similar taxonomic pro les but in distinct phenotypes, we have developed an easily applicable tool-Virtual Twins (VTwins), to control microbial features caused by nonrelevant cofactors and evaluate its effects and applicability to the cross-sectional metagenomic datasets.

Page 3/14
The principle of VTwins The idea of VTwins has been inspired by the success of twin studies in identifying disease-associated genetic variations 11,12 . In terms of data characteristics, monozygotic twin samples, which are almost identical genetically, provide perfect control against genetic heterogeneity for the identi cation of diseasecausal variation loci. Similarly, in the highly diverse metagenomic constituents, samples can be regarded as twins if bearing highly similar taxonomical pro les. Toward this end, based on the taxonomic pro les of samples from the original dataset, we attempt to construct a VTwins cohort composed of "twins" samples of distinct phenotypes. Phenotype-associated microbe features can then be identi ed by applying statistical tests for paired samples (such as the Wilcoxon signed-rank test). Speci cally, we rst construct a high-dimension space where the abundance of each taxon (here, we used the species level abundance) represents a dimension. All samples are then plotted into the high-dimension space according to each species' abundance, and pairwise Euclidean distances for all samples are calculated.
Finally, samples with the nearest distance but in the opposite phenotypic groups are selected to constitute a new VTwins cohort (see Methods and Fig. 1). Upon this VTwins cohort, microbe features consistently enriched in samples of one phenotype over the other and passing statistical tests are regarded as phenotype-associated or candidate causal microbes.

The accuracy of VTwins
Here, we revisited the up-to-date largest meta-analysis dataset of colorectal cancer (CRC) that includes eight study cohorts and a total of 771 samples (case: control = 387:384; the sample size of single cohort ranges 53 ~ 128, Data Table S1), for which two research groups have separately performed metaanalyses and published their works at the same time 13,14 . Using the pooled samples, a total of 13 species were both identi ed as CRC-associated microbes by the two studies, i.e., within the top 20 species of largest effect size in one study 13 and with FDR < 10 − 5 in the Wilcoxon rank-sum test in the other 14 (Fig. 2).
To test the power of VTwins, we rst reconstructed the VTwins cohorts for each of the eight datasets and then compared the signi cance of each species from the VTwins cohorts (tested by the Wilcoxon signedrank test for paired samples) with those from the original cohorts (tested by the Wilcoxon rank-sum test for group cohort). VTwins identi es 285 species on average that are signi cantly enriched either in the CRC group or the control group (FDR < 0.05, Data Table S2-S9 for each of the eight cohorts), among which ~ 140 species are highly signi cant (FDR < 0.001). All the 13 CRC-associated species exhibit signi cant differences in all the eight VTwins cohorts (except P. uenonis in the ITA1 cohort) and achieve high signi cance of FDR < 10 − 5 in most cases (Fig. 2).
In contrast, when using the original cohorts, although the13 CRC-associated species exhibited high signi cance in the U test of pooled samples, every single cohort can only recall 4.9 (range 0 ~ 11) of the 13 species on average (Fig. 2). We further applied the Boruta algorithm, the best-performing variable selection methods based on ensemble machine learning random forest for high-dimensional metagenomic datasets. Although the Boruta algorithm is much more sensitive than the U test by calculating each species' permutation importance and identi ed much more CRC-associated species, neither a single cohort could recall all the 13 species, nor a CRC-associated species was consistently recalled from all cohorts (Fig. 2). Therefore, VTwins saliently increases the statistical power and sensitivity for identifying disease-associated microbes compared to analysis with original cohorts even with machine learning.
We then ranked the 13 CRC-associated species according to their FDR values in the test of each VTwins cohort and compared them to their ranks of permutation importance inferred by Boruta algorithm from corresponding original cohorts. The results indicated the ranks of species are more stable across VTwins cohorts than their importance ranks across original cohorts, especially for the top species. A sevenspecies core panel that was proposed to represent the characteristic deviations in CRC microbiota in the previous meta-analysis 13 , including P. stomatis, F. nucleatum, Parvimonas spp., P. asaccharolytica, G. morbillorum, C. symbiosum, and P. micra, were just the top seven species with the highest signi cance in VTwins analysis ( Figure S1). The result suggests that species with higher rank signi cance in VTwins analysis are more probable disease-associated.
From the species that showed signi cant differences in VTwins analysis using the pooled sample, we select the top-ranking 30 species (FDR < 10 -75 , Data Table S10) as our selection of CRC-associated microbes. Our list covers all the 13 CRC-associated species identi ed by the previous two studies, except A.obesiensis (FDR < 10 -62 ), and the other species in our selection include ten enriched in CRC and eight enriched in controls ( Figure S2). Among the ten CRC-enriched species, Clostridium hathewayi 15 , Bacteroides fragilis 16 , and Peptostreptococcus anaerobius 17 are well-known to produce tumorigenic toxins; Clostridium bolteae 18 , Granulicatella adiacens 19 , and Ruminococcus gnavus 20, 21 are able to drive the mucosal in ammation that preludes cancer; Eggerthella lenta and Flavonifractor plautii 22 are potential in degrading various anticarcinogenic compounds such as avonoids. Meanwhile, all the eight species enriched in controls are butyrogenic species that produce the critical anti-in ammatory butyrate in colon 23 , and species Faecalibacterium prausnitzii and Eubacterium hallii also produce betaglucuronidase (B-GUS) and glycerol/diol dehydratase (GDH), respectively, to biotransform many tumorigenic dietary heterocyclic amine 24 . Therefore, most of the species in our selection based on VTwins analysis are able to mechanically contribute to promote or protect against CRC, i.e., causative microbes of the disease.
The robustness of VTwins-directed predictive model Machine learning has been widely used in evaluating the importance of each microbiota feature in diagnosis. The two previous meta-analyses of CRC have used random forest 13 and lasso 14 , respectively, to construct CRC-diagnosis classi ers with all the species abundance as input, and evaluated their performances with intra-and cross-cohort prediction validation. As random forest model exhibits a better accuracy between the two studies, we apply it to construct classi ers for direct comparison. Here, we use the abundance of the top species selected by VTwins instead of a full count of the species and compared their diagnostic values and transferability across cohorts to see if these top species are more relevant to disease than to the speci c cohorts.
In intra-cohort cross-validation, we gradually increase the number of features from 10 to 50 top-ranking species in constructing the classi ers and observe that the top 30 species deliver the best performance, i.e., an average of 0.89 ± 0.09 in the area under the receiver operating characteristic curve (AUROC) for all the eight cohorts (Fig. 3A). The performance of our models appears better than that of constructed based on all species, which is only 0.81 on average (refer to Fig. 2a in Ref 13 ). We then performed the crosscohort (study-to-study) validation for the model of each cohort, where the classi er is trained by a given dataset and test against each of the rest seven cohorts. Here we also used the 30 top features selected by VTwins instead of all species features, and the accuracy is obviously improved from the AUCs of 0.69 ± 0.10 (refer to Fig. 2a in Ref 13 ) to an average AUC of 0.78 ± 0.06 across all our study-to-study validations (Fig. 3B). The better performance in intra-cohort tests and transferability in cross-cohort tests of the VTwins-directed classi ers implies that the species features selected by VTwins are more relevant to the pathogenesis of CRC other than dataset-dependent features.
To further con rmed the predictive value of VTwins selected species, we perform the leave-one-datasetout (LODO) validations where classi ers are trained on seven datasets combined and validated on the left-out dataset, for each dataset in turn 25 . This approach, although a seven-fold increase in the training dataset, the accuracy was only marginally improved, producing AUC = 0.81 ± 0.06 for all eight datasets when 30 top-ranking species are used compared to AUCs of 0.77 ± 0.06 in cross-study validations (Fig. 3B). We also noticed that altering the number of features used had almost no effects on the AUCs in LODO tests ( Figure S3), and nearly the same as the performance using all species features, which yields 0.81 ± 0.08 AUCs (refer to the LODO analysis in Fig. 2a in Ref 13 ). We hypothesize that the classi er in the LODO tests reaches its supreme accuracy; and the probable existence of intermediate samples precludes a clear boundary between case and control samples no matter the number of features used in prediction.
However, the VTwins-directed models in study-to-study transfer, which produces AUCs of 0.78 ± 0.06, approaches the supreme accuracy level, i.e., 0.81 AUC, with only an averagely one-seventh training sets, indicating the robust associations between VTwins-selected features in every single cohort and microberelated CRC pathogenesis.
Additionally, 22 species are consistently recalled in each LODO validation. When sorted according to the rank of their importance in contributing to the model, the ranks of each species keep stable across validations (Fig. 3C). Consistent with the previous meta-analyses, very few control-associated species achieve top signi cance in VTwins analysis and rank high in the learning models (Fig. 3C), further highlighting that successful discrimination is achieved by CRC-associated rather than control-associated microbial features. These results show that, while great sample diversity embedded in the meta-cohorts, cross-validated AUCs can be high and stable for the prediction of CRC in both intra-dataset and crossstudy, indicating the robustness and transportability of species features selected by VTwins in association to CRC and the power of VTwins in diminishing batch effects that often confound metagenomic studies.

The sample size required for VTwins
As VTwins is much more sensitive in identifying disease-associated species than conventional analysis, we next tried to gure out the minimal sample size required for VTwins. First, we performed a random selection of a given number of samples (equal in the number of case and control samples) from the pooled cohorts and repeated ten times. For each selection, VTwins analysis was conducted using the selected samples, and the top 30 species with the highest signi cance were regarded as candidate disease-associated features. When we raised the sampling size gradually from 30 to 400, we observed that more of the seven species in the core panel were recalled and reached a plateau when the sample size was more than 120 (CRC : control = 60 : 60). If we counted the top 50 species, the required sample size was reduced to > 80 for the recall of the core panel (Fig. 4A). Even for the 13 CRC-associated species identi ed by the two meta-analyses, most could be recalled with randomly selected > 80 samples when the top 50 features were selected ( Figure S4).
Moreover, for each sampling, we constructed random forest models using both 30 and 50 top species selected by VTwins and trained the model with the selected samples ve times, then tested the model with all the rest samples. When the sampling size reaches 80, no matter the number of features used, an AUC ≈ 0.8 can be obtained in most validations, similar to the accuracy in our study-to-study transfer validation. Further increasing sampling size to 400, the accuracy is only marginally improved (Fig. 4B). As such, 80 samples (case : control = 40 : 40) is able to recall the core panel of CRC-associated microbes and to accurately identify disease-associated features by selecting those with top signi cance, thus met the minimal requirement in sample size for an effective VTwins analysis, i.e., about 10-fold decrease than that required for original group cohort and conventional analysis.

VTwins in functional analysis
Finally, we tested whether VTwins was valid in identifying the CRC-associated functional deviations in the microbiome, which is characterized as increased amino acid putrefaction 26 and reduced carbohydrate degradation 23 . To identify the pathogenesis-related metabolic functions instead of the hitcher pathways embedded in the genomes of CRC-associated microbes, we reconstructed another "twin" cohort based on the function-pro le distance by means of pathway abundance and identi ed 214 differential pathways with high signi cance (Wilcoxon signed-rank test, FDR < 10 − 5, Data Table S11). Thirteen of the top 30 pathways of highest signi cance (FDR < 10 -22 ) overlapped with the 20 pathways of the largest effect size (> 0. 35) in conventional LDA analysis in the previous meta-analysis 13 , and the function of these 13 common pathways are all consistent with the previous understanding of the CRC-associated metabolic deviations (Fig. 5A). The 13 pathways often exhibit signi cant differences in every single cohort in VTwins analysis but show no differences in the corresponding original cohorts (Figure S5), con rming the high sensitivity and statistical power of VTwins in functional analysis as well.
The other pathways in our selection besides the 13 common ones keep the concordance with the CRCassociated metabolism deviations, which include another pathway of L-glutamate-degradation (V) and the urea-cycle enriched in CRC samples and sucrose-degradation (III) enriched in controls (Fig. 5A,5B). Additionally, the VTwins also identi es signi cantly reduced pathway abundance in the biosynthesis of IMP (I, II, & III) (Fig. 5C) and CDP-DAG (I &II) (Fig. 5D). IMP is the precursor of the de novo biosynthesis of all kinds of purines, which process utilizes amino acids instead of catabolizing them into tumorigenic metabolites in CRC conditions, and CDP-DAG is a liponucleotide that has been found of antiproliferative activity in vitro along with its analogs 27 . The conspicuous de ciency of their biosynthesis in CRC samples provides further understanding of the deviation in CRC metabolism, which deserves further investigations in their roles in protecting against CRC; and the discovery of these pathways highlights the value of VTwins in exploring novel molecular mechanisms in the research of disease pathogenesis.

Conclusion
In conclusion, VTwins provides a straightforward approach to control the highly variable microbiota pro les in cross-sectional studies. By direct comparison to the conventional cohort-based analyses, we demonstrate the power and sensitivity of VTwins in identifying robust disease-associated microbes, where statistical power is saliently increased by transforming the original cohort into the paired VTwins cohort. By a similar twin strategy based on the functional pro les of samples, we have successfully identi ed the impairment of detoxi cation in autistic children in a former study 28 . The success of VTwins may rely on the virtual twins it provides as mutual controls for the complexity of all the dimensions, as an adequate number of controls is the premise for causal-relationship mining and is hard to be fully satis ed in the case of hundreds of dimensions. Therefore, the more dimensions the data have, the more effective VTwins becomes. Moreover, the strategy of virtual twins cohort can be applied to other high-diversity and high-dimension datasets given that pairwise distances among samples can be calculated, which will facilitate its broader application in other elds than metagenomics, including genomics, transcriptomics, epigenomics, and even other scienti c elds.

Study design and data availability
The preprocessed metagenomic data of six cross-sectional studies of CRC were directly downloaded from the curatedMetagenomicData 29 R package supplemental to the meta-analysis research by Thomas, A. M., et al 13 . The data includes information of metadata, taxonomic and functional composition, and relative abundance of each species and metabolic pathway, in which taxonomy and pathway pro ling were annotated using MetaPhlAn2 30 and HUMANn2 31 , respectively. Since the preprocessed metagenomic data for the other two validation studies used in the meta-analysis were not available, we downloaded their raw fastq les from the DNA Data Bank of Japan database (project No. DRA006684 32 ) and European Nucleotide Archive (project No. PRJEB27928 14 ), respectively. To keep the consistency with the other six cohorts, we performed raw data preprocessing strictly as described in the meta-analysis 13 , including quality and reads ltered, MetaPhlAn2 30 annotated taxonomic pro ling, and HUMANn2 31 annotated pathway pro ling.

Virtual Twins
The purpose of "Virtual Twins" is to construct a "twin" cohort from the original group cohort. To nd the "twin" samples of opposite phenotypes but have a very similar taxonomic pro le, we rst constructed a high-dimension space, in which each species represented a dimension. Then, each sample was plotted into the space according to the relative abundance of the species within it. Pairwise Euclidean distances were used to indicate the similarity between the two samples, and we calculated the total distance of a sample to its K-Nearest Neighbors of the same group (intragroup KNN), of the opposite group (intergroup KNN), and of the total cohort (KNN), where K was the square root of the number of samples in the corresponding group.
We noticed that some samples are even closer to neighbors of opposite phenotypes than those of their own, implying that samples from the opposite groups are extensively entangled in the space. To maximally represent the ensemble population, three steps were taken in the selection of samples for the "twin" cohort: 1) outlier samples that were too far from other samples (KNN > mean KNN + SD) were excluded rst to avoid including them in the "twin" cohort; 2) redundancies that were too close to neighbors of the same phenotypes (KNN < mean intragroup KNN -SD) were identi ed and removed to avoid the impact of redundant samples on statistical tests; 3) samples closer to neighbors in the opposite group than neighbors of own side group (intergroup KNN < intragroup KNN) were selected as representative samples that represent the most entangling part of the ensemble cohort, and the number of the representative samples of the two sides were often comparable. Finally, the "Twin" cohort was reconstructed with all the representative samples and their neighbors of the opposite group: one representative sample and one of its K neighbors constituted a "twin". Then redundant "twins" composed of the same two samples were eliminated. Although representative samples may be utilized for more than one time, i.e., at most K times, the representation is equal for all representative samples and comparative for the opposite groups, which will not deviate the statistics result. Therefore, the "twin" cohort was reconstructed by an equal number of samples from both sides that maximally matched with each other.
Random forest classi er and permutation importance All machine learning experiments used random forest, as this algorithm has been shown to outperform, on average, other learning tools for microbiome data. The species with top signi cance tested in the VTwins analysis were explored for the prediction value of their relative abundances.
Receiver operating characteristic (ROC) curve was used to evaluated the performance of the classi er through R packages ROCR (https://cran.r-project.org/web/packages/ROCR/index.html) and indicated by the area under the curve (AUROC). The importance of the mean decrease in the impurity of each species which indicates its contributions to the classi er was directly output from the caret R package.
To conduct a direct comparison, the intra-dataset prediction capability was measured through 20-repeat tenfold cross-validation, resulting in 200 validation folds for each cohort. Here, the createFolds function in the caret R package was used to stratify each fold to contain a balanced proportion of controls and disease samples. In the cross-study validation, datasets were considered two by two: one is used for training the model, the other to validate. The Leave-one-dataset-out (LODO) approach consists of verepeat training of the model on the pooled samples from all cohorts except the one used for model testing. We iterated along with all the cohorts, performing a LODO validation on each set of samples.
To calculate the reliable permutation feature importance, we applied the Boruta approach, the bestperforming variable selection method based on ensemble machine learning random forest for highdimensional metagenomic datasets. Boruta package in R was used to perform feature selection with 99 iterations and a con dence level cut-off of 0.01 for the p values.
Determination of the minimal required sample size of VTwins A random sampling test was conducted to determine the minimal sample size requirement of VTwins. In the test, all the samples from the eight cohorts were pooled, and for each random sampling, an equal number of cases and controls were randomly selected to constitute a random set. Using sampling R package (https://cran.r-project.org/web/packages/sampling/index.html), we conducted a raising series of sampling size as 30, 40, 60, 80, 100, 120, 140, 160, 180, 200, and 400 samples in each set. For each sampling size, 10-repeat random sampling was conducted which includes a half-half number of cases and controls.
Then we treat the random seta as a group cohort and perform VTwins analysis for it, i.e., construct a corresponding "twin" cohort for a random set and select the top-ranking species features of highest signi cance. These top features were then used to construct a random forest model and tested with all the rest samples of the entire sample pool except the testing random set.