3MCor: An Integrative Web Server for Metabolome-Microbiome-Metadata Correlation Analysis

Background: Mounting evidences have shown that microbiome and metabolome are closely linked to human health and dual-omics studies expanded our knowledge and understanding of health and life. Here, we designed and developed a full-function and easy-to-use platform, 3MCor (http://3mcor.cn/), for metabolome and microbiome correlation analysis under the instruction of phenotype and with the consideration of confounders. Results: Many traditional and newly reported correlation analysis methods were integrated for intra-and inter-correlation analysis. Three inter-correlation pipelines are provided for global, hierarchical, and pairwise analysis. Especially, the incorporated network analysis function is conducive to a rapid identi�cation of network clusters and key nodes from a complicated correlation network. Complete numerical results (csv �les) and rich �gures (pdf �les) will be generated in minutes. To our knowledge, 3MCor is the �rst platform developed speci�cally for the correlation analysis of metabolome and microbiome. Its functions were compared with corresponding modules of existing omics data analysis platforms. Results from 2 real-world data sets, one from a public library with a continuous phenotype and one from our lab with a categorical phenotype, were used to demonstrate its simple and �exible operation, comprehensive outputs, and distinctive contribution to dual-omics studies. Conclusions: 3MCor is powerful with complementary pipelines and comprehensive considerations of phenotypes, confounders, and the interactions among omics features. In addition to the web server, the backend R script is available at https://github.com/chentianlu/3MCorServer.


Background
Over the past 20 years, mounting studies have shown that microbiome and metabolome are closely linked to human health and the compositional and functional abnormalities of their alterations can lead to a variety of diseases, such as obesity, type 2 diabetes mellitus (T2DM), non-alcoholic liver disease, cardio-metabolic disease etc. [1][2][3][4][5].To better understand the underlying molecular mechanisms of hostmicrobiota co-metabolism and their impacts on phenotypes, many experiments in humans, animals, and cells were conducted and various correlation analysis methods have been applied on the association analysis of dual-omics data sets.
Currently, the most widely used correlation methods are Pearson, Spearman, Kendall, and generalized linear regression model (GLM).Sparse correlations for compositional data (SparCC) and correlation inference for compositional data through Lasso (CCLasso) were newly developed for the intra-correlation analysis between compositional microbial variables [6][7][8].Maximum information coe cient (MIC) was reported for linear/nonlinear correlation pair detection in big omics data sets [9].More recently, Pedersen et al. introduced a computational strategy for microbiome and metabolome correlation analysis with the consideration of speci ed phenotype, by using the weighted gene co-expression network analysis (WGCNA) for dimension reduction and using the leave one out method for key variable identi cation [10,11].The core idea of this strategy, dimension reduction, was in line with our previous pipeline used in a brain-gut axis study [6].Besides, there was an in-depth comparison of 6 typical correlation methods in the inter-correlation analysis of microbiome and metabolome [12].An advanced method, generalized correlation analysis for metabolome and microbiome (GRaMM), was designed [13].We have also explored the co-changes and associations of serum metabolome and gut microbiome in rats across lifespan [6,14] and in mice under antibiotic intervention [15], high-fat intervention [16], and caloric restriction [6], using traditional and newly developed methods.
The powerful analytical tools and active studies generated tremendous correlation pairs or clusters.
However, every method has its own prerequisites and the application of some methods requires programming skills.It is also challenging to identify key metabolites and microbes from huge amounts of results.These enlightened us to develop a multi-function and easy-to-use platform for researchers who are unfamiliar to cross-omics correlation analysis and computational tools.The metabolomemicrobiome-metadata correlation analysis platform (3MCor, http://3mcor.cn) is a web server for integrative correlation analysis of metabolome and microbiome under the instruction of phenotype and with the consideration of confounders.3MCor is independent to measurement platforms.Microbial data sets from amplicon sequencing (16S, 18S, ITS) or shotgun metagenomic sequencing and metabolomic data sets from mass spectrometry or NMR are all acceptable.Intra-and inter-correlation analysis are both supported.Various methods for data pretreatment, module ltering, univariate and multivariate correlation analysis, confounder effect correction, and network and topology analysis were integrated and 4 pipelines for intra-correlation and global, hierarchical, and pairwise inter-correlation correlation analysis were constructed.Comprehensive results will be generated in minutes, including the overall correlation of two omics matrixes, the correlation type, strength and signi cance between metabolites, microbes, pathways, and/or microbial functions, the topological indices and hub nodes of correlation network and rich visual gures.The resulting data les can be inputted directly into many existing omics data analysis tools.

Data importation and preprocessing
Omics and metadata importation A total of 4 data les (in .txt,.csv. or .xlsxformat), including 2 omics data les and 2 metadata les, are acceptable.At least 1 omics data le is required and the others are optional.Only intra-correlation analysis will be conducted in case there is only 1 omics data le imported.Both intra-and intercorrelation analysis can be selected when two omics data sets are imported.For the metabolome data set, both pro ling and targeted data are supported in which the rows are samples and the columns are annotated metabolites, unknown peaks, or pathway activity score.For the microbiome data set, both amplicon sequencing (16S, 18S, ITS) and metagenome data sets are supported in which the rows are samples and the columns are raw or scaled counts/functions.Both OTUs and ASVs are supported.Two metadata les (optional), one for phenotype and one for confounding factors, are acceptable.The rows are samples and the columns are confounding factors (race, age, BMI, etc.) or the phenotype (grouping variable) of the study.

Data preprocessing
Here, missing value imputation, scaling and transformation are provided for omics data preprocessing.
The KNN (k-nearest neighbors, default), mean, minimum, and QRILC (quantile regression imputation of left-censored data) [17] are provided for missing value imputation.QRILC was speci cally designed for missing data caused by lower than limit of quanti cation and is suitable for quantitative data preprocessing.The scaling is to remove unwanted variations caused by samples and/or sequencing.Besides total intensity normalization, rarefaction is also provided for microbial counts data [18,19].For transformation, log transformation is provided for metabolome data set.The centered log-ratio transformation (CLR) is the default method for microbiome data, where the data in each sample is transformed by taking the logarithm of the ratio between the count value for each part and the geometric mean count [20,21].Users can choose some or all of these steps according to the data characteristics and their study aims.

Metabolomics pathway deregulation matrix generation
Besides metabolite levels, the levels of pathway activity or deregulation may provide more direct insights to function study.Recently, the pathway-based features have been used in some personalized health studies and speci c method has been developed for pathway deregulation matrix generation [22][23][24][25].A tool which converts raw metabolite-based data matrix to the pathway-based matrix is implemented in 3MCor.The resulting matrix has the same number of samples as the metabolite-based matrix.The pathway-based features are derived from KEGG pathway database and their levels indicate the activity or deregulation degrees of the pathway features in corresponding samples.It can be inputted into 3MCor directly as a metabolome data set.

Intra-correlation analysis (1 pipeline)
Intra-correlation is the pairwise correlation analysis among variables within single omics data set.For metabolome data set, Pearson, Spearman, Kendall, Partial Spearman, Partial Pearson, Partial Kendall, MIC, GLM and GRaMM are provided.For microbiome data set, MIC, SparCC and CCLasso are provided.Among them, MIC is designed for large data sets.It can identify both linear and nonlinear pairs but only has correlation strength without positive or negative information.GLM, all the partial correlation methods, and GRaMM can handle confounders.SparCC and CCLasso were designed speci cally for two microbes of a compositional microbiome data set [26,27].It was reported that Pearson and Spearman may achieve unreasonable and even wrong results for intra-correlation analysis within microbiome data set [26,27].

Level 1 -Global correlation pipeline
This is the highest level with the aim to evaluate the overall correlation between two omics data sets.Five methods, Canonical correlation analysis (CCA) [28,29], Coinertia analysis (CIA) [30], Procrustes analysis (PA) [31], two-way orthogonal partial least squares (O2PLS) [32], sparse partial least squares (sPLS) [33] and Mantel Test [34] are provided.CCA projects multidimensional data and into variables and using linear transformation and then take the correlation coe cient of and as the correlation of and .CCA requires that variables in the dataset are linearly independent and the number of samples should not be less than the number of variables.CIA is a two-table ordination method used to measure the consistency between two datasets and can be considered as a PCA of the joint covariances ofand .PA compares the consistency of two data sets by analyzing the shape distributions of scatter plots derived from other dimension reduction and projection methods [35].O2PLS decomposes the data sets into ve parts: the joint variance of X and Y, the part of X Unique, the part of Y Unique, the residual part of X, and the residual part of Y.The Pearson correlation between the joint variance of X and Y is taken as the global correlation of the two data sets.The sPLS is similar to classical PLS while with an extra Lasso penalization for the computation of loading vectors and the Singular Value Decomposition.Thus, it is more suitable for sparse data sets.The Pearson correlation of the rst principal components derived from the Predictor matrixes is taken as the global correlation of the two data sets.The Mantel Test calculates the correlation of two unfolded variables derived from the distance matrix between data sets and .

Level 2-Hierarchical correlation pipeline
For high throughput omics data, it is time consuming to calculate correlations between every two variables.Also, the core results may be ooded in massive results.In addition, analysis without the consideration of phenotype may deviate from study aim.Hence, some hierarchical strategies with the aid of different dimension reduction methods were proposed by bioinformatician [11] and our group [6].These strategies were summarized and simpli ed in 3 steps and incorporated in 3MCor.First, users need to select a dimension reduction method to get the metabolic (Meta) and microbial (Mic) modules.Three methods are provided: WGCNA, principal component analysis (PCA), and principal co-ordinates analysis (PCoA).WGCNA is designed for the genomic data processing and is popular for cooperative affect clustering of omics data sets.The number of modules and the a liations of variables are automatically determined.PCA, a traditional method with wide applications, is to nd the most important coordinates/components by linear combinations of original variables [36].The PCoA is similar to PCA and is based on the distance matrix but not raw data set.It is widely used in microbiome data analysis.Second, phenotype (if exist) will be used for modules ltering.For a continuous phenotype, modules that are correlated to the phenotype (Spearman p<0.05) will be retained.For a categorical phenotype, Mann-Whitney or Kruskal-Wallis test will be used to screen out differential (p<0.05)modules between/among groups.Then, the correlation between the selected metabolic and microbial modules will be calculated taking the eigengene equivalents/effective abundances of modules (for WGCNA) or the rst principal component (for PCA and PCOA) as the representative variable of each module.Many univariate correlation methods mentioned above can be selected in this step.Furthermore, the importance of variables in each module are ranked by the within-module connectivity that is determined by summing its connectivity to all other variables in the given module (for WGCNA) or the loadings of each principal components (for PCA/PCoA).

Level 3-Pairwise correlation pipeline
This is the lowest level and pairwise correlation between individual variables in metabolome and microbiome will be evaluated by using univariate correlation methods mentioned above.Suggestions for method selection are same as that of intra-correlation section.Besides basic pairwise correlation analysis, network and topological analysis were incorporated in this pipeline hoping to present the overall correlation pattern by a network and dozens of characteristic indices and to highlight core nodes or links among the massive correlated pairs.Users may control the number of pairs involved by adjusting the threshold of r (default |r|>0.3)and/or p (default p<0.05) values.The node size is determined by its centrality of degree (default), betweenness or closeness.Network analysis contains 3 main parts.1) The entire network can be divided into some clusters (cooccurrence communities) via optimizing the modularity score.2) a total of 13 topological indices (Table S5), such as average nearest neighbor degree, average path length, degree centrality, degree assortativity, betweenness centrality, closeness centrality, density, transitivity, number of vertices, number of edges, modularity, diameter, and cluster counts will be calculated.3) A Zi-Pi plot will be generated to show the potential roles of nodes in the network (network hubs, module hubs, connectors, and peripherals) according to their Zi and Pi values.The Zi and Pi values are calculated from the within-cluster and among-cluster connectivities (eq 1 and eq 2 of SI) [37].
Two types of phenotypes are acceptable.For a continuous phenotype (e.g.age, BMI, glucose level), all the samples will be analyzed once.For a categorical phenotype (e.g.case vs. control), the selected analysis will conducted multiple times, using samples of each group and all groups respectively.Further results comparison contains the number of all (p<0.05) and strong (|r|>0.5 and p<0.05) correlated pairs, the distribution of r/p values, the consistency of correlated pairs (a Venn diagram), the network and topological indices, and so on.

Development environment and con guration
Its back-end is primarily based on PHP and shell and font-end interactive page is based on HTML, CSS and Javascript.All of the statistical analysis and visualizations were implemented using R in a Docker container.Major R packages are: igraph package for network analysis, gramm4R for GRaMM, ggplot2 for gures, WGCNA package for WGCNA algorithm, Lilikoi and Pathi er for pathway deregulation matrix generation [22,24].The entire system is deployed on a Google Cloud server with 32GB of RAM and eight virtual CPUs with 2.6 GHz each.The web platform has been tested with Mozilla Firefox, Chrome, Opera and Safari browsers in Windows 7/10, Linux and MacOS systems.To date, 3MCor has been running for over 6 months and was tested by dozens of human and animal data sets derived from different types of samples and detection platforms.

Tutorial documentation
To ensure that the users are able to understand all the available parameters for corresponding pipelines and results interpretation, interactive help information, a detailed user tutorial page and three demo data sets are provided on line.The backend R script is available here .

Results And Discussion
To verify the feasibility and validity of 3MCor, we compared it with existing tools, which cover the correlation analysis in omics data.Furthermore, 2 real-world data sets were used to demonstrate its simple and exible operation, comprehensive outputs, and distinctive contribution to dual-omics studies.

Function comparisons with existing tools
Here, we compared it with the correlation analysis modules of several widely used omics data analysis platforms (Table 1).3MCor is more powerful with comprehensive methods and pipelines compared with other tools.As known, MetagenoNets focus on network analysis but supports only microbiome data set, so that it cannot carry out inter-correlation analysis.MaAsLin2 has many data preprocessing and univariate correlation methods, but it does not contain network analysis function and no graphical interface is provided.Especially, these two platforms cannot conduct global and hierarchical correlation analysis and cannot identify nonlinear correlation pairs.3MCor is comparable to M2IA while with more options in most correlation analysis relevant items.Besides the tools included in Table 1, there are other integrative omics data analysis platforms such as MetaboAnalyst [38,39], MicrobiomeAnalyst [20,40], W4M [41,42], IP4M [43], Metabox [44], and so on.Although correlation analysis is only a small part of their overall framework, some basic correlation analysis (Spearman, Pearson, Kendall, SparCC, partial correlation) methods and/or network drawing functions are provided.

Case study 1-colonization and succession of gut microbiome and metabolome in newborns
Background and data Initial microbial colonization and later succession in the gut of human infants are linked to health and disease later in life.The gut microbiome and metabolome in 88 African-American newborns were investigated using fecal samples collected in the rst few days (from 2 min to 176 h after birth) of life [45].The metagenomic sequences (derived from Illumina HiSeq) were downloaded from NCBI (SRP217052).The inhouse pipeline Microbiome Automated Analysis Work ows (MAAWf, http://www.maawf.com/)[46] was used for taxonomic and functional pro les generation.The metabolome data (derived from LC-MS) were provided by the authors.The 276 microbial functions and 86 metabolites of 50 infants, were imported into 3MCor.The sampling time (hour) after birth was taken as a continuous phenotype.The mother's BMI was taken as a confounding factor.

Interpretation of output
We rst evaluated the global correlation (Level 1) of the two data sets.Consistently, moderate similarities were revealed by CCA, PA, Mantel test, PA and O2PLS.The joint PC1 scatter plot of O2PLS with highest correlation strength (r=0.519,p<0.05) was shown as Figure 2a.Then, for the hierarchical pipeline (Level 2), WGCNA was selected for dimension reduction and 11 metabolic and 15 microbial modules were generated.The correlation strengths and signi cances of modules that are signi cantly related to phenotype were shown as Figure 2b.The metabolic module Meta 5 (in red) was signi cantly and positively correlated with most microbial modules (Figures 2b-2c).The importance (loadings) of all the metabolites a liated to Meta5 were calculated (Figure 2d), and succinate and methylmalonic acid were the top 2 metabolites with highest contributions to Meta 5.The Sankey plot (Figure 2e) was a good summary of hierarchical pipeline with the selected correlation modules, the important metabolites and functions (only top 3 were shown), and their a liations.
The pairwise pipeline (Level 3) was also conducted.A series networks of different scales were generated by setting different r and p thresholds.Two typical ones with the thresholds of |r|>0.2& p<0.1 and |r|>0.55& p<0.05 were shown as Figures 2f and 2g respectively.For network 2g, 3 clusters were identi ed based on its topological structure.The phenotype had high connectivity in cluster 2 and the outstanding metabolites with high connectivity in cluster 3 were succinate and methylmalonic acid.Many topological indices were calculated and as expected, the Zi-Pi analysis (Figure 2h) indicated that these two metabolites were potential keystone nodes.There are many microbial functions closely correlated with them, such as branched amino acid biosynthesis (r=0.39,p<0.01 with succinate and r=0.38, p<0.05 with methylmalonic acid), aromatic amino acid biosynthesis (r=0.44,p<0.05 with succinate and r=0.44, p<0.01 with methylmalonic acid), fatty acid biosynthesis initiation (r=0.51,p<0.01 with succinate and r=0.51, p<0.05 with methylmalonic acid) etc.We further conducted pairwise correlation on microbial species and these two metabolites using the GRaMM method of 3MCor.The most prominent microbe correlated with them is E. coli (r=0.45,p<0.05 with succinate and r=0.48, p<0.05 with methylmalonic acid).The associations of E. coli, succinate, and aforementioned functions were highly consistent with the results of original report and many other studies [45,47,48].The methylmalonic acid was also identi ed as key node by 3MCor, implying its important roles in very early life.It is well recognized that methylmalonic acid is the upstream metabolite of succinate and they are involved in vitamin homeostasis and TCA cycle [49].
Case study 2-distinct microbiota-bile acid association pattern in type 2 diabetics Background and data Bile acid (BA) pro les are closely associated with T2DM and the metabolism of BA are regulated by gut microbiota.Data sets of this case study were from our reports on BA and T2DM [50].The study was approved by the Ethics Committee of Shanghai Jiao Tong University A liated Sixth People's Hospital.Written informed consent was obtained from all participants before recruitment.Please see the original reports for detailed information on diagnose and in/exclusion criteria, sample collection and detection, and data preprocessing.A total of 23 bile acids (quanti ed metabolome) and 50 high abundance genera (16S rRNA microbiome) derived from the fecal samples of 18 healthy controls (HC) and 29 diabetics (DM) were used here.A binary categorical variable (HC vs. DM) was taken as the phenotype and BMI was taken as a confounding factor.The entire correlation analysis was run three times automatically by 3MCor, using samples of HC, DM, and ALL (HC+DM) respectively.Some of the results were provided here and please see SI for complete parameter setting and results.

Interpretation of output
First, high overall correlations between metabolome and microbiome were observed and the correlation strength of HC (Figure 3a, r=0.68, p<0.05) is weaker than that of ALL (Figure 3b, r=0.89, p<0.05) and DM (Figure 3c, r=0.86, p<0.05) groups in PA analysis.Second, in the hierarchical pipeline, Mic 5 of HC group, Mic 6 of ALL group, and Mic 9 of DM group were signi cantly correlated with most Meta modules.Meta 2 of HC group, Meta 1 of ALL group, and Meta 3 of DM group were signi cantly correlated with most Mic modules (Figures 3d-3f).The top 2 metabolites with great contributions to each selected Meta modules included 7_ketoDCA, 7_ketoLCA, LCA, isoLCA, TBA, and CDCA (Figures 3g-3i).For microbes, g_Bacillus, g_Ruminococcus, g_Oscillospira, f_Ruminococcaceae, and f_Rikenellaceae were ranked as the tops (Figures 3j-3l).Using the pairwise inter-correlation pipeline, 3 networks and 3 Zi-Pi plots (Figures 3m-3r) were constructed using the same thresholds (p<0.1).Consistent to the results of global and hierarchical pipelines, the metabolome-microbiome correlation patterns and key nodes of HC, ALL, and DM groups were different and higher correlation strength of were observed in ALL and DM groups, compared with HC, as the numbers of nodes involved in the networks and the numbers of identi ed key nodes are both higher in ALL and DM groups.We counted the numbers of all (p<0.05) and strongly (|r|>0.5 and p<0.05) correlated pairs derived from HC, DM, and ALL groups and found that there were more strong correlation pairs in DM group although the number of total correlation pairs of DM was lower than that of ALL groups (Figure 3s).Notably, HCA (Hyocholic acid), 12-ketoCDCA, g_Streptococcus (SI Figure 24P), f_Mogibacteriaceae (SI Figure 24Q) and g_Ruminococcus were identi ed as high-contribution variables and hub nodes for DM group, in the hierarchical and pairwise pipeline respectively.Their levels in HC and DM groups were shown as Figures 3t.nor_DCA, f_Ruminococcaceae, f_Mogibacteriaceae, g_Ruminococcus, g_Anaerostipes were highlighted in both pipelines for ALL group.There was no overlapped node for HC group.Therefore, diabetes has a discernible impact on human metabolomemicrobiome correlation pattern and HCA, 12-ketoCDCA, g_Streptococcus, f_Mogibacteriaceae and g_Ruminococcus are identi ed to be key nodes associated with diabetes.Our human, animal, and cell experiments have elucidated that HCA, one of the non-12α-hydroxylated secondary BAs, can upregulated glucagon-like peptide-1 (GLP-1) production and secretion in enteroendocrine cells to effectively regulate blood glucose homeostasis, via simultaneously activating Takeda G-protein-coupled receptor 5 (TGR5) and inhibiting farnesoid X receptor (FXR) [51].The 12-ketoCDCA is also a non-12α-hydroxylated BA which is proved to be involved in the occurrence and development of DM [52].For g_Streptococcus, it has been reported that diabetes is the most frequently identi ed risk factors in invasive Streptococcus agalactiae (GBS) infections [53].Recent machine learning studies have found that f_Mogibacteriaceae is one of the markers that can effectively predict the risk of DM [54].Ruminococcus.sppwas positively associated with DM in many studies [55][56][57][58][59].

Conclusions
To our knowledge, 3MCor is the rst web server speci cally for metabolome-microbiome correlation analysis with comprehensive considerations of phenotype, confounders, and the complicated relationships of omics features.It integrates more than 20 methods relevant to metabolome/microbiome correlation analysis, including 1 pipeline for intra-correlation and 3 pipelines for inter-correlation analysis.To assess the advantage of 3MCor, function comparison has been carried out for 3MCor with other typical tools.Both linear and non-linear correlations can be detected.Continuous and categorial phenotypes are acceptable.It is independent to sample type and detection platform.In addition to basic analysis on multiple levels and/or groups, network analysis may greatly facilitate the quick identi cation of correlation patterns, sub-nets, and hub nodes from tremendous and/or inconsistent results.The jointly usage of multiple pipelines is also a good way to get more reliable results.In the two real-world applications, 3MCor captured many metabolites and microbes associated to very early life and type 2 diabetes using the hierarchical and pairwise pipelines.Some of them are highly consistent to previous reports and some of them are brand new ones with extra insights to human health and disease.These illustrated the practical values of 3MCor.Furthermore, as a user-friendly and extensible tool, the interface and operation of 3MCor is simple, and su cient guidance for method and parameter setting and results interpretation are provided, where the inputs and outputs of 3MCor are compatible to existing omics data analysis tools.
There are lots of work remains to be done.First, the current methods and pipelines are all data-driven.Some biological and chemical associations between two features, such as the minimum reaction steps, the structure similarity, the a liations to the same or related pathway/class/phylum, should also be considered.We are devoting to integrate diverse biological information from public databases (e.g.KEGG, SMPDB, HMDB, Biocyc) and literatures to the pipelines which may further improve the reliability of integrative (correlation) analysis and relieve the burden of subsequent experiment validation.This is a challenging but promising task.Second, omics data sets are of similarity.3MCor is of potential to be used for other omics (e.g.genomics, transcriptomics, and proteomics) or more than 3 omics data sets correlation analysis.Substantial function extension and performance evaluation are needed.

Declarations 1. Ethics approval and consent to participate
The data of case study from public database and this study was approved by the Committee for the Protection of Human Subjects (Internal Review Board) of the Children's Hospital of Philadelphia.The case study 2 was approved by the Ethics Committee of Shanghai Jiao Tong University A liated Sixth People's Hospital.Written informed consent was obtained from all participants before recruitment.

Consent for publication
Not applicable.

Figure 3 Visual
Figure 3