Framework of 3MCor
The 3MCor is composed of four main steps (Figure 1): (1) omics and metadata importation, (2) data preprocessing, (3) inter-correlation and intra-correlation (including global, hierarchical, and pairwise correlation pipelines) analysis, and (4) results visualization.
Data importation and preprocessing
Omics and metadata importation
A total of 4 data files (in .txt, .csv. or .xlsx format), including 2 omics data files and 2 metadata files, are acceptable. At least 1 omics data file is required and the others are optional. Only intra-correlation analysis will be conducted in case there is only 1 omics data file imported. Both intra- and inter- correlation analysis can be selected when two omics data sets are imported. For the metabolome data set, both profiling and targeted data are supported in which the rows are samples and the columns are annotated metabolites, unknown peaks, or pathway activity score. For the microbiome data set, both amplicon sequencing (16S, 18S, ITS) and metagenome data sets are supported in which the rows are samples and the columns are raw or scaled counts/functions. Both OTUs and ASVs are supported. Two metadata files (optional), one for phenotype and one for confounding factors, are acceptable. The rows are samples and the columns are confounding factors (race, age, BMI, etc.) or the phenotype (grouping variable) of the study.
Data preprocessing
Here, missing value imputation, scaling and transformation are provided for omics data preprocessing. The KNN (k-nearest neighbors, default), mean, minimum, and QRILC (quantile regression imputation of left-censored data) [17] are provided for missing value imputation. QRILC was specifically designed for missing data caused by lower than limit of quantification and is suitable for quantitative data preprocessing. The scaling is to remove unwanted variations caused by samples and/or sequencing. Besides total intensity normalization, rarefaction is also provided for microbial counts data [18, 19]. For transformation, log transformation is provided for metabolome data set. The centered log-ratio transformation (CLR) is the default method for microbiome data, where the data in each sample is transformed by taking the logarithm of the ratio between the count value for each part and the geometric mean count [20, 21]. Users can choose some or all of these steps according to the data characteristics and their study aims.
Metabolomics pathway deregulation matrix generation
Besides metabolite levels, the levels of pathway activity or deregulation may provide more direct insights to function study. Recently, the pathway-based features have been used in some personalized health studies and specific method has been developed for pathway deregulation matrix generation [22-25]. A tool which converts raw metabolite-based data matrix to the pathway-based matrix is implemented in 3MCor. The resulting matrix has the same number of samples as the metabolite-based matrix. The pathway-based features are derived from KEGG pathway database and their levels indicate the activity or deregulation degrees of the pathway features in corresponding samples. It can be inputted into 3MCor directly as a metabolome data set.
Correlation analysis
Intra-correlation analysis (1 pipeline)
Intra-correlation is the pairwise correlation analysis among variables within single omics data set. For metabolome data set, Pearson, Spearman, Kendall, Partial Spearman, Partial Pearson, Partial Kendall, MIC, GLM and GRaMM are provided. For microbiome data set, MIC, SparCC and CCLasso are provided. Among them, MIC is designed for large data sets. It can identify both linear and nonlinear pairs but only has correlation strength without positive or negative information. GLM, all the partial correlation methods, and GRaMM can handle confounders. SparCC and CCLasso were designed specifically for two microbes of a compositional microbiome data set [26, 27]. It was reported that Pearson and Spearman may achieve unreasonable and even wrong results for intra-correlation analysis within microbiome data set [26, 27].
Inter-correlation analysis (3 pipelines)
There are 3 complementary pipelines for inter-correlation analysis (Figure 1): (1) Global correlation, (2) Hierarchical correlation and (3) Pairwise correlation. Multiple pipelines can be selected to obtain comprehensive results.
- Level 1 - Global correlation pipeline
This is the highest level with the aim to evaluate the overall correlation between two omics data sets. Five methods, Canonical correlation analysis (CCA) [28, 29], Coinertia analysis (CIA) [30], Procrustes analysis (PA) [31], two-way orthogonal partial least squares (O2PLS) [32], sparse partial least squares (sPLS) [33] and Mantel Test [34] are provided. CCA projects multidimensional data and into variables and using linear transformation and then take the correlation coefficient of and as the correlation of and . CCA requires that variables in the dataset are linearly independent and the number of samples should not be less than the number of variables. CIA is a two-table ordination method used to measure the consistency between two datasets and can be considered as a PCA of the joint covariances ofand . PA compares the consistency of two data sets by analyzing the shape distributions of scatter plots derived from other dimension reduction and projection methods [35]. O2PLS decomposes the data sets into five parts: the joint variance of X and Y, the part of X Unique, the part of Y Unique, the residual part of X, and the residual part of Y. The Pearson correlation between the joint variance of X and Y is taken as the global correlation of the two data sets. The sPLS is similar to classical PLS while with an extra Lasso penalization for the computation of loading vectors and the Singular Value Decomposition. Thus, it is more suitable for sparse data sets. The Pearson correlation of the first principal components derived from the Predictor matrixes is taken as the global correlation of the two data sets. The Mantel Test calculates the correlation of two unfolded variables derived from the distance matrix between data sets and .
- Level 2- Hierarchical correlation pipeline
For high throughput omics data, it is time consuming to calculate correlations between every two variables. Also, the core results may be flooded in massive results. In addition, analysis without the consideration of phenotype may deviate from study aim. Hence, some hierarchical strategies with the aid of different dimension reduction methods were proposed by bioinformatician [11] and our group [6]. These strategies were summarized and simplified in 3 steps and incorporated in 3MCor. First, users need to select a dimension reduction method to get the metabolic (Meta) and microbial (Mic) modules. Three methods are provided: WGCNA, principal component analysis (PCA), and principal co-ordinates analysis (PCoA). WGCNA is designed for the genomic data processing and is popular for cooperative affect clustering of omics data sets. The number of modules and the affiliations of variables are automatically determined. PCA, a traditional method with wide applications, is to find the most important coordinates/components by linear combinations of original variables [36]. The PCoA is similar to PCA and is based on the distance matrix but not raw data set. It is widely used in microbiome data analysis. Second, phenotype (if exist) will be used for modules filtering. For a continuous phenotype, modules that are correlated to the phenotype (Spearman p<0.05) will be retained. For a categorical phenotype, Mann-Whitney or Kruskal-Wallis test will be used to screen out differential (p<0.05) modules between/among groups. Then, the correlation between the selected metabolic and microbial modules will be calculated taking the eigengene equivalents/effective abundances of modules (for WGCNA) or the first principal component (for PCA and PCOA) as the representative variable of each module. Many univariate correlation methods mentioned above can be selected in this step. Furthermore, the importance of variables in each module are ranked by the within-module connectivity that is determined by summing its connectivity to all other variables in the given module (for WGCNA) or the loadings of each principal components (for PCA/PCoA).
- Level 3- Pairwise correlation pipeline
This is the lowest level and pairwise correlation between individual variables in metabolome and microbiome will be evaluated by using univariate correlation methods mentioned above. Suggestions for method selection are same as that of intra-correlation section. Besides basic pairwise correlation analysis, network and topological analysis were incorporated in this pipeline hoping to present the overall correlation pattern by a network and dozens of characteristic indices and to highlight core nodes or links among the massive correlated pairs. Users may control the number of pairs involved by adjusting the threshold of r (default |r|>0.3) and/or p (default p<0.05) values. The node size is determined by its centrality of degree (default), betweenness or closeness. Network analysis contains 3 main parts. 1) The entire network can be divided into some clusters (cooccurrence communities) via optimizing the modularity score. 2) a total of 13 topological indices (Table S5), such as average nearest neighbor degree, average path length, degree centrality, degree assortativity, betweenness centrality, closeness centrality, density, transitivity, number of vertices, number of edges, modularity, diameter, and cluster counts will be calculated. 3) A Zi-Pi plot will be generated to show the potential roles of nodes in the network (network hubs, module hubs, connectors, and peripherals) according to their Zi and Pi values. The Zi and Pi values are calculated from the within-cluster and among-cluster connectivities (eq 1 and eq 2 of SI) [37].
Two types of phenotypes are acceptable. For a continuous phenotype (e.g. age, BMI, glucose level), all the samples will be analyzed once. For a categorical phenotype (e.g. case vs. control), the selected analysis will conducted multiple times, using samples of each group and all groups respectively. Further results comparison contains the number of all (p<0.05) and strong (|r|>0.5 and p<0.05) correlated pairs, the distribution of r/p values, the consistency of correlated pairs (a Venn diagram), the network and topological indices, and so on.
Development environment and configuration
Its back-end is primarily based on PHP and shell and font-end interactive page is based on HTML, CSS and Javascript. All of the statistical analysis and visualizations were implemented using R in a Docker container. Major R packages are: igraph package for network analysis, gramm4R for GRaMM, ggplot2 for figures, WGCNA package for WGCNA algorithm, Lilikoi and Pathifier for pathway deregulation matrix generation [22, 24]. The entire system is deployed on a Google Cloud server with 32GB of RAM and eight virtual CPUs with 2.6 GHz each. The web platform has been tested with Mozilla Firefox, Chrome, Opera and Safari browsers in Windows 7/10, Linux and MacOS systems. To date, 3MCor has been running for over 6 months and was tested by dozens of human and animal data sets derived from different types of samples and detection platforms.
Tutorial documentation
To ensure that the users are able to understand all the available parameters for corresponding pipelines and results interpretation, interactive help information, a detailed user tutorial page and three demo data sets are provided on line. The backend R script is available here .