DiCoExpress is a script-based tool implemented in R with a set of directories where data, scripts and results are organised. The potential users of DicoExpress should know what a linear model and a contrast to be tested in such a context are. Regarding programming skills, the users should know how to use a function in R, identifying required and optional arguments.
Selection of the statistical methods
DiCoExpress offers a validated set of methods , based on three neutral comparison studies [17,30,31]. The idea of such studies is to design and implement a framework to generate realistic benchmark datasets with known truth to make an objective and reproducible performance assessment. Comparing normalization methods, Dillies et al. [30] showed that the RLE method implemented in the package DESeq2 [11] and the TMM method implemented in the package edgeR [10] demonstrate satisfactory behaviour in the presence of highly expressed genes. Both these methods maintain a reasonable false-positive rate without loss of power. The choice of both methods was confirmed by Reddy et al. [32] and Evans et al. [33] even in experiments with slightly asymmetric differential expression or different amounts of mRNA/cell per condition. Based on these detailed evaluations, both RLE and TMM are suitable, but we decided to choose the TMM normalization as the default method and proposed RLE as an alternative for normalization due to the choice made for the differential analysis described below.
Rigaill et al. [31] made a neutral comparison study among differential gene expression methods, including negative binomial-based, generalized linear models, and linear models on transformed data. Performance analyzes based on the p-value distributions, ROC curves, and proportion of true and false-positive rates show a clear difference of behaviour between negative binomial-based methods and the others. Linear models on transformed data or generalized linear models are consequently the most adapted for the differential analysis. Among these models, as also observed in Schurch et al. [34], when the proportion of differentially expressed genes is low, the results obtained with the method implemented in the edgeR package are more satisfying. We thus chose the statistical model implemented in the edgeR package as a method of choice for differential expression data analysis. Moreover, we propose automatic writing of a large number of contrasts in order to facilitate the comparisons between the biological conditions considered in the experimental design. This automatic writing is a real advantage because, in the available R-packages, most contrasts in GLM with interactions between two factors must be handwritten and require thus an excellent understanding of the statistical modelling.
For the co-expression analysis, we preferred mixture models to correlation-based approaches. Mixture models aim at identifying an underlying structure in modelling the unknown distribution by a weighted sum of parametric distributions, each one representing a group of co-expressed genes. Gaussian mixture models were relevant for microarray data and were applied with success on several datasets [35,36]. For RNAseq data, which are discrete, Rau et al. [16] first concluded that normalized expression profiles modelled with a Poisson mixture are relevant for co-expression analysis. However, in the Poisson mixture, the dependence structure between samples is not considered and can mislead the results. To tackle this problem, they proposed then a Gaussian mixture after a transformation of the normalized expression profiles [17]. This model seems to be more suitable for RNAseq co-expression analysis by providing a proper identification of the groups of co-expressed genes because it accounts for per-cluster correlation structures among samples. For these reasons, we chose this Gaussian mixture implemented in the coseq R-package [17].
Architecture of DiCoExpress
Using these neutral comparison studies, we combined the most suitable methods for each step of a standard RNAseq analysis. The tool DiCoExpress is a directory composed of a set of subdirectories that is to be installed on a computer for analyzing RNAseq datasets (see Fig.1). The directory Data stores all the projects, and the directory Results contains a subdirectory per project with all the results of the different steps. The directory Sources contains the R functions used by DiCoExpress. Finally, the directory Template_scripts contains an R script file for each project, allowing a semi-automated data analysis where the user is guided through all the steps from normalization to co-expression analysis. With DiCoExpress, our objective is to automate the statistical analysis of gene expression according to the experimental design to ease the interpretation of the results in biological terms.
To create DiCoExpress, we use the R programming language and several R-packages from CRAN and Bioconductor [37,38]. Each step of the analysis has a dedicated function available in the directory Sources. Seven functions compose the core of DiCoExpress (Fig. 2), and they are combined in a script, stored in the directory Template_scripts for each project to specify the steps of the analysis and the parameters to use. A full description of these seven functions is available in the Reference Manual (Additional File 1).
Input files and data quality controls
To use DiCoExpress on a project, the user has to provide only two input files: one containing a count table summarizing the mapped reads for each gene, named Project_Name_COUNTS.csv, and a second one with a description of the project design according to the experimental factors, named Project_Name_TARGET.csv. If functional gene annotations are available in a file, the user has the option to upload it. This information is integrated into the result tables and can also be used to perform enrichment tests.
The (1) Load_Data_Files function allows the user to upload the Project_Name_COUNTS.csv and Project_Name_TARGET.csv files. A check is done to be sure that both files are adequately built: the samples in Project_Name_COUNTS.csv file must be organized in the same order as the rows of the Project_Name_TARGET.csv. If it is inconsistent, then the columns of the Project_Name_COUNTS.csv are reorganized according to the column of the target file. DiCoExpress performs analysis for a complete experimental design. If this condition is not verified, then an error message appears, and the script stops running. A filter option in Load_Data_Files is proposed to extract or remove a subset of samples, thus avoiding manual modifications of the expression file. The filtering rules are described according to the Project_Name_TARGET.csv (see section Results for an example).
The (2) Quality_Control function produces a PDF file containing graphical outputs before and after normalisation: histograms of the library sizes, boxplots of the counts for each sample, heatmap, and principal component analysis (PCA), as shown in Supplementary Fig. 1. This step is optional, but we advise the users to perform it to evaluate the quality of the RNAseq data before further analyses.
Differential expression analysis
The differential analysis is based on a negative binomial GLM, where the log of the gene expression is modelled by all the factors describing the experiment. When the number of observations is two times greater than the number of parameters of the model, we advise to include interaction terms between the biological factors. Such terms in the gene expression definition might reveal meaningful interactions such as genotype-environment interaction and answer in a direct way to some biological questions [39–41]. The (3) GLM_Contrasts function automatically writes a list of contrasts based on the model specified by the user. We focused on contrasts involving the biological factors, and their names are sufficiently explicit to understand the associated biological question addressed. For example, we proposed automatic writing of the difference between two modalities of a biological factor averaged on the second factor or for a given modality of the second factor. . Running this function is a prerequisite to run the differential expression analysis. The (4) DiffAnalysis_edgeR function uses edgeR R-package to estimate the parameters of the GLM and then test every contrast chosen by the user. As proposed by Rigaill et al. [31], the distribution of raw p-values of each contrast is inspected to assess the quality of the statistical modelling of the gene expression. Since the distribution of raw p-values is theoretically dominated by a uniform distribution, the fit between the statistical model and the data can be observed on these raw p-value histograms. If the raw p-value distribution is not satisfactory (see example in the section Results and Fig. 3), we advise repeating the analysis using a more stringent cut-off for the filtering step or another rule of filtering. If the raw p-value distribution remains unsatisfactory, the problem might come from a large number of parameters compared to the number of observations available to estimate them. In this case, we advise modifying the modeling of the gene expression removing, for example, the interaction term. For each contrast, a subdirectory is created to store information about all the tested genes and also about the differentially expressed genes (DEGs). The function also generates graphical outputs about the behaviour of the top list of the DEGs (Supplementary Fig. 2) . All these files are described in details in the Reference Manual (Additional File 1).
Co-expression analysis
The (5) Venn_Intersection_Union function helps the user in the interpretation of the results by comparing different DEG lists. This function also generates the union and/or the intersections of these DEG lists to perform a co-expression analysis with the (6) Coexpression_coseq function. This latter function uses coseq R-package [17] to transform the raw data into normalized expression profiles. We kept the filter function of coseq removing the genes with low mean normalized counts. Those discarded genes are assigned in Cluster 0. A co-expression analysis is performed on the remaining genes using a Gaussian mixture after an arcsin transformation of the normalized expression profiles. Practically, multidimensional Gaussian mixtures of 5-30 subpopulations with unequal proportions and general covariance matrix are estimated. The EM algorithm used to estimate the model parameters is known to be sensitive to the initialization point. Coseq uses a small-EM strategy, and in DiCoExpress, we go further to get robust results. First, mixture models with 5, 10, 15, 20, 25, and 30 subpopulations are estimated 5 times each to identify an interval for the final number of co-expressed gene clusters. A second collection of models on this interval of a subpopulation is then estimated 40 times each (per default). The best mixture model is the one minimizing the Integrated Completed Likelihood (ICL). The ICL curve is expected to be a convex function of the number of subpopulations, and we use this criterion to assess that the chosen model fits well the data. When a different behaviour of the ICL curve is observed, we advise the user to modify the dataset removing some genes that show a too flat normalized profile. For the co-expression analyzes, we recommend using a powerful calculation server. The RData object of the second loop is saved for each mixture model so, if the function is stopped, the analysis can be resumed. The RData of the selected model is also saved. Moreover, several tables and graphics are saved to check the analysis quality and to explore the co-expression results. An example is discussed in the section Results, and all the files are described in details in the Reference Manual (Additional File 1).
Enrichment analysis
Once an RNAseq statistical data analysis is complete, researchers often evaluate the coherence of the results by comparing them with biological knowledge. To this end, the (7) Enrichment function performs hypergeometric tests to find annotation terms that are specifically over or under-represented in a given list of genes with respect to a reference specified by an annotation file. The enrichment analysis can be performed both on the lists of differentially expressed genes, and the co-expressed gene clusters