RNAseq is a technique used since the pioneer studies of R Lister, RC O'Malley, J Tonti-Filippini, BD Gregory, CC Berry, AH Millar and JR Ecker [1] (Arabidopsis thaliana), U Nagalakshmi, Z Wang, K Waern, C Shou, D Raha, M Gerstein and M Snyder [2] (Saccharomyces cerevisiae), BT Wilhelm, S Marguerat, S Watt, F Schubert, V Wood, I Goodhead, CJ Penkett, J Rogers and J Bähler [3] (Schizosaccharomyces pombe), and A Mortazavi, BA Williams, K McCue, L Schaeffer and B Wold [4] (Mus musculus). This technique allows the combination of transcripts discovery and expression levels quantification in a single assay and has an unlimited dynamic range of detection compared to microarrays or RT-qPCR [5, 6].
For differential expression studies, the gene expression values must be comparable between samples, which means that count data should be normalized for sequencing depth and other biases such as transcript length, GC content and transcript coverage. Reads/Fragments per Kilobase per Million (RPKM or FPKM) and Transcripts per Million (TPM) both normalize count data by transcript length and sequencing depth [7], but they may give biased results in the presence of highly expressed genes or when a lot of the genes are expressed in only one sample [8]. This is because one differentially expressed gene shifts the sequencing effort distributed to the others and all genes appear to be differentially expressed [9-11]. Other methods such as relative log expression (DESeq2) and trimmed mean of M-values (edgeR) can work with the carry-over effect of highly expressed genes [10].
The comparison of different softwares for RNAseq analysis is a recurrent subject in the literature [12-14] and many authors argue over the benefits of using housekeeping genes or spike-in controls to scale the count data, yet the evaluation of the reference genes used for RNAseq data analysis is not as common. When using internal or external control genes, the normalization is first performed on the controls and the result is used to normalize the other genes. The use of external spike-ins is advocated for introducing little error into the read counts, allowing identification of global shifts in gene expression [15-17]. However, reports have shown mixed performances with different normalization methods [18], resulting in high false discovery rates and false positive rates [19]. These may show differences in amplification depending on the type of tissue studied or the protocol for mRNA enrichment [20].
One alternative for external spike-ins is the use of internal control genes, as it is done in qPCR studies. Typical control genes are actin, tubulin, elongation factor 1, polyubiquitin and ribosomal RNAs, though the stability of expression of several of those is dependent on the conditions studied [21]. To solve this issue, different algorithms were proposed to find genes having the most stable expression, mostly for qPCR applications, but they need a set of predefined genes of interest (RefGenes, T Hruz, O Laule , G Szabo, F Wessendorp, S Bleuler, L Oertle, P Widmayer, W Gruissem and P Zimmermann [22]) or a set of pre-selected candidate reference genes (geNorm, J Vandesompele, K De Preter, F Pattyn, B Poppe, N Van Roy, A De Paepe and F Speleman [23]; NormFinder, CL Andersen, J Ledet-Jensen and T Ørntoft [24]; BestKeeper, MW Pfaffl, A Tichopad, C Prgomet and TP Neuvians [25]). The most frequent approach is to take previously identified stably expressed genes, as done by B Zhuo, S Emerson, JH Chang and Y Di [11] this however does not ensure that the selected genes will show stable expression in the studied organism and conditions.
Here we propose a simple and fast method to identify the genes having the most stable expression for each experimental condition. Our method is aimed at differential expression studies and represents a simple way to select custom reference genes for any species or any type of experiments, so they can be used in the normalization step of differential expression analysis algorithms, and does not necessitate spike-ins. It alleviates the problem inherent to predefined reference genes, which may not be stably expressed across experimental set-ups and are applicable to a single species.