In this study, we describe a new method, q-mer analysis, that includes alignment information in RNA-Seq data analysis. q-mer analysis focuses not on the expression of whole genes but on oligomers and produces vectors with higher dimensionality than do count-based methods to summarize alignment information. This “dimensionality increment” was critical for characterizing samples so that non-supervised subgrouping was successful when analyzing the RNA-Seq datasets with large q values (Fig. 3(m)–(o), Fig. 4(o)), while the count-based method failed to distinguish between subgroups (Fig. 3(a), Fig. 4(a); See Figure S1, Additional File 1).
The ability of q-mer analysis to accurately characterize transcriptomics samples could be helpful when identifying the biological mechanisms underlying disorders. Currently, the diagnosis of most psychiatric or neurological diseases, such as major depressive disorder, bipolar disorder, autism spectrum disorder, and attention-deficit hyperactivity disorder, depends on behavioral and symptomatic characterization. However, q-mer analysis could help to define these disorders using q-mer vectors, potentially revealing unique characteristics that could be used for diagnosis.
q-mer analysis cannot provide a clear description of the detailed biological mechanisms that explain the differences between samples and controls; however, q-mer analysis can offer hypotheses for further investigation by capturing differences in post-transcriptional regulation or expressed mutations. Through q-mer analysis, we detected differentially expressed candidate genes in the study of cocaine addiction (Fig. 5(c)). These genes are not highlighted in the original study because of the high FDR. Since q-mer analysis indicated that the post-transcriptional regulation of these genes were different between cases and controls, subsequent studies should quantify the protein levels of the genes in Fig. 5(c). Interestingly, most of the genes in Fig. 5(c) were specifically expressed in the brain [29] and were functionally related to neurogenesis [30–36]. In the COS study, we identified mutations as discriminators between samples from COS cases and controls. Surprisingly, all of these mutations are novel [37–40]. Although many small nucleaotide polymorphisms related to COS have been reported [37–40], the mutations we identified are worth additional follow-up studies because the genes carrying these mutation were expressed and are related to COS [37, 38, 40–42].
Importantly, since q-mer analysis does not require any additional experiments other than the RNA-Seq dataset, scientists can re-mine existing RNA-Seq data and may find new results. For example, studies could immediately apply q-mer analysis to publically available RNA-Seq data [43–48]. However, the aligner, such as bowtie [49], bowtie2 [50], BWA [51, 52], STAR [53], and HISAT2 [54, 55], that is used must be considered; use of different aligner programs from study to study may produce different results. If the alignment data are different, then q-mer analysis results may differ as well.
Furthermore, RNA-Seq data derived from libraries constructed using the poly-A method should be avoided because this method only captures the 3′-end of each mRNA transcript. Thus, this method does not capture alignment information from the 5′-end and is thus not appropriate for q-mer analysis. In addition, currently, we do not have a valid statistical method to quantify the impact of q-mer analysis. In this study, we selected the top 10 oligomers that showed the highest correlation coefficients and observed linear separation of case and control samples using PCA. However, ideally, significant oligomers should be identified statistically. Investigations of the probability distribution of q-mer results remains to be addressed in future studies.
The dimension required to express alignment information of RNA-Seq data was estimated at least 49 (262,144; Table 1). Furthermore, to identify differences among the case samples and controls accurately, the required dimension may be 414 (268,435,456) or larger. Thus, the dimensionality of RNA-Seq data potentially has approximately 10,000 times the number of genes than are found in H. sapiens. However, some reports say that the affective dimension of the transcriptome should be far less than the total number of genes it contains [56–58]. This contradiction may be because those studies only investigated ideal situations. In actual cases or conditions that have not yet been examined, such as in neurological or psychiatric disorders in H. sapiens, the transcriptome could have a large number of dimensions owing to the complexity of the brain and the etiology of these diseases.
The high dimensional nature of RNA-Seq data may be its advantage. Recently, scientists have attempted to describe sample conditions by combining multi-omics data to obtain additional explanatory variables. However, this approach is costly and hard to comprehend. Combining RNA-Seq experiments with q-mer analysis may be sufficient to describe samples because the dimensionality of RNA-Seq is much higher than that derived from multi-omics approaches. In the future, q-mer analysis may be the new standard rather than omics-related methods.
We suggest that there might be a limit for gene-level characterization in understanding complicated samples such as those derived from neurological and psychiatric disorders because networks and pathways often share similarities despite different disorders in each study [27]. The number of parameters (i.e., the number of genes) is probably not sufficient to explain these disorders. In contrast, q-mer analysis provides many parameters by focusing on oligomers and can separate samples if the number of samples and the q value are large enough. Further, q-mer analysis can identify candidate genes and biological mechanisms underlying differences between samples. Therefore, we propose using q-mer analysis to increase dimensionality, to identify novel mechanistic hypotheses based on differences in oligomers between different conditions, and to study underlying biology. In conclusion, differential transcriptomics based on q-mer analysis can provide novel data for clinical studies, diagnosis, prognosis, and identification of new genetic markers for diseases.