Despite the growing number of published RNA-seq datasets, and the associated number of conclusions involving biological systems, at times depending on one or few transcripts identified as being differentially expressed, the quantification of variation present between replicates at an individual transcript level is sometimes overlooked. Given the importance of the results postulated around such individual transcripts and the growing ability to base highly informative studies around archived transcriptomics datasets, it is imperative that care is taken in understanding such variation to a greater degree than the black-box approach that many of today’s user-friendly software tools provide. In aid of this, we developed easy-to-use software to quantify intra-condition variation at an individual transcript level and, if desired, to use the output to remove reference-based transcripts associated with highly variable read counts. By applying our method, we demonstrated the effects of reducing the level of noise that standard differential expression tools are required to accommodate when determining differential expression patterns, in relation to inter and intra-study datasets derived from brain samples of dogs, wolves, and foxes. We observed an improvement in the distribution of the gene-wise dispersion estimates used by DESeq2 to determine differentially expressed genes, where the correlation between the mean of normalized counts and dispersion estimates per gene improved when removing transcripts displaying the highest levels of intra-condition variation (Fig. 3 and Supplementary Table 7 online). By removing such transcripts from the reference, prior to usage within differential expression software, gene-wise estimates of dispersion became more accurate 2. Such noise at an inter, and to a lesser extent intra, study level arises from a range of characterized and uncharacterized sources including: i) biological differences between samples such as age, sex, diet, and health; ii) in silica error involving assembly tools producing poorly understood chimeras within the reference transcriptome 30; iii) ambiguities in read mapping to such references 31; iv) normalization of count data derived from such mapped reads 32; and v) in vitro error during library preparation protocols 33,34. Each of these can affect read counts across samples that are considered within the same condition of interest. Thus, when individual transcripts are hypothesized to be part of an informative result, the variation within the data that surrounds them ought to be thoroughly explored.
Samples from fox intra-study datasets contained a higher number of transcripts associated with high amounts of intra-condition variation. This high discordance between read counts from samples allocated to the same condition has resulted in a more spread distribution of dispersion estimates. When the detection of differentially expressed genes is dependent on how far a certain point is from a curve of best fit, having points that are highly spread will affect the detection of outliers and increase the number of false positives, likely explaining the low number of outliers detected among fox data (Table 1). Following the removal of the 3% and 5% of transcripts associated with the highest levels of variation between wolves vs. dogs and aggressive vs. tame foxes, respectively, the number of differentially expressed genes had the largest increase (Fig. 3). Importantly, several genes that were over expressed in dogs and tame foxes in the non-filtered datasets, some at the top of the list, were removed after filtering (Supplementary Tables 3 to 5 online), demonstrating how dependent the final list of differentially expressed genes is on the accuracy of gene-wise dispersion estimates. Prior to filtering, when results of the differential expression analysis in our intra-study case example were compared with those of the original publication 17, 92% of the genes previously identified as being differentially expressed were in agreement. Post filtering however 7% of these were no longer differentially expressed, highlighting the need for careful understanding of variance within transcriptomic datasets. Our method has been effective in improving the distribution of dispersion estimates prior to differential expression analysis and although we have used DESeq2, its utility covers other methods for differential expression analysis, since most rely on shared information across genes for dispersion estimation e.g. edgeR 35, BBSeq 36, DSS 37, baySeq 38 and ShrinkBayes 39.
At the 3% and 5% cutoffs, amongst the 50 over expressed genes identified, across the 21 shared gene families, seven genes were shared between dogs and tame foxes (Table 2). Of these seven genes, three main functions related to brain development, neurotransmission, and immune response were identified. These functions have been repeatedly associated with behaviour selection during domestication by different approaches, such as QTL analysis 16,40,41, whole-genome sequencing 42–44, and RNA data both using microarrays and RNA-seq 12,17,45−47. Up until recently, almost no gene overlap had been observed between gene expression profiles involving pairs of domesticated and wild animals 14. However, a newly published paper performing population genomic and brain transcriptional comparisons in seven bird and mammal domesticated species has revealed a strong convergent pattern in genes implicated in neurotransmission and neuroplasticity 48. These functions are compatible with those found in our analysis. The shared gene ITGA7 belongs to a gene family that is known to play an essential role in the control of neuronal connectivity 49 and the inflammatory response 50. Other genes from this family, for example, ITGA8, have been previously observed to be over expressed in tame foxes 46, and here we also observed its over expression in dogs providing further evidence of the family’s role in tameness. Similar functions are associated with the shared genes CHRNA5 51,52 and TRIB2 53 from the cholinergic and tribbles family, respectively. Additionally, we found a shared gene involved in sensing local environmental stimuli, the MYO7A, whose mutation results in loss of hearing and vision 54. Amongst the three gene families identified as under expressed (Table 3), we found the shared gene STMND1, which deficiency in the amygdala of mice was connected to a deficiency in innate and learned fear 55, a behaviour that speculatively could also have an important role in domestication.
In summary, we have presented a software to aid in the quantification, and if desired the removal, of intra-condition variation from RNA-seq count data and demonstrated its usage in improving the distribution of gene-wise dispersion estimates in an attempt to reduce the number of false positives in differential gene expression analysis. We would like our approach to highlight that studies mixing RNA-seq data from different sources should firstly, ensure that such datasets are suitable for performing differential expression on in relation to intra-condition variation at an individual transcript level, and secondly, characterize and discuss the variation present within the data prior to drawing conclusion on individual genes observed to be differentially expressed. Finally, in our case study of wolves, dogs and foxes, we have provided further support for candidate genes involved with selection for tameness, and that appear to play a crucial role in the canine domestication.