TVScript: A Tool for Exploring the Effects of Intra-Condition Variation on the Detection of Differentially Expressed Genes

The evolution of RNA-Seq technologies yielded datasets that are of immense scientic value. Commonly, such data is generated within differential expression studies, where datasets derived from individual samples are grouped into conditions, and gene expression patterns quantied. The number of archived datasets is increasing and revisiting many at an inter-study level provides an in-depth view into transcriptome evolution. The biggest hurdle is in dealing with variation of read counts at an individual transcript level between common conditions. We present a tool, TVScript, that quanties intra-condition variation, and subsequently, removes reference-based transcripts that are associated with high levels of this. TVScript is demonstrated at inter and intra-study levels, using data from brain samples of dogs, wolves and foxes (aggressive and tame), where a marked improvement in the distribution of the gene-wise dispersion estimates, the metric utilized by the majority of differential expression tools, lowered the number of outliers detected. We provide support for seven candidate genes with potential for being involved with selection for tameness, and that appear to play a crucial role in canine domestication. We also identify several genes previously identied as being differentially expressed, but that possessed high intra-condition variation, weakening their relevance. TVScript is available at: https://sourceforge.net/projects/tvscript/.


Introduction
Developments in RNA-seq technology 1 have revolutionized transcriptomics studies by allowing for a rapid hi-resolution view of gene expression under varying conditions, compartments, and/or timepoints.
In a typical RNA-seq experiment, gene expression pro les are estimated for each sample using a metric based upon the number of reads associated with each transcript within a reference set. Expression pro les are then compared between conditions to identify differentially expressed genes 2 . A challenge arises due to sources of variation within gene expression pro les that are independent of, or partially overlapping with, the condition of interest 3 . At an intra-study level, the inclusion of biological and technical replicates can be applied to reduce the effects of such noise 4 . At times, replicates may not always be possible due to cost or di culty in obtaining samples. As an alternative, the incorporation of RNA-Seq data from the rapidly growing repertoire of published works can complement the number of effective biological replicates associated with a given condition 5 . A hurdle is in accounting for the inherent variability of the data 6,7 , which is ampli ed at an inter-study level, as there is little control over the sample environments or experimental setups. Differential expression tools generally compute a p-value for each gene that is based on the overall distribution of normalized read counts and that re ects the possibility of that gene being differentially expressed. As intra-condition variation increases, the ability to decipher differential expression patterns decreases. Several methods have been proposed for data normalization and bias removal within RNA-seq data including, EDASeq 8 , RUV2 9 , sva 3 , and PEER 10 . However, when these methods are compared, highly variable results are observed 11 and no consensus exists on the best approach to apply. In this study, we present TVScript, a tool for the identi cation and removal of transcripts associated with high levels of intra-condition read count variation. Our assumption is that, within a single condition, if a given transcript possesses large amounts of variation across normalized read count values, then an accurate expression pattern for that transcript cannot be determined relative to the condition itself, regardless of what statistical correction is applied. For such transcripts, this variation suggests that the condition-associated datasets are discordant with each other and thus, comparisons to other conditions in order to identify differentially expressed genes can only yield spurious results and, at a minimum, they should be highlighted in downstream analysis, if not completely removed. The metric TVScript uses to identify a given transcript displaying intra-condition variation, is the comparison of the variation present within the pairwise differences of normalized read counts for that transcript across datasets associated with the condition, to that obtained for each transcript of the reference within each of the two conditions being compared. Additionally, within a single condition, the range of these values is an intuitive indication of the compatibility of the data for usage within a differential expression analysis.
We demonstrate the usefulness of TVScript at an inter-study level using multiple published RNA-seq datasets from brain samples of dogs and wolves, and at an intra-study level to brain samples of foxes from domestication experiments. Domestic dogs present marked behaviour differences from wolves, their wild ancestors, due to the evolution of unique social cognitive capabilities [12][13][14] , whilst a lineage of tame red foxes has been recently discovered to have originated in fur farms in Canada 15 , which resulted from deliberated selection against fear and aggression over several generations of cross-breeding 16-18 . We used TVScript, in conjunction with DESeq2, to explore patterns of gene expression between i) wolves and dogs and ii) aggressive and tame foxes and highlight genes associated with behavioural traits involved in both domestication events. DESeq2 has been previously demonstrated to be consistent in identifying differentially expressed genes by estimating gene-wise dispersions and shrinking these estimates to generate more accurate estimations of dispersions to model the counts 19 . The results of our analysis provides support for seven candidate genes with potential for being involved with selection for tameness, and that appear to play a crucial role in canine domestication. Additionally, we identify several genes that were previously identi ed as being differentially expressed, but that possessed high intra-condition variation, thus highlighting the need to discuss such variation when presenting the results of differential expression experiments. The software, along with usage details and sample data, is available at: https://sourceforge.net/projects/tvscript/.

RNA-seq datasets, mapping and conditions
Available RNA-seq datasets from the brain of dogs, wolves, and foxes were downloaded from the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EMBL-EBI) (see Supplementary Table S1 23 . The percentage of reads mapped for each sample was calculated and used as an indicator of mapping success. For each sample, read counts for each transcript were obtained using BBMap 24 . These count les were then grouped across two contrasts, each with two conditions, wolves vs. dogs (wolves n = 6 and dogs n = 10) and aggressive vs. tame foxes (n = 12 for each condition). Read counts from technical replicates of "Dog_8" and "Dog_9" were averaged and merged into one le, while read counts from the two biological replicates of "Dog_7" were treated separately.

Software
The steps that TVScript implements to identify transcripts associated high levels of intra-condition variation are: (i) Each input dataset, containing count values representing a particular sample, is allocated to either condition A or B, as indicated within the con guration le; (ii) Counts are normalized by dividing them by the length of the reference transcript they are associated with and by the sum of all counts within the le for a given sample; (iii) For each reference transcript (t), the absolute pairwise differences between normalized read counts across all samples within condition A are calculated; (iv) The corresponding variance (σ At ) within these pairwise distances is then also calculated; (v) Steps (iii) and (iv) are repeated for condition B to obtain values for each σ Bt ; (vi) Variance scores from each condition are placed in ascending order to associate them with corresponding percentiles; (vii) Reference transcripts are removed based on the variance associated with these percentiles, provided as user input within the con guration le. Note: The user can obtain these variance values by pre-running software on the input datasets with the − 1 parameter described in the readme le. The software will then output the full intracondition pairwise distance variance distribution and the associated percentile for each value; (viii) Raw read counts associated with the remaining transcripts are then outputted into separate les that correspond to each input dataset. The names of output les are speci ed in the con guration le. These modi ed count les are the subsequent input les for the differential expression tool DESeq2.

Filtering transcripts and differential gene expression
For each contrast, TVScript was run using variance thresholds ranging from the 70th to the 90th percentiles (in steps of ve), and from the 91st to the 99th (in steps of one). Steps of one were used in the latter in order to explore this range containing the minority transcripts associated with the highest levels of intra-condition variation in more detail. For each level, only read counts associated with transcripts that passed the lter level speci ed on the con guration le were maintained. These were outputted into individual modi ed read count les, one corresponding to each of the original inputs. DESeq2 was then used to perform differential gene expression analysis using these count les as input 2 . The gene-wise dispersions estimations calculated by DESeq2 are inversely related to the mean since lower mean counts are affected by variation to a higher degree. For each contrast involving the non-ltered and each level of ltered data, estimates of dispersion were calculated and used in linear regression analysis in relation to the mean of normalized count values using the R statistical programming package 25 . The number of differentially expressed transcripts between conditions of each contrast (wolves vs. dogs and aggressive vs. tame foxes) was assessed prior to and post-ltering increments using a p-adj < 0.05 threshold for signi cance. A principal component analysis (PCA) was performed, using the plotPCA function from DESeq2 with non-ltered normalized count values, to visualize the overall effects of experimental covariates.

Gene annotation and gene family analysis
Independently for each contrast, differentially expressed transcripts obtained using the non-ltered and ltered datasets were annotated to the correspondent gene ID, using the R package BioMart 26 against the Ensembl Gene database (version 94). Over expressed genes in dogs and tame foxes were classi ed into gene families. Families containing genes from both dogs and tame foxes were selected for further analysis given their potential for being involved in the evolution of tame behaviour. Genes within gene families were grouped according to whether they were unique to either dogs or tame foxes or shared between both. Under expressed genes were treated in the same manner. For a given contrast, a gene family was only maintained if all the associated genes agreed in relation to their direction of differential expression.

Mapping
Mapping of the 44 samples against the dog reference transcriptome revealed a success of 60% and 58% for dogs and wolves, respectively (see Supplementary Figure S1 online), as expected due to their recent divergence (~ 27 kya) 27 . Similar portions of reads failing to map (~ 40%) have been previously reported for dog brain samples 20 and are most likely due to (i) novel genes; (ii) regions that are not translated despite being transcribed; (iii) contamination with genomic DNA; and (iv) uncharacterized chimeras within reference resulting from assembly errors 28 . For the fox datasets, an average of 50% of reads mapped to the dog reference transcriptome (see Supplementary Figure S1 online). The lower percentage of mapping for foxes was expected due to an increased genetic divergence to dogs (~ 10 mya) 29 together with the other aforementioned factors.

Software
TVScript has been written in the Java programming language and runs on all operating systems with installed Java Runtime Environment 8.0 or higher. The input for the software is a set of count les that are each associated with a speci c sample. Within each count le, counts represent the number of reads that map to each transcript of the reference set. Along with these count les, a con guration le, describing how samples should be allocated into one of two conditions, is required. The output is a corresponding set of count les, but with the counts associated with the most variable transcripts removed across all, and that can be directly used by differential expression analysis tools such as DESeq2 2 with no further modi cation. The software, along with source code usage details and sample data, is available at: https://sourceforge.net/projects/tvscript/.

Intra-condition variation
Across all transcripts prior to ltering with TVScript, the mean intra-condition variation observed between wolves and dogs was not signi cantly different (Wilcoxon-test, p-value < 0.198, Fig. 1a and b), while between aggressive and tame foxes a signi cant difference was observed (Wilcoxon-test, p-value < 2.2e − 16 , Fig. 1c). In the latter, tame fox samples exhibited a higher number of transcripts associated with increased variability, likely due to ve samples seen to differentiate from the remaining in the PCA plot of normalized count values (axis PC1 explained 80% of the variance; Fig. 1d). Using TVScript and based on the combined variance distribution for wolves and dogs, and separately, for aggressive and tame fox samples, transcripts were removed from the reference according to a series of threshold values ( Fig. 2a and b, and Supplementary Table S2 online). Initially, for wolves and dogs, 184 transcripts associated with the 99th percentile of variance and above were removed, while for the aggressive vs. tame foxes, 235 transcripts were removed. Overall, the number of transcripts removed was higher among intra-study samples compared to samples combined from different studies, suggesting higher discordance between fox samples.

Differential gene expression analysis
Prior to ltering, differential expression analysis yielded 430 differentially expressed genes between wolves and dogs. Of those, 259 were over expressed in dogs while 171 were under expressed (Table 1). Between aggressive and tame foxes, 651 differentially expressed genes were observed, of which, 532 and 119 were over and under expressed, respectively. Post ltering, within the rst ten steps of size one from the 99th to the 90th percentiles, the number of differentially expressed genes identi ed, peaks at the 97th (n = 430; over = 255, under = 175) and the 95th percentiles (n = 730; over = 607, under = 123) in dogs and tame foxes (Fig. 3), respectively. These peaks suggest that, in the case of our datasets, the removal of the 3% (n = 854) and 5% (n = 1940) of transcripts associated with the highest levels of intra-condition variation optimizes the detection of differentially expressed genes. These ltered datasets were selected as inputs for the gene annotation step. Following annotation, 49 over expressed genes in dogs (6 genes) and tame foxes (43 genes) within the non-ltered datasets (Supplementary Tables S3 and S4 online) were absent from the ltered datasets at 3% and 5% thresholds (Supplementary Tables S5 and S6  online). Similarly, seven under expressed genes present within the non-ltered data were absent from each of the under expressed sets in dogs and tame foxes within the ltered data. Table 1 The number of the total differentially expressed transcripts (DETs) and the correspondent over (OE) and under (UE) expressed transcripts obtained for each contrast, wolves vs. dogs and aggressive vs. tame foxes, prior to (NF) and post-ltering steps, using DESeq2 (p < 0.05). The regression analysis between the dispersion estimates over the mean of normalized counts, from the non-ltered data, revealed a better tting for the contrast wolves vs. dogs (Fig. 4a), that displayed a high correlation (r 2 > 0.7) and a low deviation of the residuals (root mean square error -RMSE) around the line of best t. By removing only 1% of the transcripts with high intra-condition variation, the correlation between both variables improved, the RMSE decreased and the number of outliers, recognize by DESeq2 as the points with extremely high dispersion values that cannot be shrunk towards the t curve, has decreased (Supplementary Figure S2 and Table S7 online). Removal of the top 10% of variable transcripts led to an increase of the r 2 to 0.82, to less 109 outliers, and to a decrease in the number of transcripts with over-dispersion (variance > mean) (Fig. 4c). For the fox contrast, the linear regression did not t well (r 2 = 0.49), due to an elevated number of transcripts being dispersed (Fig. 4b). In this case, the shrinkage was more extensive (Supplementary Figure S2 online). Nevertheless, a similar increase in the r 2 , and decrease in the RMSE and the number of outliers, was observed after removing the 10% of transcripts associated with intra-condition variation (Fig. 4d, also Supplementary Figure S2 and Table S7 online).

Genes and gene families
Between the ltered data at 3% and 5% thresholds, 21 gene families, containing 50 genes, were observed to be simultaneously over expressed within dogs and tame foxes ( Table 2). Of these 50 genes, 19 were exclusive to dogs while 24 were exclusive to tame foxes. The remaining seven genes (RGR, CHRNA5, SQLE, ARHGAP25, ITGA7, MYO7A and TRIB2), belonging to seven different families, were common to both dogs and tame foxes. Additionally, three gene families, containing four genes, were found to be simultaneously under expressed ( Table 3). Two of these genes (STMND1 and OASL) were shared between dogs and tame foxes while the other two were unique to each condition. The same analysis performed on the non-ltered datasets revealed similar results (supplementary Table S8 online), however, the RGR gene family, which included a shared gene between dogs and tame foxes, was lost. Table 2 List of the gene families that were simultaneously over expressed in dogs (DG) and tame foxes (TF) between the ltered datasets. The number and the name of the genes that composed each family and the species they are present (shared or exclusively to dogs/tame foxes) are presented with the corresponding value of log2Fold change in brackets. When more than one variant for a speci c gene was present, all the log2FC values were reported.  Table 3 List of the gene families that were simultaneously under expressed in dogs (DG) and in tame foxes (TF) between the ltered datasets. The number and the name of the genes that composed each family and the species they are present (shared or exclusively to dogs/tame foxes) are presented with the corresponding value of log2Fold change in brackets. When more than one variant for a speci c gene was present, all the log2FC values were reported.

Discussion
Despite the growing number of published RNA-seq datasets, and the associated number of conclusions involving biological systems, at times depending on one or few transcripts identi ed as being differentially expressed, the quanti cation of variation present between replicates at an individual transcript level is sometimes overlooked. Given the importance of the results postulated around such individual transcripts and the growing ability to base highly informative studies around archived transcriptomics datasets, it is imperative that care is taken in understanding such variation to a greater degree than the black-box approach that many of today's user-friendly software tools provide. In aid of this, we developed easy-to-use software to quantify intra-condition variation at an individual transcript level and, if desired, to use the output to remove reference-based transcripts associated with highly variable read counts. By applying our method, we demonstrated the effects of reducing the level of noise that standard differential expression tools are required to accommodate when determining differential expression patterns, in relation to inter and intra-study datasets derived from brain samples of dogs, wolves, and foxes. We observed an improvement in the distribution of the gene-wise dispersion estimates used by DESeq2 to determine differentially expressed genes, where the correlation between the mean of normalized counts and dispersion estimates per gene improved when removing transcripts displaying the highest levels of intra-condition variation ( Fig. 3 and Supplementary Table 7 online). By removing such transcripts from the reference, prior to usage within differential expression software, gene-wise estimates of dispersion became more accurate 2 . Such noise at an inter, and to a lesser extent intra, study level arises from a range of characterized and uncharacterized sources including: i) biological differences between samples such as age, sex, diet, and health; ii) in silica error involving assembly tools producing poorly understood chimeras within the reference transcriptome 30 ; iii) ambiguities in read mapping to such references 31 ; iv) normalization of count data derived from such mapped reads 32 ; and v) in vitro error during library preparation protocols 33,34 . Each of these can affect read counts across samples that are considered within the same condition of interest. Thus, when individual transcripts are hypothesized to be part of an informative result, the variation within the data that surrounds them ought to be thoroughly explored.
Samples from fox intra-study datasets contained a higher number of transcripts associated with high amounts of intra-condition variation. This high discordance between read counts from samples allocated to the same condition has resulted in a more spread distribution of dispersion estimates. When the detection of differentially expressed genes is dependent on how far a certain point is from a curve of best t, having points that are highly spread will affect the detection of outliers and increase the number of false positives, likely explaining the low number of outliers detected among fox data (Table 1). Following the removal of the 3% and 5% of transcripts associated with the highest levels of variation between wolves vs. dogs and aggressive vs. tame foxes, respectively, the number of differentially expressed genes had the largest increase (Fig. 3). Importantly, several genes that were over expressed in dogs and tame foxes in the non-ltered datasets, some at the top of the list, were removed after ltering (Supplementary  Tables 3 to 5 online), demonstrating how dependent the nal list of differentially expressed genes is on the accuracy of gene-wise dispersion estimates. Prior to ltering, when results of the differential expression analysis in our intra-study case example were compared with those of the original publication 17 , 92% of the genes previously identi ed as being differentially expressed were in agreement. Post ltering however 7% of these were no longer differentially expressed, highlighting the need for careful understanding of variance within transcriptomic datasets. Our method has been effective in improving the distribution of dispersion estimates prior to differential expression analysis and although we have used DESeq2, its utility covers other methods for differential expression analysis, since most rely on shared information across genes for dispersion estimation e.g. edgeR 35 , BBSeq 36 , DSS 37 , baySeq 38 and ShrinkBayes 39 .
At the 3% and 5% cutoffs, amongst the 50 over expressed genes identi ed, across the 21 shared gene families, seven genes were shared between dogs and tame foxes ( Table 2). Of these seven genes, three main functions related to brain development, neurotransmission, and immune response were identi ed.
These functions have been repeatedly associated with behaviour selection during domestication by different approaches, such as QTL analysis 16,40,41 , whole-genome sequencing [42][43][44] , and RNA data both using microarrays and RNA-seq 12,17,45−47 . Up until recently, almost no gene overlap had been observed between gene expression pro les involving pairs of domesticated and wild animals 14 . However, a newly published paper performing population genomic and brain transcriptional comparisons in seven bird and mammal domesticated species has revealed a strong convergent pattern in genes implicated in neurotransmission and neuroplasticity 48 . These functions are compatible with those found in our analysis. The shared gene ITGA7 belongs to a gene family that is known to play an essential role in the control of neuronal connectivity 49 and the in ammatory response 50 . Other genes from this family, for example, ITGA8, have been previously observed to be over expressed in tame foxes 46 , and here we also observed its over expression in dogs providing further evidence of the family's role in tameness. Similar functions are associated with the shared genes CHRNA5 51,52 and TRIB2 53 from the cholinergic and tribbles family, respectively. Additionally, we found a shared gene involved in sensing local environmental stimuli, the MYO7A, whose mutation results in loss of hearing and vision 54 . Amongst the three gene families identi ed as under expressed (Table 3), we found the shared gene STMND1, which de ciency in the amygdala of mice was connected to a de ciency in innate and learned fear 55 , a behaviour that speculatively could also have an important role in domestication.
In summary, we have presented a software to aid in the quanti cation, and if desired the removal, of intracondition variation from RNA-seq count data and demonstrated its usage in improving the distribution of gene-wise dispersion estimates in an attempt to reduce the number of false positives in differential gene expression analysis. We would like our approach to highlight that studies mixing RNA-seq data from different sources should rstly, ensure that such datasets are suitable for performing differential expression on in relation to intra-condition variation at an individual transcript level, and secondly, characterize and discuss the variation present within the data prior to drawing conclusion on individual genes observed to be differentially expressed. Finally, in our case study of wolves, dogs and foxes, we have provided further support for candidate genes involved with selection for tameness, and that appear to play a crucial role in the canine domestication. Figure 1 Percentile range of intra-condition variation scores (x-axis) observed prior to ltering for wolves and dogs   The number of differentially expressed transcripts (y-axis) identi ed using non-ltered and ltered datasets based on the rst 10 percentiles of the variance distribution in dogs (a) and in tame foxes (b). Over and under expressed genes are represented by red and green dots, respectively, and gray arrows represent the selected threshold for each contrast. Plots of nal dispersion estimates for wolves vs. dogs (a and c) and aggressive vs. tame foxes (b and d) calculated using DESeq2 for the non-ltered (NF, a and b) and 10% ltered (90th, c and d) datasets. Each black dot represents one transcript and red dots represent outliers. Values for the number of outliers and