Taming of the wild: a new method for cross study RNA-seq analysis

In the last decades, the evolution of RNA-Seq has yielded archived datasets that possess the potential for providing unprecedented inter-study insight into transcriptome evolution, once background noise has been reduced. Here we present a method to quantify intra-condition variation and to remove reference-based transcripts associated with highly variable read counts, prior to differential expression analysis. The method utilizes variation within pairwise distances between normalized read counts for each transcript across all included samples of a given condition. As a case study, we demonstrate our approach at an inter and intra-study level using RNA-seq data from brain samples of dogs, wolves, and two strains of fox (aggressive and tame) prior to performing differential expression analysis to identify common genes associated with tame behaviour.


Background
The advance of RNA-seq technology [1] has revolutionized gene expression analysis by allowing a rapid hi-resolution view of gene expression under varying conditions, compartments or timepoints. In a typical RNA-seq experiment, gene expression pro les are estimated for each sample using a metric based upon the number of reads associated with each transcript within a reference transcriptome. Expression pro les are compared to identify differentially expressed genes (DEGs) [2]. A challenge arises due to sources of variation within gene expression pro les that are independent of, or partially overlapping with, the condition of interest [3]. At an intra-study level, the inclusion of biological and technical replicates can be applied to reduce the effects of such noise [4].
At times, replicates may not always be possible due to cost or di culty in obtaining samples. As an alternative, the incorporation of RNA-seq data from the rapidly growing repertoire of published works could complement the number of effective biological replicates associated with a given condition [5]. A hurdle is in accounting for the inherent variability of the data [6], which is ampli ed at an inter-study level as there is little control over the sample environments or experimental setups. For a given experiment, between two conditions, differential expression tools generally compute a p-value for each gene, based on the overall distribution of normalized read counts, that re ect the likelihood of that gene being differentially expressed. As intra-condition variation increases, the ability to decipher differential expression patterns decreases. Several methods have been proposed for data normalization and bias removal on RNA-seq data including, EDASeq [7], RUV2 [8], sva [3] and PEER [9]. However, when these methods are compared, highly variable results are observed, with the number of false positives of DEGs being increased [10]. No consensus exists on the best approach to apply.
Here, we provide a simple method for removing transcripts associated with high levels of intra-condition read count variation and thus results in a subset of the data that contains reduced noise. The method is to be applied to data prior to differential gene expression analysis, following the grouping of datasets into conditions of interest. We assume that, within a single condition, if a given transcript possesses large amounts of variation across normalized read count values, then an accurate expression pattern cannot be determined relative to the condition itself, independently of the source of variation. When this variation is large, it suggests that the condition-associated datasets are discordant with each other and thus comparisons to other conditions to identify DEGs will not yield meaningful results. The metric our method uses is the variation present within the pairwise differences of normalized read counts between datasets associated with a speci c condition across each transcript of the reference set. Within a condition, the range of these values across all transcripts indicates the compatibility of the data for usage within a differential expression analysis.
As a case study, we demonstrate our approach at an inter and intra-study level using multiple published RNA-seq datasets from brain samples of dogs and wolves derived from several studies, and to brain samples of foxes from domestication experiments derived from the same study. Animal domestication has resulted in a shared set of common behavioural traits, including reduced fear of humans, diminished aggression and altered tendencies of exploration [11]. Despite the genetic basis underlying such behaviours being unclear, it has been shown that selection for tameness leads to common phenotypic changes in domesticated species, for example in rats [12], red foxes [13,14] and red junglefowl [15]. Among canids, two illustrative examples of human-de ned behavioural shifts have been studied through RNA-sEq. First, domestic dogs, that present marked behaviour differences from wolves, their wild ancestors, have evolved unique social cognitive capabilities [16][17][18]. Second, a lineage of tame red foxes recently discovered to have originated in fur farms in Canada [19], which resulted from deliberated selection against fear and aggression over several generations of cross-breeding [20][21][22]. An inter-study comparison of expression patterns between RNA-seq data for wolves and dogs as well as for aggressive and tame foxes would provide insights on the gene dynamics involved in the evolution of tameness.
Within this study we initially explore the effects of removing transcripts associated with several levels of variation relative to inter and intra-study datasets and compare patterns of gene expression between i) wolves and dogs and ii) aggressive and tame foxes to uncover the genes associated with behavioural traits involved in both domestication events.

Results
Alignment of RNA-seq data Available RNA-seq data from the brain of dogs, wolves, aggressive and tame foxes (Supplementary  Fig. 1), as expected due to their recent divergence (~ 27 kya) [23]. A similar portion of reads failing to map against the transcriptome (~ 40%) has been previously reported for dog brain samples [24] and is most likely due to (i) novel genes (ii) regions that are not translated despite being transcribed, (iii) contamination with genomic DNA and (iv) uncharacterized chimeras within reference resulting from assembly errors. For the fox samples, an average of 50% of reads mapped to the dog reference transcriptome (Supplementary Fig. 1). The lower percentage of mapping for foxes was expected due to an increased genetic divergence to dogs (~ 10 mya) [25] together with the other aforementioned factors.

Intra-condition Variation
To remove individual transcripts associated with high amounts of variation prior to differential gene expression analysis, we developed a method that utilizes the intra-condition variation within pairwise differences of normalized read counts between samples for each transcript ( Fig. 1; Methods section). The method handles raw count data which is then normalized considering the length of each transcript and the sum of all counts for the corresponding sample. We grouped samples across two contrasts, wolves (n = 6) vs dogs (n = 10) and aggressive vs tame foxes (n = 12 for each condition), according to the groups that were posteriorly used for differential gene expression analysis. Across all transcripts, the mean intracondition variation observed between wolves and dogs was not signi cantly different (Wilcoxon-test, pvalue < 0.198, Supplementary Fig. 2a), while between aggressive and tame foxes a signi cant difference was found (Wilcoxon-test, p-value < 2.2e − 16 , Supplementary Fig. 2a). In the latter, tame fox samples exhibited a higher number of transcripts associated with increased variability, mostly due to ve samples that differentiate from the remaining ones in the PCA plot of normalized count values (axis PC1 explained 80% of the variance, Supplementary Fig. 2b).
Based on the combined variance distribution for each of the contrasts, transcripts were removed from the reference according to a series of threshold values de ned by the percentiles associated with the distribution of variation (e.g. variance score at the 95th percentile correspond to a removal of the 5% of most variable transcripts) (Fig. 2 Table 2; Methods section). For each level, only the transcripts that pass the lter were maintained in the newly created output les of each sample, for downstream analysis. Initially, for wolves and dogs, 184 transcripts associated with the 99th percentile of variance and above were removed, while for the aggressive vs tame foxes, 235 transcripts were removed. Overall, the number of transcripts removed was higher among intra-study samples compared to samples combined from different studies, suggesting higher discordance between fox samples.

Differential Gene Expression Analysis
Levels of gene expression between conditions of each contrast (wolves vs dogs and aggressive vs tame foxes) were assessed using raw read count data from non-ltered and ltered data on DESeq2 [2]. To evaluate the effect of removing variation on dispersion estimates calculated by DESeq2, we used nonltered and all the different ltered-level datasets, independently, to calculate the estimates of gene-wise dispersions and to perform a regression analysis over the mean of normalized counts. The regression analysis revealed a better tting for the contrast wolves vs dogs (Figure 3), that displayed a high correlation (r 2 > 0.7) and a low deviation of the residuals (root mean square error -RMSE) around the line of best t. By removing only 1% of the transcripts with high intra-condition variation, the correlation between both variables improved, the RMSE decreased and the number of outliers, recognize by DESeq2 as the points with extremely high dispersion values that cannot be shrunken towards the t curve, has decreased (Supplementary Figure Table 3).
Regarding the differential gene expression analysis prior to ltering, 430 differentially expressed transcripts (DETs) were identi ed between wolves and dogs. Of those, 259 were over expressed in dogs while 171 were under expressed (Supplementary Table 4). Between aggressive and tame foxes, 651 DETs were observed, of which, 532 and 119 were over and under expressed, respectively (Supplementary Table   4). Post ltering, within the rst ten steps of size one from the 99 th to the 90 th percentiles, the number of DETs identi ed, peaks at the 97 th (n=430; over=255, under=175) and the 95 th percentiles (n=730; over=607, under=123) in dogs and tame foxes (Figure 4), respectively. These peaks suggest that the removal of the 3% (n= 854) and 5% (n=1940) of transcripts associated with the highest levels of intracondition variation optimizes the detection of DETs. These ltered datasets were selected as inputs for the gene annotation.

Individual Gene Identi cation And Gene Families
Following the annotation of DETs within each contrast, using the non-ltered (Supplementary Table 5 and  Supplementary Table 6) and ltered datasets at 3% and 5% thresholds (Supplementary Table 7 and   Supplementary Table 8), six and 43 annotated genes originally over expressed in the non-ltered datasets of dogs and tame foxes, respectively, were removed in the ltered datasets due to high intra-condition variation. Similarly, seven different under expressed annotated genes were removed in dogs and tame foxes. Of those, in dogs, all seven genes were transcripts removed whereas in tame foxes, two were removed, while the other ve, although kept in the reference, were no longer signi cantly differentially expressed (p > 0.05).
Between the ltered data, 21 gene families, containing 50 genes, were observed to be simultaneously over expressed within dogs and tame foxes (Table 1). Of these 50 genes, 19 were exclusive to dogs while 24 were exclusive to tame foxes. The remaining seven genes (RGR, CHRNA5, SQLE, ARHGAP25, ITGA7, MYO7A and TRIB2), belonging to seven different families, were common to both dogs and tame foxes.
Additionally, three gene families, containing four genes, were found to be simultaneously under expressed ( Table 2). Two of these genes (STMND1 and OASL) were shared between dogs and tame foxes while the other two were unique to each condition. The same analysis performed on the non-ltered datasets revealed similar results (Supplementary Table 9), however, the RGR gene family, which included a shared gene between dogs and tame foxes, was lost. Table 1 List of the gene families that were simultaneously over expressed in dogs (DG) and tame foxes (TF) between the ltered datasets. The number and the name of the genes that composed each family and the species they are present (shared or exclusively to dogs/tame foxes) are presented with the corresponding value of log2Fold change in brackets. When more than one variant for a speci c gene was present, all the log2FC values were reported.  Table 2 List of the gene families that were simultaneously under expressed in dogs (DG) and in tame foxes (TF) between the ltered datasets. The number and the name of the genes that composed each family and the species they are present (shared or exclusively to dogs/tame foxes) are presented with the corresponding value of log2Fold change in brackets. When more than one variant for a speci c gene was present, all the log2FC values were reported.

Discussion
Despite the existence of several published RNA-seq datasets, the quanti cation and control of variation present between replicates of a given condition are often overlooked. Here we developed a method to quantify intra-condition variation and to remove reference-based transcripts associated with highly variable read counts, prior to differential expression analysis. By applying our method, we were able to reduce the level of noise that standard differential expression tools are required to accommodate when determining differential expression patterns.
Following the application of our method to intra and inter-study datasets derived from brain samples of dogs, wolves, and foxes, we observed an improvement in the distribution of the gene-wise dispersion estimates used by DESeq2 to determine DEGs, by removing 1% of the transcripts that displayed high intra-condition variation. The correlation between the mean of normalized counts and dispersion estimates per gene improved when intra-condition variation was removed. This is a consequence of reducing the number of transcripts that displayed large differences in variance for the respective mean of normalized counts, or, in other words, have high overdispersion. By removing those from the reference, prior to usage within the differential expression software, the gene-wise estimates of dispersion will be more accurate since they are calculated based on information across all genes by assuming that genes with similar expression levels have similar dispersion [2]. Surprisingly, samples from the fox intra-study case contained a higher number of transcripts associated with high amounts of intra-condition variation.
This high discordance between counts of samples from the same condition has resulted in a more spread distribution of dispersion estimates around the t curve which affected the strength of the shrinkage method. When the detection of outliers is dependent on how far a certain point is from the t curve, having points that are highly spread will affect the detection of outliers and increase the number of false positives. This may explain the low number of outliers detected in the fox data, even having the highest variation. Our method proves effective to improve the distribution of dispersion estimates prior to differential expression analysis and although we have used DESeq2, its utility covers other methods for differential expression analysis, since they also rely on share information across genes for dispersion estimation (e.g. edgeR [26], BBSeq [27], DSS [28], baySeq [29] and ShrinkBayes [30]).
It was observed that removing the 3% and 5% of the transcripts associated with the highest levels of variation maximized the number of DETs between wolves vs dogs and aggressive vs tame foxes, respectively. More importantly, several genes that were over expressed in dogs and tame foxes in the nonltered datasets, some at the top of the list, were removed after ltering. This is concerning because it reveals how dependent the nal list of DEGs is on the accuracy of gene-wise dispersion estimates. When we compared the results of differential expression analysis obtained in the intra-study case with the original publication [21] we found most of the genes (92%) that they have identi ed as differentially expressed and veri ed that some of those genes disappeared after ltering.
Amongst the 50 over expressed genes identi ed across the 21 shared gene families, seven genes were shared between dogs and tame foxes. Up until now, almost no gene overlap has been observed in gene expression analyses involving domesticated or tame animals, across three different mammal families (Canidae, Suidae, and Leporidae) [18]. Of the seven shared genes, three main functions related to brain development, neurotransmission, and immune response were identi ed. These functions have been repeatedly associated with behaviour selection during domestication by different approaches, such as QTL analysis [12,31,32], whole-genome sequencing [33][34][35] and RNA data both using microarrays and RNA-seq [16,17,20,21,36,37]. Of these genes, the shared gene ITGA7 belongs to a gene family that is known to play an essential role in the control of neuronal connectivity [38] and the in ammatory response [39]. Other genes from this family, for example, ITGA8, have been previously observed to be over expressed in tame foxes [20], and here we also observed its over expression in dogs. Thus, our novel analysis provides further evidence of the family role in tameness. Similar functions are associated with the shared genes CHRNA5 [40,41] and TRIB2 [42] from the cholinergic and tribbles family, respectively. Also, we found a shared gene involved in sensing local environmental stimuli, the MYO7A, whose mutation results in loss of hearing and vision [43]. Amongst the three gene families identi ed as under expressed, we found the shared gene STMND1, which de ciency in the amygdala of mice was connected to a de ciency in innate and learned fear [44], a behaviour that might also have played an important role in domestication.

Conclusions
In the present study, we have presented a method to quantify and remove intra-condition variation from RNA-seq count data and demonstrate its usage in improving the distribution of gene-wise dispersion estimates and ultimately, reduce the number of false positives in differential gene expression analysis. Studies using RNA-seq data should characterize and discuss the variation present in their data prior to performing differential expression analysis, independently of the software, to reduce the number of genes that appear, erroneously, as differentially expressed. Additionally, we demonstrated the use of our method in the identi cation of candidate genes involved with selection for tameness, which seems to have played a crucial role in the canine domestication.

RNA-seq data
Available RNA-seq data from the brain of dogs, wolves, and foxes (Supplementary Table 1 [18,21,24,45,46] were used. We accounted for the available details of all samples including the relative location of the tissue, age, and sex of animals, replicate information and sequencing details (Supplementary Table 1). The sample codes will be referred to throughout the rest of this manuscript.

Read Mapping And Normalization
Reads from each of the 44 samples were mapped to the dog reference transcriptome [46], containing 26,107 annotated contigs (Ensembl CanFam3.1, release 92) using Bowtie2.3.4.1 [47]. The percentage of reads mapped for each sample was calculated and used as an indicator of mapping success. For each sample, read counts for each transcript were obtained using BBMap [48] and the output les were used as input in the differential expression software. For our ltering method, normalization was performed by dividing the count values by the length of the transcript they were associated with and by the sum of all counts within the le for a given sample.

Removing Intra-condition Variation
The method we developed removes individual transcripts associated with high amounts of intracondition variation within the normalized read counts across samples prior to differential expression analysis (Fig. 1). The method utilizes the intra-condition variation within pairwise differences of normalized read counts between samples for each transcript. Samples were grouped across two contrasts, each with two conditions: wolves vs dogs (wolves n = 6 and dogs n = 10) and aggressive vs tame foxes (n = 12 for each condition). Read counts from technical replicates of "Dog_8" and "Dog_9" were averaged and merged into one le, while read counts from the two biological replicates of "Dog_7" were treated separately.
The input for the method is a set of les, each containing the raw counts that are associated with each transcript of the reference. For each contrast the method implements the following steps: A. Preprocessing: Allocate each input dataset of read counts to either condition A or condition B and normalize read counts.
B. Calculating intra-condition variation: (i) For each transcript, calculate the absolute pairwise differences between normalized read counts across all samples within condition A. From these values calculate the corresponding variance (σ An ) where n represents the transcript number. Following this step, the number of variance values are equal to the number of transcripts within the reference.
(ii) Repeat for condition B to obtain values for σ Bn .
C. Filtering: (i) All the variance scores are placed in ascending order into a single le, regardless of condition.
(ii) Calculate the corresponding percentile values across the full distribution of variance values.
(iii) Use percentiles as thresholds to identify the amount of variation to be removed. For example, use the variance score at the 95 th percentile to remove the 5% of transcripts associated with the largest amount of intra-condition variation.
(iv) For each transcript, if either σ An or σ Bn is greater than the threshold, remove that transcript and all associated counts from the input data.
(v) Output the raw read counts associated with the remaining transcripts to a separate le for each dataset.
For each contrast, variance thresholds ranging from the 70th to the 90th percentiles (in steps of ve), and from the 91th to the 99th (in steps of one) were explored. Steps of one were used in the latter in order to explore this range containing the transcripts associated with the highest levels of intra-condition variation in more detail.
The method was written in Java and is freely available on the web at: https://sourceforge.net/projects/tvscript/.

Differential gene expression analysis
To perform the differential gene expression analysis we used DESeq2 [2] since it was shown to be one of the most consistent methods to identify DEGs [49]. DESeq2 estimates the gene-wise dispersions and shrink these estimates to generate more accurate estimations of dispersions to model the counts. The dispersions are inversely related to the mean since lower mean counts are more affected by within replicates variation. For each contrast, using as input the non-ltered and all the different ltered-level datasets, independently, estimates of dispersion were calculated and used in a linear regression analysis in relation to the mean of normalized counts using R [50]. The number of DETs between conditions of each contrast (wolves vs dogs and aggressive vs tame foxes) was assessed prior to and post-ltering stages. A PCA was performed, using the plotPCA function from DESeq2 with non-ltered normalized count values, to visualize the overall effects of experimental covariates in each contrast.

Gene Annotation And Gene Family Analysis
Independently for each contrast, DETs obtained using the non-ltered and ltered datasets, were annotated to the correspondent gene ID using the R package BioMart [51] against the Ensembl Gene database (version 94). Over expressed genes in dogs and tame foxes were classi ed into gene families. Families containing genes from both dogs and tame foxes were selected for further analysis given their potential for being involved in the evolution of tame behaviour. Genes within gene families were grouped according to whether they were unique to either dogs or tame foxes or shared between the two. Under expressed genes were treated in the same manner. For a given contrast, a family was only maintained if all the associated genes agreed in relation to their direction of differential expression (i.e. all genes were either over or under expressed). The method developed to lter individual transcripts associated with high amounts of intra-condition variation within expression pro les across samples of both conditions for each contrast.