Alignment of RNA-seq data
Available RNA-seq data from the brain of dogs, wolves, aggressive and tame foxes (Supplementary Table 1) was used. The alignment of the 44 samples against the dog reference transcriptome (26,107 transcripts) revealed a mapping success of 60% and 58% for dogs and wolves, respectively (Supplementary Fig. 1), as expected due to their recent divergence (~ 27 kya) [23]. A similar portion of reads failing to map against the transcriptome (~ 40%) has been previously reported for dog brain samples [24] and is most likely due to (i) novel genes (ii) regions that are not translated despite being transcribed, (iii) contamination with genomic DNA and (iv) uncharacterized chimeras within reference resulting from assembly errors. For the fox samples, an average of 50% of reads mapped to the dog reference transcriptome (Supplementary Fig. 1). The lower percentage of mapping for foxes was expected due to an increased genetic divergence to dogs (~ 10 mya) [25] together with the other aforementioned factors.
Intra-condition Variation
To remove individual transcripts associated with high amounts of variation prior to differential gene expression analysis, we developed a method that utilizes the intra-condition variation within pairwise differences of normalized read counts between samples for each transcript (Fig. 1; Methods section). The method handles raw count data which is then normalized considering the length of each transcript and the sum of all counts for the corresponding sample. We grouped samples across two contrasts, wolves (n = 6) vs dogs (n = 10) and aggressive vs tame foxes (n = 12 for each condition), according to the groups that were posteriorly used for differential gene expression analysis. Across all transcripts, the mean intra-condition variation observed between wolves and dogs was not significantly different (Wilcoxon-test, p-value < 0.198, Supplementary Fig. 2a), while between aggressive and tame foxes a significant difference was found (Wilcoxon-test, p-value < 2.2e− 16, Supplementary Fig. 2a). In the latter, tame fox samples exhibited a higher number of transcripts associated with increased variability, mostly due to five samples that differentiate from the remaining ones in the PCA plot of normalized count values (axis PC1 explained 80% of the variance, Supplementary Fig. 2b).
Based on the combined variance distribution for each of the contrasts, transcripts were removed from the reference according to a series of threshold values defined by the percentiles associated with the distribution of variation (e.g. variance score at the 95th percentile correspond to a removal of the 5% of most variable transcripts) (Fig. 2, Supplementary Table 2; Methods section). For each level, only the transcripts that pass the filter were maintained in the newly created output files of each sample, for downstream analysis. Initially, for wolves and dogs, 184 transcripts associated with the 99th percentile of variance and above were removed, while for the aggressive vs tame foxes, 235 transcripts were removed. Overall, the number of transcripts removed was higher among intra-study samples compared to samples combined from different studies, suggesting higher discordance between fox samples.
Differential Gene Expression Analysis
Levels of gene expression between conditions of each contrast (wolves vs dogs and aggressive vs tame foxes) were assessed using raw read count data from non-filtered and filtered data on DESeq2 [2]. To evaluate the effect of removing variation on dispersion estimates calculated by DESeq2, we used non-filtered and all the different filtered-level datasets, independently, to calculate the estimates of gene-wise dispersions and to perform a regression analysis over the mean of normalized counts. The regression analysis revealed a better fitting for the contrast wolves vs dogs (Figure 3), that displayed a high correlation (r2 > 0.7) and a low deviation of the residuals (root mean square error – RMSE) around the line of best fit. By removing only 1% of the transcripts with high intra-condition variation, the correlation between both variables improved, the RMSE decreased and the number of outliers, recognize by DESeq2 as the points with extremely high dispersion values that cannot be shrunken towards the fit curve, has decreased (Supplementary Figure 3, Supplementary Table 3). The removal of the top 10% variable transcripts led to an increase of the r2 to 0.82, to less 109 outliers and to a decrease in the number of transcripts with over-dispersion (variance > mean) (Figure 3). For the fox contrast, the linear regression did not fit so well the correlation between both variables (r2 = 0.49), where an elevated number of transcripts presented over-dispersion (Figure 3). In this case, the shrinkage was more extensive (Supplementary Figure 3) to include more points scattered around the fit curve. Nevertheless, the same improvement in the r2, RMSE, and the number of outliers, was observed after removing intra-condition variation (Figure 3, Supplementary Figure 3, Supplementary Table 3).
Regarding the differential gene expression analysis prior to filtering, 430 differentially expressed transcripts (DETs) were identified between wolves and dogs. Of those, 259 were over expressed in dogs while 171 were under expressed (Supplementary Table 4). Between aggressive and tame foxes, 651 DETs were observed, of which, 532 and 119 were over and under expressed, respectively (Supplementary Table 4). Post filtering, within the first ten steps of size one from the 99th to the 90th percentiles, the number of DETs identified, peaks at the 97th (n=430; over=255, under=175) and the 95th percentiles (n=730; over=607, under=123) in dogs and tame foxes (Figure 4), respectively. These peaks suggest that the removal of the 3% (n= 854) and 5% (n=1940) of transcripts associated with the highest levels of intra-condition variation optimizes the detection of DETs. These filtered datasets were selected as inputs for the gene annotation.
Individual Gene Identification And Gene Families
Following the annotation of DETs within each contrast, using the non-filtered (Supplementary Table 5 and Supplementary Table 6) and filtered datasets at 3% and 5% thresholds (Supplementary Table 7 and Supplementary Table 8), six and 43 annotated genes originally over expressed in the non-filtered datasets of dogs and tame foxes, respectively, were removed in the filtered datasets due to high intra-condition variation. Similarly, seven different under expressed annotated genes were removed in dogs and tame foxes. Of those, in dogs, all seven genes were transcripts removed whereas in tame foxes, two were removed, while the other five, although kept in the reference, were no longer significantly differentially expressed (p > 0.05).
Between the filtered data, 21 gene families, containing 50 genes, were observed to be simultaneously over expressed within dogs and tame foxes (Table 1). Of these 50 genes, 19 were exclusive to dogs while 24 were exclusive to tame foxes. The remaining seven genes (RGR, CHRNA5, SQLE, ARHGAP25, ITGA7, MYO7A and TRIB2), belonging to seven different families, were common to both dogs and tame foxes. Additionally, three gene families, containing four genes, were found to be simultaneously under expressed (Table 2). Two of these genes (STMND1 and OASL) were shared between dogs and tame foxes while the other two were unique to each condition. The same analysis performed on the non-filtered datasets revealed similar results (Supplementary Table 9), however, the RGR gene family, which included a shared gene between dogs and tame foxes, was lost.
Table 1
List of the gene families that were simultaneously over expressed in dogs (DG) and tame foxes (TF) between the filtered datasets. The number and the name of the genes that composed each family and the species they are present (shared or exclusively to dogs/tame foxes) are presented with the corresponding value of log2Fold change in brackets. When more than one variant for a specific gene was present, all the log2FC values were reported.
Gene Family | Group | Number of OE | Gene name and log2FC value |
Retinal G protein-coupled receptor | Shared | 1 | RGR (2.10 in DG, 0.78 in TF) |
Cholinergic receptor nicotinic alpha | Shared | 1 | CHRNA5 (1.1 in DG, 0.4 in TF) |
Squalene epoxidase | Shared | 1 | SQLE (0.54 in DG, 0.31 in TF) |
Rho GTPase activating protein | Shared | 1 | ARHGAP25 (0.86 in DG, 0.72 in TF) |
TF | 2 | ARHGAP4 (0.64); ARHGAP30 (0.57) |
Integrin alpha subunits | DG | 3 | ITGA6 (1.25, 1.24); ITGA8 (1.14, 0.90); ITGAX (0.97) |
TF | 1 | ITGAL (0.73) |
Shared | 1 | ITGA7 (0.76 in DG, 0.46 and 0.49 in TF) |
Myosin | DG | 1 | MYO3A (1.12) |
TF | 3 | MYOZ1 (1.53); MYO1F (0.93); MYO1C (0.47) |
Shared | 1 | MYO7A (0.82 in DG; 0.41 in TF) |
Tribbles pseudokinase | TF | 2 | TRIB1 (0.94); TRIB3 (0.78) |
Shared | 1 | TRIB2 (0.61 in DG; 0.2 in TF) |
EF hand calcium binding | DG | 1 | EFCAB1 (2.59) |
TF | 1 | EFCAB2 (0.46) |
Transcription factor | DG | 1 | TCF23 (2.04) |
TF | 1 | TCF19 (0.63) |
Adhesion G protein-coupled receptors | DG | 1 | ADGRG6 (1.45) |
TF | 1 | ADGRG1 (0.57) |
Patatin Like Phospholipase Domain | DG | 1 | PNPLA4 (1.41) |
TF | 1 | PNPLA7 (0.59) |
SRY-box | DG | 1 | SOX6 (1.26) |
TF | 2 | SOX17(0.84); SOX10 (0.66) |
Hyaluronan and proteoglycan link protein | DG | 1 | HAPLN1 (1.15) |
TF | 1 | HAPLN3 (0.70) |
Serine/threonine kinase | DG | 2 | STK17A (1.15, 1.14); STK32A (1.10) |
TF | 1 | STK40 (0.57) |
Potassium channels | DG | 1 | KCTD16 (0.98) |
TF | 1 | KCTD15 (0.72) |
Podocalyxin like | DG | 1 | PODXL (0.95, 0.84) |
TF | 1 | PODXL2 (0.70, 0.69, 0.67) |
ATP binding cassette subfamily B | DG | 1 | ABCB1 (0.93) |
TF | 1 | ABCB9 (0.52) |
Zinc finger DHHC-type | DG | 1 | ZDHHC15 (0.75) |
TF | 1 | ZDHHC1 (0.70) |
Sushi domain | DG | 1 | SUSD1 (0.68) |
TF | 2 | SUSD3 (0.79); SUSD6 (0.47) |
TBC1 domain family | DG | 1 | TBC1D5 (0.54) |
TF | 1 | TBC1D7 (0.27) |
Mitogen-activated protein kinase kinase kinases | DG | 1 | MAP3K5 (0.51) |
TF | 1 | MAP3K11 (0.76) |
Table 2
List of the gene families that were simultaneously under expressed in dogs (DG) and in tame foxes (TF) between the filtered datasets. The number and the name of the genes that composed each family and the species they are present (shared or exclusively to dogs/tame foxes) are presented with the corresponding value of log2Fold change in brackets. When more than one variant for a specific gene was present, all the log2FC values were reported.
Gene Family | Group | Number of UE | Gene name and log2FC value |
Stathmin domain | Shared | 1 | STMND1 (-1.18 in DG, -0.53 in TF) |
Oligoadenylate synthetase like | Shared | 1 | OASL (-0.41 in DG, -0.52 in TF) |
Heat shock protein family B | DG | 1 | HSPB8 (-0.70) |
TF | 1 | HSPB11 (-0.32) |