Prior to blood sampling, all owners confirmed that the dogs were healthy and did not show any signs of illness at least in the past two weeks prior to the sampling date. Furthermore, a veterinarian also examined the dogs prior to blood sampling and all donors were classified as healthy without any infections. A routine laboratory blood test was also carried out, with negative results, indicating a good general health (Supplementary Data 1).
A power analysis was carried out as implemented in the RnaSeqSampleSize R package [29]. The FDR level was set to 10%, its default parameter value in the DESeq2 R package used for the differential gene expression analysis here and the minimum fold change (rho) was set to 3. The estimated power was 0.98, indicating that the group sizes of five samples, combined with the high sequencing depth was sufficient for the study.
3.1 Raw data quality
Table 3 shows the raw data statistics of the analysed samples. Although the young animals had a higher average sequencing depth (142 million vs. 128 million reads), this was primarily due to an outlier sample (CL_y3), which was sequenced to ~ 211 million reads. Without this individual, there was no significant difference between the two groups with regard to sequencing depth. After removing the outlier, the only significant difference between the age groups could be observed in the number of secondary alignments, including the hemoglobin genes as well (11 million reads – or 8% – more in the old cohort).
Table 3
Sequencing and alignment statistics of the 10 samples.
Sample ID
|
Number of sequenced reads
|
Number of reads after adapter trimming
|
Number of aligned reads
|
Number and proportion of hemoglobin reads
|
Secondary alignments with hemoglobin genes
|
Secondary alignments without hemoglobin genes
|
N
|
%
|
N
|
%
|
N
|
%
|
N
|
%
|
CL_y1
|
139574092
|
99781996
|
87552180
|
87.74
|
64061880
|
73.17
|
100056226
|
100.27
|
2175945
|
2.18
|
CL_y2
|
145101716
|
114231400
|
98754640
|
86.45
|
24426835
|
24.73
|
33712226
|
29.51
|
5736365
|
5.02
|
CL_y3
|
211298362
|
137241346
|
120949359
|
88.13
|
91142725
|
75.36
|
174121970
|
126.87
|
2585114
|
1.88
|
CL_y4
|
115495360
|
102506414
|
87090986
|
84.96
|
20523206
|
23.57
|
36225285
|
35.34
|
4687901
|
4.57
|
CL_y5
|
100769724
|
76546858
|
66152258
|
86.42
|
35704823
|
53.97
|
51500272
|
67.28
|
2617499
|
3.42
|
CL_o1
|
132309048
|
100814174
|
86252499
|
85.56
|
38902610
|
45.10
|
75151359
|
74.54
|
4017804
|
3.99
|
CL_o2
|
115616142
|
93060894
|
79023663
|
84.92
|
37103408
|
46.95
|
57135456
|
61.40
|
3463513
|
3.72
|
CL_o3
|
120024014
|
91619710
|
80186573
|
87.52
|
29403124
|
36.67
|
31520157
|
34.40
|
3990157
|
4.36
|
CL_o4
|
124942900
|
103191914
|
88892129
|
86.14
|
30694317
|
34.53
|
54958699
|
53.26
|
3720187
|
3.61
|
CL_o5
|
148054394
|
103245778
|
90434886
|
87.59
|
58026387
|
64.16
|
111519462
|
108.01
|
2956990
|
2.86
|
young average
|
142447851
|
106061603
|
92099885
|
86.74
|
47171894
|
50.16
|
79123196
|
71.85
|
3560565
|
3.41
|
old average
|
128189300
|
98386494
|
84957950
|
86.35
|
38825969
|
45.48
|
66057027
|
66.32
|
3629730
|
3.71
|
Although the raw data quality checks (including: adapter trimming, removal of the high quality poly-G sequences from the 3’ end of the reads, hard trimming of the ends of the sequences, removal of reads shorter than 50 bp) removed 24% of the reads, still approximately 100 million reads were retained for the analysis per sample (ranging from 76 to 132 million reads in the young cohort and 91–103 million reads in the old cohort). This number of reads corresponds to ~ 50 million fragments.
The alignment rate was high (86%; on average 92 and 85 million reads aligned in the young and old cohorts, respectively). Therefore our dataset was appropriate for the planned differential gene expression analysis.
We also investigated the hemoglobin genes and the number of reads aligning to these genes. On average, 43 million reads (or 48% of the reads kept after raw data filtering) aligned to the hemoglobin genes. A large variation could be observed in the data with respect to the number of hemoglobin-related reads, ranging from 24 to 75% of the filtered reads in the samples.
The hemoglobin-related reads’ filtering led to a large and significant reduction in the total read number, which was reduced to 23–74 million for the different samples. This affected the samples differently, with the largest changes in the young cohort: three samples had 23–30 million reads, while 2 samples had 66 and 74 million reads after removing the Hg-related reads. The range of the read numbers in the old cohort remained more similar, but a considerable variation existed in that group as well (32–58 million reads per sample). The varying amount of hemoglobin-related mRNA introduced an unwanted bias to our experiment.
The number of secondary alignments was also affected by the hemoglobin genes. The average proportion of secondary alignments compared to the primary alignments was 69%, but it went up to as high as 127%, i.e. more secondary alignments were present in some samples than primary alignments. However, when the hemoglobin reads were removed, the proportion of multi-mapped reads dropped to a normal level, and the secondary alignments were at an average level of 3.6% across the samples, with a negligible difference between the age cohorts (3.4% and 3.7% in the young and old cohorts).
Consequently, the hemoglobin reads and the associated genes – primarily due to the large within-cohort variation – represented a large, random bias in our dataset. As a result, as well as following the recommendations of Harrington et al. [11], both the Hg genes and the associated reads were excluded from the downstream analysis. A reduction in statistical power is expected due to the large reduction in the read counts.
3.2 Descriptive statistical analysis
A total of 12966 genes were expressed in the canine blood. The age clusters could not be differentiated in a multidimensional scaling analysis, which was applied on the rlog-transformed read counts of the ~ 13000 expressed genes (Fig. 1). This suggests that the chronological age of the dogs was not the primary source of the observed read count variance in our data.
Both a principal component analysis (Figure S3) and the Euclidean distances calculated from the rlog-transformed read counts of the samples (Figure S4) led to very similar observations, and neither of these two additional analyses could successfully differentiate the age groups. Thus, these analyses support the multidimensional scaling, that age was not the primary source of variation observed in the per gene read counts.
3.3 Differential gene expression analysis
Figure 2 shows an MA-plot of all expressed genes in the companion dog’s blood tissue. The overwhelming majority of the genes had a fold change around \({log}_{2}0\): the fold change was between − 1 and 1 for 12541 genes (or 97% of the expressed genes). This implies that gene expression changes in the blood transcriptome of the dogs as a function of age are exceptionally rare. This is true in spite of the significant differences between the number of aligned reads without the hemoglobin reads in the two age cohorts (Table 3).
Indeed, we identified as few as 61 differentially expressed genes, which was 0.5% of all expressed genes. 31 of these were downregulated in old dogs compared to young dogs and 30 were upregulated in the same direction (Fig. 3; the fold change of the significant genes ranges from 0.5 to 5.6). Clustering of the sequenced animals based on the gene expression profiles of the differentially expressed genes unsurprisingly separated the two examined clusters – in contrast with the previous analyses (e.g. MDS; Fig. 1), which was based on the normalised read counts of all expressed genes.
We also tested the effect of sex and neutering as covariates in the fitted model, with neither of the two parameters with a significant effect on the results. Applying the same thresholds, out of the tested 12,966 genes only three were significantly differently expressed between males and females and zero between neutered and non-neutered animals.
Furthermore, an independent, parallel differential gene expression analysis with the edgeR R package, using its default parameter values, did not result in differentially expressed genes between the two groups (data not shown).
3.4 Functional analysis of the differentially expressed genes
Next, we compared the differentially expressed genes in a gene ontology (GO) overrepresentation test to the background of all expressed genes detected in the blood (n ~ 12966) in dogs. This analysis was implemented using the pantherDB on-line tool ([30]; [31]). There was only one gene ontology that was significantly enriched in the overrepresentation test. The fold enrichment of response to bacterium (GO:0009617) was 8.62, and the corresponding false discovery rate adjusted p-value 0.014.