3’ bias in read coverage of public datasets is associated with poly(A)-enrichment and freeze-thaw
We first examined public data to establish initial evidence of the incompatibility of poly(A)-enrichment and frozen samples. Specifically, we analyze the gene-coverage distribution in libraries prepared from frozen samples by either poly(A)-enrichment or ribosomal depletion. Since evidence exists that freeze-thaw enhances transcript degradation and since poly(A)-enriched samples select mRNA by hybridization to the poly(A)-tail, we expect increased read coverage on the 3’ end of transcripts--3’ bias--when these two factors are combined. To test this expectation, we compared gene body coverage from the 5’ to 3’ end between poly(A)-enriched and ribosomal RNA depleted samples with and without freezing. Specifically, we examined the median coverage percentile, the percentile-normalized nucleotide at which median cumulative coverage for a given sample is achieved (Fig. 1a).
We compared median coverage percentile between 237 blood-tissue samples spanning 10 publicly available datasets (Supplementary Table 1, Fig. 1b). We found that for either library preparation method, freezing increases 3’ bias (independent t-test, Benjamini-Hochberg correction, FDR ≤ 0.003), but this increase is much more significant for poly(A)-enriched samples. Additionally, poly(A)-enrichment has a consistently larger 3’ bias than ribosomal depletion (FDR ≤ 0.041). We found that both library preparation and freezing independently and significantly contribute to 3’ bias (two-way ANOVA, p≤3.99e-9). In an additional study examining the impact of RNA extraction in frozen tissue from the UNC and TCGA tumor tissue repositories36, we found a significant (two-sample t-test, p<1e16) decrease in the 5’-to-3’ coverage ratio of poly(A)-enriched samples compared to ribosomal depletion (Fig. 1c). This indicates an increase in 3’ bias of frozen tissues consistent across either repository.
To determine the breadth of this potential sample processing issue, we explored the prevalence of poly(A)-enrichment from frozen tissue by examining metadata in the Gene Expression Omnibus (GEO). With GEOmetadb37, we queried all human RNA samples between 2008 and 2018 using either poly(A)-enriched or ribosomal depletion. There are thousands of samples annotated as “frozen” prepared using either total RNA or poly(A)-enrichment methods (Fig. 1d). In samples annotated as “frozen”, the frequency of poly(A) extraction increases from less than 10% to over 25% (Fig. 1e) suggesting that the potentially problematic combination is prevalent and possibly preferred. Finally, stratifying this trend over time, we see that poly(A)-enrichment, as well as the relative proportion of poly(A)-enriched frozen samples, is increasing in popularity relative to total RNA extraction, where usage has remained fairly consistent (Fig. 1f).Taken together, these results indicate a potential, widespread distortion in RNA-seq associated with a deleterious interaction between poly(A)-enriched and freeze-thaw. As these results span several studies, each may introduce unaccounted sources of technical variation. To explore this potential more formally, the remainder of our analyses focus on a specific experiment to address this question. Specifically, we subjected whole-blood extracted leukocyte samples--with technical replicates--from autistic (ASD) or typically developing (TD) toddlers to a varying number of freeze-thaw cycles, which we record along with other sample quality metrics such as RIN.
An Additional Freeze-Thaw Cycle Increases Random Read Counts 1.4-Fold
To address the scarcity of analyses on the effect of freeze-thaw on RNA-seq measurements, we use our technical replicates to compare changes in sample quality across freeze-thaw cycles. We first note that neither RIN nor TIN capture significant (one-sided Wilcoxon test) decreases in sample quality due to increased freeze-thaw (Fig. S1). Given previous indications that these metrics may not sufficiently address transcript degradation17,26–28, we instead measure the introduction of noise to samples (Fig S2-3). We define noise as the fraction of reads in a sample that are randomly counted, rather than mapping to a sample-specific gene. To estimate noise, we simulated the randomness in read counts between technical replicates (Supplementary Methods). By comparing technical replicates that have undergone the same number of freeze-thaws, we can calculate the expected noise in a sample at a given number of freeze-thaw cycles. Since noise does not rely on RIN, we can compare freeze-thaw and RIN effects independently.
Median noise increased 1.4-fold from one to two freeze-thaw cycles (one-sided Mann-Whitney U test, p 0.007) on average across all measures (Fig. 2a). By definition, technical replicates reveal variation due to technical measurement error. We estimated noise between technical replicates that have not undergone freeze-thaw to range between 9.11-10.15% (Wald test, p 5.77e-7). The expected increase in noise per additional freeze-thaw cycle was estimated to be 3.6-4.1 percentage points (Wald test, p 8.12e-3) (Fig. 2b). The introduction of random reads to samples by freeze-thaw cycles may have substantial effects on count quantification (see Discussion) and, consequently, downstream analyses such as differential expression.
RIN Does Not Predict Additional Noise After One Freeze-Thaw Cycle
Next, we asked whether our observations that RIN does not sufficiently capture changes in sample quality due to freeze-thaw (Fig. S1) could be extended to noise. Specifically, we tested whether RIN can reflect the differences in sample quality as measured by noise.
When only considering samples that underwent one freeze-thaw, each unit increase in RIN decreases noise by 3.24-3.38 percentage points for all metrics (Wald test, p 6.3e-3) (Fig. 3a-b). Yet, when only accounting for samples that underwent two freeze-thaw cycles, noise does not significantly change as RIN increases. Taken together, these results indicate that while RIN can be a good measure of noise for samples that underwent one freeze-thaw, it does not capture the loss in sample quality induced by two freeze-thaw cycles.
Differential Expression Similarity Increases 10.3% in High Quality Samples
Next, we investigated how the introduction of noise impacts differential expression (DE) analysis. We assessed DE reproducibility by generating thousands of sample combinations, i.e. subsets, with varying sample quality (Fig. S6). We define sample quality by the aggregate number of freeze-thaw cycles or RIN. We ran DE across ASD-TD groups and compared results between subsets of various sizes (4 - 14 samples). We measure reproducibility using similarity or discordance, based on correlation and dispersion, respectively; higher similarity and lower discordance each represent higher reproducibility. We use these measures to assess differences that arise between subsets consisting of high quality (low freeze-thaw or high RIN) and low quality (high freeze-thaw or low RIN) samples.
We held two expectations regarding the effect of sample quality on DE reproducibility in the context of similarity: 1) the reproducibility between subsets with high quality samples should be higher than those with low quality samples at any given subset size, and 2) subset size and sample quality should interact to increase the reproducibility of DE analysis; this would be reflected by a higher rate of increase in reproducibility with respect to subset size for higher quality subsets.
As expected, similarity increases with subset size (Fig. S10), as reflected by the estimated 0.02 (Wald test, p = 2.2e-5) increase in similarity per additional sample (Fig. 4a); thus, expected similarity would increase by 0.20 in a subset with 14 samples relative to a subset with 4 samples. Regression results for each model predicting similarity are reported in Supplementary Table 5.
To measure similarity, we took the pairwise Spearman correlation of the log-fold change values between subsets. We tested our first expectation by placing subset pairs into high and low sample quality bins--defined by either RIN or freeze-thaw--for each subset size and comparing their similarity values. Regardless of sample quality, DE similarity increases with subset size. Yet, for nearly all subset sizes, higher quality bins have significantly (one-sided Mann-Whitney U test, p 2.8e-17) higher similarity than low quality bins (Fig. 4d-e). Across subset sizes, we observed an average 1.13-fold and 1.06-fold increase in similarity from low to high quality samples for freeze-thaw and RIN, respectively.
Similarity significantly (Wald test, p 9.2e-3) decreases with the number of freeze-thaw cycles and increases with RIN when accounting for the effects of sample size (Fig. 4a-c), validating our second expectation. Similarity decreases by 0.077 per additional freeze-thaw cycle (Wald test, p = 8.77e-4). Given the estimated similarity of 0.23 for samples that have not undergone freeze-thaw, this implies that DE reproducibility will approach zero after approximately three freeze-thaw cycles (Fig. 4b). Even when accounting for subset size and the effects of RIN, the estimated decrease in similarity from freeze-thaw is nearly the same--0.078 (Wald test, p = 8.77e-4); this further corroborates that RIN alone cannot capture the changes in sample quality due to freeze-thaw. Taken together, these results indicate that higher sample quality increases DE reproducibility as measured by similarity.
Discordance Decreases nearly 5-fold In High Quality Samples
We further investigated the relationship between DE reproducibility and sample quality using an effect size sensitive measure of discordance (Fig. S9). Specifically, we explored how sample quality affects the relationship between discordance and the DE effect size--measured by the mean-variance standardized effect--at each subset size. In this context, we expected 1) discordance at any given effect size to be lower in high-quality subsets and 2) the rate of increase in discordance to be lower in high quality subsets relative to low quality subsets.
Corresponding to the regression models used for this analysis, we label the expected change in discordance per unit increase in fold-change effect size as ΔD. We observed a significant (Wald test, p 9.45e-141) decreasing trend in ΔD with increasing subset size (Fig. S11, Supplementary Table 6).
We estimate discordance with respect to effect size at each subset size and for subsets of either high or low quality. As expected, independent of sample quality, ΔD demonstrates an overall decreasing trend with respect to subset size for both RIN and freeze-thaw. With respect to freeze-thaw, at a subset size of 6, there is a 1.1-fold decrease in the value of ΔD from low quality subsets to high quality subsets. The disparity in ΔD between high and low sample quality (Δm = ΔDLow Quality / ΔDHigh Quality) increases nearly monotonically through to the subset size of 14, at which point there is a 3.2-fold decrease (Fig. 5a). This monotonicity indicates that the observed relationship between discordance and sample quality is consistent. Furthermore, it causes notable differences in discordance values, even at low effect sizes.
Consistent with our expectations, ΔD is lower for high quality subsets as compared to low quality subsets for both freeze-thaw and RIN across all subset sizes (Fig. 5b-c). Nearly all estimates are significant after multiple test correction (Wald test, Benjamini-Hochberg FDR correction, q 0.07), with the exception of those for the smallest subset size for freeze-thaw.
Taken together, these results indicate that higher sample quality increases DE reproducibility as measured by discordance.
Additional freeze-thaw cycles show increased 3’ bias in poly(A)-enriched but not ribosomal RNA depleted samples
Finally, we asked whether repeated freeze-thaw cycles can induce a 3’ bias, consistent with the induction of random reads and the loss of DE reproducibility as well as our initial observation in the public datasets.
Using the median coverage percentile, we found a shift in mRNA coverage towards the 3’ end in the poly(A)-enriched samples relative to ribosomal depletion (Fig. S13a). Specifically, the median coverage percentile for poly(A)-enriched samples is significantly (one-sided Wilcoxon test, p < 2.2e-16) larger than that of ribosomal RNA depleted samples (Fig. S13b). Samples prepared with poly(A)-enrichment have more 3’bias compared to ribosomal depletion in both one (one-sided Wilcoxon test, p = 7e-15 ) and two (one-sided Wilcoxon test, p = 5.9e-5) freeze-thaw cycles (Fig. S13d). Altogether, this indicates an overall 3’ bias of poly(A)-enriched samples, even independently of freeze-thaw (Fig. S13b).
Crucially, this 3’ bias is accentuated when samples are stratified by the number of freeze-thaw cycles (Fig. 6a). We observe a significant increase (Wald test, p = 0.007) in normalized median coverage percentile due to the number of freeze-thaw cycles in poly(A)-enrichment. The increase was not maintained in ribosomal RNA depleted samples (Wald test, p = 0.07) (Fig. 6b, Supplementary Table 8). For poly(A)-enriched samples, normalized median coverage percentile increases 1.12 percentage points per log freeze-thaw cycle; freeze-thaw cycles were log-transformed to stabilize variance. We further demonstrate a dependency of 3’ bias on freeze-thaw cycles by showing that median coverage percentile significantly increases with freeze-thaw in poly(A)-enriched samples (Kruskal-Wallis test, p = 0.041). This 3’ bias is particularly apparent after five freeze-thaw cycles (one-sided Wilcoxon test, p = 0.008). Unlike poly(A)-enrichment, ribosomal depletion, while demonstrating significant differences in median coverage percentile between freeze-thaws (Kruskal-Wallis test, p = 0.012), does not follow a trend due to increases in freeze-thaw cycles. This is highlighted by the fact that the difference in median coverage percentile between one and two freeze-thaw cycles is significant (one-sided Wilcoxon-test, p = 0.001), but the remaining comparisons are not (Fig. S13c).
Taken together, these analyses indicate that poly(A)-enrichment inherently introduces a 3’ bias in coverage as compared to ribosomal depletion, and that this bias is exclusively exacerbated in poly(A)-enriched samples due to freeze-thaw cycles. Thus, 3’ bias may indicate the severity of freeze-thaw induced signal degradation in poly(A)-enriched samples. If this 3’ bias is the root cause of freeze-thaw induced instability in absolute and differential RNA-seq quantification, such instabilities may be subverted by substituting poly(A)-enrichment for ribosomal depletion during library preparation.