Quality control recommendations for RNASeq using FFPE samples based on pre-sequencing lab metrics and post-sequencing bioinformatics metrics

doi:10.21203/rs.3.rs-1428294/v1

Download PDF

Research Article

Quality control recommendations for RNASeq using FFPE samples based on pre-sequencing lab metrics and post-sequencing bioinformatics metrics

https://doi.org/10.21203/rs.3.rs-1428294/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Formalin-fixed, paraffin-embedded (FFPE) tissues have many advantages for identification of risk biomarkers, including wide availability and potential for extended follow-up endpoints. However, RNA derived from archival FFPE samples has limited quality. Here we identified parameters that determine which FFPE samples have the potential for successful RNA extraction, library preparation, and generation of usable RNAseq data.

Methods

We optimized library preparation protocols designed for use with FFPE samples using seven FFPE and Fresh Frozen replicate pairs, and tested optimized protocols using a study set of 130 FFPE biopsies from women with benign breast disease. Metrics from RNA extraction and preparation procedures were collected and compared with bioinformatics sequencing summary statistics. Finally, a decision tree model was built to learn the relationship between pre-sequencing lab metrics and qc pass/fail status as determined by bioinformatics metrics.

Results

Samples that failed bioinformatics qc tended to have low median sample-wise correlation within the cohort (Spearman correlation < 0.75), low number of reads mapped to gene regions (< 25 million), or low number of detectable genes (11,400 # of detected genes with CPM > 2). The median RNA concentration and pre-capture library Qubit values for qc failed samples were 18.7 ng/ul and 1.68 ng/ul respectively, which were significantly lower than those of qc pass samples (40.8 ng/ul and 5.58 ng/ul). We built a decision tree model based on input RNA concentration, input library qubit values, and achieved an F score of 0.803 in predicting QC status (pass/fail) of FFPE samples.

Conclusions

We provide a bioinformatics quality control recommendation for FFPE samples from breast tissue by evaluating bioinformatic and sample metrics. Our results suggest a minimum concentration of 25 ng/ul FFPE-extracted RNA for library preparation and 1.8 ng/ul pre-capture library output to achieve adequate RNA-seq data for downstream bioinformatics analysis.

FFPE

RNA-seq

quality control

breast tissue

RNA concentration

library concentration

decision tree

For decades, clinical biospecimens have been typically fixed in formalin then embedded in paraffin wax to make formalin-fixed paraffin-embedded (FFPE) tissue blocks for diagnosis and long-term storage. FFPE tissue archiving has many advantages, including room temperature stability, long-term storage, and suitability for subsequent immunohistochemical (IHC) analyses, which had led to use of FFPE in IHC-based biomarker investigations[1]. However, FFPE processing and tissue storage are known to result in highly degraded RNAs which limits gene expression-based biomarker discovery using RNA sequencing[2]. Transcriptional profiling by RNA sequencing (RNA-seq) is a powerful tool for genome wide quantification of RNA expression with high sensitivity that has been routinely used in breast cancer research and clinical diagnosis[3]. RNA-seq involves an enrichment step to remove the abundant ribosomal RNAs by either ribosomal depletion or Poly(A) selection. However, Poly(A) selection protocol is less suitable for low quality RNA derived from FFPE samples. During recent years, RNA library protocols tailored for FFPE samples have been developed, including the NEBNext rRNA Depletion and the TruSeq RNA Exome panel, although the relative performance of these methods with FFPE-derived RNA has not been published, and there are limited studies that provide insight for selection of FFPE samples of adequate quality.

Our study aim is to compare two commonly used RNA library preparation protocols for FFPE samples, and to provide a recommendation on RNA input metrics, including RNA concentration and library concentration, to achieve adequate RNA-seq data for downstream bioinformatics analysis

For the first part of the study, we evaluated two commonly used RNA library protocols for FFPE samples using seven paired FFPE and fresh frozen (FFzn) samples. All samples were prepared through both protocols and compared based on bioinformatics metrics, including alignment, SNP concordance, junction coverage and sample-wise correlation. For the second part of the study, we sequenced 130 benign breast disease (BBD) samples along with technical replicates in ten sequencing batches. Thorough bioinformatics quality control was performed to identify QC-failed samples. Finally, a decision tree model was constructed to correlate pre-sequencing metrics with QC status defined by bioinformatics metrics.

Study design

Institutional Review Board approval was obtained for research use of human samples in this project (#IRB 75–87). A pilot study was performed using FFPE and fresh frozen pairs for seven women diagnosed with benign breast disease to evaluate the performance of two library preparation protocol, Illumina’s TruSeq RNA Exome and NEBNext rRNA Depletion (Fig. 1a). To evaluate the precision of SNPs identified by the two protocols, we also performed whole exome sequencing (WES) for the three selected fresh frozen samples. The TruSeq Exome protocol exhibited better performance in bioinformatics metrics and was selected to process all study samples and technical controls in the main study. A total of 158 samples including study samples and technical controls were submitted for RNA extraction (Fig. 1b). Forty samples failed library preparation due to low RNA input quantity. The remaining samples were submitted for RNA sequencing in ten sequencing batches. Batches were designed so that samples with similar RNA quality were included in the same batch. This helped to minimize the potential sequencing bias toward high quality samples in the same batch. To examine the potential sequencing batching effect, the same two technical controls (FFPE and FFzn pair for the same subject) were included in each sequencing batch 1–7. For sequencing batch 8–10 where samples are of low RNA quality, we only included the FFPE technical control as the FFzn technical control would potentially attract more sequencing reads and bias the quantification of other low-quality study samples. Besides the two technical controls, we also included 11 study replicate samples in different sequencing batches. Thorough bioinformatics evaluations were performed to identify samples passing the qc metrics.

RNA quantitation and quality

Total RNA concentration was determined using the Qubit 2.0 Fluorometer and RNA HS Assay (Life Technologies Corp., Carlsbad, CA). RNA integrity was assessed and recorded with DV50, DV100, and DV200 values using the RNA 6000 Nano Kit on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA), but was not used for sample exclusion in the library preparation. DV values are a commonly used metric that represents the proportion of RNA fragments in a sample with greater length than the numeric value (i.e., DV200 equals the percentage of RNA fragments > 200 nucleotides)

RNA Exome Library Preparation and Sequencing

40–100ng of experimental FFPE RNA, FFPE control RNA, or fresh frozen control RNA was used for library preparation using the TruSeq RNA Library Prep for Enrichment and the Illumina Exome Panel- Enrichment Oligos kit (llumina, Inc., San Diego, CA). Following adaptor ligation and enrichment, the libraries were quantitated by Qubit and pooled for subsequent exome capture based on available yield. Up to a 4-plex pooling strategy was used for the exome capture, with capture groups consisting of 200ng, 100ng, 50ng, 40ng, and 30ng of input library for each sample in the pre-capture pool. Up to 12 samples (3 pools) were batched for sequencing and included a paired FFPE control RNA and fresh frozen control RNA in each batch. Following two rounds of hybridization to the capture probes, the pools were PCR amplified and purified using AMPure XP beads. The amplified and enriched libraries were quality assessed using a combination of the Qubit dsDNA HS Assay (Invitrogen), the Bioanalyzer DNA 7500 Assay (Agilent Technologies), and KAPA Library Quantification Kit for Illumina (KAPA Biosystems, Boston, MA). The three capture pools for each batch were combined in equal molar amounts and sequenced across 3 lanes of an Illumina NextSeq 500 High Output flowcell using paired 75bp reads. Each flowcell generated a minimum of 700 million reads passing filter.

rRNA Depletion Library Preparation and Sequencing

20-100ng of FFPE RNA and paired fresh frozen RNA was used for library preparation using the NEBNext rRNA Depletion Kit (Human/Mouse/Rat) and Ultra II Directional RNA Library Prep Kit for Illumina (New England Biolabs Inc., Ipswich, MA), following the manufacturers protocol for highly degraded (RIN ≤ 2) or intact (RIN > 7) samples respectively. Experimental FFPE RNA and paired fresh frozen RNA from the same patient was used if available using similar input amounts for each sample type. A total of 13 libraries were prepared, including six patient pairs. Libraries were quality assessed using a combination of the Qubit dsDNA HS Assay (Invitrogen), the Bioanalyzer DNA 7500 Assay (Agilent Technologies), and KAPA Library Quantification Kit for Illumina (KAPA Biosystems, Boston, MA). Libraries were combined in equal molar amounts and sequenced across three lanes of an Illumina NextSeq 500 High Output flowcell using paired 75bp reads. Each flowcell generated a minimum of 800 million reads passing filter.

Whole exome sequencing of fresh frozen samples

Three fresh frozen samples were submitted for whole exome sequencing at Mayo Clinic molecular genomic facility. In brief, paired-end libraries were prepared with 1.0 µg of genomic DNA in accordance with the manufacturer's protocol (Agilent Technologies, Inc, Santa Clara, Calif). Whole-exon capture was performed with 750 ng of the prepped library following the protocol for the SureSelect Human All Exon v5 + UTRs 75 Mb kit (Agilent Technologies, Inc). The purified capture products were then amplified with use of SureSelect Post-Capture Indexing forward and Index polymerase chain reaction reverse primers (Agilent Technologies, Inc) for 12 cycles. Concentration and size distribution of the completed captured libraries were assessed on Qubit (Invitrogen, Waltham, Mass) and Bioanalyzer DNA 1000 chip (Agilent Technologies, Inc). Libraries were sequenced at an average coverage of about 80× in accordance with standard protocol of the cBot and HiSeq 3000/4000 PE Cluster Kit (Illumina, San Diego, Calif). The flow cells were sequenced as 150 × 2 paired end reads on the HiSeq 4000 with the HiSeq 3000/4000 sequencing kit and collection software (HCS version 3.3.52; Illumina). Base calling was performed with Real-Time Analysis version 2.7.3 (Illumina). All procedures were performed in accordance with the manufacturer's instructions.

RNA-seq alignment and gene quantification

After sequencing procedure, raw FASTQ files were processed through Mayo’s internal MAP-RSeq pipeline[4] (Version 3.0). MAP-RSeq uses a variety of publicly available bioinformatics tools tailored by in-house developed methods. Briefly, the aligning and mapping of reads are performed using Star aligner[5] against hg38 genome build. The gene and exon counts are generated by FeatureCounts[6] using the gene definitions files from Ensembl v78. Quality control was carried out using RSeqQC[7]. Gene expression data was normalized to counts per million (CPM) using TMM method as implemented in edgeR[8] followed by log2 transformation.

Estimation of SNP confirmation rate and false positive rate

SNPs were identified using GATK haplotype caller[9] and further filtered by RVBoost[10]. For the pilot study, SNP confirmation rate (precision) was calculated for each mutation type (C > T, C > G, etc.) as:

\(SNP confirmation rate={SNP}_{rna}\bigcap {SNP}_{dna}/{SNP}_{rna}\)

(1)

, where SNP_RNA represents the SNPs identified in RNA-seq data and SNP_DNA represents the SNPs identified in WES data for the same set of samples. For both pilot and main study, false positive rate (FPR) between either technical or study replicate samples was calculated as:

\(FPR={N}_{MAD>lfc}/{N}_{total}\)

(2)

, where N_total denotes the total number of genes and N_MAD>lfc denotes the number of genes with maximum absolute difference (MAD) above a certain logarithm fold change cutoff.

Bioinformatics QC and model building

After thorough bioinformatics evaluation, samples meeting any of the below criteria were flagged as QC-fail:

Sample-wise median correlation smaller than 0.75
Gene mapped reads smaller than 25 million gene mapped reads,
Less than 11,400 # of detected genes with CPM > 2

A decision tree model was built based on CART Modeling via rpart R package to learn the relationship between pre-sequencing QC metrics, such as RNA qubit or pre-capture library qubit, and QC pass/fail status predicted by post sequencing bioinformatics metrics. Samples were split into training and testing set with a ratio of 7:3. Repeated cross validation were performed to optimal parameters (complexity parameter) during model training. Similarly, an alternative model based on logistic regression was also constructed:

\(\text{ln}\left(\frac{p\left(X\right)}{1-p\left(X\right)}\right)={\beta }_{0}+{\beta }_{1}X\)

(2)

All models were built based on Caret R package[11] and all statistical analysis was carried out in R under R version 4.0.3.

Evaluating FFPE library preparation kits using FFPE and fresh frozen replicates

We evaluated two RNA-seq library preparation protocols optimized for low quality, highly degraded samples (such as FFPE): Illumina’s TruSeq RNA Exome protocol and NEBNext rRNA Depletion protocol (Fig. 1a). FFPE and Fresh Frozen (FFzn) replicates for seven BBD patients were prepared using the two selected protocols and submitted for RNA sequencing as described in the Methods section. While the Depletion protocol generated more sequenced reads compared to RNA Exome, a significantly lower proportion of reads were mappable to genic or exon-exon junction regions for both FFzn and FFPE samples (Additional file 1), and captured a smaller number of canonical exon-exon junctions (Additional file 1). We also examined the two protocols in terms of their ability to accurately capture SNP genotypes. SNP confirmation rate (precision) was calculated for three FFzn samples by using SNPs identified from their corresponding DNA whole exome sequencing (WES) data (Additional file 1). The calculation was performed separately for six mutation categories (C > T, C > G, C > A, T > A, T > C, and T > G.). The Depletion protocol generated many false positive calls with consistently low SNP confirmation rate across all mutation categories. For the RNA Exome protocol, the SNP confirmation rate was significantly higher across different mutation categories with C > T being highest (p value < 2.2E-16). Finally, we compared the two protocols in terms of their correlation with data from the TruSeq protocol with PolyA selection using five FFzn samples (Additional file 1). For the Depletion protocol, only two of five samples successfully clustered by subject ID instead of library protocol. The RNA Exome protocol showed good concordance with the TruSeq PolyA data where all samples were clustered by subject ID regardless of their library protocol. Overall, the RNA Exome protocol showed superior performance compared to the Depletion protocol in terms of these bioinformatics metrics and was selected as the library protocol to process all samples in the main study (Fig. 1b).

Sample QC based on bioinformatics metrics

All study samples and technical controls were submitted for library preparation using RNA Exome protocol. Library concentration was gathered on the individual samples prior to hybridization capture and is hereafter referred to as the “pre-capture” library. Forty samples failed this step due to low pre-capture library output. The remaining samples were submitted for RNA sequencing in ten batches as detailed in the methods section (Fig. 1b). Bioinformatics quantification of gene expression was then performed, and qc metrics were collected, including sample-wise median correlation (median_cor), number of gene mapped reads (gene_reads), number of detectable genes with counts per million (CPM) larger than 2 (gene_cpm2). Median_cor was calculated for each sample as its median correlation with all other samples in the cohort, and was between 0.8 and 0.9 for most samples (Fig. 2a). Based on the 11 study sample replicates, we first evaluated the relationship between false positive rate (FPR) and median_cor (Fig. 2b). Replicates with lower median_cor tend to have higher FPR. For samples with extremely low median_cor, FPR decreased and plateaued around 20% likely due to a reduced number of detectable genes. Similar trends were observed between FPR and median_cor when applying an expression cutoff before calculating FPR. Due to the limited number of study replicates within median_cor range of 0.7 and 0.8, a median_cor value around the inflection point (0.75) was selected as a cutoff to identify qc failed samples. We next investigated the relationship between median_cor and gene_cpm2 (Fig. 2c). Samples with higher gene_cpm2 had better median_cor; a median_cor of 0.75 corresponding to a value of 11,400 for gene_cpm2 was selected as the threshold. Finally, we examined the relationship between gene_reads and gene_cpm2 (Fig. 2d). Samples with increased gene_reads had higher gene_cpm2. The trend saturated with gene_cpm2 around 13,000, with a gene_cpm2 of 11,400 (roughly 85% of saturation) corresponding to 25 million gene_reads. Samples were therefore identified as QC failed when meeting any of the following criteria: 1) low median sample-wise correlation within the cohort (median_cor < 0.75); 2) low number of detectable genes (gene_cpm2 < 11,400); 3) low number of reads mapped to gene regions (gene_reads < 25 million).

Relationship between pre-sequencing lab metrics and post sequencing bioinformatics metrics

We next examined the relationship between pre-sequencing RNA metrics and post sequencing bioinformatics metrics. Among those pre-sequencing lab metrics, pre-capture library qubit values showed high correlation with bioinformatics metrics, including median_cor, gene_cpm2 & gene_reads (Additional file 2). Samples failing bioinformatics qc had significantly lower library qubit values compared to qc-passed samples (p value = 2.8E-6). Figure 3 shows the detailed relationship of library qubit values with bioinformatics metrics, including median_cor, gene_cpm2 and gene_reads, which were all positively correlated with library qubit (Fig. 3a-c). Local failure rate was calculated based on these bioinformatics metrics under different library qubit values. As shown in Fig. 3d, local failure rate decreased with increasing library qubit values and saturated at 20% with library qubit value around 2–4 ng/ul. We observed a similar trend with input RNA qubit (Additional file 3). Local failure rate decreased with increasing RNA qubit values and saturated at 25% with an RNA qubit value ~ 20–30 ng/ul. The recommended quantities of starting FFPE material according to the vendor corresponds to a range of DV200 values, with the lowest recommended quality at DV200 of 30–50%, and lowest corresponding input of 4.7 ng/ul. Recommendations for input RNA using DV50 or DV100 values has not been evaluated by the vendor.

Prediction of QC failed samples based on pre-sequencing metrics

We next built a decision tree model to learn the relationship between pre-sequencing lab metrics and qc pass/fail status as determined by bioinformatics metrics. Samples were split into training and testing sets with a ratio of 7:3. Repeated cross validation was used to determine the optimal ‘complexity’ parameter used to build the training model (Fig. 4a). We found that pre-capture library concentration had higher feature importance compared to RNA concentration (Fig. 4b). Finally, we evaluated the performance of the training model by applying it to the testing set and were able to achieve an F score of 0.803. As shown in Fig. 4c, we grouped the samples into three categories based on RNA and pre-capture library concentrations: 1. Low/marginal quality (RNA qubit < 25 ng/ul);2. Intermediate quality (RNA qubit > = 25 ng/ul & library qubit < 1.8 ng/ul); 3. Good quality (RNA qubit > = 25 ng/ul & library qubit > = 1.8 ng/ul). The decision tree-based model was chosen due to its high interpretability. Logistic regression analysis on the model achieved a similar performance in terms of F score (0.818).

False positives evaluation based on FFPE and fresh frozen replicates

We also evaluated the reproducibility across sequencing batches using FFPE and FFzn replicates (Additional file 4). As expected, FFPE replicates had higher overall FPR compared to FFzn replicates (10.4% vs 4.4%). By applying an expression cutoff of cpm > 2, FPR for both FFPE and FFzn replicates decreased to below 1% (1.8e-3 vs 6.8e-4). We further investigated the relationship between MAD (Max Absolute Difference) and gene-level features including gene length and GC content for FFPE replicates. As shown in Additional file 4, shorter genes were more variable and had larger MAD compared to longer genes. GC content had a moderate positive correlation with MAD indicating that genes with high GC content were more likely to be influenced by FFPE procedure. In summary, FFPE replicates showed similar reproducibility as FFzn replicates across sequencing batches. Genes with short length or high GC content are more likely to be influenced by FFPE procedure.

In this study, we evaluated two commonly used RNA library protocols for FFPE samples: RNA exome capture and rRNA-depletion, using seven paired FFPE-FFzn samples. Samples processed using the RNA exome capture protocol showed a higher percentage of gene mapped reads, captured a higher number of canonical junctions, generated better SNP concordance rate and demonstrated better concordance with TruSeq PolyA data. Next, we sought to identify pre-sequencing metrics that could be used to predict sample pass/fail status based on post-sequencing bioinformatics metrics. All study samples along with replicate samples were processed using the RNA exome protocol. Three bioinformatics metrics were determined to identify qc-failed samples, including sample-wise median correlation (median_cor), number of gene mapped reads (gene_reads), number of detectable genes with counts per million (CPM) larger than 2 (gene_cpm2). Finally, a decision tree-based model was built to examine the relationship between pre-sequencing lab metrics and qc-status as defined by post-sequencing bioinformatics metrics. Based on the model, we recommend a minimum of 25ng/ul for RNA concentration and 1.8 ng/ul for pre-capture library concentration for FFPE samples to generate good quality RNA-seq data for bioinformatics analysis. We also demonstrated that FFPE replicates have similar reproducibility compared to FFzn replicates across sequencing batches. However, genes with short length or high GC content are more likely to be influenced by the FFPE procedure.

Clinical biospecimens are typically stored as FFPE blocks, representing an invaluable source of material for biomedical research. FFPE blocks enable prolonged storage of clinical samples, preserving both tissue morphology and nucleic acids information. However, FFPE processing and tissue storage have been shown to affect RNA quality, thus limiting gene expression quantification by technologies like RNA sequencing. Our study provides a guideline for future research that utilizes FFPE samples for RNA-seq. By following these recommendations, sequencing samples with RNA and library input higher than our recommended values will not only help yield a better success rate for RNA sequencing, but also help to prevent unnecessary cost for sequencing.

There are several limitations for our study. Firstly, we benchmarked two commonly used library preparation protocols for FFPE samples using bioinformatics metrics, including SNP confirmation rate. SNP confirmation rate (precision) was calculated as the percentage of true SNPs (called by WES data) within the SNPs identified by RNA-seq for the same sample. This does not consider RNA specific mutations introduced by events like RNA editing. However, RNA editing events are considered very rare and the expected SNP confirmation rate should be very close to our calculation in Additional file 1 [12]. Secondly, when performing bioinformatics QC using replicate samples, due to the limited number of replicate samples with median_cor around 0.7 and 0.8 range, we arbitrarily selected a cutoff value (0.75) around the inflection point of the loess-fitted curve between median_cor and FPR. This criterion will potentially affect our definition of qc pass/fail as determined by those bioinformatics metrics. To provide the user with more flexibility in selecting cutoffs for those bioinformatics metrics, we have provided a documentation that enables the end-user to define customized cutoffs based on their preference of stringency: https://github.com/Liuy12/FFPEinput. Thirdly, the concentration of RNA in the original samples is highly dependent on the amount of input tissue, original handling and storage of the sample, the extraction method used, and perhaps most importantly, the elution volume used following extraction and purification. It is difficult to compare these amounts across samples or studies unless all these factors are controlled. The library concentrations are more comparable since they are based on a consistent total RNA amount going into the library prep. Other than RNA and library input metrics, we also investigated other pre-sequencing lab metrics including DV50, DV100, DV200 values. The recommended quantities of starting FFPE material according to the vendor corresponds to a range of DV200 values, with the lowest recommended quality at DV200 of 30–50%. Recommendations for input using DV50 or DV100 values has not been evaluated by the vendor. Due to the RNA input limit, we were only able to quantify around 70% of all study samples for DV metrics. Based on those limited data, we observed that DV50 is highly correlated with DV100 values. Both DV50 and DV100 have moderate correlation with DV200, a conventional metric for measuring RNA quality. DV50 value is identified as the top predictive feature for sample failure using a recursive feature elimination algorithm. However, including DV50 in building the decision tree model does not lead to improved performance compared to using RNA/library input metrics alone. We suspect that this could be due to the decreased sample size with available DV values.

We benchmarked two commonly used library preparation protocols for FFPE samples. The TruSeq RNA exome capture protocol showed a superior performance. We also provide a common bioinformatics quality control recommendation for FFPE samples. Based on our defined bioinformatics criteria, we recommend a minimum of 25ng/ul for RNA concentration and 1.8 ng/ul for pre-capture library concentration for FFPE samples to achieve adequate RNA-seq data for downstream bioinformatics analysis.

FFPE

Formalin-fixed paraffin-embedded

IHC

Immunohistochemical

RNA-seq

RNA sequencing

FFzn

Fresh frozen

BBD

Benign breast disease

WES

Whole exome sequencing

median_cor

Sample-wise median correlation

gene_reads

Number of gene mapped reads

CPM

Counts per million

gene_cpm2

Number of detectable genes with CPM larger than 2

FPR

False positive rate

DV200

The percentage of RNA fragments > 200 nt

DV100

The percentage of RNA fragments > 100 nt

DV50

The percentage of RNA fragments > 50 nt

MAD

Max absolute difference

Ethics approval and consent to participate

This work was approved by the Mayo Clinic Institutional Review Board (IRB #75–87). As required by Mayo Clinic Institutional Review Board policies, all subject provided written research consent or authorization for use of their tissues and data. All methods were performed in accordance with the relevant guidelines and regulations.

Consent for publication

Not applicable

Availability of data and materials

Documentation is provided to enable the end-user to define customized cutoffs based on their preference of stringency: https://github.com/Liuy12/FFPEinput. Raw sequencing data are available from the corresponding author on reasonable request.

Competing interests

The authors declare that they have no competing interests

Funding

This work was supported by a grant from the National Institutes of Health (R01 CA187112) and by the Mayo Clinic Breast SPORE Biospecimen Resource for Breast Cancer Research (CA116201). The funding bodies had no role in the design of the study, collection, analysis, or interpretation of data, or in writing the manuscript.

Authors' contributions

YL performed bioinformatics analysis; participated in interpretation of results and manuscript drafting. AB participated in interpretation of results and manuscript drafting. SJW supervised statistical analyses and interpretation; participated in study design and drafting the manuscript. BWH performed RNA sequencing and generation of lab metrics. SJM, MEP & LL supervised RNA sequencing; participated in interpretation of results and manuscript drafting; SJM and MLS performed laboratory analysis of samples; EPH, RAV and TLH participated in data retrieval and statistical interpretation. MF and JMC1 participated in study design, review of tissue samples, and interpretation of results. DCR and JMC2 supervised laboratory analysis of samples, and participated in study design and interpretation of results. ACD supervised and coordinated project efforts; participated in study design, interpretation of results, and manuscript drafting. CW supervised bioinformatics analyses and interpretation; participated in study design and manuscript drafting. All authors read and approved the final manuscript

Acknowledgements

Not applicable

Ludyga N, Grunwald B, Azimzadeh O, Englert S, Hofler H, Tapio S, Aubele M: Nucleic acids from long-term preserved FFPE tissues are suitable for downstream analyses. Virchows Arch 2012, 460(2):131–140.
Jovanovic B, Sheng Q, Seitz RS, Lawrence KD, Morris SW, Thomas LR, Hout DR, Schweitzer BL, Guo Y, Pietenpol JA et al: Comparison of triple-negative breast cancer molecular subtyping using RNA from matched fresh-frozen versus formalin-fixed paraffin-embedded tissue. BMC Cancer 2017, 17(1):241.
Byron SA, Van Keuren-Jensen KR, Engelthaler DM, Carpten JD, Craig DW: Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat Rev Genet 2016, 17(5):257–271.
Kalari KR, Nair AA, Bhavsar JD, O'Brien DR, Davila JI, Bockol MA, Nie J, Tang X, Baheti S, Doughty JB et al: MAP-RSeq: Mayo Analysis Pipeline for RNA sequencing. BMC Bioinformatics 2014, 15:224.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29(1):15–21.
Liao Y, Smyth GK, Shi W: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014, 30(7):923–930.
Wang L, Wang S, Li W: RSeQC: quality control of RNA-seq experiments. Bioinformatics 2012, 28(16):2184–2185.
Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26(1):139–140.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M et al: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011, 43(5):491–498.
Wang C, Davila JI, Baheti S, Bhagwate AV, Wang X, Kocher JP, Slager SL, Feldman AL, Novak AJ, Cerhan JR et al: RVboost: RNA-seq variants prioritization using a boosting method. Bioinformatics 2014, 30(23):3414–3416.
Kuhn M: Building Predictive Models in R Using the caret Package. Journal of Statistical Software 2008, 28(5):1–26.
Li S, Mason CE: The pivotal regulatory landscape of RNA modifications. Annu Rev Genomics Hum Genet 2014, 15:127–150.

No competing interests reported.

Additionalfile1.pdf
Additional file 1 – Bioinformatics evaluation of library preparation protocols using FFPE and FFzn pairs (.PDF)
Additionalfile2.pdf
Additional file 2 – Pair-wise scatter plot among pre-sequencing lab metrics and post sequencing bioinformatics metrics (.PDF)
Additionalfile3.pdf
Additional file 3 – Correlation of pre-capture library concentration with bioinformatics metrics for all samples excluding FFzn controls (.PDF)
Additionalfile4.pdf
Additional file 4 – Evaluation of false positives based on FFPE and FFzn replicates (.PDF)

Download PDF

Editorial decision: Major revision
15 Apr, 2022
Reviews received at journal
08 Apr, 2022
Reviewers agreed at journal
27 Mar, 2022
Reviewers agreed at journal
25 Mar, 2022
Reviewers invited by journal
23 Mar, 2022
Editor assigned by journal
18 Mar, 2022
Editor invited by journal
17 Mar, 2022
Submission checks completed at journal
17 Mar, 2022
First submitted to journal
07 Mar, 2022

You are reading this latest preprint version

Quality control recommendations for RNASeq using FFPE samples based on pre-sequencing lab metrics and post-sequencing bioinformatics metrics

Status:

Version 1

Abstract

Figures

Background

Methods

Study design

RNA quantitation and quality

RNA Exome Library Preparation and Sequencing

rRNA Depletion Library Preparation and Sequencing

Whole exome sequencing of fresh frozen samples

RNA-seq alignment and gene quantification

Estimation of SNP confirmation rate and false positive rate

Bioinformatics QC and model building

Results

Evaluating FFPE library preparation kits using FFPE and fresh frozen replicates

Sample QC based on bioinformatics metrics

Relationship between pre-sequencing lab metrics and post sequencing bioinformatics metrics

Prediction of QC failed samples based on pre-sequencing metrics

False positives evaluation based on FFPE and fresh frozen replicates

Discussion

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1