Sequence-matching adapter trimmers generate consistent quality and assembly metrics for Illumina sequencing of RNA viruses

doi:10.21203/rs.3.rs-4248995/v1

Download PDF

Short Report

Sequence-matching adapter trimmers generate consistent quality and assembly metrics for Illumina sequencing of RNA viruses

https://doi.org/10.21203/rs.3.rs-4248995/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Trimming adapters and low-quality bases from next-generation sequencing (NGS) data is crucial for optimal analysis. We evaluated six trimming programs, implementing five different algorithms, for their effectiveness in trimming adapters and improving quality, contig assembly, and single-nucleotide polymorphism (SNP) quality and concordance for poliovirus, severe acute respiratory syndrome coronavirus 2 (SC2) and norovirus paired data sequenced on Illumina iSeq and MiSeq platforms. Trimmomatic and BBDuk effectively removed adapters from all datasets, unlike FastP, AdapterRemoval, SeqPurge, and Skewer. All trimmers improved read quality (Q≥30, 87.8−96.1%) compared to raw reads (83.6−93.2%). Traditional sequence-matching (Trimmomatic and AdapterRemoval) and overlapping algorithm (FastP) retained the highest-quality reads. While all trimmers improved the maximum contig length and genome coverage for iSeq and MiSeq viral assemblies, BBDuk-trimmed reads assembled the shortest contigs. SNP concordance was consistently high (>97.7−100%) across trimmers. However, BBDuk-trimmed reads had the lowest quality SNPs. Overall, the two adapter trimmers that implemented the traditional sequence-matching algorithm performed consistently across the viral datasets analyzed. Our findings guide software selection and inform future versatile trimmer development for viral genome analysis.

Next-generation Sequencing

Illumina

Quality Control

Adapter Trimming

RNA Viruses

De Novo Assembly

Next-generation sequencing (NGS) has revolutionized infectious disease research and public health, enabling faster pathogen discovery, surveillance, and response (1–4), at a lower cost and higher throughput than traditional Sanger sequencing (5). NGS sample preparation involves attaching adapters and unique barcodes to target genomic DNA or cDNA. These sequences are vital on the Illumina NGS platform for flow cell binding, cluster generation, and demultiplexing of target genome reads (6,7). When target DNA fragments are shorter than the sequencing run cycle, sequencing may extend into the adapters, resulting in adapter-contaminated reads (8). Effective adapter trimming is essential for accurate reference mapping, de novo assembly, and SNP calling.

This study used poliovirus, severe acute respiratory syndrome coronavirus 2 (SC2), and norovirus paired reads sequenced on Illumina iSeq and MiSeq platforms (8) to evaluate the performance of six adapter and quality trimming programs, implementing five algorithms: i) sequence-matching using global alignment with no gaps, ii) sequence overlapping with mismatches, iii) probabilistic overlapping, iv) k-mer based sequence matching, and v) bit-masked k-difference algorithm with mismatches, gaps, and indels (Supplementary Material section 1.1).

Published adapter trimming software programs were selected based on their unique algorithms, sensitivity, specificity, positive and negative predictive values, and speed. These trimmers included Trimmomatic v0.39 (9) and AdapterRemoval v2.2.2 (10) for sequence-matching, FastP v0.20.1 (11) for sequence-overlapping, SeqPurge v2022_07 (12) for probabilistic overlapping, BBDuk v38.90, a tool included in the BBMap package (https://sourceforge.net/projects/bbmap/) for kmer-based, and Skewer v0.2.2 (13) for k-difference matching algorithm (Fig. S1). Parameter thresholds for adapter identification, quality trimming based on quality scores (Q), and allowed mismatches for read alignments were standardized across trimmers (Supplementary Methods section 2.1 and Table S2).

Libraries prepared from random cDNA of 13 poliovirus clinical isolates and amplicons generated from eight SC2-positive nasopharyngeal swabs and seven norovirus-positive stool samples were sequenced using Illumina 300-cycle (2 x 150bp, paired-end) MiSeq v2 Micro and iSeq i1 kits following standardized protocols (Supplementary Materials sections 2.2). Raw MiSeq and iSeq data were demultiplexed onboard the instrument with adapter trimming disabled. The sequenced viral reads were then processed through the selected trimmers.

Trimmer performance was evaluated by comparing read statistics for raw versus trimmed datasets, including percent residual adapters, read count, length, and base quality (Q≥30). Assembly statistics compared include N50, which is the length of the shortest contig where equal and longer contigs cover at least 50% of the complete (reference) genome analyzed, maximum contig length (maxContig), “genome coverage” defined as the percentage of bases in the reference covered by the maxContig as adapted from Illumina (14), and single nucleotide polymorphism (SNP) quality and SNP concordance. The “genome coverage” was calculated as Maximum contig length (bps)/Viral reference genome Length (bps)×100.

Sequence read statistics were calculated using SeqKit v.0.10.1 (15), and quality was assessed using FastQC v0.11.5 (16) and MultiQC v1.9 (17). Raw and trimmed reads per trimmer program were separately assembled de novo using SPAdes v3.15.3 (18). SNP calling was performed using BCFtools v1.10.2 (19) and appropriate references per virus as described in Supplementary Methods sections 2.3⎼2.5.

Results between trimmers and Illumina sequencing platforms were statistically compared using Wilcoxon signed-rank test with Bonferroni correction and data visualized using ggplot2 in R v4.0.2 (https://www.r-project.org/).

Residual adapters

Compared to MiSeq, iSeq raw reads had significantly more adapters for all three viral datasets analyzed (p≤1.35 ×10^-3) (Fig. 1 and Table S3). After trimming, residual adapters were still detected in AdapterRemoval, FastP, and SeqPurge-trimmed single and Skewer-trimmed paired reads, with FastP retaining the most adapters for poliovirus (0.038−12.54%), SC2 (0.043−13.06%), and norovirus trimmed reads (0.32−3.51%) (Fig. 1). AdapterRemoval left more adapters in MiSeq than iSeq poliovirus and SC2 trimmed reads (p<0.015). SeqPurge only left detectable adapters in SC2 single reads.

Differences in raw versus trimmed read statistics

Overall, iSeq and MiSeq raw reads showed similar mean total read (paired and single) counts, paired read counts, base counts, and read lengths, except MiSeq generated more SC2 raw reads and bases (p=0.035, Table S4). The iSeq generated more high-quality raw reads for poliovirus and SC2 than MiSeq (p≤1.09×10^-3), while no differences were observed for noroviruses.

After trimming, all trimmers output similar counts of total reads, read pairs, and bases for poliovirus, SC2, and norovirus trimmed reads (Table S5-S7), except BBDuk, which had significantly fewer bases for SC2 (p<0.028, Table S6). BBDuk also retained the shortest trimmed reads for all viruses compared to other trimmers (p≤3.12×10^-5, Fig.2, Table S5-S7). SeqPurge and Skewer consistently output longer trimmed reads than Trimmomatic, AdapterRemoval, and FastP across viruses and sequencers (Fig. S8-S10, panels D and J).

The iSeq poliovirus and SC2 trimmed datasets had significantly fewer paired reads compared to the raw datasets (p<0.012, Tables S4−S6, Fig. S8B and S9B), with Trimmomatic, AdapterRemoval, FastP, and BBDuk consistently retaining fewer trimmed read pairs than raw reads (p<0.027) for both poliovirus and SC2. Also, poliovirus and SC2 trimmed datasets had significantly fewer bases compared to raw datasets (p<5.44×10^-4). Overall, trimmed reads were shorter but with higher quality bases (82.41⎼96.2% with Q≥30) than raw reads (77.74⎼93.61%) for poliovirus, SC2, and noroviruses (p<3.75×10^-3, Tables S5−S7, Fig. S8−S10, panels E, F, K and L). Additionally, trimmers preserved longer MiSeq poliovirus and SC2 trimmed reads than iSeq (p≤5.59×10^-3, Fig.2, Table S4), and more high-quality iSeq than MiSeq trimmed reads for all three viruses (p≤0.035) (Table S4).

Differences in trimmed read quality

Overall, AdapterRemoval, Trimmomatic, and FastP consistently produced trimmed reads with a higher percentage of quality bases (Q≥30, 93.15−96.7%) than SeqPurge, BBDuk, and Skewer (87.73−95.72%) (Tables S5-S7 and S11, Fig. S8-S10, panels E, F, K and L). Specifically, BBDuk, SeqPurge, and Skewer retained significantly fewer quality trimmed iSeq reads across all viruses (p<7.9×10^-3) and MiSeq norovirus reads (p<0.024) compared to other trimmers. Only AdapterRemoval retained significantly more quality MiSeq SC2 trimmed reads than BBDuk and SeqPurge (p<0.016), and no quality differences were observed for MiSeq poliovirus trimmed reads (p>0.088).

Overall, trimmers output more high quality (Q≥30) iSeq than MiSeq SC2 and norovirus trimmed reads (p<0.035), with no platform-differences for poliovirus trimmed reads (Table S4).

De novo assembly statistics

All trimmers except BBDuk improved N50 and maxContig for assemblies across viral datasets compared to raw reads. After trimming, the most pronounced differences in assembly statistics were observed for poliovirus and SC2 datasets. Notably, BBDuk-trimmed poliovirus and SC2 reads assemblies resulted in the lowest N50 (p<0.037, Table S12), and maxContig (p<7.83×10^-3, Table S13), achieving only 8−39.9% genome coverage compared to raw reads (8.8−87.5%) and other trimmers (54.8−98.9%) (Table 1). Trimmed poliovirus reads assembled in long contigs, significantly improving genome coverage compared to raw read assemblies, from 35.7% to 98.9% for iSeq FastP-trimmed reads and from 87.5% to 95.6% for MiSeq AdapterRemoval-trimmed reads (Table 1). Assemblies from norovirus trimmed reads showed no significant differences.

MiSeq and iSeq showed comparable mean N50 and maxContig for SC2 and norovirus trimmed reads. However, FastP-trimmed iSeq poliovirus reads assembled longer contigs than MiSeq reads (p=0.014, Table S14).

Single nucleotide polymorphism (SNP) quality and concordance

There were no differences in SNP quality for SC2 and norovirus datasets across the trimmers. However, for poliovirus datasets, BBDuk-trimmed read assemblies had lower mean SNP quality compared to other trimmers (Table S15).

Illumina iSeq and MiSeq read assemblies identified SNPs with similar quality, ranging from 3 to 228 for all viruses (Table S14). SNP concordance across trimmers was high (>97.7-100%) for both iSeq and MiSeq viral datasets; however, BBDuk-trimmed read assemblies had 2−8 unique SNPs relative to other trimmers (Fig. S16).

We tested six trimming software programs on viral sequencing data generated using Illumina iSeq and MiSeq platforms. Trimmomatic and BBDuk produced the cleanest trimmed reads with the least residual adapters for poliovirus, SARS-CoV-2 (SC2) and norovirus datasets. Viral reads trimmed using FastP, AdapterRemoval, SeqPurge (SC2 single-reads only), and Skewer exhibited varying levels of residual adapters, with FastP-trimmed reads retaining the highest percentage (0.038%–13.06%) across viral datasets. Our results align with a previous study reporting low levels of residual adapters in human cancer data trimmed using AdapterRemoval (0.4%), Skewer (0.1%), and Trimmomatic (<1.0×10^-5 percent) (12). In contrast to our study, high numbers of residual adapters were reported in ChIP-seq human H3K4me1 data trimmed using BBDuk (37.2%) and Trimmomatic (57.7%) (20). These differences in adapter trimming performance likely depend on specimen type, adapter contamination levels, and trimming settings. For instance, AdapterRemoval v2.2.2 was reported to be less specific trimming single reads with multiple or short (<12 bp) adapters (21). FastP-trimmed data showed increased residual adapters when allowed mismatches exceed four (22), possibly due to overlooking multiple or interweaved adapters, as FastP assumes only one adapter sequence exists at read tails (11).

Overall, all trimmers retained a similar number of total reads, paired reads, and bases for poliovirus, SC2 (except BBDuk retained fewer SC2 bases), and norovirus datasets. This aligns with a previous study analyzing human cancer genes, where SeqPurge, AdapterRemoval, Trimmomatic, and Skewer retained a similarly high percentage (99.9%) of raw read pairs (12). However, another study analyzing RNA-Seq reads from Drosophila simulans gonads and carcasses showed that Skewer retained more usable RNA-Seq read pairs (20% of raw reads) than Trimmomatic (14%), and AdapterRemoval (13%) (13). All trimmers significantly improved data quality (Q≥30 = 87.73−96.07%) compared to raw reads (83.55−93.17%), with AdapterRemoval and Trimmomatic (traditional sequence-matching algorithm) and FastP (overlapping algorithm) producing reads with the highest quality. These tools’ better performance could be attributed to their ability to simultaneously compare read-to-read and adapter-to-read alignments (9–11), effectively removing poor-quality bases. Taken together, the variation in adapter trimming outcomes observed across studies is likely due to differences in the type of data sequenced (human versus virus) and trimming parameters used.

For raw reads, the Illumina iSeq had more detectable adapters than MiSeq (p≤0.001), likely due to differences in their chemistry, workflow, and flow cell mechanisms, which may bias the average insert length (21). Despite no platform-based differences in the number of trimmed reads and bases, trimmers retained longer MiSeq reads, and more high-quality iSeq reads, possibly because iSeq raw datasets required more trimming to remove adapters.

Differences in assembly metrics, N50 and maxContig, were observed between sequencing platforms only for poliovirus, where raw and FastP-trimmed iSeq read assemblies had higher N50 and maxContig values than MiSeq reads. The most pronounced differences in assemblies were observed between trimmers, with shorter BBDuk-trimmed read assemblies resulting in the lowest N50, maxContig, and genome coverage relative to other trimmers. Trimming poliovirus reads with Trimmomatic improved genome coverage breadth by up to 71.8%, aligning with the Trimmomatic developer’s reports showing post-trimming N50 and maxContig values for Escherichia. coli genome assemblies increased by 58⎼77% and 28⎼55%, respectively (9). In our study, poliovirus assemblies, sequenced from isolates (non-targeted sequencing), exhibited higher genome coverage (35.7⎼98.9%) compared to SC2 (8.7⎼67.9%) and noroviruses (29.3⎼75.6%), which were amplified from clinical samples before sequencing. The non-targeted poliovirus data assembled in longer contigs likely due to the ability of enteroviruses to inhibit host cell RNA synthesis, leading to a higher viral RNA proportion for sequencing (22).

Identification of high-quality SNPs is crucial for comprehensive genome analysis. Our study found 97.7−100% concordant SNPs per virus across all six trimmers. High SNP concordance was also reported by Sturm et al. when benchmarking SeqPurge performance using targeted breast and ovarian cancer exon sequencing (12). Notably, BBDuk-trimmed read assemblies had 2−8 additional unique SNPs, possibly due to low read coverage or false-positive SNP calls (12). Poliovirus assemblies using BBDuk-trimmed reads had the lowest SNP quality compared to other trimmers.

Limitations

When choosing a trimmer, researchers should consider factors like throughput, speed, and memory usage (11–13). This study did not compare these aforementioned factors due to the limited sample size. However, previous studies suggest Trimmomatic and AdapterRemoval offer high throughput (10), FastP provides rapid processing (11), and AdapterRemoval, SeqPurge, and Skewer require less memory (12,13).

This study found sequence-matching trimmers, Trimmomatic and AdapterRemoval, consistently performed well for viral iSeq and MiSeq data. Our findings emphasize the importance of quality and adapter trimming in advancing infectious disease research using next-generation sequencing.

The authors declare no conflicting interests.

Availability of data and materials

Viral datasets analyzed in this study were processed to mask human reads and submitted to the NCBI Sequence Read Archive under BioProject ID PRJNA1041361.

Acknowledgement

We thank Anna Kelleher and Yan Li from the Coronavirus and Other Respiratory Viruses Division for the SC2 samples, Anna Montmayeur from Viral Gastroenteritis Branch, Division of Viral Diseases (DVD) for the Norovirus samples, and Ann Frolov and James Bullows from Polio and Picornavirus Branch in DVD for the poliovirus samples analyzed in this study.

Funding

No funding.

Gargis AS, Kalman L, Lubin IM. Assuring the Quality of Next-Generation Sequencing in Clinical Microbiology and Public Health Laboratories. J Clin Microbiol. 2016 Dec;54(12):2857–65.
Kanzi AM, San JE, Chimukangara B, Wilkinson E, Fish M, Ramsuran V, et al. Next Generation Sequencing and Bioinformatics Analysis of Family Genetic Inheritance. Front Genet [Internet]. 2020 [cited 2023 Jul 17];11. Available from: https://www.frontiersin.org/articles/10.3389/fgene.2020.544162
Maljkovic Berry I, Melendrez MC, Bishop-Lilly KA, Rutvisuttinunt W, Pollett S, Talundzic E, et al. Next Generation Sequencing and Bioinformatics Methodologies for Infectious Disease Research and Public Health: Approaches, Applications, and Considerations for Development of Laboratory Capacity. J Infect Dis. 2020 Mar 28;221(Supplement_3):S292–307.
Nabakooza G, Owuor DC, de Laurent ZR, Galiwango R, Owor N, Kayiwa JT, et al. Phylogenomic analysis uncovers a 9-year variation of Uganda influenza type-A strains from the WHO-recommended vaccines and other Africa strains. Sci Rep. 2023 Apr 4;13(1):5516.
Liu L, Li Y, Li S, Hu N, He Y, Pong R, et al. Comparison of Next-Generation Sequencing Systems. BioMed Res Int. 2012 Jul 5;2012:e251364.
Buermans HPJ, den Dunnen JT. Next generation sequencing technology: Advances and applications. Biochim Biophys Acta BBA - Mol Basis Dis. 2014 Oct 1;1842(10):1932–41.
Head SR, Komori HK, LaMere SA, Whisenant T, Van Nieuwerburgh F, Salomon DR, et al. Library construction for next-generation sequencing: Overviews and challenges. BioTechniques. 2014 Feb;56(2):61–77.
Illumina. How short inserts affect sequencing performance [Internet]. 2023 [cited 2023 Jul 3]. Available from: https://knowledge.illumina.com/library-preparation/general/library-preparation-general-reference_material-list/000003874
Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014 Aug;30(15):2114–20.
Schubert M, Lindgreen S, Orlando L. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes [Internet]. 2016 [cited 2022 Dec 21];9. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4751634/
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884–90.
Sturm M, Schroeder C, Bauer P. SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC Bioinformatics. 2016 May 10;17:208.
Jiang H, Lei R, Ding SW, Zhu S. Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics. 2014 Jun 12;15(1):182.
Illumina. De Novo Assembly Using Illumina Reads. [cited 2024 Feb 20]; Available from: https://www.illumina.com/Documents/products/technotes/technote_denovo_assembly_ecoli.pdf
Shen W, Le S, Li Y, Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE. 2016 Oct 5;11(10):e0163962.
Andrews S. FastQC: A Quality Control tool for High Throughput Sequence Data [Internet]. 2010 [cited 2020 Mar 21]. Available from: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct;32(19):3047–8.
Prjibelski A, Antipov D, Meleshko D, Lapidus A, Korobeynikov A. Using SPAdes De Novo Assembler. Curr Protoc Bioinforma. 2020;70(1):e102.
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021 Feb 16;10(2):giab008.
Guzman C, D’Orso I. CIPHER: a flexible and extensive workflow platform for integrative next-generation sequencing data analysis and genomic regulatory element prediction. BMC Bioinformatics. 2017 Aug 8;18(1):363.
Illumina. Calculating Percent Passing Filter for Patterned and Non-Patterned Flow Cells. 2017;
Lloyd RE. Enterovirus Control of Translation and RNA Granule Stress Responses. Viruses. 2016 Mar 30;8(4):93.

Table 1 is available in the Supplementary Files section.

No competing interests reported.

Table1AssemblyMetrics.docx
Table 1: Mean N50 and mean maxContig before and after trimming, grouped by virus and sequencing platform. The value in parentheses is the percent genome coverage represented by the N50 or maxContig. Data in blue highlights the highest mean N50 and mean maxContig values for a given virus/platform and data in red indicates the lowest values.
SupplementaryMaterial10April2024.docx

Download PDF

Editorial decision: Revision requested
22 Apr, 2024
Editor assigned by journal
19 Apr, 2024
Submission checks completed at journal
18 Apr, 2024
First submitted to journal
10 Apr, 2024

You are reading this latest preprint version

Sequence-matching adapter trimmers generate consistent quality and assembly metrics for Illumina sequencing of RNA viruses

Status:

Version 1

Abstract

Figures

Introduction

Methods

Results

Residual adapters

Differences in raw versus trimmed read statistics

De novo assembly statistics

Single nucleotide polymorphism (SNP) quality and concordance

Discussion

Conclusion

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Version 1