Initial evaluation of enzymes for Illumina library amplification
Each enzyme was assessed for its ability to amplify genomic Illumina adapter ligated library fragments of an expected average insert size of approximately 500bp, from a set of four microbial genomes with differing GC content:- Bordetella pertussis, 67.7% GC; Escherichia coli, 50.8% GC; Clostridioides difficile, 29.1% GC; and Plasmodium falciparum, 19.3% GC. For each enzyme tested, 1 ng of pre-PCR library fragments from each genome was amplified using manufacturers recommended denaturation and extension times, with annealing at 60°C for 15 seconds and 14 cycles. Unique dual indexed P7 and P5 amplification primers were used to avoid index hopping [5, 6].
For UDI oligonucleotides used, see Supplementary Table 1.
Details of enzymes used and cycling conditions are listed in Supplementary Table 2.
After 0.7x Ampure bead cleanup the yield of each library was assessed using fluorimetric measurement (Supplementary Fig. 1). There were surprising differences in the yields obtained from the enzymes tested with some giving relatively little library. These were repeated with fresh enzyme on a different PCR block, always with the same outcome. Quantabio RepliQa and sparQ, Kapa HiFi, Invitrogen Platinum Superfi II, Thermo Collibri and Phusion U multiplex PCR mastermix, Tools Ultra, Biotool Univerase and Agilent Herculase gave good yields.
Barcoded libraries were pooled in a pseudoequimolar manner according to genome size and run on an Illumina Novaseq 6000 SP or S4 flowcell lane to give > 30x coverage of each genome. To fairly compare results, datasets were randomly trimmed to contain reads representing 30 x coverage. We tabulated the depth of coverage seen at each position of the genome and calculated the fraction of each genome (referred to as Low Coverage Index) that was covered to a depth of less than 15 x i.e. half the mean coverage. Most datasets had < 5% low coverage with the GC neutral E. coli genome but higher degrees of low coverage were observed for less base balanced genomes with a lot of enzyme, with the extremely AT rich genome of P falciparum posing the biggest challenge (Fig. 1). Quantabio RepliQa, Kapa HiFi, and Collibri had < 5% low coverage index with all four genomes.
Further evaluation of enzymes for Illumina library amplification
To test reproducibility further replicate libraries were made from the better performing enzymes, with the addition of Watchmaker Genomics Equinox library amplification mastermix, and Takara Ex Premier (these new formulations had been previously unavailable for testing), under a variety of different cycling conditions; Supplementary Table 2. With this selected group, yields were quite high with all templates (Supplementary Figs. 2 and 3).
Again the low coverage index was calculated for each dataset and enzymes/conditions ranked from low to high LCI for each genome (Fig. 2).
Whilst some enzymes perform better in certain genomic contexts RepliQa, Watchmaker Equinox, and Takara Ex Premier, give good coverage uniformity with all genomes. To assess the average low coverage index across all genomes we calculated the sum of the low coverage values compared to coverage from PCRfree libraries (Fig. 3) illustrating that these three enzymes have minimal bias each giving coverage uniformity similar to that seen with PCR free libraries.
The end result of the more even coverage obtained with RepliQa and Equinox relative to other enzymes could be clearly seen in the more challenging GC or AT rich regions where RepliQa, Equinox and PCR free had good coverage in GC rich (locally 100% GC) regions of B. pertussis and also in AT rich (locally < 4% GC) regions of P falciparum (Fig. 4).
Sequencing data was also obtained for human genome template amplification. Again this showed RepliQa, Watchmaker Equinox and Takara Ex Premier to have the most even genome coverage reflected in the lowest LCI values (Supplementary Fig. 4).
It has been observed that some enzymes are inhibited with magnetic beads that are commonly used in NGS workflows e.g. SPRI magnetic bead cleanup and size selection, and streptavidin bead capture of biotin labelled DNA fragments. With SPRI cleanup carryover of beads after elution is commonplace and some protocols employ “with bead” approaches where increased yield is obtained when the beads are not removed after the final elution step [8]. Streptavidin conjugated magnetic beads are also used in NGS protocols for selection of biotin labelled library fragments, most commonly in hybrid capture target enrichment procedures [9], and due to the extremely strong affinity of streptavidin for biotin such methods require PCR amplification of bead bound library fragments. To test if amplification by the enzymes used in the second phase of this study are inhibited by such beads amplification was carried out without beads, with a volume of washed beads in water equivalent to equal sample volume or with the template bound to 50ul of washed streptavidin beads (Dynabeads MyOne Streptavidin T1, Thermo, cat no. 65602). All of these enzymes tested were found to be unaffected by the presence of Ampure or streptavidin magnetic beads, apart from Q5 (Supplementary Fig. 5).
Accuracy and utility of the human genome sequences obtained with each enzyme were assessed by comparing each dataset with the NA12878 reference genome and variant list (see methods). Both numbers of indels and SNPs detected were slightly greater with 500bp mean inserts compared to 200bp. The three enzymes with the highest sensitivity for SNP and indel detection in the microbial reference genomes were QuantaBio RepliQa, Watchmaker Equinox and Takara Ex Premier (Table 1). These enzymes were found to call more SNPs and indels with greater precision, compared to kapa HiFi at rates that are comparable to those seen in the Precision FDA Truth challenge when using 50x PCRfree datasets [10].
There may be times in a high throughput lab or within a clinical sequencing lab when fast turnaround is required when rapid PCR may be desired. Quantabio promote RepliQa for its short extension times. When we tested the enzymes in phase 2 with increasing extension times (5, 15, 30 or 60 seconds) it was observed that Collibri, Q5, RepliQa and Watchmaker Equinox enzymes gave near maximal yield after just 5 seconds of extension whereas Herculase and Kapa HiFi yields increased with extension time (Supplementary Fig. 6).
The genome of the malaria parasite Plasmodium falciparum has an extremely low GC content of 19.3% [11] and has been shown to be one of the most challenging genomes to amplify and sequence [3, 4] [12]. There have been several papers published for this genome detailing methods to minimise the biases introduced by PCR and sequencing including PCR free library approaches [3], [13] and optimised PCR protocols [4, 11] [14, 15]. In this study when using these approaches we find that the fraction of the genome at less than 50% of mean coverage could be decreased even further (Fig. 5) though the most successful reduction was achieved by using a different approach for different enzymes. The lowest LCI was achieved using RepliQa with denaturation at 94C and extension at 60C as described by Lopez-Barragan, though near similar LCI values were also obtained under these conditions using Collibri and Watchmaker Equinox enzymes. Following on from this we tested RepliQa under a range of denaturation and extension temperature combinations and found that these conditions could not be improved upon (results not shown).
Long range amplification for Long Read sequencing
Long range PCR is a common approach for generation of material for long read sequencing. Many users have found this to be even more challenging with low yield and a bias towards smaller fragments during amplification. To test the suitability of PCR enzymes for this application we prepared size fractionated adapter ligated yeast genome fragments adding Illumina adapters to enable amplification using the same primers as used in the rest of this study.
Sheared S. cerevisiae DNA was size fractionated using Sage Sciences ELF or Bluepippin instruments yielding modal fragment sizes of 21.6kb and 13.3kb respectively (Supplementary Fig. 7). After adapter ligation 1ng of each of these were used as a template for long range PCR with a range of enzymes using manufacturers recommended cycling conditions (Supplementary Table 2).
Initially, 12 cycles of PCR was used, but with most enzymes that generated little or no product (data not shown) so PCR was repeated for 15 cycles after which time amplicons of the expected size were observed with most enzymes, though yields varied widely (Supplementary Fig. 8). The long range PCR products were then prepared for Pacific Biosciences HiFi sequencing using manufacturers’ recommended amplicon library prep protocol and barcoded adapters. Sequencing yields and coverage obtained are summarised in (Supplementary Fig. 9 and Table 2). Due to extremely low yields after PCR, products from some enzymes gave insufficient yield to obtain significant coverage.
For those amplification product libraries that gave > 30x genome coverage low coverage index was calculated. The lowest LCI (indicating more even genome coverage was obtained with RepliQa followed by Terra polymerase (Supplementary Fig. 10).
Long range PCR can often preferentially amplify smaller templates such that after multiple cycles the amplification reaction can be dominated by such shorter amplicons. Bluepippin size selected templates amplified by RepliQa and terra polymerase gave the longest average subread lengths (library insert size) of approximately 12kb (Supplementary Figs. 11 and 12). With the larger 21kb ELF fractionated template the majority of reads were obtained from shorter amplification products. RepliQa gave the largest fraction of 20kb subreads (Supplementary Fig. 13).
By comparing the PacBio HiFi data with the sequence of the S. cerevisiae S288C genome reference the error profile of the library generated after amplification with each enzyme could be determined. Terra, LongAmp and Promega Go Long are Taq based polymerase formulations and as a result were observed to give higher rates of particularly mismatch errors compared to the other enzymes that possess proofreading activity. NEB Q5 gave the lowest error rates (Supplementary Fig. 14).
As might be expected those enzymes that gave the most even genome coverage also gave the best assembly statistics when 30x normalised coverage reads were assembled in the SMRTlink portal (Table 2). Here RepliQa followed by Terra polymerase gave the most contiguous assemblies. The S. cerevisiae genome is known to have 16 chromosomes [16] and additional circular chromosomal elements have been reported [17], therefore assemblies from material amplified using these enzymes has given near complete contiguity with the sum of contig lengths matching that expected for the yeast genome and with ELF fractionated fragments amplified for 15 cycles with RepliQa assembling into just 20 contigs.