DOI: https://doi.org/10.21203/rs.3.rs-69257/v2
Whole genome sequence (WGS) data provide a complete picture of a pathogen genome and are transforming molecular epidemiology [1]. In the past few years, the availability of WGS data from field collected malaria samples has grown along with the development of more sensitive, faster and lower-cost sequencing technologies. Until recently, generating these sequence data required collection of large volumes of venous blood, leukocyte depletion at the time of sample collection, and storage in the field - tasks often challenging to perform in resource-limited settings [2,3]. In contrast, collection of dried blood spots (DBS) requires small sample volumes and such samples are easily stored and transported. However, the low volumes of blood present in DBS often result in low quality and quantity of parasite DNA, particularly relative to the overwhelming proportion of human DNA in the sample. In order to address these challenges, several studies have utilized different enrichment methods such as non-selective whole genome amplification (WGA) [4,5]; hybrid selection [6–9]; enzymatic digestion of host DNA using the MspJI family of restriction enzymes [10,11] and selective whole genome amplification (sWGA) [11–14]. Recently, collection of leukocyte-depleted dried erythrocyte spots has also showed promise to provide parasite enriched samples for WGS [15].
Unlike other enrichment methods, sWGA requires relatively small amounts of starting material and provides relatively high quality data for a low-cost - enabling WGS of field samples that would otherwise contain insufficient genetic material for P. falciparum [13], P. vivax [11] and P. Knowlesi [14]. However, it is possible that current sWGA protocols can be improved to increase coverage, sensitivity, and reproducibility. For example, two recent studies that performed sWGA on clinical samples using two different sets of sWGA primers, reported an average 55% of reads mapped to the P. falciparum genome with 48% of the genome covered at ≥5X in 24 samples with an average parasite density of 73,601 parasites per µL [13,16]. However, Oyola et al., established a parasitaemia threshold of 0.03% (>1200 parasites/µL) to obtain at least 50% of the genome at 5X in clinical samples. Similar level of coverage was reported in a P. vivax specific sWGA [11], highlighting the need to improve these protocols for DBS samples.
The efficiency of sWGA could be potentially be improved by standardized comparison and optimization of different DNA extraction methods from DBS; identifying optimal sWGA primer sets that provide higher yield with reduced bias; and performing enzymatic cleavage of human DNA prior to sWGA [10]. In this study, we compared four DNA extraction methods followed by either enzymatic cleavage of human DNA or no cleavage, and amplification with WGA or sWGA using different primer sets. The combination of conditions were compared in standardized samples across a range of low parasite densities extracted from DBS, and the resulting WGS data were compared in terms of coverage of the parasite genome, dropout rate, and the concordance of called variants.
Experimental design
Dried blood spots (DBS) with densities from 10 to 10,000 parasites/μL blood were extracted using four different DNA extraction methods and then amplified by two different sets of sWGA primers or random hexamer primers with or without a prior treatment of input DNA with McrBC, an endonuclease which cleaves human DNA containing methyl cytosine [17] (Figure 1). The recovery and quality of WGS were compared between the different combinations of methods. All analyses were performed considering the core genome of P. falciparum - excluding telomeres, centromeres and sub-telomeres.
Tween-Chelex extraction yielded the highest coverage in low density samples
The recovery and coverage of WGS for DNA extraction methods across parasite densities were investigated. Four methods were evaluated, two of which were spin column based and two Chelex based. DBS extracted with the Saponin-Chelex method had the lowest performance, with only 13% of reads mapping to the P. falciparum core genome and poor genome coverage (1X coverage ranging from 2% to 50%) across all parasite densities. This extraction method was not investigated further. QiaAMP Mini and QiaAMP Investigator spin column kits performed similarly in terms of the percentage of reads that mapped to P. falciparum regardless of parasitemia. After assessing the proportion of the genome left uncovered after sequencing the QiaAMP Investigator kit had a 3% higher dropout rate than QiaAMP Mini at 1000 parasites/μL (34% vs 37%) and 8% higher at 100 parasites/μL (36% vs 44%), and was also not investigated further (Supplemental Figure 1). The performance of Tween-Chelex extraction and QiaAMP Mini kits were evaluated in further detail.
xtraction with QiaAMP Mini resulted in a consistently higher proportion of reads mapping to P. falciparum (Figure 2A). However, the higher mapping rate did not translate into better sequence coverage across the P. falciparum core genome (Figure 2B). Samples extracted using the Tween-Chelex method consistently resulted in a greater proportion of the genome covered at 5X depth than samples extracted via spin columns (e.g., 89% vs 72% at 100 p/μL). Similarly, DBS extracted via spin columns had higher dropout rates than those extracted with Tween-Chelex (e.g., 12% vs. 2% respectively at 100 parasites/μL) (Figure 3). In order to evaluate the accuracy of the mapped reads, the overall SNP concordance was compared for known SNP positions and de novo SNP calls to the reference 3D7 genome. Both extraction methods achieved high concordance, with Tween-Chelex outperforming QiaAMP for lower density samples (Figure 4). Overall, the Tween-Chelex extraction method had a lower dropout rate, higher genome-wide coverage and higher rate of SNP concordance than the QiaAMP mini kit on low density DBS samples for a given number of total reads, despite having a lower fraction of these reads mapping to the P. falciparum core genome.
McrBC improves sensitivity and genome coverage in low density samples
Whole genome amplification has been shown to be a successful strategy for increasing the amount of material available for WGS of P. falciparum from DBS samples. This study compared performance for two sets of P. falciparum specific sWGA primer sets, 10A and 6A10AD, and a random hexamer-based WGA. Overall, the two sWGA primer sets performed similarly with respect to the percentage of reads that mapped to P. falciparum (Figure 2A) and genotype concordance (Figure 4). However, the 6A10AD primer set resulted in moderately higher breadth and depth of coverage across extraction methods and parasite densities (Figure 2B).
To test an alternative enrichment strategy to sWGA, WGA with random hexamers preceded by McrBC treatment was performed, resulting in a significantly lower genome coverage than sWGA (Supplemental Figure 2). However, performing sWGA on McrBC treated samples yielded a substantial increase in the depth of genome coverage that is most marked at the lowest parasite densities extracted with Tween-Chelex. After McrBC treatment, the proportion of reads that mapped to the human genome dropped considerably at 100 parasites/μL (Tween-Chelex: 43% to 5%), and 10 parasites/μL (Tween-Chelex: 87% to 47%), dramatically improving usability and cost efficiency of the assay (Figure 2A). Conversely, McrBC improved P. falciparum genome coverage, with a more pronounced increase in Tween-Chelex than QiaAMP mini extracted samples. Samples at 10 parasites/μL had a substantial increase in the percentage of the genome covered at 5X from 23% to 60% (Figure 2B).
The ability to recover high quality sequencing data from low-density dried blood spots has useful and immediate implications for public health - allowing sample collection to be performed in a cost effective and scalable manner. In this study, a comprehensive evaluation was performed on the effects of DNA extraction method, McrBC based digestion of human DNA, and sWGA conditions across a range of parasite densities. Tween-Chelex extraction, digestion with McrBC, and sWGA using a new combination of 20 previously published primers [12,13] and an AT-biased dNTP mixture provided the highest yield and most reliable coverage, allowing whole genome sequencing data to be obtained from dried blood spots containing as few as 10 parasites per µL of blood. These findings extend the results of previously reported methods, making whole genome sequencing accessible to a larger number of low density samples that are common in cross-sectional surveys.
The data show that recovery of WGS from DBS samples with 10 parasites/µL of blood is possible, but the quality and accuracy was higher for samples containing at least 100 parasites/µL (e.g. 93% of the genome at 5X coverage). McrBC treatment prior to sWGA allows for even greater recovery across all methods evaluated, and improved performance in all metrics tested. Other studies [10, 23] have found varying results with methylation digestion methods which may be due to a difference in both enzyme and experimental process. Interestingly, QIAamp extraction resulted in a higher proportion of total reads mapping to the P. falciparum genome than Tween-Chelex extraction, but nonetheless resulted in lower coverage, higher drop-out, and lower SNP concordance. This apparent discrepancy could stem from preferential recovery of specific regions of the genome by spin columns but more even recovery by Tween-Chelex and can be seen as a compensating increase in a small part of the genome being covered at higher depth. This would result in the increased variability in genomic depth seen in QIAamp which in combination with the already large variability in sWGA results in a more random dropout pattern as opposed to Tween-Chelex showing only the depth variability from sWGA.
The factor evaluated with the least effect on sequence quality and recovery was the sWGA primer set, but there were still modest benefits to using a more expanded primer set with AT-biased dNTPs (6A10AD). These effects were present independent of extraction and digestion method, and showed consistent improvements over the previously published 10A primer set.
This study showed that Tween-Chelex extracted samples treated with McrBC digestion and amplified using 6A10AD sWGA conditions had higher performance with respect to minimal genome dropout, higher percentages of coverage at higher depth, and more accurate SNP concordance with respect to reference strains. Tween-Chelex extraction has the added benefit of being cheaper than spin columns, allowing researchers and elimination programs to scale analyses with respect to dollar per base coverage. McrBC digestion prior to sWGA provided a significant improvement over non-digested samples extracted with Tween-Chelex, most dramatic in the lowest parasite density samples. There are numerous logistical advantages of DBS collection over venous blood, including a lower barrier of training for sample collection, less blood to collect per sample, and no specialized transportation requirements, allowing researchers and control programs to incorporate whole genome sequencing, including samples as low as 10 parasites per µL, into research and surveillance activities where such data will be useful.
Mock DBS samples
Mock DBS samples were made by mixing synchronized, ring stage cultured P. falciparum parasites with uninfected human whole blood to obtain a range of parasite densities (10, 100, 1000 and 10000 parasites per μL of blood). DBS samples were stored at -20°C until processing.
DNA extraction and quantitative PCR
DNA was extracted from a single 6mm hole-punch using four different extraction methods. i) Saponin-Chelex as described previously [18]; ii) a modified Tween-Chelex described below; iii) QIAamp DNA Mini Kit (Qiagen, California, United States) following the manufacturer’s recommendations and iv) QIAamp DNA Investigator Kit (Qiagen, California, United States) as described elsewhere [13].
Tween-Chelex extraction was conducted by modifying the Saponin-Chelex extraction method. DBS were punched using a 6mm hole-puncher in to 1.5ml microcentrifuge tubes. 1mL of 0.5% Tween 20 (catalogue # P1379, Sigma Aldrich) in 1X PBS was added into the tube containing DBS punches and incubated overnight at 4°C. The samples were briefly centrifuged, Tween-PBS was removed and the punches were washed with 1mL of 1X PBS and incubated for 30 minutes at 4°C. The samples were briefly centrifuged, PBS removed and 150μL of 10% Chelex 100 resin (catalogue # 1422822, Bio-Rad Laboratories) in water were added to each sample, ensuring the DBS punches were covered with the Chelex solution and incubated for 10 minutes at 95°C. The tubes were centrifuged at 15,000rpm for 10 minutes and the supernatant was transferred to 0.6mL microcentrifuge tubes and centrifuged at 15,000rpm for 5 minutes. The extracted DNA was then transferred to a 96-well plate and stored at -20°C until processing. The parasite densities were confirmed using var-ATS ultra-sensitive qPCR as described previously [19].
McrBC digestion of human DNA
For a subset of samples, extracted DNA was digested with McrBC (catalogue # M0272S, New England Biolabs) in a 30 μl reaction containing 20μL of extracted DNA, 10 units of McrBC, 1X NEBuffer 2, 0.5μl 100X BSA and 0.5μl 100X GTP. The samples were incubated at 37°C for 2 hrs followed by inactivation at 65°C for 20 minutes. The digested products were used as templates for whole-genome amplification.
Selective Whole genome amplification (sWGA)
The sWGA reactions were performed using two sets of primers: 10A [13] and 6A10AD, a combination of 6A [12] and 10A [13] with an adjusted ratio of dNTPs. The sWGA reaction using the 6A10AD primer set was performed as described previously [13] except for the adjusted proportion of nucleotides in the dNTP mix (i.e. 70% AT and 30% GC), similar to the composition of the malaria genome. The master mix and reaction condition for the modified sWGA protocol is shown in Table S1. A template DNA volume of 20uL with an estimated 16-16000 parasite genomes (Table S2) per sWGA reaction was used. For the non-selective whole genome amplification, random hexamer primers were used following the manufacturer's instructions for the GenomiPhi V2 DNA Amplification Kit (catalogue # 45-001-221, GE Healthcare Life Sciences).
Whole-genome sequencing
The whole genome amplification product was purified with SPRI magnetic beads (catalogue #65152105050250) to remove unbound primers, primer dimers, and other impurities. Sequencing libraries were prepared using the NEBNext® Ultra™ II DNA Library Prep Kit (catalogue #E7103) following manufacturer’s instructions. Samples were barcoded, purified, pooled and sequenced on the Illumina NextSeq550 or NovaSeq 6000 System using 150bp paired-end sequencing chemistry. Only preliminary analyses were sequenced using the Illumina NextSeq550.
Sequence analyses
Reads were demultiplexed, filtered by base quality and poly-g tails were clipped. The reads were then aligned to the P. falciparum 3D7 reference genome (version 3) with BWA-MEM with secondary alignments marked [20]. Sample base quality was recalibrated using Genome Analysis Toolkit (GATK) BaseRecalibrator and GATK ApplyBQSR with SNP locations from the Pf3k project (Data Release 5) used as a prior. The number of reads per sample were then downsampled to minimum total reads for equivalent comparisons. Percentiles of genome coverage were calculated using GATK CollectWgsMetrics across the core genome. Dropouts were evaluated by taking mean coverage over 200bp sliding windows across the core genome. Percentage of reads mapping to the core P. falciparum genome was calculated with samtools flagstat [21].
Variant calling and variant-filtering was conducted on the genome sequences following the Genome Analysis Toolkit (GATK) Best Practices [22] with minor modifications. Variants were called by sample across the core genome using gatk4 HaplotypeCaller (-ERC GVCF) then genotyped across all samples using gatk4 CombineGVCFs and gatk4 GenotypeGVCFs. The variants were recalibrated using gatk4 VariantRecalibrator and ApplyVQSR using data from the Pf3k project (Data Release 5) as a training set (annotations: QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR). Variants were then filtered for passing VQSLOD and for biallelic SNPs. SNP concordance was measured with bcftools gtcheck, and gatk GenotypeConcordance. Concordance was measured using all known SNPs as well as with only high-quality non-homozygous SNPs.
WGS - Whole Genome Sequence
DBS - Dried Blood Spots
WGA - Whole Genome Amplification
rWGA - Random Whole Genome Amplification
sWGA - Selective Whole Genome Amplification
Not applicable
Consent for publication
Not applicable
Availability of data and materials
The datasets generated and/or analysed during the current study will be available in the repository github.com/greenhouse-lab/wgs_optim
Competing Interests
The authors declare that they have no competing interests.
Funding
This work was funded by the Bill and Melinda Gates Foundation (OPP1132226) and by the National Institute of Allergy and Infectious Diseases as part of the International Centers of Excellence in Malaria Research program (U19AI089674). BG is a Chan Zuckerberg Biohub investigator.
Authors’ Contributions
NT, AC, EMD, BG and ST designed experiments. NT analyzed and interpreted the sequencing results. AC and EMD performed the experiment. RS provided guidance on library preparation and sequencing. BG and ST provided guidance in experimental design and analyses. All authors read and approved the final manuscript.