Next-generation short-read WGS data has the capability to provide detailed genotype information, but often necessitates a good understanding and use of bioinformatics tools and packages. GC3 was developed to be a user-friendly computational tool to (1) extract coverage profiles of target genome regions, 2) provide interpretable results regarding deletion prevalence, 3) classify samples according to the type of gene deletions, and (4) inform NMCPs concerning coverage of important genes. In this study, we demonstrated that GC3 can be used for these purposes by applying it to Plasmodium falciparum genome segments, specifically the regions containing hrp2 and hrp3 genes. Most NMCPs in malaria-endemic settings rely on HRP2-based RDTs for day-to-day diagnosis in both clinical and field settings. There is evidence of recent expansions of Pf strains lacking HRP2 expression, a cause for concern as stated by the WHO [1, 19, 31, 32]. Computational toolkits, such as GC3, need to be able to efficiently analyze Pf genomic data and assess for the presence of hrp2- or hrp3-deletion strains, which provides a valuable monitoring resource to researchers and public health professionals concerned with malaria RDT effectiveness.
This work demonstrated the validity and utility of the GC3 tool to measure hrp2 and hrp3 deletion frequencies in Pf individual samples and in sample sets. The sliding window capability of GC3 provided a wider view of large deletions in reference lab strains (i.e. Dd2 and HB3) while adjusting for fluctuations in subtelomeric read coverage and allowed for visually-friendly figures. The option of extracting and plotting every chromosome position within an interval allowed for a magnified view of target loci, and better illustrated details in coverage within the target gene. GC3’s genotype results were validated against previously reported genotypes of Pf laboratory reference and representative strains, and similar, publicly available sample sets from Kenya and Peru, where hrp2/3 deletions have been observed [11, 18, 33]. Analysis of samples from Cambodia and Brazil demonstrate GC3’s capability to process novel WGS data from regions other than West Africa, the location of origin of PfNF54, the parental isolate from which the reference 3D7 was cloned [34, 35]. Results are consistent with previous estimates of hrp2/3 deletions among each respective country [19, 29, 36]. Of note is the high prevalence of all deletions, especially hrp3 deletions among Brazilian samples. Deletions in hrp2/3, and reports of high deletion prevalence, were first observed in the American continent [7, 8, 16, 18, 36–38], especially hrp3 deletions [7, 39]. Overall, GC3 can appropriately process and analyze publicly available WGS datasets from a variety of genomic studies.
Additional comparison against a hrp2/3-specific qPCR assay demonstrated very good reliability of GC3’s capability. Although there was one discordant result between tools, this may be a reflection of GC3’s sensitivity and the qPCR assay’s difficulty to detect partial deletions, as these are only detected by the qPCR-based assay if they overlap the primer-binding or amplicon sites. A potential challenge for GC3 are the samples with a very low amount of parasite DNA resulting on overall low depth of coverage and uneven representation of the locus in the genomic library and/or among the WGS data, leading to significant regions of the hrp2/3 loci with zero coverage (and then a ‘deletion’ assessment by GC3), despite the loci being present in the genome. However, this situation was not observed in our study. In the specific case of discordant results in a sample from Cambodia, the sample had very high coverage at the core and subtelomeric regions. In general, partial deletions present a challenge since some cases have shown a qPCR assay can amplify part of hrp2/3, but corresponds to false negative RDT diagnoses [40, 41]. Ultimately, there were very few such samples, so their impact on overall results is considered minimal. Although beyond the scope of this study, further examination of hrp2/3 partial deletions, their specific location within the locus and their respective RDT diagnosis may provide valuable information regarding the most appropriate criteria and thresholds to accurately identify gene deletions with a functional phenotype, i.e., those deletions that abrogate protein expression.
Among Cambodian samples, read coverage in flanking genes was further analyzed and demonstrated that hrp2/3 deletions can be restricted to just the locus proper, or extend to flanking genes, but without a discernible pattern. These results are consistent to previous reports [18, 19, 39, 42, 43]. Overall, GC3 reported similar results in deletion prevalence and classification trends among global malaria-endemic regions.
Utilizing the function of GC3 to extract all positions (i.e. interval = 1, step size = 1) and plotting hrp2/3 coding positions only, coverage is clearer and subsets of samples can be compared, if desired. When using this option to compare countrywide samples collected at different time points, interestingly 2002 Malian samples had lower standardized coverage than older samples (2002 relative to 2010) for both hrp2 and hrp3. This result was unexpected, since the frequency of deletions is expected to increase over time, due to the selection imposed by parasite detection by RDT. Given the size of the sample subgroups, consistent results across both hrp2 and hrp3, and generation of WGS data directly from total DNA extracted from venous whole blood in both 2002 [30] and 2010 [21], it is improbable that differences in read mapping success would cause such a difference in coverage. Interestingly, hrp2 deletions had already been observed in Mali in the late 1990s prior to significant RDT use in the country, indicative of spontaneous deletions that were maintained despite the absence pressure from RDT usage [44]. In fact, random polymorphisms can occur naturally in the subtelomeric region, including large deletions, without evolutionary pressure [45], Further, the fitness cost associated with hrp2 loss is not significant, although some small cost is associated with hrp3 deletions [3, 18]. Considering previous observations and minimal fitness cost associated with hrp2/3, GC3’s Mali results are possibly due to the expansion of a circulating genotype with naturally occurring hrp2/3 deletions in 2002 that were less prevalent or non-existent by 2010. Ultimately, this observation, together with the high frequency of hrp2 and, especially, hrp3 deletions in Brazilian isolates, suggest that there may be factors that impact deletion rate at these loci or forces, other than RDT usage, that favor strains with hrp2/3 deletions.
Closer examination of the proportional coverage at each hrp2/3 position among Cambodian samples, revealed in each gene positions with zero read coverage resulting from mapping artifacts, and which may confound results. On hrp2 positions, zero coverage positions increase and then spike around the intron region which, as mentioned before, is likely due to the intron’s high AT content [46] that can cause challenges for read mapping. Interestingly, there were two spikes in zero coverage positions on the hrp3 locus. Examination of a previous whole genome hierarchical cluster analysis of the same Cambodian samples [47] revealed that the majority of samples contributing to one or both peaks belong to the same Cambodian subpopulation and hence share a similar genetic background. Ultimately, the figure offers a useful preliminary view of the hrp2/3 genes and their characteristics.
Some limitations exist when analyzing the GC3’s results. Firstly, the quality of computational results is influenced by the depth of the P. falciparum sequencing data, as measured by the total number of reads mapped to the reference genome. In the case of P. falciparum, we have observed that the percentage of genome with coverage asymptotes at ~ 12 million 100-bp reads mapped to the parasite genome, averaging ~ 52X coverage genome-wide. It is also recommended the user apply GC3 to calculate mean coverage over the region/chromosome where the target gene is located to estimate expected coverage at the general area of the target gene. Further, high-quality WGS data results obtained with well-described DNA extraction methods and sequencing methods, either by direct sequencing or sWGA [21, 23], and established quality control and filtering protocols should be used when comparing samples from different studies. While comparing standardized coverage between direct and sWGA sequence data, direct sequencing achieves more uniform coverage due to the inefficient amplification on the subtelomeric region by sWGA primers [23], but sWGA still provides good coverage at hrp2/3 gene positions (Supplementary Figure S6). Even with high-quality data, and pertinent to hrp2/3 deletion detection, polyclonal infections add another layer of complexity since they can mask the lack of coverage at a target gene, especially in high transmission regions [17]. Particularly if GC3 is the only method being implemented to assess for deletions, this factor would need to be considered. The deletion criteria can be easily adjusted to be more stringent by the user depending on their purposes, much like the deletion criteria used in the comparator study [18]. Despite the limitations, GC3 appropriately processed hrp2/3 coverage data and classified deletions. Its utility can be extended to analyze and visualize coverage data of any target gene on any pathogen.
Summary statement
Overall, validation of the GC3 tool to extract and process WGS data was successful when comparing with expected results using reference strains, well-described samples and a hrp2/3-specific qPCR assay. Following our criteria for identifying deletions, the results agreed with previous estimations of hrp2/3 deletion frequency in each respective country. Apparent in the results is the level of detail that can be extrapolated from short-read WGS data and viewed using a comprehensive computational tool. Although challenges persist in ensuring high-quality WGS data and achieving similar coverage among low parasitemia samples using sWGA, we expect GC3’s results to be fairly robust. Further investigation of the partial deletions threshold that results in a false negative RDT diagnosis is needed to validate the deletion criteria. Ultimately, groups investigating a target gene’s coverage can use GC3 to efficiently generate translatable results and figures to understand and interpret broad patterns using hundreds to thousands of previously generated genomic datasets.