DOI: https://doi.org/10.21203/rs.3.rs-1862557/v1
The ability of malaria rapid diagnostic tests (RDTs) to effectively detect active infections is being compromised by the presence of malaria strains with genomic deletions at the hrp2 and hrp3 loci, encoding the antigens most commonly targeted in diagnostics for Plasmodium falciparum (Pf) detection. The frequency of such deletions can be determined in publically available Pf whole genome sequencing (WGS) datasets. A computational approach was developed and validated, termed Gene Coverage Count and Classification (GC3), to analyze genome-wide sequence coverage data and provide informative outputs to assess presence and coverage profile of a target locus in WGS data. GC3 was applied to detect deletions at hrp2 and hrp3 (hrp2/3) and flanking genes in different geographic regions and across time points.
GC3 uses Python and R scripts to extract locus read coverage metrics from mapped WGS data according to user-defined parameters and generates relevant tables and figures. GC3 was tested using WGS data for laboratory reference strains with known hrp2/3 genotype, and its results compared to those of a hrp2/3-specific qPCR assay. Samples with at least 25% of coding region positions with zero coverage were classified as having a deletion. Publicly available sequence data was analyzed and compared with published deletion frequency estimates.
GC3 results matched the expected coverage of known laboratory reference strains. Agreement between GC3 and a hrp2/3-specific qPCR assay reported 19/19 (100%) hrp2 deletions and 18/19 (94.7%) hrp3 deletions. Among Cambodian (n = 127) and Brazilian (n = 20) WGS datasets, which had not been previously analyzed for hrp2/3 deletions, GC3 identified hrp2 deletions in three and four samples, and hrp3 deletions in 10 and 15 samples, respectively. Plots of hrp2/3 coding regions, grouped by year of sample collection, showed a decrease in median standardized coverage among Malawian samples (n = 150) suggesting an increase in frequency of deletions between 2007-08 and 2014-15. Among Malian (n = 90) samples, median standardized coverage was lower in 2002 than 2010, indicating widespread deletions present at the gene locus in 2002.
The GC3 tool accurately classifies hrp2/3 deletions and provided informative tables and figures to analyze targeted gene coverage. GC3 is an appropriate tool when performing preliminary and exploratory assessment of locus coverage data.
From 2010 to 2020, national malaria control programs (NMCPs) distributed 2.2 billion rapid diagnostic tests (RDTs) for malaria and 3.1 billion RDTs were sold by manufacturers, the majority of these going to malaria-endemic countries in sub-Saharan Africa [1]. RDTs are an integral part of nearly all NMCP’s clinical and field interventions since they provide quick and effective malaria diagnosis. These RDTs include a small cassette detecting Plasmodium-specific antigens in the blood of an infected individual and are user-friendly and affordable [2]. Predominantly, RDTs detect the Plasmodium falciparum-specific antigen histidine-rich protein 2 (HRP2), which is released into the bloodstream in large quantities when infected red blood cells lyse [3]. P. falciparum accounts for vast majority of the 241 million reported human malaria cases in 2020 and is the primary parasite causing malaria-related mortality and morbidity [1]. Due to homology and considerable sequence similarity between the two proteins, HRP2-based RDTs can cross-react with histidine-rich protein 3 (HRP3) in high-density infections [4]. As evidence of their effectiveness, 94% of WHO-qualified RDTs are either HRP2-based or based on a combination of HRP2 and a partner antigen such as parasite lactate dehydrogenase or aldolase [5–7]. HRP2-based RDTs are an essential diagnostic tool for NMCPs to scale surveillance operations and adequately assess infection, leading to proper treatment administration, measure intervention progress and identify malaria reservoirs.
Recently, however, the effectiveness of HRP2-based RDTs is becoming compromised due to the emergence of deletions in the hrp2 and hrp3 (hrp2/3) loci that prevent the expression of a functional protein [5, 7–15]. In particular, full deletions, as well as some partial deletions, in one or both of these genes eliminate detectable HRP2 and/or HRP3, preventing accurate malaria diagnosis using RDTs. Previous estimates of hrp2/3 deletion prevalence report higher frequencies in South and Central America, followed by Africa, then Asia and Oceania [16]. Low-transmission areas with high treatment rates, characteristics often found in elimination settings, are especially at risk for the spread of strains with hrp2/3 gene deletions, as models show that, under those conditions, strains with hrp2/3 deletions have a strong fitness advantage over those with intact genes [17]. Therefore, as NMCPs continue to control and move toward elimination, it is critical to monitor the presence and spread of hrp2/3 deletions. Without fully understanding the dynamics of hrp2/3 deletions, and spread of those deletions in particular, undiagnosed infections may lead to an increase in malaria prevalence and mortality, and hinder global progress towards control and elimination.
A computational tool that facilitates detection and classification of deletions in hrp2/3 (e.g. partial vs. complete deletions) in published whole genome sequencing (WGS) datasets will enable rapid and detailed analysis and estimation of deletion prevalence within and between datasets. The development of baseline values as well as the comparison of deletion prevalences across current samples sets as well as temporal comparison between these and previously published datasets may be particularly useful. Previous studies have performed analyses using WGS data [18–20]; however, implementation of the methods used in these studies requires a strong understanding of bioinformatics tools and packages. The development of a more user-friendly computational tool would expand the ability to implement surveillance of locus deletion based on WGS coverage data to a wider audience investigating and monitoring hrp2/3 deletions, as well as deletion genotypes in other target genes. This work aimed to fill this gap, by developing a computational tool, which we call “Gene Coverage Count and Classification”, or GC3, to provide translatable results of hrp2/3 deletion prevalence and classification, based on short-read WGS data, among global Pf samples for which WGS data is available.
The WGS data used were generated either by direct sequencing of total DNA extracted from each isolate or by sequencing post selective whole genome amplification (sWGA) of extracted DNA, and were reported previously [21]. Some of the WGS datasets were generated as part of the MalariaGEN project [22] and downloaded from the Sequence Read Archive (SRA). A selection of field samples representing 19 different countries from Africa (n = 9), South America (n = 5), Asia (n = 4) and Oceania (n = 1), for a total of 1120 datasets (1114 global samples + 6 reference strains), were evaluated for general results (Supplementary Figure S1). The following laboratory reference strains with known hrp2 and hpr3 genotype were used for developing and testing GC3: NF54 (West Africa) – hrp2 and hrp3 present, 7G8 (Brazil) – hrp2 and hrp3 present, NF135.C10 (Cambodia) – hrp2 and hrp3 present, NF166 (Guinea) – hrp2 and hrp3 present, Dd2 (Laos) – hrp2 absent/hrp3 present and HB3 (Honduras) – hrp2 present/hrp3 absent.
Read coverage files were generated by aligning raw reads in fastq format to the Pf3D7 reference genome assembly (PlasmoDB release v24) using bowtie (v2.2.9 and above). Alignment files in BAM (Binary sequence Alignment/Map) format were processed according to GATK’s (Genome Analysis Toolkit) Best Practices documentation. Genome-wide coverage per site was recovered using bedtools’ genomecov function [21, 23]. The resulting BED (Browser Extensible Data) file (a tab-delimited text file) is used as the initial input to GC3. However, any delimited file with columns for molecule identifier (e.g. Pf3D7_08_v3), chromosomal position and coverage value is acceptable. When comparing sample datasets, coverage values per base pair (bp) were standardized by dividing ‘locus coverage’ by ‘subtelomeric mean coverage’ to account for differences in sequencing depth among samples.
For GC3 to function properly, Python v3.0 and R v4.1.1 (or later versions) with the following libraries must be installed: readxl, writexl, dplyr, reshape2, and ggplot2. GC3 uses a Python-based script to extract read coverage information for genomic coordinates set by the user and processes these output files using an R script (Fig. 1). Following the framework, the user is required to provide input parameters at two junctions.
1. Extracting target coverage data – Python-based script
Within the Python script, the user is required to provide parameter inputs depending on the desired output. To use the “sliding window” option, the user must provide: 1) name of input file, 2) start coordinate, 3) end coordinate, 4) molecule identifier containing target locus (e.g. Pf3D7_08_v3), 5) interval size (i.e. window length, in base pairs), and (6) step size (i.e. shift between windows, in base pairs). In the initial window, defined by start coordinate and interval size, the average coverage is obtained by adding read coverage across all positions and dividing by interval size. The start position is then updated by adding step size to the previous start coordinate and the process is repeated until the end coordinate is reached. The output file will report an interval’s start and end coordinates separated by a colon and flanked by apostrophes and the interval’s average read coverage separated from the coordinates by a colon (e.g. '1290240:1290740': 294.228).
If individual coverage of all positions in the interval of interest is desired, then the interval size should be set to 1, and GC3 will extract values for each coordinate between start and end coordinates, inclusively (step size is automatically set to 1). Output file will report the position and respective coverage (e.g. '1372236': 387). The intermediate output is a text file with position(s) and corresponding read coverage values.
Additionally, the user can calculate mean coverage between start and end coordinates using a separate GC3 function. User parameters needed are (1) name of input file, (2) start coordinate, (3) end coordinate, and (4) molecule identifier. This function is needed if the user desires to know, for example, the mean coverage over a wider region or to standardize coverage between different sets of samples (i.e., sample subgroups).
2. Coverage data processing – R-based script
The user will need to input intermediate output files into the separate GC3’s R script to clean, and generate sample metrics and descriptive plots. At the start of the R script, the user will define (1) path to the intermediate files, (2) name of the intermediate file(s), (3) target locus’ coordinates in reference genome, (4) gene’s intron coordinates (if necessary), in reference genome, (5) position coordinates of interest (e.g. flanking gene positions), (6) list of subgroup sample identifiers and subgroup name (optional). If read coverage is to be standardized relative to coverage in a reference chromosome or chromosomal segment, then a file of mean read coverage per chromosome or segment (obtained as described above) per sample should also be defined. The GC3 R script will output several files, namely, (i) Excel version of intermediate text files, (ii) summary metrics: sample identifier, overall mean coverage -- if mean coverage file included, mean target gene coverage, proportion of gene positions with coverage, proportion of smaller regions of interest (e.g. coding regions, exons, etc.), deletion classification, and count of positions with 0X coverage, (iii) read coverage information for target gene (number of positions with zero coverage, and mean and median coverage per position over all samples), and (iv) descriptive plots: sliding window coverage over region of interest (i.e. subtelomeric region), all coordinates coverage over target gene positions (i.e. hrp2 and hrp3), and proportion of positions with zero coverage.
Python and R scripts can be found at the Silva group’s GitHub (https://github.com/igs-jcsilva-lab) as well as a README file with detailed instructions and input examples.
The gene structure of hrp2 and hrp3 in the reference 3D7 strain was obtained from PlasmoDB (www.plasmodb.org). Both hrp2 and hrp3 consist of two coding exons. Exon1 is 69 bp in length for both genes, and exon2 is 848 bp long in hrp2 and 758 bp in hrp3 (Fig. 2). The analysis of WGS data focused on subtelomeric regions of chromosome 8 (Pf 3D7 reference strain coordinates 1,290,240–1,443,449, for a total of 153,209 bp), containing the hrp2 coding DNA sequence (CDS) and intervening intron, and of chromosome 13 (Pf 3D7 reference strain coordinates 2,731,041–2,892,340, for 161,299 bp), containing the hrp3 CDS and intron. Subtelomeric coordinates were chosen to include the closest “essential” gene [24] downstream of hrp2 or hrp3 and farthest upstream functional gene (i.e. PfEMP1-encoding var gene).
Metrics were generated to classify samples by presence or absence of full or partial deletions in each locus of interest. If ≤25% of the CDS was missing (i.e. at most 25% of the reference CDS had zero coverage) the locus was considered present with a “small deletion of uncertain functional impact” (SDUFI). If > 25% (but not 100%) of the reference CDS positions had zero coverage the sample was classified as having a partial deletion (25% < %-positions-with-zero-coverage < 100%); it was classified as having a complete deletion if all positions have zero coverage. This classification is partly informed by Sepúlveda and colleagues [18], who implemented an algorithm to perform deletion calling without having to analyze the coverage profile of the entire genome. They classified deletions based on a 75% threshold of positions with ≤ 2X coverage, but we considered this to be too stringent and decreased our thresholds as explained above to account for “SDUFIs” or partial deletions less than the 75% threshold that might impact protein detection. Deletions of flanking genes were assigned to samples if > 25% of the flanking gene’s positions reported zero coverage. Intergenic regions were excluded to reduce the effect of variable read coverages in non-coding regions.
A previously described hrp2/3-specific qPCR assay capable of detecting locus deletions in mono- and poly-clonal infections [25] was utilized to compare with the hrp2/3 deletion genotype inferred by GC3. In summary, primer sequences were adapted from conventional PCR [26] to bind to conserved regions of hrp2, hrp3 and an apicomplexan-specific single copy gene used as positive control, rnr2e2 (ribonucleotide reductase R2_e2, [27]). The computational approach used by GC3 for the detection of hrp2/3 deletions was compared to this hrp2/3-specific qPCR assay [25], using the following samples:
NF54 – Positive control
7G8 – Positive control
Dd2 – hrp2 absent control
HB3 – hrp3 absent control
17 global samples (see Supplementary Table S1 for details)
The presence and classification of hrp2/3 deletions is reported for the four laboratory reference strains mentioned above and for 17 global samples from Brazil (n = 3), Cambodia (n = 6), Mali (n = 3), Malawi (n = 4) and Thailand (n = 1). Global samples with accessible DNA material were randomly selected to represent the following GC3-inferred genotype subgroups: samples with no deletions, hrp2 deletion (complete), hrp3 deletion (complete), double hrp2/3 deletion, low overall sample mean read coverage (< 20X), possible discordant pairs (partial deletion, with non-zero coverage in qPCR primer binding sites), and PCR primer site deletions (samples with zero coverage in qPCR primer binding site – either in hrp2 or hrp3). Accession ID and subgroup stratification of global samples can be found on Supplementary Table S1.
When measuring correlation between mean coverage in hrp2/3 positions and subtelomeric or upstream/downstream gene, Spearman’s rank correlation method was used (Supplementary materials). Spearman’s method accounts for non-parametric distribution and therefore mean coverages were not standardized [28]. R v4.1.1 program was used to conduct statistical analysis.
Sample read coverages by sliding windows of 1000 bp intervals and 500 bp step size were generated over the subtelomeric regions of chromosome 8 and chromosome 13 (sum of coverage across all positions in interval / interval length). Additionally, coverage at every position (interval = 1) was generated at every position between coordinates 1,372,236 to 1,377,299 on chromosome 8 and 2,835,756 to 2,847,557 on chromosome 13. These positions corresponded to hrp2 and hrp3 coordinates plus 2000 bp on either end of their respective coding regions.
WGS data from reference laboratory strains were analyzed to estimate hrp2/3 coverage per bp and the proportion of positions with coverage by at least one read (≥ 1X coverage) at hrp2 and hrp3 coordinates, and ultimately evaluate the validity of results from GC3. Expected coverage was estimated using each respective subtelomeric region as reference. Overall, for each lab strain reference, we found excellent concordance between coverage values in each locus and the respective subtelomeric chromosomal regions (Table 1). Mean subtelomeric coverage of the Dd2 strain (with hrp2 deletion genotype) was high (chromosome 8: 29X; chromosome 13: 48X), and, as expected, mean coverage at the hrp2 positions was 0X, while hrp3 mean coverage was 45X (with 100% of CDS coordinates with coverage > 0). The HB3 strain (hrp3 deletion genotype) was sequenced to ~ 145X coverage (chromosome 08 and chromosome 13 subtelomeric regions with 159X and 131X coverage, respectively). Mean coverage at the hrp3 positions was 0X (50% proportional coverage) and while hrp2 was similar to genome-wide coverage (142X, with 100% proportional coverage of gene positions). Residual coverage may have occurred at the hrp3 gene of HB3 despite its known deletion due to mapping of some reads originating from hrp2 and mapping to similar but non-orthologous locations. GC3 correctly identified HB3 as having a hrp3 gene deletion (Table 1). These results are similar to previously described coverage profiles of Dd2 and HB3 [18].
Mean subtelomeric coverage | Mean gene coverage | Proportion of coding positions with ≥ 1X coverage (%) | ||||
---|---|---|---|---|---|---|
Reference Strain | Chromosome 08 | Chromosome 13 | hrp2 | hrp3 | hrp2 | hrp3 |
NF54 | 156.5 | 173.7 | 125.2 | 134.8 | 100% | 100% |
7G8 | 201.9 | 236.3 | 247.0 | 269.9 | 100% | 100% |
NF135.C10 | 35.3 | 42.7 | 36.9 | 41.7 | 100% | 97% |
NF166 | 280.0 | 328.4 | 295.7 | 344.6 | 100% | 100% |
Dd2 | 29.3 | 48.3 | 0.0 | 45.0 | 0% | 100% |
HB3 | 159.3 | 130.5 | 142.2 | 0.35 | 100% | 50% |
GC3 can create plots of the sliding window findings in order to provide a visual perspective of the target region. To illustrate the coverage data provided in Table 1, the subtelomeric regions containing hrp2 and hrp3 of reference strains Dd2, HB3 and NF54 were plotted (Fig. 3). Results were normalized on a log scale to better visualize large fluctuations in coverage generated by whole genome shotgun sequencing. Sliding window plots confirm validation results of Dd2, HB3 and NF54, and clearly illustrate coverage for each respective strain. On chromosome 08, coverage of the Dd2 strain decreases to zero for several thousand base pairs that include the hrp2 locus, whereas NF54 and HB3 have high coverage in the same region. On chromosome 13, it is the HB3 strain that has several thousand base pairs with little or no coverage, including the hrp3 locus, whereas NF54 and Dd2 coverage remains high. GC3 visuals showed no deletions in NF54, a complete hrp2 deletion in Dd2 and a large section of little to no coverage at the hrp3 locus in HB3, respectively.
A subset of global samples (n = 17) and reference strains (n = 4) underwent qPCR specific for hrp2 and hrp3 to compare with GC3 results. Two among the selected global samples were excluded due to low parasitemia resulting in very low or no detection of the positive control gene by qPCR (Cq threshold cutoff = 37.5). There was very good agreement between GC3 (computational) and qPCR assay results. Out of four references strains and remaining 15 global samples, GC3 matched qPCR results 19/19 (100%) for hrp2 and 18/19 (94.7%) for hrp3 (Table 2). Only one sample (IGS-CBD-099) had a discordant result between methods. In particular, for this sample, GC3 classified it as having a partial deletion at the hrp3 locus, and coverage assessment with base-pair granularity suggested partial lack of read coverage, including the exon 2 primer binding region, between 2841390–2841412 (Fig. 4). It is noteworthy that the average coverage in this region is very low (~ 1X), however coverage is high for the corresponding chromosomal subtelomeric and core regions (~ 124X and ~ 143X, respectively). On the other hand, the qPCR assay was positive for hrp3 (Cq = 25.4). Taken together, the results suggest the sample has a partial deletion at the hrp3 locus, which does not encompass the qPCR primer binding regions, but that is possibly close enough to the binding site of the primer in exon 2 to interfere with read mapping in that region.
hrp2 b | hrp3 b | rnr2e2 | |||
---|---|---|---|---|---|
(control gene) | |||||
Sample name | Country | Phenotype subgroup | GC3/PCR | GC3/PCR | PCR |
7G8 | Reference (Brazil) | Control | present/present | present/present | present |
NF54 | Reference | Control | present/present | present/present | present |
Dd2 | Reference (Laos) | Control - hrp2 deletion | absent/absent | present/present | present |
HB3 | Reference (Honduras) | Control - hrp3 deletion | present/present | absent/absent | present |
IGS-BRA-017sA | Brazil | No deletions | present/present | present/present | present |
IGS-THL-017 | Thailand | No deletions | present/present | present/present | present |
IGS-BRA-021 | Brazil | No deletions | present/present | present/present | present |
IGS-CBD-026 | Cambodia | No deletions | present/present | present/present | present |
IGS-CBD-031 | Cambodia | hrp2 deletion (complete) | absent/absent | present/present | present |
IGS-MLI-036 | Mali | hrp3 deletion (complete) | present/present | absent/absent | present |
IGS-BRA-001sA | Brazil | double hrp2/3 deletion | absent/absent | absent/absent | present |
IGS-CBD-008 | Cambodia | Low coverage sample | present/present | present/present | present |
IGS-MWI-254sA | Malawi | Low coverage sample | present/present | present/present | present |
IGS-MWI-251sA | Malawi | Low coverage sample | present/present | present/present | present |
IGS-MLI-039 | Mali | hrp2 discordant pair | present/present | present/present | present |
IGS-MLI-031 | Mali | hrp3 discordant pair | present/present | present/present | present |
IGS-CBD-034 | Cambodia | hrp2 PCR primer deletion | present/present | present/present | present |
IGS-CBD-094 | Cambodia | hrp3 PCR primer deletion | present/present | present/present | present |
IGS-CBD-099 | Cambodia | hrp3 PCR primer deletion | present/present | absent/present | present |
a Cells in purple and blue denote agreement between GC3 and qPCR results (i.e. "PCR"); red denotes disagreement between methods. | |||||
b For GC3 assignments, if read coverage is present in > 75% of the CDS of a target gene, locus is designated as "present". When read coverage is absent for at least 25% of the CDS, a deletion is inferred, and locus is designated as "absent". |
Previously, a subset of Kenyan samples (n = 27) was genotyped for hrp2/3 deletions, with two and one deletions identified in hrp2 and hrp3, respectively [18] (Table 3). In addition, Sepúlveda and colleagues also identified no hrp2 deletions among twelve Peruvian samples and two samples with deletions in hrp3 [18]. The criterion previously used to call deletions was > 75% of gene positions with ≤ 2X coverage [18]. We wanted to determine how GC3 performed on a similar set of samples. Accordingly, we downloaded from the SRA samples from the same time point from Kenya (n = 57, including 24 from [18]) and Peru (n = 11, including 7 from [18]). For hrp3, we identified the same number of deletions reported previously [18]. However, the hrp2 were discordant. We identified one complete deletion reported among Kenyan samples, and we observed one partial hrp2 deletion among the Peruvian samples, which differs from previous reporting. Among the Kenyan samples, a previous study reported two hrp2 deletions [18], one of which was also identified by GC3. Whereas the discordant Peruvian sample was only identified to have a deletion by GC3. The difference in assessment is likely due to differences in the criteria used between GC3 and that used by Sepúlveda and colleagues to call deletions.
Country | n | hrp2 | |||||
---|---|---|---|---|---|---|---|
No Deletion | GC3 Deletion Classification (Partial/Complete)a | ex1-/ex2 + b | ex1+/ex2-b | ex1-/ex2-b | Previous Genotype (Deletion/Total)c | ||
Kenya | 59 | 58 | 0 / 1 | 0 | 0 | 1 | 2 / 27 |
Peru | 11 | 10 | 1 / 0 | 1 | 0 | 0 | 0 / 12 |
Country | n | hrp3 | |||||
No Deletion | GC3Deletion Classification (Partial/Complete)a | ex1-/ex2 + b | ex1+/ex2-b | ex1-/ex2-b | Previous Genotype (Deletion/Total)c | ||
Kenya | 59 | 58 | 1 / 0 | 0 | 0 | 1 | 1 / 27 |
Peru | 11 | 9 | 2 / 0 | 0 | 1 | 1 | 2 / 12 |
ahrp2/3 deletions assigned to isolates with > 25% CDS positions with zero coverage. bExon absence (-) assigned to isolates if > 25% exon positions have zero coverage. (+) signifies exon is present. | |||||||
cPrevious hrp2/3 deletion genotype results from Sepúlveda and colleagues. Deletions called for samples with > 75% of coding region with ≤ 2X coverage [18]. |
All global samples used in the study (n = 1114) were examined for hrp2/3 deletions (Supplementary Table S2). Cambodian (n = 127) and Brazilian (n = 20) samples were further visualized in more detail at the subtelomeric regions of interest (Supplementary Figure S2) and examined for hrp2/3 exon presence/absence since they have not previously been described (Table 4). Although these are undescribed samples, computational results are comparable to previous estimates in each respective region, where deletions have been previously observed [18, 29]. Among Cambodian samples collected in 2009–2011, there were one hrp2 deletion, eight hrp3 deletions, and two hrp2/3 double deletions (both hrp2 and hrp3), with prevalence of 0.8%, 6.3% and 1.6%, respectively. All hrp2 deletions corresponded to absent exons (> 25% zero coverage positions on both exons), but were classified as two partial and one complete hrp2 deletion as two samples still had coverage in a low proportion on hrp2 positions. Deletions of hrp3 among Cambodian samples were classified as seven partial and three complete deletions. Among Brazilian samples collected in 2016, four had hrp2/3 double deletions, and eleven with hrp3 deletions, corresponding to 20% and 55% prevalences respectively. Of the four hrp2 deletions, two were partial deletions, and two were complete hrp2 deletions. Of the 15 hrp3 deletions, one was a partial deletion on exon 2, and 14 were complete hrp3 deletions.
Country | n | hrp2 | ||||
---|---|---|---|---|---|---|
No Deletion | Deletion Classification (Partial/Complete)a | ex1-/ex2 + b | ex1+/ex2-b | ex1-/ex2-b | ||
Cambodia | 127 | 124 | 2 / 1 | 0 | 0 | 3 |
Brazil | 20 | 16 | 2 / 2 | 0 | 1 | 3 |
Country | n | hrp3 | ||||
No Deletion | Deletion Classification (Partial/Complete)a | ex1-/ex2 + b | ex1+/ex2-b | ex1-/ex2-b | ||
Cambodia | 127 | 117 | 7 / 3 | 0 | 2 | 8 |
Brazil | 20 | 5 | 1 / 14 | 0 | 1 | 14 |
ahrp2/3 deletions assigned to isolates with > 25% CDS positions with zero coverage. bExon absence (-) assigned to isolates if > 25% exon positions have zero coverage. (+) signifies exon is present. |
Quantifying flanking region deletions among Cambodian samples
Deletions at the hrp2/3 positions may extend to flanking genes, possibly with additional impact on overall parasite fitness. Therefore, an option in the GC3 R-script was built in to assign deletions of flanking genes. To determine whether deletions extend into these flanking coding regions, a table was generated for Cambodian samples, where samples with > 25% gene positions with zero coverage in upstream and downstream flanking regions were classified as having a locus deletion (Table 5). Of note are the observations that the presence of a hrp2/3 deletion is not always associated with deletions in flanking genes and, conversely, deletions in flanking genes are not always associated with hrp2/3 deletions. A subset of Cambodian samples has been plotted to illustrate flanking gene coverage as it relates to hrp2 or hrp3 (Supplementary Figure S3). Results suggest that deletions can occur independently in hrp2 (or hrp3) and their respective flanking genes.
hrp2a | hrp3a | |||
---|---|---|---|---|
Flanking geneb (Upstream / Downstream)c | Present | Absent | Present | Absent |
Present / Present | 124 | 2 | 113 | 3 |
Absent / Present | 0 | 1 | 4 | 0 |
Present / Absent | 0 | 0 | 0 | 1 |
Absent / Absent | 0 | 0 | 0 | 6 |
Total | 124 | 3 | 117 | 10 |
aDeletion assigned to samples if > 25% of coding region positions had zero coverage | ||||
bDeletion of flanking genes assigned to samples with > 25% gene positions with zero coverage | ||||
cFor hrp2, an upstream gene was a STEVOR family gene, and a downstream gene encoded heat shock protein 70. For hrp3, upstream and downstream genes were PHIST-encoding genes of unknown function. |
Coverage plots of hrp2/3 coordinates were generated for Cambodian, Malawian, and Malian samples to demonstrate: 1) magnified plots of only hrp2/3 positions and 2) differences in relative depth of coverage between samples collected at different time points (Fig. 5). To account for differences in sequencing depths between samples, coverage values were standardized. For each sample, site or locus coverage were divided by the expected coverage, obtained from mean coverage in subtelomeric region in which each locus is located (see Methods for coordinates). Standardized coverage of ~ 1 shows locus coverage similar to subtelomeric mean coverage. A strong, positive correlation between subtelomeric and hrp2/3 gene coverage justifies the use of this standardization approach (Supplementary Figures S4 and S5). The expectation of results is that standardized coverage over time will decrease if hrp2/3 deletions increase in frequency. To avoid undue impact of outlier standardized values, median standardized coverage was plotted per group.
Standardized read coverage for Cambodian samples were plotted alongside standardized coverage for the geographically representative strain NF135.C10 (Fig. 5A), and showed that the uneven standardized coverage of NF135.C10 is mirrored in the clinical samples. This suggests there are sequence-inherent properties that impact sequencing or mapping success. (See Supplementary Table S3 for descriptive coverage of each sample). In contrast, Malawian samples collected in 2014–2016 had lower median standardized coverage than the sample set collected in 2007–2008 (Fig. 5B), suggesting an increase in hrp2 and hrp3 deletions between the two time points. Interestingly, median standardized coverage is low (< 1) in both time points, showing that read coverage in the target genes is half of that in subtelomeric regions. The majority of Malawian samples underwent sWGA (n = 139) prior to sequencing (Supplementary Table S4) which may explain the lower standardized coverage as compared to standardized coverage of directly sequenced samples (Supplementary Figure S6). Of further interest, Mali samples from 2002 had lower standardized coverage than 2010 samples on both hrp2 and hrp3 positions (Fig. 5C). All Malian WGS data was obtained by direct sequencing of total DNA from venous blood, using a similar protocol [21, 30] with good coverage in the core genomes (Supplementary Table S5), implying the quality of WGS data did not contribute to this result. Overall, figures offer a visual perspective between different time points as monitoring of hrp2/3-deletions become crucial in the event of their expansion, but prior to any interpretation, the factors that influence WGS coverage should be considered.
To provide a clearer illustration of the proportional frequency of gene coordinates with zero coverage and the location of those positions along the locus, a view of hrp2 and hrp3 gene positions by proportional counts of no coverage (0X coverage) vs. coverage (≥ 1X coverage) was generated for Cambodian samples (Fig. 6). Supplementary Figure S7 provides the same proportional counts of zero coverage per position for Malawian and Mali samples. Among Cambodian samples, there is a clear increase in 0X coverage positions at the intron regions (hrp2 intron: 1375232–1375083; hrp3 intron: 2841636–2841484) relative to exon coverage. This is to be expected, as the length (145–148 bp) and the nucleotide composition of these Pf introns (hrp2 AT%: 91%; hrp3 AT%: 91.2%) prevent unambiguous mapping of 101 bp-long reads centered in the middle of the intron. On hrp3 positions, there are also two spikes in zero coverage positions on either side of coordinate 2,841,250, likely due to differences in Cambodia samples compared to Pf3D7 reference, such as indels or rapidly evolving sequence motifs among genetically similar Cambodian strains, which prevent read mapping in a subset of samples. That read mapping pattern is also observed in the troughs in hrp3 coverage plot in Fig. 5A, for samples collected in 2010 and 2011 (but curiously not in Pf NF135).
Next-generation short-read WGS data has the capability to provide detailed genotype information, but often necessitates a good understanding and use of bioinformatics tools and packages. GC3 was developed to be a user-friendly computational tool to (1) extract coverage profiles of target genome regions, 2) provide interpretable results regarding deletion prevalence, 3) classify samples according to the type of gene deletions, and (4) inform NMCPs concerning coverage of important genes. In this study, we demonstrated that GC3 can be used for these purposes by applying it to Plasmodium falciparum genome segments, specifically the regions containing hrp2 and hrp3 genes. Most NMCPs in malaria-endemic settings rely on HRP2-based RDTs for day-to-day diagnosis in both clinical and field settings. There is evidence of recent expansions of Pf strains lacking HRP2 expression, a cause for concern as stated by the WHO [1, 19, 31, 32]. Computational toolkits, such as GC3, need to be able to efficiently analyze Pf genomic data and assess for the presence of hrp2- or hrp3-deletion strains, which provides a valuable monitoring resource to researchers and public health professionals concerned with malaria RDT effectiveness.
This work demonstrated the validity and utility of the GC3 tool to measure hrp2 and hrp3 deletion frequencies in Pf individual samples and in sample sets. The sliding window capability of GC3 provided a wider view of large deletions in reference lab strains (i.e. Dd2 and HB3) while adjusting for fluctuations in subtelomeric read coverage and allowed for visually-friendly figures. The option of extracting and plotting every chromosome position within an interval allowed for a magnified view of target loci, and better illustrated details in coverage within the target gene. GC3’s genotype results were validated against previously reported genotypes of Pf laboratory reference and representative strains, and similar, publicly available sample sets from Kenya and Peru, where hrp2/3 deletions have been observed [11, 18, 33]. Analysis of samples from Cambodia and Brazil demonstrate GC3’s capability to process novel WGS data from regions other than West Africa, the location of origin of PfNF54, the parental isolate from which the reference 3D7 was cloned [34, 35]. Results are consistent with previous estimates of hrp2/3 deletions among each respective country [19, 29, 36]. Of note is the high prevalence of all deletions, especially hrp3 deletions among Brazilian samples. Deletions in hrp2/3, and reports of high deletion prevalence, were first observed in the American continent [7, 8, 16, 18, 36–38], especially hrp3 deletions [7, 39]. Overall, GC3 can appropriately process and analyze publicly available WGS datasets from a variety of genomic studies.
Additional comparison against a hrp2/3-specific qPCR assay demonstrated very good reliability of GC3’s capability. Although there was one discordant result between tools, this may be a reflection of GC3’s sensitivity and the qPCR assay’s difficulty to detect partial deletions, as these are only detected by the qPCR-based assay if they overlap the primer-binding or amplicon sites. A potential challenge for GC3 are the samples with a very low amount of parasite DNA resulting on overall low depth of coverage and uneven representation of the locus in the genomic library and/or among the WGS data, leading to significant regions of the hrp2/3 loci with zero coverage (and then a ‘deletion’ assessment by GC3), despite the loci being present in the genome. However, this situation was not observed in our study. In the specific case of discordant results in a sample from Cambodia, the sample had very high coverage at the core and subtelomeric regions. In general, partial deletions present a challenge since some cases have shown a qPCR assay can amplify part of hrp2/3, but corresponds to false negative RDT diagnoses [40, 41]. Ultimately, there were very few such samples, so their impact on overall results is considered minimal. Although beyond the scope of this study, further examination of hrp2/3 partial deletions, their specific location within the locus and their respective RDT diagnosis may provide valuable information regarding the most appropriate criteria and thresholds to accurately identify gene deletions with a functional phenotype, i.e., those deletions that abrogate protein expression.
Among Cambodian samples, read coverage in flanking genes was further analyzed and demonstrated that hrp2/3 deletions can be restricted to just the locus proper, or extend to flanking genes, but without a discernible pattern. These results are consistent to previous reports [18, 19, 39, 42, 43]. Overall, GC3 reported similar results in deletion prevalence and classification trends among global malaria-endemic regions.
Utilizing the function of GC3 to extract all positions (i.e. interval = 1, step size = 1) and plotting hrp2/3 coding positions only, coverage is clearer and subsets of samples can be compared, if desired. When using this option to compare countrywide samples collected at different time points, interestingly 2002 Malian samples had lower standardized coverage than older samples (2002 relative to 2010) for both hrp2 and hrp3. This result was unexpected, since the frequency of deletions is expected to increase over time, due to the selection imposed by parasite detection by RDT. Given the size of the sample subgroups, consistent results across both hrp2 and hrp3, and generation of WGS data directly from total DNA extracted from venous whole blood in both 2002 [30] and 2010 [21], it is improbable that differences in read mapping success would cause such a difference in coverage. Interestingly, hrp2 deletions had already been observed in Mali in the late 1990s prior to significant RDT use in the country, indicative of spontaneous deletions that were maintained despite the absence pressure from RDT usage [44]. In fact, random polymorphisms can occur naturally in the subtelomeric region, including large deletions, without evolutionary pressure [45], Further, the fitness cost associated with hrp2 loss is not significant, although some small cost is associated with hrp3 deletions [3, 18]. Considering previous observations and minimal fitness cost associated with hrp2/3, GC3’s Mali results are possibly due to the expansion of a circulating genotype with naturally occurring hrp2/3 deletions in 2002 that were less prevalent or non-existent by 2010. Ultimately, this observation, together with the high frequency of hrp2 and, especially, hrp3 deletions in Brazilian isolates, suggest that there may be factors that impact deletion rate at these loci or forces, other than RDT usage, that favor strains with hrp2/3 deletions.
Closer examination of the proportional coverage at each hrp2/3 position among Cambodian samples, revealed in each gene positions with zero read coverage resulting from mapping artifacts, and which may confound results. On hrp2 positions, zero coverage positions increase and then spike around the intron region which, as mentioned before, is likely due to the intron’s high AT content [46] that can cause challenges for read mapping. Interestingly, there were two spikes in zero coverage positions on the hrp3 locus. Examination of a previous whole genome hierarchical cluster analysis of the same Cambodian samples [47] revealed that the majority of samples contributing to one or both peaks belong to the same Cambodian subpopulation and hence share a similar genetic background. Ultimately, the figure offers a useful preliminary view of the hrp2/3 genes and their characteristics.
Some limitations exist when analyzing the GC3’s results. Firstly, the quality of computational results is influenced by the depth of the P. falciparum sequencing data, as measured by the total number of reads mapped to the reference genome. In the case of P. falciparum, we have observed that the percentage of genome with coverage asymptotes at ~ 12 million 100-bp reads mapped to the parasite genome, averaging ~ 52X coverage genome-wide. It is also recommended the user apply GC3 to calculate mean coverage over the region/chromosome where the target gene is located to estimate expected coverage at the general area of the target gene. Further, high-quality WGS data results obtained with well-described DNA extraction methods and sequencing methods, either by direct sequencing or sWGA [21, 23], and established quality control and filtering protocols should be used when comparing samples from different studies. While comparing standardized coverage between direct and sWGA sequence data, direct sequencing achieves more uniform coverage due to the inefficient amplification on the subtelomeric region by sWGA primers [23], but sWGA still provides good coverage at hrp2/3 gene positions (Supplementary Figure S6). Even with high-quality data, and pertinent to hrp2/3 deletion detection, polyclonal infections add another layer of complexity since they can mask the lack of coverage at a target gene, especially in high transmission regions [17]. Particularly if GC3 is the only method being implemented to assess for deletions, this factor would need to be considered. The deletion criteria can be easily adjusted to be more stringent by the user depending on their purposes, much like the deletion criteria used in the comparator study [18]. Despite the limitations, GC3 appropriately processed hrp2/3 coverage data and classified deletions. Its utility can be extended to analyze and visualize coverage data of any target gene on any pathogen.
Overall, validation of the GC3 tool to extract and process WGS data was successful when comparing with expected results using reference strains, well-described samples and a hrp2/3-specific qPCR assay. Following our criteria for identifying deletions, the results agreed with previous estimations of hrp2/3 deletion frequency in each respective country. Apparent in the results is the level of detail that can be extrapolated from short-read WGS data and viewed using a comprehensive computational tool. Although challenges persist in ensuring high-quality WGS data and achieving similar coverage among low parasitemia samples using sWGA, we expect GC3’s results to be fairly robust. Further investigation of the partial deletions threshold that results in a false negative RDT diagnosis is needed to validate the deletion criteria. Ultimately, groups investigating a target gene’s coverage can use GC3 to efficiently generate translatable results and figures to understand and interpret broad patterns using hundreds to thousands of previously generated genomic datasets.
National Malaria Control Program
Rapid Diagnostic Tests
loci encoding histidine-rich proteins 2 and 3
Plasmodium falciparum
Whole genome sequencing
selective whole genome amplification
Sequence Read Archive
Binary sequence Alignment/Map
Genome Analysis Toolkit
Browser Extensible Data
base pair
coding DNA sequence
Ethics approval and consent to participate
All WGS datasets used are publicly available through GenBank. P. falciparum samples for which DNA material was utilized to perform qPCR were collected during field studies in Brazil, Malawi, Mali, and Thailand. All samples were collected after recruitment/enrollment and with written informed consent from the subject or a parent/guardian, and assent obtained from children under 18 years old. Informed consent included authorization to generate parasite genomic data. The respective field studies were approved by the following Institutional review boards (IRBs): Institute of Biomedical Sciences, University of São Paulo (1368/17/CEPSH), for collection in Acre, Brazil; Malawi National Health Sciences Research Committee, and the IRB at the University of Maryland, Baltimore (FWA#00005976), for collection in Malawi; Comité D’Éthique de la FMPOS (Faculté de Médecine de Pharmacie et d’Odonto-stomatologie), Université des Sciences, des Techniques, et des Technologies de Bamako, Mali, and the IRB at the University of Maryland, Baltimore (HP-00041382), for collection on Mali; Ethics Review Committee of the Department of Medical Research, Human Subject Protection Branch (HSPB) of the Walter Reed Army Institute of Research, Silver Spring, MD (FWA# 00000015, IRB# 00000794), and Institute for the Development of Human Research Protection (IHRP), Ministry of Public Health, Thailand (FWA# 00017503, IRB#00006539), for collections in Thailand. This study conforms to the principles established in the Helsinki Declaration.
Consent for publication
Not applicable
Availability of data and materials
The datasets from field studies are available from public repositories, with access information reported previously [21].
Competing interests
The authors declare that they have no competing interests.
Funding
This work was funded in part by the National Institutes of Health (NIH) awards U19 AI110820 and R01 AI141900 to JCS. AO received funding from the National Heart, Lung and Blood Institute (1K01HL140285-01A1).
Author’s contributions
TCS conceived the study, designed the bioinformatics toolkit, performed the analyses, interpreted the results and wrote the paper. AD supported the development of GC3, contributed to the interpretation of results, and edited the manuscript. JCS conceived and designed the study and bioinformatics toolkit, interpreted the results, contributed funding and resources, and wrote the paper. BS and SJ were responsible for all laboratory procedures and edited the paper. TS designed the hrp2/3-specific qPCR protocol and edited the paper. AO contributed to the interpretation of results and edited the paper. GG and CD contributed fundamental resources to the study and reviewed the paper. All authors read and approved the final version of the manuscript.
Acknowledgements
We would like to thank the study teams that supported the collection of samples in the Brazil, Malawi, Mali, and Myanmar and Thailand studies. Specifically, we would like to acknowledge the generosity of the UMB PIs from each respective study, namely Dr. Miriam Laufer, Dr. Mark Travassos, and Dr. Shannon Takala-Harrison as well as the University of Sao Paulo PI, Dr. Marcelo Urbano Ferreira. Parasite isolate DNA from Cambodia and Thailand was derived from clinical samples collected by the Department of Bacterial and Parasitic Diseases at the Armed Forces Research Institute of Medical Sciences in collaboration with Dr. Shannon Takala-Harrison and had undergone whole genome sequencing as part of a previous study [48]. Malawian isolates were generated from samples by the Mfera Health Center in the Chikhwawa district from a previous investigation [49]. Malian sequences comes from previous studies conducted in collaboration with the Bandiagara Malaria Project [21, 30]. Brazilian samples were collected by the University of Sao Paulo Department of Public Health Practices [21]. Accession numbers of samples are provided in the supplementary materials Table S1.