Study area
The study was conducted in two sites, the Cape Coast Metropolitan area with Cape Coast as the main township and Kassena-Nankana districts (KNDs) with Navrongo as the main township (Fig 1). Cape Coast is located in the Southern coastal savannah region with low to moderate perennial malaria transmission but with marked seasonal effect during the rainy season (May-August and October-November). The estimated annual entomological inoculation rate (EIR) is fewer than 50 infective bites per person per year [21]. The KNDs are located in the Upper East Region of Ghana with a Guinea savannah vegetation. Malaria is perennial in the KNDs with high seasonal malaria transmission during the rainy season (June to October), and minimal transmission during the rest of the year, which are relatively dry months. The estimated annual EIR for the KNDs is up to 157 infective bites per person per year [22].
In Cape Coast, P. falciparum parasites were isolated from 101 children (aged 6-59 months) living within the municipality and presented with clinical malaria at the Cape Coast District hospital. Samples were collected during the major rainy season (May-August) in 2013. In Navrongo, P. falciparum parasites were isolated from 131 children aged between 12-59 months who lived in the KNDs and also presented with P. falciparum clinical malaria at health facilities in the KNDs in the years 2010 (January to October), 2011 (January to February) and 2013 (August to October) during both dry and wet seasons. For both study sites, children presenting with fever i.e. axillary temperature ≥ 37.5° or history of fever during the previous 24 hours were screened with malaria Rapid Diagnostic Test (RDT). Blood smears were prepared for RDT positive individuals and P. falciparum asexual parasites were determined by microscopy. Venous blood (2-5mL) was collected and archived from P. falciparum infected patients who gave consent.
Fig 1. Location of the Navrongo within the Kassena-Nankana Districts and Cape Coast within the Cape Coast Metropolis. The distance between Cape Coast and Navrongo is approximately 784.8 km (generated by QGIS software).
Genomic DNA extraction and sequencing
Genomic DNA was extracted using the QiaAmp DNA prep kit (Qiagen, Valencia, CA) following the manufacturer’s protocol and confirmation of P. falciparum positive samples was done by amplification using nested PCR with specific primers [23] (see Additional file 1). The Genomic DNA was submitted to the Wellcome Trust Sanger Institute Hinxton, UK. for whole genome sequencing using the Illumina HiSeq platform as part of the MalariaGEN community project. Illumina sequencing libraries (200bp insert) were aligned to the reference P. falciparum 3D7 genome after which variant calling was done following the customized GATK pipeline. Each sample was genotyped for 797,000 polymorphic bi-allelic coding SNPs across the genome ensuring a minimum of 5x paired-end coverage across each variant per sample. The dominant allele was retained in the genotype file at loci with mixed reads (reference/non-reference). The genotypes were assigned denoting the reference and non-reference nucleotides as 0 and 1 respectively. Polymorphic sites with low call rates and those in hypervariable, telomeric and repetitive sequence regions were excluded.
Sequence acquisition and pre-processing
Genome sequences from Navrongo and Cape Coast were mined from the MalariaGEN Plasmodium falciparum Community (Pf3k) Project release 5.1 Database [24] in variant call format (VCF). Genetic variants on chromosome 3 were retrieved for both Navrongo and Cape Coast. For the VCF file of Cape Coast, all genotypes at each SNP position were mono-allelic (monoclonal); we modeled biallelic genotypes using a custom-made Python script. This was based on the approach by the MalariaGEN Pf3k Project, where loci with mixed allele calls were modeled using the read and allelic depth [25]. Briefly, to account for PCR errors, genotypes of SNPs with read depth <5 were not determined. At SNP positions with read depth ≥ 5, the sample was genotyped as heterozygous if the allelic depth of both alleles were ≥ 2. The remaining alleles were either genotyped as the homozygote reference allele or homozygote alternative/derived allele.
Data for both populations were filtered to obtain only biallelic SNPs using Bcftools v1.9 [26] and quality checked as follows: Only SNPs that passed all VCF filters were retained. Isolates with > 10% missing SNPs were excluded followed by the removal of SNPs with > 5% of missing data using PLINK v1.9 [27]. Further, heterozygosity was calculated and 8 isolates with outlier heterozygosity within the Cape Coast population were excluded. No outlier heterozygosity was observed in Navrongo. SNPs with a minor allele frequency (MAF) < 1% were removed. The remaining missing SNPs were imputed and phased using Beagle v5.1 [28]. After quality control, the Cape Coast dataset remained with 2504 SNPs out of 26156 within Chromosome 3 and 92 out of 101 samples. On the other hand, the Navrongo dataset retained 1954 out of 43199 SNPs within Chromosome 3 and 128 out of 131 samples. The Pfcsp was then extracted from chromosome 3 (position: 221323 – 222516) and retained SNPs at the CSP loci used were 13 and 22 for Cape Coast and Navrongo respectively.
Population genetics analysis
Minor allele frequency distribution
Prior to the removal of rare alleles (MAF ≤0.01), the minor allele frequency distribution for all putative SNPs (n = 90) within the Pfcsp for both Cape Coast (n = 35 SNPs) and Navrongo (n = 55 SNPs) P. falciparum isolates was determined using Plink1.9. MAF is the frequency at which the second most common allele occurs at a given SNP position in a population.
Within host parasite diversity estimation and statistical analysis
The genetic diversity within the individuals was assessed by estimating Wright’s inbreeding co-efficient (FWS). For this analysis, we were primarily interested in the within host diversity of Pfcsp which refers to the number of different Pfcsp strains contained within an individual infection. The retained variants (13 and 22 SNPs) from the 92 Cape Coast isolates and 128 Navrongo isolates were used for this analysis.
The Fws metric estimates the heterozygosity of parasites (HW) within the individual relative to the heterozygosity within the parasite population (HS) using the read count of alleles. The Fws metric calculation for each sample was done using the equation:
where HW refers to the allele frequency of each unique allele found at specific loci of the parasite sequences within the individual and HS refers to the corresponding allele frequencies of those unique alleles within the population [29,30]. Fws ranges from 0 to 1; a low Fws value indicates low inbreeding rates within the parasite population thus high within host diversity relative to the population. An Fws threshold ≥0.95 implies samples with clonal (single strain) infections while samples with Fws <0.95 are considered highly to have mixed strain infections signifying within host diversity. Fws was calculated using an R package, moimix [31]. Samples with clonal infections were used for selection analysis. The Pearson Chi square test was used to measure the statistical significance of any differences observed in the within host diversity estimates between the population pair. The test was done using R software with P values of <0.05 considered statistically significant.
Genetic diversity within parasite populations
We examined the haplotype diversity (the number of two random strains within the population having different haplotypes) of the Pfcsp in each population by exploring the variants in the C-terminal region of the gene (909 – 1140bp). We re-constructed 184 Pfcsp fasta DNA sequences with the retained variants (13 SNPs) from the 92 Cape Coast isolates and 256 DNA sequences (22 SNPs) from the 128 Navrongo isolates using an in-house Python script.
The following metrics were then used to assess the diversity of Pfcsp C-terminal within each parasite population using the DnaSP software (version 6.10.01) [32]: number of sequences (n), number of haplotypes (h), segregating sites (S), the average number of pairwise nucleotide differences (K), nucleotide diversity (π) and haplotype diversity (Hd) [33,34].
To assess the genealogical relationships between the Pfcsp C-terminal haplotypes found in Navrongo and Cape Coast, we constructed a network based on the method described by Templeton, Crandall, and Sing (TCS) [35,36] using PopArt [37]. The haplotypes were denoted as 3D7, Hap 2 up to Hap 66 in the network.
We further explored the amino acid haplotypes within each population by translating all the 440 Pfcsp DNA sequences (Cape Coast (184) and Navrongo (256)) into amino acid sequences and comparing them to the 3D7 reference strain (0304600.1, PlasmoDB [38]) using in-house Python scripts. The frequency of TH2R 311-327 amino acid (PSDKHIKEYLNKIQNSL) and TH3R 352-363 amino acid (NKPKDELDYAND) haplotypes in each parasite population were determined also using a customized Python script and plotted.
Population differentiation and structure analysis
Wright Fixation index (Fst) and principal component analysis (PCA) were used for population differentiation and structure analyses. To reduce bias in Fst and PCA analysis, we pruned out SNPs (from the 2504 Cape Coast chromosome 3 retained SNPs and the 1954 Navrongo retained SNPs) with pairwise linkage disequilibrium (LD) value, r2 >0.5 within a window of 100bp in the entire chromosome 3 dataset using a step size of 10. The remaining SNPs set at Chromosome 3 shared between the populations after pruning was 516 of which 10 were Pfcsp SNPs.
The Pfcsp SNPs were then used to estimate Fst and population structure. Weir and Cockerham’s Fst per SNP between Cape Coast and Navrongo parasite isolates was calculated using Vcftools v0.1.5 [39] and population structure by PCA was done using smartpca (Cambridge, MA, USA) in EIGENSOFT package v6.1.3 [40]. Principal components were computed with the number of outlier removal iterations set at 10 while maintaining other parameters. In all, 10 PCs were computed with 5 and 9 outlier samples removed from the 92 and 128 isolates from Cape Coast and Navrongo respectively. Thus, there remained 83 samples in Cape Coast and 123 samples in Navrongo population after outlier samples were removed.
Signatures of selection
To test for SNP neutrality, the Tajima’s D statistical test [41], was done in sliding windows of size 100bp and step size of 10 with Pfcsp monoclonal samples from each population using Vcftools v0.1.5. Tajima’s D test compares the average pairwise differences (pi) and the total number of segregating sites (S). Negative values indicate directional or purifying selection while positive values indicate balancing selection.
To detect loci likely to be under recent positive selection in the Cape Coast and Navrongo monoclonal chromosome 3 isolates, we calculated the standardized integrated haplotype score (|iHS|) for each SNP with a MAF >0.05 in chromosome 3 (358 out of the 2504 and 608 out of the 1954 remaining SNPs from Cape Coast and Navrongo respectively) [42]. Again, for the purpose of this analysis, the Fws metric was used to estimate these monoclonal chromosome 3 isolates in the retained variants within the Chromosome 3 region (2504 in Cape Coast and 1954 SNPs in Navrongo).
|iHS| measures the amount of extended haplotype homozygosity (EHH) at a given SNP along the ancestral allele relative to the derived allele [42]. The reference and alternate alleles were characterized as ancestral and derived alleles respectively. This was done in R using the rehh package v2.0.4 [43]. Genomic regions under positive selection were identified as those with multiple SNPs having |iHS| values >3 and formed the focal SNPs for extended haplotype homozygosity (EHH) analysis. EHH for both the reference and alternate alleles were calculated and bifurcation plots generated to visualize the decay of EHH at increasing distances from the focal SNP loci [44] using rehh package v2.0.4 in R.