Clinical samples. WGS was attempted on 106 Plasmodium falciparum isolates collected from subjects with uncomplicated malaria or asymptomatic infection from 2015 to 2017. Forty three of these were leukodepleted blood collected as part of an in vivo efficacy study of artemether-lumefantrine (AL) in pediatric uncomplicated malaria patients collected from 2015-2017 in Yombo, Bagamoyo District. A remaining 63 isolates were from dried blood spots (DBS) collected in Zanzibar in 2017. These came from cross-sectional surveys of asymptomatic individuals (n = 34) and an in vivo efficacy study of artesunate-amodiaquine (ASAQ) with single low dose primaquine (SLDP) in pediatric uncomplicated malaria patients (n = 29). These isolates essentially represent a convenience sample. We note that isolates were not selected for sequencing on the basis of specific clinical or epidemiologic characteristics; however, sequencing was more likely to be successful on isolates from subjects with high parasitemia. Study participants from Zanzibar were asked to report any overnight travel away from home in the past 4 months. Responses were coded as yes (overnight travel to mainland Tanzania or Kenya) or no (no overnight travel off islands of Zanzibar). Clinical characteristics of the attempted and sequenced samples from each cohort from Zanzibar is provided in Supplemental Table 1.
Generation and sequencing of libraries. Leukodepleted blood samples and DBS were extracted using QIAmp 96 DNA blood kits per the manufacturer protocol (Qiagen, Hilden, Germany). DNA from leukodepleted blood was acoustically sheared using a Covaris E220 instrument, prepared for sequencing without enrichment using Kappa Hyper library preps, and individually barcoded per manufacturer's protocol (Kappa Biosystems, Columbus, OH). DNA extracted from DBS was enriched for P. falciparum DNA before library prep using two separate selective whole genome amplification (sWGA) reactions. The sWGA approach was adapted from previously published methods and employed two distinct sets of primers designed for P. falciparum, including the Probe_10 primer set described previously by Oyola et al. and another set of custom primers (JP9) we designed using ‘swga’ [19–21]. We included phosphorothioate bonds between the two most 3’ nucleotides for all primers in both sets to prevent primer degradation. Design and evaluation of these custom primers and the sWGA approach are described in the Supplemental Materials and Supplementary Table 2. The two sWGA reactions were carried out under the same conditions. The products of the two sWGA reactions were pooled in equal volumes and acoustically sheared using a Covaris E220 instrument before library preparation using Kappa Hyper library preps. The indexed libraries were pooled and sequenced on a HiSeq4000 using 2x150 chemistry at the University of North Carolina High Throughput Sequencing Facility. Sequencing reads were deposited into the NCBI SRA (Accession numbers: pending).
Public sequencing data. Illumina short read WGS data for Plasmodium falciparum isolates was downloaded from public databases. This included 68 isolates from other regions of Tanzania, collected between 2010 and 2013, as well as 179 isolates from other regions, including Southeast Asia, South Asia, East and West Africa (Supplemental Table 3).
Read alignment and quality control. Raw paired-end reads were trimmed for adapter sequences with `cutadapt` v1.18 and aligned to the P. falciparum 3D7 reference genome (assembly version 3, PlasmoDB version 38: https://plasmodb.org/common/downloads/release-38/Pfalciparum3D7/fasta/data/PlasmoDB-38_Pfalciparum3D7_Genome.fasta) with `bwa mem` v0.7.17-r1188. Duplicates were marked with `samblaster` v0.1.24. We defined a position as “callable” if it was covered by >= 5 high-quality reads (MQ >= 25, BQ >= 25), and computed the proportion of callable sites in each isolate was calculated with the Genome Analysis Toolkit (GATK) `CallableLoci` tool v3.8-0. Only isolates with >= 70% of the genome callable were used for further analysis.
Variant discovery and filtering. Short sequence variants (including SNVs, indels and complex multi-nucleotide variants) were ascertained in parallel in each isolate using GATK `HaplotypeCaller` v.4.0.3.0, then genotyped jointly across the entire cohort with GATK `GenotypeGVCFs` according to GATK best practices. Variant discovery was limited to the core (non-hypervariable) nuclear genome as defined by Miles et al. [22]. Putative SNVs only were filtered using the GATK Variant Quality Score Recalibration (VQSR) method. For training sets, we used: QC-passing sites from the P. falciparum Genetic Crosses Project release 1.0 (ftp://ngs.sanger.ac.uk/production/malaria/pf-crosses/1.0/; [22]) (true positives, prior score Q30); QC-passing sites from the Pf3K release v5.1 (ftp://ngs.sanger.ac.uk/production/pf3k/release_5/5.1/) (true positives + false positives, prior score Q15). We used site annotations QD, MQ, MQRankSum, ReadPosRankSum, FS, SOR and trained the model with 4 Gaussian components. A VQSLOD threshold -0.0350 achieved 90% sensitivity for re-discovering known sites in the training sets. All biallelic SNVs with VQSLOD at or above this threshold were retained.
Isolates may contain multiple strains that are haploid resulting in mixed infections with arbitrary effective ploidy. To account for this complexity of infection (COI) in our analyses, we followed previous authors [23] and calculated the following quantities at each variant site: for each isolate, the within-sample allele frequency (WSAF), the proportion of mapped reads carrying the non-reference allele; the population-level allele frequency (PLAF), the mean of within-sample allele frequencies; and the population-level minor allele frequency (PLMAF), the minimum of PLAF or 1-PLAF. These calculations were performed with `vcfdo wsaf` (https://github.com/IDEELResearch/vcfdo).
Analyses of mutational spectrum. Ancestral versus derived alleles at sites polymorphic in P. falciparum were assigned by comparison to the outgroup species P. reichenowi. Briefly, an approximation to the genome of the P. reichenowi - P. falciparum common ancestor (hereafter, “ancestral genome”) was created by aligning the P. falciparum 3D7 assembly to the P. reichenowi CDC strain assembly (version 3, PlasmoDB version 38: https://plasmodb.org/common/downloads/release-38/PreichenowiCDC/fasta/data/PlasmoDB-38_PreichenowiCDC_Genome.fasta) with `nucmer` v3.1 using parameters “-g 500 -c 500 -l 10” as in [24]. Only segments with one-to-one alignments were retained; ancestral state at sites outside these segments was deemed ambiguous. The one-to-one segments were projected back into the 3D7 coordinate system. Under the assumption of no recurrent mutation, any site polymorphic in P. falciparum is not expected to also be mutated on the branch of the phylogeny leading to P. reichenowi. Thus, the allele observed in P. reichenowi is the ancestral state conditional on the site being polymorphic. Transitions-transversion (Ti:Tv) ratios and mutational spectra were tallied with `bcftools stats` v1.19.
Analyses of ancestry and population structure. VQSR-passing sites were filtered more stringently for PCA to reduce artifacts due to rare alleles and missing data. Genotype calls with GQ < 20 or DP < 5 were masked; sites with < 10% missing data and PLMAF >5% after sample-level filters were retained for PCA, which was performed with `akt pca` v3905c48 [25]. For calculation of f3 statistics, genotype calls with GQ < 10 or DP < 5 were masked; sites with <10% missing data and PLMAF >1% after sample-level filters were retained. Then f3 statistics were calculated from WSAFs rather than nominal diploid genotype calls, using `vcfdo f3stat`.
Estimation of sequence diversity. Estimates of sequence diversity and differentiation were obtained from the site-frequency spectrum (SFS), which in turn was estimated directly from genotype likelihoods with `ANGSD` 0.921-11-g20b0655 [26] using parameters “-doCounts 1 -doSaf 1 -GL 2 -minDepthInd 3 -maxDepthInd 2000 -minMapQ 20 -baq 1 -c 50.” Unfolded SFS were obtained with the `ANGSD` tool `realSFS` using the previously-described ancestral sequence from P. reichenowi. All isolates were treated as nominally diploid for purposes of estimating the SFS because we noted systematic bias against mixed isolates when using `ANGSD` in haploid mode. Four-fold degenerate and zero-fold degenerate sites were defined for protein-coding genes in the usual fashion using transcript models from PlasmoDB v38. SFS for all sites, 4-fold and 0-fold degenerate sites were estimated separately in mainland Tanzania and Zanzibar isolates in non-overlapping 100 kb bins across the core genome. Values of sequence diversity (theta_pi) and Tajima’s D were estimated for these bin-wise SFS using `sfspy summarize` (https://github.com/IDEELResearch/sfspy), and confidence intervals obtained by nonparametric bootstrap. Fst was calculated from the joint SFS between mainland Tanzania and Zanzibar. The distribution of local Fst values was calculated in 5 kb bins for purposes of visualization only.
Strain deconvolution and inheritance-by-descent analyses. Complexity of infection (COI) and strain deconvolution (phasing) were performed jointly using `dEploid` v0.6-beta [14]. For these analyses we limited our attention to 125 isolates from mainland Tanzania and Zanzibar (57 new in this paper and 68 previously published). On the basis of the analyses shown in Figures 1 and 2, these isolates appeared to constitute a reasonably homogeneous population, so we used the set of 125 for determination of PLAFs to be used as priors for the phasing algorithm. Phasing was performed using population allele frequencies as priors in the absence of an external reference panel known to be well-matched for ancestry. We further limited the analysis to very high-confidence sites: VQSLOD > 8, 75% of isolates having GQ ≥ 10 and DP ≥ 5, ≥ 10 bp from the nearest indel (in the raw callset), ≥ 10 total reads supporting the non-reference allele, and PLMAF ≥ 1%. The `dEploid` algorithm was run in “-noPanel” mode with isolate-specific dispersion parameters (“-c”) set to the median coverage in the core genome, and default parameters otherwise. Within-isolate IBD segments were extracted from the `dEploid` HMM decodings by identifying runs of sites with probability ≥ 0.90 assigned to hidden states where at least two of the deconvoluted haplotypes were IBD. The total proportion of strain genomes shared IBD (within-isolate FIBD) for isolates with COI > 1 was obtained directly from `dEploid` log files, and agreed closely with the sum of within-isolate IBD segment lengths.
Between-isolate IBD segments were identified by applying `refinedIBD` v12Jul18 [27] to the phased haplotypes produced by `dEploid`. For a genetic map, we assumed constant recombination rate of 6.44 x 10-5 cM/bp (equal to the total genetic length of the P. falciparum map divided by the physical size of the autosomes in the 3D7 assembly.) Segments >2 cM were retaiend for analysis. The proportion of the genome shared IBD between phased haplotypes (between-isolate FIBD) was estimated by maximum likelihood described in [28] using `vcfdo ibd`.
Demographic inference. Curves of recent historical effective population size were estimated from between-isolate IBD segments with `IBDNe` v07May18-6a4 [29] using length threshold > 3 cM, 20 bootstrap replicates and default parameters otherwise. Local age-adjusted parasite prevalence point estimates (PfPR2-10) and credible intervals were obtained from the Malaria Atlas Project [30] via the R package `malariaAtlas` [31].
More remote population-size histories were estimated with `smc++` v1.15.2 [32]. Phased haplotypes from `dEploid` were randomly combined into diploids and parameters estimated separately for mainland Tanzania and Zanzibar populations using 5-fold cross-validation via command `smc++ cv`, with mutation rate set to 10-9 bp-1 gen-1. Marginal histories from each population were then used to estimate split times using `smc++ split`.
Analyses of natural selection. The distribution of fitness effects (DFE) was estimated within mainland Tanzania and Zanzibar populations with `polyDFE` v2.0 using 4-fold degenerate sites as putatively-neutral and 0-fold degenerate sites as putatively-selected [33]. “Model C” in `polyDFE` parlance -- a mixture of a gamma distribution on selection coefficients of deleterious mutations and an exponential distribution for beneficial mutations -- was chosen because it does not require a priori definition of discrete bins for selection coefficients, and the gamma distribution can accommodate a broad range of shapes for the DFE of deleterious mutations (expected to represent the bulk of polymorphic sites.) Confidence intervals for model parameters were obtained by non-parametric bootstrap via 20 rounds of resampling over the 100 kb blocks of the input SFS. Because `polyDFE` fits nuisance parameters for each bin in the SFS, we found that computation time increased and numerical stability decreased for SFS with larger sample sizes. We therefore smoothed and rescaled input SFS to fixed sample size of 10 chromosomes each using an empirical-Bayes-like method (https://github.com/CartwrightLab/SoFoS/) re-implemented in `sfspy smooth`. Smoothing of input SFS had very modest qualitative effect on the resulting DFE.
The cross-population extended haplotype homozygosity (XP-EHH) statistic was used to identify candidate loci for local adaptation in mainland Tanzania or Zanzibar. Because the statistic requires phased haplotypes and is potentially sensitive to phase-switch errors, only isolates with COI = 1 were used (n = 18 mainland Tanzania, n = 12 Zanzibar.) XP-EHH was calculated from haploid genotypes at a subset of 103,982 biallelic SNVs polymorphic among monoclonal isolates with the `xpehhbin` utility of `hapbin` v1.3.0-12-gdb383ad [34]. Raw values were standardized to have zero mean and unit variance; the resulting z-scores are known to have an approximately normal distribution [35] so nominal p-values were assigned from the standard normal distribution. The Benjamini-Hochberg method was used to adjust nominal p-values for multiple testing.
Pipelines used for WGS read alignment, variant calling, variant filtering, haplotype deconvolution and SFS estimation are available on Github: https://github.com/IDEELResearch/NGS_Align_QC_Pipelines.