POPULATION GENOMICS OF PLASMODIUM FALCIPARUM AND MALARIA CONTROL IMPLICATIONS IN ABIDJAN (COTE D’IVOIRE)

Background: The sudden onset of P. falciparum resistance to antimalarial drugs requires careful surveillance of African parasite population. Genomics tools are implemented to detect evolutionary changes that could impact on malaria control and elimination strategies. Here, we evaluated the genome-wide pattern of selection and sequence variation of P. falciparum populations from Côte d’Ivoire. Methods: The study was carried out in three localities of Abidjan from 2013 to 2014. We collected 70 blood samples following a written consent from patients above two years of age. After extracting P. falciparum DNA from isolates, we performed Whole Genome Sequencing and used population genomics approaches to investigate genetic diversity, complexity of infection and identify loci under positive directional selection. Results: We observed an excess of rare variants in the population showing a clear mutation process in the isolates. Moderate Fst estimates (0.3) was detected for surfin genes expressed at the surface of infected erythrocytes and released merozoites. Seven iHS regions that had at least two SNPs with a score > 3.2 were identified. These regions code for genes that have been under strong directional selection. Two of these genes were the chloroquine resistance transporter ( crt) on chromosome 7 and the dihydropteroate reductase (dhps) on chromosome 8 despite their official proscription for malaria treatment since 2007. There was also a recent selective sweep in the erythrocyte membrane protein ( Pfemp1) . Conclusion: Population genomics revealed selective drug pressure and balancing selection on protective immunity target genes, demonstrating his effectiveness to follow up adaptive evolution of parasite populations and adopt appropriate strategies to control malaria in Côte d’Ivoire.

score > 3.2 were identified. These regions code for genes that have been under strong directional selection. Two of these genes were the chloroquine resistance transporter (crt) on chromosome 7 [1] . In Côte d'Ivoire, malaria is endemic and the entire population is at risk of contracting the disease especially pregnant women and children under the age of five [2] . Malaria incidence has plateaued during the past few years. In addition, the emergence of malaria parasite resistance to artemisinin derivatives highlighted the need to monitor parasite population [3,4] . Investigating the genetic profile of P. falciparum may detect evolutionary changes that can have an impact on malaria control and elimination efforts.
During the past decade, genomics and genetics studies have been conducted to identify P. falciparum markers associated with disease severity, resistance to drugs and escape to the human immune system.
These studies provide valuable informations for malaria control [5][6][7] . Hence, a population genomics study of P. falciparum in a single endemic population of Gambia identified new genes under balancing selection, such as the apical membrane antigen 1 gene (ama1) which encodes a prime vaccine candidate [8] . In addition, a study in Nigeria detected regions that were actively maintained in the gene presumably a signature of adaptation to drug pressure and host immunity. [9] . No evidence or very little subpopulation of the P. falciparum population has been found across West Africa [5] . These findings suggest gene flow between regions, despite differences in transmission seasonality and local vector species abundances [10][11][12] . However, studies have also shown that malaria parasite population structure is increasing in low transmission areas [13,14] . The emergence of multigenic drug resistance has been favored by high rates of inbreeding that can be estimated by within-host diversity [14][15][16] . Thus, malaria transmission intensity and parasite genetic diversity vary greatly across West Africa regions due to variation in rainfall seasonality indicating more highly mixed genotype infections in Guinea than in The Gambia [17] . In Côte d'Ivoire, transmission of malaria in Côte d'Ivoire occurs all year round, with seasonal peaks during the rainy seasons. It is imperative to assess the genetic diversity of individual infections relative to the genetic diversity of the parasite population as a whole. Recent genome-wide scan of P. falciparum revealed loci under selection in known drug targets as crt (chloroquine resistance transporter), dhps (dihydropteroate synthase), and dhfr (dihydrofolate reductase) in two localities of Ghana [12] . Other studies performed in Senegal, The Gambia and Guinea [7,10,18,19] reported evidence in signatures of selection surrounding genes involved with chloroquine (mdr1 and crt) and antifolate (dhfr and dhps) resistance. This suggests that Côte d'Ivoire could report similar patterns since it reflects the same historical drug use with previous described countries. Although population genomics investigated across West Africa have shown strong positive selection around known drug resistance genes, inferences of local mechanisms require looking more distinctively at individual populations such as Côte d'Ivoire.
This study assess the genetic profile of circulating P. falciparum strains and aims to understand the genome-wide patterns of selection in Côte d'Ivoire. Using samples from different localities (Abobo, Koumassi and Yopougon), we assessed whether there is a significant population substructure in Côte d'Ivoire and measure the complexity of infection between three localities of the country. Finally, we identified chromosomes and genes that could be under recent positive selection and balancing selection.

Ethics Statement
An approval to collect and analyze clinical isolates was granted by the National Ethics and Research Committee (CNER-CI) according to protocols and standards operating procedures of Good Clinical Practices of the ICH harmonized Triplicate Guide Lines for Good Clinical Practice made in 1996 and the Helsinki Declaration on human being research. Samples were collected following a written informed consent from patients or their legal guardians for study participants under 18 years old. falciparum DNA was later extracted using the QI Amp blood mini kit (Qiagen, UK) followed by the whole genome sequencing of P. falciparum. Genome Sequencing at the Wellcome Trust Sanger Institute using an Illumina HiSeq platform. Standard laboratory protocols were used to determine DNA quantity and proportion of human DNA in each sample. All the 70 samples were above the "good quality" thresholds and were put forward for whole genome Illumina paired-end sequencing [14,20] . Sequencing reads were mapped to the P. falciparum 3D7 reference genome using bwa mem version 0.7.15 with -M parameter to mark shorter split hits as secondary [21] . Standard alignment metrics were generated for each sample using the stats utility from samtools version 1.2 [22] . We also used GATK's CallableLoci (version 3.5) to determine the proportion of genomic positions callable in each sample [23] . Potential SNPs (Single Nucleotide Polymorphisms) and indels were discovered by running GATK's HaplotypeCaller (version 3.6) independently across each of the 70 sample-level bam files. This resulted in genotype calls for both SNPs and short indels. SNPs and indels were filtered separately. Each variant was assigned a quality score using GATK's Variant Quality Score Recalibration (VQSR) version 3.6. The tools VariantRecalibrator and ApplyRecalibration were used for this purpose. Regions of the genome which we previously identified as being enriched for errors were masked out. VariantsRecalibrator was run using the PASS variants from the P. falciparum crosses 1.0 release as a training set. For SNPs we used 15.0 as a prior for the training set variants.

Study sites and sampling of Plasmodium falciparum from clinical isolates
ApplyRecalibration was then run to assign each variant a quality score named VQSLOD (Variant quality score). Higher values of VQSLOD indicate higher quality. Variants (both SNPs and indels) with a VQSLOD score ≤ 0 were filtered out. Variants were excluded from the analysis if they were positioned within subtelomeric regions, located within the hypervariable Var, Rifin, and Stevor gene families, or were positioned within repetitive sequences as identified by Tandem Repeat Finder. Data were then filtered out to extract 89578 bi-allelic SNPs in the core genome with a VQSLOD score > 6. In addition, we filtered out those SNPs to exclude isolates with missing calls at > 10% of all positions and SNPs with calls missing in >10% of isolates. A total of 89211 SNPs remained after filtering.

Determination of allele frequencies and balancing selection
Analysis of allele frequency distributions, including within population Tajima's D test was performed using Vcftools and custom R scripts to identify genes under balancing selection [24] . For Tajima's D test, we extracted bi-allelic SNPs that were segregating within our population. Missing data were excluded by removal of individual isolates on a gene by gene basis due to the observation that most of missing data clustered within a small number of isolates. The allele frequency spectrum for each gene was assessed with at least 3 SNPs using custom R scripts. Typable SNPs in this study were classified as synonymous or non-synonymous based on amino-acids changes when compared to the 3D7 reference genome sequence. The ancestral state of SNPs has been determined by comparing each sequence to homologous sequences in Plasmodium reichenowi (Pr), a parasite with recent common ancestry.

Identification of intra-population signatures of recent selection sweep
To detect loci under recent positive selection, we computed the standardized integrated Haplotype Score (| iHS |) for each SNP with no missing data and a minimum minor allele frequency of 0.05 using the REHH R software package. iHS has been calculated for each SNP with no missing data and a minor allele frequency of >0.05 [25] . The genetic map distance between markers inferred with LDhat 2.2 [26] was measured using a block penalty of 10 million rjMCMC iterations, and a burn-in of 100 000 iterations. Selection windows were defined by calculating the distance required for the extended haplotype homozygosity of each SNP to decay to a level of 0.05 in each direction. Overlapping EHH windows from individual high-scoring SNPs (|iHS| >3.29, suggestive line) were combined into continuous windows, and windows supported by only a single SNP position were subsequently discarded. Bonferoni correction was applied for genome wide significance. The reference and nonreference allele was described as ancestral and derived alleles, respectively. For significance the REHH package generated a two-sided p-value as -log 10(2Φ (-|iHS|)), Where Ф (iHS) represents the Gaussian cumulative distribution function.

Complexity of Infection with Fws
Infection complexity was determined by Fws fixation index [14,27] using the moimix package in R and computed with R scripts. For all bi-allelic coding SNPs, Fws was calculated using the formula Fws = 1 -(Hw /Hs) where Hw is the within-individual heterozygosity and Hs is the within-population heterozygosity. At each bi-allelic SNPs, heterozygosity was estimated using the formula H= 1-(p2 -q2), where p and q are the frequencies of the two alleles (p=1 -q). At each SNP, p and q were estimated for each individual as the proportions of sequencing reads that carried each allele in the individual sample.
At the population level, the allele frequencies at the SNPs level were estimated as the mean of the allele frequencies in the individual composing the population sample. The minor allele frequency (MAF) was reported as the frequency of the least common allele at that SNP. Within-individual and within-population heterozygosity were computed and assigned to ten equal sized MAF bins [0.0-0.05] ... [0.45-0.5] and for each bin the mean within sample and population heterozygosity was computed.

Population structure, Principal Component Analysis (PCA) and Fst-metric
We conducted a PCA and Fst as implemented in PLINK1.9 and Vcftools, respectively, to assess the structure of study populations. We used Weir and Cockerham's population genetic differentiation estimator Fst between study sites [28] . For PCA, we applied the Linkage Disequilibrium correction to remove the correlated pairs of SNPs and identity by descent (IBD) to identify and remove any closelyrelated samples before computing principal components (PCs). We further calculated the top 10 eigenvectors from the population genotype. R statistical package was used to analyze our data. For Fst analysis missing data for some isolates were excluded on a per SNP basis.

Results
Following a quality control of all the 70 clinical samples, we extracted 89578 bi-allelic SNPs in the core genome with a VQSLOD score greater than 6 SNPs. Missing calls greater than 10% were excluded (367 SNPs) from the analysis leaving 89211 SNPs for all isolates.

Allele frequency distribution and balancing selection
There was an excess of rare variants in the population, with the majority (80%) being single isolates ( Fig.   2A). Coding sequences had higher coverage than intergenic regions, probably due to A + T allelic richness. There was an excess of non-synonymous SNPs compared to the synonymous SNPs, showing a clear mutation process in the isolates (Fig. 2B). indicating an excess of low frequency compared with that expected on a mutation-drift equilibrium population (Fig. 3 and Fig. 4)

Evidence of signature of positive directional selection in Côte d'Ivoire
We conducted a scan of the P. falciparum whole genome by using REHH package in R. An assessment of the standardized integrated Haplotype score (| iHS |) identified a recent positive selection in seven loci that had four or more SNPs with a standardized | iHS | value greater than 3.2 (suggestive line) (Fig. 5A) and four loci that had at least one SNP with an | iHS | greater than 5 (Fig. 5B and Table 2

Assessing the genome wide complexity of infection (Fws)
Within-infection Fws fixation index describes the relationship between the diversity observed within a patient to that of the population using estimates of heterozygosity. It provides a measure of the risk of outcrossing between parasites within an individual to generate new genotypes during recombination in the mosquito host [27] . In our analysis, Fws scores ranged from 0.50 to 0.99 (mean 0.89, median 0.98) with 64% samples presenting high Fws estimates (ie > 0.95; Fig. 6). These high Fws values were an evidence of infections dominated by single predominant genotypes at the time of sampling across our population.

Population Structure and Differentiation
We conducted a principal component analysis using the 89578 SNPs with no missing data in all 70 clinical samples. The first three principal components (10.8% of the total variation) were plotted and showed no evidence of population structure in most samples from the three populations. Only few isolates appeared as outliers that were not very divergent (Fig. 7). We measured Fst to assess the genetic differentiation and evaluate the overall effect of population substructure on parasite populations. There was a minimal differentiation between population highlighted by a very low value for mean Fst estimate (mean 0.001) (Fig. 8). Only 12 SNPs had Fst values equal or greater to 0.2 and the highest Fst value was 0.7 (Table 3).  Table 3). The genome-wide Fst estimate is 0.001. Three of the differentiated SNPs were located on chromosome 7. The SNP at position 435368 span the region coding for P. falciparum esterase gene Pfpare (P. falciparum prodrug activation and resistance esterase) that is responsible for resistance to pepstatin esters [29] . Furthermore, a SNP located on chromosome 8 encodes an amino-acid within the SURFIN8.2 gene (PF3D7_0830800), a polymorphic antigen that is expressed to the surface of P. falciparum infected erythrocyte (IE) and released merozoites [30] . PF3D7_1035200 gene containing two SNPs with highest genome wide Fst values were located within a single region on chromosome 10, encoding a conserved protein with an unknown function.

Discussion
P. falciparum genomic studies are the ideal tools to assess selection processes as well as evolution patterns of parasite populations [7,31] . In this study, we sequenced the whole genome of 70 clinical isolates from Côte d'Ivoire to identify signature of selection and migration.
Our findings suggest a strong and recent positive selection on Pfcrt and Pfdhps, two malaria drug resistance genes. These findings have been corroborated by previous studies conducted in Ghana, Guinea, The Gambia and Senegal [7,12,17] . In contrast to our findings, Konaté et al observed a local decay the chloroquine mutation rate, rejecting the hypothesis of a recent positive pressure [32] . The difference in our findings may be explained by our smaller sample size from a single sentinel site for antimalarial drug efficacy in Côte d'Ivoire which doesn't really reflect the drug pressure on Pfcrt in the country. While other studies [10,19] conducted in West Africa, have found a drug-induced selective pressure on Pfdhfr, Pfmdr1, our study did not corroborate these findings. We hypothesize that the timing of samples collection and malaria incidence may have been responsible for these differences. In fact our study samples were collected from 2013 to 2014 in a urban setting while samples from Senegal were collected from 2002 to 2009 in a peri-urban area [31] . P. falciparum virulence is attributed to the parasite's ability to modify the erythrocyte surface to adhere and invade the host immune system [33] . Therefore, PfEMP1 encoding gene have been very polymorphic. We detected high extended haplotype score around PfEMP1 highlighting the selective pressure on this gene. Evidence of positive directional selection in msp3, glurp, ama1, msp7 and pftrap on chromosomes 10, 11 and 13 respectively, known as non-drug related drivers of directional selection, as well as antigenic loci that leads drug resistance [34,35] . Some of these antigens are malaria vaccine candidates that are usually expressed in merozoites and are thought to be targets of protective immunity and under balancing selection as described in Asian and African population [10,36] .
In malaria-endemic areas, people are usually infected with multiple variants of P. falciparum [13] .
Inbreeding levels determine the rates of an effective recombination and play a central role in understanding the population genomics of the parasite [37] . Within host diversity fixation index (Fws) measure the risk of inbreeding for parasites within an isolate compared to that in the whole population [14,27] . Fws values in our study ranged from 0.5 to 0.99 (mean 0.89) with most of the isolates having Fws estimates greater than 0.95. These results parallel findings from several studies conducted in West Africa [7,12] . The results showed high inbreeding levels of parasite within the host with no threat to malaria control efforts. However, a follow-up study conducted in areas bordering Côte d'Ivoire (Burkina and Mali) with a bigger sample size shown low Fws values [27] . Further exploration of the complexity of infection is required through different epidemiological settings to enable more effective interpretation of the relative out-crossing risks associated with different Fws scores.
To assess gene flow between the three study sites, we conducted a Principal Component Analysis of 70 clinical samples did not show any clustering between populations, as reported recently [5] . We noticed only few outliers that were not very divergent. Indeed, the three areas are located in the same district.
Computation of Weir and Cockerham's Fst to measure allele frequency differentiation between the three sites with a stringent genome wide cut-off (Fst ≥ 0.2) identified 12 highly differentiated SNPs between populations. The mean Fst was very low in our population. Similar findings were made in studies conducted in West Africa [7,11,12] , suggesting high levels of gene. Located in chromosome 7, one of these SNPs is highly differentiated around regions coding for the parasite esterase gene Pfpare, conferring resistance to pepstatin ester that is a potent peptydil inhibitor of various malarial aspartic proteases and also has parasiticidal activity [29] . Although this compound is not used in our region, this finding could be physiologically relevant because many children in malaria endemic regions are malnourished and contain low, even undetectable, levels of plasma amino acids [38] . The family of surfin genes also ranked high Fst estimate. This protein, which is expressed at the surface of the IE and the released merozoite throughout blood stage development may be crucial for parasite survival [30] .
Further investigation on this protein might be needed since it is a threat to malaria intervention efforts.
To find genes under balancing selection, Tajima's D statistic test was computed. We identified genes with at least three SNPs under balancing selection. Most genes had negative Tajima's D values (mean -0,15), in agreement with observation in Senegal, Nigeria, Guinea and the Gambia [7,9,31] . This may be explained by the historical bottleneck discovered before in Africa [39] . Moreover, balancing selection over genes with Tajima's D values greater than 1 were evident in a subset of genes encoding known antigens (Pfemp1, glurp) and targets of immune selection (msp7-like, surfin) consistent with evident decrease of population size shown in Kenya and other geographical locations in Africa [40,41] . However, it is notable to consider that not all genes with positive values are effectively under balancing selection due to the evolution of our population after a recent bottleneck.

Conclusion
This study reports the first WGS of P falciparum strains in Côte d'Ivoire to assess signature of selection and gene flow between study areas. We identified regions of the genome under selective pressure in drugs and vaccine encoding genes. We have also shown the lack of structure of the parasite populations. A detailed understanding of P. falciparum genomics could facilitate malaria control. Identification of balancing selection also reinforced the theory of population expansion in Africa. It is important to notify that further investigation should be made in a larger sample size of the country, taking account temporal and spatial factors to monitor the evolution of gene flow and genetic diversity of the parasite in Côte d'Ivoire.