Transcript levels are the integrated outcome of multiple distinct processes including transcription, splicing, nuclear export and stability, each of which is influenced by a variety of regulatory mechanisms. Our aim is to assess the relative contribution of RNAPII recruitment and activity to mRNA output during RBC infection. To accomplish this goal, we evaluated the occupancy of RNAPII and three CTD phosphoisoforms across the P. falciparum genome at multiple developmental stages of the IDC. In a previous study, we used ChIP and DNA microarray technology (ChIP on chip) to show that there are two temporal domains of RNAPII occupancy, with about half of protein-coding genes (~1800) associating with RNAPII early in the IDC and remaining genes (~1700) associating late [21]. To assess RNAPII occupancy across the entire transcription unit, we now use the same custom-generated RNAPII antibodies to perform ChIP at three time points across the IDC followed by next-generation sequencing (ChIP-seq). These mouse monoclonal antibodies are specific to the unphosphorylated CTD (anti-CTD) associated with the pre-initiating polymerase; the heptad repeat mono-phosphorylated at serine 5 (anti-Ser5-P) associated with the initiating polymerase; and the heptad repeat double phosphorylated at serine positions 2 and 5 (anti-Ser2/5-P) associated with the processive, elongating form of the polymerase [21]. In addition, we employed the commercially available N20 antibody against a universally conserved epitope at the N-terminus of RNAPII which is not known to undergo post-translational modification; N20 should therefore recognize all forms of the polymerase regardless of the phosphorylation state of the CTD [22, 23].
These antibodies were used in ChIP-seq to characterize in detail the genome-wide occupancy of RNAPII isoforms at three stages of the IDC: the ring stage at 12-16 hours post invasion (hpi), the trophozoite stage (24-28 hpi) and schizont stage (36-40 hpi). Occupancy by RNAPII as revealed by the four antibodies was visualized in two ways: as enrichment relative to input along the length of the transcription unit (determined using the bamCompare algorithm and shown as profile plots and heatmaps), and as discrete peaks of RNAPII occupancy called by the MACS2 algorithm [24, 25]. To enable correlation of RNAPII occupancy with transcript abundance, we also performed RNA-seq on the identical parasite culture.
The distribution of RNAPII across the transcription unit is stage-specific
The genome-wide read densities for RNA-seq, ChIP-seq, and input chromatin were plotted and visualized in a genome browser format [26]. Such a representation allows ready visualization of read density and is used here to illustrate some key points regarding the results elaborated later in this report. The genome browser view of three contiguous genes in a ~27 kb region of chromosome 14 is shown in figure 1. The far leftward gene, PF3D7_1410200, encodes CTP synthetase and is known to be expressed early in the IDC, consistent with the need to boost CTP pools for the biosynthetic needs of later stages. The far rightward gene, PF3D7_1410400, encodes rhoptry associated protein 1, a so-called “invasion” gene required for the infectivity of newly released merozoites. Not surprisingly, transcripts for this gene are known to accumulate late in the IDC. The middle gene, PF3D7_1410300, encodes a putative WD-repeat-containing protein, and has been observed to maintain low and relatively consistent transcript levels across the IDC. The expected stage-associated transcript levels for each of these genes are borne out by our own RNA-seq results as plotted immediately below the positions of the gene bodies (Fig 1).
For simplicity, only the ChIP-seq results and input controls for two informative antibodies have been shown – those corresponding to the N20 antibody which recognizes all forms of RNAPII (total RNAPII) and the antibody recognizing the doubly phosphorylated heptad repeat (RNAPII-Ser2/5-P) associated with the actively elongating form of the polymerase. First, RNAPII, regardless of the isoform detected, is concentrated within the gene body, i.e. between the start and stop codons of the open reading frame. Second, we find that the distribution of total RNAPII (N20, upper set of green tracks) over the three genes is highly similar, regardless of the amount of transcript produced for a given gene or the stage at which it is assayed. By contrast, the far leftward gene (PF3D7_1410200; CTP synthetase) is associated with increased transcript levels at the ring and trophozoite stages and increased reads corresponding to the elongating form of RNAPII (RNAPII-Ser2/5-P; upper set of purple tracks). Conversely, the far rightward gene (PF3D7_1410400; RAP1) is associated with increased transcripts at the schizont stage and increased reads for RNAPII-Ser2/5-P at that same point in the IDC. (The plot for RNAPII-Ser2/5-P is truncated along the y-axis whose maximum was set to allow clear visualization of the weaker signals present over the other two genes.) Last, the middle gene (PF3D7_1410300; putative WD-repeat protein) yields almost no transcripts at any stage by RNA-seq, and displays no stage-specific change in the level of ChIP-seq reads obtained with either antibody. Thus, for the far leftward and rightward genes, transcript abundance is stage-specifically correlated with variation in the level of RNAPII-Ser2/5-P signal, but not that of total RNAPII. Despite significant occupancy by the elongating form of RNAPII, little to no transcript is detected for the middle gene, PF3D7_1410300, suggestive of post-transcriptional control.
Initial observations such as those presented in figure 1 would be consistent with a role for RNAPII activity (Ser2/5-P), as opposed to RNAPII recruitment (N20), in regulating the levels of transcript output. We therefore undertook an assessment of the correlation between the levels of transcript output and occupancy by each of the RNAPII isoforms (CTD, Ser5-P, Ser2/5-P) along with total RNAPII (N20).
ChIP-seq data was used to plot the occupancy of the four RNAPII isoforms across the transcription units of protein-coding nuclear genes at all three stages of the IDC. For each RNAPII isoform at each stage of the IDC, RNAPII enrichment was determined at each of 3,000 bins spanning the 5’ flanking region, coding region and 3’ flanking region. Genes showing RNAPII enrichment of at least 2-fold (before log2 transformation) in at least one bin defined gene sets specific for an RNAPII isoform and stage of the IDC. The averaged plots of RNAPII occupancy for each gene set were then determined (Fig 2A). Further details regarding filters and gene selection are provided in the figure legends, Materials and Methods and Additional File 1.
The distribution of total RNAPII as revealed by the N20 antibody provides the level of RNAPII recruitment at a given locus against which changes in RNAPII phosphorylation state can be compared. Total RNAPII shows two distinct profiles (Fig 2A). The first profile, exhibited in rings and trophozoites, is characterized by elevated levels of RNAPII over the gene body (GB) spanning the start codon (ATG) to the stop codon (STOP). The second profile, found in schizonts, is distinct and shows a relatively uniform distribution along the entire length of the transcription unit from 1 kb upstream of the start codon to 1kb downstream of the stop codon, with a moderate decrease in overall abundance from 5’ to 3’. The observation of these distinctly different profiles for total RNAPII is consistent with the conclusions of our earlier study suggesting that different transcriptional mechanisms govern gene expression at early vs late times of the IDC [21].
Deviations from the total RNAPII (N20) pattern by the three RNAPII isoforms under study reveal the distribution of CTD phosphorylation (or lack thereof) and thereby the activity of RNAPII along the transcription unit. In rings, the unphosphorylated (CTD) and monophosphorylated (Ser5-P) isoforms display rather flat profiles, indicating that these isoforms are more abundant in the 5’ and 3’ flanking regions than in the gene body. This would be reasonable if a significant fraction of heptad repeats were doubly phosphorylated while the polymerase is traversing the gene body, consistent with the association of the Ser2/5-P modification with the processive elongating form of the polymerase. Supporting this interpretation, the distribution of the Ser2/5-P modification is exaggerated by comparison to that of total RNAPII, with highly elevated levels over the gene body and depleted levels in the 5’ flank and, to a lesser extent, the 3’ flank.
The distribution of the CTD and Ser5-P isoforms in schizonts is distinct from those observed in rings and trophozoites, as was observed for total RNAPII (N20). Levels of the unphosphorylated (CTD) and monophosphorylated (Ser5-P) forms are noticeably depleted in the gene body relative to the 5’ and 3’ flanks. By contrast, the Ser2/5-P isoform is unique at this stage in its persistent accumulation over the gene body relative to the flanks. The simplest interpretation of this data is that the depressed level of N20 in the gene body is reflected in lower levels of RNAPII phosphoisoforms along the gene body as well, and that transcriptional output is driven by RNAPII activity rather than recruitment.
In conclusion, a global view of expressed protein-coding genes reveals exaggerated levels of the Ser2/5-P modification over the gene body relative to total RNAPII (N20) in rings and schizonts, though not in trophozoites. Fluctuations in the levels of Ser2/5-P modification without concomitant changes in the level of total RNAPII would be consistent with a model in which transcriptional activity is modulated via the control of RNAPII enzymatic activity rather than the levels of RNAPII recruitment.
Transcript output at a given gene correlates with the abundance of the elongating form of RNAPII but not total polymerase
To refine the association between RNAPII distribution and transcript output, we defined a set of 4,990 nuclear, non-antigenic protein-coding genes and subdivided them into three classes based on their relative transcript levels. For each stage of the IDC, genes were ranked in descending order of transcript abundance and partitioned into three classes: High, Medium, and Low, with the number of genes in each class being 1,663 or 1,664. We then plotted the enrichment of individual RNAPII isoforms and total RNAPII for each of the three gene classes in each of the three developmental stages under study (Fig. 2B).
The overall distribution of all forms of RNAPII is highly similar to that described above, but now reveals a striking association with gene expression levels at the ring stage. Those genes with the highest transcript levels exhibit up to 2-fold enrichment of RNAPII occupancy in the ORF compared to genes with medium and low levels of expression. Strikingly, the positive association between Ser2/5-P and transcription is exclusive to the gene body, while occupancy in the 5’ and 3’ flanking regions follows a reverse trend. We also note the gradually increasing Ser2/5-P signal toward the 3’ end of the ORF, possibly associated with the terminal stages of elongation and the recruitment of 3’ end processing factors [2] (Fig 2B).
In schizonts, there is only a modest increase in Ser2/5-P signal associated with more highly expressed genes; however, the Ser2/5-P signal in both rings and schizonts remains elevated relative to that of total RNAPII (N20). Once again, these observations suggest that P falciparum does not achieve higher transcript output by enhanced recruitment of RNAPII to the promoter, but rather via control of RNAPII activity. They are also consistent with distinct transcriptional mechanisms operating at early vs late times of the IDC.
The above results reveal a positive correlation between Ser2/5-P occupancy and transcript abundance. To more rigorously assess the association between RNAPII occupancy and transcript output, we used the MACS2 algorithm [24, 25] to identify localized peaks of RNAPII enrichment. MACS2 derives statistically significant peaks of ChIP-seq enrichment occurring over defined windows measured in base pairs, and therefore provides a complementary method for the identification of genes displaying enriched RNAPII occupancy.
We used MACS2 to classify genes according to their association with one or more peaks of RNAPII binding. Transcript abundance for each of these classes was then compared for each of the three stages of the IDC (Fig 3A). Strikingly, enhanced transcript abundance is consistently associated with the presence of peaks of Ser2/5-P occupancy, but not unphosphorylated (CTD) or Ser5-P isoforms. (The statistical significance between the expression levels of different groups are provided in Table S1.) These results strongly support a role for increased transcription, necessarily mediated by the Ser2/5-P isoform, in achieving elevated mRNA levels at all stages of the IDC.
While these results reveal that higher levels of Ser2/5-P are associated with increased mRNA output, they do not implicate increased recruitment of RNAPII in achieving this outcome. Indeed, the findings of figures 1 and 2 suggest that overall recruitment of RNAPII, as revealed by the N20 antibody, does not vary markedly despite large changes in transcriptional output. We therefore used peak calling to compare directly the correlation between mRNA levels and total RNAPII vs. Ser2/5-P abundance across the IDC (Fig 3B).
For each stage, we defined four gene sets: those genes associated only with peaks of the Ser2/5-P isoform; those associated only with peaks of total RNAPII (N20); those associated with both; and those associated with neither. For each of these four classes, transcript abundance was compared and tested for statistical significance (Fig 3B; Table S2). The results clearly show that increased transcript output is strongly correlated with enrichment of the elongating Ser2/5-P isoform but not with enrichment of total RNAPII (N20). We conclude that P falciparum does not achieve higher mRNA output by increased recruitment of RNAPII, but rather through the conversion of RNAPII to its active elongating form. RNAPII distribution was independently confirmed in rings and schizonts by ChIP-qPCR using the same immunoprecipitated chromatin samples used for ChIP-seq (Fig S1).
A recent study reported on genome-wide measurements on transcription and mRNA decay rates across the IDC in the 3D7 parasite strain [27]. We compared our current data identifying genes having stage-specific peaks of Ser2/5-P occupancy in the T996 parasite strain with the data set from this published study (Table S3). Interestingly, genes having ring-stage-specific peaks of the Ser2/5-P isoform show highly statistically significant overlap with those genes that are maximally transcribed in early and late rings. Reciprocally, genes associated with schizont-stage peaks of Ser2/5-P display highly statistically significant overlap with those genes that are maximally transcribed in this same stage. By contrast, genes having peaks of total RNAPII (N20, but not Ser2/5-P) or no peaks of either form do not display a stage-specific positive correlation with transcription rate. The overall congruity of these results strongly supports the association of the Ser2/5-P isoform with elevated transcriptional output.
The accumulation of the elongating form RNAPII in the gene body and 3’ flanking region correlates with transcript output
The results above demonstrate a positive correlation between the Ser2/5-P signal and transcript output. The Ser2/5-P isoform is routinely detected at lower levels in the flanking regions and is highest toward the 3’ end of the gene body. To test for an association between the location of Ser2/5-P signal along the transcription unit and the level of transcript output, we correlated the position of peaks of Ser2/5 abundance called by the MACS2 algorithm with our RNA-seq data at each stage of the IDC (Fig 4). Genes first filtered by peak calling as for Fig 3 were then sorted based on Ser2/5-P peak position within the transcription unit followed by plotting of mRNA expression levels (RNA-seq). A large fraction of genes does not display RNAPII peaks as defined by the MACS2 algorithm but are shown at the far right of each panel for comparison. In addition, those genes having peaks exclusively in the 5’ plus 3’ flanks, or in all three regions at once, are poorly represented and likely to suffer from the effects of small sample size but are included for completeness.
At all stages, genes having Ser2/5-P peaks in the GB in conjunction with the absence of peaks in the 5' flank show the highest transcript output. In schizonts, concurrent peaks in the GB and 3’ flank are also correlated with high mRNA levels. By contrast, the presence of peaks of Ser2/5-P signal in the 5’ flank is consistently associated with lower transcript output. These results are highly statistically significant (Table S4). We conclude that peaks of Ser2/5-P signal in the gene body and 3’ flank region are correlated with elevated mRNA levels.
ChIP-seq reads obtained with the antibody to the Ser2/5-P isoform of RNAPII are most abundant toward the 3’ end of the coding region, and peaks of the Ser2/5-P signal in the gene body and 3’ flank are associated with higher transcript output. By contrast, ChIP-seq reads from the N20 isoform are only modestly elevated at the 3’ end, revealing that the increased reads attributable to the Ser2/5-P isoform are not simply due to increased RNAPII abundance. To investigate the relevance of this observation, we assessed the correlation between the strength of the Ser2/5-P signal at the gene 3’ end vs. transcript output.
We defined a 3’ region extending from 1 kb upstream to 1 kb downstream of the translational stop codon (STOP) and used K-means clustering to identify four patterns in the accumulation of the Ser2/5-P isoform at each of the three stages of the IDC (Fig 5A). Cluster 1 genes show the broadest and generally highest levels of Ser2/5-P across the 3’ region spanning the stop codon. Further, for both rings (481 genes) and schizonts (712 genes), genes in cluster 1 showed 3’ enrichment of Ser2/5-P at two distinct locations – one just before the STOP codon and another within 1 kb downstream. This second site of Ser2/5-P accumulation seems to overlap with the average distribution of the polyadenylation site (PAS) in P. falciparum, since most PAS fall within 1 kb downstream of the stop codon yielding a 3’ UTR with an average length of 523 nt [7]. Increased ChIP-seq signal at the 3’ end of Cluster 1 genes was confirmed independently by ChIP-qPCR using the same immunoprecipitated chromatin as used for ChIP-seq (Fig S2).
To assess the correlation between the distribution of the Ser2/5-P isoform and transcript output, we plotted the expression level of the mRNA derived from genes in each of the 4 clusters. At all three stages and especially in rings, genes in Cluster 1 show transcript levels that are significantly higher than for the other clusters (Fig 5B, Table S5). In rings, Cluster 2 genes are also expressed at significantly higher levels than genes in Clusters 3 and 4. These results extend and support the positive correlation between Ser2/5-P occupancy and transcriptional output revealed in the analyses described above.
Cluster 1 genes at each stage are defined by the broad accumulation of Ser2/5-P across the gene 3’ end. We used expression heat maps to address how the expression of Cluster 1 genes defined for a given stage varied across the IDC. The results show that for Cluster 1 genes showing maximal Ser2/5-P signal in rings, a large majority shows maximum expression in that same stage (Fig S3). Likewise, a majority of Cluster 1 genes having maximal ChIP-seq reads in schizonts have their most abundant expression at this same stage. By contrast, trophozoite-stage Cluster 1 genes are maximally expressed either in rings or schizonts, with a smaller fraction showing highest expression in trophozoites. Functional annotation of Cluster 1 genes shows that most of them are associated with stage-specific pathways or processes.
While the Ser2/5-P isoform accumulates preferentially over the gene 3’ end in schizonts, a different pattern is observed for two other isoforms. In schizonts, the CTD and Ser5-P isoforms of RNAPII are more abundant at the 5’ ends of genes relative to total RNAPII (Fig 2). To assess whether this profile is associated with transcript output, we used k-means clustering to define four distinct distributions of CTD and Ser5-P signals in a region spanning 1 kb up- and downstream of the ATG in schizonts (Fig S4A). Cluster 1 genes had the highest accumulation of each isoform in the 1 kb region upstream of the ATG that would be expected to include the presumptive promoter for most genes [6]. The genes of other clusters had lower levels of RNAPII occupancy and in more restricted regions located either upstream (Clusters 3 and 4) or downstream (Cluster 2) of the ATG. There is significant overlap between the four gene groups showing CTD (orange circle) and Ser5-P (blue circle) occupancy around the ATG (Fig. S4B).
Clusters 1, 3 and 4 show higher CTD and Ser5-P signal in the presumptive promoter region upstream of the ATG in schizonts. We assessed the mRNA levels for these three clusters in rings and schizonts. Interestingly, transcript output for the three gene sets is significantly higher than for genes of cluster 2 and is most evident in rings (Fig S4C, Table S6), suggesting that CTD and Ser5-P signal in schizonts anticipates higher expression in rings of the subsequent cycle.
A heat map generated to view the expression profile through the IDC of the genes in Cluster 1 shows that these genes have higher expression levels in rings (Fig S4D). Functional enrichment analysis of genes in Cluster 1 based on the Gene Ontology (GO) terms showed that the genes in this cluster are associated with transcriptional, translational and metabolic processes (Fig S4D). Taken together these data show that at the schizont stage the CTD and Ser5-P RNAPII phosphoisoforms are over-represented at the presumptive promoter regions of specific gene groups, possibly in preparation for their expression in the next round of infection.