The presence of DNA methylation in the T. solium genome
The methylation status of DNA is related to three types of enzymes, including DNA methyltransferases, which affect maintenance methylation and de novo methylation. To understand whether T. solium possesses the ability to methylate DNA, we first conducted a reciprocal Blast alignment to identify genes that might be homologous to known DNA (cytosine-5)-methyltransferases. As a result, two genes (Scaffold00200.gene8095 and LongOrf.asmbl_16366) were identified that are homologous to DNMT3B and DNMT2, respectively, with high sequence similarity (e-value < 1e-10). Scaffold00067.gene4890 was aligned (e-value < 1e-5) with either DNMT3A or DNMT3B from multiple species. In addition, more than one gene was matched with DNMT1, among which Scaffold00068.gene4920 had the best hit (e-value < 1e-10) (Table S1). Phylogenetic analyses by MEGA7 also supported these results (Figure S1). Moreover, we searched for genes homologous to methyl-CpG binding domain protein (MBD). Two candidate genes (LongOrf.asmbl_5021 and LongOrf.asmbl_14047) were homologous to MBDs in multiple species, including Echinococcus granulosus, which is closely related to T. solium evolutionarily (Table S1 and Figure S2). A full repertoire of functionally conserved amino acid residues was identified for both the potential DNMT2 and DNMT3 and the MBDs of T. solium, indicating that these proteins are functionally active (Table S2). However, a high level of divergence between T. solium and other species was observed for DNMT1 homologs (Table S2), which was in agreement with previous studies.
Given these results, we assessed the genome-wide DNA methylation profiles in T. solium using MethylC-Seq. There were 54.21 million raw reads generated (Table S3). BSMAP (Xi & Li, 2009) was used to align the sequenced reads to the T. solium reference sequence, reaching an approximately 76.41% mapping rate. The average read depth was 11.32 per strand, while on average, over 50 Mb (90.52%) of each strand of the T. solium reference sequence was covered. Because of the potential occurrence of non-conversion and thymidine-cytosine sequencing errors, the false-positive rate was estimated by calculating the methylation level of lambda DNA, which is normally unmethylated (Materials and methods). We then applied the error rate (0.0041) to correct methylated cytosine sites (mC) identification according to the method described by Lister et al. (Lister & Ecker, 2009), which is based on a binomial test and false discovery rate constraints. As a result, approximately 76.6 thousand mCs were estimated in the T. solium genome (accounting for 0.20% of the total cytosines sequenced with depth ≥5X). Both symmetrical CpG methylation and asymmetrical non-CpG methylation were revealed.
Characterization of overall methylation patterns
We further characterized the global patterns of DNA methylation in the genomes of T. solium. First, we showed the percentage of methylated cytosine of each sequence context. Among the 76.6 thousand mCs across the entire genome, a majority (69.5%) were in the context of CHH. In contrast, only 15.38% and 15.12% of the mCs were located in the contexts of CHG and CpG, respectively (Figure S3 A). Furthermore, most of the CpG and non-CpGs displayed a low methylation fraction (<30%) (Figure S3B and C). These patterns are highly different from mammalian methylomes, in which most 5mCs are located in CpG contexts and the majority of the CpGs are highly methylated (>50%) (Feng et al., 2010). Since most (69.5%) of the mCs in the T. solium genome were in the CHH context, we further analyzed the sequence context of mCHHs across the entire genome to further examine whether there is any sequence bias in the enrichment of cytosine methylation in the CHH context. As a result, mCpA was shown to be preferentially enriched within the methylated CHH dinucleotide (Figure 1A). There were more than 21,000 methylated CpAs in each strand, meaning that 55.73% of total CpAs were methylated in the entire genome (Figure 1B, D). This result was consistent with reports that mCpA was predominantly found in another tapeworm, S. mansoni (Weber et al., 2007; Zhang, 2008). With regard to methylation levels, we did not observe significant differences among different sequence contexts for mCs (Figure 1C).
We next examined whether there was any preference for the distance between adjacent sites of DNA methylation in the T. solium genome. The relative distance between mCs in each context within 50 nucleotides in introns was then analyzed because of the steady methylation without any selective pressure by protein coding genes in intron regions. Similar to the periodicity of 8–10 bases revealed in previous studies on the Arabidopsis and human genomes (Cokus et al., 2008), we also observed a strong tendency of peaked enrichment of mCpA sites, which might be explained by a single turn of the DNA helix (Figure 1E). Moreover, we found that mCpT revealed a similar periodicity of 8-12 bases (Figure 1F), though the numbers of cytosines in the context of CpG and CpC were too few to yield reliable results (Figure S3D and E). In summary, our results indicated that the molecular mechanisms governing de novo methylation at CpA sites may be similar among the cysticercus and the plant and animal kingdoms.
We then examined the distribution of methylation levels for the four categories of methylated cytosines across the entire genome. In general, similar mosaic distribution patterns were observed for methylation levels of all types of mCs, that is, relatively highly methylated domains were interspersed within regions with low methylation (Figure S4A). Furthermore, the distribution of mCs across the genome was also uneven; dense mCs of specific categories were occasionally enriched in specific scaffolds (Figure S4B). Such a pattern has been observed in previous studies on other invertebrates. We also examined the patterns of methylation in annotated elements, including genes, tandem repeats, and transposable elements. The methylation percentage of each cytosine context in exons was higher than that in other annotated elements, especially CpAs, which accounted for a more than 2-fold greater percentage than the other contexts in exons (Figure 2A).
We then examined the average methylation level in each element, which showed that average CpG methylation levels were higher than those other types of methylated cytosines, similar to mammalian genomes. However, the genome-wide pattern was again divergent from mammalian genomes, as higher average methylation in exons and lower methylation in introns of CpG sites were observed (Figure 2B, C and Figure S5). The trend for the average methylation of CpC and CpT was similar to that of CpG. However, a uniform distribution of CpA methylation levels in each annotated element was displayed (Figure 2C and Figure S5). We also analyzed the methylation level of each cytosine context in repeat regions (Figure 2D, E). Previous studies have indicated that transposable elements are usually unmethylated in the honey bee Apis mellifera and silkworm Bombyx mori (Lyko et al., 2010; Xiang et al., 2010). In cysticercus, we observed a similar phenomenon as the above species except that relatively highly methylated rRNAs were observed in T. solium. Notably, CpAs were methylated at a higher level or frequency than other types in LINE/L1 (Figure 2D, E).
The relationship between methylation and gene expression
It was reported that DNA methylation plays an important role in regulating gene expression. We evaluated gene expression in T. solium using Illumina high-throughput RNA-seq technology. Most of the raw reads could be uniquely mapped to previously annotated genes (88.17%). A total of 9,718 annotated genes out of 11,903 could be aligned with at least one unique read. To characterize the relationship between DNA methylation and gene expression, we divided the expressed genes with at least one read into quartiles of expression levels. We then examined the distribution of methylation levels for different quartiles of expressed genes and genes exhibiting no expression. High CpG and CpC methylation levels were observed in upstream and exon regions of genes with the lowest expression. Moreover, a negative correlation could also be observed between CpA and CpT methylation levels of upstream and exons and expression levels of these expressed genes. However, for silent genes, mainly high CpG and CpC methylation levels of downstream regions were observed (Figure 3). Taken together, methylation levels of mCs from both CpG or non-CpG sequence contexts were correlated with gene expression levels, though different regulation mechanisms might be involved.
Next, to infer whether methylated genes were enriched for specific molecular functions, we filtered out a total of 1,647 of the genes with the lowest expression and 1,354 of the most highly expressed genes, based on the criteria that at least one mC was present within their genic regions. Then, we applied the WEGO (Web Gene Ontology Annotation Plotting) tool (Ye et al., 2006) to functionally categorize the gene ontology (GO) terms of these genes. We found that these two sets of genes displayed similar patterns of GO enrichment, specifically, “cell” and “cell part” in Cellular Component, “binding” and “catalytic” Molecular Functions, and “cellular process” and “metabolic process” in Biological Process were relatively enriched. This result suggested that the genes heavily regulated by DNA methylation were more prone to signaling regulation or interaction with environmental factors, e.g., diet or metabolism (Figure S6). In summary, these results suggested the potential for the regulation of T. solium genes by DNA methylation, especially those that function as regulators of cell-cell or cell-environmental communication. Furthermore, different molecular mechanisms might be involved depending on different mC contexts and genes.
Regulation of DNA methylation on key parasitism genes of T. solium
To obtain further insight into the epigenetic regulation of parasite development, survival and parasite-host interactions of T. solium, we next studied conserved genes across tapeworm-species and genes encoding excretion–secretion proteins (ESPs) in T. solium. For conserved genes, we applied a gene set that was reported previously in a study by Bjorn Victor et al., in which 261 genes conserved between Taenia and Echinococcus tapeworms were obtained by comparing the transcriptomes of five important intestinal parasites, including T. multiceps, T. solium, E. granulosus, E. multilocularis and T. pisiformis (Robert McMaster et al., 2016). Based on their results, we further retrieved 216 genes with the best blastx hit for each contig (e<1e-10) and studied their DNA methylation status. A total of 190 of these genes contained at least one mC across their genic regions. As indicated in Figure 4C, CpG and CpC methylation levels in upstream and exon regions were higher than other types of methylation and in other genic regions. A further examination of the 190 genes revealed that 71 genes contained CpG or CpC methylation within their upstream or exon regions. Therefore, we searched for extensively methylated genes based on the criterion that the CpG and CpC methylation levels of the examined gene were significantly higher than the average value of the 71 genes. As a result, we revealed 14 conserved genes that were extensively methylated on CpG/CpC sites within their upstream regions and exons (p<0.05). Compared with those 26 genes without mC, we found these 14 genes were expressed at a significantly lower level (Figure 4A), suggesting DNA methylation is a key mechanism for the transcriptional regulation of these conserved genes. For ESPs, we also applied a dataset containing 76 ESPs for T. solium, which was identified by Bjorn Victor et al. using a proteomics strategy (Victor et al., 2012). We applied the BlastP algorithm to align these ESPs back to the genome and revealed 111 gene sequences that might encode these ESPs (Table S4). Using the same criterion for conserved genes, we found 13 extensively methylated genes. Similarly, gene expression comparisons again revealed that these 13 genes were expressed at significantly lower levels than the 26 non-methylated genes (Figure 4A). Using a similar strategy, we also looked into genes containing methylated CpAs and CpTs within their upstream or exon regions. However, no clear difference in gene expression levels was observed (Figure 4B). These results indicated that CpG/CpC methylation in upstream regions and exons played a major role in T. solium gene repression. Furthermore, we found a different distribution pattern between mCpG and mCpC for these repressed genes, in which mCpG were mostly distributed in upstream regions, while mCpC were more often in exons (Figure S5). Based on the above analyses, we revealed 27 key genes that might be repressed by DNA methylation mechanisms. Interestingly, a protein-protein interaction analysis using the STRING online tool (Fonseca et al., 2006) indicated strong mutual interactions among these conserved proteins and ESPs (Figure 5) based on annotation of the model organism Caenorhabditis elegans.