In silico analysis of FspEI sites in genome
Restriction-site Associated DNA Sequence (RAD-Seq) is a cost-efficient genomic profiling method in genetics. MethylRAD used methylation sensitive FspEI enzyme to perform reduced methylome sequencing at whole genome level. FspEI can recognize both 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC) in the CmC and mCDS sites[22]. And it cuts DNA double-strand on the 3’ side of the methylated cytosine at a fixed distance (N12/N16), which generate 32-base-long fragment if the target site is symmetrically methylated[21].
5'…Cm C (N)12▼…3'
3'…G G (N)16▲…5'
There are 81,341,472 potential FspEI sites in the genome with a density of about 13 bp, of which 1,980,934 (2.44%) were two primary target sites (CCGG and CCWGG, W = T or A) that can produce 32-bp fragments if symmetrically methylated. There were 834,392 CCGG sites and 1,146,542 CCWGG sites with genome-wide densities of 1,263 and 919 bp, respectively.
Sequencing of MethylRAD libraries
In this study, MethylRAD libraries were constructed for 5 females and 5 males Ancherythroculter nigrocauda samples with DNA extracted from liver tissues. Each library was produced over 32 million reads (Table 1), of which about 97% were kept as high quality (HQ) reads for further analyses. In HQ reads, about 73% of them were generated by CmC digestion site and the rest were generated by mCDS site. Reads come from symmetric methylated sites accounted about 36% of all HQ reads, while the rest reads only have one digestion site near their end (Fig. 1). For all HQ reads, about 91% of them can be mapped to Ancherythroculter nigrocauda genome with ~ 35% of them were uniquely mapped, and this ratio is comparable to that of WGBS study in Arabidopsis thaliana [28]. Totally, 6,839,406 methylation sites were identified here.
Table 1
Statistics for different types of FspEI digestion sites in all samples.
Samples | CCWGG | CDS_sym | Front_CDS | End_CDS | Front_CGG | End_CCG | Waste | Total Filtered | Raw reads | Filter ratio |
s2 | 13,670,493 | 5,170,216 | 2,658,436 | 1,821,659 | 14,518,613 | 10,007,711 | 3,865,462 | 51,712,590 | 53,243,673 | 97.12 |
s5 | 14,263,323 | 4,991,183 | 2,689,945 | 1,821,420 | 14,835,323 | 9,986,276 | 3,911,991 | 52,499,461 | 53,720,008 | 97.73 |
s6 | 8,174,498 | 2,735,554 | 1,601,958 | 1,085,417 | 9,173,532 | 6,033,627 | 2,324,600 | 31,129,186 | 32,010,326 | 97.25 |
s7 | 8,951,192 | 3,103,361 | 1,504,826 | 1,047,662 | 8,708,427 | 5,829,977 | 2,254,991 | 31,400,436 | 32,184,017 | 97.57 |
s9 | 14,028,859 | 5,137,018 | 2,354,919 | 1,636,726 | 12,759,356 | 8,898,154 | 3,636,798 | 48,451,830 | 49,355,861 | 98.17 |
s18 | 11,213,116 | 3,143,776 | 1,770,509 | 1,222,380 | 9,865,538 | 6,346,251 | 2,907,113 | 36,468,683 | 37,482,840 | 97.29 |
s19 | 19,344,254 | 5,041,845 | 2,352,257 | 1,660,252 | 14,295,732 | 9,649,459 | 3,971,159 | 56,314,958 | 57,373,824 | 98.15 |
s20 | 14,389,383 | 2,483,183 | 1,655,491 | 1,167,962 | 10,297,493 | 6,498,218 | 2,898,821 | 39,390,551 | 40,232,687 | 97.91 |
s23 | 8,535,808 | 3,406,281 | 1,777,701 | 1,231,132 | 8,486,696 | 5,908,651 | 2,577,899 | 31,924,168 | 32,556,203 | 98.06 |
s25 | 9,324,820 | 3,700,893 | 2,188,323 | 1,298,372 | 9,082,312 | 6,396,720 | 2,873,231 | 34,864,671 | 35,737,834 | 97.55 |
W = A/T. D = G/A/T. S = G/C. Waste: reads not belong to all previous types. |
In this study, reliable methylation sites were kept by a cutoff of coverage no less than 3 in each sample and at least 5 samples meet the criteria. We found 4,125,537 methylation sites in this study, among which CCWGG and CCGG only account 14.37% (593,253, Table 2). Therefore, most of these sites were out of the initial scope of MethylRAD[21]. On the contrary, reads with CmCG in either end made up of almost half (49.04%) of all methylation sites (Table 2). This ratio is much lower than that for CmCG reads in total reads. Besides, the coverage for the methylation sites were about 20× and they didn’t show difference between female and male samples. These results proved that reads for all samples meet the need of following analyses. At last, whole genome methylation level were calculated for all female and males, and the average methylation level for female (0.54) was significantly higher than that of male samples (0.49, Wilcoxon test, p = 0.03175, Fig. 2).
Table 2
Coverage in different types of FspEI digestion sites in s2 and s19.
Types | Num | s2 average coverage | s19 average coverage |
CCWGG | 593,253 | 24.52 | 33.88 |
CDS_sym | 503,939 | 17.82 | 19.74 |
Front_CDS | 583,382 | 22.35 | 24.01 |
Front_CGG | 1,143,537 | 20.53 | 24.06 |
End_CCG | 880,950 | 19.86 | 22.7 |
End_CDS | 420,476 | 22.05 | 24.11 |
Differentially methylation genes’ identification between male and female groups
After the whole genome methylation pattern analysis of each sample, we would like to find differentially methylation genes (DMGs) and differentially methylated sites (DMSs) between male and female groups. At first, samples MDS clustering were carried out to check the samples’ correlation, and the result show that female samples differed clearly with male samples (Fig. 3). With a stringent criterion on DNA methylation site, 161,895 DMSs were found between female and male groups. They were distributed almost evenly on each chromosome (Fig. 4). According to their position, 91,239 out of them were found located within 21,864 gene’s body and their 2kb flanking region. At the same time, 1,685 genes have more than 10 DMSs in their gene body and flanking region.
As methylation of cytosines responsible for gene’s regulation were always clustered into methylation island in promoter region which located 2kb upstream to 1kb downstream of transcript start site (TSS). In this study, 12,985 DMSs were identified in 7,545 gene’s regulation regions. Among these genes, only 1,137 genes have more than 3 DMSs with in the regulation region, and they were DMGs in this study. For the top 10 genes with most DMSs in promoter region, they are psip1b, znf839, CR848040.3, anpepb, katnb1, CABZ01027248.1 and mtdhb. In these genes, 561 were found hyper-methylated in promoter region in female when compared with male samples. These hyper-methylated genes were enriched in several GO catalogs, such as fatty acid synthase activity (GO:0004312), insulin-like growth factor I binding (GO:0031994), and swimming behavior (GO:0036269) (Fig. 5.). Besides, the rest 549 DMGs were found hypo-methylated in female samples when compared with male ones. The KEGG enrichment analysis show that they were enriched in different pathways which were associated with growth rate, such as GnRH signaling pathway (dre04912) and Insulin signaling pathway (dre04910).
Association between DNA methylation and gene expression in liver tissue
DNA methylation has been found to be involved in many processes regarding gene transcription and regulation[29, 30]. As gene expression data of the same samples had been published in our previous article, the correlation between the two datasets can be further studied in liver tissue[8]. To do this association analysis, 580 genes, both presented in 1,137 DMGs and 2,395 differentially expressed genes (DEGs), were used to determine how DNA methylation level changes affects their gene expression. First, matrix for gene expression and DNA methylation were examined. A negative correlation between the two datasets was evident (Person correlation test, r=-0.0413, Fig. 6). Besides, variance in methylation level is much smaller than that in RNA-SEQ data (chi2-test, p < 0.001). KEGG analysis for these genes found that they were enriched in pathways associated with GnRH signaling pathway (dre04912) and MAPK signaling pathway (dre04010, Supplementary Table 1).