RNA extraction and RNA-seq quality assessment
At the beginning of the in vitro culture, PSCs were invaginated, and the structure of the calcareous corpuscles and the apical hook could be observed under the inverted microscope (Fig. 1A). After 10 days of incubation, the morphology of the PSCs changed greatly, the apical hook and the sucker were evaginated (Fig. 1B). On day 20, the micro-cysts appeared with a thin laminated layer, and the small hook of scolex moved toward the central position of these micro-cysts, moreover, the calcareous corpuscles of the micro-cysts were observed clearly (Fig. 1C). After 40 days the micro-cysts and laminated layer were observed clearly, the apical hooks and sucker were not completely degraded (Fig. 1D). On 80 days, the apical hook and sucker disappeared or degraded, the calcareous corpuscles decreased, and a considerable increment in the size of the cysts could be observed (Fig. 1E). Samples from three batches of PSCs specimens for five different culture periods (EgPSC_0d, EgPSC_10d, EgPSC_20d, EgPSC_40d, and EgPSC_80d) were collected. Total RNA extraction results showed that the RNA bands was slightly degraded, with no pigment contamination, obvious protein content, sugar, or other impurities.
Sequence assembly and annotations
Through the second-generation high throughput sequencing Illumina platform, the cDNA library forms each sample was constructed and sequenced, which average generated 51,300,044 clean reads representing a total of 7,597,846,677 (7.5 Gb) nucleotides of each sample (Table 2). The GC content is one of the important characteristics of the genome base sequence, which can reflect the structure, function and evolutionary information of the gene [25]. Interestingly, the average GC content of the E. granulosus transcriptional level (46.44%) was similar to both its genome (42.1%) and coding regions (49.3%) [26]. Early studies estimated the S. mekongi, B. malayi and S. mansoni had a GC content of 34.1%, 30.5% and 35.3%, respectively [27, 28], indicated that compared with other parasite species, the GC content of E.granulosus was higher.
After alignment and assembled clean reads, a total of 14,903 genes and 32,401 transcripts were obtained, including 3584 new genes and 21,082 new transcripts, which had not been reported in the previously published transcriptome of the parasite. Among 32,401 transcripts, 20,511 were annotated by GO (63.3%), 13,730 by KEGG (42.38%), 17,152 by COG (52.94%), 28,188 by NR (87%), 16,511 by Swiss-Prot (51.08%), 17,874 by Pfam (55.16%). In addition, Veen diagram analyzed of function annotation showed that 4,458 reference transcripts and 53 new transcripts were simultaneously annotated into all gene databases. Furthermore, some genes or transcripts could not be annotated and were defined as hypothetical protein, which suggested that the group of identified transcriptome dataset is sufficiently reliable for further characterization [29]. The length of the transcript was widely distributed, and the majority of the transcripts longer than 1,800 bp (34.65%) in length. In contrast, the transcripts length between 0 and 200 bp were the least (1.34%) (Fig. 2). The obtained transcriptome raw reads dataset has been submitted to the NCBI Short Read Archive (SRA) database under the accession number: SRP172517.
Veen diagram analysis and differential expression analysis between samples
Based on the expression matrix, inter-sample Venn diagram analysis was performed to obtain the co-expression and specific expression genes (Fig. 3A) or transcripts (Fig. 3B) between groups. Intriguingly, although a total of 14,903 genes and 32,401 transcripts were obtained by transcriptome sequencing at different developmental stages of PSCs in the encystation process, these genes and transcripts did not always present in every period of the cysts formation process. Conversely, the genes and transcripts were specifically expressed in different stages utilizing inter-sample Veen diagram analysis, especially at EgPSC_0d, the number of stage-specific genes and transcripts was the largest, 1079 and 1990, respectively. Secondly, the stage-specific genes and transcripts at the 20d stage were 277 and 704, respectively. While EgPSC_0d were the period when the original PSCs was just removed from the fresh sheep liver, and EgPSC_20d was the initial period of cyst formation, both of which were key periods of the encystation process. In the previous transcriptome sequencing of the E. granulosus indicated that the presence of stage-specific genes among the cyst wall, PSCs and pepsin/H+-activated PSCs [12], which indicated that the stage-specific genes exist not only in different stages of the E. granulosus development but also in different periods of development.
For the DEGs analysis of each sample, EgPSC_0d was used as the control group compared with other groups (EgPSC_0d vs 10d, EgPSC_0d vs 20d, EgPSC_0d vs 40d and EgPSC_0d vs 80d). The parameter set as P-value cutoff of < 0.05 and an FC cutoff of > 2 to identify the significant DEGs, including 1,991 up-regulated and 2,517 down-regulated DEGs, respectively. A set of gene which composed of 1,991 up-regulated DEGs was established to provide the functional classification compared with the COG database. The result showed that most of the genes (25%) were classified into the poorly characterized category, which described as function unknown. Secondly, 144 genes (7.2%) were classified as cellular processes and signaling, which described as intracellular trafficking, secretion, and vesicular transport (Additional file 1: Table S1). Given that high levels of expression often indicate a fundamental role, these represent interesting goals for further research. The finding can open the way to the further studies.
Based on Veen diagram analysis, the total of 1,991 gene were up-regulated and 2,517 were down-regulated DEGs, we found that 52 genes were consistently on the rise during the process of encystation (Fig. 4A), while 152 genes have continued to decline (Fig. 4B). Through cluster heat map analysis of 52 consistently up-regulated DEGs (Fig. 4C), showing that hypothetical protein (EGR_10197) and (EGR_05435) increased significantly at the late stage of encystation, and both of them were annotated as a hypothetical protein. Besides, gene1117 described as an integral component of the membrane and classified into cellular component term of the GO database. Moreover, 13 genes annotated to COG database among 52 DEGs and served many important biological functions, such as amino acid transport and metabolism, intracellular trafficking, secretion, and vesicular transport (Table 3). After analyzed the correlation coefficients between the 52 DEGs with FDR< 0.05 (Fig. 4D), a visualization network was obtained, which showed that cornifelin (EGR_03608), platelet glycoprotein (EGR_09887), potassium voltage-gated channel subfamily C member 3 (EGR_05864) and others are related to the expression of many other genes.
Another interesting found was that none of the genes remained up-regulated during the encystation process when we analyzed by Veen diagram analysis of the gene sets among EgPSC_0d vs 10d, EgPSC_10d vs 20d, EgPSC_20d vs 40d and EgPSC_40d vs 80d (Additional file2: Figure S1). Only ten genes exist at most of four gene sets simultaneously, and two genes of them continued up-regulated from EgPSC_10d to 80d, EGR_05443, and EGR_03870, respectively, and the other 8 genes were continuously up-regulated from EgPSC_0d to 10d and EgPSC_20d to 80d. In addition, of the 10 genes, except EGR_09887 and EGR_09469, which described as platelet glycoprotein and dynein light chain, respectively, the remain genes were described as hypothetical proteins (Table 4).
GO and KEGG enrichment analysis of DEGs
GO was a comprehensive database that aggregates all gene-related research results from around the world and it could standardize biological terms for genes and gene products from different databases, providing a uniform definition and description of gene and protein functions [30]. To obtain a better understanding of the enriched functions of 1,991 DEGs in the process of encystation of E. granulosus, GO term enrichment analyzed was performed. According to the GO classification of DEGs, there were 1094, 148 and 263 genes assigned to cellular component (CC), biological process (BP) and molecular function (MF) categories, respectively, of which 709, 0 and 122 genes were significantly DE (FDR value < 0.05) (Additional file 3: Table S2). The top three predominant terms for BP were endoplasmic reticulum unfolded protein response, O-glycan processing, and ion transport; for CC they were integral component of membrane, intrinsic component of membrane, and membrane part; and for MF they were phosphoric ester hydrolase activity, hydrolase activity, acting on ester bonds and beta-1,3-galactosyltransferase activity (Fig. 5).
KEGG is a large-scale database for systematic analysis of gene function, genomic information, and functional annotation. By analyzing the results using the information in this database, genes or transcripts can be classified according to the metabolic pathways or functions involved, including organismal systems (OS), metabolism (M), environmental information processing (EIP), human diseases (HD), genetic information processing (GIP) and cellular processes (CP) [25]. KEGG pathway enrichment analysis revealed that the 1,991 DEGs classified into 291 pathways (Additional file 4: Table S3) (Fig. 6). Analysis of the top 20 most significantly enriched pathways showed that the map 04962, which described as vasopressin-regulated water reabsorption metabolic pathway was the most significantly enriched pathways (FDR value < 0.05) (Fig. 7), followed by pantothenate and CoA biosynthesis (map00770) and Hippo signaling pathway (map04391).
Since the morphological changed of the PSCs were greatest in the first 10 days, we performed GO and KEGG enrichment chord analysis the top ten of GO term and KEGG pathway with FDR value < 0.05 on the DEGs between EgPSC_0d and EgPSC_10d. After analyzed, 43 and 61 specific genes were obtained, respectively (Additional file 5: Figure S2, Additional file 6: Figure S3).
Real-time quantitative PCR analysis
To validate the RNA-seq data and the analyses derived from them, 12 DEGs were selected to measure their relative expression level by QPCR, including six up- and six down-regulated DEGs. The inter-sample statistical analysis was performed using T-test. 0.01<P<0.05 was considered significant (*), P<0.01 (**) and P<0.001 (***) were considered extremely significant. The QPCR data were compared with the relative expression level of the gene in the RNA-seq TPM value to confirm the reliability of the transcriptome sequencing results (Fig. 8).