Summary of genome assembly and annotation for W. pigra
Using a whole-genome shotgun strategy with the Illumina HiSeq™ 2000 platform, we sequenced the genome of W. pigra from Wuhan, the provincial capital Hubei, China. The de novo assembly of a 146 Gbp high-quality sequences from 2 paired-end and 3 mate-pair libraries provided 100-fold coverage with a total assembly length of 177 Mbp (Table 1), which approximates the genome size estimated by 23 K-mer distribution (Figure S1). The scaffold N50 is 728 Kbp. 3495 scaffolds are with length >2 Kbp. Repeat content comprised 23% of the W. pigra genome, which is 10% lower than that of the H. robusta [7]. The W. pigra shares a similar profile of GC content (35%) with H. robusta (33%). A total of 26,743 protein-coding genes were predicted in W. pigra. W. pigra and H. Robusta showed similar gene model features in a whole. However, W. pigra has shorter intron length and longer protein length compared with H. robusta (Table 1). A total of 17123 protein-coding genes were annotated in all three common databases Uniprot, TrEMBL and interPro (Figure 1B). We identified 12346 and 10295 orthologous genes between W. pigra and H. medicinalis, and between W. pigra and H. robusta, respectively, using reciprocal best blast hits (RBHs) method (Figure 1C). There are a large proportion of genes (14398 and 16449) in W. pigra not assigned as orthologous genes.
Table 1. Summary for genome sequencing, assembly and annotation
|
H. robusta
(Ref [7])
|
W. pigra
(this study)
|
H. medicinalis
(Ref [8])
|
Size of genome assembly
|
228 Mbp
|
177 Mbp
|
187M
|
Num. of Scaffolds
|
1,993
|
10,050
|
14,042
|
Num. of Scaffolds (> 2Kbp)
|
1124
|
3495
|
105
|
Scaffold N50
|
3,060 Kbp
|
728 Kbp
|
97 Kbp
|
Total reads
|
3,176,156
|
118,388,619
|
62,184,084
|
Reads mapping to genome (%)
|
2,839,951 (89%)
|
112,480,685 (95%)
|
NA
|
Sequencing coverage depth
|
7.92X
|
100X
|
73X
|
Repetitive content (%)
|
33
|
23
|
NA
|
GC (%)
|
33
|
35
|
41
|
Num. of predicted genes
|
23,400
|
26,743
|
14,596
|
Protein length (aa)
|
376
|
438
|
464
|
Mean exon length
|
203 bp
|
205 bp
|
224
|
Mean intron length
|
526 bp
|
391 bp
|
716
|
Mean number of exons per gene
|
6.1
|
6.4
|
8
|
GC, fraction of guanine plus cytosine nucleobases; Scaffold N50, the length such that half of the assembled sequence is in scaffolds longer than this length.
Gene synteny among the genomes of W. pigra, H. medicinalis and H. robusta
The above result showed that W. pigra only has 46.2% (12346) orthologous genes in H. medicinalis, and 38.4% (10295) orthologous genes in H. robusta. To further compare the genome similarity among the three leeches, we performed a careful analysis of syntenic blocks between W. pigra and H. medicinalis, and between W. pigra and H. robusta using MCScanX[15]. As small scaffolds are not useful for gene synteny analysis, we only considered the scaffold with more than 30 genes. Finally, we identified 21 scaffolds in H. medicinalis had syntenic blocks matched to the 13 scaffolds in W. pigra. In contrast, there are 33 scaffolds in H. robusta matched to the 21 scaffolds in W. pigra (Figure 2). Overall, the genome of W. pigra has a good collinearity relationship with other two genomes. We further examined the synthetic blocks in the larger scaffolds Wh8, wh9, wh17, and wh22. We found that H. medicinalis tends to have larger synthetic blocks matched to the scaffolds of W. pigra than H. robusta. It suggests that compared to H. robusta, W. pigra has a more similar genome structure with H. medicinalis.
The expansion and contraction of gene family in the W. pigra genome
After analysis of gene synteny, we further analyzed the expansion and contraction of gene family among the seven species: H. robusta, Lottia gigantea, Capitella teleta, Schmidtea mediterranea, Schistosoma mansoni, W. pigra, H. medicinalis. We compared the predicted proteomes of seven species, yielding a total of 13563 orthologous gene families that comprised 108245 genes. We found 1488, 832 and 1266 gene families expanded in W. pigra, H. medicinalis and H. robusta, respectively (Figure 1D). Of these families, there are 63, 1 and 59 families that are evolving rapidly (P<0.05) in W. pigra, H. medicinalis and H. robusta, respectively (Figure 1E). These rapidly evolving families are species-specific and little overlap between the two species (Figure 1E). To reveal the molecular function and structural domain of these rapidly evolving families, we performed enrichment analyses by gene ontology terms and interPro domains. The enrichment results showed a clear difference among the three leeches. For W. pigra, the expanded families are enriched in the following functions: protein histidine kinase activity, O−acyltransferase activity, thiamine pyrophosphate binding, carbohydrate binding, proteolysis and so on. For H. robusta, the expanded families are mainly enriched in the functions such as sodium channel activity, sodium ion transport, zinc ion binding, and RNA−DNA hybrid ribonuclease activity. For H. medicinalis, only two functions endopeptidase inhibitor activity and extracellular region are enriched (Figure 3A). In contrast, for the contracted families, there are little go terms enriched in W. pigra and H. robusta, but more go terms enriched in H. medicinalis. For example, iron ion binding, heme binding, proteolysis, and sodium channel activity functions are enriched by the contracted family in H. medicinalis (Figure 3C). Corresponding to these functions, specific protein domains are enriched in different leeches. These results imply the three species may take different adaptive strategies. And the different functions and domains are potentially related to environmental adaptation and bioactive peptides properties of the three leeches.
Phylogenetic analysis and sequence alignment of the hirudin gene family
As the most well-studied natural anticoagulant from leeches, hirudin has served as a standard for designing natural coagulation inhibitors[16]. Hirudin may be useful in the therapy of thrombosis because of its specific antithrombin effects[17]. We identified two hirudin genes g14352 and g17108 (Figure 4A) in W. pigra in this study. We named g14352 and g17108 as hirudin_1, hirudin_2, respectively (Figure 5). For comparison, we also identified three hirudin genes g9136, g9138, and g9139 in H. medicinalis. These five hirudin genes and 38 hirudin-like sequences from protein database UniProt were used to clarify the phylogenetic relationships of these hirudin genes (Figure 4A). They are clustered into three clades (named Groups 1, 2 and 3) (Figure 4A). Three groups are highly supported with bootstrap value >95. The sequences (Group 3) from W. pigra do not cluster with the other hirudin genes. Groups 1, 2 and 3 follow different cysteine patterns CX(7)CX(1)CX(5)CX(5)CX(10)C, CX(7)CX(1)CX(5)CX(5)CX(8)C and CX(8)CX(1)CX(5)CX(5)CX(10)C, respectively (Figure 4B). The pattern of group 1 is the typical cysteine pattern of the hirudin. In contrast, gene g17108 (hirudin_2) of W. pigra shows the third cysteine pattern, which inserts an extra amino acid between the first and second cysteines. The gene g17108 (hirudin_2) is a new kind of hirudin, which has not been reported before.
Genome-wide distribution, gene structure and transcript levels of hirudin genes
Although there are a lot of studies about hirudin, the genome-wide distribution and gene structure of hirudin have not been reported. By sequence searching, we found that g14352 (hirudin_1) and g17108 (hirudin_2) are located at different scaffolds 5072 and 278, respectively (Figure 5A). The left and right sides of g17108 (hirudin_2) are surrounded by multiple genes (24 and 69 genes, respectively). We can infer that the two genes are separated by great distances (>210 Kbp). It suggested a lot of genome rearrangement events happened after gene duplication of hirudin genes. Furthermore, gene structures of the two hirudin genes are also different. g14352 (hirudin_1) only has three exons. In contrast, g17108 (hirudin_2) has four exons, which encode a signal peptide and a longer tail. Therefore, protein hirudin_2 has a longer sequence than hirudin_1 (Figure 4B and 5B).
Exploration of bioactive ingredients in the W. pigra genome
There are more than 20 bioactive substances identified from leeches, such as Antistasin, hirustasin, ghilantens, hirudin [2, 6, 18, 19]. These molecules have analgesic, anti-inflammatory, anticoagulant, platelet inhibitory, thrombin regulatory functions, and so on. W. pigra, H. medicinalis and H. robusta are belong to the family hirudinidae and glossiphoniidae, respectively. The detailed distribution of these bioactive substances in different leech species is still unknown. It is essential to identify and compare these active molecules in different leeches. Using the genome data, we systematic explored and compared five classes of active substances in W. pigra, H. medicinalis and H. robusta (Table 2). All 9 common active molecules were found in W. pigra. It is noteworthy that hirustasin, hirudin and destabilase I genes are absent in the H. robusta. There are far more gene copies for the active molecules in W. pigra than in H. robusta (57 vs 24). W. pigra exceeds H. robusta in both kinds and gene number of active molecules. The gene copy of bioactive ingredients of W. pigra also exceeds that of H. medicinalis.
Table 2. The exploration of five class of active substances in W. pigra, H. medicinalis and H. robusta
Modes of action
|
Bioactive molecules
|
Gene copy
(H. robusta)
|
Gene copy
(W. pigra)
|
Gene copy
(H. medicinalis)
|
Analgesic and anti-inflammatory effect
|
Antistasin
|
6
|
9
|
5
|
Hirustasin
|
0
|
18
|
7
|
Ghilanten
|
9
|
10
|
7
|
Extracellular matrix degradation
|
Hyaluronidase
|
1
|
4
|
5
|
Inhibition of platelet function
|
Saratin
|
1
|
4
|
1
|
Anticoagulant effect
|
hirudin
|
0
|
2
|
3
|
Factor Xa inhibitor
|
2
|
1
|
1
|
Therostasin
|
5
|
4
|
2
|
Destabilase I a
|
0
|
5
|
2
|
Antimicrobial effect
|
Destabilase I a
|
0
|
5
|
2
|
Total
|
|
24
|
57
|
33
|
a Destabilase I, involved in anticoagulant effect and antimicrobial effect.
Gene expression analysis of W. pigra
We make full use of the available RNA-seq data to analyze the gene expression in W. pigra. We divided all genes of W. pigra into four parts (No expression, Low, Medium, High). The expression of these genes is shown in Figure 5C. These genes are involved in different molecular functions (Figure 5D). No expressed genes are enriched in the functions of ATPase activity, oxidoreductase activity, DNA−binding transcription factor activity, transposase activity and so on. Low expressed genes are enriched in the functions of ion transport, G protein−coupled receptor activity, carbohydrate binding, microtubule motor activity and so on. Medium expressed genes are enriched in the functions of nucleus, zinc ion binding, Rho guanyl−nucleotide exchange factor activity, protein dephosphorylation and so on. High expressed genes are enriched in the functions of translation, ribosome, serine−type endopeptidase inhibitor activity, enzyme inhibitor activity and so on. Finally, we examined the gene expression of bioactive peptide (Figure 5E). All kinds of bioactive peptides were expressed. Of these peptides, antistasin, therostasin, and hirudin have higher expression, while factor Xa inhibitor and ghilanten have lower expression. The result implies that these bioactive peptides may play different roles in the survival of W. pigra.