General genome features
The genomic DNA isolated from T. tridentatus was sequenced to 124× coverage and assembled into a 2.06-Gb genome. The k-mer analysis yielded an estimated genome size of 2.22 Gb with a depth peak of 78×. The final draft assembly consists of 143,932 scaffolds with an N50 scaffold size of 165 kb, among which the longest scaffold size is 5.28 Mb and the shortest is 1 kb. The GC content of the genome is 32.03% (Table 1). A total of 24,222 protein-coding genes were conservatively predicted in the T. tridentatus genome in this study. The average exon and intron lengths predicted for the assembly are 333 bp and 3,792 bp, respectively. A total of 88.25% of the predicted genes were assigned and annotated by comparing to the NCBI non-redundant database, KEGG database (16) and InterPro database (17).
Repeat annotation
The screening of repeat contents from the RepeatMasker (18) analysis based on similarity alignments identified 20.29 Mb in T. tridentatus, representing 0.99% of the genome size. Most of the identified repeat sequences were simple repeats (0.77%). To estimate of repeat sequences which are more difficult to detect in the draft assembly, RepeatModeler (19) was used to predict potential existing but unidentified repeats. Based on this analysis, repeat elements totalled 34.83% in T. tridentatus, including a 13.26% proportion of transposable elements. Meanwhile, long interspersed elements (LINEs) composed the largest portion at 6.21%. LTR elements (1.72%) and DNA elements (5.33%) were also detected in the T. tridentatus genome. To determine the reliability of the repeat contents screening by RepeatMasker and RepeatModeler, we also performed repeat analysis of the L. polyphemus genome for reference. Similar results were obtained with the identification of repeat sequences representing 1.11% and 34.24% in L. polyphemus, respectively. Given that RepeatMasker use similarity of known repeat sequences in the Repbase database to identify repeats in the input sequence, this suggests that the repeat sequences from horseshoe crabs have a great difference compared with existing homologous repeats.
Assembly assessment
The completeness of the T. tridentatus genome assembly was assessed using the transcriptome data of the embryonic sample at Stage 21 (the hatch-out stage) of T. tridentatus (20). It was found that 99.04% of the transcriptome contigs were aligned to the assembly scaffolds, with an e-value cut-off of 10-30. To further confirm the completeness of the predicted genes, the commonly used genome assembly validation pipeline BUSCO (21) gene mapping method with 1,066 BUSCO Arthropoda gene sets were utilized. The predicted genes of T. tridentatus reveals 98.7% conserved proteins of homologous species with 1,052 BUSCOs (76.6% complete single-copy BUSCOs, 10.8% complete duplicated BUSCOs and 11.3% fragmented BUSCOs). Only 1.3% of the benchmarked universal single-copy orthologous groups of arthropod genes were missing in the assembly. This demonstrated that most of the evolutionarily conserved core genes were found in T. tridentatus genome, suggesting a remarkable completeness of genome assembly and predicted gene repertoire of T. tridentatus.
Phylogeny analysis and divergence time dating
Two L. polyphemus assemblies have been previously documented (15, 22), one of which was selected to perform comparative genomics according to a relatively higher assembly level. The OrthMCL (23) calculation resulted in a total of 12,116 orthologous groups in the genomes of T. tridentatus and L. polyphemus. Of these, 10,968 orthologues contained genes found in both horseshoe crab genomes, with 15,905 T. tridentatus and 20,390 L. polyphemus genes included; moreover, approximately 6,880 of the shared genes were single-copy. Functional enrichment analysis showed that these shared genes were involved in several important pathways (p-value < 0.05), such as metabolic pathways (pyruvate, glycerolipid, amino sugar, nucleotide sugar and so on), ribosome biogenesis and DNA replication. The analysis also identified 1,418 protein-coding genes that were only present in T. tridentatus. In total, 1,956 genes were only specific to L. polyphemus. To place T. tridentatus with the most current understanding of the evolution of Chelicerata species, phylogenetic and comparative genomic analyses of T. tridentatus and 11 other Chelicerata as well as one Myriapoda outgroup were conducted. The phylogenetic tree was rooted using the centipede S. maritima as the outgroup (Figure 1a). Strong bootstrap support was obtained for spider, mite and tick clades, forming a monophyletic group. T. tridentatus and L. polyphemus were grouped together, forming the Xiphosura clade. The comparative genomic analysis of the 14 species revealed 14,479 orthologous groups containing genes in at least two different species, among which 1,993 shared groups were commonly distributed in all sampled species, with 111 single-copy orthologues (Figure 1b). The single-copy genes enriched for KEGG pathways such as ribosome, oxidative phosphorylation, proteasome, metabolic pathways, and carbon metabolism. Additionally, T. tridentatus and L. polyphemus had the most orthologues shared among these two species (2,720 (22.2%) and 2,648 (21.5%)). Pathway enrichment of these genes showed significant enrichment (p-value < 0.01) for neuroactive ligand-receptor interaction, FoxO signalling pathway and AGE-RAGE signalling pathway in diabetic complications. The latter two KEGG pathways include the important JAK-STAT signalling pathway genes related to innate immunity in arthropods. With respect to species-specific genes, 1,124 genes were unique to T. tridentatus. C. sculpturatus had the most (7,328) expanded species-unique genes, followed by 6,247 N. clavipes-specific gene families. In contrast, only 161 genes were unique to T. mercedesae. The numbers of species-specific genes in T. tridentatus and L. polyphemus were in between, with 1,124 and 857, respectively. Nevertheless, considering the fragmentation of the draft genome, there may exist more coding genes in the analysed genomes. The species-specific genes described here only refer to the results based on the draft genomes.
The divergence time estimate results for the 7 Chelicerata species showed that the last common ancestor of Asian horseshoe crabs (including T. tridentatus) and L. polyphemus was dated to 130 Mya (121-141) and that the split of the Asian horseshoe crabs T. tridentatus and C. rotundicauda was dated to 63 Mya (57-69) (Figure 2), while the internal split of T. tridentatus from southern coastal China to the Korean Peninsula was dated to 12 Mya (11–14). Both the species tree and time tree suggested that horseshoe crabs are closely related to scorpions and that the split of scorpions from horseshoe crabs was dated to 440 Mya (412-468).
Two Hox gene clusters
Hox genes, which are a highly conserved subclass of homeobox super-class genes that have been extensively investigated, are usually distributed in clusters (24-27). Analysis of the Hox gene family showed that the T. tridentatus assembly contained 46 Hox genes, while 43 Hox genes were identified in L. polyphemus (Table S1). This is the most complete set of Hox genes we obtained based on homeobox domains from these two horseshoe crab assemblies. We found that most Hox genes had at least two representatives in both genomes, which was consistent with a previous whole-genome duplication study in horseshoe crabs (28).
We further examined the positions of the identified Hox genes in the two genomes and found two clusters of adjacently distributed Hox1 and Hox4 in the T. tridentatus assembly. In L. polyphemus, there was one Hox cluster of adjacent Hox1 and Hox4 genes and one additional Hox1, Hox2 and Hox3 cluster. Other clusters, such as adjacent Hox2 and Hox3 clusters and longer clusters of Hox4, Hox7, Ubx, AbdA and AbdB genes found in the two assemblies, could probably be connected to the two clusters mentioned above. Based on the Hox gene positions in the assemblies, our analysis is consistent with a previous study and suggests that there are possibly two Hox gene clusters present in horseshoe crabs if Hox genes are linearly arranged in clusters along the anterior-posterior axis similar to the ancestral arthropod Drosophila (29).
Expansion of crucial gene families of the innate immune signalling pathways in T. tridentatus and L. polyphemus
Immune-related genes can be broadly classified into pattern recognition receptors (PRRs), signaling transduction pathways and effectors. We manually searched the T. tridentatus and L. polyphemus genomes and T. tridentatus transcriptome for homologues of essential immune-related genes. PRRs in T. tridentatus and L. polyphemus show large amounts of expansion, and key genes in the signal transduction pathways also exhibit a certain degree of expansion (Figure 3). We examined six PRR families in T. tridentatus and L. polyphemus, which included the peptidoglycan recognition proteins (PGRPs), thioester-containing proteins (TEPs), fibrinogen-related proteins (FREPs), down syndrome cell adhesion molecules (Dscams), galectins and C-type lectins (CTLs). The results revealed 42 FREPs and 117 Dscams in T. tridentatus that were extensively present in both horseshoe crab genomes with functional domains.
Recognition of PAMPs by PRRs triggers signal transduction pathways through transcriptional activation. All known gene family components that play important roles in innate immune signal transduction in arthropods (such as the Toll, IMD, JAK-STAT, and JNK pathways) (30-32) are present in the genomes of T. tridentatus and L. polyphemus. From our analyses, we show that IMD and JAK-STAT pathway genes in T. tridentatus and L. polyphemus exhibit a certain degree of expansion. The orthologue analysis for shared genes in horseshoe crabs with their close evolutionary related species found that horseshoe crabs have the most unique gene orthologues shared in only two species, including the abovementioned expanded gene families.
Regarding the IMD signalling pathway, imd and IKK exit as a single gene, and we discovered multiple copies of genes encoding death-related ced-3/Nedd2-like proteins (Dredds), MAPKKK transforming growth factor - b (TGFb) - activated kinase 1 (Tak1) and Relish proteins within T. tridentatus and L. polyphemus. For Dredds, the phylogeny tree shows one branch including 7 corresponding genes identified in the two horseshoe crabs and 1 gene in C. sculpturatus. Another branch encompasses 2 genes in P. tepidariorum (Figure 4a). The Dredds are required for Tak1 activation. For Tak1 homologue analysis, one branch consisting of two gene copies in T. tridentatus and L. polyphemus showed gene expansion (Figure 4b). Moreover, main components of the JAK-STAT signalling pathway, including the receptor Domeless and the Janus Kinase and STAT transcription factor, were identified in both T. tridentatus and L. polyphemus, indicating that the JAK-STAT pathway has remained intact in horseshoe crabs. Two STAT homologue candidates were identified in the T. tridentatus genome with the typical functional domains, including a DNA binding domain and an SH2 domain which are conserved compared to those reported in insects and shrimps (33). Plausible homologs of major components of the JNK signalling were also identified in both T. tridentatus and L. polyphemus. Phylogenetic analysis of JNKs showed that there were three branches consisting of a pair of corresponding genes identified in the T. tridentatus and L. polyphemus genomes and one branch formed by a pair of genes in C. sculpturatus and S. mimosarum (Figure 4c).
Antimicrobial peptide diversity in T. tridentatus
A hallmark of the T. tridentatus host defence system is the production of antimicrobial peptides, which act as innate immune effectors (34). We searched the T. tridentatus genome for antimicrobial peptide genes and identified most of the antibacterial peptides that have been reported, including one anti-LPS, two tachyplesin and two big defensin peptides (Figure 3).
The anti-LPS gene found in the T. tridentatus genome contains an antimicrobial peptide (AMP) region between G23 to R83 with two conserved cysteine residues as well as a hydrophobic NH2-terminal and cationic residues clustered in its disulphide loop, which are supposed to act as an affinity site in combination with LPS (35, 36). The tachyplesin family includes constitutively expressed cationic peptides comprised of 17-18 amino acids that strongly inhibit the growth of both Gram-negative and -positive bacteria, including pathogenic microorganisms from marine bivalves such as Bonamia ostreae, Perkinsus marinus and Vibrio P1, and can also have strong inhibitory effects on the growth of fungi (37, 38). In this study, we identified two tachyplesin precursors in T. tridentatus, each of which consists of 77 amino acids encompassing a putative signal peptide sequence, a mature tachyplesin peptide sequence, a C-terminal arginine followed by the amidation signal residues Gly-Lys-Arg and a 22-aa peptide in the C-terminal portion (38). In addition to this, two big defensin protein precursors are also present in the T. tridentatus genome, one of which is 118 amino acids in length and contains a hydrophobic N-terminal half and a cationic C-terminal half, which may be closely related to its biological activity for broad antimicrobial properties (39).
Intact coagulation cascades in T. tridentatus
Serine protease-dependent rapid coagulation in horseshoe crabs has been shown to play a key role upon the activation of immune pathways in response to pathogen detection (40). We found that T. tridentatus and L. polyphemus have all the coagulation-related genes while other related species lack a part of the coagulation pathway (Table 2), indicating a wider diversity of coagulation factors and a relatively intact coagulation cascade present in horseshoe crabs. Factor G, a heterodimer that is specifically activated by the fungal cell wall component 1,3-β-D-glucan, is a special serine protease precursor that provides another starting point for the clotting reaction (41, 42). We identified 4 factor G sequences in our T. tridentatus genome and transcriptome assembly, including genes encoding the alpha and beta subunits, respectively. However, we failed to identify any clotting factor G homologues in other Chelicerata species.