A total of 312.2 Gb PacBio reads, 215.7 Gb Illumina short reads, and 253.7 Gb Hi-C reads were generated in a female G. eckloni (Supplementary Fig. 1, Additional file 1; Supplementary Table 1, Additional file 2). After filtering, 215.2 Gb (231.2× coverage) of clean Illumina data were retained to perform a genome survey (Supplementary Table 1, Additional file 2). Based on the k-mer (k = 17) depth frequency distribution analysis, the genome size of G. eckloni was estimated to be 927.13 Mb (Supplementary Fig. 2, Additional file 1; Supplementary Table 2, Additional file 2). For genome assembly, a total of 239 Gb clean PacBio long reads (334.6× coverage) with an average length of 23,706 bp were generated (Supplementary Table 1, Additional file 2). The data were assembled using wtdbg2 followed by Quiver and Pilon polishing using the 215.2 Gb (231.2× coverage) of Illumina HiSeq clean reads, which produced a 918.45 Mb genome assembly with a contig N50 size of 4.19 Mb (Supplementary Table 3, Additional file 2). The assembled genome size was slightly larger than most other teleosts (~700 Mb) [14, 15], but obviously smaller than the genomes of Cyprinus carpio (1.83 Gb) and Carassius auratus (1.85 Gb) [16, 17]. The assembled sequences were further anchored and orientated onto 23 pseudo-chromosomes using Hi-C data. The 23 pseudo-chromosomes ranged in size from 15.91 to 89.39 Mb (Fig. 1a, Table 1), covering ~98.52% of the whole genome. The heatmap of the pseudo-chromosome crosstalk confirmed the high quality of our genome assembly (Fig. 1a, b). Finally, the G. eckloni genome was obtained with 711 scaffolds and a total length of 918,681,488 bp, a contig N50 of 4.19 Mb, and scaffold N50 of 43.54 Mb (Supplementary Table 3, Additional file 2).
We conducted a sequence consistency assessment using BWA software. The results showed that 93.40% of the reads could be mapped, covering 96.34% of the assembled genome when the Illumina short reads were mapped to the assembled genome (Supplementary Table 4, Additional file 2). The completeness of the final assembled genome was assessed using CEGMA and BUSCO analyses. The CEGMA analysis revealed that 221 conserved genes (89.11% of the core eukaryotic genes) supported the completeness of the assembled genome (Supplementary Table 5, Additional file 2). The BUSCO analysis showed that 88.4% (single-copy genes: 82.3%, duplicated genes: 6.1%) of the 2,586 single-copy genes were identified as complete, 2.4% were fragmented, and 9.2% were missing from the assembled genome (Supplementary Table 6, Additional file 2). Overall, these analyses suggested that we assembled a high quality and chromosome-level genome of G. eckloni.
Genome annotation
The repeat sequences prediction analysis revealed that 47.63% of the G. eckloni genome was annotated as repetitive elements (Supplementary Table 7, Additional file 2), of which LTRs were the most abundant with a total length of 356.79 Mb, accounting for 38.84% of the whole genome. SINEs were the rarest with a total length of 2.37 Mb and represented 0.26% of the whole genome (Supplementary Table 8, Additional file 2).
Using a comprehensive strategy based on homologous sequence searches, ab initio gene predictions, and RNA-seq-derived evidence from nine tissue and blood samples, a total of 24,430 protein-coding genes were predicted in the genome of G. eckloni. The average transcript length was 1,530.87 bp with an average coding sequence (CDS) length of 1,536.71 bp. The average exon number per gene was 8.88 with an average exon length of 173.00 bp and average intron length of 1,862.69 bp (Supplementary Fig. 3, Additional file 1; Supplementary Table 9, Additional file 2). The comparison of gene numbers, average coding sequence lengths, and average gene lengths with eight other fish species indicated that our annotations were comprehensive (Supplementary Fig. 3, Additional file 1; Supplementary Table 10, Additional file 2).
A total of 23,157 genes were annotated using at least one public database and represented 94.80% of the total predicted protein-coding genes (Table 2). The functional annotations found that 20,432 (83.60%), 23,110 (94.60%), and 21,539 (88.20%) genes had significant hits with proteins catalogued in Swissprot, NR, and InterPro, respectively. A total of 15,281 (62.60%), 20,593 (84.30%), and 19,290 (79.00%) genes were annotated to GO, KEGG, and Pfam, respectively (Supplementary Fig. 4, Additional file 1; Table 2). Additionally, 1,717 miRNAs, 12,157 tRNAs, 1,780 rRNAs, and 1,152 snRNAs were identified, which had average lengths of 116.58, 75.00, 178.65, and 133.32 bp, respectively (Supplementary Table 11, Additional file 2). Taken together, these analyses suggested a satisfying level of completeness and accuracy of genome annotation.
Table 2
The number of genes with functional classifications in G. eckloni.
Database
|
Number
|
Percent (%)
|
SwissProt
|
20,432
|
83.60
|
NR
|
23,110
|
94.60
|
KEGG
|
20,593
|
84.30
|
InterPro
|
21,539
|
88.20
|
GO
|
15,281
|
62.60
|
Pfam
|
19,290
|
79.00
|
Annotated
|
23,157
|
94.80
|
Unannotated
|
1,273
|
5.20
|
Total
|
24,430
|
94.80
|
Gene family analysis and phylogenetics of G. eckloni
To examine G. eckloni evolution, we conducted a comparative genomics analysis using the G. eckloni genome obtained from this study and 13 other vertebrate genomes downloaded from the Ensembl database (Astyanax mexicanus, Ictalurus punctatus, Danio rerio, C. carpio, Ctenopharyngodon idella, Oreochromis niloticus, Oryzias latipes, Takifugu rubripes, Gallus gallus, Homo sapiens, Mus musculus, Xenopus tropicalis, and Petromyzon marinus). A total of 24,619 gene families were identified among the 14 species (Supplementary Fig. 5, Additional file 1; Supplementary Table 12, Additional file 2), of which 2,739 core gene families were shared by all 14 species (Supplementary Fig. 6, Additional file 1). Compared to other species, we found 1,488 G. eckloni-specific genes, classified into 856 gene families. Further analysis revealed that 210 genes of the species-specific genes were significantly enriched five KEGG pathways, including Dorso-ventral axis formation, Peroxisome, ABC transporters, Endocytosis and Herpes simplex virus 1 infection.
A phylogenetic tree was constructed using 597 single-copy orthologs with one-to-one correspondence in the different genomes, indicating that G. eckloni clustered together with C. carpio (Supplementary Fig. 7, Additional file 1). According to the time-calibrated phylogeny, the age of the most recent common ancestor (MRCA) of the teleost fish was estimated to be 211.8–254.1 Ma. The G. eckloni with the closest relationship to C. carpio shared an MRCA at 34.8 Ma (26.0–41.4, Fig. 2a). Fluctuations in the ecogeographical environment and major hydrographic formation occurred during uplifts of the QTP and are generally accepted to have significantly affected speciation events of schizothoracine fish [2, 18]. Our estimated divergence time for G. eckloni and C. carpio is similar to those obtained for other speciation events of congeners in family Cyprininae [19, 20], which is consistent with the first and second tectonic uplifts of the QTP [21].
Analysis of the expansion and contraction of the gene families revealed that there were 464 (1650 genes) expanded and 743 (192 genes) contracted gene families in G. eckloni when compared to its MRCA (Fig. 2a). Further functional enrichment analysis of the expanded gene families highlighted 62 significantly enriched GO terms (p < 0.05) and 25 KEGG pathways (p < 0.05) (Supplementary Table 13, Additional file 3), while the contracted gene families highlighted 43 significant GO terms (p < 0.05) and 23 KEGG pathways (p < 0.05) (Supplementary Table 13, Additional file 3). The 25 KEGG pathways, including 625 genes from these expanded gene families, revealed that they were mainly classified as ABC transporters, Peroxisome, Herpes simplex virus 1 infection, Staphylococcus aureus infection, Axon guidance, Dorso-ventral axis formation, Pertussis, Legionellosis, Rap1 signaling pathway and so on (Supplementary Table 13, Additional file 3). Intriguingly, some of these pathways overlapped with pathways that species-specific G. eckloni genes were involved in, such as the Dorso-ventral axis formation, ABC transporters, Peroxisome and Herpes simplex virus 1 infection. The 23 KEGG pathways, including 189 genes from these contracted gene families, revealed that they were mainly classified as Tight junction, Systemic lupus erythematosus, Pathogenic Escherichia coli infection, Gap junction, Alcoholism, Pertussis, Ascorbate and aldarate metabolism, NOD-like receptor signaling pathway and so on (Supplementary Table 14, Additional file 3). Thus, it is likely that the expanded and contracted gene families played important roles in the adaptation of G. eckloni to the plateau water environment.
WGD events and chromosome evolution in G. eckloni
To determine the date of the G. eckloni WGD event (4R), we used inter- and intra-genomic colinear genes and calculated their synonymous substitution rates (Ks values) (Fig. 2b). The Ks distribution of homologous gene pairs in syntenic blocks indicated that the peak Ks for the G. eckloni genome was 0.10. Based on a Ks rate of 3.51 × 10−9 substitutions per synonymous site per year [22], we estimated that the latest WGD (4R) happened at ~14.2 Ma, which was consistent with the most recent genome duplication time of goldfish and later than the divergence time between G. eckloni and C. carpio. Similarly, an analysis of the fourfold synonymous third codon transversion (4dTv) provided additional evidence for an extra WGD event (Supplementary Fig. 8, Additional file 1).
Previous studies showed that there are several typical karyotypes in schizothoracine fish, including 2n = 90, 92, 94, 98, 148, and ± 446 [7]. Almost all species from Schizothoracinae that have been karyologically investigated are polyploid, including tetraploid and hexaploid [6, 7, 9]. The karyotype of G. eckloni was reported as 2n = 94 [7]. Along with the chromosome number, which is roughly twice as large as diploid cyprinid, including D. rerio (2n = 50), C. idellus (2n = 48), Hypophthalmichthys molitrix (2n = 48), and Hypophthalmichthys nobilis (2n = 48) [23, 24], G. eckloni is likely a tetraploid species. The D. rerio genome allowed us to identify potential fusion and fission events that shaped the G. eckloni genome since the ancestral karyotype. The chromosome synteny comparison of G. eckloni with D. rerio as a reference confirmed that four chromosome fusion events occurred in G. eckloni (Fig. 1c). Notably, chromosome 22 in D. rerio involved both fusion and fission events. The synteny analysis revealed a high level of collinearity between the chromosome-level genomes of G. eckloni and D. rerio, and although some were small, inter-chromosomal translocations indicated that the overall gene order in the G. eckloni genome remained very stable after its divergence from zebrafish (Fig. 1c). Previous studies corroborated that the C. carpio and C. auratus genome have tetraploidized due to an additional round of genome duplication (4R) and, thus, have 50 pairs of chromosomes (2n = 4x = 100) [16, 17, 22, 25], which is twice as many chromosomes as the ancestral cyprinid, D. rerio (2n = 50). All things considered, we propose that the latest WGD (4R), chromosome fusions, fissions, and deletions were a result of the formation of karyotypes with 94 chromosomes in G. eckloni.
Evolution of the globin gene superfamily in G. eckloni
The globin repertoire of extant vertebrates is the product of successive genome and gene duplication events followed by differential gene retention among lineages [26–28]. Fish endemic to the QTP is comparatively well adapted to aquatic environments with low oxygen partial pressure (hypoxia), in which the globin superfamily has played an important role [5, 29, 30]. Analyses of the G. eckloni genome revealed 25 globin genes distributed on five chromosomes (Supplementary Table 12, Additional file 2). An NJ phylogenetic tree was constructed based on the alignment of all the globins from G. eckloni and D. rerio, which showed well-supported monophyletic clades of Ngb, Cygb, Mb, XGb, and α- and β-Hb (Fig. 3a). Sequence comparisons and phylogenetic analyses identified one Ngb, one Mb, one XGb, and two paralogous Cygb genes (Cygb1 and Cygb2) in G. eckloni, which was consistent with a previous study on D. rerio [26] (Supplementary Fig. 9, Additional file 1). Moreover, the Hb gene repertoires of G. eckloni differed from D. rerio, based on the phylogenetic relationships and previous studies [31–33]. Hb genes expressed at the embryonic/larval and adult stages were denoted by “e” and “a,” respectively. In the G. eckloni genome, we identified 11 a-Hb genes, including seven intact embryonic a-Hb (Hbae1.1, Hbae1.2, Hbae1.4, Hbae1.5, Hbae1.6, Hbae4, and Hbae5), two intact adult a-Hb (Hbaa1 and Hbaa2), six intact embryonic β-Hb (Hbbe1.1, Hbbe1.2, Hbbe1.4, Hbbe1.5, Hbbe1.6, and Hbbe3), two intact adult β-Hb (Hbba1and Hbba2), and three a- and β-Hb incomplete or pseudogenes (Hbae1.3, Hbbe1.3, and Hbaa-like). Like zebrafish, all Hb genes of G. eckloni were distributed on two distinct chromosomes (Fig. 3b). Chromosome 20 contained the major globin locus with 17 α- and β-Hb genes and chromosome 05 housed the minor globin locus with three α- and β-Hb genes. There were six duplicate copies in the Hbae1 and Hbbe1 genes, which was three more genes than in the zebrafish genome. Compared to the zebrafish genome, Hbae3, Hbbe3, and one copy of Hbba1-Hbbe1 gene pairs were lost in the G. eckloni genome. Phylogenetic analyses revealed that all the α- and β-Hb genes, except the Hbaa-like gene, in the G. eckloni genome appeared to have 1:1 orthologs in zebrafish (Hbae1, Hbae4, Hbae5, Hbaa1, Hbaa2, Hbbe1, Hbbe3, Hbba1, and Hbba2). These findings suggested that the latest WGD (4R) and a recent small-scale gene duplication event after WGD contributed to gene duplication and deletion in the α- and β-Hb gene families in G. eckloni, which facilitated the adaptation of G. eckloni to the plateau water environment.