3.1 Transcriptome sequencing and de novo assembly
Three biological replicate libraries were generated and sequenced from the tailfins of Grass-goldfish and Egg-goldfish, respectively (Table 1), and all the raw reads were deposited in the SRA with accession number PRJNA666023. A total number of 54,123,840, 52,650,612 and 54,797,604 raw reads were respectively obtained from the three Grass-goldfish specimens. Three Egg-goldfish specimens separately produced 45,551,450, 49,700,170 and 50,184,718 raw reads. The adaptors, the reads with more than 5% N content, or the bases with lower quality (defined as reads containing > 50% bases with Q value ≤ 10) were removed. After quality control, approximately 18.40 and 15.08 Gb clean nucleotides in total were obtained for Grass-goldfish and Egg-goldfish, respectively.
Table 1
Summary statistics of original and assembled data from transcriptome of three Grass-goldfish and Egg-goldfish
| Grass-1 | Grass-2 | Grass-3 | Egg-1 | Egg-2 | Egg-3 |
Raw sequencing reads | | | | | | |
Total reads | 54,123,840 | 52,650,612 | 54,797,604 | 45,551,450 | 49,700,170 | 50,184,718 |
Total bases (bp) | 8,118,576,000 | 7,897,591,800 | 8,219,640,600 | 6,832,717,500 | 7,455,025,500 | 7,527,707,700 |
Clean sequencing reads | | | | | | |
Total reads | 48,160,910 | 47,175,288 | 29,472,277 | 41,018,144 | 31,511,232 | 29,436,563 |
Total bases (bp) | 7,105,892,795 | 6,963,276,072 | 4,329,394,413 | 6,055,892,155 | 4,666,822,327 | 4,361,478,274 |
Percentage of clean reads (%) | 88.98 | 89.60 | 53.78 | 90.04 | 63.40 | 58.65 |
Percentage of clean bases (%) | 87.53 | 88.17 | 52.67 | 88.63 | 62.60 | 57.94 |
GC % of clean reads | 42.41 | 43.36 | 42.57 | 41.96 | 43.28 | 45.16 |
Q30 (%) | 95.54 | 95.43 | 94.97 | 95.30 | 95.64 | 95.85 |
Alignment information | | | | | | |
Aligned reads | 37,065,166 | 36,090,133 | 22,552,540 | 32,203,590 | 25,137,076 | 23,154,977 |
Aligned ratio (%) | 76.96 | 76.50 | 76.52 | 78.51 | 79.77 | 78.66 |
Unigene information | | | | | | |
Total number of unigenes | 114,796 |
Number of bases | 89,545,010 |
Mean length of unigene (bp) | 780.04 |
N50 length of unigene (bp) | 861 |
Largest unigene (bp) | 18,054 |
Subsequently, 114,796 unigene sequences, with the total length of 89,545,010 bp, were de novo assembled from the clean reads of the combined six mRNA-Seq Illumina libraries using Trinity software (version: trinityrnaseq-r2013-02-25) (Conesa et al. 2005). The proportions of reads that could map back to the assembled transcript were 76.68% for Grass-goldfish, and 78.94% for Egg-goldfish. The overall GC contents of the transcriptomes were 42.78% and 43.47% in Grass-goldfish and Egg-goldfish, respectively (Table 1). Furthermore, the mean (780 bp), N50 (861 bp) and largest (18, 054 bp) length of unigenes were also calculated.
3.2 Functional annotation of unigenes
The assembled 114,796 unigenes were further annotated in the NR (ftp://ftp.ncbi.nih.gov/blast/db; Pruitt et al. 2012), GO (http://geneontology.org; Ashburner et al. 2000), COG (http:// https://www.ncbi.nlm.nih.gov/research/cog-project; Galperin et al. 2021), KEGG (http://www.genome.jp/kegg; Kanehisa et al. 2004) and SWSS databases (http://www.uniprot.org; Bairoch and Boeckmann 1991) (Table 2; Fig. 2). Overall, 49.45%, 1.33%, 32.41%, 25.02%, and 34.39% unigenes were identified and categorized in the NR, GO, COG, KEGG, and SWSS databases, respectively, and 1,096 unigenes annotated to all databases are shown at the intersections of all circles in Fig. 2.
Table 2
Summary of unigene annotation against public databases
Category | Number of annotated unigenes | Percentage of annotated unigenes (%) |
Annotated in NR | 56,761 | 49.45 |
Annotated in GO | 1,530 | 1.33 |
Annotated in COG | 37,204 | 32.41 |
Annotated in KEGG | 28,723 | 25.02 |
Annotated in SWSS | 39,476 | 34.39 |
Annotated in at least one database | 57,584 | 50.16 |
Annotated in all database | 1,096 | 0.95 |
Note. NR, NCBI non-redundant protein dataset (ftp://ftp.ncbi.nih.gov/blast/db; Pruitt et al., 2012); GO, Gene Ontology (http://geneontology.org; Ashburner et al., 2000); COG, Clusters of Orthologous Groups (https://www.ncbi.nlm.nih.gov/research/cog-project; Galperin et al., 2021); KEGG, Kyoto Encyclopedia of Genes and Genomes Pathway (http://www.genome.jp/kegg; Kanehisa et al., 2004); SWSS, SwissProt (http://www.uniprot.org; Bairoch and Boeckmann, 1991). |
In the GO category, 1,530 unigenes were classified into three primary GO terms, “biological processes”, “cellular components”, and “molecular function”, and further subdivided into 56 secondary-class terms (Fig. 3). In the GO term “biological processes”, “cellular process (1045 genes)”, “single-organism process (829 genes)”, and “metabolic process (803 genes)”, were the top three, secondary-class terms with the most unigenes. The secondary categories “cell part (883 genes)”, “cell (883 genes)”, and “organelle (640 genes)”, were the top three secondary categories in the GO term “cellular components”. In the “molecular function” GO term, the secondary-class terms “binding (820 genes)”, and “catalytic activity (586 genes)” occupied the top two positions with the most unigenes.
In the COG annotation analyses, 37,204 unigenes were mapped into 26 categories, and the “U: intracellular trafficking, secretion, and vesicular transport (3784 genes)”, “O: Posttranslational modification, protein turnover, chaperones (2482 genes)”, “T: Signal transduction mechanisms (2099 genes)”, “K: Transcription (1950 genes)”, and “Z: Cytoskeleton (770 genes)” were substantially enriched (Fig. 4).
In the KEGG pathway annotation, 28,723 unigenes were classified into five categories mapping to 340 pathways (Fig. 5A), with 8237 genes located in “A: Cellular Process”, 3441 genes in “B: Environmental Information Processing”, 7200 genes in “C: Genetic Information Processing”, 5837 genes in “D: Metabolism”, and 10,180 genes in “E: Organismal Systems”. The top 20 enriched KEGG pathways are displayed by a histogram (Fig. 5B), and the top three pathways with the most unigenes enrichment were “metabolic pathways (2824 unigenes)”, “pathways in cancer (1245)”, and “PI3K − Akt signalling pathway (1147 unigenes)”.
3.3 Gene quantification in Grass-goldfish and Egg-goldfish
The number of mapped reads for each gene was calculated and standardized to RPKM (reads per kilobase of exon model per million mapped reads) to quantify the transcriptome expression level in the tailfins from Grass-goldfish and Egg-goldfish. In total, ~ 76.68% (Grass-goldfish), and 78.93% (Egg-goldfish) clean reads from each library were mapped to the prior-assembled 114,796 unigenes. According to the calculated RPKM value for each unigene, the expression levels of unigenes were subsequently defined as no transcription (RPKM < 1), low (1 ≤ RPKM < 5), medium (5 ≤ RPKM < 10), and high (RPKM ≥ 10). The number of genes with medium (14,651 vs. 14,464), and high expression (7398 vs. 7331) in Grass-goldfish and Egg-goldfish tailfins was similar, while the number of unexpressed (99,537 vs. 85,489), and low-expressed (60,474 vs. 45,147) genes in the tailfins of Grass-goldfish was much greater than in Egg-goldfish, suggesting that more genes function in the development of the double tail fin.
3.4 Analysis of the DEGs (differentially expressed genes)
To identify the DEGs between the Grass-goldfish and Egg-goldfish, a paired comparison of the expression quantity for each unigene in the tailfin of the different breeds was conducted and the unigenes with fold change ≥ 2 and FDR < 0.01 were filtered (Table 3). A total of 7516 DEGs were detected, out of which 2946 unigenes demonstrated substantially differential expression in the tailfins between Grass-goldfish and Egg-goldfish. In these differentially expressed genes, compared with Egg-goldfish, 1922 unigenes were up-regulated and 1024 unigenes were down-regulated in Grass-goldfish.
Table 3
Summary of DEG (differentially expressed gene) number between different pairs of Grass-goldfish and Egg-goldfish.
Comparison pairs | DEG number | Up-regulation | Down-regulation |
Grass vs. Egg | 2946 | 1922 | 1024 |
Grass1 vs. Gass2 | 1065 | 502 | 563 |
Grass1 vs. Grass3 | 1067 | 496 | 571 |
Grass2 vs. Grass3 | 1126 | 531 | 595 |
Egg1 vs. Egg2 | 422 | 232 | 190 |
Egg1 vs. Egg3 | 457 | 219 | 238 |
Egg2 vs. Egg3 | 431 | 190 | 241 |
The differentially expressed unigenes from tailfins of Grass-goldfish vs. Egg-goldfish were further mapped and classified in GO categories and KEGG pathways. The significantly up-regulated DEGs (240 unigenes) annotated in GO categories were mainly involved in the single organism process (GO: 0044699, 23 unigenes), and cellular process (GO: 0009987, 23 unigenes) of biological processes (121 unigenes), the cell part (GO: 0044464, 14 unigenes) and membrane part (GO: 0044425, 13 unigenes) of the cellular component (86 unigenes), and the binding (GO: 0005488, 16 unigenes) and catalytic activity (11 unigenes) of molecular function (33 unigenes). The significantly down-regulated DEGs (198 unigenes) were generally involved in the cellular process (GO: 0009987, 22 unigenes) and metabolic process (GO: 0008152, 20 unigenes) of the biological process (78 unigenes), the cell part (GO: 0044464, 20 unigenes) and organelle (GO: 0043226, 18 unigenes) of the cellular component (97 unigenes), and the binding (GO: 0005488, 11 unigenes) and catalytic activity (9 unigenes) of molecular function (23 unigenes). Furthermore, in the metabolic process category, the DEG analyses revealed that the HOXA2b (TRINITY_DN71748_c0_g1), HOXB13a (TRINITY_DN74023_c0_g1), paired mesoderm homeobox protein 1-like isoform X2 (PRRX2, TRINITY_DN89108_c0_g1), zinc finger E-box-binding homeobox 1-like isoform X3 (ZEB1, TRINITY_DN93814_c0_g1), and homeobox protein, Meis1 (TRINITY_DN92032_c0_g1) were up-regulated, while the H2.0-like homeobox protein (HLX, TRINITY_DN86311_c0_g1) was down-regulated in the tailfin from the Grass-goldfish.
Large numbers of DEGs were also mapped to various primary and secondary metabolite pathways (e.g., ko01100, ko04151, ko04010, and ko04014) in the enrichment analyses of DEGs between the Grass- and Egg-goldfish KEGG pathways (Table S1 and S2), including the PI3K-Akt signalling pathway (36 unigenes), MAPK signalling pathway (17 unigenes), and Jak-STAT signalling pathway (13 unigenes). The osteoclast differentiation (ko04380) and dorso-ventral axis formation (ko04320), probably related to the formation of the goldfish tailfin, were also identified in the KEGG pathway enrichment analyses of DEGs. In the osteoclast differentiation (ko04380) pathway, our analyses demonstrated that the GRB2-associated-binding protein 2-like isoform X2 (TRINITY_DN94490_c0_g1), serine/threonine-protein phosphatase 2B catalytic subunit alpha isoform X2 (TRINITY_DN90198_c0_g2), autophagy-related protein 9A-like (TRINITY_DN92800_c0_g2), and calcium/calmodulin-dependent protein kinase type IV-like (TRINITY_DN98117_c0_g1) were up-regulated (Table S1), while the interferon gamma receptor 1–2 (TRINITY_DN94125_c1_g2), fos-related antigen 2-like (TRINITY_DN85588_c0_g1), interleukin-1 beta-2 (TRINITY_DN104072_c0_g1), fos-related antigen 1-like isoform X2 (TRINITY_DN76364_c0_g1), retinoic acid receptor responder protein 3-like (TRINITY_DN16362_c0_g1), and suppressor of cytokine signalling 1-like (TRINITY_DN83317_c0_g1) were down-regulated (Table S2) in the tailfin from Grass-goldfish. In addition, the GTPase, HRas (TRINITY_DN90348_c2_g2), was down-regulated in the Grass-goldfish’s tailfin (Table S2), within the dorso-ventral axis formation (ko04320) pathway.
3.5 Polymorphism analysis of simple sequence repeats (SSR) markers
A total of 15931 SSRs (simple sequence repeats) were identified including di- (12,511, 78.53%), tri- (2919, 18.32%), and tetra-nucleotide repeats (438, 2.75%), (Table 4, Fig. 6). Among the di-nucleotide repeats, AC/GT (8669), AG/CT (2909), and AT/AT (910) were the dominant motifs. Within the tri- nucleotide repeats, AAT/ATT (537), followed by AGG/CCT (470), and AAC/GTT (444), were the most repeated motifs (Table 4).
Table 4
The SSRs (simple sequence repeats) identified from the unigene sequences of Goldfish.
SSR parameters | Number |
Total number of sequences examined | 114,796 |
Total size of examined sequences | 88,134,359 |
Total number of identified SSRs | 15,931 |
Di-nucleotide | 12511 |
Tri-nucleotide | 2919 |
Tetra-nucleotide | 438 |
Penta-nucleotide | 49 |
Hexa-nucleotide | 14 |
Number of SSRs containing sequences | 24,594 |
Number of sequences containing more than 1 SSR | 5,307 |
Number of SSRs present in compound formation | 2,785 |