A chromosomal-level genome assembly and the diet habit-specic amino acid mutation identication of the Cyprinidae sh Ancherythroculter nigrocauda

Ancherythroculter nigrocauda is an endemic Cyprinidae sh in China, it has many desirable traits for genetic breeding, including strong disease resistance, unusual stress tolerance and high eciency in nutrition update, which have made it an emerging commercial aquaculture sh. With the publication of its close-related species’ genome sequence, we can study the diet-specic genomic mutations within Cyprinidae. Results Here we report whole genome assembly of a female A. nigrocauda individual constructed using the single molecule DNA sequencing platform PacBio Sequel. With the help of Hi-C anchoring, we successfully placed contigs to chromosome level (2n = 48), yielding a genome size of 1054.05 Mb with contig N50 of 3.40 Mb and scaffold N50 of 42.68 Mb. This genome assembly, which has reached a high base-level accuracy of 99.999%, harboring 33,606 annotated protein-coding genes. We also found 582 genes hold diet-specic amino acid mutation between herbivorous and carnivorous shes and 26 of them showed signicant different expression patterns in liver tissue of these two types of shes.


Introduction
Ancherythroculter nigrocauda (NCBI Taxonomy ID: 263419) is a Cyprinidae sh endemic to the upper Yangtze River in China [1]. This species frequently inhabits in main channels and tributaries of the upper Yangtze River in Sichuan province and Chongqing city in southwest China. In recent decades, the natural resource of this species has suffered severely reduction both in population size and in genetic diversity due to habitat disruption caused by pollution from agriculture, chemical industry, over shing, and damming [2]. As a result, there is a loss of genetic diversity of it [3]. Nevertheless, A. nigrocauda is a popular delicacy, and it can be easily kept alive and fresh because of its strong tolerance to hypoxia and stressful conditions, which make it a promising aquaculture sh in China. It has also been used as new stock in genetic breeding that has bred two new aquaculture varieties, such as the hybrid "Culter Xianfeng NO. 1" and "Bofang Xianfeng NO. 2". The hybrid "Culter Xianfeng NO. 1" is the F1 offspring of Culter alburnus x A. nigrocauda and "Bofang Xianfeng NO. 2"is the selected variety from Megalobrama amblycephala x A. nigrocauda through fourth-generation hybrid and selection. The excellent traits of A. nigrocauda are presented in both two new varieties.
Despite its high economic value and popularity as a resource in aquaculture breeding, genetic and genomic resources for this sh are still limited. Only a few of genomic nucleotide sequences [4,5] and single nucleotide polymorphism (SNP) markers are publicly available [6]. Such a lack of resources seriously limited the progress of its genetic breeding and application. And the lack of a reference genome also caused some problems in studying gene expression using RNA-Seq approaches [7].
With the development of sequencing technology, the genome of more than 17 shes in Cyprinidae have been deposited in NCBI database. However, this number is still faraway from enough compared with the 3,000 species in Cyprinidae. Nowadays, single-molecule real-time sequencing has been used more and more frequently in sh whole genome sequencing [8,9]. Super long reads from it provided a much better assembly when compared with short reads from Illumina sequencing [10]. Besides, high-throughput chromatin conformation capture (Hi-C) also greatly foster the assembly quality of whole genome by translating genome-wide 3D proximity maps to contigs' clustering, ordering and pseudomolecule construction [8,11]. Therefore, combination of long-reads sequencing and Hi-C has been used in many genome sequencing projects and proved its power in high-quality genome assembly [8,12].
Here, we presented a highly contiguous reference genome for A. nigrocauda that used a series of technologies including the single-molecule sequencing platform PacBio Sequel and Hi-C analysis-based chromosome construction. The reference genome assembly of A. nigrocauda will be valuable for both comparative genomics research and genetic breeding research. With the help of this high-quality genome, we also inspected the amino acid mutation associated with diet for A. nigrocauda and its closely related species.

Karyotype analyses
The karyotype analysis was conducted using the method as described in a previous study [13].
Phytohemagglutinin (PHA) was rst injected near the base of the pectoral n, followed by injecting colchicine in the kidney. Kidney tissues were taken and placed in a beaker, cut into small pieces, followed by mixing with saline. After settling for 15 min, cells in the supernatant were collected via centrifuge. These cells were resuspended using 0.075 M KCl solution for 35 min and then collected via centrifuge. The cells were xed using Carnoy's Fluid and treated with Giemsa stain before observing using a microscope.

Samples and Illumina sequencing
As higher heterozygosity always causes di culties in genome assembly, we want to choose one sample with the lowest heterozygosity in genome sequencing. Therefore, three female A. nigrocauda individuals were chosen as candidates for genome sequencing because female samples always have lower heterozygosity than male ones for XY/XX sex determination system in many Cyprinidae shes.
To obtain enough high-quality genomic DNA molecules, the adult female sh individual was dissected and fresh muscle tissues were used for DNA extraction using the traditional phenol/chloroform extraction method [14] and the extracted DNA was quality checked using agarose gel electrophoresis. A single band corresponding to high molecular weight was observed, indicating high integrity of DNA molecules for library construction for the Illumina HiSeq X Ten (Illumina Inc., San Diego, CA, USA) and the PacBio Sequel (Paci c Biosciences of California, Menlo Park, CA, USA) sequencing platforms.
A library with the insert size of 350 bp was constructed for Illumina HiSeq X Ten sequencing platform according to the manufacturer's protocol. To get high-quality Illumina reads for following analysis, ambiguous bases and low-quality reads were rst trimmed and ltered using the HTQC package [15].
First, reads with adaptors or generated via duplication were discarded. Reads whose size of Ns or lowquality ( < = 5) bases reached 10% of the total read length were discarded. Second, reads were removed from further analysis if the average quality were smaller than 20 or the read length was shorter than 75 bp. Third, the mate reads were also removed if the corresponding reads had been removed. The processed reads were used for genome assessment analysis. With the K-mer analysis approach GCE software [16], we calculated the number of each 17-mer from the sequencing data to estimate the genome size and heterozygosity. The sample with the lowest heterozygosity was chosen to further PACBIO sequencing.

PacBio long reads sequencing and genome assembly
Eight libraries with 20Kb insert size were constructed for PacBio Sequel sequencing. PacBio data was assembled using the FALCON v2.0 assembler, which implements a hierarchical assembly approach [17].
The initial step is to error-correct long reads by aligning all reads to a subset of the longest reads. For the full raw data set, only reads longer than 5 kb (length_cut_off = 5 kb) were corrected for generating errorcorrected reads. To identify overlaps between raw sequences, we used "p-read" (pre-assembled reads from error corrected reads) longer than 5 kb (length_cut_off_pr = 5 kb) for String Graph assembly. Then FALCON-unzip [17] operates from a completed FALCON job directory, it uses the phasing information of heterozygous positions within individual reads to group the reads into different phasing blocks and haplotypes within each block. In this study, we used primary contigs for further analysis.

2.4
In situ Hi-C library construction and chromosome assembly using Hi-C data Liver sample taken from the same female A. nigrocauda individual with lowest heterozygosity was used for library construction for Hi-C analysis as described previously [18,19]. The library was sequenced with 150 bp paired-end mode on the Illumina HiSeq X Ten platform (San Diego, CA, USA). The reads were mapped to assembled genome with Bowtie [20] with two ends of paired reads being mapped to the genome separately. To increase the interactive Hi-C reads ratio, an iterative mapping strategy was performed as done previously [18,19]. Only read pairs with both ends uniquely mapped were used for further analysis. From the alignment results of read pairs, self-ligation, non-ligation and other sorts of invalid reads, including Start Near Rsite, PCR ampli cation, random break, Large Small Fragments and Extreme Fragments, were ltered out by Hi-C lib as described previously [18]. Through tracking restriction sites, contact counts among contigs were calculated and normalized. By clustering the contigs using contig contact frequency matrix, we also corrected some minor errors in the FALCON-unzip assembly. Finally, uniquely mapped read pairs were used for clustering, ordering and orientating contigs to construct chromosomes using LACHESIS version c23474f [21].

PACBIO Iso-Seq Sequencing of mRNAs
We analyzed the full-length transcripts of A. nigrocauda using the Iso-Seq protocol [22]. Total RNA was extracted using the TRiZol reagent (Thermo Fisher Scienti c, Waltham, MA, Cat. 15596018) from combined tissues including heart, liver, brain, spleen, kidney, gonads, gill, intestine, eyes, muscle and blood from the same individual used for PACBIO sequencing. Two SMRT cell libraries were constructed with size selection of 1-3 Kb and > 3 Kb using the Blue Pippin (Sage Science, MA, USA) and sequenced with PacBio Sequel platform.
Identi cation of protein-coding genes involved homolog-based prediction, de novo predictions, and the use of Iso-Seq data as follows.
(2) De novo prediction was performed on the transposons-masked genome. Augustus [30] using model training based on coding sequences from Ctenopharyngodon idellus, Cyprinus carpio, Danio rerio, Gasterosteus aculeatus and Takifugu rubripes were used to predict coding genes.
Finally, gene models predicted from all above methods were combined by MAKER [33] .

Gene clustering by OrthoMCL
The sequences of protein-coding genes from A. nigrocauda and the closely related shes, including Oryzias latipes, Ictalurus punctatus, Hippocampus comes, Dicentrarchus labrax, Ctenopharyngodon idellus, Cyprinus carpio, Danio rerio, Takifugu rubripes, Tetraodon nigroviridis, Paralichthys olivaceus, Cynoglossus semilaevis, Lepisosteus oculatus, M. amblycephala and C. alburnus were used for a gene family clustering analysis. For simplicity, only proteins encoded by transcripts with the longest alternative splicing sites were used in this analysis. First, we generated pairwise protein sequence alignments for all proteins by running blastp [37] with an e-value threshold of 1e − 5 . Second, OrthoMCL [38] was used to cluster similar proteins by setting the main in ation value at 1.5 and using the default settings for other parameters.

Phylogenetic analysis
Using single copy orthologs, we probed the phylogenetic relationships between A. nigrocauda and other shes with sequenced genomes. To this end, protein sequences of single-copy genes were aligned using MUSCLE [39]. Guided by the protein multi-sequence alignment, the alignment of the coding DNA sequences (CDS) for those genes were generated and concatenated for following analyses. The phylogenetic relationships were constructed using PhyML [40] using the concatenated nucleotide alignment with the JTT + G + F model.
We further estimated divergent times for all pair using the phylogenetic tree using r8s [41], which were used as input, together with molecular clock data from the divergence time from the TimeTree database [42], to estimate species divergence time for all pairs of species in the phylogenetic tree using MCMCtree program (from PAML) [43].

Expansion and contraction of gene families
Based on the identi ed gene families and the constructed phylogenetic tree with predicted divergence time of the 14 species, we used CAFÉ [44] to analyze gene family expansion and contraction. In CAFÉ, a random birth and death model is proposed to study gene gain or loss in gene families across a speci ed phylogenetic tree. Then, conditional p-value was calculated for each gene family, and family with conditional p-value less than 0.05 was considered to have an accelerated rate for gene gain or loss. These expansion and contraction gene families in A. nigrocauda (p-value ≤ 0.05) were mapped to KEGG pathways for functional enrichment analysis, which was conducted using the enrichment methods. This method implemented hypergeometric test algorithms and the Q-value (FDR, False Discovery Rate) was calculated to adjust the p-value using R package (https://github.com/StoreyLab/qvalue).
2.10 Identi cation of diet-speci c amino acid mutation and their expression pattern in different diet shes.
To get M. amblycephala (blunt snout bream, BSB) and C. idellus (grass carp, GC) speci c amino acid (AA) mutation, we rst blasted proteins from BSB, GC, C. alburnus (topmouth culter, TC) and A. nigrocauda to zebra sh proteome and get the corresponding relationship among them [45][46][47]. Then proteins mapped to the same zebra sh protein were considered as orthologs and aligned together with PRANK v.170427 (http://wasabiapp.org/software/prank/). Then in-house Python script was used to nd AA presented in BSB and GC but absented in TC and A. nigrocauda.
To nd the expression pattern of genes identi ed previously, we used liver transcriptome data from SRR8225322 (BSB), SRR8225320 (TC), SRR8225318 (BSB & TC hybrid F1) and SRR8225330 (BSB & TC hybrid F1) [46]. Then reads from SRR8225322 were mapped to BSB genome, SRR8225320 were mapped to TC genome with Hisat2 v2.1.0 [48]. Reads from SRR8225318 and SRR8225330 were mapped to both BSB and TC genome and get the expressed value for each gene with HTSEQ v 0.12.4 [49]. Only genes with average FPKM larger than 0.2 were left for following analysis. Then statistics analysis was carried out in R v3.6.1 (https://www.R-project.org/).

Sample preparation
As A. nigrocauda had not been backcrossed to reduce heterozygosity, we chose three individuals to do genome survey with aim of nding one with the lowest heterozygosity, which will greatly bene t the assembly of genome. We sequenced 50.59, 56.64, and 47.06 Gb of Illumina reads for these three samples and found their heterozygosity levels varied remarkably (0.54%, 1.04% and 0.42%). However, they don't have much difference on genome size or repeat contents (Table 1). Based on these results, the sample with the lowest heterozygosity (0.42%) was chosen for the following PACBIO genome sequencing (Fig. 1a.). Its k-mer distribution is shown in Supplementary Figure S1.

PACBIO sequencing, genome assembly, and quality evaluation
Genomic DNA extracted from muscle tissues was used in library construction of PACBIO sequencing. In total, 101.50 Gb of long reads were sequenced on PACBIO Sequel platform (  Figure S2). With the help of these long reads, we assembled the A. nigrocauda genome into 538 contigs with 1,053.82 Mb sequences, which is almost identical to that in k-mer analysis (Table 2. Supplementary  Table S1). The contig N50 of this assembly is 3.40 Mb and the longest contig lasts for 14.79 Mb (Table 3). To correct the sequencing error in assembled genome, ltered Illumina reads were mapped to it with Bowtie2 and three rounds of consensus correction with Pilon were performed. The GC content of assembled genome is 37.6% and the average mapping depth is about 120X for the whole genome (Table 3, Supplementary Figure S3).
As no karyotype information about A. nigrocauda has been reported, we carried out Giemsa stain of dividing kidney cells under microscope and found each kidney cell has 24 pairs of chromosomes (Fig. 1b). To anchor assembled contigs to the 24 chromosomes, 102.34 Gb of Hi-C reads were generated. With the help from LACHESIS software, we successfully organized assembled contigs into 24 groups, with a combined size of 1,054.05 Mb (scaffold N50 was 42.68 Mb) (Fig. 1c). The 24 chromosomes ranged in size from 32.61 Mb to 66.38 Mb and they covered about 99.52% of the whole genome (Supplementary Table S2).
We evaluated the completeness and quality of the newly assembled A. nigrocauda genome with Illumina high-quality reads. We were able to align 98.21% of the NGS reads to the genome assembly, with insertion length distribution of read pairs peaking at around 300 bp, which was consistent with library construction designed for the Illumina sequencing. And we found 99.97% of assembled genome was covered by highquality reads, among which 99.69% was covered by over 20X of reads, suggesting high quality level of the assembled genome (Supplementary Table S3). Above results indicate that our assembly is of high quality and continuity.

Genome annotation
To identify repeat elements in assembled genome, combined de novo prediction and homolog search against RepBase database was used in this study. In total, we identi ed 51.86% of the A. nigrocauda assembly as repetitive sequences, among which DNA transposons represented the most abundant fraction of transposable elements (35.02% of the whole genome), and LTR retrotransposons comprised 11.20% of the whole genome (Supplementary Table S4, Supplementary Table S5). Besides, we found the diversity rate for DNA transposon and LTR found in de novo prediction is much lower than that in RepBase (Supplementary Figure S4).
To identify protein-coding genes in our assembly, we combined results from de novo and homolog searching from 5 close-related sh genomes. Totally, we annotated 33,606 protein-coding genes in the A. nigrocauda genome assembly ( Table 4). The distributions of gene density, repeat density, non-coding genes and GC density across the 24 chromosomes of A. nigrocauda genome were further illustrated in Fig. 2. We also found that the regions with low gene density typically had high repeat content, while the regions with high repeat content usually had high GC content. The gene number, gene length, CDS length, exon length and intron length distribution were all comparable with the related shes (Supplementary Figure S5).
We evaluated the genome annotation of the assembled A. nigrocauda genome using BUSCO v.3.0. 2 [50] with actinopterygii_odb9 data set to assess the completeness of the genome. The A. nigrocauda genome assembly contained 94.50% of the BUSCO genes, indicating a high level of completeness of the genome assembly and high quality of annotation (Supplementary Table S6). Besides, most of the protein-coding genes (29,674,88.30%) are homologous to genes with known functions (Supplementary Table S7). We also identi ed 6,712 non-coding RNA (ncRNA) including 975 microRNAs (miRNAs), 5,017 transfer RNAs (tRNAs), 356 ribosomal RNAs (rRNAs) and 364 small nuclear RNAs (Supplementary Table S8).

Evolutionary analysis of A. nigrocauda with other sh species
A phylogenetic tree was reconstructed with 945 single-copy orthologous genes identi ed among A. nigrocauda and 14 other vertebrate species (Fig. 3). The result showed that A. nigrocauda was most closely related to C. alburnus, and the two species diverged from their common ancestor around 7.3-9.7 million years ago (MYA). And M. amblycephala diverged from their three Cultrinae species common ancestors around 11.6-14.2 MYA (Fig. 3).
According to divergence times and phylogenetic relationships, 1,385 gene families out of 28,833 gene families were signi cantly expanded in the A. nigrocauda (P < 0.05, Supplementary Figure S6). Those expanded gene families included 148 signi cantly enriched (q-value < 0.05) KEGG pathways (Supplementary Table S9). In organismal systems, the digestive system is one of the top pathways in expanded gene families, and many KEGG pathways, including protein digestion and absorption (Fig. 4), which may contribute to the highly e cient protein conversion ability of A. nigrocauda (Lu et Table S10). Among these pathways, many genes related to protein digestion and absorption, amino acid metabolism and other metabolism were subjected to positive selection, which may contribute to the highly e cient protein conversion ability of A. nigrocauda.

Comparison between different diet habit within closerelated species
A. nigrocauda and C. alburnus (TC) are both belong to Cyprinidae and are carnivorous shes, while their close-related species in Cyprinidae, M. amblycephala (BSB) and C. idellus (GC) are pure herbivorous.
Therefore, we want to know whether there is some diet-speci c amino acid (AA) mutation between herbivorous and carnivorous shes. More than 2 diet-speci c AA mutations were found in 582 genes out of 15,748 in total. GO enrichment analysis showed that 10 out of top 20 pathways were related to steroid biosynthesis and lipid transport (Supplementary Table S11). This pattern further veri ed how herbivorous shes absorb nutrients from plants in GC [47].
Besides, inter-species hybridization between BSB and TC produces herbivorous offspring [46]. We used transcriptome data form the liver tissue of the parents and offspring to nd diet-speci c expression patterns for these 582 genes. After removing genes with low expression level, 297 genes were left for dietassociated analysis. Finally, we found 26 genes were signi cantly associated with diet (p < 0.05) (Supplementary Table S12). Among these genes, we also found some were associated with apolipoprotein's metabolism, such as ApoBb. These genes might be key functional elements for adapting to different diet in different shes.

Conclusion
In this study, we constructed a high-quality chromosome-level assembly of the sh A. nigrocauda using combined methods including PacBio DNA sequencing, Illumina HiSeq X Ten DNA sequencing and Hi-C interaction analysis. Using the long reads from PacBio Sequel platform and short reads from the Illumina HiSeq X Ten platform, we successfully constructed contig assembly for A. nigrocauda. Leveraging contact information among contigs from Hi-C technology, we further improved the assembly to the chromosome-level quality. We annotated 33,606 protein-coding genes in the A. nigrocauda genome, 29,674 of which were functionally annotated, suggesting high conservation of genes in this economically important sh species. With 945 single-copy orthologs from A. nigrocauda and other related sh species, we constructed the phylogenetic relationship of these shes, and found that A. nigrocauda might have diverged from its common ancestor of C. alburnus around 7.3-9.7 MYA.
Given the increasing interests in sh genome evolution and the economic importance of A. nigrocauda for aquaculture, our genomic data provide valuable genetic resources for functional genomics investigations for the research community.

Declarations
Ethics approval and consent to participate All the sh used in this study were owned by authors of this study and reared in Fisheries Research Institute, Wuhan Academy of Agricultural Sciences, Wuhan, China (30°20'N, 114°14'E). All experimental steps in this study were approved by the Committee for Animal Experiments of the Wuhan Academy of Agricultural Sciences and were carried out in accordance with the Laboratory Animal Management Principles of China.

Consent for publication
All the authors have read the manuscript and consent for its publication.

Availability of data and materials
The raw data from our genome project was deposited in the NCBI Sequence database with Bioproject IDs PRJNA533477. The Illumina, PacBio and Hi-C sequencing data are available from NCBI via the accession number of SRR8929633, SRR8929629 and SRR8929634, respectively. The PacBio transcriptome sequencing data were deposited to NCBI via the accession number of SRR8929630 and SRR8929631. The nal chromosome assembly was submitted to NCBI Assembly with the accession number of VTFQ00000000 in NCBI.

Competing interests
The authors declare no con ict of interest. The sponsors had no role in the design, execution, interpretation, or writing of the study.    Distribution of protein-coding genes, repeats, non-coding genes, and GC ratio in the A. nigrocuada genome.

Figure 3
Phylogenetic relationship between A. nigrocauda and other sh species. The estimated species divergence time (million years ago) and the 95% con dential intervals were labeled at each branch site.
The divergence used for time recalibration was illuminated as red dots in the tree.