First draft genome of loach (Orenectus shuilongensis; Cypriniformes: Nemacheilidae) provide insights into the evolution of cavesh

Loaches of the superfamily Cobitoidea (Cypriniformes, Nemacheilidae) are small elongated bottom-dwelling freshwater shes with several barbels near the mouth. The genus Oreonectes with 18 currently recognized species contains representatives for all three key stages of the evolutionary process (a surface-dwelling lifestyle, facultative cave persistence, and permanent cave dwelling). Some Oreonectes species show typical cave dwelling-related traits, such as partial or complete leucism and regression of the eyes, rendering them as suitable study objects of micro-evolution. Genome information of Oreonectes species is therefore an indispensable resource for research into the evolution of caveshes. were from Each individual was over-exposed and executed by anesthesia, and quickly frozen in liquid nitrogen for one hour before storing at − 80 °C. Genomic DNA was extracted from a muscle sample using DNeasy Blood &Tissue Kit (Qiagen). The present study was approved by the Animal Ethics Committee of Guizhou Normal University. The procedure of sample collection was in strict accordance with the Animal Ethics Procedures and Guidelines of the People's Republic of China. paired-end The were sequenced on For the sequencing adaptors were removed. reads (such as chloroplast, mitochondrial, bacterial and sequences, etc.) were screened by alignment to the NCBI-NR database using BWA [10] with default parameters. read pairs were removed by FastUniq [11] , and low-quality reads were ltered under following (1) reads nucleotides (N), reads with > nucleotides aligned the adapter, allowing mismatches, 50% Phred (NJ) and the divergence times of the taxa were estimated with mcmctree [15] . The outgroup sequences were chosen from the zebrash genome assembly GRCz11, with the genome alignment to the obtained Oreonectes genome assembly using LASTZ [49] . We employed calibration points from the dated Cyprinidae-Nemacheilidae divergence to place a lognormally distributed prior on the age of the root of a tree containing all samples of Oreonectes species and outgroup. To explore the genomic mechanism underlying the degeneration of eyes and body color, the SNPs and Indels obtained through GATK pipeline were annotated using the software SnpEff [6] .


Background
Cave shes are successful vertebrate colonizers in subterranean habitats and usually process some regressive features, such as the rudimentary eyes and loss of pigmentation. Meanwhile, some compensative traits, such as elongated appendages and reinforced non-visual sensory systems, have evolved in cave shes. Since uncovering the genetic basis of phenotypic adaptations of animals to a speci c environment is a key goal in evolutionary study, cave shes have attracted interests from biologist and certain cave shes (Astyanax mexicanus and Sinocyclocheilus spp.) have been well studied. However, there is few data available to unravel the genomic mechanism under the evolution and adaptation to subterranean life among other groups of shes.
Loaches (Cypriniformes: Cobitoidea, Nemacheilidae) are small elongated bottom-dwelling freshwater shes with several barbels near the mouth, distributed in Eurasia and Africa. Within this group, Oreonectes shes are distributed only in southwestern China and northern Vietnam, most of which dwell in underground rivers in the karst environment [1] . Oreonectes shes contain 18 species representing the three key stages of the evolutionary process including a surface-dwelling lifestyle, facultative cave persistence, and permanent cave dwelling. Almost all Oreonectes species show some cave-related traits, such as part or complete eye degeneration and leucism, which makes this genus a good study system of micro-evolution [2] . O. shuilongensis is a surface-dwelling species which was newly discovered in the Shuilong Township in Guizhou Province of China [3] . A genome assembly of O. shuilongensis would facilitate research into key aspects of the evolutionary history of cave versus surface dwelling in Oreonectes, including the role of environmental changes in the seeminly rapid diversi cation and speciation in underground caves [2] .
Here we present a de novo genome assembly for O. shuilongensis, the rst genome constructed for the family Nemacheilidae and the superfamily Cobitidea. The completeness and continuity of the genome provided valuable genomic resources for studies on the evolutionary history of the rapid speciation processes of family Nemacheilidae.

Whole Genome and RNA Sequencing
After removal of < 500 bp PacBio subreads, 5 million subreads (total 50.9 Gb) remained, with an average length of 10.2 kb (Table 1 and Suppl. Tables S2 and S3). Additionally, a total of 11.7 Gb transcriptome data were obtained from RNA-sequencing (Table 1).

Estimation of the Genome Size and Sequencing Coverage
The genome size of O. shuilongensis was estimated at approximately 515.66 Mb based on k-mer analysis (Suppl. Figure S1), and our O. shuilongensis genome assembly spans 521.68 Mb (803 contigs, contig N50 of 5.58 Mb; Table 2 and Suppl. Table S4). The completeness of the O. shuilongensis genome assembly was evaluated using CEGMA v2.5 [4] and BUSCO v2 [5] . CEGMA analysis suggested that 99.78% of conserved Core Eukaryotic Genes (CGEs) proteins are present in our assembled genome, and BUSCO analysis showed that 97.58% of vertebrate Benchmarking Universal Single-Copy Orthologs have been assembled, implying a high completeness of our O. shuilongensis genome assembly (Suppl . Tables S5 and S6). We found around 99.52% of the reads properly mapped to the genome assembly (Suppl. Table S7). All these results indicate that the assembly of the O. shuilongensis genome is characterized by a high level of accuracy.  Genomic mechanism underlying the degeneration of eyes and body color Genome re-sequencing were performed for facultative cave-dwelling O. jiarongensis (three individuals) and cavedwelling O. daqikongensis (two individuals) and O. dongliangensis (one individual) at a high average depth of 28.06 ± 5.08×, with an overall average genome coverage of 93.77% of the O. shuilongensis genome assembly (Table 4). A total of 12,534,348 SNPs was identi ed in these three species, and the number of SNPs per individual ranged from 6.0 to 6.2 M ( Table 4). The reconstructed phylogeny indicates that the ancestor of cave-dwelling O. daqikongensis, O. dongliangensis and facultative cave-dwelling O. jiarongensis rst diverged from the surfacedwelling O. shuilongensis about 9.31 million years ago (95% HPD, 12.54-7.12) (Fig. 4). Annotation of all sequence variants in SnpEff [6] (Fig. 5) suggested that 1,541 SNPs and 438 indels were located in 401 genes, likely resulting in pseudogenization in semi cave-dwelling and cave-dwelling species. Twenty-nine pseudogenes annotated using DAVID [7] showed signi cant enrichment for the GO terms of "eye development (0001654)" and "retina development in camera-type eye (0060041)" (Table 5, Fig. 6 and Suppl. Table S12). For example, the expression of six7, six6a and six6b is required for optic primordium development and the speci cation and proliferation of the eye eld in vertebrate embryos [8] . The function-lost mutation of aldh1a3 and tfap2a caused eye and retinal defects in zebra sh [9] . It is presumed that these pseudogenes might lead to eye degeneration of semi/complete cave-dwelling Oreonectes species. Furthermore, Mc1r (melanocortin-1 receptor), a key gene regulating the biosynthesis of melanin in most vertebrates, is a pseudogenization caused by a deletion in O. daqikongensis (Fig. 7), likely blocking biosynthesis of melanin and leading to the albino phenotype (Fig. 1).
Remaining pseudogenes are enriched for the GO terms of "potassium channel activity", "regulation of axon extension involved in axon guidance", "G-protein coupled receptor activity" and KEGG pathway of "neuroactive ligand-receptor interaction" (Table 5).

Discussion And Conclusions
Organisms that have colonized underground caves encounter vastly different selective pressures than their relatives in above-ground habitats. In the present study, we report the rst whole genome sequencing, assembly, and annotation of the O. shuilongensis, encompassing a total of predicted 25,247 protein-coding genes and 7,041 noncoding RNAs. We anticipate that this genome assembly will serve as a basis for in-depth biological studies of evolution and adaptation of cave shes. With the availability of these genomic data, genomic/transcriptomics differences between surface-dwelling and cave-dwelling loaches can be studied at the genomic scale. More broadly, our assembly will facilitate evolutionary and genomics research of Cobitoidea shes and beyond.

Materials And Methods
Animals captured and DNA extracted O. shuilongensis, O. jiarongensis (three individuals), O. dongliangensis (two individuals) and O. daqikongensis (one individuals) (Fig. 1) were captured from Guizhou Province, China. Each individual was over-exposed and executed by anesthesia, and quickly frozen in liquid nitrogen for one hour before storing at − 80 °C. Genomic DNA was extracted from a muscle sample using DNeasy Blood &Tissue Kit (Qiagen sequences, etc.) were screened by alignment to the NCBI-NR database using BWA [10] with default parameters.

RNA Sequencing
Tissues of skin, muscle, intestine, liver and kidney of the same loach individual were collected and RNAs were extracted with TRIZOL Reagent (Invitrogen, USA). RNAs were then balanced mixed for the sequencing. The absorbance of 1.90 at 260 nm/280 nm and the RIN of 9.1 were obtained for the puri ed RNA sample by Nanodrop ND-1000 spectrophotometer (LabTech, USA) and 2100 Bioanalyzer (Agilent Technologies, USA), respectively. One microgram of RNA was reverse transcribed using Clontech SMARTer cDNA synthesis kit, and was further fragmented using divalent cations for the sequencing. The paired-end library was prepared following the manual of the Paired-End Sample Preparation Kit (Illumina Inc., San Diego, CA, USA). Then the library with an insert length of 270 bp was sequenced by Illumina HiSeq X Ten in 150 bp paired-end mode (Illumina Inc., San Diego, CA, USA).

Estimation of the Genome size and Sequencing Coverage
The 19-mer frequency distribution analysis was performed on the remaining clean reads to estimate the genome size of the O. shuilongensis using the formula: Genome size = kmer_Number/Peak_Depth. Corrected Illumina reads were selected to perform genome size estimation.
De novo Genome Assembly and Quality Assessment of O. shuilongensis Genome 520M' and 'corOutCoverage = 90', then to detect raw reads overlapping through a highly sensitive over lapper MHAP (mhap-2.1.2, option 'corMhapSensitivity = normal'), and nally to perform error correction through the falcon sense method (option 'corrected Error Rate = 0.045'). Next, using default parameter settings, error-corrected reads were trimmed of unsupported bases and hairpin adapters to obtain their longest supported range. In the last step, Canu generated the draft assembly using trimmed reads. The draft assembly was polished to obtain the nal assembly, adopting Pilon [13] (https://github.com/broadinstitute/pilon) using Illumina data with the parameters '-mindepth 10 --changes --threads 4 --x bases'. To further evaluate the accuracy of the O. shuilongensis genome, we aligned the high-quality short reads from short insert size pair-ended libraries against the genome assembly using BWA [14] .

Annotation of Repeat Sequences
The repeat composition of the assemblies was estimated by building a repeat library employing the de novo prediction programs LTR FINDER v1.05 [15] , MITE-Hunter [16] , RepeatScout v1.0.5 [17] and PILER-DF v2.4 [18] . The database was classi ed using PASTEClassi er [19] and was combined with the Repbase [20] database to create the nal repeat library. Repeat sequences in the O. shuilongensis genome were identi ed and classi ed using the RepeatMasker v4.0.6 [21] program.

Gene Functional Annotation
Annotation of the predicted genes was performed by local BLAST [27] programs blasting their sequences against a number of nucleotide and protein sequence databases, including NCBI-NR [20] , KOG [34] , KEGG [35] , and TrEMBL [36] with an E-value cutoff of 1e-5. We then searched the Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway databases using the software Blast2GO [37] .

Noncoding RNA Annotation
Non-coding RNAs play important roles in a great variety of processes, such as the rRNAs and tRNAs involved in mRNA translation. The rRNA fragments were identi ed by aligning the rRNA template sequences using BLAST with E-value at 1e-10 and identity cutoff at 95% or more. The tRNAScan-SE v1.3.1 [38] algorithms with default parameters were applied to the prediction of tRNA genes. The rRNA and miRNA genes were predicted by Infenal v1.1 [39] against the Rfam [40] database and miRbase [41] with cutoff score at 30 or more. The minimum a cutoff score was based on the settings which yield a false positive rate of 30 bits.

Global Gene Family Classi cation
In order to identify gene families among sh species in this work, proteins of the longest transcripts of each individual gene from O. shuilongensis and other sequenced species, including Salmo salar, Ictalurus punctatus, A. mexicanus, C. carpio, S. rhinocerous, D. rerio and Larimichthys crocea were analyzed. All data was downloaded from NCBI [42] . Gene family analysis based on the homolog of gene sequences in related species was initially implemented by the alignment of an "all against all" BLASTP [43] with a cutoff of 1e-5 and subsequently followed by alignments with high-scoring segment pairs conjoined for each gene pair by Solar. To identify homologous gene pairs, we required more than 30% coverage of the aligned regions in both homologous genes. Finally, homologous genes were clustered into gene families by OrthoMCL [44] with the in ation parameter set at 1.5.

Phylogenetic Relationship and Genomic Comparison
Evolutionary analysis was performed using the single-copy protein-coding genes among all species. Amino acid and nucleotide sequences of the ortholog genes were aligned using the multiple alignment software MUSCLE [18] with default parameters. A total number of 724 single-copy ortholog alignments were concatenated into a super alignment matrix of 1,692,085 nucleotides. A maximum likelihood method deduced tree was inferred based on the matrix of nucleotide sequences using PhyML package with the JTT + G + F model. Clade support was assessed using bootstrapping algorithm in the PhyML package with 100 alignment replicates. The constructed phylogenetic tree indicated that O. shuilongensis were clustered closely to Cyprinidae species, which is inconsistent with their putative evolutionary relationships (Fig. 3).
We determined the expansion and contraction of the orthologous gene families by comparing the cluster size differences between the ancestor and each of the O. shuilongensis and seven other sh species using the CAFÉ [45] program. A random birth and death model were used to study changes of gene families along each lineage of the phylogenetic tree. A probabilistic graphical model (PGM) was introduced to calculate the probability of transitions in gene family size from parent to child nodes in the phylogeny. Using conditional likelihoods as the test statistics, we calculated the corresponding P-values in each lineage. A P-value of 0.05 was used to identify families that were signi cantly expanded in O. shuilongensis genome.

Dating the Divergence among Oreonectes Fishes and Assessing the Genomic Mechanism Underlying the Degeneration of Eyes and Body Color
O. shuilongensis is a surface-dwelling species with intact eyes and a unique color pattern consisting of ne black marks on the body except on the abdomen [3] . To explore the genomic changes resulting in the degeneration of eyes and body color of semi cave-dwelling and cave-dwelling Oreonectes shes, the whole genome re-sequencing was performed for facultative cave-dwelling O. jiarongensis (three individuals) and cave-dwelling O. daqikongensis (two individuals) and O. dongliangensis (one individual). For each individual, ~ 3 µg of DNA was sheared into fragments of 350 bp with the Covaris v1.8 system. DNA fragments were then processed and sequenced using the Illumina HiSeq 4000 platform. Filtered sequence reads were mapped to the O. shuilongensis reference genome using BWA-MEM with default parameters (0.7.10-r789) [46] . Alignment bam les were imported to SAMtools (v0.1.19) [10] for sorting and Picard (http://broadinstitute.github.io/picard/, version 1.92) was used to remove duplicated reads. Following mapping, we performed variant calling using the GATK [47] package with default parameters. To explore the phylogenetic relationships among Oreonectes shes, phylogenetic tree was inferred using the Neighbor-Joining (NJ) algorithm as implemented in RAxML software with 1000 bootstraps [48] and the divergence times of the taxa analyzed were estimated with mcmctree [15] . The outgroup sequences were chosen from the zebra sh genome assembly GRCz11, with the genome alignment to the obtained Oreonectes genome assembly using LASTZ [49] . We employed calibration points from the dated Cyprinidae-Nemacheilidae divergence to place a lognormally distributed prior on the age of the root of a tree containing all samples of Oreonectes species and outgroup. To explore the genomic mechanism underlying the degeneration of eyes and body color, the SNPs and Indels obtained through GATK pipeline were annotated using the software SnpEff [6] .
Abbreviations BLAST: Basic Local Alignment Search Tool; KEGG: Kyoto Encyclopedia of Genes and Genomes; NCBI: National Center for Biotechnology Information.

Declaration of Interest Statement
We declare that we have no nancial and personal relationships with other people or organizations that can inappropriately in uence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as in uencing the position presented in, or the review of, the manuscript entitled.
Ethics approval and consent to participate Not applicable.

Consent for publication
Written informed consent for publication was obtained from all participants.

Availability of supporting data
Supporting data is available in the NCBI database. Raw data has been deposited in NCBI with the project accession PRJNA505902. BioSample accessions: SAMN10438768, SAMN10438757-60, SAMN10438763 and SAMN10438766.

Competing interests
The authors declare that they have no competing interests. Author contributions ZL, XQ, HW, and LH, performed the experiments, XQ, and ZW, analyzed the data, ZL and HW, wrote the paper, and participated in the design of the study. XQ, HW, and Z W, participated in data statistics and the draft manuscript. JZ, and ZL, conceived the study, participated in its design and coordination, and helped to draft the manuscript. All authors read and approved the nal manuscript.