Genome survey of Misgurnus anguillicaudatus to identify genomic information, simple sequence repeat (SSR) markers, and mitochondrial genome

The dojo loach Misgurnus anguillicaudatus is an important economic species in Asia because of its nutritional value and broad environmental adaptability. Despite its economic importance, genomic data for M. anguillicaudatus is currently unavailable. In the present study, we conducted a genome survey of M. anguillicaudatus using next-generation sequencing technology. Its genome size was estimated to be 1105.97 Mb by using K-mer analysis, and its heterozygosity ratio, repeat sequence content, GC content were 1.45%, 58.98%, and 38.03%, respectively. A total of 376,357 microsatellite motifs were identified, and mononucleotides, with a frequency of 42.57%, were the most frequently repeated motifs, followed by 40.83% dinucleotide, 7.49% trinucleotide, 8.09% tetranucleotide, and 0.91% pentanucleotide motifs. The AC/GT, AAT/ATT, and ACAG/CTGT repeats were the most abundant motifs among dinucleotide, trinucleotide, and tetranucleotide motifs, respectively. Besides, the complete mitochondrial genome was sequenced. Based on the Maximum Likelihood and Bayesian inference analyses, M. anguillicaudatus yingde in this study was the “introgressed” mitochondrial type. Seventy microsatellite loci were randomly selected from detected SSR loci to test polymorphic, of which, 20 microsatellite loci were assessed in 30 individuals from a wild population. The number of alleles (Na), observed heterozygosity (Ho), and expected heterozygosity (He) per locus ranged from 7 to 19, 0.400 to 0.933, and 0.752 to 0.938, respectively. All 20 loci were highly informative (PIC > 0.700). Eight loci deviated from Hardy–Weinberg equilibrium after Bonferroni correction (P < 0.05). This is the first report of genome survey sequencing in M. anguillicaudatus, genome information, mitochondrial genome, and microsatellite markers will be valuable for further studies on population genetic analysis, natural resource conservation, and molecular marker-assisted selective breeding.


Introduction
Next-generation, high-throughput sequencing (NGS) is an economical and powerful strategy for investigating genomic resources. Genome survey sequencing through NGS offers fundamental genomic information relating to genome size, heterozygosity ratio, and repeat contents. In recent years, genome survey sequencing has been extensively applied to accurately predict the whole genome characteristics of aquaculture species and provided favorable support for evolutionary biology and adaptation survey [1][2][3][4][5]. In addition to genome information, genome survey sequencing can also characterize a considerable amount of sequence data for molecular markers development, such as microsatellites and single nucleotide polymorphisms. Owing to the advantages, such as co-dominance, high polymorphism, easy automatic analysis, and random distribution within the genome, microsatellites (SSRs) are costeffective and useful molecular tools in studies on genetic population and conservation, genetic breeding, and genetic map construction [6][7][8]. Recently, based on genome survey sequencing, increasing genome-wide SSR markers have been identified in aquaculture species, such as Charybdis feriatus [9], Sillago sihama [10], Marsupenaeus japonicus [2], and Hemibagrus wyckioides [11].
The dojo loach Misgurnus anguillicaudatus, which belongs to the Cobitidae family, is an important endemic freshwater species and is widely distributed in China, Japan, Korea, and other Southeast Asian countries. M. anguillicaudatus is an important scientific research object due to its unique physiological characteristics. Due to its natural variability in ploidy level, M. anguillicaudatus is regarded as an ideal animal in which to explore the biological origin, the evolutionary history of polyploidy, and unisexual reproduction [12,13]. In addition, M. anguillicaudatus can also be recognized as a potential model species to explore the mechanism of accessory air-breathing function [14]. In addition to scientific value, it has become an important economic species in Asia because of its nutritional value and wide environmental adaptability. Despite its economic importance, the supply of dojo loaches for human consumption mostly depended on the exploitation of wild populations. With the increasing demand, over exploitation and habitat destruction have resulted in the decrease of natural populations in recent years. The lack of genetic and genomic resources in M. anguillicaudatus has limited aquaculture production. In this study, we aimed to (1) obtain genome information using genome survey sequencing, (2) describe microsatellite distribution patterns, (3) characterize the complete mitochondrial genome and construct phylogenetic analysis, (4) develop genomewide SSR markers from genome sequencing data. These results will provide effective support for natural resource conservation, genetic diversity detection, and molecular marker-assisted selective breeding.  Table S1) have been previously applied to identify the ploidy of M. anguillicaudatus. The M. anguillicaudatus used for genome survey sequencing was diploid (Fig. S1).

Whole-genome survey sequencing
Genomic DNA samples were randomly sheared into 350 bp fragments using an ultrasonicator (Covaris Inc.). The TruSeq DNA LT Sample Prep Kit (Illumina, USA) was used to construct libraries. DNA fragments of required lengths were recovered using electrophoresis before end repair, following which poly A-tails and sequencing adapters were added. The obtained fragments were purified and then performed PCR amplification for library preparation. Following the standard protocol, the Illumina Hiseq Xten platform was used to sequence the DNA library. The production of genomic data was carried out by the Qingdao OE Biotech Co., Ltd. (Qingdao, China).

K-mer analysis and genome assembly
After filtering out the adapters, poly (N) sequences, and lowquality reads, the high-quality clean data were used to perform K-mer analysis using GCE v1.0.0 software. Based on the 17-mer distribution, information on the peak depth and the number of 17-mers was obtained and used to estimate genome size, repetitive sequences, and heterozygosity ratio. Then the genome size can be estimated by using the following formula: Genome size = K-mer num/Peak depth [16]. Additionally, based on K-mer analysis, the heterozygosity rate and proportion of repeat sequence were calculated according to the methods described by Liu et al. [16].
SOAPdenovo (v2.04) software [17] was used to assemble clean reads with K-mer = 41 by applying the de Bruijn graph structure. The clean reads were used to assemble contigs, and contigs were then used to assemble scaffolds depending on its paired-end information. The SOAPaligner v2.21 software [18] was applied to estimate the GC depth distribution.

Mitochondrial genome assembly and phylogenetic analysis
The mitogenome sequence (NC_011209.1) downloaded from the NCBI dataset was as the reference to isolate the mitogenome sequence from the filtered clean data using NOVOPlasty v3.7.2. Based on genome sequencing data, only 845 bp mitochondrial genome sequence was identified. Finally, the complete mitochondrial genome of M. anguillicaudatus yingde was obtained by next-generation sequencing.
The mitogenome sequence of 23 Cobitidae species and 2 outgroup species (Danio rerio and Cyprinus carpio) were downloaded from the NCBI GenBank database and used for phylogenetic analyses. On the basis of 13 concatenated mitochondrial protein-coding genes (PCGs), we performed the phylogenetic analysis by the Maximum Likelihood (ML) and Bayesian inference (BI) methods using IQ-TREE and MrBayes 3.1.2 [19], respectively. PartitionFinder v1.1.1 [20] was used to select the best partitioning strategy and the optimal nucleotide substitution model for the data sets using the Bayesian information criterion (BIC). The result of the Bayesian information criterion revealed that GTR + I + G model are the best-fit model for the data sets. The ML phylogenetic tree was carried out using IQ-TREE software with 1000 replicates of bootstrap. The BI analysis was performed with MrBayes 3.1.2 software based on 10,000,000 generations.

Characterization of polymorphic microsatellite loci
Seventy SSR loci containing dinucleotide to pentanucleotide repeats were randomly selected from our genome survey data and primers were designed using Primer 5 software to amplify microsatellites. Firstly, 70 SSR loci were tested using ten individuals for polymorphism detection. Then primers of verified polymorphic loci were estimated in a wild population with 30 individuals. The microsatellite polymorphism analysis strategy was performed following the description in Schuelke's article [21]. Briefly, four different universal adapter sequences labeled with the fluorochromes (FAM, VIC, NED, or PET dyes) were added to the 5′ end of each forward primer. Universal adapter sequences were shown in Table S2.
PCR was carried out in a 10 µL volume that contained 1.0 µL of buffer, 0.8 µL of dNTP mixture (2.5 mM), 0.6 µL of Mg 2+ , 0.05 µL of Ex-Taq DNA polymerase, 1 µL of DNA (100 ng/µL), 4.55 µL of ddH 2 O, 1 μL of forward/ reverse primer mixture (F: R = 2:50, 1.5 μM), and 1 µL of fluorescent-labeled primer (1.5 μM). PCR amplification was performed in a thermal lab cycler (SensoQuest, Germany) under the following conditions: initial denaturation at 94 °C for 5 min; 25 cycles at 94 °C for 30 s, appropriate annealing temperature for 1 min, and extension at 72 °C for 1 min; an additional 12 cycles at 94 °C for 30 s, annealing at 53 °C for 1 min, and extension at 72 °C for 1 min; and a final extension at 72 °C for 10 min. PCR products were subjected to capillary electrophoresis in a 3730 Genetic Analyzer (Applied Biosystems Inc., Foster City, CA, USA) using GeneScan™ 600 Liz® (ABI) as the size standard. The genotypes were determined using the software of GeneMarker V2.2.0.

Data analysis
Cervus 3.0.7 software was applied to calculate the number of alleles (Na), observed heterozygosity (Ho), expected heterozygosity (HE), and the polymorphism information content (PIC) of polymorphic loci. Genepop 4.0 software was used to perform tests for deviations from Hardy-Weinberg equilibrium (HWE). Micro-Checker v.2.2.3 software was used for calculating the frequency of null alleles (Fua).

Genome size estimation and sequence assembly
A total of 145.46 Gb of raw data were generated by sequencing the genome survey library (350 bp). After filtering and elimination, 135.22 Gb of high-quality clean sequences were generated with Q30 scores assigned to 93.17%, which was approximately 122.3 × coverage. The effective rate and the error rate were 92.96% and 7%, respectively (Table S3).
A total of 135.22 Gb were used for K-mer analysis in M. anguillicaudatus. The 17-mer frequency distribution derived from the sequencing reads was plotted in Fig. 1A. The K-mer analyses indicated that the peak of the 17-mer distribution was 90, and the total K-mer count was 99,537,205,022, therefore, the genome size of M. anguillicaudatus was estimated to be 1105.97 Mb. The average GC content of the M. anguillicaudatus genome was 38.03% (Fig. 1B). The 17-mer analysis showed that the heterozygosity and repeat content of M. anguillicaudatus genome was 1.45% and 58.98%, respectively (Table S4).
A total of 4,120,965 contigs were assembled from the genome survey sequencing data with an N50 of 378 bp. Based on the contigs, the genome assembly contained 3,488,543 scaffolds, with an N50 of 551 bp (Table 1).

Fig. 1
Results of genome survey using NGS. A K-mer (K = 17) analysis for estimate genome size and heterozygote frequency of M. anguillicaudatus. The x-axis is depth (X); the y-axis is the proportion that represents the frequency at that depth divided by the total frequency of all depths. B Guanine plus cytosine (GC) content and depth correlation analysis of M. anguillicaudatus. The x-axis represents the GC content and the y-axis is the sequencing depth. The distribution of sequence depth is on the right side, while the distribution of GC content is at the top  Table 2). Short repetitive motifs were the dominant motifs, especially mono-and di-repeats, which accounted for 83.40%. The number of penta-and hexa-nucleotide motif types was greater than 83, while that of tri-and tetra-nucleotides was less than 32 ( Fig. 2A). Although the penta-and hexanucleotide SSR loci included more motif types than other motif types, the proportion of every motif type in pentaand hexa-nucleotide SSR loci was very few. The frequency distribution range of microsatellite repeats ranged from 5 to 10 repeats for trinucleotides, mostly from 5 to 9 repeats for tetranucleotides, from 5 to 8 repeats for pentanucleotides, and from 5 to 7 repeats for hexanucleotides (Fig. 2B). Of tri-, tetra-, penta-, and hexa-nucleotide repeats, the most common repeat numbers (5-6) accounted for more than 50% of the total repeats, and the repeat numbers 5-7 accounted for more than 78% of the total repeats. The motif types increased with increasing repeat length, and the frequency of repeats decreased because mutation rates were higher in longer repeats.

Mitochondrial genome assembly and phylogenetic analysis
We obtained a 16,464 bp long circular mitochondrial genome of M. anguillicaudatus, which was composed of 13 protein-coding genes (PCGs), 22 tRNA genes, 2 rRNA genes, and a non-coding hypervariable control region showing the typical teleosts mitogenomic arrangement (Fig. 4). The mitochondrial genome was AT-biased (58.05%) with the lowest content of G among the four bases and a positive A + T skew, which was identical to the characteristics of the mitochondrial genomes of vertebrates. Apart from protein-coding gene ND6, the other eight tRNAs (tRNA-Ala, tRNA-Asn, tRNA-Cys, tRNA-Tyr, tRNA-Ser, tRNA-Glu, tRNA-Pro, tRNA-Gln) were also located on the light strand, whereas all other genes were encoded on the heavy strand. In the complete mitogenome, eleven intergenic spacers were observed: of these, nine intergenic spacers were small intergenic spacers ranging from 1 to 13 bp with 32 bp total length. The largest two intergenic spacers were found in tRNA-Asn/tRNA-Cys (30 bp) occurring in the light strand and COX2/tRNA-Lys (27 bp) occurring in the heavy strand. Six overlapping regions were identified in a total of 26 bp in length. Gene overlaps were observed at 11 gene junctions (tRNA-Ile/tRNA-Gln, ATP8/ATP6, ATP6/COX3, Nd4L/ ND4, ND5/ND6, and tRNA-Thr/tRNA-Pro) with overlaps ranging from 1 to 10 bp. Regarding start codons, COX1 started with GTG, whereas the other 12 protein-coding genes shared an identical initiation codon ATG. Concerning stop codons, COX3 and CYTB were inferred to terminate with an incomplete stop codon, with four the proteincoding genes (ND2, ND3, ND4, and ND5) using TAG as a stop codon, and the other seven genes (ND1, COX1, COX2, ATP8, ATP6, ND4L, and ND6) sharing TAA, respectively (Table S6).
To better understand the taxonomic status of M. anguillicaudatus yingde, we performed the phylogenetic relationship of M. anguillicaudatus yingde with other Cobitidae species and two outgroup species as inferred by the entire mitogenome. The phylogenetic analyses trees constructed by ML and BI methods provided identical phylogenetic topologies (Fig. 5)

Polymorphism of microsatellite loci
Among 70 microsatellite loci, 20 loci were detected to show polymorphism in a wild M. anguillicaudatus population of 30 individuals (Table 3)

Discussion
With the great advance of NGS technology, genome survey sequencing has developed rapidly in recent years. Especially, the K-mer analysis combined with genome survey sequencing has been successfully applied for the prediction of genome size and genome structural information for many non-model species without prior knowledge. Ploidy is an important factor that directly determines the genome size. As the increase in ploidy and chromosome numbers, genome size generally increases. Therefore, for polyploid organisms, the ploidy must be detected before the whole genome sequencing. M. anguillicaudatus has complex ploidy composition in the natural environment [23][24][25]. A previous study in natural populations of M. anguillicaudatus indicated that diploid (2 N = 50) individuals were most common among Chinese M. anguillicaudatus populations [12]. In this study, M. anguillicaudatus used for 2b-RAD sequencing was diploid, which was identified by microsatellite markers. According to the reported genome data, except for few fish species [26][27][28], the genome size of most fish was generally less than 1 Gb. The estimated genome size of diploid M. anguillicaudatus (1105.97 Mb) was larger than that in most fishes, such as Oryzias latipes (700.4 Mb) [29], Acanthogobius ommaturus (928.01 Mb) [30], and Hemibagrus wyckioides (728 Mb) [11], which was related to a higher number of repetitive sequences.
In addition to ploidy, many other factors including heterozygosity ratio, repeat content, whole genome duplication also affect the quality of genome assembly. For genome  Table S7. The phylogenetic analyses were conducted based on the concatenated 13 mitochondrial protein-coding genes with Bayesian inference (BI) methods and Maximum Likelihood (ML). Numbers on the nodes represent support values inferred from BI (left) bootstrap and ML (right) probability analyses, respectively assembly, if the heterozygosity rate exceeds 0.5%, it is difficult to assemble, and if it exceeds 1%, it is even more difficult [31]. The heterozygosity rate of M. anguillicaudatus was approximately 1.45%, indicating a complex genome with the higher heterozygosity rate. Accordingly, we supposed that the low quality of assembled genome sequences Table 3 Characterization of 20 polymorphic microsatellite loci in M. anguillicaudatus Primers for each locus including a forward primer (F), a reverse primer (R), Na number of alleles, Ho observed heterozygosity, He expected heterozygosity, PIC polymorphism information content, P HWE P values for Hardy-Weinberg equilibrium corrected for multiple comparisons using the false discovery rate, F (Null) frequency of null alleles *Signifcantly deviated from Hardy-Weinberg equilibrium (P < 0.05), **Signifcantly deviated from Hardy-Weinberg equilibrium (P < 0.01), ***Signifcantly deviated from Hardy-Weinberg equilibrium (P < 0.001)

Primers
Sequences with short scaffolds might be caused by the high heterozygosity rate of M. anguillicaudatus. Previous studies concluded sex determination type according to the differences in heterozygosity ratio between females and males [32,33]. In the male heterogamety system, the heterozygosity ratio in males was larger than that in females, therefore, female individuals used for 2b-rad sequencing may be more reasonable to ensure high-quality genome sequences. Besides heterozygosity rate, the GC content is also one of the factors that directly affect sequence bias [34]. GC content that exceeds the normal range (> 65% or < 25%) may result in sequence bias on the Illumina sequencing platform, and eventually influence the quality of genome assembly seriously [35].
The average GC content of the M. anguillicaudatus genome was approximately 38.03%, which was in the acceptable range and did not have an effect on genome sequence quality. This is the first report of genome survey in M. anguillicaudatus and these genome survey data give a preliminary understanding of the genomic characteristics before largescale genome sequencing. However, further genome study combined with PacBio and Hi-C techniques is needed to acquire more accurate high-quality genome information of M. anguillicaudatus.
Identifying the microsatellite markers from genome survey data has the advantages of time-and cost-effective compared with traditional methods of developing microsatellites. It affords a fast method of isolating large numbers of microsatellite markers from genome-wide data. In the present study, except for mononucleotide repeats, the dinucleotide repeat motifs were the most frequent motifs and CG/CG dinucleotide microsatellites were the least abundant dinucleotide SSR loci, this result was congruent with other microsatellite repeats studies as described previously in Danio rerio, Oreochromis niloticus, and Oreochromis latipes [36]. This may be due to the methylation of cytosine into thymidine [37].
Apart from nuclear genome sequences, genome survey sequencing data also contain mitochondrial genome sequences [38]. In aquatic animals, Xu et al. [33] have assembled the mitochondrial genome of Platycephalus sp.1. on the basis of genome survey sequencing data. Although we failed to isolate the complete mitochondrial genome from genetic data because of the complexity of M. anguillicaudatus genome. We still suggested that such a cost-effective method to generate mitochondrial genome should be extensively applied in future exploration of genome survey data. Previous studies on the mitochondrial genome have found that the M. anguillicaudatus was divided into two clades and was considered as a polyphyletic group. One clade and other genus Misgurnus species got together into a monophyletic group [39], while the other clade was clustered with the genus Cobitis species [40][41][42]. With respect that M. anguillicaudatus have two mtDNA lineages, Kitagawa et al. [43] stated mtDNA introgression hypothesis: the mitogenome of M. anguillicaudatus individual in close association with the genus Cobitis species was recognized as "introgressed" mtDNA, while the mitogenome of other individual in close association with genus Misgurnus species was be considered as "non-introgressed" mtDNA. In terms of the hypothesis, we supposed that the M. anguillicaudatus yingde is the "introgressed" mitochondrial type.

Conclusion
In conclusion, we firstly assessed M. anguillicaudatus genomic information including genome size, GC content, heterozygosity, and repeat sequence ratio, using genome survey sequencing and K-mer analysis. Our findings demonstrated that M. anguillicaudatus genome was complex with high heterozygosity rate and repeated sequences. Genome-wide microsatellite motifs and 20 polymorphic microsatellite loci in the wild population were identified. Moreover, the mitochondrial genome was described and phylogenetic analysis was presented to explore its taxonomic status. All of this genetic information will be conducive for whole genome sequencing, genome size evolution, population genetic diversity, and genome-assisted breeding in M. anguillicaudatus.

Supplementary Information
The online version contains supplementary material available at https:// doi. org/ 10. 1007/ s11033-021-07037-x. tion elsewhere; that its publication has been approved by all co-authors if any; that its publication has been approved explicitly by the responsible authorities at the institution where the work is carried out. The authors agree to publication in the journal Molecular Biology Reports.