Polyploidization, hybridization, and maternal and paternal lineages in Cyprinids (Teleostei: Cypriniformes)

Abstract


Background
The Cyprinidae has about 400 closely related polyploid species whose evolution involves auto-or allopolyploidization [1].Many species have ploidies of 4n, 6n, 8n, and one species even has over 400 chromosomes.Within Cyprinidae, great contradiction exists between traditional and molecular classi cations.Traditionally, cyprinids were divided into 10 subfamilies (www.shbase.org),but 12 subfamilies have been recognized in China [2].Past molecular phylogenies scatter species in one subfamily across several subfamilies.The topologies did not resolve monophyletic relationships for ten traditional subfamilies whose species have hybrid origins, and, thus, reject the traditional taxonomy [3][4][5][6][7][8][9].Yang et al. [9] examined the phylogenetic relationships of Cyprinidae by using mitogenome sequences and cloned nuclear RAG1.Consequently, they reassigned some genera to subfamily Cyprininae, and subdivided it into 11 tribes [9].Subsequently, Tan and Armbruster [10] upgraded these tribes to subfamilies.However, phylogenetic hypotheses revealed in several studies suggest recurrent, blurred origins via hybridization and polyploidization within Cyprinidae, which may be responsible for the controversial classi cation among subfamilies.The history of polyploidization in the Cyprininae remains uncertain and this poses signi cant challenges for phylogenetic systematics.
The ploidy of many cyprinids is clear; yet the history of polyploidization remains poorly understood.
Uncertainty involves the timing of whole genome duplications, origins of polyploidization and so on.Ma et al. [11] resolved the phylogenetic positions of the paternal and maternal ancestors of Cyprininae and estimated the dates of whole genome duplication to be around 10.71-12.42 Ma based on mitochondrial and nuclear DNA datasets.However, the evolution of polyploids in the other subfamilies await clari cation.Further, few studies use large-scale molecular datasets to explore the origin of polyploidization of Cyprinidae.
Analyses of mitochondrial and nuclear DNA marks can yield insights on the history of polyploidization.
On one hand, nuclear DNA sequences generally evolve slower than the mitochondrial DNA because of difference in effective population sizes [12][13][14].On the other hand, nuclear DNA of polyploids can yield signals of biparental contributions, while mitochondrial DNA only provide that of matrilineal genealogy [15,16].An understanding of complicated polyploidization requires discerning signals from both mitochondrial and nuclear DNAs.
Whole genome markers are necessary to assess the recurrent, complex origins of taxa via hybridization and polyploidization.Polyploid cyprinids have very large genomes and numbers of chromosomes, and the technical challenges and costs preclude whole-genome sequencing, yet it is possible to assess variation by using transcriptomic data, which are affordable.Herein, we generate transcriptomes from liver from one representative species of Poropuntiinae and Schizopygopsinae, and four species of Schizothoracinae (C-values: 2.19-3.24pg).Analyses use downloaded mitochondrial genomes, genomic and transcriptomic data from NCBI and Ensembl for representative species in Cyprininae, Labeoninae, Schizopygopsinae and outgroup.Analyses construct gene-and gene-copy trees, which track the evolution history of cyprinids.Because transcriptome assemblies have potential mistakes, we also clone and sequence 10 HOX genes and gene-copies from 14 representative species and one species complex, which covers seven subfamilies and one genus of Cyprinidae, and one outgroup to help clarify the large-scale history of polyploidization.
All assembled protein-coding genes of seven species obtained in the clustered unigenes were used to evaluate the assembly quality by comparing the assembled mitochondrial protein-coding sequence with downloaded sequences.The proportions of mismatch nucleotides relative to mitochondrial proteincoding genes for the unigenes of seven species ranged from 0.10% to 1.60%.Total coverage was at least 98.83%.More than 87.76% of paired reads mapped back to the unigenes after alignment using Bowtie2 [17].The low ratio of mismatch nucleotides, high coverage and reads utilization con rmed assemblies were reliable.

(c) Identi cation of ORFs and Homologous Genes Clusters
The number of ORFs ranged from 34,202 to 56,298 for G. pachycheilus, L. rohita, O. stewartia, P. huangchuchieni, Sc. kozlovi, Sc. macropogon, Sc. oconnori and Sc.waltoni (Supplementary Table S13).After ltering homologous gene clusters, 3,671 eligible homologous gene clusters were used to construct nuclear gene-trees to infer the phylogenetic relationships using multiple gene-copies.

(d) Gene-trees Resulting from Multi-omics
The matrilineal genealogy and nuclear genes-trees revealed the maternal and paternal lines, respectively.These trees clearly identi ed potential diploid progenitors, or at least narrowed parental origins.After ltering out the gene-trees having bootstrap values of less than 50 for key nodes (Fig. S1), 136 trees were retained.Further reduction was based on con ict with gene classi cation criteria.

(i) Phylogeny of Genera in Schizothoracinae and Schizopygopsinae
The gene-trees mainly depicted classes "Schizo_A" and "Schizo_B" (Fig. 2a, b).The rst, supported by 77.94% (106/136) of trees for Schizo_A resolved a monophyletic Schizothoracinae and a monophyletic Schizopygopsinae (Fig. 2a).At least two gene-copies within Schizothoracinae indicated a recent intergeneric WGD events.In Schizo_B, supported by 22.06% (30/136) of gene-trees (Table 1), the topology of gene copies from Schizothoracinae and Schizopygopsinae scattered with each other, and thus indicated introgression between these subfamilies.We used the 106 gene-trees supporting monophyly of both subfamilies to identify the evolutionary relationships of the lineages of Cyprininae, Poropuntiinae and Labeoninae (Fig. 2b).

(f) Time-Estimation of WGD in Cyprininae
The distributions of Ks values for homologous gene clusters from Sc. kozlovi, Sc. waltoni, Sc. oconnori and Sc.macropogon did not present a clear secondary peak.Thus, WGDs did not appear to accumulate (Fig. 4).For S. grahami, the distributions presented a clear peak (Fig. 4) with a Ks value of about 0.155 to 0.160.Assuming the synonymous substitution mutation rate was about 5.7-6.4 × 10 -9 substitutions/synonymous site/year [18], the duplication episode was estimated to have occurred 12.10~14.03Ma.

Discussion
To guarantee candidate genes meet the criteria for reconstructing their relationships given polyploidization, we employ several strict ltering steps, which result in a dramatic reduction in gene markers.The reduction owes to defective transcriptomes, which might owe to (1) cDNA being transcribed from partial gene sequences, or two cDNA sequences might be transcribed from non-overlapped regions of a same gene, and/or (2) multiple cDNAs are from different spliceosomes of the same gene.These phenomena make it di cult to assemble full-length genes from transcriptomic data [19,20].Further, the species of Schizothorax may be autopolyploids with multiple copies of genes [21,22], and accordingly high homology of gene assemblies may create sequences not in the genome.Considering the potential mistakes and bias caused by de novo assembly of transcriptomic data, we use cloning and Sanger sequencing of 10 HOX genes to reconstruct the gene-trees.
Several representative species within Cyprinidae reveal in detail the hybridization resulting in WGDs in the Cyprininae, Labeoninae, Poropuntiinae, Torinae, Schizopygopsinae and Schizothoracinae.By integrating mitogenomes, nuclear genomes and transcriptomes, tree topologies indicate that species of Schizothoracinae and Schizopygopsinae are the maternal siblings of Cyprininae and paternal siblings of Torinae.This advances knowledge of the origins of polyploidy within Cyprinidae, especially when compared to previous studies [9,11].Analyses of the RNA-seq data cluster gene-copies together in 77.94% of nuclear gene-trees and 76.47% of the nuclear gene-trees for matrilineal copies place Schizothoracinae in the Cyprininae.The phylogenies depict species of these two subfamilies as having polyploid origins and their matrilineal progenitor came from Cyprininae.The 10 HOX gene-trees are concordant with the conclusion indicated by RNA-seq data.This narrows the origins of the polyploids and shows that genes from species in these two subfamilies provide reliable makers for identifying maternally or paternally inherited copies of duplicated genes in Cyprinidae and Torinae.Analyses identify complicated gene fates and unstable topologies among seven subfamilies and one genus of Cyprinidae, which is inconsistent with previous ndings [23,24].Further, although the chromosome numbers in Schizothorax suggest that they might have experienced a recent WGD event, no statistical test can technically estimate the signi cance of a secondary peak (Fig. 4a-d).Nevertheless, the initial peak of recent duplications may hide secondary peaks.This may be a limitation of the unigene method, which requires a secondary peak to dissociate su ciently from another corresponding large-scale duplication, and that the two peaks have obvious distributions.Our four species of Schizothorax may be autopolyploids, because the divergence of duplicate gene-pairs is too slight to separate them from the rst peak (Fig. 4a-d).In contrast, a peak emerges above the background level for S. graham (Fig. 4e).The second peak re ects a period of increased accumulation of duplicate genes.Thus, analysis of complete genome sequences can detect large-scale duplications that are older or encompass fewer genes than events detected by the unigene method [5,[25][26][27][28].
Although Sinocyclocheilus assigns to the Cyprininae, its relationship with other subfamilies requires further exploration.Our results suggest that S. grahami is an allotetraploid, and shares the same allopolyploid events with gold sh and common carp, same to our result published previously [29].Lastly, the estimated time of allopolyploidization is about 12.10 ~ 14.03 Ma, which is consistent with a previous estimation in Cyprininae based on mtDNA and nDNA analyses [11].Thus, the shared allopolyploidization events unite the species by assigning copies to different subfamilies on the same gene-tree; this rejects their taxonomic hypothesis of monophyly.
The phylogenetic positions of polyploid subfamilies of Cyprinidae are complex, especially for Labeoninae.Two phenomena might lead to this.On one hand, species in the subfamily might have experienced hybridization.These species readily crossbreed.Our differing gene-tree distributions agree with the "genomic historical mosaicism" model of Folk et al. [30]; their origin began with hybridization and ended with xation of parental histories, i.e. coalescence of all alleles after the hybridization.On the other hand, species in the subfamily experienced genome downsizing (rediploidization) after its WGD [31].This process removed redundant DNA (repetitive sequences) and reduced chromosome numbers via complex chromosomal rearrangements [32].These complex processes likely result in its variable phylogenetic positions in gene-trees.However, this possibility awaits further testing from phylogenomic analyses.

Conclusions
In summary, our work identi es the maternal and paternal lineage of allopolyploids as being in Cyprininae and Torinae, respectively.This suggests that species of Sinocyclocheilus are also allopolyploids belonging to Cyprininae, and indicates that Labeoninae ts the model of "genomic historical mosaicism".However, the limited genomic/transcriptomic data for few representative species of cyprinids sh results in speculation on the history of hybridization and polyploidization in eight cyprinid subfamilies.Further strategies might help clarify the history of hybridization and WGDs by enlarging the number of representative species, combining phenomic and molecular data, and constructing gene-trees based on whole-genome markers.

Methods (a) Samples and downloaded data
One species each of Acrossocheilinae, Poropuntiinae and Torinae, four species of Cyprininae and Schizothoracinae, three species of Labeoninae, and two species of Schizopygopsinae were sampled.The detailed information on ploidy/genome size and sampling location were listed in Supplementary Table S1-3.Voucher specimens were deposited in the State Key Laboratory for Conservation and Utilization of Bio-resources, Yunnan University.Mitochondria genomes from 32 species were downloaded from NCBI, including for one species of Acrossocheilinae, Smiliogastrinae and Torinae, two species of Barbinae, eight species of Cyprininae, ve species of Labeoninae and Schizothoracinae, four species of Poropuntiinae and Schizopygopsinae, and outgroups including Danio rerio and Ctenopharyngodon idella (Supplementary Table S4).Genome sequences from ve species of Danio rerio, Carassius auratus, Ctenopharyngodon idella, Cyprinus carpio, and Sinocyclocheilus graham, and RNA-seq data from Gymnodiptychus pachycheilus and Labeo rohita were also downloaded from NCBI (Supplementary Table S2).

(b) Ploidy Con rmation, mRNA Isolation from Liver and Sequencing
Flow cytometry was used to measure the DNA content of blood cells for all four individuals of Schizothoracinae.Ploidy information for Poropuntius huangchuchieni was taken from Zheng et al. [33] (Supplementary Table S2).
Livers were excised carefully from one individual of P. huangchuchieni, four species of Schizothorax and one Oxygymnocypris stewartii.Samples were stored in RNALater (Ambion, Austin, TX, USA) at -80°C.RNA was extracted from all six samples following instructions for RNeasy Mini Kit (Qiagen, Chatsworth, CA, USA).After isolating mRNA, preparation of the transcriptome library was started with 10μg of total RNA from each sample using an mRNA-Seq Sample Prep Kit (Illumina Inc., San Diego, CA, USA).After purifying and fragmenting the mRNA, rst strand cDNA was used to synthesize the second strand.Following end repair and adding "A" bases, the adaptors were ligated to the ends of the cDNA fragments.After fragment size-selection on gels, suitable cDNA fragments were enriched by PCR.A MinElute PCR Puri cation Kit (Qiagen, Chatsworth, CA, USA) was used to purify the products at all steps.Finally, quality control and quanti cation of the library was assessed using Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, Calif., USA) and Qubit (v1.0, Invitrogen, Carlsbad, CA, USA) qPCR.Clusters were generated in a cluster station system and sequenced using the HiSeq 2000.The library was denatured and hybridized to a owcell.Captured DNA was used as a template for second strand synthesis and ampli ed into a clonal cluster.Subsequently, clusters were linearized, blocked and hybridized with a sequencing primer, which provided a site for sequencing by synthesis.Finally, the owcell was transferred to and sequenced by the HiSeq 2000 platform.
(c) RNA-seq Data Quality Control and Transcriptome de novo Assembly Raw reads of downloaded and our sequencing were ltered by excluding the adaptor sequence, the reads with more than 10% unknown nucleotides and low quality reads (more than 50% bases with quality value less than 10), by using in-house scripts.

(d) De nition of Homologous Gene Custers and Construction of Gene-trees
First, all-against-all protein sequence alignment were performed with BLASTP (v2.5.0) [36] using the protein sequences annotated by available genomesand ORFs annotated by TransDecoder (v3.0.0).Only the proteins with alignment length over 60 bp and identity more than 50% were retained.Next, homologous genes were clustered using OrthoMCL (v2.0.9) [37].Most important, homologous geneclusters were selected as follows: 1) haging at least one marker of outgroup (D. rerio or C. idella); 2) at least one species of Schizothorax with two paralogs; and 3) either Cy.carpio or Ca.auratus with two paralogs.Subsequently, gene-trees based on homologous gene clusters were constructed by PhyML (v3.1) [38,39], while employing both D. rerio and C. idella as outgroup taxa.Lastly, homologous genetrees were ltered as follow: 1) having at least one marker of outgroup (D. rerio or C. idella); 2) at least one species of Schizothorax with two paralogs; 3) either Cy.carpio or Ca.auratus with two paralogs from different origins; and 4) bootstrap value of key node was more than 50 (Fig. S1).Setting the conditions for ltering the homologous gene clusters and the homologous gene trees involved the following: 1) either D. rerio or C. idella should have one marker; 2) to identify the maternal and parental origins of Cyprininae, Cy. carpio or Ca.auratus, or Cy.carpio or Ca.auratus should have two homologous gene occurring on different branches in the gene-tree; and 3) all species of Schizothorax are polyploids, but their origins are unknown.We select at least one species of Schizothorax with two paralogs to explore the origin of Schizothorax.
Eventually, in total, 136 gene-trees were reserved.These included the outgroup and at least one species of Schizothorax with two copies, clear divergence of maternal and paternal copies in Cyprininae, and markers of L. rohita and P. huangchuchieni, which served to distinguish maternal and paternal copies.

(e) Construction of Matrilineal Genealogy Divergence Time Estimation
The mitochondria DNA of P. huangchuchieni were locally assembled using the transcriptomic data with Trinity (v2.3.4) [40].Next, 13 contigs were obtained and connected to a pseudo-mitochondria genome using the mitochondrial DNA of Barbonymus gonionotus as reference.Published mitogenome sequences for 32 species and our data were aligned using MUSCLE (v3.8.31) [41].Then, unreliably aligned regions were removed by using Gblocks (v0.91b) [42] with parameters set to "-b1=4 -b2=5 -b5=a -t=d".Subsequently, the retained data were used to construct phylogenetic trees.The maximum likelihood (ML) tree was constructed using PHyML (v3.1) [38,39] with the GTRGAMMAIG model and 1,000 random sequence addition replicates.Bayesian trees were constructed by using MrBayes (v3.2.2) [43] incorporating optimized models of sequence evolution selected by jModelTest2 [44,45].MCMC analyses were run for 5×10 7 generations, with sampling every 1,000 th generation and discarding the rst 1.25×10 7 generations (12,500 samples) as burn-in.The standard deviation of split frequencies was 0.007550.The Bayesian MCMC divergence time was computed by the mcmctree program in PAML v4.8 [46] using the mitochondrial DNA sequences and the substitution model GTR.Analyses used the relaxed clock with an uncorrelated log-normal distribution.The Birth-Death model served as the tree prior.The rst 50,000 iterations were discarded as burn-in and every 2,000 th iteration was sampled until obtaining 50,000 samples.Effective sample sizes estimated with Tracer (v1.6) ( http://tree.bio.ed.ac.uk/software/tracer/) exceeded 200.Three calibration points were used.(i) For the age of root of Xenocyprididae, Danionidae and Cyprinidae, the oldest reliable fossil record of 55.8 Ma was used as the maximum bound [47].(ii) The earliest fossil specimens of Labeoninae were from the Early Miocene, 11.63 Ma, and this was used as the minimum bound for its node and 16 Ma was used as maximum bound [48,49].(iii) The Cyprinidae-Xenocyprididae split had a minimum bound of 18.86 Ma and maximum of 22.14 Ma [50].

(f) Estimation of Ks Value and Date of Whole Genome Duplication
For the four species of Schizothorax and S. grahami, each pair of paralogs in the homologous gene clusters were aligned with MUSCLE v3.8.31 [41].Then, the levels of synonymous substitutions (Ks) between the pairs of paralogs were estimated by using PAML.Median Ks values were used for clusters with more than two homologous genes [51].The distribution of Ks values of each species was obtained by combining the Ks values of all homologous gene-clusters.The time of WGD was estimated using the formula: T=Ks/2r, where r represented the substitution rate and given values ranging from 5.7 to 6.4 × 10 - 9 substitutions per site per year among shes [18].
(g) DNA Extraction, PCR Ampli cation, Cloning, Sequencing, and Analysis for HOX Genes Total genomic DNA was isolated from frozen or ethanol-xed tissue samples using the standard phenolchloroform procedure.The concentration and quality of DNA were assessed by agarose gel electrophoresis and then the samples were stored at 4℃.Ten HOX genes were ampli ed from total DNA extracts via polymerase chain reaction (PCR) using published and/or optimized primers (Supplementary Table S5).Reaction mixtures contained approximately 25-50 ng of DNA template, 1x PCR buffer solution, 0.15 mM MgCl 2 (TaKaRa), 0.25 mM dNTPs (TaKaRa) and 1U Taq DNA polymerase, in a nal volume 50 μL.PCR included an initial denaturation step at 95℃ for 5 min, followed 33 cycles consisting of 94℃ for 30 sec, annealing for 30 sec at from 65℃ to 58℃, extension for 60 sec at 72℃, and a nal elongation step at 72℃ for 7 min.PCR always included negative controls.Several independent PCR products were combined before puri cation to assure presence of DNA ampli ed for each allele and each copy in the duplicated genome.All DNA was fractionated by agarose gel extraction kit (Huashun, Shanghai, China) and then cloned into a pMD19-T vector (TaKaRa) following each manufacturer's protocols.After each vector was transformed, culturing was performed in LB solid medium.To increase the probability of detecting duplicated paralogs, 16-36 clones of each species were sequenced (Supplementary Table S6).Plasmid DNAs of 10 HOX genes were directly extracted and sequenced on a 3730 DNA Analyzer (Applied Biosystems, Foster City, CA).

Figures Figure 1
Figures

Figure 3 Seven
Figure 3

1 )
The sample collection and sequencing were supported by the National Natural Science Foundation of China (91631305, 91331105, 31760306 and U1902201).Experiments and reagent supply were funded by National Postdoctoral Program for Innovative Talents (BX201600130) and China Postdoctoral Science Foundation (2017M613015).The cost of data analysis was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB13000000), the National Natural Science Foundation of China (91631305, 91331105, 31760306 and U1902201).53.Kumar S, Nei M, Dudley JT, Tamura K: MEGA: A biologist-centric software for evolutionary analysis of DNA and protein sequences.Brie ngs Bioinf 2008, 9(4):299-306.

Table 1
The classi cation of copy clades from all the gene trees