(a) Samples and downloaded data
One species each of Acrossocheilinae, Poropuntiinae and Torinae, four species of Cyprininae and Schizothoracinae, three species of Labeoninae, and two species of Schizopygopsinae were sampled. The detailed information on ploidy/genome size and sampling location were listed in Supplementary Table S1–3. Voucher specimens were deposited in the State Key Laboratory for Conservation and Utilization of Bio-resources, Yunnan University.
Mitochondria genomes from 32 species were downloaded from NCBI, including for one species of Acrossocheilinae, Smiliogastrinae and Torinae, two species of Barbinae, eight species of Cyprininae, five species of Labeoninae and Schizothoracinae, four species of Poropuntiinae and Schizopygopsinae, and outgroups including Danio rerio and Ctenopharyngodon idella (Supplementary Table S4). Genome sequences from five species of Danio rerio, Carassius auratus, Ctenopharyngodon idella, Cyprinus carpio, and Sinocyclocheilus graham, and RNA-seq data from Gymnodiptychus pachycheilus and Labeo rohita were also downloaded from NCBI (Supplementary Table S2).
(b) Ploidy Confirmation, mRNA Isolation from Liver and Sequencing
Flow cytometry was used to measure the DNA content of blood cells for all four individuals of Schizothoracinae. Ploidy information for Poropuntius huangchuchieni was taken from Zheng et al. [33] (Supplementary Table S2).
Livers were excised carefully from one individual of P. huangchuchieni, four species of Schizothorax and one Oxygymnocypris stewartii. Samples were stored in RNALater (Ambion, Austin, TX, USA) at –80°C. RNA was extracted from all six samples following instructions for RNeasy Mini Kit (Qiagen, Chatsworth, CA, USA). After isolating mRNA, preparation of the transcriptome library was started with 10μg of total RNA from each sample using an mRNA-Seq Sample Prep Kit (Illumina Inc., San Diego, CA, USA). After purifying and fragmenting the mRNA, first strand cDNA was used to synthesize the second strand. Following end repair and adding “A” bases, the adaptors were ligated to the ends of the cDNA fragments. After fragment size-selection on gels, suitable cDNA fragments were enriched by PCR. A MinElute PCR Purification Kit (Qiagen, Chatsworth, CA, USA) was used to purify the products at all steps. Finally, quality control and quantification of the library was assessed using Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, Calif., USA) and Qubit (v1.0, Invitrogen, Carlsbad, CA, USA) qPCR. Clusters were generated in a cluster station system and sequenced using the HiSeq 2000. The library was denatured and hybridized to a flowcell. Captured DNA was used as a template for second strand synthesis and amplified into a clonal cluster. Subsequently, clusters were linearized, blocked and hybridized with a sequencing primer, which provided a site for sequencing by synthesis. Finally, the flowcell was transferred to and sequenced by the HiSeq 2000 platform.
(c) RNA-seq Data Quality Control and Transcriptome de novo Assembly
Raw reads of downloaded and our sequencing were filtered by excluding the adaptor sequence, the reads with more than 10% unknown nucleotides and low quality reads (more than 50% bases with quality value less than 10), by using in-house scripts.
After filtering, Trinity (v2.3.4) [34] was used for the de novo assembly of transcriptomes with parameters set to “--min_kmer_cov 2 --min_glue 3”. The results were clustered into unigenes using TGICL (https://sourceforge.net/projects/tgicl/) [35] with default parameters, and unigenes were compared to 13 mitochondrial protein-coding genes of seven species downloaded from NCBI for evaluating the quality of the assemblies. The candidate coding regions of unigenes were identified by TransDecoder (v3.0.0) (https://transdecoder.github.io/) [34] with parameters set to “--retain_long_orfs 150 --single_best_orf”. To further maximize sensitivity for capturing open reading frames (ORFs) that may have functional significance, all unigenes were scanned for all possible ORFs to identify homologies to known proteins in the UniProt database (http://www.uniprot.org/) and common protein domains in the Pfam database (http://pfam.xfam.org/).
(d) Definition of Homologous Gene Custers and Construction of Gene-trees
First, all–against–all protein sequence alignment were performed with BLASTP (v2.5.0) [36] using the protein sequences annotated by available genomesand ORFs annotated by TransDecoder (v3.0.0). Only the proteins with alignment length over 60 bp and identity more than 50% were retained. Next, homologous genes were clustered using OrthoMCL (v2.0.9) [37]. Most important, homologous gene-clusters were selected as follows: 1) haging at least one marker of outgroup (D. rerio or C. idella); 2) at least one species of Schizothorax with two paralogs; and 3) either Cy. carpio or Ca. auratus with two paralogs. Subsequently, gene-trees based on homologous gene clusters were constructed by PhyML (v3.1) [38, 39], while employing both D. rerio and C. idella as outgroup taxa. Lastly, homologous gene-trees were filtered as follow: 1) having at least one marker of outgroup (D. rerio or C. idella); 2) at least one species of Schizothorax with two paralogs; 3) either Cy. carpio or Ca. auratus with two paralogs from different origins; and 4) bootstrap value of key node was more than 50 (Fig. S1). Setting the conditions for filtering the homologous gene clusters and the homologous gene trees involved the following: 1) either D. rerio or C. idella should have one marker; 2) to identify the maternal and parental origins of Cyprininae, Cy. carpio or Ca. auratus, or Cy. carpio or Ca. auratus should have two homologous gene occurring on different branches in the gene-tree; and 3) all species of Schizothorax are polyploids, but their origins are unknown. We select at least one species of Schizothorax with two paralogs to explore the origin of Schizothorax.
Eventually, in total, 136 gene-trees were reserved. These included the outgroup and at least one species of Schizothorax with two copies, clear divergence of maternal and paternal copies in Cyprininae, and markers of L. rohita and P. huangchuchieni, which served to distinguish maternal and paternal copies.
(e) Construction of Matrilineal Genealogy and Divergence Time Estimation
The mitochondria DNA of P. huangchuchieni were locally assembled using the transcriptomic data with Trinity (v2.3.4) [40]. Next, 13 contigs were obtained and connected to a pseudo-mitochondria genome using the mitochondrial DNA of Barbonymus gonionotus as reference. Published mitogenome sequences for 32 species and our data were aligned using MUSCLE (v3.8.31)[41]. Then, unreliably aligned regions were removed by using Gblocks (v0.91b) [42] with parameters set to “-b1=4 -b2=5 -b5=a -t=d”. Subsequently, the retained data were used to construct phylogenetic trees. The maximum likelihood (ML) tree was constructed using PHyML (v3.1) [38, 39] with the GTRGAMMAIG model and 1,000 random sequence addition replicates. Bayesian trees were constructed by using MrBayes (v3.2.2) [43] incorporating optimized models of sequence evolution selected by jModelTest2 [44, 45]. MCMC analyses were run for 5×107 generations, with sampling every 1,000th generation and discarding the first 1.25×107 generations (12,500 samples) as burn-in. The standard deviation of split frequencies was 0.007550.
The Bayesian MCMC divergence time was computed by the mcmctree program in PAML v4.8 [46] using the mitochondrial DNA sequences and the substitution model GTR. Analyses used the relaxed clock with an uncorrelated log-normal distribution. The Birth-Death model served as the tree prior. The first 50,000 iterations were discarded as burn-in and every 2,000th iteration was sampled until obtaining 50,000 samples. Effective sample sizes estimated with Tracer (v1.6) ( http://tree.bio.ed.ac.uk/software/tracer/) exceeded 200.
Three calibration points were used. (i) For the age of root of Xenocyprididae, Danionidae and Cyprinidae, the oldest reliable fossil record of 55.8 Ma was used as the maximum bound [47]. (ii) The earliest fossil specimens of Labeoninae were from the Early Miocene, 11.63 Ma, and this was used as the minimum bound for its node and 16 Ma was used as maximum bound [48, 49]. (iii) The Cyprinidae-Xenocyprididae split had a minimum bound of 18.86 Ma and maximum of 22.14 Ma [50].
(f) Estimation of Ks Value and Date of Whole Genome Duplication
For the four species of Schizothorax and S. grahami, each pair of paralogs in the homologous gene clusters were aligned with MUSCLE v3.8.31 [41]. Then, the levels of synonymous substitutions (Ks) between the pairs of paralogs were estimated by using PAML. Median Ks values were used for clusters with more than two homologous genes [51]. The distribution of Ks values of each species was obtained by combining the Ks values of all homologous gene-clusters. The time of WGD was estimated using the formula: T=Ks/2r, where r represented the substitution rate and given values ranging from 5.7 to 6.4 × 10-9 substitutions per site per year among fishes [18].
(g) DNA Extraction, PCR Amplification, Cloning, Sequencing, and Analysis for HOX Genes
Total genomic DNA was isolated from frozen or ethanol-fixed tissue samples using the standard phenol-chloroform procedure. The concentration and quality of DNA were assessed by agarose gel electrophoresis and then the samples were stored at 4℃. Ten HOX genes were amplified from total DNA extracts via polymerase chain reaction (PCR) using published and/or optimized primers (Supplementary Table S5). Reaction mixtures contained approximately 25–50 ng of DNA template, 1x PCR buffer solution, 0.15 mM MgCl2 (TaKaRa), 0.25 mM dNTPs (TaKaRa) and 1U Taq DNA polymerase, in a final volume 50 μL. PCR included an initial denaturation step at 95℃ for 5 min, followed 33 cycles consisting of 94℃ for 30 sec, annealing for 30 sec at from 65℃ to 58℃, extension for 60 sec at 72℃, and a final elongation step at 72℃ for 7 min. PCR always included negative controls. Several independent PCR products were combined before purification to assure presence of DNA amplified for each allele and each copy in the duplicated genome. All DNA was fractionated by agarose gel extraction kit (Huashun, Shanghai, China) and then cloned into a pMD19-T vector (TaKaRa) following each manufacturer’s protocols. After each vector was transformed, culturing was performed in LB solid medium. To increase the probability of detecting duplicated paralogs, 16–36 clones of each species were sequenced (Supplementary Table S6). Plasmid DNAs of 10 HOX genes were directly extracted and sequenced on a 3730 DNA Analyzer (Applied Biosystems, Foster City, CA).
All sequences were edited and initially aligned with DNASTAR Lasergene (v7.1.0) [52]. The sequences were checked manually adjusted when necessary. MEGA(v5.0) [53-55] was used to identify unique haplotypes in each species. A representative sequence of each haplotype was selected for further analysis. Phylogenetic reconstructions were performed by maximum likelihood (ML) using PHyML (v3.1) [38, 39], and Bayesian inference (BI) employing MrBayes (v3.2.2) [43]. In the BI analysis, four independent Markov Chain Monte Carlo (MCMC) chains were run simultaneously for 2,000,000 generations while sampling one tree per 100 replicates and the best-fitting model was found using jModelTest [56]. Two independent runs were conducted. The average standard deviation of split frequencies was required less than 0.01 and the posterior probabilities diagnostic for convergence of branch lengths (potential scale reduction factor) approached 1 after 50,000,000 generations. In the ML analysis, the best-fitting model of nucleotide substitution from jModelTest and 1,000 random sequence addition replicates were used to construct tree.