Taxonomy and Taxon sampling
We followed Wilson and Mittermeier (2018)7 for the taxonomy of extant erinaceids except for Hylomys megalotis which was recognized as in its own genus, namely Otohylomys 19 (Table S1). We sampled 29 tissues/specimens representing 18 out of the 26 recognized erinaceid species using museum destructive samples and newly collected fresh tissue samples (Table S2). Among these, Mesechinus wangi, Paraechinus nudiventris and Paraechinus microps were sequenced for the first time. A tissue sample of Podogymnura truei was loaned from the Field Museum of Natural History. Destructive sampling using museum specimens from the USNM was approved by the Department of Vertebrate Zoology; Division of Mammals (Invoice No. 2076134), National Museum of Natural history, Smithsonian Institution.
Dna Extraction, Mitogenome Enrichment And Sequencing
The total DNA was extracted using 10% sodium dodecyl sulphate (SDS) with proteinase K, and purified using the phenol/chloroform approach76. For DNA extracted from fresh tissue samples, we sheared the DNA into small fragments using NEBNext dsDNA Fragmentise (New England Biolabs, Canada). This step was skipped for DNA extracted from museum samples. We constructed barcoded DNA libraries using a NEBNext Fast DNA Library Prep Set (New England Biolabs, Canada) with barcode adapters (NEXTflex DNA Barcodes, BIOO Scientific, USA). We purified the libraries using magnetic beads, size-selected using a 2% E-gel on an E-Gel electrophoresis system (Thermo Fisher Scientific, Canada), and re-amplified using a NEBNext High-Fidelity 2X PCR Master Mix (New England Biolabs, Canada).
We enriched mitochondrial genes using a capture-hybridization protocol following He et al. (2018)20. In brief, we first prepared biotin-linked DNA probes using long-range PCR amplicons. We hybridized our DNA libraries with probes at 65°C for 24–48 hours. We reamplified the enriched libraries before purification and concentration measurement using a Qubit Fluorometric Quantitation (Thermo Fisher Scientific, Canada). We ran the sequencing using a 316-chip on an Ion Torrent Personal Genome Machine (PGM). All laboratory work was conducted in China except for P. nudiventris which was conducted separately in India.
Assembly, Annotation And Alignment
We trimmed the reads using Trimmomatic v0.3621 to clean adaptor sequences at both ends, and thereafter corrected sequence error using BFC22. We assembled mitogenomes using SPAdes v3.1123 and Geneious R1124, individually. For SPAdes assembly, we mapped the reads to a set of reference mitogenomes representing Erinaceidae species downloaded from GenBank using mrsFAST v3.3.025. The mapped reads were assembled de novo using SPAdes with untrust contigs to reference as recommended by the manual. The assembled contigs were used as new references to which the reads were mapped for another run of mapping. We repeated the processes 3–20 times until all mitochondrial genes were recovered. We also assembled the sequences using Geneious R11 as described before13. In brief, the reads were mapped to reference genomes using the Geneious assembler iteratively up to 25 times. The results generated from the two assemblers were aligned using MAFFT and mismatch regions were checked carefully by eye. We annotated the mitogenomes using MITOS with default parameters26.
For mitogenome alignment, we first downloaded all available erinaceid mitogenomes from GenBank and selected Solenodon paradoxus, Crocidura russula and Uropsilus gracilis as outgroups due to their close relationships with Erinaceidae (Table S2). The two ribosomal RNA genes and 13 protein-coding genes were aligned using MAFFT v.727 as implemented in Geneious. The protein-coding genes (PCGs) were carefully checked by eye and manually corrected premature stop codons.
Partitioned Bremer Support, Rdp5 And Saturation Analysis
Cross-contamination likely occurred during the DNA extraction and library preparation28. To minimize contaminated/unreliable sequences in our final alignment, we adopted a two-step strategy by calculating Partitioned Bremer supports (PBS) and detecting recombination using all 37 erinaceid mitogenomes and a corresponding mitogenome RAxML tree. Firstly, we calculated PBS for all genes on each node to detect the signal of recombination using PAUP v.4.0a16829 and a Tcl script30. A positive PBS indicates that a given gene supports a particular node on the tree, and a negative PBS indicates a gene favours an alternative relationship31. When we observed a negative PBS < -5 for a given gene at any node of the tree, we conducted a recombination detection (RDP) using RDP V.532. RDP detects whether a potential recombination breakpoint exists. Because of the maternal inheritance of the mammals' mitochondria, a “recombination” event suggests that one of the two parent fragments may be contaminated. RDP estimated UPGMA trees using the two fragments on both sides of a breakpoint (i.e., parent sequences), and we masked the parent sequences if the UPGMA tree violet our knowledge of the evolutionary history (Table S3). Finally, we estimated the PBS again using the masked alignment.
We examined the saturation of the coding genes, and for the 1st codon, 2nd codon and 3rd codon positions individually using DAMBE v.7.3.533 with an information entropy-based index of substitution saturation34. Then we plotted the transition versus transversion with F84 and GTR parameters.
Molecular Phylogenies Using Mitogenome And Supermatrix Data
We conducted molecular phylogenetic analyses using i) the complete mitogenomes (including only the two rRNAs and 12 coding genes on the heavy chain, and excluded ND6 on the light chain), and ii) a gene supermatrix. The complete mitogenome data set (MtG hereafter) comprises 40 mitogenomes (including three outgroups). We further included mitochondrial genes (cytochrome b, NADH dehydrogenase subunit 2 and 12S ribosomal RNA) sequences from GenBank to generate a sequence matrix using Geneious R11. This data set (supermatrix) comprises 132 erinaceid individuals.
We performed phylogenetic relationship analyses based on maximum-likelihood (ML) using RAxML-HPC v.835 and Bayesian inference (BI) using MrBayes v.3.2.7a36. The best partitioning scheme and associated models were estimated using PartitionFinder v.2.1.137 for each data set for RAxML and MrBayes individually. The results are given in Table S8. RAxML analyses were performed on XSEDE via the CIPRES web portal (http://www.phylo.org). Solenodon paradoxus was used to root the tree. We selected GTR + Gamma model in both trees search and bootstrapping phases. We didn't use the BFGS searching algorithm. We ran a rapid bootstrapping (-x) with recommended parameter 'autoMRE’ to let the program halt bootstrapping automatically and search for the best-scoring ML tree (-f a). We ran MrBayes using four simultaneous Markov chain Monte Carlo (MCMC) chains including one cold chain and three hot chains, a temperature of 0.065, and runs of 5 million generations. Trees were sampled every 5000 generations and the burn-in fraction was set as 30%. We repeated the analyses twice to ensure convergence on the same states.
Integrative Analyses On Dna And Morphological Data
We further used morphology-gene integrated data to estimate phylogenetic positions for living and fossil species that have no available genetic information. A morphological data matrix was generated by Gould38,39,40 representing 19 living and 22 fossil species (19 living and 11 fossil species used in this study), and morphological data for M. hughi was included in He et al. (2012)13. We selected a subset of 53 individuals representing all distinct lineages and integrated it with morphological data for 26 living species (supplementary alignment1) to estimate the phylogenetic positions of three extant species (i.e., Atelerix sclateri, Hemiechinus collaris, Podogymnura aureospinula) that has no available DNA sequences. We performed Bayesian analyses in MrBayes. The partitioning strategy and other parameters were set as above.
We further included 11 fossil taxa provided by Gould, (1995)38 (supplementary alignment2). The other 11 fossil taxa were not included because they were represented by too few characters (i.e., characters < 25), and because removing these “rogue taxa” could improve phylogenetic accuracy44. We ran MrBayes as mentioned above but constrained the following topologies: the monophyly of Erinaceidae, Galericinae, Brachyericinae, Erinaceinae as well as the monophyly of Deinogalerix, Galerix and Gymnnurechinnus. A brief taxonomic revision and systematic assignments for each fossil was fully explained in Table S7.
Divergence Time Estimation
We estimated divergence times using the MCMCTree program in PAML 4.9i45 and BEAST v.2.6.646. For BEAST analyses, we partitioned the data based on the result of PartitionFinder as mentioned above (Table S8). The clock model and time tree were linked across partitions and a birth-death model was used for the tree prior. We performed MCMC runs of 50 million generations and the trees were sampled every 5000 generations. The two calibrations were set as followed: we used a lognormal prior distribution (mean 3.61, SD 0.142, offset 0; mean 4.27, SD 0.084, offset 0 respectively), rendering the time distribution covering the fossil age. We used Tracer v.1.7 to estimate the posterior distribution of each parameter in the log file to ensure that analyses reached stationary states. Then we used TreeAnnotator (BEAST package) to summarize the sampled trees into a maximum clade credibility tree with mean heights parameters. Analyses were conducted twice.
For MCMCTree analyses, we adjusted the time unit to 100 Ma. The ML tree topology with two calibrations prior was used as an input tree file (supplement timetree1 file). Then MCMCTree calculated the tree's branch length, the gradient vector and the Hessian under the model GTR + G (model = 7). We used a lognormal clock model and set the mean substitution rate (rgene_gamma) and the rate drift parameter (sigma2_gamma) as G (2 2000) and G (1, 4.5) respectively. The birth-death parameters were set as 0.01, 0.01, and 0.1. Each run discarded the first 50000 generations as burn-in and sampled every 50 iterations until it reached 5 million. Similarly, we used Tracer v.1.7 to examine the stationary state and convergence of each run. Analyses were conducted twice.
We firstly used two fossil-based calibrations. One is based on the Adunator from the Torrejonian between 61.1 Ma and 84.2 Ma6. This is the oldest known fossil of erinaceomorph and was considered as on the stem Erinaceomorpha47. The second calibration was according to fossils Neurogymnurus indricotherii and Palaeoscaptor acridens both from the early Oligocene48, 49. N. indricotherii was the earliest unquestioned Galericine from western Kazakhstan in Asia. P. acridens from central Mongolia was devoted to the subfamily Erinaceinae49. We set the most recent common ancestor (MRCA) of the two subfamilies to 28.3 Ma − 48.8Ma referring to Meredith et al., (2011)6.
First, we ran BEAST and MCMCTree using the MtG data set. The estimated MRCA of both subfamilies were strikingly older than that estimated in previous studies13, 19 (see Result and Table S5). We further tested whether it was because of substitution saturation by conducting substitution saturation plots using DAMBE v7 for the 12 coding genes as well as for each codon position (Fig S16). We extracted the 2nd codon position which was not saturated (data set MtG2nd) following Zheng et al. (2011)51. The estimated divergence times using MtG2nd were not obviously different from that using MtG, we further included available nuclear genes from He et al., (2021)50 for erinaceids and outgroups. These included 17 nuclear gene fragments (~ 39443 bp) for six erinaceids and three outgroup taxa. Because this analysis still resulted in similar estimations for divergence times, we included two additional secondary calibrations on the MRCA of Erinaceinae and Galericinae in He et al. (2021)50. While using BEAST we added a MRCA prior with a normal distribution (mean 17.71, sigma 2.05, offset 0; mean 6.97, sigma 2.05, offset 0 respectively). And using MCMCTree we added two extra calibration priors to the input tree file (supplement timetree2 file).