2.1. Taxon Sampling
Ingroup sampling consisted of 133 specimens representing 20 of the 25 valid Distichodus species, thereby encompassing 80% of Distichodus currently recognized diversity (Table 1). Distichodus brevipinnis, D. langi, D. mossambicus, D. rufigiensis, and the newly described D. ingae [9], were not included in analyses due to unavailability of tissues. With the exception of D. altus, D. nefasch, D. rostratus, and D. petersii, for which only a single tissue sample was available, multiple individuals per species were sequenced to sample as large a portion of each species’ range as possible (Table 1). In addition to increasing geographic coverage, inclusion of multiple individuals per species allowed for testing the monophyletic status—and therefore species limits under the phylogenetic species concept [18, 19]––of nominal species from which more than one individual was available for sequencing. Sampling of multiple individuals per species, however, was not aimed at making inferences about tokogenetic (intraspecific) relationships and/or phylogeographic patterns. Paradistichodusdimidiatus was included as outgroup based on the findings from a relatively recent molecular phylogenetic study that investigated relationships of the Distichodontidae [15], which resolved the monotypic genus Paradistichodus as the sister group of Distichodus. Similarly, Nannocharax ansorgii was included as additional and outermost outgroup for molecular dating and inference of geographic range evolution analyses.
Most tissue samples were obtained from specimens collected during recent expeditions in West and West-Central Africa by a research team from the American Museum of Natural History (AMNH) (led by co-author MLJS). Specimens were handled and euthanized prior to preservation in accordance with recommended guidelines for the use of fishes in research [20] and stress was ameliorated by minimizing handling and through the use of the anesthetic Tricaine mesylate (MS-222) for euthanasia. Tissue samples were taken in the field and immediately preserved in 95% ethanol. Voucher specimens were fixed in formalin and subsequently transferred to 70% ethanol for long-term storage. Data for specimens cataloged and stored in the ichthyology collection of the AMNH, are available online at http://sci-web-001.amnh.org/db/emuwebamnh/index.php.
Specimen collection was made in accordance with ethical and legal guidelines for international animal research approved by the AMNH Institutional Animal Care and Use Committee (IACUC) (approval #36/06). The AMNH IACUC has guidelines relating to studies involving its members in different countries, and this study conforms to those guidelines. Specimen collection and exportation of samples used in this study follow institutional and national ethical and legal guidelines of the Ministry of Fishery and Aquaculture, Republic of Guinea, No. 65/MPA/DGAGSP/11; the Ministry of Scientific Research and Technical Innovation, Republic of Congo, No. 031/MRSIT/DGRST/GERBID.06.13; and the Ministry of Agriculture and Fisheries, Democratic Republic of Congo, No. 037/DP/SG/AGRIPEL/16.
Additional samples were obtained from colleagues at the Cornell University Museum of Vertebrates (CUMV), the Royal Museum for Central Africa (MRAC), and the South African Institute for Aquatic Biodiversity (SAIAB). Voucher specimens are deposited in the ichthyology collections of the AMNH, CUMV, MRAC, and SAIAB. Species identity of non-AMNH vouchers was confirmed either by direct examination of loaned specimens, photographs provided, or on taxonomic authority of the loaning institution. Voucher catalog numbers and GenBank accession numbers for the gene sequences generated and included in this study are listed in Table 1.
2.2. Gene sampling and nucleotide data collection
Eight gene fragments, including the seven protein-coding loci sampled by Arroyave et al. [15] to address distichodontid interrelationships (co1, cytb, enc1, glyt, myh6, nd2, and sh3px3) were sequenced. Additionally, a faster-evolving mitochondrial non-coding marker, control region (cr), was added to address more recent divergences within the genus. DNA sequence data was generated from a total of 133 Distichodus individuals. General procedures for DNA extraction, amplification, and purification, along with primers and thermal profiles for sequencing the protein-coding genes used in this study follow Arroyave and Stiassny [21] and Arroyave et al. [15]. Distichodus-specific primers for cr (cr_Dist_f: 5’-AGCGCCGGTCTTGTAATCCG-3’; cr_Dist_r: 5’-TGCTTGTGGAACTTTCTAGGGTCCAT-3’) were designed using the software Primer3 [22] from conserved flanking regions of aligned mtDNA control region sequences extracted from the two distichodontid complete mitochondrial genomes available in GenBank (Distichodus sexfasciatus AB070242 and Ichthyborus sp. AP011993). Amplification of cr via PCR was carried out using the following thermal profile: 5-min initial denaturation at 95°C, followed by 35 cycles of denaturation at 95°C for 60 s, annealing at 58°C for 60 s, and extension at 72°C for 120 s, followed by a 10-min final extension at 72°C.
2.3. Sequence editing and partitioning scheme/substitution model selection
Contig assembly and sequence editing was performed using Geneious v.11.0.2 [23]. IUPAC nucleotide ambiguity codes were used to represent heterozygous sites. The resulting sequences were trimmed to exclude primer regions and examined for appropriateness/homology using BLASTx [24]. Each gene was aligned using MUSCLE [25] under default parameters as implemented in Geneious, followed by concatenation of individual alignments. All sequences were checked for stop codons and for miscalled amino acids by examining translation alignments.
Best-fit partitioning schemes and models of molecular evolution for the nucleotide data were determined using PartitionFinder2 [26] based on 22 pre-defined data blocks: the non-coding mtDNA control region (1 block) plus the 1st, 2nd, and 3rd codon positions of the seven protein-coding genes (3 positions x 7 genes). The PartitionFinder2 greedy algorithm was employed to search for an optimal scheme under the assumption of independent model parameters and branch lengths for each partition. Selection of the partitioning scheme and models over the set of schemes and models produced during greedy search was accomplished using the Schwarz/Bayesian Information Criterion (BIC) [27].
2.4. Phylogenetic, biogeographic, and chronological analyses
Various analytical approaches were employed to infer phylogenetic relationships in Distichodus from the multilocus dataset generated in this study, one of which also simultaneously estimates absolute times of divergence in the resultant phylogeny. The results from the latter approach were subsequently used in analyses for testing historical biogeographic hypotheses of geographic range evolution in Distichodus.
2.4.1. Maximum Likelihood (ML) estimation of phylogeny
Phylogenetic analysis of the concatenated alignment of the eight sampled genes under a Total Evidence/Simultaneous Analysis [28, 29] approach was performed using the ML optimality criterion. Furthermore, to examine the degree of variation in topology, resolution, and clade support among the individual sampled loci, and to complement the inferences made from the simultaneous analysis of all markers, each of the nuclear genes (enc1, glyt, myh6, sh3px3) and a concatenated alignment of the mitochondrial genes (co1, cr, cytb, nd2; effectively inherited as a single locus), were independently analyzed, also using the ML optimality criterion. ML phylogenetic analyses were conducted with RAxML v.8 [30] through the CIPRES Science Gateway v.3.3 [31] as a single partition under the GTRGAMMA model with four rate classes using full ML optimization for the tree search and 1000 rapid bootstrap (BS) searches to assess nodal support [32].
2.4.2. Species-tree approaches
Although concatenation methods have been suggested to often perform well when incomplete lineage sorting (ILS) levels are low [33], the degree of ILS in Distochodus is unknown. To explore the outcomes of ILS-aware species-tree analyses relative to concatenation, both SVDquartets [34] and ASTRAL-III [35] were employed. SVDquartets has been suggested to perform well with low ILS and small numbers of sites per gene, and ASTRAL methods have been suggested to perform well under high ILS conditions, but may be sensitive to small numbers of sites per gene [33]. SVDquartets analysis was conducted in PAUP* v4.0a164 [36] sampling all ~8.6 million quartets under the multispecies coalescent on the full dataset, using the default QFM quartet assembly method. Bootstrap support values were assembled onto the SVDquartets tree using the sumtrees command in the DendroPy package [37]. Gene trees input to ASTRAL-III were estimated from best-fit codon models inferred in codonPhyML [38] under default search intensity, using custom R scripts written by the authors. Because the mitochondrial genome does not undergo recombination and is inherited as a single locus, the three protein-coding mitochondrial genes were fit with a single codon model and inferred gene tree. Gene trees for each autosomal locus were inferred separately.
2.4.3. Bayesian co-estimation of phylogeny and divergence times
Prior to co-estimation of phylogeny and divergence times, a new data matrix was created from the original multi-individual, multi-locus matrix, by including DNA sequence data from only a single individual per species, from or near the type locality whenever possible (for each sampled species, the first individual listed in Table 1). The resulting reduced matrix was analyzed in BEAST v.2.5.0 [39] under the optimal partitioning scheme and substitution models suggested by the PartitionFinder2 analysis. Node ages were estimated using a Bayesian relaxed-clock method [40] under the uncorrelated lognormal (UCLN) rate variation model, and assuming a birth-death process prior for topology and divergence times. By default, the prior on the mean parameter of the UCLN clock model (ucldMean.c) is a uniform distribution on the interval (0,¥), which is an uninformative and improper prior (it does not integrate to 1). Although improper priors can sometimes lead to proper posterior distributions, they may also have undesired effects and cause problems with mixing and convergence [41]. Based on previous findings regarding substitution rates in Distichodus [15], we assumed a log-normally distributed prior for the clock rate (ucldMean.c) with hyperparameters μ=0.003 and σ=0.5. On the other hand, the standard deviation parameter of the UCLN clock model (ucldStdev.c) is by default assigned a gamma distribution prior. Variation in substitution rates among branches in Distichodus, however, appears to be low in general [15]. Accordingly, we assumed an exponential prior distribution with 95% of the probability density on values < 1 for the standard deviation of the UCLN (ucldStdev.c).
The molecular clock was calibrated based on early Miocene (ca. 18 Ma) fossilized dentition attributable to Distichodus recovered from deposits of the Maradah Formation in Jabal Zaltan, Libya, by far the oldest fossil unambiguously assignable to the genus [42]. In fact, this fossil pushes back the first known appearance of Distichodus in the fossil record by 10 Ma with respect to the Distichodus calibration fossil used by Arroyave et al. [15] to infer a time-scaled phylogeny of citharinoid fishes. Although the Maradah fossil is unquestionably diagnostic of Distichodus (tall, slender necked tooth with a bifid apex bearing characteristically short and rounded lobes) and could potentially be ascribed to either Distichodus nefasch or D. rostratus on the basis of size and geographic distribution, its exact phylogenetic placement is unknown. The absence of relevant comparative morphological data in a phylogenetic context to which to integrate the fossil taxon, coupled with its fragmentary nature, renders it difficult to confidently assign it to a particular node and to determine whether it should be used to constrain the age of the stem or the crown group of the calibration node. Because of this phylogenetic uncertainty, along with the challenge of objectively establishing a maximum age constraint to the calibration node, we conducted a series of analyses (Table 2) to assess the robustness of node ages to analytical ambiguity and to offer alternative output scenarios based on a variety of reasonable input parameters, particularly with respect to the phylogenetic placement of the calibration node and its maximum age constraint. Specifically, we used three alternative calibration nodes: 1) MRCA of Distichodus and Paradistichodus (D+P), 2) MRCA of Distichodus (D), and 3) MRCA of D. nefasch and D. rostratus (Dne+Dro). The rationale behind this proposal is that, at the very least, the calibration fossil could be used to constrain the age of divergence between Distichodus and its sister group, Paradistichodus, but under more liberal phylogenetic designations, it could also be used to constrain the age of the entire genus or even the divergence between the species D. nefasch and D. rostratus. Furthermore, each calibration node was constrained both as stem and as crown group. Additionally, the temporal uncertainty of calibration nodes was modeled using log-normally distributed priors with a hard minimum bound set by the age of the fossil (18 Ma) and one of three alternative 95th percentile soft maximum bounds (P95 SMBs): 20, 30, and 40 Ma (Fig. 2; Table 2). The combinatorial exercise of choosing one of three alternative calibration nodes, constrained as stem or crown, and modeled by a log-normally distributed prior characterized by one of three alternative P95 SMBs, resulted in 18 different analyses (although effectively 15 since the node representing the MRCA of Distichodus as stem is equivalent to the node representing the MRCA of Distichodus and Paradistichodus as crown (Table 2). In each analysis, root age was indirectly constrained (as an implied prior) by the combined effects of the calibration prior on other internal node and the prior for topology and divergence times (birth-death process).
BEAST2 analyses were implemented using the Markov Chain Monte Carlo algorithm (MCMC) run for 50 million generations sampled every 1000 generations, under default proposal mechanisms and default priors for the parameters of the birth-death branching process used to provide the prior distribution for the non-calibration nodes (speciation and extinction rates) and the model of molecular evolution for each gene (substitution rates, base frequencies, gamma shape, and proportion of invariant sites). Convergence model parameter estimates were assessed via ESS values over 200, using Tracer v.1.7 [43]. Sufficient sampling of the estimate of the tree topology (ESS > 200) was determined by dividing the topological approximate ESS by the generation number of the approximate earliest stationary value in the topological autocorrelation plot, generated in the R package rwty [44]. Further assessment of MCMC convergence was undertaken by examination of the average standard deviation of split frequencies, with values << 0.01 taken as indicative of stationarity. All analyses used a 10% burn-in. A maximum clade credibility (MCC) topology was inferred using TreeAnnotator v.2.5 [39], resulting in a chronogram indicating posterior probabilities (PP) and mean ages of all nodes with their associated 95% highest posterior density (HPD) intervals.
2.4.4. Inference of geographic range evolution
The evolution of geographic ranges in Distichodus was investigated using the null-range-excluded dispersal-extinction-cladogenesis model (DEC*) [45], a modified version of the original likelihood-based dispersal-extinction-cladogenesis (DEC) model [46, 47]. The set of discrete geographic areas for the DEC* analysis consisted of the six Afrotropical ichthyofaunal provinces of Roberts [48] (modified by Lévêque [49]) with presence of Distichodus species: Congo Basin (CB), Zambezi (Z), Nilo-Sudan (NS), Upper Guinea (UG), Lower Guinea (LG), and East Coast (EC) (Fig. 1). African ichthyofaunal provinces were delimited on the basis of current and historical patterns of drainage connectivity and the composition of the fish fauna, and therefore represent regions with a distinctive evolutionary history and a more or less characteristic biota at the species and higher taxonomic levels [48, 49]. To assess the relative fits of alternative models of faunal assemblage in the Congo Basin, three variants of the DEC* model were fit to the data in the BioGeoBEARS R package [50], following the parameterization of dispersal multipliers from Day et al. [51]: M0, an unconstrained multiplier matrix allowing for dispersal to and from the Congo Basin; M1, an asymmetric multiplier matrix allowing only dispersal out of the Congo Basin (CB-as-source); M2, an asymmetric multiplier matrix allowing only dispersal into the Congo Basin (CB-as-sink). Tip-state ranges were assigned based on the presence of species in different ichthyofaunal provinces. In several cases, species spanned multiple provinces. The maximum range size was set to widespread (all six ichthyofaunal provinces). Given the high dimensionality of the transition matrix resulting from the combination of different provinces (areas) into ranges of sizes up to six, relative to the size of the dataset, 14 disjunct ranges of differing sizes were pruned from analysis, reducing the dimensionality of the matrix from 64 x 64 to 50 x 50. To assess the stability of numerical optimization, analysis was run five times from fresh R sessions. Model fits of the M0, M1, and M2 variants were compared using the Akaike information criterion [52] and supports were assessed using Akaike weights [53]. In an effort to take account of chronological uncertainty due to alternative molecular clock calibration scenarios, inference of geographic range evolution in Distichodus was conducted on three of the 15 time-scaled phylogenies previously inferred with BEAST2, namely the chronograms resulting from analyses based on each alternative calibration node constrained as crown and by a relatively moderate soft maximum bound (P95 SMB=30 Ma) (analyses 5, 8, and 14 in Table 2).