We used NGS to sequence and assemble the complete mitochondrial genomes of the Tanganyika sardines, S. tanganicae and L. miodon, and built a phylogeny of Clupeiformes using full mitochondrial sequences with a focus on the West and Central-African tribe Pellonulini. Based on these complete mitogenomes, we estimated the divergence time of the Tanganyika sardines to investigate the timing of their speciation in relation to the geology of Lake Tanganyika.
Conserved gene order in Clupeiformes
Generally, mitochondrial gene arrangements have remained stable for long evolutionary times, but rearrangements do occur in many lineages of both invertebrates and vertebrates. Small rearrangements of neighbouring genes, for example clusters of tRNA-genes, and non-coding regions are especially common (37–39). Several lineages of Actinopterygii are characterized by such rearrangements, but Clupeiformes is not one of them (35). Our gene order analysis confirmed the conserved arrangement of mitochondrial genes in this order, aside from one transposition of two tRNA genes (tRNA-Pro and tRNA-Thr) in Ilisha elongata.
Inconsistent tree topologies within and outside Pellonulini
Both of our phylogenies (ML and Bayesian) support a single common ancestor for all included pellonuline species and recover Ethmalosa fimbriata as their sister species with high statistical support, consistent with Lavoué et al. (21). This is in contrast with the results of Egan et al. (20) and Bloom & Lovejoy (19), neither of whom found good support for this sister-species relationship. None of the genera Microthrissa, Odaxothrissa, or Pellonula appeared monophyletic in our study, and the Tanganyika sardines rendered Potamothrissa paraphyletic. The position of Microthrissa differs in almost every study that has addressed it. According to Egan et al. (20), M. royauxi was more closely related to the Tanganyika sardines and Potamothrissa, while M. congica clustered with Pellonula and Odaxothrissa. In the study of Wilson et al. (17), two specimens of M. royauxi did not even cluster together. Bloom & Lovejoy (19), on the other hand, found both Microthrissa species more closely related to members of Pellonula and Odaxothrissa than to the clade including the Tanganyika sardines and Potamothrissa. In accordance, our analysis failed to recover the exact position of M. royauxi with high statistical support, but it did confirm that M. congica is more closely related to members of Pellonula and Odaxothrissa than to M. royauxi. The Lake Tanganyika sardines formed a well-supported clade nested within Potamothrissa, while M. congica clustered with members of Pellonula and Odaxothrissa. In the same clade, P. vorax and O. vittata were more closely related to each other than to P. leonensis or O. losera. Contrarily, our study is the first to place P. obtusirostris and P. acutirostris in the same clade. With the improved resolution resulting from our whole mitogenome approach, our study confirms E. fimbriata as sister species of Pellonulini and for the first time opposes monophyly of nearly all pellonuline genera.
The source of these inconsistent topologies is unclear, but could be related to smaller, more variable and partly incomplete gene datasets in previous studies. Egan et al. (20) included only the gene coding for CYT-B for most members of Pellonulini, including the LT sardines, and three nuclear genes and/or the 16S rRNA gene for others. Bloom & Lovejoy (19) did not include O. losera and P. acutirostris and used CYT-B and 16S rRNA genes for most, and two nuclear genes for two species, while Wilson et al. (17) used only mitochondrial genes (CYT-B, 16S rRNA, 12S rRNA). None of the previous studies included more than around 5 kbp of alignment data, except Lavoué et al. (21), who included all PCG, rRNA and tRNA sequences which amounted to slightly over 10kbp. Taxonomic coverage was also highly variable, ranging from 49 to 190 clupeoid species, the lowest number belonging to Wilson et al. (17), which also had the largest credibility interval but had the highest taxonomic coverage of Pellonulini. The Tanganyika sardines and other pellonulines were also missing from several of these studies. Although the inclusion of taxa with incomplete datasets can help to resolve phylogenies (40), there is a trade-off with increased risk of phylogenetic artefacts, and difficulty detecting multiple substitutions (41,42). In contrast, our study had the first nearly complete dataset for all taxa, thanks to the readily available mitochondrial genomes from many pellonuline and other clupeid species.
Morphological diversity is relatively low in representatives of Pellonulini compared to for example cichlids (15,43). In the FAO species catalogue of clupeoid fishes, several ambiguous identifications and uncertain species descriptions are mentioned. For instance, the distinction between O. losera and O. vittata is based solely on the number of gill rakers, which also varies with the age of the specimen, a common occurrence among clupeid fishes (43). In P. leonensis, there is also evidence for undescribed subspecies exhibiting characteristics of both P. leonensis and P. vorax, or even specimens belonging to Cynothrissa. It is thus not inconceivable that misidentifications have confounded past taxonomic studies, and that a taxonomic revision of these genera may be needed.
Outside of Pellonulini, we recovered most of the traditional families and subfamilies of Clupeiformes with high statistical support. The positions of the families with lower taxonomic coverage, including Pristigasteridae, Chirocentridae, Dussumieridae, remained unresolved, Clupeidae was not monophyletic and several species were also placed away from their congeners, for example in the genera Setipinna, Thryssa, Ilisha and Sprattus. These latter two findings are consistent with previous studies of Clupeiformes with taxonomic coverage comparable to ours (19,20,44), underlining the need for revision of these taxa.
Inconsistent divergence time estimates in Clupeiformes
Bayesian analysis estimated the divergence time of the Lake Tanganyika sardines at around 4.07 MYA [95% CI: 1.66, 7.06]. This estimate is younger than the previous estimate by Wilson et al. (17) at 7.6 MYA, but within its credibility interval [95% CI: 2.1, 15.9]. Conversely, their estimate fell just outside our credibility interval. Our estimate is also younger compared to the one by Bloom & Lovejoy (19) at 6.61 MYA [95% CI: 2.20, 11.01], but in accordance with Egan et al. (20) at 3.91 MYA [95% CI: 1.19, 6.64]. Our credibility intervals were smaller than those of Wilson et al. (17), but comparable to Bloom & Lovejoy (19) and Egan et al. (20). Other, deeper nodes of interest also differed between the studies. Overall, our estimates most closely agreed with those of Egan et al. (20), except for the divergence time of the pellonulines from other clupeids, which was around 10 MYA older in our study. Bloom & Lovejoy (19) found consistently older estimates, while Wilson et al. (17) estimated the more recent nodes as older, and the deeper nodes as younger than the other three studies.
The differences in estimated divergence times between the studies can be partially attributed to the different estimation procedures. Specifically, methodological choices for Bayesian dating of nodes can strongly influence the accuracy and precision of the divergence time estimates, for example the choice of priors to account for uncertainty surrounding the age of a fossil, and the choice of clock model (34,45). Almost all the studies we compared here dated divergence using a fossil-calibrated uncorrelated (relaxed) clock model implemented in BEAST, accounting for substitution rate heterogeneity among branches. Six to eight fossil calibrations were specified as exponential priors with soft maximum ages. Only Wilson et al. (17) used an autocorrelated clock approach with seven fossil calibrations specified as uniform priors. In accordance with the three more recent studies on the divergence times of Clupeiformes (19–21), but in contrast with Wilson et al. (17), we chose a relaxed clock model, in accordance with the varying speeds of diversification in different clupeid lineages (44). However, we decided to use pre-calibrated time scaling points (“secondary calibration”) over direct fossil calibration (“primary calibration”). Ideally, time-calibration of the diversification of the pellonulines would be based on fossils within this clade. Unfortunately, there are no known pellonuline fossils, and the fossil record of fish of Central Africa in general is sparse (16). Secondary calibrations can result in younger and falsely narrowed estimates of node ages (42,46). By basing secondary calibrations on 95% credibility intervals of primary estimates, 5% of the primary uncertainty (the values falling outside of the credibility interval), is lost (42). Nevertheless, secondary calibrations were necessary in our case to achieve convergence. In addition, the low number of fossil calibration points that was applicable to our dataset would likely have resulted in less accurate estimates than using secondary estimates from carefully calibrated phylogenies (47). Our use of normally distributed priors (a common approach for secondary calibrations) rather than lognormal or exponential distributions could have shifted node age estimates in the opposite direction to produce older estimates (46). Considering these effects, we are confident that our methodological choices have produced node age estimates with the best possible accuracy, but some caution is warranted when interpreting the width of our credibility intervals. The robustness of our approach is supported by correspondence of our estimates to the applied fossil ages (potential primary calibrations).
Utility of whole mitochondrial genomes for phylogenomic analysis and divergence time dating
Mitochondrial protein coding genes vary in their ability to recover known phylogenetic topologies. The sequences of ND4, ND5, COI and CYTB genes are generally useful for phylogenetic questions, while fast evolving genes such as ND4L and ATP8 are regarded as poor phylogenetic performers, although this differs per study and taxon (25,26). Indeed, we found relatively high nucleotide diversity in some genes or regions compared to others, including parts of the genes coding for the ATP6, ND1, ND2, ND3 and parts of ND5. However, whole mitochondrial genomes can recover accurate phylogenies with high resolution, despite containing “poor” phylogenetic performers (27,48). A smaller subset of “good” mitochondrial genes may be able to recover the same topology as the entire mitochondrial genome, but this is highly taxon-specific (25–27,48). Thus, utilizing more markers that provide complementary information is preferable if previous taxon-specific information on the utility of single markers is not available (27).
Nuclear and mitochondrial DNA can contrast or complement each other, both in terms of tree topology and branch lengths (7,8,44). Mitochondrial data has been extensively used due to its uniparental inheritance as a single linkage group, little recombination and fast evolutionary rate, but has been criticized due to frequent violation of the selective neutrality assumption and complications related to introgression and incomplete lineage sorting between species (24). Our estimated divergence time of 4.07 MYA might indicate the lower boundary of divergence time if, after initial isolation, secondary contact with gene flow has occurred between the two diverging sardine species. Due to haploidy and uniparental inheritance, the effective population size of mitochondrial DNA (mtDNA) is fourfold smaller compared to nuclear DNA (nDNA) (24), implying that lineage sorting will happen more rapidly in mtDNA compared to in nDNA (49–51). Due to a lack of recombination, mtDNA introgresses as one single block. Therefore it is possible that through introgression, followed by complete lineage sorting, the mtDNA of one of the species replaces the mtDNA of the other species without evidence of introgression (mitochondrial capture) (51–53). In this case, our estimate would indicate the divergence time after mitochondrial homogenization, instead of the original divergence time. There are several examples of ongoing hybridization between clupeid species (54–57). With their similar habitat, nursery areas, and modes of reproduction, the Tanganyika sardines may well have a history of introgression. A recent study using RAD-tag sequencing suggested that they have completely different sex chromosome organization (58), making this prospect less likely, but not impossible (59).
At the start of the Eocene (around 50 MYA), global sea levels were more than 100 m higher than today, and steadily decreased over the next 20 million years, with smaller maxima in between (16,60,61). These fluctuations may have allowed frequent isolations and reconnections in the Congo basin, favouring hybridization between other newly formed pellonuline species as well. To completely resolve the species tree of Pellonulini, phylogenomic analyses using nuclear genomic markers and multiple individuals per species are needed (but see Bloom & Egan (44), who found similar divergence time estimates with mtDNA and nDNA datasets).
Recent divergence time suggests intralacustrine speciation of the Tanganyika sardines
Present-day distributions of several Afrotropical freshwater fish lineages show striking overlap, including members of Pellonulini, Kneriidae and Phractolaemidae, providing evidence for a single marine-freshwater transition across West- and Central Africa around 50 MYA during a period of high sea levels (16). Despite the high sea levels, Lake Tanganyika was likely never in direct contact with the ocean and has not experienced much higher water levels than at present (3,6). Furthermore, due to uplift of the borders of the Congo basin from the Cenozoic onwards, the possibility of an additional marine incursion close to the lake is faint (18). It is therefore more likely the Lake Tanganyika sardines evolved from riverine clupeids. Indeed, the presence of a large body of water covering a large area of the Congo basin (“paleo-lake Congo”) until the Pliocene or early Pleistocene (2–12 MYA, Beadle, 1974; Peters and O’Brien, 2001), may have increased the connectivity between the Congo tributaries and its surrounding lakes, and may have facilitated the entry of riverine species into the predecessor of Lake Tanganyika at this time.
Our improved divergence time estimates of the Tanganyika sardines (4.07 MYA) and their MRCA from other pellonulines (11.88 MYA) help us to better understand their origin and colonisation time in connection to the geological history of the lake. Our estimates are compatible with (1) the entrance of the MRCA of the Tanganyika sardines into the newly formed Tanganyika basin (around 12 MYA) via the tributaries of the proto-Malagarasi-Congo river; and (2) intralacustrine speciation at the onset of deep- and clearwater conditions after the subbasins fused (5–6 MYA). However, based on the 95% credibility intervals of our estimates, we cannot exclude the possibility that the MRCA of the Tanganyika sardines diverged from P. obtusirostris outside of the proto-lake and entered it sometime between the time of its formation and the fusion of its subbasins.
Which environmental conditions triggered the divergence between S. tanganicae and L. miodon remains uncertain. Sexual selection, such as in cichlids (64), is unlikely to have played a large role due to the mode of reproduction of the clupeids. Ecological differences can be powerful drivers of speciation, even in (partial) sympatry (65,66). The newly fused basin, adding ecological heterogeneity to the ancestral sardine’s environment, may have favoured dietary specialization through divergent selection on polymorphic trophic traits. Niche separation and divergence can then prompt genetic reproductive isolation if reinforced by spatial or temporal separation of spawning or lower hybrid fitness (65–67). Indeed, contemporary populations of L. miodon seem to spawn all year round and mostly in the littoral, while populations of S. tanganicae exhibit clear peak spawning times in the pelagic (68). This suggests that at some point during their divergence, spawning became more common in their respective preferred habitats. An alternative explanation is that their speciation was triggered by periods of allopatry (65,67). Given our credibility intervals and the frequent water-level fluctuations potentially separating and reconnecting the southern and central subbasins of Lake Tanganyika several times, it is likely that ancestral sardine populations frequently occurred in partial or complete isolation.
A Limnothrissa-like ancestor of the Tanganyika sardines?
According to our ML and Bayesian phylogenies, the closest living relative of the Tanganyika sardines is P. obtusirostris, a riverine herring feeding primarily on insects that occurs in the northern and eastern stretches and tributaries of the Congo river system, all the way down to the Lukuga river, which was connected to the Malagarasi river east of Tanganyika around the time of the lake’s formation. Ecologically, L. miodon, with its generalist diet including insects and small fishes, is more similar to P. obtusirostris than S. tanganicae, which is a strict planktivore. In addition, individuals of L. miodon and species of Potamothrissa share a morphological feature that is otherwise rare in clupeid fishes: a row of saw-like teeth at the side of the lower jaw (43). We suggest that the ancestral Tanganyika sardine shared more ecological traits with L. miodon than with S. tanganicae. This is also reflected in the more shorebound and generalist lifestyle of L. miodon, and its ability to invade the Cahora Bassa reservoir though dispersal via the riverine environment of the Zambezi (69), suggesting a relatively high ecological flexibility compared to S. tanganicae (70), and thus a higher ability to colonize a new environment. Nevertheless, the presence of established contemporary populations of S. tanganicae in one of the Congo’s tributaries, the Lukuga, attest its ability to inhabit, or at least cross, non-pelagic environments, provided the water composition is sufficiently similar (71). We also found larger genetic differentiation of S. tanganicae than L. miodon from the remaining pellonulines in non-coding regions. This could further support our hypothesis of a higher relatedness between L. miodon and the ancestral sardine, but may also indicate different demographic histories (72). Kmentová et al. (73) found signatures of recent population expansion in both L. miodon and S. tanganicae, but these were more pronounced in the latter. The population expansion in S. tanganicae might be linked to the fusion of subbasins, or any other major lake-level fluctuation that increased the amount of pelagic habitat. Similarly, species of the pelagic cichlid tribe Bathybatini showed recent demographic expansions, probably also linked to lake-level fluctuations (74).