We demonstrate the utility of our program and also explore potential limitations and pitfalls of BUSCO orthologs for phylogenetics by analyzing three datasets that illustrate different classes of phylogenetic problems. First we test the utility of TOAST for resolving the earliest divergences within delphinoids, a clade that experienced a geologically recent pulse of diversification [25–27]. Second, we test whether strongly supported resolution of the sister lineage to teleost fishes is achievable using BUSCO orthologs. This dataset spans an evolutionary timescale on par with some of the earliest divergences within major groups of living jawed vertebrates. Finally, we assess the effect of tissue specific expression patterns on the representation of BUSCO orthologs between two closely related species of camelids. This third dataset provides expectations of data coverage heterogeneity and also demonstrates the ability of TOAST to link recovered loci to the gene ontology (GO) database to assess how the estimated functional roles of loci impact expression patterns.
Resolving the early divergences of delphids
We accessed nucleotide and TSA sequences for each species of Cetacea (whales and dolphins; NCBI ID = 9721) and using the laurasiatheria_odb9 dataset (database of single-copy laurasiatherian orthologs that are present in at least 90% of species) from the OrthoDB website [9, 13]. Using TOAST, we create a set of files for each species that includes gene IDs that match specific BUSCO IDs from the laurasiatheria_odb9 dataset. The representation of ortholog identification within a species varied from complete coverage (6253 loci) to less than 14 loci. However, exploration of missing data patterns demonstrates higher levels of missing data within toothed whales. Using an arbitrary threshold of including only cetaceans with at least 1000 of the over 6000 possible orthologs revealed that most missing data was localized in the dolphin Tursiops (Figure 1A), and that most taxa not meeting this threshold contained very few loci (Figure 1A). Further visualizations possible with TOAST demonstrate that this threshold of minimally containing 1000 loci would remove the majority of the missing data the concatenated alignment (Figure 1B, Supplemental materials).
Using this threshold, we constructed a concatenated alignment of 1,473,632 nucleotides for representatives of the major delphinid lineages Monodontidae (Delphinapterus leucas and Monodon monoceros), Phocoenidae (Neophocaena asiaeorientalis), Delphinidae (Lagenorhynchus obliquidens, Tursiops truncatus, Orcinus orca), and Lipotidae (Lipotes vexillifer) using Physeter catodon as an outgroup. Maximum likelihood based phylogenetic tree searches were conducted using this data in conjunction with best fit nucleotide substitution models and partitions chosen using Bayesian Information Criteria (BIC) in IQ-TREE [15]. Confidence in the topological inference was assessed using 1000 bootstrap replicates (Figure 1B). This resulting tree provides strongly supported topological resolution for the evolutionary relationships of major delphinid lineages, supporting previous hypotheses of a sister group relationship between Delphinidae and a clade comprised of Monodontidae + Phocoenidae [28–30] Our analyses also strongly support an early divergence of Lipotes prior to the radiation of delphioids (Figure 1B). This result is consistent with previous phylogenetic analyses that suggested Lipotes to be one of several deeply divergent lineages of river dolphins that invaded freshwater in the Miocene [31–33]. During this period, sea level rise led to the creation of shallow seaways that provided new estuarine habitats for cetaceans on several continents. Subsequent lowering sea levels is thought to have isolated these lineages from their marine relatives, resulting in the independent evolution of modern river dolphins [32]. Our phylogenetic results are consistent with other topological inferences that have been used as a basis to hypothesize that river dolphins such as Lipotes are vestiges of previously diverse marine cetacean clades that were replaced by the radiation of delphinoids [32], and simultaneously demonstrate the potential utility of TOAST for generating phylogenetic datasets for recent radiations.
Resolving the sister lineage to teleosts
We used transcriptomes of representative species of ray finned fishes from a recent study by Hughes et. al [34] to resolve the sister lineage to teleost fishes. Species selected span all major clades of non-teleost fishes, including representative bichirs (Erpetoichthys calabaricus, Polypterus bichir, Polypterus endlicheri), paddlefish (Polyodon spathula), gar (Atractosteus spatula), and bowfin (Amia calva). Teleosts were selected to include major divergences in the earliest diverging teleost lineages elopomorpha (Megalops cyprinoides, Gymnothorax reevesii, Conger cinereus) and osteoglossomorpha (Osteoglossum bicirrhosum, Mormyrus tapirus, Pantodon buchholzi, Papyrocranus afer), as well as several euteleost species (Engraulis encrasicolus, Lepidogalaxias salamandroides, Synodus intermedius, Porichthys notatus). Using the Actinopterygii_odb9 dataset (database of single-copy 4584 orthologs that are present in at least 90% of species) from the OrthoDB website [9, 13], we then used Using TOAST to harvest gene IDs that had a match to specific BUSCO IDs (4529 loci). Quantification of missing data patterns revealed that almost taxa had greater than 70% representation. However, high levels of missing data was localized to three species (Gymnothorax reevesii: 81% missing, Conger cinereus: 84 % missing, and Porichthys notatus: 94% missing Figure 2A). We constructed a concatenated alignment of 24,225,167 nucleotides for all taxa excluding these three species with high missing data values. We conducted a maximum likelihood analyses using best fit models and partitions selected with BIC in IQ-TREE with 1000 bootstrap replicates (Figure 2B). The resulting tree provides strongly supported topological resolution for every node, supporting previous hypotheses of a sister group relationship between teleosts and Holostei, a clade comprised of gars and bowfin.
For over 150 years, gars and bowfin have been repeatedly hypothesized to form a natural group: Holostei. Originally defined to include a combination of bichirs, gar, and bowfin [35], a revision by Huxley [36] redefined the clade to comprise only bowfin and gar. However, anatomical investigations in the 20th century have repeatedly challenged the validity of Holostei, often proposing instead that bowfin share a closer affinity to teleosts than gar, though recent work by Grande [37] again revived Holostei as sister to teleosts. In sharp contrast to decades of debate among morphologists [38–41], molecular phylogenetic analyses have been nearly unanimous in resolving a strongly supported holostei in studies using mtDNA genomes [42–44], nuclear exons [42, 45–47], as well as transcriptomic and genomic analyses [34, 48]. Such congruence is surprising across years of independent study, and our results add yet another line of evidence strongly supporting the monophyly of Holostei (Figure 2).
The resolution of Holostei as sister to teleosts has important implications for biomedical research using teleost model organisms. Linking genomic work in models such as zebrafish, sticklebacks, or medaka back to humans is often challenged by factors that include the loss of ohnologs following the early rounds of vertebrate genome duplication [49, 50], duplicates of loci found in humans as a consequence of the teleost genome duplication [51], and generally rapid rates of molecular evolution in target loci of many teleosts [52, 53]. However, sequencing of the genome of spotted gar (Lepisosteus oculatus) has revealed surprising similarity between this holostean genome and genomic features of both teleosts and tetrapods [48]. For example, analysis of the major histocompatibility (MHC) genes in gar revealed synteny to human loci as well as identification of other major groups of immune receptors thought to be teleost specific [48]. Similarly, the gar genome contains sequence information that can link hidden orthology of teleost and human microRNAs as well as conserved non-coding elements (CNEs) that appear to be highly informative for understanding the “fin to limb” transition [48]. Further investigation of gar CNEs also linked numerous human disease-associated haplotypes to locations within the zebrafish genome, providing new opportunities for functional experiments [48]. The biomedical importance of the genome of a single gar in combination with the recognition of holostean monophyly raises the question of what discoveries await discovery in not only the genomes of more gar, but also the bowfin genome. Given that both the fossil record and molecular clock studies point to an ancient divergence of gar and bowfin several hundred million years ago [45, 46], investigation of the bowfin genome is sure to illuminate not only biomedically relevant research, but also fundamental aspects of vertebrate genome biology.
Assessing tissue-level expression data in camelids
Transcriptome sequences were downloaded from NCBI’s TSA for two camel species (Cambac:Camelus bactrianus and Camdro:Camelus dromedarius) and eight matching tissues (brain: GAES01|GADT01, heart: GAET01|GADU01, hypothalamus: GAEU01|GADV01, kidney: GAEV01|GADW01, liver: GAEW01|GADX01, lung: GAEX01|GADY01, muscle: GAEY01|GADZ01, skin: GAFA01|GAEA01, and testis: GAEZ01|GAEB01). Utilizing these transcriptome datasets, we demonstrate TOAST’s ability to compare patterns data coverage within the same species and between species and/or tissue types and compare these patterns to expectations of predicted function.
TOAST builds upon the R packages GSEABase and GOstats [54] to assign the orthologs within a given BUSCO database into Gene Ontology Terms (GoTerm) categories. For the camelids, we used TOAST to organize the Laurasiatheria_odb9 database [55, 56], and display GoTerm assignation and overlap (GoTerms) using the dependent R package UpSetR [57] (Figure 3A). Our results demonstrate that the BUSCO loci for this subset of mammals are largely involved in developmental processes and signalling responses (Figure 3A). We demonstrate different degrees of missing data between species and tissue samples, with generally high numbers of fragmented loci in the testis and and very little coverage of loci within liver samples of both species (Figure 3B). Additionally, there appear to be minor differences in the relative percent of GoTerms sampled in brain tissues with the GoTerm “Development” to compared to other tissue types (Figure 3C). Depending on user goals, TOAST offers users the ability to remove fragmented loci based on a length threshold. In this case, our results demonstrate the heterogeneity of BUSCO coverage that can occur between the same tissue types of closely related species. Although it is premature to interpret biological significance from missing data patterns of this dataset, our results demonstrate the potential for BUSCO coverage limitations stemming from tissue and lineage specific patterns of expression.