Comparative analysis of dipteran non-coding DNA
Our previous study of 19 consecutive in vivo tested Drosophila enhancers contained within a 28.9 kb intragenic region located between the vvl and Prat2 genes, revealed that each CSB cluster functioned independently as spatial/temporal cis-regulatory enhancer [24]. The enhancers possessed a diversity of regulatory functions, including dynamic activation of expression in defined patterns within subsets of cells in discrete regions of the embryo, larvae and/or adult.
Submission of the 29 Kb enhancer field to the RefSeq Genome Database of Ceratitis capitata via BLASTn revealed 17 uCSBs; all 17 regions were colinear and located between the Ceratitis orthologs of Drosophila vvl and Prat2 genes. In each case the matches between Ceratitis and Drosophila corresponded to a complete or a portion of a CSB identified as being highly conserved among Drosophila species [24]. Submission of the same Drosophila region to Musca domestica RefSeq Genome Database revealed 13 uCSBs that are colinearly arrayed within the Musca genome. Since the structural gene and these conserved uCSBs are currently on different contigs, the absolute orientation of the Musca sequences with respect to the Musca vll structural gene could not be determined. Nine of these Ceratitis and Musca CSBs were present in both species and corresponded to CSBs contained in several of the enhancers identified in our previous study of the Drosophila enhancer field [24]. The conservation within one of these embryonic neuroblast enhancers, vvl-41, is depicted in Fig. 1. We have annotated an EvoPrint of vvl-41 to show shared CSBs with Ceratitis and Musca (Fig. 1A). Green CSBs are shared 3 ways between the three species, red letters represent bases that are shared between Dm and Ceratitis and blue letters represent bases that are shared exclusively between Dm and Musca. Fig. 1B shows two and three-way alignments in vvl-41 between the conserved CSBs in the three species. In many cases the uCSBs contained known DNA motifs for TFs. Each of the CSB elements in vvl-41 that are shared between Dm and Ceratitis are in the same orientation with respect to the vvl structural gene. However, in Musca, the orientation of elements with respect to the structural gene is unknown since the structural gene and the CSBs are on different contigs. Additional file1, Fig. S1 presents three-way alignments of each of the other eight uCSBs within the vvl enhancer field that are shared between Dm, Ceratitis and Musca. The uCSB of vvl-49 in Ceratitis is in reverse orientation with respect to the vvl structural gene. Many of the uCSBs in Musca are in a different orientation on the contig than in Dm, indicating microinversions. We conclude that, except for microinversions, the order and orientation of highly conserved non-coding sequences in Drosophila, Ceratitis and Musca with respect to flanking genes is the same.
Many of the non-coding regions in dipteran genomes contain uCSBs, especially in and around developmental determinants, and many of these are likely to be cis-regulatory elements such as those found in the vvl enhancer field. Another example is the prevalence of uCSBs found in the non-coding sequences associated the Dm hth gene locus. A previous study identified an ultraconserved regions in hth shared between Drosophila and Anopheles [19]. We have identified additional hth uCSBs shared among Dm, Ceratitis and Musca. We identified a total of 16 CSBs shared between the three species, 8 CSBs shared between Dm and Ceratitis but not Musca, and 7 CSBs shared between Dm and Musca, but not Ceratitis (Fig. 2 and data not shown). Both Ceratitis and Musca contain uCSBs that were in reversed orientation with respect to the Drosophila orthologous regions.
EvoPrint analysis of Drosophila hth sequences immediately upstream and including the first exon, revealed a conserved sequence cluster (Fig. 2) associated with the transcriptional start site. Fig. 2A illustrates correspondence of the Dm conserved region in Ceratitis and Musca. Two of the longer CSBs were conserved in both Ceratitis and Musca, one shorter CSB was conserved only in Musca, and a second shorter CSB was conserved only in Ceratitis. Two and three-way alignments as revealed by BLASTn in a comparison of Dm, Ceratitis and Musca are shown in Fig. 2B. Each of the uCSBs is in the same orientation with respect to the hth structural gene.
Discovery of non-coding conserved sequence elements in mosquitoes
EvoPrinting combinations of species using A. gambiae as a reference species and multiple species from the Neocellia and Myzomyia series and the Neomyzomyia provides a sufficient distance from A. gambiae to resolve CSBs. The CSB clusters resolved within the Anopheles species (data not shown) are similar to those detected using Dm as a reference sequence [30]. Phylogenic analysis has revealed the Anopheles species have diverged from ~48 My to ~30 My [31] while Aedes and Culex diversified from the Anopheles lineage in the Jurassic era (∼145–200 Mya) [27]) or even earlier.
We sought to identify uCSBs in mosquitos by comparing Anopheles species with Aedes and Culex. We used non-coding sequences associated with the mosquito homolog of the morphogen wingless [40] to discover associated conserved non-coding sequences. Fig. 3 illustrates a CSB cluster slightly more than 27,000 bp upstream of the A. gambiae wingless coding exons. CSB orientation in A. gambiae was reversed with respect to the ORF when compared to the orentations of both Culex and Aedes CSBs. We identified uCSBs, conserved in Culex and Aedes, coincide with CSBs revealed by EvoPrint analysis of Anopheles non-coding sequences. Additional file 1, Fig. S2 illustrates an EvoPrinter scorecard for the non-coding wingless-associated CSB cluster described in Fig. 3. Scores for the first four species, all members of the gambiae complex, are similar to that of A. gambiae against itself, with subsequent scores reflecting increased divergence from A. gambiae. Culex and Aedes are distinguished from the other species by their belonging to a distinctive branch of the mosquito evolutionary tree, the Culicinae subfamily and their low scores against the A. gambiae input sequence. The mosquito EvoPrinter consists of 20 species, including 7 species of the Gambiae subgroup and related species A. christyi and A. epiroticus, 5 species of the Neocellia and Myzomyia series (including A. stephensi, A. maculates, A. calcifacies, A. funestus and A. minimus), 2 species of the Neomyzomyia series (A. darius and A. farauti), 2 species of subgenus Anopheles (A. sinensus and A. atroparvus), Nyssoryhynchus and other American species, (A. albimanus and A. darling), and two species of the subfamily Culicinae (Aedes aegypti and Culex quinquefaciatus). Mosquito genomes are described by Holt et al., 2002; Nene et al., 2007; Reddy et al., 2012, and Neafsey et al, 2014 [33,34,35,36].
Conserved sequence elements in bees and ants
Bees and ants are members of the Hymenoptera Order, representing the Apoidea (bee) and Vespoidea (ant) super-families. Current estimates suggest that the two families have evolved separately for over 100 million years [28]: Hymenoptera Genome Database: integrating genome annotations in HymenopteraMine). To identify conserved sequences shared by bees and ants or unique to each family, we developed EvoPrinter alignment tools for seven bee and 13 ant species (Additional file 1; Table S1). Three approaches were employed to identify/confirm conserved elements (both in coding and non-coding sequences) and their positioning within bee and ant orthologous DNAs. First, Evoprinter analysis of bee and ant genes identified conserved sequences in either bees or ants and ultra-conserved sequence elements shared by both families (Figs. 4,5). Second, BLASTn alignments of the orthologous DNAs identified/confirmed CSBs that were either bee or ant specific or shared by both (data not shown). Third, side-by-side comparisons of ant and bee EvoPrints and BLASTn comparisons revealed similar positioning of orthologous CSBs relative to conserved exons (Figs. 6, additional file Fig. S2 and data not shown).
To identify conserved sequences within bee species we initially generated EvoPrints of the honey bee (Apis mellifera) genes using other Apis and Bombus species. Using EvoPrints of the Dscam2 locus resolved clusters of conserved sequences (Fig. 4). Dscam2 is implicated in axon guidance in Drosophila [37] and in regulation of social immunity behavior in honeybees [38,39]. The EvoPrint scorecard (Fig. 4a) reveals a high score (close relationship) with the homologous region in the other two Apis species. The more distant Bombus species score lower by greater than 50%, and Habropoda represents a step down from the more closely related Bombus species. Megachile shows a significantly lower score reflecting its more distant relationship to Apis mellifera. The relaxed EvoPrint readout reveals two CSB clusters (Fig. 4b). Only one sequence cluster, the lower 3’ cluster, is conserved in all six test species examined, while the 5’ cluster is absent present in all species except Megachile. BLAST searches confirmed that the 3’ cluster was absent from Megachile, a more distant species Dufourea novaeangliae, and all ant species in the RefSeq genome database (data not shown). BLASTn alignments also revealed conservation of the 3’ cluster in the bee species Dufourea novaeangliae, the wasp species Polistes canadensis and two ant species, Vollenhavia emeryi and Dinoponera quadriceps.
EvoPrinter analysis of bee and ant genes that are orthologs of the Drosophila neural development genes goosecoid (gsc) and castor (cas) revealed conserved non-coding DNA that is unique to either bees or ants or conserved in both (Fig. 5). The Drosophila Gsc homeodomain transcription factor is required for proper axon wiring during embryonic CNS development and has recently been linked to social immunity behavior in honeybees [38,39]. The Drosophila Cas Zn-finger transcription factor has been shown to be essential for neuroblast temporal identity decisions during neural lineage development (40, 41]. EvoPrints of the Hymenoptera orthologs identify non-coding conserved sequence clusters that contained core uCSBs shared by both ant and bee superfamilies, and these uCSBs are frequently flanked by family-specific conserved clusters (Figs. 4, 5, 6 and data not shown). For example, analysis of the non-coding sequence upstream of the Wasmannia auropunctata (ant) cas first exon identifies both a conserved sequence cluster that contains ant and bee uCSBs and an ant specific conserved cluster that has no counterpart found in bees (Fig. 5b and data not shown). It is likely that the ant specific cluster was deleted in bees, since BLASTn searches of Wasmannia against the European paper wasp Polistes dominula reveals conservation of a core sequence corresponding to this cluster (data not shown).
The combined evolutionary divergence in the gsc and cas EvoPrints, accomplished by the using multiple test species, reveals that many of the amino acid codon specificity positions are conserved while wobble positions in their ORFs are not. The lack of wobble conservation indicates that the combined divergence of the test species used to generate the prints afford near base pair resolution of essential DNA.
Cross-group/side-by-side bee and ant comparison of their conserved DNA was performed using bee specific and ant specific EvoPrints and by BLASTn alignments (Figs. 6 and Additional file 1, Fig. S2 and data not shown). Fig. 6 highlights the conservation observed among bee and ant exons and flanking sequence of the glass bottom boat (gbb, 60A) locus of Apis melliflera EvoPrinted with four bee test species (panel A) and the Wasmannia auropunctata gbb locus EvoPrinted with three ant species (panel B). Coding sequences are underlined red, non-coding homologous regions are underlined blue, and novel CSBs present in either ants or bees but not both are indicated by the vertical lines to the side of each EvoPrint. Similarly, EvoPrinting a single exon and flanking regions of the Apis mellifera homothorax locus with four bee species and generating an ant specific EvoPrint of the orthologous ant sequence of the Ooceraea biroi homothorax locus with ten other ant species, reveals CSBs that are conserved in both Apis and Ooceraea, as well as sequences that are restricted to one of the two Hymenopteran families (Additional file 1; supplemental Fig. S2).