Sequencing
With the PacBio DNA sequencing, we generated 366,533 LRs comprising 2.31 Gb (8-fold coverage); mean size and N50 were 10.66 kb and 11.74 kb, respectively (Supplementary Fig. S1). With the Illumina DNA sequencings, we generated 345,692,872 SRs consisting of 45.70 Gb (161-fold coverage), including 276,666,288 reads (27.62 Gb) from SR1 and 69,026,584 reads (18.07 Gb) from SR3 (Supplementary Figs S2–S5). To generate SR2, 1,061,968 paired-end reads (0.25 Gb) were discarded from SR3, and then the filter-passed reads were merged into 33,982,308 single-end reads (9.70 Gb).
Transcriptome assembly and annotation
The nine RNAseq datasets for T. angustula analyzed by us included 577,856,230 reads, comprising 47.32 Gb (Supplementary Table S1). Read count and dataset size ranged from 27,737,128 to 41,105,492 reads (mean of 32,103,123 ± 5,237,347 reads) and from 1.62 to 3.27 Gb (mean of 2.62 ± 0.45 Gb), respectively, across datasets.
After filtering out the transcripts that did not achieve the expression threshold (n = 55,910), the de novo transcriptome of T. angustula comprised 138,341 assembled transcripts (GC 37.4%), totaling 221.88 Mb. The longest transcript, N50, and L50 were, respectively, 38.04 kb, 4.02 kb and 15,657 (Supplementary Fig. S6 and Supplementary Table S3). Our BUSCO assessment of the transcriptome returned 5,906 complete USCOs, thus yielding a completeness ratio of 97.2%.
Genome size estimation, assembly and annotation
Based on the k-mer frequency distribution of reads, we estimated the genome of T. angustula to be 317.60–317.92 Mb in size, of which 256.01–256.26 Mb (80.6%) and 61.59–61.65 Mb (19.4%) corresponded to the unique and repetitive regions, respectively (Supplementary Table S4). The estimated overall heterozygosity rate was 0.27%, with a distinct peak occurring at the k-mer coverage of 13.3 (Supplementary Fig. S7).
In total, ten genome assemblies were produced, and all of which were considerably smaller than previously estimated (Table 1). They were nonetheless fairly similar in size, ranging from 279.62 to 284.16 Mb (mean of 282.40 ± 1.83 Mb). Similarly, the calculated GC content was consistently around 37% irrespective of the strategy used. However, the number of contigs greater than 1 kb and the length of the longest contig varied substantially across strategies, ranging from 675 to 28,748 (mean of 574.37 ± 405.44) and from 0.17 to 4.97 Mb (mean of 2.56 ± 1.42 Mb), respectively. The L50 varied from 16.99 to 1,006.56 kb (mean 595.20 ± 345.38 kb), the N50 varied from 94 to 4,542 contigs (mean of 110.86 ± 99.93 contigs) and BUSCO completeness reached between 90.5 and 97.0% (mean of 96.01 ± 1.73%). Based on the assessed parameters, MaSuRCA yielded better preliminary assemblies than MEGAHIT, except when SOAPdenovo was employed within the former. The two best strategies used either SR2 (Masurca1) or SR3 (Masurca4), and both were carried out with Celera Assembler. Our preferred assembly strategy (i.e., Masurca4 + Samba + SSPACE) resulted in 705 contigs comprising 283.99 Mb (GC 37%), with N50 of 1,01 Mb, L50 of 94 contigs and BUSCO completeness reaching 97%.
Prior to gene prediction, we identified and masked 462,261 repetitive elements, which together comprised 65.94 Mb and corresponded to 23.22% of the genome of T. angustula (Supplementary Table S5). Most repetitive elements were interspersed (n = 235,426 or 55.25 Mb, 50.9%), followed by simple (n = 193,756 or 8.95 Mb, 41.9%) and low-complexity repeats (n = 33,079 or 1.73 Mb, 7.2%). We predicted that the genome of T. angustula comprises 16,958 PCGs totaling 68.61 Mb, which corresponded to 31.46% of the unmasked genome and 24.15% of the complete genome. These PCGs comprised a total of 80,341exons and 64,712 introns, thus averaging 4.73 exons and 3.81 introns per gene. We annotated the biological function of 13,959 PCGs (82.31% of the total), including 10,245 annotations (73.39%) based on Swiss-Prot and the other 3,714 (26.61%) based on TrEMBL (Supplementary Table S6). Using InterProScan, we also functionally annotated 14,597 PCGs (86.07% of the total) by associating them with 61,834 protein domains, 17,532 biological pathways and 8,243 GO terms previously identified. Finally, we identified 1,866 ncRNA genes (Supplementary Tables S7 and S8), most of which were small nuclear RNAs (n = 666, 35.6%), followed by micro RNAs (n = 445, 23.8%), cis-regulatory RNA elements (n = 272, 14.5%), transport RNAs (n = 177, 9.4%) and small RNAs (n = 163, 8.7%).
Mitogenome overview
The mitogenome of T. angustula, a circular double-stranded DNA molecule spanning 17,410 bp, encodes a set of 37 genes, including 2 rRNAs, 22 tRNAs and 13 PCGs (Fig. 1). All genes were identified on the putative plus strand, as in F. varia. Moreover, the organizational pattern of both mitogenomes are identical and seemingly unique among corbiculate bees (Fig. 2).
Fig. 1 Circular schematic representation of the complete mitogenome of Tetragonisca angustula. From the outer to inner circle: the arrows indicate gene directions, coverage plot of the Illumina SR alignment, coverage plot of PacBio HiFi LR alignment and GC skew plot.
Fig. 2 Linear schematic representation of the mitogenomic organization in Tetragoniscaangustula compared to other corbiculate bees. Colors indicate conserved blocks of protein-coding genes that maintain the same organization even after rearrangement events.
The mitogenome of T. angustula includes the longest A+T-rich region among all bee species sequenced to date, encompassing 2,276 bp. Putative replication origin stems, accompanied by flanking TA dinucleotide repeats, were identified. A significant decrease in coverage from LRs and, especially, from SRs, was observed for this region, with the latter also showing low coverage uniformity. Conserved motifs like T(n) (polyT stretch) and putative stop signals, previously reported for other bee species [84, 85] were not found.
Contaminants
Our search for contaminants found ten bacteria species often associated with bees (Supplementary Table S9). The most abundant was Gilliamela apis, a common bacteria of bee guts. A gene of the bacteria Pantoeaagglomerans was also identified; this species is usually associated with A. mellifera, typically in honey sacs. The set of species included gram-positive and gram-negative bacteria that are usually implied in bee intestinal functions.
Phylogenomic inference
Our final data matrix included 1,337 USCOs comprising a total of 653,067 sites (488.45 sites/USCO on average). These consisted of 227,238 parsimony-informative sites (34.8%), 115,415 singletons (17.7%) and 310,414 non-variable sites (47.5%), which were divided into 63 different partitions.
The ML analysis yielded a fully resolved phylogenomic tree (Fig. 3) in which all nodes were maximally supported (UFBoot = 100). The inferred high-level relationships were fully consistent with those from previous phylogenomic studies, including the following: 1) short- and long-tongued bees were recovered as reciprocally monophyletic in the absence of Melittidae; 2) Colletidae and Halictidae appeared as sister taxa, and together they comprised the sister group to Andrenidae; and 3) corbiculate bees formed a monophyletic clade within Apidae. SLBs and bumblebees, which were recovered as reciprocally monophyletic, likely diverged from each other 47.5 Mya in the mid-Eocene, according to our dating analysis. Tetragonisca angustula would have diverged from its closest relative, F. varia, 7.9 Mya during the late-Miocene, a relatively late divergence when compared to that of the congeneric species B. affinis and B. vancouverensis Cresson (8.4 Mya).
Fig. 3 Dated ML tree inferred from USCO data showing the phylogenetic placement of Tetragonisca angustula (bold) within Hymenoptera. Color-coded triangles depict the approximate number of orthogroups in expansion (shades of blue), contraction (shades of green) and rapid evolution (shades of red) for the corresponding internode or terminal species. Black and white numbers are the clade numbers attributed by CAFE. All clades were recovered with UFBoot = 100 (omitted).
Orthogroup identification and evolution
OrthoFinder identified 341,783 orthologs and assigned 321,485 (94.1%) of them to 19,339 orthogroups, yielding an average of 16.62 orthologs per orthogroup (Supplementary Table S10). Of these, 5,584 orthogroups (28.8%) were shared by all species, while 1,397 (7.2%) were species specific; the latter comprised 1.7% (n = 5,896) of the assigned orthologs. Tetragonisca angustula, in particular, had 13,318 of its 15,164 orthologs (87.8%) assigned to 11,817 orthogroups (1.12 orthologs/orthogroup), including 19 orthogroups and 82 orthologs that were found to be specific to the species. Six of the species-specific orthogroups of T. angustula included orthologs coding for the following: activating signal cointegrator 1, DUF5641 domain-containing protein, kinesin-like protein and mitochondrial-processing peptidase, LINE-1 retrotransposable element and reverse transcriptase. We could not functionally annotate the remaining 13 orthogroups specific to T. angustula.
We inferred 158,670 events in which an orthogroup underwent a change in size, including 135,168 contractions and 23,502 expansions (Supplementary Table S11), based on the lambda value (0.0030938) calculated by CAFE. However, only 570 of the identified orthogroups (2.9%) changed significantly in terms of gene composition (p < 0.05). Tetragonisca angustula accounted for 1,192 of the 158,670 change events, 809 of which (67.8%) represented contraction events. Overall, the number of contractions was significantly higher than that of expansions (F = 19.2047, p = 0.00003) among the sampled hymenopterans. Neither the mean number of contractions (F = 0.8970, p = 0.3530) nor of expansions (F = 2.4964, p = 0.1271) was significantly different between bees and non-bees. The same was found when SLBs and the other bees were compared with each other: contractions (F = 0.0812, p = 0.7791) and expansions (F = 0.4916, p = 0.8272).
CAFE also identified 1,202 events of rapid evolution spanning 281 different orthogroups, 64 of which were found in T. angustula, including 37 (57.8%) in expansion and 27 (42.2%) in contraction. At least six of these 64 rapidly-evolving orthogroups (we could not assign function to eight orthogroups) comprised odorant receptors (ORs)—which are found in the membranes of olfactory neurons in insects—including one orthogroup in expansion and five in contraction (Supplementary Table S12). Overall, a single OR ortholog was gained and 24 were lost by T. angustula, leaving a negative balance of 23 OR orthologs. In contrast, nine rapidly evolving OR orthogroups were predicted for F. varia, seven of them in expansion (+46) and two in contraction (-4), resulting in a balance of 42 more OR orthologs. Both M. bicolor (+7) and M. quadrifasciata (+1) as well as the inferred ancestors of Meliponini (+13) and T. angustula plus F. varia (+4) gained more OR orthologs than lost.