Convergent amino acid variants specific to avian vocal learners. Among the genomes of 48 avian species spanning most orders17, we compared 6 species from the three vocal learning orders or suborders (songbirds: zebra finch, medium ground finch, and American crow; parrots: budgerigar, and kea; and hummingbirds: Anna’s hummingbird) with 41 vocal non-learning birds (Fig. 1a, Fig. S1)15. Rifleman, a close relative of songbirds and suboscines, was initially excluded because of the uncertainty of its vocal learning ability, although assumed to be a vocal non-learner15. These were short-read genome assemblies and do not have all repetitive and GC-rich regions assembled, but do have most protein coding genes assembled21. We performed CSAV analysis for 4,519,041 homologous amino acid sites in an alignment of 8,295 orthologous genes in birds (Supplementary Table 1; methods in Supplementary Note 1). Out of these homologous sites, 154 sites (0.0034%) detected in 141 genes (1.7%) contained convergent single amino acid variants specific to avian vocal learners (AVL-CSAV; Supplementary Data 1). These amino acid differences were significantly supported (adjusted p < 0.05) by Fisher’s exact test22 (Supplementary Data 1). We classified AVL-CSAVs into four types, two identical and two different types by applying a Shannon’s entropy test23 (Fig. 1b; Supplementary Note 1 and Supplementary Data 1); the terminology, ‘identical’ and ‘different’ were used, instead of ‘convergent’ and ‘divergent’ or ‘parallel’20,24, as the term ‘convergent’ can be ambiguously interpreted as pertaining to only identical CSAVs or both identical and divergent CSAVs. Out of 154 AVL-CSAV sites, 25 sites (16%) were identical CSAVs (AVL-iCSAV) and 129 sites (84%) were different CSAVs (AVL-dCSAV) among vocal learners relative to vocal non-learners (Supplementary Data 1). For example, the 253rd site of B3GNT2 was a Type 1 (AVL-iCSAV) site with asparagine (N) observed in all avian vocal learning species and histidine (H) in all vocal non-learning species; while the 217th site of SMRC8 was a Type 3 (AVL-dCSAV) site with glutamine, valine, and leucine (Q, V, and L) observed in avian vocal learners and isoleucine (I) in all vocal non-learners (Fig. 1c).
Avian vocal learners did not show a preponderance of identical convergent single amino acid variants. We next tested whether avian vocal learners have a higher frequency of convergent substitutions relative to control sets of species. Considering the polyphyletic relationship of the 6 vocal learning species examined, we designed 2 types of clade-specific control sets: 1) Random controls consisting of 1,000 different species combinations with 6 randomly selected target species from 3 independent lineages without considering any traits; 2) Core controls consisting of all possible 59 sets given the phylogeny with 6 target species having at least 2 vocal learning clades and 1 non-learning clade, and 2 sets with different numbers of target species with other convergent traits (6 birds of prey from 4 clades and 15 waterbirds from 4 clades: Supplementary Note 2; Supplementary Table 2). We conducted the CSAV analysis for this total of 1,056 random and core control species combinations, and identified convergent identical (Ctrl-iCSAV) and different (Ctrl-dCSAV) amino acid substitutions in all control sets.
As an extension to the previous studies on convergent evolution in reptile and mammalian lineages18–20 that tested pair-wise combinations of two species (Supplementary Note 3), we found strong correlations between iCSAVs and dCSAVs tested in higher dimensional combinations of species (Fig. 1d,e). Although higher than the expectation (20.48 and 23.04 respectively) according to the regression with random and core control sets, the number of convergent substitutions in vocal learning birds was not an outlier (adjusted p > 0.05, Bonferroni Outlier Test 25) from the trend observed in the control sets (Fig. 1d,e). Several outliers did exist among the control sets, with the highest residual being 32.46 in one of the random control sets (4 Passeriformes, budgerigar, and falcon), and 17.61 in a core control set (3 songbirds, Anna’s hummingbird, and 2 land fowls; Fig. 1d,e). These species combinations of 2 control sets with the highest residuals, however, do not share any known convergent traits as far as we are aware. In the groups of species with other known convergent traits (birds of prey and waterbirds), both iCSAVs and dCSAVs were less frequently observed (Fig. 1e). These findings support that identical convergent single amino acid substitutions are widespread, and their numbers vary in different species combinations that does not appear to readily correlate with known convergent traits.
Amino acid convergences are associated with product of origin branch lengths. We sought a measure of molecular convergence that controls for phylogenetic relationships. According to previous studies on mammalian or drosophila genomes24, and vertebrate mitochondrial genomes19, fewer convergent substitutions are expected with the greater phylogenetic tree branch distance. However, the correlations found in these studies showed high variations, which makes it difficult to identify the outliers. Here, we took into consideration additional phylogenetic features, including the relationship between the convergent variants versus the most recent common ancestor [MRCA] branch of each clade (origin branches), the terminal branches, and the nodes of the tree (Fig. 2, top row; Supplementary Note 4; Supplementary Fig. 1). We observed strong and significant correlations between CSAVs and the product of MRCA branch length lengths (POB) for both random control (Fig. 1f and Fig. 2) and core control sets (Fig. 1g and Supplementary Fig. 4). The correlation was also observed for both iCSAV and dCSAV (Fig. 2a). Much weaker correlations of CSAVs were observed with the product of terminal branches (PTB), distances among terminal branches (DTB) and terminal nodes (DTN; Fig. 2b-d). In contrast to the iCSAV versus dCSAV correlation analyses, in the POB phylogenetic analyses the number of CSAVs in avian vocal learners was a significant outlier relative to random control species sets (adjusted p < 1.41e-08; Fig. 2a). The outlier position of CSAVs was driven mostly by the dCSAVs (adjusted p < 1.64e-07), but also combined trend for the iCSAVs in vocal learners (Fig. 2a). Two other species combinations were outliers, but different from the iCSAV versus dCSAV analyses: (4 Passeriformes, chicken, and duck) and (3 songbirds, Anna’s hummingbird, carmine bee-eater, and downy woodpecker). However, the avian vocal learners were not a significant outlier relative to core control sets only (Supplementary Fig. 4). These findings suggest that POB value can largely explain convergent variants at the amino acid level, where the longer their ancestral branch lengths the greater frequencies of convergence amino acid variants, and that vocal learning birds are somewhat different in this regard relative to other species combinations.
Convergent amino acid substitutions can arise from complex nucleotide variants at multiple sites. To investigate what types of codon and nucleotide variants can cause CSAVs, we modified the CSAV method to detect convergent single codon variants (CSCVs) made up of 3-nucleotides and convergent single nucleotide variants (CSNVs) in those codons (Fig. 3a,b; Supplementary Note 5). Similar to the CSAV analysis, the codon and nucleotide variants were classified into 2 identical types and 2 different types (Fig. 3a,b). A CSAV can theoretically originate from identical single nucleotide variants at a homologous codon position (CSNVs) or complex multiple nucleotide variants (CMNV) at different codon positions (Fig. 3c). However, some CSNVs can also give rise to no change in the amino acid, namely synonymous substitutions (Fig. 3d). We checked for overlaps among CSAVs, CSCVs, and CSNVs to trace the source of the convergent amino acid substitutions at the codon and nucleotide levels.
Analyzing 4,519,041 homologous codons and 13,557,123 homologous nucleotides of the 8,295 singleton orthologous genes in birds, we found 626 CSCV sites specific to avian vocal learners (AVL-CSCVs; Fig. 3e). Among them, 56 (15.7%) showed nonsynonymous CSNVs and 98 (8.9%) nonsynonymous CMNVs, resulting in 154 CSAVs among vocal learners (Fig. 3e). The remaining CSCVs consisted of 113 (18.1%) synonymous CSNVs and 359 (57.3%) synonymous CMNVs (Fig. 3e). An example of a nonsynonymous CSAV explained by a CSNV is in the 253rd codon site of B3GNT2, where all vocal learners had the same convergent nucleotide (A), codon (AAT), and amino acid (N, Asparagine) sequence relative to all vocal non-learners (e.g. C; CAT or CAC; and H, Histidine; Fig. 3f). An example of a nonsynonymous CSAVs explained by a CMNV is the 482nd site of LRRN4, where all of vocal learners showed identical amino acid convergence (AVL-iCSAV) to Histidine (H), while their codons consist of CAC or CAT for vocal learners and TTT, TAT, TAC, CGC, and TCG for vocal non-learners (Fig. 3f).
In the random and core control species sets, although we found different total numbers of CSCVs and their corresponding CSAVs and CSNVs, their relative proportions (%) were similar to each other and to that of vocal learners (Fig. 3e); about 1/3 of CSAVs of random and core control sets originated from CSNVs at each homologous nucleotide site, while 2/3 originated from multiple nucleotide changes at different nucleotide sites (Fig. 3e). These findings suggest that amino acid convergences originate not only from identical single nucleotide substitutions at each homologous site but also from complex nucleotide substitutions at multiple sites within a codon.
Convergent variants at all levels are best explained by the product of MRCA branch lengths. Next, we performed correlation tests between nine types of sequence variants (three types of convergences [all, identical, and different] at three levels [amino acid, codon, and nucleotide]) and four types of phylogenetic features (POB, PTB, DTB, and DTN; Supplementary Note 6). As expected, all nine types of convergent variants (CSAVs, iCSAVs, dCSAVs, CSCVs, iCSCVs, dCSCVs, CSNVs, iCSNVs, and dCSNVs) were highly correlated with each other, in both random control (Fig. 4) or core control species sets (Supplementary Fig. 5). For the phylogenetic features, the POB showed the strongest correlation with all variant types, where the others (DTB, DTN, and PTB) were either weaker or not correlated at all (Fig. 4). Correlations were overall weaker in the core control sets of species (Fig. 4 vs Supplementary Figs. 5,6), presumably due to a smaller number of species combinations than the random control sets. Unlike the CSAVs, the numbers of vocal learner-specific variants did not exceed that of control sets for the other variant types (Fig. 4).
Post hoc analyses of vocal learner substitutions in Rifleman. Rifleman and more broadly the New Zealand Wrens, a close relative of vocal learning songbirds, have been assumed to be a vocal non-learner15. Although rifleman was excluded from the initial CSAV search, we can ask whether its sequences match those of vocal non-learners as assumed. We applied principal component analysis (PCA) and phylogenetic analysis for the 154 AVL-CSAV sites and the subset of 25 AVL-iCSAV sites (Supplementary Note 7). PC1 and PC2 accounted for 53% and 66% of the total variances of the AVL-CSAV sites and AVL-iCSAV sites, respectively (Fig. 5a,b). The vocal learning birds clustered away from the vocal non-learning group as expected (Fig. 5a,b). For the AVL-CSAV sites, rifleman clustered with the vocal non-learners (Fig. 5a). For the AVL-iCSAV subset, rifleman was separate from the two groups, but was still closer to vocal non-learners (Fig. 5b). Phylogenetic analysis of these AVL-CSAVs was consistent with the PCA results, where instead of branching with its closest relatives, the songbirds, rifleman was on a branch outside and next to the vocal learners (Fig. 5c,d). These results support the assumption that rifleman is a vocal non-learner. The results also suggest that some of the convergent sites and genes could be associated with their convergent trait.
Functions of convergent genes. To investigate the biological functions of genes with convergent variants in vocal learners and in control sets, gene ontology (GO) analysis was performed on 9,513 (1,057 species sets * 9 variants types) gene lists with 1 or more convergent variants (Supplementary Note 8). Among them, at least one significant (adjusted p < 0.05) GO term was found for 1,038 gene lists (10.9%). We further found a positive correlation between the number of significant GO terms and the number of genes with convergent variants in each list (Fig. 6a); however, we found negative correlation between the average adjusted p value of significant GO terms and the number of genes with convergent variants (Fig. 6b).
In vocal learning birds, we did not find any GO enrichment for the total AVL-CSAV gene list. However, the AVL-iCSAVs gene list subset was significantly enriched for ‘learning’ (GO:0007612, adjusted p = 0.042). Four genes were responsible for this enrichment (DRD1B [also known as DRD5], LRRN4, PRKAR2B, and TANC1; Fig. 6c). The identical amino acid convergences of DRD1B, PRKAR2B, and TANC1 also showed identical codon convergences (AVL-iCSCVs) in all vocal learners, while that of LRRN4 showed different synonymous codon changes (AVL-dCSCV; Fig. 6d). Of the 1,056 control species combinations, only one control gene list (a Ctrl-dCSCV gene list) showed significant enrichment for ‘associative learning’ (GO:0007612, adjusted p = 0.035); the associated set of species (Fig. 6e) did not include any vocal learners, but a convergent variant in LRRN4 contributed to this functional enrichment (Fig. 6f). The findings indicate that convergent genes in vocal learners do function in the brain and for learning.
Convergent amino acid sites are fixed and positively selected. We checked for additional evidence of whether the CSAVs in vocal learners are reliable, by checking for sequence assembly artifacts and SNPs within species. We used the dbSNP database of a representative vocal learner (zebra finch; n = 1,257 samples; build 139) and a vocal non-learner (chicken; n = 9,586 sample, build 145; Supplementary Note 9). At the 154 AVL-CSAV sites, zebra finches showed complete fixation without any nonsynonymous polymorphisms. However, one missense SNP was found in the chicken OTOA gene (c.2581A > G, p.Thr861Ala) resulting in an amino acid change identical to that of vocal learners (Supplementary Fig. 7). We also validated fixation of the convergent substitution in DRD1B by PCR of genomic DNA and sequencing of 3 male and 3 female zebra finches and chickens (Supplementary Fig. 8). These findings indicate that the vast majority (99.4%) of the convergent amino acids we identified in vocal learners are the result of true species-specific variants.
In addition, to consider positive selection on the convergent sites, we performed dN/dS analysis with the branch-site model for CSAV genes in the avian vocal learning species and their closest control set (songbirds, parrots, and swifts; Supplementary Fig. 9). We found that under half of the iCSAV sites showed signs of positive selection (Likelihood ratio value (D) > 0, ω2.foreground > 1) in the vocal learning birds (10 of 25 genes, 40%) and the closest control set (12 of 26 genes, 46%), with 12% (3) and 23% (6) being statistically significant, respectively (adjusted p < 0.05, posterior probability > 0.5). These findings suggest that a subset of iCSAV genes in different species combinations have been positively selected whether the species share a convergent trait or not, and it does not seem to need a greater number of positively selected sites for the avian vocal learning ability.
Genes with convergent identical changes are specialized in vocal learning and the associated brain subdivisions. We next tested if the AVL-CSAV genes are expressed in vocal learning brain circuits. We analyzed 8 brain transcriptome data sets, which include genes that show singing-regulated expression (increased or decreased) in song learning nuclei of songbirds26 (Area X, HVC, LMAN, and RA), differential expression (increased or decreased) in song nuclei compared to their surrounding non-vocal motor brain regions (NUC vs SUR), one song nucleus compared among the other song nuclei (NUC vs other NUCs), and the surrounding regions of each song nucleus compared to the other surrounding regions (SUR vs other SURs), from independent experimental data sets (DEG_2014: microarray method in 201414,26, DEG_2019: micro-dissected RNA sequencing in 201927, and DEG_2020: laser capture microscope with RNA sequencing in 202027; Supplementary Note 11, Supplementary Fig. 10, and Supplementary Data 2). Relative to the average of all genes (8,295 avian orthologous genes) measured, we found no enrichment of AVL-CSAV genes (0–13%) among the singing regulated genes or song nuclei specialized genes relative to the surrounding brain regions, whether positively selected or not (Fig. 7a). However, in two independent transcriptome experiments, we found 60 to 100% of the positively selected AVL-iCSAV (AVL-PS-iCSAV) genes were enriched among the differentially expressed genes in one song nucleus relative to the others, and some of those were also enriched to a lesser degree in the adjacent surrounding brain subdivision relative to the others (Fig. 7a). These enrichments were not found for the AVL-dCSAV genes, not for any positively selected gene set in the closest related control set (Fig. 7a), nor for all control sets even before positive selection analyses (not shown). Out of 4 song nuclei, Area X involved in song learning showed the highest number of differentially regulated genes out of singleton orthologous genes of birds or genes with amino acid convergences specific to avian vocal learners in comparisons among a song nucleus and the other song nuclei (NUC vs other NUCs, DEG_2020; Supplementary Data 2, Fig. 7b).
Out of 25 AVL-iCSAV genes, a total of 8 genes (32%) had positively selected sites in vocal learners and differential expression specific to a song nucleus and surrounding brain subdivision: B3GNT2, DRD1B, FNDC1, HMGXB3, MTFR1, PIK3R4, PRKAR2B, and SMPD3 (Table 1). These include two genes, DRD1B and PRKAR2B, revealed in the GO analyses for learning functions (Fig. 6c, d). Further DRD1B has specialized up-regulation specific to adult Area X compared to its surrounding striatum (Table 1 and Fig. 7c)28–30
Table 1. Avian vocal learning-related candidate genes with iCSAV under positive selection supported by differential expression on song nuclei and surrounding regions. NUC vs SUR: song nucleus compared to its surrounding non-vocal motor brain regions. NUC vs other NUCs: a song nucleus compared to other song nuclei. SUR vs other SURs: a surrounding region of song nuclei compared to another song nucleus.
Gene Symbol
|
Ensembl Gene ID
|
CSAV position
|
AA of vocal learners
|
AA of vocal non-learners
|
Likelihood ratio value (D)
|
Adjusted p value (FDR)
|
dN/dS (ω2) of foreground branch
|
Posterior probability (B.E.B.)
|
DEG2014 (NUC vs other NUCs)
|
DEG2019 (NUC vs SUR)
|
DEG2020 (NUC vs SUR)
|
DEG2020 (NUC vs other NUCs)
|
DEG2020 (SUR vs other SURs)
|
B3GNT2
|
ENSTGUG00000002218
|
253
|
N
|
H
|
1.14
|
2.75.E-01
|
3.3
|
0.999**
|
AreaX Up
|
RA Down
|
|
AreaX Up
|
AreaX Up
|
DRD1B
(DRD5)
|
ENSTGUG00000009908
|
440
|
A
|
IV
|
1.22
|
2.74.E-01
|
3.7
|
0.5
|
AreaX Up
|
AreaX Up
|
LMAN Down
|
AreaX Up
& LMAN Down
|
AreaX Up
|
FNDC1
|
ENSTGUG00000011376
|
1175
|
S
|
G
|
4.31
|
6.37.E-02
|
6.7
|
0.981*
|
|
RA Up
|
|
RA Up
|
|
HMGXB3
|
ENSTGUG00000000975
|
325
|
D
|
-E
|
0.99
|
2.97.E-01
|
2.3
|
0.995**
|
AreaX Up
|
|
|
AreaX Up
|
|
MTFR1
|
ENSTGUG00000011257
|
104
|
T
|
-AGP
|
3.24
|
1.01.E-01
|
10.1
|
0.521
|
|
|
LMAN Up
|
AreaX Down
|
|
PIK3R4
|
ENSTGUG00000004027
|
673
|
C
|
R
|
4.87
|
4.93.E-02*
|
10.4
|
0.997**
|
AreaX Up
|
|
|
AreaX Up
|
|
PRKAR2B
|
ENSTGUG00000003068
|
98
|
V
|
-I
|
25.44
|
3.57.E-06**
|
295.8
|
0.999**
|
Ra Down
|
|
|
AreaX Up
& RA Down
|
AreaX Up
|
SMPD3
|
ENSTGUG00000006964
|
311
|
C
|
-Y
|
6.00
|
3.37.E-02*
|
14.3
|
0.994**
|
AreaX Up
|
|
|
AreaX Up
& RA Down
|
AreaX Up
& RA Down
|
Convergent evolution between vocal learning birds and human in the same genes. Lastly, we investigated amino acid convergences between vocal learning birds and humans. We performed SAV analysis for 18,565 singleton orthologous genes in primates, and discovered 11,317 genes (61.0%) contained 126,087 human-specific amino acid substitutions relative to other primates (Human-SAV; Fig. 8; Supplementary Note 12), including the well-known p.Thr303Asn and p.Asn325Ser amino acid substitutions in FOXP2 specific to humans (Supplementary Fig. 11). Since this is over 60% the coding genes in humans, it is unlikely that the majority of these substitutions are associated with differences between humans and other primates. Narrowing them down to our candidate gene list to those found in avian vocal learning species, out of 141 AVL-CSAV genes, 125 singleton orthologous genes were identified in human (Supplementary Data 3). Among the 125 genes, 85.6% (107 genes; well over 61%) contained 396 Human-SAV sites compared to other non-human primates (Fig. 8). For random and core control sets of birds, on average 64.8% and 77.9% included Human-SAVs, respectively (Fig. 8). Despite this lower average % in control sets, several random and core control species sets still showed higher percentages with Human-SAV (Fig. 8). These findings are consistent with analyses among birds, that although vocal learning birds and humans have convergences in the same genes, the total number is not higher than some control sets that may not have trait convergences with humans.
We scanned for overlaps of the amino acid substitutions in vocal learning birds and in humans. Out of 107 bird-primate orthologous genes with 120 AVL-CSAV sites among birds and 396 Human-SAV sites among primates, only SETD4 had an overlapping site, in the 103rd position of the primate and avian peptide alignment (Fig. 9a). However, the amino acid Arginine (R) in one avian vocal learning bird (Budgerigar) was identical to that of non-human primates, as opposed to Glycine (G) in human; vocal non-learning birds had a Glutamine (Q; Fig. 9a). In contrast to this result, we found that of the 107 genes with avian vocal learner convergences and with human-specific substitutions among primates, 79 showed 111 conserved protein domains containing at least one of AVL-CSAV and Human-SAV site, respectively (Supplementary Note 13). This included four genes with identical convergent substitutions in vocal learning birds, enrichment for learning in the GO analyses, with positive selection and differential expression between song learning nuclei (Figs. 9; DRD1B, LRRN4, PRKARB, and TANC1). DRD1B showed one AVL-iCSAV site and one Human-SAV site located in the c-terminal intracellular region that interact with DRD231, a gene also associated with a spoken language disorder32 (Fig. 9b). These convergent variants specific to vocal learning birds and humans, but at different sites in the signaling domain of the DRD1B receptor along with its specialized expression serve as candidates for changing dopamine receptor function in human speech and songbird vocal learning circuits28,33. These findings indicate that in more distantly related species, convergence in genes associated with convergent traits may occur with lineage-specific convergences occurring in different amino acid sites, but in the same protein domain.