The structure of the teleost gap junction protein gene family
The compressed tree with the connexin subfamilies for teleosts and mammals is shown in Fig. 1. All sequences involved are shown in Suppl. Fig. 1-12. A few of the expanded branches are shown in Figs. 2-6 (Fig. 2, gjb7; Fig. 3, gja4; Fig. 4, gjd2; Fig. 5, the “gjb4like” complex; Fig. 6, cx39.2), and the remaining branches are shown in Suppl. Fig. 14[1]. In this tree, and in all trees made for the major statistical analyses (Suppl. Table 1), the GJE1/gje1/cx23 group was omitted, because the inclusion of the GJE1 orthologous group caused long-branch attraction [53, 54]. In fact, the long-branch attraction was so intense that it ripped apart both the delta and gamma subfamilies, and caused the highly variable groups of GJC3 and GJD4 to locate in the vicinity of the GJE1 group (compare Fig. 1 and Suppl. Fig. 15). However, we did include a human pseudogene in the Cx39.2 group (Fig. 6), but not the corresponding pseudogenes from some other mammals (Suppl. Fig. 12). This orthologous group is further discussed below. We also excluded rodent gja6 (which is the ortholog of the human pseudogene sometimes called Cx43pX [29]) and a cod gjd2 sequence (Gm-NN-gjd2*1-G01582). This sequence often split out from its expected gjd2 group, and we excluded it to make clearer distinctions within the different gjd2 groups.
Overall, it was evident that the structure of the connexin gene family was similar across all the teleosts. There were examples of species-specific gene duplications or lack of genes, but at the present time we cannot with certainty ascribe all such “anomalies” to biological and genetic reality or to partial genome sequencing and/or erroneous genome assembly. The overall similarity should make it rather simple to extend the gene identifications to other teleost species when their genomes are sequenced, thereby easing their annotation. However, this is dependent on consistency in naming of the genes in the family, which is presently at lack as shown below.
The mixture of nomenclatures
As can be seen in Figs. 2 to 6 (and also in Suppl. Fig. 14 and Suppl. Tables 3-6), there was often little consistency in naming within many of the gene clades, as some of the genes were named by the size nomenclature and others are named by the Greek nomenclature. We will here sum up the nomenclature for some of the teleost species. More details are found in the Supplementary Tables and Supplementary Figures.
Zebrafish is undoubtedly the most highly investigated teleost [47], with its genome sequencing starting in 2001, the first genome assemblies available in Ensembl around 2005, with the latest assemblies and annotations from 2017/2018 (Ensembl release 91, CRCz11). Thus, we would expect the gene nomenclature to be of good standard and being consistent with the intentions expressed in the Zebrafish Nomenclature Conventions [52]. In zebrafish, among the 38 unique and predicted genes present in GenBank (Suppl. Table 3 and Suppl. Fig. 5), 25 genes followed the size nomenclature and 13 genes followed the Greek nomenclature. The naming of 37 predicted genes in Ensembl was rather similar to GenBank, with 31 sequences having the same name as in Ensembl (Suppl. Table 3).
Fugu was the first teleost with its genome published [50], with the last genome assembly from 2011 (in Ensembl) and annotations from 2018 [55]. In July 2019, the Fugu annotation was updated in GenBank. Many of the previous GenBank predictions changed names from the combined Greek and size nomenclature (gja1-cx43) to Greek nomenclature only (gja1). In many cases the accession numbers also changed (Suppl. Table 4). After the GenBank update, three genes followed the size nomenclature and 38 genes followed Greek nomenclature. One previously predicted gene (Fr-gja3like-XM_003970457) was lost in the update. Fourteen genes can now be said to have the same naming in GenBank and Ensembl (disregarding upper/lower case letters, and considering gja5a = gja5), all in the Greek nomenclature (Suppl. Table 4). In Ensembl, 16 Fugu genes followed the Greek nomenclature, 21 genes followed size nomenclature, one gene had no name, and four genes were not predicted (Suppl. Table 4).
For cod sequences in Ensembl (Suppl. Fig. 10, Suppl. Table 5), eight followed Greek nomenclature (six in upper case and two in lower case), 18 followed size nomenclature, 17 were predicted but not named, and one was not predicted (but found by us). The recently available cod chromosome level genome assembly in GenBank [56] and the corresponding gene predictions provided us with the possibility to compare the naming of the new GenBank predictions with the Ensembl cod gene predictions (Suppl. Table 5). Only four sequences had been given the same name in Ensembl and GenBank (including lower/upper case letters as identical; Suppl. Tables 4 and 6).
For the GenBank predictions in herring, 32 genes followed the Greek nomenclature, four followed the size nomenclature, and eight followed a mixed nomenclature, in addition to two non-predicted genes (one of them, gja8, considered as a part of an unrelated gene; Suppl. Fig. 9 and Suppl. Table 6). In the recent annotation from the novel chromosomal level assembly of herring added to the Ensembl database (Sept. 2019) [57], the predictions contained nine genes in Greek nomenclature and 20 genes in size nomenclature, in addition to 14 predicted genes with no name, and three genes that were lost, probably due to erroneous genome assembly (see below) (Suppl. Table 6). Only two genes had totally identical names in Ensembl and GenBank; four genes if upper/lower case letters and combination Greek-size nomenclatures were considered identical to the lower case Greek nomenclature.
Only a few of the eel connexins in the GenBank transcriptome shotgun assemblies had been named, with several having a hybrid nomenclature not commonly used (such as CXA5, cxb1, CXG1, etc.).
Multiple names for a distinct ortholog within teleosts
There were three common inconsistencies within an orthologous group, two of which are considered in this section, and the third in the next section. The first was that some genes within the group are named according to the Greek nomenclature, and other genes according to the size nomenclature. For example, within the GJB7 group (also called connexin25 in mammals), some teleost sequences were named gjb7 and other sequences were named cx28.8, and some combined the Greek and size nomenclature such as gjb7-cx25 (Fig. 2).
The second inconsistency was that evident orthologs had been given different numbers in the Greek nomenclature. One example was the teleost orthologs for mammalian GJA4, also called connexin37 (Fig. 3). They were called gja4 in Fugu, cx39.4 in Tetraodon, stickleback and zebrafish, and gja6like in Atlantic herring. It should be noted that GJA6 is a different gene group that was generated by a mammalian-specific gene duplication of GJA1 (connexin43), maybe by retrotransposition. GJA6 is a pseudogene in humans and some other species (called connexin43-related pseudogene on the X chromosome, Cx43pX, in ref. [23, 29]). In other species, including rodents, dog and elephant, GJA6 appears to be a functional gene [23, 29]. Another example is found within the major GJD2 group (Fig. 2C). Zebrafish NM_001128766 and stickleback ENSGACG00000020357 (no GenBank entry) were both called gjd1a, whereas the orthologs in Fugu were both called gjd2like (Fig. 4).
Distinct genes having identical names
The third common inconsistency was that clearly different sequences had the same name. In Fugu (using the GenBank sequences predicted before July 2019), there were two of each for Cx32.2like, gjb1like, gjb2like, and gjb3like genes; three gja3like and gjc1like genes; and four gjb4like and gjd2like genes (Fig. 4; Suppl. Table 4 and 7).
Atlantic herring (Clupea harengus) had its genome sequenced, assembled and annotated in GenBank in 2015 [17, 58], and a recently a chromosomal level assembly [59, 60] was annotated in Ensembl in the fall of 2019 [57]. Thus, the prediction and naming of the genes should describe much of the current status for automatic annotation. In the GenBank 2015 annotation, there were two of each for gja5like, gjd2, and gjd3like; three of Cx32.2, gjc1like and gjd2like genes; and four genes called gja3like and gjb4like (Fig. 4, Suppl. Table 6 and 7). In the Ensembl annotation, each of the names occurred only once, but on the other hand, 14 of the gene predictions were un-named (Suppl. Table 6).
We will use teleost gjd2like and gjd2 as examples. Gjd2like was used in several more or less closely related genes in the delta subfamily. More specifically, sequences with this name were found among the cx36.7, cx39.2, and the central gjd2 groups. These groups are shortly discussed below.
The central gjd2 group (Fig. 4) is a complex of sequences that are all closely related to the mammalian GJD2. Previously, these genes were named connexin36 in mammals and connexin35 or connexin35.1 [61] in fish. While mammals have one GJD2 gene, teleosts have up to four (as in zebrafish, Fugu, and stickleback) in this central gjd2 group. For convenience, we named groups of the teleost genes in the central gjd2 group as gjd2*1, gjd2*2 and gjd2*3, because they sometimes split into three groups, depending on the statistical analysis. Occasionally, one or two sequences split out of the gjd2*1 group, and ended in-between the other gjd2/GJD2 groups. This happened particularly often with Gm-NN-gjd2*1-G01582 (sequence found in Suppl. Fig. 10), which is why we excluded this sequence during the statistical analyses. Generally, the sequences within gjd2*2 and gjd2*3 stayed as unified groups, usually as a dichotomous clade (for discussion of ohnologies within these groups, see below).
The mammalian GJD2 is somewhat promiscuous in terms of which teleost sequence group it most closely adhered to, but most often it was gjd2*1 or gjd2*2. In zebrafish, these genes are among the few places where “a” and “b” have been added to some of the gene names in the databases. In the gjd2*2/*3 group, one of the zebrafish (and stickleback) genes is called gjd1a (but there is no gjd1b) and the other gjd2like. In the gjd2*1 group, one of ohnologs in zebrafish, Tetraodon, stickleback and cod is called gjd2b (but there is no gjd2a).
Another group named gjd2like (in Fugu and Atlantic herring) was the cx36.7 group, called Dr17927 in a previous paper [24]. This group often branched off from the foot of the central gjd2 complex itself (Fig. 1), but in a few statistical analyses it located closer to gjd3 or gjd4 (Suppl. Table 1). As yet, there are no mammalian members in this group, and our previous work [24] suggested that this group was specific to fish.
Another orthologous group often named gjd2like has previously, and more uniquely, been called cx39.2 [29]. This orthologous group divided its location between the delta (most commonly) and gamma subfamilies depending on the model run, but it never located within or at the foot of the central gjd2 group (in contrast to cx36.7). The first mammalian member in the cx39.2 group was found in opossum [29], but we here show that this ortholog is also present in several other mammals, like several Afrotheria (Suppl. Fig. 12) and bats (Fig. 6), sometimes as obvious pseudogenes. A human pseudogene (NG_026166), named “Homo sapiens gap junction alpha 4 pseudogene on chromosome 17” (GJA4P) is not a pseudogene related to GJA4 but rather to the cx39.2 (GJD2-like) group according to the phylogenetic analyses (Fig. 6). Alignments of NG_026166 against GJA4 and representatives from the cx39.2-like group clearly indicated a closer relationship with the latter (Table 1; Suppl. Figs. 13A and 13B; Suppl. Tables 8 and 9). In a comparison at amino acid level between the conserved domains of human GJA4P and GJA4 vs. eel cx39.2 and cx39.4 (Table 1), the identity levels between the GJA4/cx39.4 (human/eel) orthologs were ~55%, the same as for GJA4P/cx39.2 (human/eel), which is clearly higher than GJA4P/GJA4 (human/human; ~38%) and GJA4P/cx39.4 (human/eel; ~34%). Also at nucleotide level, the human GJA4P showed higher identities to cx39.2 than to GJA4 from opossum and bat (Suppl. Table 9), e.g., conserved domains of Hs-GJA4P was 53.9% identical to opossum GJA4-XM_007492764 and 65.3% identical to opossum cx39.2 (= Md-GJD2like-XM_001376506) (Suppl. Table 9). Thus, the alignments were consistent with the phylogenetic results (Figs. 1 and 6), and thereby settled this pseudogene (NG_026166) to be incorrectly named in humans. It is not GJA4P, but rather Cx39.2P.
On teleost connexin ohnologies
The phylogenetic analyses provided a strong indication of the presence of several ohnologous pairs in teleosts. However, distinguishing between paralogous pairs that have been created by tandem gene duplication and ohnologous pairs created by genome duplication might be difficult, especially if the assembly only exists as contigs or relatively short scaffolds. If a novel teleost genome assembly is being made, it would be valuable to have the answer to this question established in other species, simply because the naming should be different in the two cases. Thus, it is of importance to show whether the ohnologous relationship can be traced across teleosts in a reasonably systematic way. In other words, is the genomic location of a gene and its potential ohnolog in one or two species sufficient to give indications for other species?
As of today, most eukaryotic draft genome assemblies consist of thousands of scaffolds, and even if these scaffolds can be Mb long, they are just a fraction of the size of most eukaryotic chromosomes. For such scaffolds, only connexin genes positioned rather closely are informative. When this analysis started, chromosomal assemblies were not available for herring and cod, but both became available during the summer of 2019 [56, 59].
For looking more closely into ohnologous pairs, the Japanese eel genome assembly was used as a starting point, because eel is a member of the early diverging fishes. Table 2 summarizes the situation for the chromosomes (or linkage groups) containing the highest number of connexin genes, and Suppl. Table 10 gives the full overview. Eel linkage group (chromosome) 19 contained eight connexin genes (gja1, cx34.5, cx28.9, cx32.2, gja10, gjb7, gjd2*1, gje1). The same eight genes were found on zebrafish chromosome 20 and stickleback chromosome 18, and at least seven of them are collected at cod chromosome 21. Thus, there is a strong tendency that linked genes in eel also are linked in the other species. For some unknown evolutionary reason, this chromosome had relatively few examples of ohnologs. The ohnologous chromosome may have gone through some kind of genetic catastrophe. In fact, for the two connexins with the highest number of species showing ohnology, gja1 and gjd2*1, the ohnologs were found on an “unexpected” chromosome (7 in eel, 14 in herring and 17 in zebrafish). We use the term “unexpected” because these ohnologs deviated more from the location patterns we found for the other connexins on eel chromosome 7 (see below).
Eel chromosome 7 contained five connexin genes, in addition to the ohnologs of gja1 and gjd2*1, namely gja4, gja9, cx28.6, cx35.4 and cx34.4, and four of their ohnologs were placed at chromosome 4 (and chromosome 19 for gja1 and gjd2*1). In stickleback, all five genes were found on chromosome 10 (but chromosome 18 for gja1 and gjd2*1), and three of the ohnologs were found on chromosome 15, the fourth ohnolog (gja9) on chromosome 20, and the fifth was missing. In Tetraodon, three of the five genes (gja4, gja9, cx28.6) were found on chromosome 21, and two of these had ohnologs, gja9 on an unplaced scaffold, and cx28.6 on chromosome 10. Tetraodon chromosome 10 also contained the single copies of cx35.4 and cx34.4. In zebrafish, gja9, cx28.6, cx35.4, and cx34.4 were found on chromosome 17. Gja4 was present as a single paralog on chromosome 19, which also contained the ohnolog of cx28.6. Thus, we see for gja4, gja9, cx28.6, cx35.4 and cx34.4 on eel chromosome 7 that there was a strong tendency towards a pattern of consistency in distribution of ohnologous pairs to distinct chromosomes in all the investigated species, while gja1 and gjd2*1 tended to deviate.
In general, teleosts had four genes that were very closely related to mammalian GJD2. Although one or two of the sequences in the gjd2*1 group occasionally split out from the remaining genes, the two ohnologs (Table 2) generally stayed together, and there should be no doubt about the proper ohnology. In 14 of 21 statistical analyses gjd2*1 grouped together with mammalian GJD2, and these were considered as the appropriate orthologs. Gjd2*2 and gjd2*3 often dichotomously grouped together (in 11 of 21 statistical analyses), but other times split up. We believe that gjd2*2 and gjd2*3 most likely are ohnologs, although it could not totally excluded that they are non-ohnologous paralogs located on different chromosomes.
If the genes that were linked in eel had broken linkages in other species, in many cases two or three of the most closely linked genes have moved to another chromosome than the rest of the group. A more complete overview containing all connexin genes and associated chromosomes is provided in Suppl. Table 10.
Of course, this analysis also showed closely related genes that were not ohnologs. E.g., the genes within the cx34.5 and cx32.2 groups (also known as cx32.7, cx32.3 and cx28.9) are not ohnologs, because they all are located on the same chromosome (19 in eel, 15 in herring, 20 in zebrafish, 21 in cod, 18 in stickleback, 14 in spotted pufferfish, and scaffold1917 in Fugu).
In summary, over the range of divergence time, large stretches of the chromosomes have been maintained reasonably intact subsequent to the teleost genome duplication. Thus, the corresponding ohnologs are found on other non-random chromosomes. However, both gene losses and tandem duplications might have occurred over the considered evolutionary period, which could complicate the interpretations. Of course, this is even further complicated by the facts that the sequencing itself is probably not able to reach a complete coverage of the genome causing the partial or full absence of a gene, and that the assembly process is not straight-forward.
Lack of expected connexin genes may point to chromosomal misassemblies
As an example of practical use of this kind of information, we here briefly apply the knowledge of the outlined patterns of the connexin genes on (i) the first published herring genome assembly [17, 62], which was used as basis for GenBank gene predictions from 2015 (XM accession numbers in GenBank); (ii) the new herring chromosome level assembly [59, 60]; and (iii) a herring genome assembly made by the present authors [40, 58]. Although the GenBank herring gene predictions were superior when compared with most other fishes (in the sense that the predictions tended to follow the expected gene patterns), there were still some features worth noting.
- First, there was one easily found connexin (cx39.2) that was not predicted in the annotation from the 2015 herring genome assembly [17].
- Second, several connexin genes showed identical or near identical duplicates in the first herring genome assembly. The gjb3like-XM_ XM012822385 (one of the ohnologs in the cx35.4 group) was identical to XM_012822374 and XM_012822365, found at three locations on scaffold NW_012217989. The gjb3like-XM_012818491 (the second ohnolog in the cx35.4 group) was identical to XM_012818489; found at two locations on scaffold NM_012210726. The gjb4like-XM_012818492 (one of the ohnologs in the cx34.4 group) was nearly identical to XM_012819490, and both were found on scaffold NW_012219726. The gjd3like-XM_012837668 was nearly identical to XM_012837669, and both were found on scaffold NW_012223269. Although such copies are not entirely biologically implausible, they are not probable, and are more likely caused by assembly errors. Indeed, in the initial states of our own assembly most of them were not present in duplicate sequences, only becoming so in the last step where our assembly was fused with the published herring genome [40]. In the recently released (summer 2019) herring chromosome level assembly [59, 60] most of these duplicates have collapsed into a single copy of the sequence.
- Third, three connexin genes have “disappeared” from the new herring chromosome level assembly. These are gja9like-XM_012824682, gjb1like-XM_012819602 and gjb7like-XM_012823856. The corresponding orthologs are found in the other teleost species, and - even more importantly - hits were found in the two other herring genome assemblies. We have verified the presence of these genes in our early assemblies [40]. This strongly indicates misassemblies in the new chromosome level assembly. More specifically, the lack of gjb7 indicates a misassembly on chromosome 14 or 15. Indeed, an alignment of the relevant scaffold and chromosome showed breaks and inversion around the expected position of gjb7 at chromosome 15 (Fig. 7). The apparent lack of the gjb1 ohnolog indicates a misassembly on chromosome 8, where we indeed found breaks and inversions (not shown). We expected that the lack of the gja9 ohnolog to indicate a misassembly on chromosome 14, but we found the relevant scaffold to align with chromosome 11, where again breaks and inversions were found (not shown).
Regarding the third point above, also the recent chromosome level assembly in cod [56] showed a “no hit” for the Gm-gja10 ohnolog Gm-cx52.6-G05425 and for Gm-gja5-G04028 (Suppl. Table 5 and 10). The lack of gja5 suggested a problem in the assembly of cod chromosome 20 around position 1,000,000 (Suppl. Fig. 16A). Gm-cx52.6 is located on a small and unplaced contig (not even containing the full-length sequence of the gene), which was unusable for dot plot alignment at a chromosomal scale. By using suitable scaffolds containing the cx52.6 ortholog from herring and stickleback, we believe there is a problem in assembly of cod chromosome 21 around position 2,700,000 (Suppl. Fig. 16B and C). Other alignments using the corresponding zebrafish sequence also pointed to the same location.
A more consistent nomenclature suggestion
We believe that improved gene predictions and annotations are possible through the proper incorporation of knowledge into the algorithms. Furthermore, it would certainly help if the genes were labeled with unique names, as is one of the underlying logics in the instructions from the Human Gene Nomenclature Committee and the Zebrafish Gene Nomenclature Conventions. For most of the genes in the teleost connexin family, it is easy to suggest names that follow the nomenclature guidelines. Suppl. Fig. 17 presents a suggestion that follows the recommendations from the Human and Mouse Gene Nomenclature Committees and the Zebrafish Nomenclature Conventions, and Table 3 shows a translation between the nomenclature systems, including the recently suggested “alphabetic” system in mammals [26]. In our suggestion, we maintain the Greek nomenclature naming and numbering of those genes that are well established names in human and mouse, and transfer the naming conventions to the corresponding orthologs in teleosts. We fully avoided the “-like” names, as they often are used for several distinct genes and thus do not indicate a concrete orthologous group, and in this way can be misleading.
The subfamily number (gjd1/2/3/4, etc.) for the groups where new names are suggested does not consider the chronological order of detection, but rather the numbers that are available. For example, cx39.9 is closely related to gja3, and is in fact often called gja3like. As gja1, gja3, gja4, gja5, and gja6 already are occupied, while gja2 is not, we suggest calling cx39.9/gja3like for gja2. The genes in the cx34.5, cx28.9 and cx32.2 groups are called gja11, gja12 and gja13, respectively. We skip gja7, as this name has historically been used for Cx45 (= GJC1).
Statistically, a strong link exists between cx35.4/gjb3like and GJB3. We therefore suggest that cx35.4 should be called gjb3, despite the lack of the hallmark of the mammalian GJB3 protein, namely the amino acid sequence CX5CX5C in the second extracellular loop, where all other connexins (except the GJE1 proteins) have the sequence CX4CX5C.
There is a particular problem in the beta subfamily in that the mammalian GJB2 and GJB6 are always located in a dichotomous manner in the phylogenetic analyses, and similarly for GJB4 and GJB5. There were no indications that cx30.3 located closer to either of GJB2 or GJB6, and similarly, cx34.4 did not locate closer to either of GJB4 or GJB5. It might be that cx30.3 is a precursor gene for both GJB2 and GJB6, and cx34.4 is a precursor gene for GJB4 and GJB5, as we have suggested earlier [23, 24]. Thus, several possibilities exist for naming these genes. Cx30.3 could be called gjb8 (following the present pattern in the Greek nomenclature), pre-gjb2/6 (indicating the potential of being a precursor for the two mammalian genes), or gjb26 (a variant of the previous, but with the potential danger that this could be mistaken for cx26). We suggest cx30.3 is called gjb8 and cx34.4 is called gjb10. We further suggest that cx28.6, which generally located at the root of ((GJB4-GJB5)-gjb10)-(GJB3-gjb3) (parentheses indicate branching structure), is called gjb9.
In the gamma subfamily, there are two groups concerned with renaming. The first one is in marsupials, where the majority of statistical analyses (Suppl. Table 1) support GJC1like/GJC2like genes probably being the orthologs of eutherian GJC3, as originally suggested [29]. The second group is cx43.4/44.2/gjc1like, which we suggest is renamed gjc4.
In the delta subfamily, the major problems concern the gjd2 complex. As briefly discussed above, we consider gjd2*2 and gjd2*3 probable ohnologs, and suggest that they are named gjd1, fitting with a zebrafish and a stickleback sequence within this group already named gjd1. The ohnolog pairs within gjd2*1 are probably orthologs with mammalian GJD2, and consequently we suggest they are named gjd2. The teleost cx36.7/gjd2like group never dichotomized with any of the mammalian genes and most often branched off from the root of the gjd2 complex. We suggest this group should be called gjd6. The last group is the little-studied cx39.2 group, which in mammals has a variety of names in database gene predictions, such as GJC2like, GJD2like and GJA4like. The mammalian genes robustly dichotomize with the corresponding teleost genes, which in the databases usually are called gjd2like. We suggest that this clade is called GJD5 in mammals (thus, the human pseudogene NG_026166 should be called GJD5P) and gjd5 in teleosts.
[1] Please note that the naming of each single sequence in most of the figures and tables follows as closely as possible the naming and orthography (including lower and upper case letters) in the respective databases from where the sequences were collected (see also detailed explanation in the Methods section and in legend to Fig. 2).