Composition and distribution of algal selenoproteome
We predicted more than 1000 selenoprotein genes from genomic (36 organisms) and/or transcriptomic (including EST) datasets of 137 algal species (Detailed information about these organisms is shown in Table S1 in supplementary file 1). Distribution of selenoproteins and their Cys-containing homologs in these organisms is shown in Figure S1 in supplementary file 1. Details about these selenoprotein genes are available at the SPDB database website(http://www.selenoprotein.com). Algae selenoproteins can be searched by textual information such as species, selenoprotein family name, etc., or can be searched by sequence using a web blast tool[29]. For each selenoprotein gene, information such as nucleic acid sequence, amino acid sequence, SECIS element, gene splicing structure, and EST alignment information was recorded. Detail description of this database is shown in Figure S10 –S15. Considering that the majority of organisms examined here only had transcriptomic data, the possibility that additional selenoprotein genes were not sequenced in some of them could not be neglected. Figure 1A shows the distribution of different selenoproteins and their homologs in the 36 algae with genomic sequences.
According to the taxonomic classification of algae[30-35], we divide these species into Plantae (including Green algae, Red algae, and Glaucophytes), the SAR group (including Stramenopiles, Alveolates, and Rhizaria), Cryptophytes and Haptophytes. The majority of these algae (34 out of 36) belong to Plantae and the SAR group. The composition of the algal selenoproteome has varied dramatically among different taxonomic groups, including six species in which no selenoprotein gene could be detected (Figure 1 and Figures S1 and S3). However, in specific relatively narrow lineages, the number of selenoproteins appeared to be more stable. For example, all the algae with larger selenoproteomes (containing more than 20 selenoproteins, shown by green branches in Figure 1) are concentrated in Mamiellales and Diatoms, and the opposite algae with smaller selenoproteomes (less than 2 selenoproteins, shown by red branches in Figure 1) are concentrated in Red algae and Eustigmatophyceae.
In Plantae, the size of the selenoproteome varied significantly among different organisms. Red algae and glaucophytes had very small selenoproteomes, including two organisms (Chondrus crispus and Cyanidioschyzon merolae) in which no selenoprotein gene could be detected. Among green algae, Mamiellales had the largest selenoproteomes (>20 selenoproteins), whereas Sphaeropleales, Streptophyta, and Trebouxiophyceae had the smallest (0-5 selenoproteins). Compared with other algae, Streptophyta is evolutionarily closer to land plants. Although the only organism with sequenced genomic data in this clade, Klebsormidium flaccidum, only contained two selenoprotein genes, more selenoprotein genes were detected in some other streptophytes with EST data, such as Nitella hyalina and Chaetosphaeridium globosum, which are thought to be closer to higher plants than K. flaccidum (Figure S1).
The distribution of known selenoproteins in the SAR group was also highly variable. Stramenopiles is the biggest group of SAR, including Diatoms, Brown algae, Yellow-green algae, Phaeophyceae, and Eustigmatophyceae. In Aureococcus anophagefferens, one of pelagophyte, 82 selenoprotein genes belonging to 33 families were found. It is previously reported to have the largest eukaryotic selenoproteome[26]. The number of selenoprotein genes in diatoms varied from 20 to 44, which is similar to the size of selenoproteomes in the Mamiellales order of green algae. Brown algae and yellow-green algae had much smaller selenoproteomes (5-6 selenoprotein genes). Moreover, there is no selenoprotein gene detected in Eustigmatophyceae. Alveolates and Rhizaria are the other two groups of SAR group; we detected 29 and 23 selenoprotein genes for Symbiodinium minutum (Alveolates) and Bigelowiella natans (Rhizaria).
Two additional algae with sequenced genomes are Guillardia theta (Cryptophyte) and Emiliania huxleyi (Haptophyte). Fourteen selenoproteins belonging to 12 families were detected in G. theta. Surprisingly, a total of 96 selenoprotein genes were identified in E. huxleyi, which is the largest selenoproteome of all organisms discovered so far. These selenoproteins belong to 25 different families. Such a large number of selenoprotein genes might be related to the high repetition rate of the genome of E. huxleyi [36].
42 selenoprotein families were predicted from algae. Many algal selenoproteins have homologous proteins containing no Sec, and the most common one is that the position corresponding to Sec is replaced by Cys (hereinafter referred to as Cys-homologous). In addition, there are many other homologs of selenoproteins, but the corresponding position of Sec is neither Sec nor Cys (hereinafter referred to as Other-homologs). These homologous proteins (Cys-homologs and Other-homologs), although probably not related to selenium metabolism, may function similarly to Sec-containing selenoproteins due to sequence similarity. More importantly, they contain information on the evolution of selenoprotein families. Therefore, we also included Cys-homolog and Other-homolog data when analyzing the evolution and distribution of the selenoprotein family.
Figure S2 shows the distribution of algae containing different selenoproteins and/or their homologs. Considering the distribution of all types of homologous proteins (Sec-containing, Cys-homologs, and Other-homologs), PDI_a and TXNRD families are present in all 36 algal genomes, and GPX and GRX are also present in 35 species (Figure S2A). Therefore, these protein families may be essential for the majority of algae. However, the proportion of Sec-containing proteins is different, in which PDI_a and GRX are present in Cys-containing form in most algae, while GPX and TXNRD are mainly present in the Sec-containing form. Figure S2B shows the ranked distribution of selenoproteins (Sec-containing) in different algae. Sec-containing forms of four selenoproteins GPX, SELENOU, SELENOT and TXNRD could be found in more than half of the 36 genomes, which are the most widely distributed selenoproteins in algae.
Figure S2C shows the proportion of Sec- containing members of each protein family. Some selenoprotein families, such as MSP, SELENOK, SELENOS, USGC, AhpC_b, SELENON, FesRD, are almost exclusively in the form of Sec- containing. Besides, 80% of DIO, TlpA, SELENOEW, and Hypo are in the form of Sec- containing. These selenoproteins have fewer non-Sec homologous proteins, indicating that their function is more dependent on selenium metabolism in algae. In contrast, some other selenoprotein families, such as MsrB, PDI_d, AhpC_a, and GST, are in the Cys-homologous or Other-homologous form in nearly 90% of the algae genomes.
Identification of novel selenoproteins
In this study, three novel selenoprotein families were found in different algae (Figure 1 and Figure 2).
PDI_e. We found a large number of PDI-like protein genes in algae. The thioredoxin-like fold domain can be detected for most of them. Therefore, their functions may be related to redox regulation. Based on the amino acid sequences around the Sec residue, PDI sequences could be divided into five subfamilies (Figure S3). PDI_a, PDI_b, PDI_c, and PDI_d contained only one Sec; however, PDI_e (as named in this study) was found to have three neighboring Sec residues to form a GUGUU motif (Figure 2A). It is the first time to discover a selenoprotein with two consecutive Sec residues. Because of this first found Sec-Sec form, we considered PDI_e as a novel selenoprotein (EST sequence alignment and predicted SECIS elements of PDI_e in several organisms) are shown in Figure S4. We speculate that the selenoprotein synthesis system of organisms containing PDI_e is sufficient to meet the requirements of decoding continuous TGA codons. Correspondingly, the number of selenoproteins in several PDI_e-containing algae is also abundant (Figure 1, Figure S1 and Figure S3). Even in some PDI_e-containing algae without genomic sequences, many selenoproteins could also be detected. For example, in Isochrysis galbana, 17 selenoproteins of 14 families were found in 6,432 assembled Est contigs, and in Karenia brevis, 29 selenoproteins of 17 families were found in 29,618 assembled Est contigs.
We found a total of 12 PDI_e genes in 10 different algae. They are mainly distributed in Haptophyceae and the SAR group. The loss of the GUGUU motif occurred in the homologous proteins of Fistulifera solaris. There was no Sec-containing PDI_e in the Plantae group, and only non-Sec sequences homologous to PDI_e were detected. In Figure 2A, in addition to the PDI_e proteins found in algae, the proteins found in the NR database that have sequence similarity to PDI_e were also shown. The results show that there is no protein homologous to PDI_e in bacteria, fungi, or other multicellular eukaryotes, so we conclude that this is a selenoprotein contained only in single-cell eukaryotic algae.
AhpC_b. Two families of selenoproteins containing AhpC_TSA domains could be found in algae, AhpC_a, AhpC_b. AhpC_a was detected in almost all algae, but most of them were Cys-homologs. The Sec-containing AhpC_a was only present in three algal species: Aureococcus anophagefferens, Emiliania huxleyi, and Symbiodinium minutum. The AhpC_b was found in Thalassiosira oceanica. There is a detectable similarity between AhpC_b and AhpC_a, but the Sec-flanking sequences were significantly different. In the NR database, we found several homologous proteins of AhpC_b. But interestingly, all of these homologs were found in prokaryotic organisms and in Cys form. Figure 2B shows the phylogenetic tree and multiple alignment of amino acid sequences of AhpC_b, their best homologs from prokaryotic species, and all Sec-form algae AhpC_a. As we can see from Figure 2B, the UxxC(CxxC) motif of AhpC_b and other prokaryotic homologs is different from the TGGUT motif of AhpC_a. Because of the difference between the key motif and whole sequences, we consider the AhpC_b as a novel selenoprotein (the SECIS element is shown in Figure S4). And we speculate that it potentially originated from a prokaryotic ancestor by way of horizontal gene transfer, because of no similar eukaryotic sequence was found. Due to its AhpC_TSA domain, the function of AhpC_b may be related to antioxidants.
SymSEP. We found a selenoprotein family that was only present in Symbiodinium phyla as Sec form. We named it as SymSEP. Four SymSEP selenoproteins were found in genomic sequences and Est contigs from 2 species Symbiodinium minutum, and, Symbiodinium sp. C3. The SECIS elements were detected and shown in Figure S4 in the supplementary file 1 (In the unpublished data, we also found a SymSEP in Symbiodinium microadriaticum).
A phylogenetic tree and multiple sequence alignment of SymSEP homologous proteins are shown in Figure 2C. The figure shows all similar proteins of SymSEP found in all 137 algae data. Other similar proteins detected from the NR database are also included. As can be seen, the form of the Sec-containing is only present in the Symbiodinium phyla. Cys-homologous contains CxxC motifs widely distributed in a variety of eukaryotic algae and bacteria. In addition, there are two branches that do not contain either Sec or CxxC motifs. From the phylogenetic tree in the figure, we speculate that SymSEP first originated from prokaryotes in the form of Cys, and only became Sec in Symbiodinium phyla after differentiation. The Trx-like domain was also detected in its coding region, and the function of SymSEP may be related to redox regulation.
Substitution of Sec
Sec is the functional core site of selenoprotein, and its codon is the termination code TGA. Mutations in the codon result in the conversion of Sec to other amino acids, such as Cys (TGC, TGT), Trp (TGG). Compared to Sec, their codons are only different in the third base. Among them, the properties of Cys and Sec are the closest, and most of the selenoproteins have homologous proteins in which Sec is substituted by Cys. The substitution between Sec and Cys is an important event in the evolution of selenoproteins.
Due to the correct translation of Sec-TGA requires complex synthetic systems, such as the SECIS structure located downstream of the coding region. The change from Cys to Sec is theoretically tricky than that from Sec to Cys. The traces left by this transformation, the SECIS structure found downstream of the Cys-containing gene, were previously reported. We also found a SECIS in a Cys-containing PRX from Symbiodinium minutum (see attached Figure S5ABC in supplementary file 1). More interestingly, we found a pair of GRX genes in Fragilariopsis cylindrus. Their sequences are highly similar (positive > 80%), but one is Sec-containing, and the other is Cys-containing. Analysis of these two GRXs revealed a typical Sec-Cys substitution event. Most algae contain Cys-containing GRX, and Sec-containing GRX was only found in several selenoprotein-rich species from the SAR group and Haptophytes. No Sec-containing GRX could be found in the Plantae group. A phylogenetic analysis of algae GRX revealed that the Sec-containing protein was clustered in a subtree. Figure S6A selects the majority of the algae Sec-containing GRX and a portion of the Cys-containing GRX that is closely related to those Sec- GRX. It can be inferred from the phylogenetic tree that most of the Sec-containing GRX has a common ancestor (except 001, 002, 006). However, in this sub-tree branch, there are also a few Cys-containing homologous genes, which may undergo Sec-to-Cys changes. The Cys-GRX and Sec-GRX of the Fragilariopsis cylindrus highlighted in Figure S6A have a common parent node, and in other words, their differentiation has only recently occurred. More interestingly, the flanking genomic sequences of the two GRXs are homologous (see Figure S5D). Therefore, we hypothesize that these two GRXs may be derived from the same Sec-containing ancestral gene in which genomic-level replication events occur in this species or its related ancestors. The original single GRX gene is turned into two copies, and one of the copies converts Sec into Cys due to mutation. This is the first discovery of a genomic replication event associated with Sec-Cys substitution.
As we discussed above, the specific TGA decoding way and the complex synthesis system of selenoprotein, make it very hard to change Cys into functional and genetic retainable Sec in the evolutionary history. However, in a specific situation, that the Cys-to-Sec mutation occurs in the species with a functional selenoprotein synthesis system, and it occurs in a coding region that upstream of a functional SECIS sequence, this change could be achieved. Then the mutation will produce a decodable TGA-Sec codon. If the Cys-to-Sec changed protein still has a complete or partial function that makes the species survival and breeding, then it will be retained as a functional gene. Such events have been previously reported in several selenoproteins, especially those containing multiple Sec residues, such as SELENOP and several SELENOW. In this study, we have found several new examples of Cys-to-Sec events. We had found a SELENOW with 2 Sec in a UxxU motif in amphioxus before, while in other SELENOW, only one Sec was found in their CxxU motif. Interestingly another UxxU type SELENOW was found in this work (from Ostreococcus lucimarinus). Multiple sequences alignment of these SELENOW is shown in Figure S7 in the supplementary file 1. Another example of Cys-to-Sec mutation was found in the SELENOJ family. SELENOJ was first discovered in vertebrates and was thought to exist only in multicellular animals[37]. Interestingly, multiple SELENOJ selenoproteins and Cys-containing homologs were detected in algae, including one sequence containing 2 Sec residues from Alexandrium tamarense (Figure S6B). In this 2-Sec-containing SELENOJ, the first Sec was also present in several algae and animals. The second Sec was only found in the EST sequences of A. tamarense. Therefore, it could be potential evidence of the Cys-to-Sec evolution event, which could cause novel selenium-related function on new position Sec.
In addition to Cys-homologs, we searched for non-Cys-containing homologs of 42 selenoprotein families from 137 algal data and NR database. In these other-homolog protein sequences, the local region corresponding to the position of the Sec motif is changed to other motifs. SELENOF is one of the earliest discovered animal selenoproteins[38]. It is mainly in the form of Sec-containing in multicellular animals and exists in the form of Cys-homologs in only a few invertebrates (Arthropoda, Ecdysozoa, etc.)[39-41]. SELENOF is also widely distributed in algae, and the Sec-containing algal SELENOF contains the same CxU motif as the animal SELENOF. Interestingly, there is no Cys-homolog of SELENOF found in algae. Instead, other-homologous with other motifs were found in various algae. Their CxU motifs are converted to CMR in terrestrial plants and certain algae, and DQW in some green algae. In addition, the Sec motif has undergone significant changes in some SELENOF, resulting in the loss of local conservation, such as Micromonas commoda SELENOF. Despite the loss of the Sec-containing motif, these other-homologous are still preserved and functional in the algal genomes of many different evolutionary domains, indicating that SELENOF has more functions related to selenium than selenium-related functions. Figure S6D shows the distribution of Sec-containing proteins, Cys-homologous and other-homologous in the various evolutionary fields of eukaryotic algae (including terrestrial plants) of 42 selenoprotein families of algae. In the GPX, GRX, GST, MDP, PDI, and other families, the core Sec motif has also become the Non-Sec motif. Besides, the figure also shows the distribution of selenoprotein homologous proteins in terrestrial plants. Although there is no Sec-containing protein, most of the homologous proteins of unicellular algal selenoproteins are found in terrestrial plants. The phyla of the terrestrial plants, such as Charophyceae (Nitella hyalina) and Coleochaetophyceae (Chaetosphaeridium globosum), have more selenoproteins, suggesting that the loss of selenoproteins in terrestrial plants may occur in later geological ages.
Selenoprotein gene clusters and fusion genes
Genetic recombination, transposition, or whole-genome duplication can result in changes in the genomic location of the DNA fragment. These events may lead to clustering or fusion of genes. Previously, we reported clusters of selenoprotein genes in several invertebrate genomes, which might suggest a functional correlation between them[42-45]. Here, selenoprotein clusters were also observed in algae. Figure 4A shows the type and presentation of clusters in different algae. The gene structure and position of these clusters are shown in Figure S8 in the supplementary file 1. As we can see from Figure 4A, the clustering of selenoprotein genes were only found in 13 species. It is mostly found in Emiliania huxleyi. The most frequently found selenoprotein families were MSRA and SELENOU. Among them, the SELENOF-PDI_a pair is the only species-cross cluster we detected, which suggests that the function of SELENOF is correlated with PDI in the Mamiellales. Moreover, the genome synteny is also detected in these Mamiellales algae (shown in Figure 4B) flanking these SELENOF-PDI pair. Not all Mamiellales selenoprotein gene clusters have this cross-species distribution, such as AhpC_a-PDI_a, GST-DsbA, and Rhod-MSRA, which is only found in specific Mamiellales genomes. Considering the genomic collinearity, we speculate that the genomic fragment in which SLENOF-PDI_a is located may have important functional or structural conservation in microalgae. Although the Sec-motif was lost, this genomic level conservation in Micromonas commoda was retained. In addition, three clusters were composed of the same selenoprotein genes, two SELENOW genes in Chlamydomonas reinhardtii, two SELENOU genes in Emiliania huxleyi, and three PRX genes in Symbiodinium minutum. The adjacency of those gene locations in the genome indicates they are potentially originated from the duplication and differentiation of the same ancestor gene.
Recombination or transposition events, which occur within the coding region of a gene, may result in the truncation or fusion of genes. We scanned the conserved domains of all algal selenoproteins. Figure 4C shows that a total of 36 domains were detected in 29 algae selenoprotein families, domain alignment diagrams for all selenoprotein families are provided in the supplementary file 3. The most frequently detected domain in algae selenoproteins was the Trx-like domain, which was present in about half (20) of the algal selenoprotein families. All of them are functionally related to the thiol/disulfide redox system, such as AhpC, PRX, PDI, DsbA, GPX, GRX, and GST. Other Trx-like-containing families, such as DIO, SELENOF, SELENOM, SELENOH, SELENOT, SELENOW, SELENOU, SELENOL and TlpA, also have oxide-reduction related function. In several selenoproteins such as PITH, rhodanese, MSRA, and MSRB, no Trx-like domain could be detected; however, some of them have been reported to be functionally related to oxide reduction process of sulphur. The PITH selenoprotein contains the proteasome-interacting thioredoxin domain. The rhodanese-like selenoprotein is likely to be a sulphur transferase involved in cyanide detoxification. MSRA and MSRB are widely present in animals, which are related to reducing methionine sulphoxide[46, 47]. Another important function is also found in algae selenoproteins. The hemerythrin metal-binding domain is found in the algae TlpA selenoprotein, which suggests its oxygen-binding function[48]. The iron-sulfur cluster binding related catalytic activity could be indicated by the domains found in the FeS-oxidoreductase and reductase[49]. The methylated-DNA-[protein]-cysteine methyltransferase selenoprotein (MDP) is related to the DNA repair biology process[50-52].
As shown in Figure 4C and 4D, novel domain fusions were detected for several selenoprotein families in certain algae, including a SELENOM fused with the pVHL (Von Hippel-Lindau disease tumor suppressor beta domain) domain (Aureococcus anophagefferens), another SELENOM fused with the ShKT peptide toxin domain (Emiliania huxleyi), and a fusion protein of two selenoproteins (E. huxleyi). Their coding regions were found in both genomic and EST sequences. The multiple sequences alignment of them was shown in supplementary file 2. As pVHL was previously reported as the substrate recognition component of an E3 ubiquitin ligase complex[53], it is possible that the SELENOM with pVHL fusion potentially has a function related to tumor supression[53]. Moreover, considering that the ShKT domain is often found in the anemone toxin protein whose function is related to the potent inhibitors of K(+) or iron channels, the fusion of Emiliania huxleyi SELENOM may be related to the toxicity of bloom[54].
The fusion form of two selenoprotein genes, GST (glutathione S-transferases) and MSRA (methionine sulphoxide reductase A), were found in E. huxleyi. The structure of the fusion gene is composed of 4 exons, which is also supported by the EST sequences (Figure 4D). Multiple sequence alignment of this fusion protein and other selenoproteins shows the homology of each part of it (Shown in the supplementary file 2). It is the first time to identify the fusion event for two selenoprotein genes. GST participates in the detoxification of reactive electrophilic compounds by catalyzing their conjugation to glutathione. MSRA reverses the inactivation of many proteins due to the oxidation of critical methionine residues by reducing methionine sulphoxide (MetO) to methionine. GST and MSRA are both considered as detoxification enzymes because of antioxidant function. It has been reported that GST and MSRA were co-induced during chemical stress conditions in bacterial [55, 56], suggesting their correlation in function and biological processes. This fusion in Emiliania huxleyi is an enhancement of the association of these two related genes. Further efforts are needed to explore the biological pathways involving these two enzymes.