An ”omic“ approach to Pyrocystis lunula: New insights related with this bioluminescent dinoagellate

Background: Pyrocystis lunula (Schutt) is a photoautotrophic dinoagellate without armored, frequently found in marine environments. Today, there are several biotechnological applications derived from the bioluminescent system of this species. From a post-genomic perspective, in order to study the whole proteome of P. lunula, an ”omic“ approach (transcriptomic-proteomic analysis) was assessed using fresh microalgae samples. Results: A total of 80,874,825 raw reads were generated (11,292,087,505 pb; 55.82 % GC) by mRNA sequencing. Very high quality sequences were assembled into 414,295 contigs (219,203,407 pb; 55.38 % GC) by the Trinity software, generating a comprehensive reference transcriptome for this species. Then, a P. lunula proteome was predicted and further employed through the rst proteomic analysis on this species. A total of 17,461 peptides were identied, yielding 3,182 protein identication hits. The identied proteins were further categorized according to functional description and gene ontology classication. Conclusions: The rst comprehensive molecular analysis of the microalgae P. lunula using both, transcriptomic and proteomic approaches, has been reported. Proteomic results represent a valuable piece in the understanding of this microalgae regulation at molecular level, and shed light on the identication of important factors involved in gene expression regulation. Indeed, the presence of the luciferin-binding protein (LBP), which had not been described so far in the genus Pyrocystis has been highlighted.

A measurement of the DNA content of P. lunula has shown that the DNA synthesis is low. P. lunula vegetative cells are diploid and possess 2 of each of the 49 chromosomes. Although the genome size of P. lunula has not been determined yet, dino agellates present enormous genomes ranging from 2 to 200 pg/nucleus of DNA (about 1.5-245 Gbp) [ 16 , 17 ], representing 1 to 80 times more than a human haploid genome, and for this reason it has been di cult to sequence the complete genome of different members of this lineage [ 27 ], resulting in an additional limiting factor to carry out proteomic studies in these organisms. Nowadays, have been reported only a complete and a partial genome for Symbiodinium kawagutii and S. minutum. These two species have the smallest dino agellate genomes identi ed so far [28][29][30]. The real challenge relies on bioinformatic analysis to assemble reads into genes or chromosomes, and this is particular hard to achieve because dino agellate genomes are highly redundant [ 31 ].
Despite the development of proteomics platforms, not many approaches have been developed to study microalgae proteomes [ 6 , 32-54 ]. The main aim of most of them was oriented to improve biodiesel production, as well as other relevant aspects of their biology. These studies have used a variety of proteomic strategies, from the MALDI TOF/TOF analysis to the qualitative and quantitative mass spectrometry (LC-MS/MS) approaches [ 55 , 56 ]. Recently, new advances in the technology of massive sequencing have allowed the generation of transcriptomic databases that in the past would have been extremely di cult to achieve, not only in terms of economic expenses, but also of time spent. Thus, the quantity of transcriptomic data produced has grown very fast and can now be easily extend to a genome wide transcriptome [ 31 ]. These efforts are likely to provide valuable information for our understanding of the biology of this important group, as well as potentiate the impact of their use in biotechnological applications.
The synergistic combination between this tool and the modeling of the proteome by LC-MS/MS will allow obtaining a wealth of information that, to date is not available in the case of P. lunula. It is clear that omics provide information with important implications for systems biology approaches [ 57 ], and the experience has shown that such data can transform our understanding of the basic biology, at the same time that opens the door to new research directions. Today, the limited genomic information available is a signi cant obstacle to develop global studies at the molecular level in the case of P. lunula. Driven by this challenge, our main objective was to generate a reference transcriptome and proteome that will support and complement future research lines in P. lunula.

Results
De novo transcriptome assembly A total of 80,874,825 raw reads were generated (11,292,087,505 pb, 55.82 % GC, reads length media of 140 pb). The percentage of residual ribosomal RNA remaining in the sample after the enrichment step on messenger RNA was quanti ed as detailed in the material and methods section. Identi ed ribosomal RNAs were then categorized in an attempt to identify potential contamination of putative prokaryotes present in the sample. Thus, from the 11.39 % reads passing the threshold in the rRNA search (9,212,736 reads), 9.16 % and 1.87 % matched eukaryotic ribosomal 18s and 28s, respectively, while only 0.24 % and 0.12 % matched bacterial 16s and 23s, respectively. The raw reads generated for the de novo assembly were deposited at the NCBI SRA database, BioProject accession number PRJNA497104.
A total of 25 assemblies were generated and further assessed to select the most suited algorithm and parametrization for a successful P. lunula transcriptome assembly. Thus, an analysis of the assemblies structure, for contigs greater than 400 nt, was carried out. Attending to the three algorithms computational requirements [ 58 ], RAY was rst employed to analyze the effect of setting kmer parameter, library size and reads quality (in terms of minimum read size and mean phred scores) in the algorithm performance. Results clearly show that increasing the kmer parameter have a tremendously deleterious effect on nal number of contigs obtained, although, as it was expected, contigs' size distribution improves with higher kmer values, specially showing higher Nx for kmer over 41 ( Figure 1). However, almost identical structure was obtained for all datasets when the kmer parameter was set within 21-41, suggesting that lower kmer values (within 21-31) could be appropriate in this case, avoiding an important loss of new contigs. Indeed, when each kmer combination is explored in detail to evaluate the effect of library size and reads quality lters, a slightly better structure is obtained when the Q20 (highest quality lter) is applied. With respect to the reads' length lters (L100 and L30), as both seems to behave similarly for the assemblies structure, we decide to further use the more strict ltered library (L100), because this is actually the more con dent sequence collection (more strict quality and length lters applied). Moreover, its reduced size ts well within the Trinity and Mira pipelines, as these two algorithms present much higher computational requirements than Ray for large datasets [ 58 ]. Finally, comparing all 10 assemblies obtained from the Q20 L100 library (Figure 2), that one obtained with Trinity software, and k25_NC20 parameters, presented the best equilibrium in terms of structure and total number of contigs.
Then, to evaluate the representativeness of each assembly, a BLASTn search was carried out for each sequence in each assembly against each sequence in each assembly (all-against-all reciprocal search for similarity), including two additional datasets downloaded from repositories (P. lunula, strain CCCM 517, https://www.imicrobe.us/; S. minutum [ 28 ]). A total of 3,645 BLASTn searches (27x27 assemblies at 5 evalue thresholds) were employed to retrieve the very best hit passing a speci c e-value threshold [see Additional le 1]. The percentage of sequences of one assembly that nd a hit in each other assembly was determined (Figure 3). To facilitate visualization, only results from reciprocal comparisons of Trinity_Q20 L100_k25_NC20 to all 27 assemblies were plotted. Results clearly indicate that while almost all sequences from most assemblies presented a hit, meaning that are also present in the Trinity evaluated assembly (red lines), the other assemblies clearly miss many of the contigs that are presented in the evaluated one (blue lines). Considering the two external datasets, results manifest that the P. lunula strain CCCM 517 dataset is very poor (only 9 sequences longest than 400 nt), and that the similarity at the nucleotide level of P. lunula with S. minutum is scarce.
A nal evaluation of the generated assemblies was carried out by proteomics. Thus, inferred proteomes were employed as references in a proteomic assay for P. lunula protein samples. Searching the obtained spectra against Trinity_Q20 L100_k25_NC20 and Trinity_Q20 L100_k31_NC20 databases yielded the highest number of identi ed proteins. These results, together with representativeness and structure analyses commented above propose the Trinity_Q20 L100_k25_NC20 assembly (414,295 contigs (185,177 > 400 bases); 219,203,407 pb; 55.38 % GC) as the best suited to the identi cation pro les obtained by MS analysis [see Additional le 2].

Proteomic analysis
To infer a ne representative proteome from the transcript sequences assembled in Trinity_Q20 L100_k25_NC20, a BLASTx was performed against the "nr" database, which represents all the public, nonredundant, protein sequences. As a result, matching hits were found for around 100,000 sequences (threshold: e-value < e-15). This information was very useful for both, performing further functional annotation of these matching transcripts, as well as, to identify their reading frame, allowing the identi cation of a set of coding sequences of P. lunula. For the remaining 300,000 sequences that do not showed clear homology, all the possible ORFs of the 6 reading frames were identi ed, and the protein sequence encoded by the longer ORF was retained, together with the previously identi ed set of coding sequences, to be later searched within the proteomic assay. Although some of this newly inferred protein sequences could be artefactual (consequence of non-coding RNAs translations, or assembling/sequencing artifacts) introducing noise into the proteome database, we considered of great relevance analyzing putative novel proteins that could be present here that do not yet exist in public databases. This could be of special importance in species clearly underrepresented in databases, as occurs with the lack of molecular studies on this entire genus.
All the sequences generated in both groups, with or without homology against "nr", as long as they were greater than 50 aa, were preserved, and the reference proteome was created. The Proteome Discoverer 2.1 software, with Sequest HT algorithm, was used to search all the spectra that were obtained in the protein sample in the MS/MS analysis (17,461 peptides) against that predicted proteome, and with a threshold of FDR < 0.01, 3,182 identi cations were obtained. The bidimensional distribution of the identi ed proteins thorough the proteomic assay within the total proteins present in the predicted proteome, according to their MW and IP was studied ( Figure 4). The mass spectrometry data was deposited in the ProteomeXchange Consortium through the PRIDE [ 59 ] partner repository, dataset identi er PXD01155.

Functional annotation
Using BLAST2GO software, all the identi ed proteins were functionally annotated according to the GO categories, enzyme codes and InterPro domains. Thus, Figures 5 and 6 shows the main functions identi ed within the identi ed proteins according to the GO biological process (BP) and molecular function (MF) categories, respectively. A complete list of identi ed proteins, together with all de functional information associated is reported in the Additional le 3. Regarding the biological process category (BP), the rst place is occupied by proteins related to oxide-reduction processes, followed by proteins related to the translation process and biogenesis of ribosomes. At the level of molecular function (MF), ATP binding molecules were found to be most abundant among the proteome followed by structural constituent of ribosome, protein and metal ion binding. All the enzymatic codes associated were further located into the KEGG database, using the KEGG-mapper tool, to identify the pathways in which the identi ed proteins could be acting [see Additional le 4]. Attending to the species contributing to the functional assignments, S. microadriaticum was the top-hit species, followed by Vitrella brassicaformis, Perkinsus marinus and Heterocapsa triquetra (Figure 7).Although this analysis may undergo a bias by the abundance of sequences from each species present within the public databases, our results seems consistent with the evolutionary relationship between Dino agellata, Chromerida, and Perkinzoa (alveolates) [ 9 , 12 , 31 ].

Discussion
Novel proteins and proteome mining From the 3,182 proteins detected in P. lunula, 175 (5.49 %) turn out to be sequences that do not have any homology with any protein identi ed so far in the different databases. Within this group, 59 stand out (PSM ≥ 5) and should be considered with a high level of con dence as novel proteins [see Additional les 5 and 2]. This is not unexpected since the transcriptomics data sets of dino agellates has showed, among other things, that in some cases more than 50 % of the genes do not have homology with any gene documented so far [ 31 ]. Many genes appear to be unique in dino agellates. For example, although many of the redox-regulated genes identi ed in P. lunula are conserved, the majority, do not shown signi cant similarities with other genes in the databases as around 68 % were novel [ 9 ].
Evolutionarily, dino agellate nuclear genomes are very dynamic, because they have obtained plastid targeted genes through successive horizontal gene transfer from different kinds of sources (tertiary replacement plastid, peridinin plastid, bacteria, cyanobacteria, haptophyte, red and green algae), rising to a highly chimeric nuclear genome [60][61][62][63]. Furthermore, segmental genome duplications [ 64 , 65 ], and retro-transposition of cDNA to the genome [ 66 ], are likely contribute to the redundant of genes and expansion of these genomes [ 28 ]. Due to their immense proportions, only a few dino agellate genomes have been sequenced and the information of the genomic structure has mainly derived from the study of individual genes. Tubulin, actin, polyketide synthase, rubisco, heat shock proteins, chlorophyll-binding protein, proliferating cell nuclear antigen, LCF, and luciferin-binding protein (LBP), are among those that have been investigated [ 31 ]. Interestingly, we have detected all these mentioned proteins within our study, and particularly relevant is the identi cation of LBP [see Additional le 3], which was thought absent from the bioluminescent system in the genus Pyrocystis.
Luciferin-binding protein (LBP) The bioluminescence system in dino agellates is unique, from a cellular and molecular perspective. The production of light occurs in organelles named scintillons [ 67 ], which contain the substrate luciferin, the LCF enzyme, and in the most of the cases a LBP [68][69][70][71]. The light emission depends on LCF-catalyzed oxidation of the luciferin, generally protected from oxidation by a LBP that binds luciferin at physiological pH. However, this LBP has not received much attention as LCF, maybe because it has not been identi ed in all bioluminescent dino agellates, and therefore has not been considered to date as an essential component of this bioluminescence system [ 72 ]. Lingulodinium polyedrum (formerly Gonyaulax polyedra), the rst organism from which this protein was isolated, cloned and fully sequenced, is the main model organism for studies of LBP [73][74][75]. In L. polyedrum, the LBP gene present two different types or isoforms (LBPa and LBPb) which share 86 % sequence identity and encode two protein that are expressed in equal amounts [ 74 , 75 ]. Additionally, each gene type has the typical dino agellate genomic organization and is presented in several non-identical tandem copies. Furthermore, in the L. polyedrum proteome, LBP was found to be very abundant, up to 1 % of the total proteins [ 75 , 76 ].
The LBP is well known in L. polyedrum [ 68 , 77 ], Noctiluca scintillans [ 78 ], and it has also been found in the genera Gonyaulax, Ceratocorys, Protoceratium, and Alexandrium [ 79-82 ]; therefore, these genes seems to be a fundamental component of the molecular bioluminescence system in dino agellates. Molecular studies have demonstrated a high variation among gene copies, revealing a very diverse gene family comprising multiple gene types in some cases [ 72 ]. In fact, Pyrocystis LBP have not been reported in protein extracts screened by a universal antibody for LBP [ 68 , 70 ]. This could not be taken as a conclusive proof of the absence of the LBP protein, given the high degree of variability reported for this gene. Currently is unknown how many of the bioluminescent dino agellate species utilize LBP in their bioluminescence system [ 11 ]; nevertheless, emerging information shows substantial evidence that LBP is an important component of dino agellate bioluminescence system [ 72 ].
It was reported that dino agellates LBP (DinoLBP) primers yielded PCR products from eleven species out of the eighteen tested, demonstrated the present of LBP in all members of the Gonyaulacales with the exception of the genera Ceratium and Fragilidium. It is important to note that, due the differences of LBP between species, the e cient ampli cation of LBP from Ceratocorys horrida, Gonyaulax spinifer, and Alexandrium monilatum only were possible after a reduction in the PCR stringency. Furthermore, the DinoLBP primers ampli ed LBP in all the species known to contain it, except N. scintillans, which present a highly divergent LBP sequence [ 78 ]. Published research have shown the diversity of LBP in several closely related dino agellates, with rst reports in P. reticulatum, C. horrida, and G. spinifera, as well as the rst partial sequences from Alexandrium a ne, A. monilatum, A. tamarense, and A. fundyense [ 72 ]. These results also agree with an absence of LBP in Pyrocystis spp.; however, the lack of detection in N. scintillans suggest that the negative PCR result in P. lunula, Protoperidinium crassipes, Ceratium longipes, and Fragilidium cf. subglobosum, are unlikely to be conclusive. Despite many efforts, the differences in LBP between species has precluded the design of universal primers for LBP, which has previously done in the case of LCF [ 83 ]. LBP sequences have shown to be a very diverse gene family [ 75 , 76 ]. Furthermore, phylogenetic studies on Alexandrium spp. also suggest the presence of more than one LBP type, similar to previous observations with their LCF sequences [ 83 ]. Multiple gene types in genes involved in bioluminescence seem to be common in both LBP and LCF for several species [ 75 , 83 , 84 ], creating more divergence between gene copies of LBP in these genomes.
Studies in N. scintillans alsorevealed that LBP is present in diverse forms, and as the result of important evolutionary events like gene ssion or fusion. N. scintillans is unusual as LCF and LBP are found as two domains in one gene, whereas they are normally separate genes [ 72 ]. It has been suggested that the origin of the hybrid structure LCF/LBP was either by fusion of the two genes of photosynthetic species in N. scintillans, or by ssion of the N. scintillans gene into those species. The hybrid LCF/LBP gene (2,396 bp) reported in N. scintillans [ 72 , 78 ] consist of part of the N-terminal region and the LCF domain followed by the LBP domain. A detailed analysis of our results has revealed a similar situation, being able to detect two different isoforms of LBP, one version that seems to correspond to the individual gene LBP, very similar (query length 422, query cover 95 %, ident. 75 %) to that reported for A. tamarense (GenBank Additionally, in N. scintillans, both fragments present an identical N-terminal region of 108 bp. A study of the two cDNA sequences showed that in the shorter fragment, the N-terminal region led directly into the LBP domain in the absence of an LCF domain. Therefore, the combined LCF/LBP and the single separated LBP, which shared the N-terminal region and LBP domain, but not the LCF domain, corresponded to two different genes. It was also con rmed that N. scintillans did indeed express both of these genes [ 72 ]. It has been speculated that the second LBP in N. scintillans could have a role in binding luciferin in the scintillons or it could act to store it in the cytoplasm [ 78 ]. It has been suggested that if the LCF/LBP of N. scintillans is ancestral to the LCF and LBP of photosynthetic species [ 86 ], the single LBP of N. scintillans could have been formed by mRNA splicing of LCF/LBP and subsequent retro-transposition on the genome [ 66 , 87 ]. The N-terminal region in both genes seems to supports this hypothesis. Having separate LCF and LBP could enable differential regulation of each gene and this can be advantageous if LBP needs to be stoichiometrically proportional to luciferin [ 77 ], but LCF, which has a triple catalytic capacity in photosynthetic species, can be re-used [ 72 ].

Glutathione-S-transferase (GST) and the elusive luciferin
In our results, oxide-reduction processes are invoked as a key component, as evidenced by the GO and KO analyzes [see Additional les 3 and 4]. Oxidative stress is able to induce a transcriptional response. In L. polyedrum, the oxidative stress induced by metal increased the activity of the superoxide dismutase [ 88 ], being dependent on the type of metal, its time of exposure and concentration [ 89 , 90 ]. Furthermore, a study with 3,500 genes from P. lunula showed that up to 200 genes increased in abundance after treatment with sodium nitrite (1 mM) or paraquat (0.5 mM). It is also important to note that antioxidant enzyme GST (GenBank AAN85429.1) [ 9 ], whose presence has been detected in our results [see Additional le 3], also possesses the pfam05295 domain (Luciferase_N). In addition to the pfam05295 domain (N-terminal LCF/LBP), a GST-N-Sigma-like domain is also present, which belongs to the Thioredoxin-like superfamily. These proteins function as disul de oxidoreductases (PDOs), altering the redox state of others proteins through the reversible oxidation of their dithiol active site. The thiol group of cysteine, on the reduced state, is able to donate a reduction equivalent (H+, +e-) to other unstable molecules [ 9 ], such as reactive oxygen species like luciferin.
In P. lunula, the gene of theantioxidant enzyme GST has been showed an additional 255 bp extension at its end [ 9 ], encoding an N-terminal sequence previously reported in LCF and LBP from L. polyedrum [ 73 , 91 ]. The similarity between the N-termini of GST, LCF and LBP is around 45 %, compared with 50 % between L. polyedrum LCF and LBP [ 92 ]; exon recombination was suggested as a plausible explanation for this homology [ 93 ]. Although the function of this N-terminal region is still unknown, the presence of this conserved sequence in GST could indicates that its function is not restricted to bioluminescence and maybe related to others roles, like cellular processing or localization of these proteins. It is also important to note that in P. lunula, unlike others bioluminescent dino agellates, the amounts of luciferin and LCF are constant throughout the day and night [ 70 ]. Therefore, instead of daily de novo synthesis and destruction of all the components of the bioluminescent system, the rhythm involve changes in their intracellular localization [ 8 , 84 ], and it is likely that the GST protein may be involved in this process, however further studies are necessary in order to test this hypothesis.
The luciferin from P. lunula is a tetra pyrrole-type molecule, resembling to chlorophyll a and krill luciferin [ 94 ]. This luciferin presents photo-oxidation and it is extremely labile to 0 2 , at high salt concentration, and low pH [ 95 ]. P. lunula contains luciferin in larger amounts than other dino agellate species, even 100 times more than L. polyedrum [ 94 ], and this luciferin it seems to be universal because LCF from any dino agellate can use it as a substrate to produce light [ 96 ]. Surprisingly, this luciferin could even crossreact with the bioluminescent system of the krill (Euphasia superba) [ 68 , 97 ]. It has been hypothesized that P. lunula luciferin is a photo-oxidation breakdown product of chlorophyll a [ 98 ]. According to this paradigm, Liu and Hastings (2007) [ 78 ] hypothesized that heterotrophic dino agellates acquire luciferin from ingested prey, either directly or from the degradation of chlorophyll provide by the prey. However, the heterotrophic dino agellate P. crassipes can maintain its bioluminescence up to 1 year without the ingestion of chlorophyll or luciferin containing food [ 99 ], and therefore must contain a luciferin originating from a different precursor molecule. The presence of LBP in several photosynthetic species suggests that even within these species, there might be an alternative luciferin molecule that requires LBP for stabilization [ 74 ]. In any case, the hypothesis that the origin of luciferin is photo-oxidized chlorophyll a [ 98 ] would only be plausible for P. lunula, which maintains its luciferin throughout the light-dark cycle. L. polyedrum only contains luciferin at night when photo-oxidation is not possible [ 71 ], so its synthesis cannot be explained by the photo-oxidation mechanism. Therefore, it is likely that more than one mechanism is responsible of luciferin production even in closed related species [ 11 ].
In fact, previous reports has con rmed the intracellular production of luciferin in P. lunula [ 100 ]. In this regard, Fresneau and Arrio (1988) [ 101 ], argued that bioluminescence of P. lunula is controlled by the reduction state of the luciferin precursor. It was reported that that luciferin and its precursor P630, so called by his excitation wavelength peak (630 nm), possess the same peptide moiety. P630 is a chromopeptide more stable than luciferin at low temperature in methanolic solutions, and it is composed by a polypeptide (4.8 kDa) and a linear tetrapyrrole such as luciferin (600 Da). Cations could oxidize P630 or break the bond between the peptidic chain and the extended tetrapyrrole. Reduction of P630 is performed enzymatically by a NAD(P)H-dependent oxidoreductase or chemically by 2-mercaptoethanol or dithiothreitol. Reduced P630 has the same spectral characteristics as the puri ed luciferin, as LCF can oxidize this reduced molecule with a light emission at 480 nm. It is very important to note that puri ed luciferin spontaneously turned partly into P630, in methanol at -20 °C. All these observations suggest that the level of interconversion between P630 and luciferin could be the oxide-reduction equilibrium in this system [ 101 ]. These observations suggest that reduced P630 is a luciferin molecule, and the oxidized form seems, in these conditions, to be the precursor of luciferin [ 102 ]. According to Fresneau and Arrio (1988) [ 101 ], the bioluminescence is a complex mechanism controlled by at least two successive reactions. The rst is the reduction of the luciferin precursor P630 by a NAD(P)H-dependent reductase [ 102 ], and the second is the LCF-luciferin reaction which generates the light emission. Since this molecule is reversibly reduced, it may be considered as an interchange point of reducing power involving a currently unknown electron transfer pathway including NAD(P)H. P630 seems to be a fundamental component involved in complex light-modulated reactions which are widespread in plants [ 103 ]. Fresneau and Arrio (1988) [ 101 ] also suggest that a recycling process of the luciferin would lead to the impossibility of reaching a level of zero light intensity with homogenates or with mixtures of the puri ed enzymes and substrates. In such a cyclic process, we would expect a steady state in the light emission, as in re y extracts where ADP is converted into ATP by a kinase [ 104 ]. According with these authors [ 101 ], the bioluminescence could be considered as a metabolic process related to the regulation of excess intracellular reducing power produced through the photosynthesis and respiration. This assumption is consistent with the evidence of a speci c chlororespiration reported by Bennoun (1982) [ 105 ] for higher plant chloroplasts and other observations in cyanobacteria [ 106 , 107 ] of an alternative pathway of respiration in photosynthetic thylakoid membranes [ 108 ].
Furthermore, studies focused on the NADPH-dependent detoxi cation of reactive carbonyls, indicates that eukaryotic cells use different systems to deal with the harmful effects of this reactive molecules. The GSTs pathway, which is fueled by glutathione (GSH) or thioredoxin redox cycle, conjugates aldehydes with GSH, supplying an important mechanism of detoxi cation. In animals, the main detoxi cation pathway is GSH-dependent [ 109 ], and a GSH-dependent detoxi cation system is functional in plants as well. The GST proteins constitute a large group; some of these are localized in the chloroplasts and mitochondria. Because chloroplasts contain a milimolar order of GSH, the GSH-dependent detoxi cation system is seems to be very effective [ 110 ].
GSTs comprise a large family of eukaryotic and prokaryotic phase II metabolic isozymes that catalyze the conjugation of the reduced form of GSH to xenobiotic substrates for detoxi cation [ 111 , 112 ]. Could GST be the enzyme NAD(P)H-dependent reductase that controls the state of reduction of P630?. The enzyme GST is also associated with cytochrome P450 [see Additional le 3]. The presence of this GST with a characteristic domain pfam05295, which is part of the cytochrome P450 metabolic cycle, could indicate that the GST enzyme is involved in the synthesis process and/or storage of luciferin through an electron transfer system unknown until now. Also striking is the absence of cox1 (cytochrome c oxidase subunit I) and cox3 (cytochrome c oxidase subunit III) in our results(see section above mitochondrial and plastid genome), which could also be associated with this unknown electron transfer pathway. These are questions that open new lines of investigation in relation to the function of GST and the process of synthesis of luciferin in these organisms. The domain pfam05295 (Luciferase_N) is the common thread between LCF-LBP-GST in P. lunula, since all of these proteins contain it.

Nucleosome
The chromosomes of dino agellates, having a liquid crystalline structure [ 113 ] with bivalent cations acting as the stabilization of the matrix [ 114 ], are condensed permanently at all stages of the cell cycle.
It was suggested that this DNA could play a role in the organization of the chromosome, maybe by a relation with a matrix protein [ 28 , 115 ]. A large proportion of this transcriptionally inactive DNA seems to be repeated sequences, and it is possible that this could contribute to the organization of the genome [ 82 ]. The matrix of the nucleus is a network of bers that also plays an important role in the organization of the chromatin. Furthermore, studies in Amphidinium carterae showed the presence of two matrix proteins, topoisomerase II and lamins, similar to what is found in higher eukaryotes [ 116 ]. In our case, it was possible to detect the presence of topoisomerase II and III (3-alpha and 3-beta-1, respectively) [see Additional le 3]. Therefore, although this chromatin is organized differently than in other eukaryotes, its nuclear matrix is conserved [ 117 ].
A sequence-speci c DNA binding protein is the dino agellate nuclear associated protein (Dinap1) [ 118 ].Besides Dinap1, homologue of the Tubulin-like protein [ 119 ], a group transcription factors (membranetethered), related to signaling pathways [ 120 ], were found in Alexandrium. In our case, it was not possible to detect the presence of Dinap1, Dip1, DapB, DapC or DapG. However, the presence of the DapA (DapA-like;Aldolase-type TIM barrel) and tubulin-like proteins (Tubulin alpha chain-like) are noteworthy [see Additional le 3]. A group of histone-like proteins (HLPs) [ 121 ] were rst found in C. cohnii [ 122 , 123 ]. Nowadays, the presence of these proteins has been reported in other dino agellates [ 124 ], including our case, where HLPs were detected [see Additional le 3]. In the past, it was thought that dino agellates lacked histone proteins [ 125 ] and instead used HLPs for DNA organization [ 126 ]; nevertheless, recent studies have showed the presence of four core nucleosomal histones, H2A, H2B, H3, and H4 [ 127 , 128 ]. It is very important to note that three of the four histones that make up the core nucleosome in dino agellates have been found in our results (H2A, H3 and H4), and also two other Dino agellates genes contain very few or lack introns completely [ 73 , 93 , 139 ]. However, in P. lunula, the LCFc gene showed a 403 bp intron [ 84 ]. Additionally, studies on 17 dino agellates species, reported introns in only three: Peridinium willei, Polarella glacialis and Thecadiniium yashimaense [ 117 , 140 ]. Nevertheless, it is evident that spliced leader (SL) trans-splicing is ubiquitous in dino agellates [141][142][143]. In this process, a unique and a highly conserved spliced leader sequence is transplanted to the 5′ end of mRNA molecules. The RNA that donates the spliced leader was found to be mostly < 50 bp [ 142 , 144 , 145 ]. SL trans-splicing acts to transform polycistronic to monocistronic mRNA [ 117 , 141 ].  Dino agellate nuclear DNA is widely methylated. The genes associated with the regulation of methylation are unclear but some potential candidates have been reported. The S-adenosylmethionine synthetase (SAM), which has been detected in our study [see Additional le 3], has also been recognize in other dino agellates in the context of saxitoxin synthesis which requires methyl transfer [ 152 ]. In addition, SAM itself is methylated and the inhibition of its methylation produce the arrest on cell cycle [ 153 ].
Other components related to transcriptional regulation found in our data were: i) ruvB-like 2, RNA-binding protein (RBP), ii) RRM (polyadenylate-binding protein 1, splicing factor U2AF family SnRNP auxiliary factor large subunit, iii) RRM domain-containing protein, and putative RRM domain and KH domain -SPAC30D11.14-like KH-protein), iv) Pumilio (Pumilio homolog 2 isoform X1 and Pumilio RNA-binding repeat), v) S1 (30S ribosomal protein S1), vi) KH (Tudor and KH domain-containing protein), vii) SAP (Heterogeneous nuclear ribonucleoprotein U), viii) LSM (LSM domain, eukaryotic/archaea-type), ix) ATPdependent RNA helicase (OB NTP-binding), and x) La (Lupus La protein). RNA is attached to a number of RBP ribonucleoprotein complexes that in uence splicing, transport to the cytoplasm, translation, stability and subcellular localization [ 154 ]. Recently, has been reported a list of RBP containing different RNA binding domains on mammalian [ 155 ], and when these were used to compared with different dino agellate species, four RBPs were found to have the greatest level of sequence similarity: Pumilio, S1, the OB NTP fold, and RRM. All these domains have been detected in our results [see Additional le 3].
In general, the studies have revealed that although dino agellates make use of some transcriptional regulation, post-transcriptional control is the dominant process of the regulation of gene expression in these organisms [ 30 ].

Translational regulation
The basic mechanism of protein synthesis can be divided into four stages: i) initiation, ii) elongation, iii) termination, and iv) recycling. Although a modulation in the translation can take place on all the stages, the kinetics studies have showed that the fundamental regulatory step is the initiation [ 156 , 157 ]. The initiation alludes to the disposition of ribosomes (translation-competent), where the initiator tRNA (Met-tRNAi) in the ribosomal P-site is base-paired to the mRNA start codon. This comprise orthologs to the bacterial/eukaryotic factors IF1/eIF1A and IF2/eIF5B, which are widely conserved in all the domains of life [ 30 ]. In our results, we highlight the presence of eIF3F, translation initiation factor 1A (eIF1A), IF1, IF2, eIF4A, eIF4E, eIF4G, putative translation initiation factor eIF5A, and translation elongation factors (Tu, Ts, 1-alpha, 1-beta, 2, 3, P, and G) [see Additional le 3].
Additionally, a study of dino agellates transcriptomic datasets and environmental cDNA, revealed the presence of 79 ribosomal proteins that frequently are present in eukaryotes [ 145 ], some of which have been reported to be highly represented in dino agellates transcriptomes [ 31 ]. Currently, little is known about dino agellate ribosomes. Nevertheless, it was reported that Lingulodinium has ribosomal proteins comparable to those of higher plant and mammalian [ 28 , 29 , 130 ]. In our case, a plenty of ribosomal proteins were detected [see Additional le 8]. The importance of translational control is emphasize in L. polyedra by the scarce agreement between the quantity of a protein and the level of its transcript [ 158 ].

Mitochondrial and plastid genome
Dino agellates have an extremely reduced mitochondrial genome, which only contain three proteins: i) cob (cytochrome b), ii) cox1 (cytochrome c oxidase subunit I), and iii) cox3 (cytochrome c oxidase subunit III), and two fragmented rRNAs. Cox2 has apparently being relocated to the nuclear genome [ 31 ].
In the case of the proteome of P. lunula, evidence of the presence of the cob protein was found, but not of cox1 or cox3 [see Additional le 3]. In dino agellates, the mitochondrial genome is also extremely duplicated and recombined. All the genes are present in multiple copies, generally interrupted by fragments of other genes [ 159 ]. The stop codon TGA requires either its diversion or RNA editing (for example to TGG) to codons like serine or tryptophan [ 160 ]. These genes are transcribed, some of which are polyadenylated [ 161 ], and requires trans-splicing to be translatable [ 31 , 159 ].
In dino agellate containing peridinin plastids, the genome is separated into plasmid-like mini-circles, in most of the cases each containing 1-2 genes. The extension of the mini-circles is different between species. Nevertheless, all the mini-circles share a conserved core in the non-coding region that could compress transcription initiation signals [ 162 ]. The genes encoded in the plastid genome are linked to photosynthesis, including the core subunits of the photosystem, cytochrome b6f, and ATP synthase complex (atpA, atpB, petB, petD, psaA, psaB, psbA-E, psbI) as well as four other genes (ycf16, ycf24, rpl28, and rpl23) [ 31 ]. In the case of the proteome of P. lunula, evidence of expression of the proteins ATP synthase complex (petB, psaA, psaB) and cytochrome b6f was found. It is noteworthy the presence in our results of important proteins associated with plastids such as: i) ferredoxin, ii) TPT (Phosphotransferase KptA / Tpt1), iii) light harvesting protein, and iv) photosystem II (D2 subunit) [see Additional le 3]. The light harvesting protein and the photosystem II subunits seems to be related with the stabilization and protection of the photosystem, while ferredoxin and TPT play an important role in the process of exporting the products of photosynthesis from the plastid [ 31 , 163 ]. In addition, other proteins present in our results, related to plastids, were detected: coproporphyrinogen oxidase isoform 1, glyceraldehyde 3phosphate dehydrogenase, fructose 1,6-bisphosphatase, glutamate 1-semialdehyde 2,1aminotransferase, transketolase, NAP50, fructose 1,6 -bisphosphatase isoform 2, and uroporphyrinogen decarboxylase isoform 1 [see Additional le 3]. Studies carried out on non-photosynthetic dino agellates (Noctiluca, Oxyrrhis, and Dinophysis) revealed that contain cryptic plastid metabolisms and lack alternative cytosolic pathways, suggesting that all free-living dino agellates are metabolically dependent on plastids. These nding led to a hypothesis about the dependency on plastid organelles in eukaryotes that have lost photosynthesis. Furthermore, it is also been proposed that the evolutionary origin of bioluminescence in non-photosynthetic dino agellates may be linked to plastid tetrapyrrole biosynthesis [ 12 ]. However, further studies are necessary to test this hypothesis.

Proton-pump rhodopsin
The discovery of the bacterio-rhodopsin, which contains all-trans retinal as a chromophore, change the notion that biological conversion of light energy to ATP is only possible through photosynthesis [ 164 ]. In dino agellates, this rhodopsin was rst detected in P. lunula, in which was reported to be expressed more actively at early light phase than at early dark phase, suggesting its implication in circadian photoreception and phase shifting. This gene has been also reported in other dino agellates (O. marina, Polarella antartica, K. vene cum, A. cantenella, P. lunula) as well as in natural dino agellate assemblage. These results suggest that rhodopsin is ubiquitous in dino agellates [ 145 , 165 , 166 ]. In fact, in our transcriptomic data (>TRINITY_DN107399_c0_g1_i6 len=1,117) [see Additional le 2] was also detect the presence of rhodopsin (GenBank: AAO14677.1). Furthermore, this rhodopsin, closely related to xanthorhodopsin, is unique in the sense that harvests solar energy with carotenoid molecules as the antenna [ 167 ]. Even more, the dino agellate rhodopsin sequences contain the characteristic proton donor site. All these facts lead the proposal that dino agellate rhodopsin is a proton-pumping, instead of a sensory rhodopsin. On mixotrophic species such as K. vene cum, it is remarkable that light stimulate heterotrophic growth [ 168 ], where rhodopsin-generated energy may be used to improve ingestion and digestion of preys [ 145 ].

Conclusions
The major contribution of the present work is making available a reference transcriptome and proteome of P. lunula, that is now accessible for the research community, and a functional description of the 3,182 proteins identi ed by proteomics and transcriptomic, including 175 novel proteins, which have already been deposited in the ProteomeXchange and NCBI SRA databases, respectively. In addition to this, a series of important factors related to the regulation of gene expression were identi ed. This annotated transcriptome and proteome should help to accelerate functional genomics in this species and perhaps others commercially and environmentally important microalgae, while full genome sequencing projects are in progress. The transcriptomic and proteomic data should not be seen as a substitute for the sequencing of the genome. These approaches have different strengths and weaknesses and should be considered as complementary and not mutually exclusive. However, until more genomes are available, the omic approaches could provide a valuable reference point. The presence of the LBP protein stands out in our results, which previously had been considered absent in the bioluminescent system of the Pyrocystis genus. Furthermore, the presence of two isoforms of this gene in P. lunula, similar to the reports in other dino agellates species, highlights the importance of this protein in the bioluminescence system of P. lunula. It is also important to highlight the presence of the GST protein, which could be involved in the process of synthesis and storage (controlling the oxidative degradation) of luciferin in P. lunula.

Materials And Methods
Due to the lack of genomic information on P. lunula, a re ned reference transcriptome was generated through massive sequencing and whose assembly was made through the use of the three main bioinformatics programs currently used for this purpose (RAY, Trinity and MIRA). In parallel, puri cation of total protein extracts, subsequently analyzed by LC-MS, was carried out. Then, to assess the quality of different assemblies obtained, a multiple proteomic analysis was carried out, using the inferred proteomes from all the P. lunula assembled transcriptomes as searched database. Consequently, the best suited assembly was selected and re ned to nally produce a comprehensive P. lunula reference proteome. A ne-tuned proteomic analysis was carried out against this last, yielding very relevant information on the P. lunula proteome composition. Furthermore, a functional analysis of the identi ed proteins based on gene ontology terms (GO), enzyme codes and pathways (KEGG) associations, was made to obtain a broad picture of the cell functions represented by the molecular components here analyzed. Finally, a process of curing information and data mining was carried out to highlight some of the most relevant molecular components identi ed within the P. lunula proteome. Isolation of total RNA ARN was extracted using the NucleoSpin RNA® kit (Macherey-Nagel), following the indications of the manufacturer, and quanti ed by UV spectrophotometry, as well as by electrophoresis in agarose gels. The quality of the isolated RNA was assessed using the 2100 Bioanalyzer System (Agilent Technologies), with the RNA 6000 Nano® kit, following the indications of the manufacturer. Only high-quality RNA (RIN = 8.3 and A260:A280 ratio near 2.0) was used for subsequent experiments.

Transcriptome sequencing
The cDNA library was built using 2 ug of total RNA, in an Ion Chef System (Thermo Fisher Scienti c) and the Ion Total RNA-Seq Kit v2 for whole transcriptome libraries (Life Technologies Corporation, California, USA), according to manufacturer instructions. The next generation sequencing was carried out using the semiconductor technology through an Ion S5 System (Thermo Fisher Scienti c) using an Ion 540 sequencing chip.
De novo assembly and analysis of high throughput RNA sequencing data The raw reads obtained from the sequencing platform were pre-processed in order to retain only highquality sequences to be subsequently used for the assemblies (Cutadapt version 1.9, BBDuk version 35.43). Sequencing adapters were rst clipped, and low-quality bases (with phred score below a threshold) were trimmed in raw sequences. Two different phred score values were selected as thresholds (15 and 20), producing two sequence datasets. Then, these datasets were further processed into two each by ltering out reads shorter than 30 or 100 nt, for a total of four sequence datasets labeled according to the quality and length lters they passed (Q15 L30, Q15 L100, Q20 L30 and Q20 L100).
Reads quality assessment was carried out using FastQC software (version 0.11.3) to evaluate the effect of every step of this process. All subsequent analyses with RAY software were conducted using these four high quality datasets to evaluate the effect of library size and quality threshold in the tested assembly algorithm throughput. The remaining ribosomal RNA was detected by SortMeRNA software (version 2.0). Thus, adaptors clipped reads were mapped to Sortmerna prepackaged databases id98 (from Silva v.119, and Rfam) with default parameters. Three different softwares were employed to assemble the P. lunula transcriptome through a de novo approach, considering there is no reference genome available to date, which were further assessed to contrast the obtained assemblies by a Proteomic analysis. The four pre-processed sequence datasets described above were assembled into contigs using 5 different parametrizations of kmer ( Additionally, in order to generate more con dence in the assemblies, due to the lack of a reference genome, Trinity and Mira software were also used to generate de novo assemblies. Because the RAY assembler results were best with the Q20 L100 sequence dataset, and due to the tremendous computational resources required by Trinity and Mira with the increase of the input reads le, only that Functional annotation of the gene ontology (GO) terms, Enzyme codes, InterPro and KEGG pathways associations was done using the BLAST2GO program [176,177]. Associated Enzyme codes were mapped to reference KEGG pathways (KO) to better illustrate representative pathway components identi ed in P. lunula (KEGG mapper tool).

Veri cation of the presence of LBP and LCF/LBP by PCR
The puri cation of the total RNA was carried out following the indications of the commercial kit NucleoSpin RNA® (Macherey-Nagel). The synthesis of cDNA and ampli cation of the fragments by RT-PCR was carried out using qScript® kit ( File name: Additional le 2.
File format: .pdf Title of data: Assembly generated with Trinity software (Trinity_Q20 L100_k25_NC20).
Description of data: Transcriptome assembly that best suited to the identi cation pro les obtained by MS analysis.
File name: Additional le 3.
File format: .xlsx Title of data: Final result of the proteome annotation.
Description of data: File that includes the result of all identi cation rounds through BLAST2GO.
File name: Additional le 4.
File format: .pdf Title of data: KEGG analysis.
Description of data: All the enzymatic codes and pathways identi ed in the proteome of P. lunula.
File name: Additional le 5.
File format: .xlsx Title of data: Novel proteins in P. lunula.
Description of data: Sequences that do not have any homology with any protein identi ed so far in the different databases.
File name: Additional le 6.
File format: .docx Title of data: LBP sequence.
Description of data: Sequencing analyses of the LBP fragment obtain by PCR.
File name: Additional le 7.
File format: .docx Title of data: LCF/LBP sequence.
Description of data: Sequencing analyses of the LCF/LBP fragment obtain by PCR.
File name: Additional le 8.
File format: .xlsx Title of data: Ribosomal proteins detected in the proteome of P. lunula.
Description of data: All the ribosome related proteins identi ed in the proteome of P. lunula according to BLAST2GO. Table   Table 1: Spliced leader (SL) analysis.

Figure 1
Structural analysis of 20 assemblies generated by RAY Figure 2 Structural analysis of 10 assemblies generated by all three algorithms (RAY, Trinity, MIRA) from the Q20 L100 dataset.

Figure 3
Analysis of representativeness by sequence similarity for Trinity_Q20 L100_k25_NC20 against all 25 assemblies and two more public datasets Figure 4 Page 27/31 Distribution of the predicted and identi ed proteins according to their MW and IP calculated by the R library Peptides.

Figure 5
Functional description according to biological process (BP) classi cation.

Figure 6
Functional description according to molecular function (MF) classi cation.

Figure 7
Top-hit species identi cation according to BLAST2GO.