De novo genome assembly and annotation
To assemble unfragmented and non-redundant genomes of C. variabilis, C. vulgaris, and C. sorokiniana, we adopted a hybrid assembly technique using short- and long-read data. The read data of C. variabilis and C. sorokiniana were acquired from the public database. We sequenced the genomic DNA of C. vulgaris using PacBio RS II and Hiseq X Ten. RS II produced approximately 2.6 Gbp long-read data with an average length of 6.7 kbp, and HiSeq X produced 7.5 Gbp short-read data (Table S1). We assembled the hybrid data of the three species and evaluated the degree of fragmentation of these assemblies using indexes such as the number of contigs, the largest contig size, and N50. In the assembly of C. vulgaris, the number of contigs decreased from 237 to 91, the largest contig size increased from approximately 1.7 to 4.1 Mbp, and N50 increased from approximately 0.6 to 1.7 Mbp in the assembly using hybrid data, compared with the assembly using only long-read data. To connect the contigs and reduce the number of redundant contigs, we scaffolded each assembly of the three species and evaluated the efficiency of assembly by measuring the preservation rate of Chlorophyta universal single-copy orthologs using BUSCO (Seppey et al. 2019). The preservation rates of C. variabilis, C. vulgaris, and C. sorokiniana were 93.3%, 93.1%, and 92.8%, respectively, indicating that these genomes correctly reflect the original gene repertoire of each species. All proportions of redundantly predicted single-copy orthologs were less than 1% for the three species, indicating that these genomes correctly reflected the copy number of each gene (Table S2). Assembly of the genomes showed that the genome of C. variabilis is about 6 Mbp larger than that of C. vulgaris.
We predicted the gene structures—exons, introns, and repeat sequences—from the assembled genomes of three species and added their functional annotations. To obtain accurate gene structures, we predicted the structures based on expression data from RNA-seq. The average lengths of Chlorophyta universal single-copy orthologs obtained from the genomes of the three species were more than 3 kbp, the number of exons was 8, and the preservation rate was 93%, showing that the structure of each gene was accurately predicted (Table S3). We found that C. variabilis carried 1,475 more genes than C. vulgaris. These predicted genes were functionally annotated using the UniProtKB, Arabidopsis thaliana proteome, NCBI non-redundant (nr), and InterProScan databases for gene names, GSEA, contamination check, and domain search, respectively. In the homology search against the nr database, the proportions of genes originating from bacteria and viruses were less than 1%, showing that the genomes of the three species were free from contamination (Fig. S1). Next, we identified repeat sequences in the genomes of the three species using a de novo approach. We found that C. variabilis possessed 5% more repeats than the others (Table S4). The genome of C. variabilis was 6 Mbp larger than that of C. vulgaris, probably due to the presence of repeat sequences. The total length of the repeat sequences in the C. variabilis genome was 3 Mbp larger than that of C. vulgaris. C. variabilis possessed 1,475 more genes than C. vulgaris, and multiplying 1,475 by the average length of the C. variabilis gene region (2,954) yields about 4.5 Mbp. The statistics of the three genome datasets generated are summarized in Table 1.
Table 1
Summary of genome datasets of C. variabilis, C. vulgaris, and C. sorokiniana.
Species
|
C. variabilis
|
C. vulgaris
|
C. sorokiniana
|
Lifestyle
|
Symbiosis
|
Free-living
|
Free-living
|
Sequencer
|
illumina & Nanopore
|
illumina & PacBio
|
illumina & PacBio
|
Genome size (bp)
|
44.98 M
|
38.97 M
|
58.12 M
|
GC (%)
|
67.09
|
61.50
|
64.08
|
Genes
|
11,634
|
10,159
|
13,282
|
Repeat (%)
|
14.84
|
9.47
|
9.55
|
Phylogeny and molecular evolution of C. variabilis
To confirm a phylogenetic relationship between C. variabilis and C. vulgaris, we constructed a tree using 449 single-copy orthologs from 14 Trebouxiophyceae species including Coccomyxa subellipsoidea as the outgroup. The tree generated indicated that C. vulgaris is most closely related to C. variabilis (Fig. 1A). The terminal branch lengths of C. variabilis and C. vulgaris were 0.077 and 0.123, respectively, indicating that the mutation rate of C. variabilis was lower than that of C. vulgaris. The Ds value of C. variabilis, representing the number of mutations per synonymous site, was lower than that of C. vulgaris. The two species had similar Dn values (the number of mutation per non-synonymous site) and Dn/Ds ratios, so positively selected genes were not detected using the likelihood ratio test (Fig. 1B).
Comparative genomics of C. variabilis and C. vulgaris
To analyze the synteny between the three species at the whole-genome level, we produced dot plots. In comparisons between C. variabilis and C. vulgaris, each dot representing a homologous match between the two sequences was located on the diagonal more closely than the other comparison. These results showed that syntenies between C. variabilis and C. vulgaris are conserved throughout their genomes without large-scale insertions, deletions, inversions, or duplications (Fig. 1B). This finding also indicates that C. variabilis and C. vulgaris are closely related.
To identify the genetic features of C. variabilis, we compared the profiles of their GO terms and inferred orthologous gene groups among the three species. GO terms for each gene were derived from domain annotation, and these results were summarized in a bar chart. In the bar chart, most GO terms had the same patterns, but the “extracellular region part” term was presented only in C. variabilis (Fig. 2A). In ortholog analysis, we identified OGs, which are defined as groups of genes descended from a single ancestral gene (Emms & Kelly 2019). A total of 8,931 OGs were identified, containing 85.9% of all predicted genes among 3 species, and the remaining genes were not assigned to any OGs and were hence designated as singletons. These OGs were classified by conserved patterns among the three species and visualized in a Venn diagram (Fig. 2B). In the Venn diagram, 7,412 OGs containing 83.0% of all identified OGs were conserved among the 3 species, suggesting that the species are closely related. We then focused on three groups composed of OGs and singletons: C. variabilis-specific lost, gained, and duplicated genes.
The C. variabilis-specific lost genes were contained in the 287 OGs conserved only in C. vulgaris and C. sorokiniana (Fig. 2B). To estimate the functions included in these OGs, we conducted GSEA, which identifies significantly enriched biological terms using GO terms, resulting in the identification of “cellular response to reactive oxygen species (ROS)” being the most enriched (Fig. S2A). Superoxide dismutase (SOD, OG0002817), which catalyzes the dismutation of superoxide radicals into oxygen and hydrogen peroxide; monodehydroascorbate reductase (MDR, OG0008335), which catalyzes the conversion of monodehydroascorbate to ascorbate of major antioxidant; and peptide-methionine (R)-S-oxide reductase (OG0008244), which protects against oxidation of proteins, were lost (Table S5). The C. variabilis-specific gained genes were contained in 10 OGs with 39 genes and 1,426 singletons conserved only in C. variabilis (Fig. 2B). In these OGs, “protein phosphatase 1 binding” was the most enriched term, as identified using GSEA (Fig. S2B). The C. variabilis-specific duplicated genes were contained in 297 OGs with 978 genes extracted from the 3 categories of OGs in the Venn diagrams: 7,412 OGs with 8,655 genes conserved among 3 species, 394 OGs with 510 genes conserved only in C. variabilis and C. vulgaris, and 805 OGs with 1,004 genes conserved only in C. variabilis and C. sorokiniana (Fig. 2C). In these OGs, “calmodulin-dependent protein kinase activity” was the most enriched, as identified using GSEA (Fig. S2C). Among C. variabilis-specific gained and duplicated genes, we analyzed the transcriptome under symbiotic conditions to identify the genes involved in the symbiotic lifestyle.
Transcriptome dynamics of C. variabilis under symbiosis
To identify DEGs under symbiotic conditions, we compared RNA-seq data derived from the three conditions illustrated in Fig. S3, namely, C. variabilis, cultured under non-symbiotic conditions without host P. bursaria (the free-living condition; FL) as a control, cultured under symbiotic conditions in the P. bursaria log phase (the symbiotic and log phase condition; SL), and cultured under symbiotic conditions in the P. bursaria stationary phase (the symbiotic and stationary phase condition; SS). More than 98% of reads remained after filtering, and more than 86% of reads mapped to the genome, showing that the sequencing process was effective, and six samples had no contamination. These mapped reads were counted by gene, and after filtering and normalization, we extracted 4,401 significantly DEGs from the pairwise comparison among the 3 conditions, using the generalized likelihood ratio test.
We calculated the relative expression values for each DEG, performed hierarchical clustering based on the expression patterns of each gene, and visualized these results with a heat map and dendrogram (Fig. 3A). The 4,401 DEGs were classified into four Clusters with similar expression patterns (Fig. 3B). Under symbiotic conditions, 2,085 genes were downregulated (Cluster 1) and 1,038 genes were upregulated (Cluster 2). In the SL condition, 1,114 genes were upregulated (Cluster 3), and in SS conditions, 164 genes were upregulated (Cluster 4). Cluster 1 included genes involved in cellular quality control of DNA and proteins, with GO terms such as “cellular response to DNA damage stimulus,” “damaged DNA binding,” and “proteolysis” being strongly enriched in GSEA (Fig. S4A). Cluster 2 included genes related to rRNA and tRNA, with GO terms “ncRNA metabolic process” identified by GSEA (Fig. S4B), suggesting activation of protein synthesis. Cluster 3 included genes linked to GO terms such as “cell cycle,” “DNA replication,” and “chromosome” (Fig. S4C), showing that C. variabilis proliferates in synchronization with host cell division, a finding that is consistent with previous reports. The genes localized in the chloroplast were contained in Clusters 2, 3, and 4 (Fig. S4B, C, and D). Genes related to chlorophyl belonged to Cluster 4 (Fig. S4D), showing the specific activation of photosynthesis under symbiotic conditions.
Multi-omics analysis of C. variabilis
To identify genes playing a crucial role under symbiotic conditions, we combined the results of ortholog and transcriptome analysis and listed them in the supplementary table. C. variabilis-specific gained or duplicated OGs and DEGs were extracted. Among the C. variabilis-specific 1,465 genes gained, 468 DEGs were included, composed of 237, 106, 102, and 23 DEGs belonging to Clusters 1, 2, 3, and 4, respectively. Only 95 of the 468 DEGs were annotated against a gene in the BLAST search of UniProtKB, including chitosanase (OG0011577; g11572), which degrades the chitosan constituting the cell wall of Chlorella, included in Cluster 2, chloride channel (OG0011693; g572) included in Cluster 3, and permease (OG0012137; g2843), a membrane transport protein included in Cluster 4 (Table S6). Unannotated OG0000793 was composed of five genes, and this OG was specifically gained and duplicated in C. variabilis. OG0000793 contained genes with different expression patterns, with three genes belonging to Clusters 2, 3, and 4, respectively (g5461, g6542, and g11141), and the other two genes (g3072 and g3111) were not DEGs, suggesting that they play different roles, or subfunctionalization (Table S6).
Among the C. variabilis-specific 978 duplicated genes in the 297 OGs, 406 DEGs of 204 OGs were included, composed of 203 genes from 130 OGs, 81 from 68 OGs, 102 from 68 OGs, and 19 DEGs from 16 OGs belonging to Clusters 1, 2, 3, and 4, respectively. We focused on 152 OGs including 202 DEGs belonging to Clusters 2, 3, and 4, because the expression of these genes was upregulated under symbiotic conditions, indicating a high likelihood of involvement in symbiosis. In the 152 OGs, the genes were related to cell wall biogenesis or degradation and metabolic exchange with the host. The former included alpha-galactosidase (OG0000160), arabinosyltransferase (OG0000063), chitosanase (OG0000466), and beta-glucan synthesis-associated protein (OG0000291). The latter facilitates mutual transport of certain metabolites between C. variabilis and P. bursaria, including efflux transmembrane transporter (OG0000073) transferring a specific substance from the inside of the cell to the outside, proton/sulfate cotransporter (OG0000123), sugar-phosphate/phosphate translocator (OG0001113), transporter of a specific amino acid (OG0001180), ammonium transporter (OG0000350), and glutamate dehydrogenase (OG0001044), which assimilates ammonium to glutamate.
Symbiosis gene network and pathways
To investigate the pathways participating in symbiosis, we automatically extracted the genes corresponding to each component constituting pathways from integrated data of ortholog and transcriptome analysis (supplementary figures 1) and plotted their expression patterns on the map from the KEGG Pathway database. These genes were manually curated and are summarized in Table S7. In the pathway “porphyrin and chlorophyl metabolism,” the expression of genes encoding 16 enzymes constituting the biosynthetic pathway of chlorophyl a and b from glutamate tended to increase in SS-specific conditions (Fig. 4A). Among 30 genes corresponding to 16 enzymatic reactions, 16 genes were DEGs, including 9, 2, and 5 genes belonging to Clusters 2, 3, and 4, respectively. OG0000400 and OG0000788, corresponding to coproporphyrinogen III oxidase (1.3.3.3) and magnesium-protoporphyrin IX monomethyl ester (oxidative) cyclase (1.14.13.81), respectively, were C. variabilis-specific duplicated and contained DEGs belonging to Clusters 2 and 3. In addition to chlorophyl biosynthesis, all genes encoding chlorophyl-binding subunits of LHC, which enables efficient absorption of light energy, were highly expressed in SS-specific conditions (Fig. 4B). These results suggest that the increase in chlorophyl and LHC content enhances light absorption and conversion into energy, which then promotes photosynthesis.
In the pathway “carbon fixation in photosynthetic organisms,” the gene expression of 13 enzymes constituting the Calvin-Benson cycle tended to increase under both SL and SS conditions, compared with FL conditions (Fig. 4C). Among 25 genes of 20 OGs corresponding to 13 enzymatic reactions, 17 genes were DEGs, including 2, 12, and 3 genes belonging to Clusters 1, 2, and 3, respectively. The two copies of the ribulose-bisphosphate carboxylase, or Rubisco (4.1.1.39), gene were both DEGs in Cluster 2. The upregulation of each gene in the Calvin cycle, such as Rubisco, leads to the synthesis of a large number of corresponding enzymes, suggesting enhancement of carbon fixation, which then promotes photosynthesis. In “starch and sucrose metabolism,” the expression patterns of the genes encoding eight enzymes constituting the biosynthetic pathway of maltose from D-fructose-6P tended to increase in SL-specific conditions (Fig. 4D). Among 21 genes of 15 OGs corresponding to 8 enzymatic reactions, 13 genes were DEGs, including the 2, 3, and 8 genes belonging to Clusters 1, 2, and 3, respectively. The genes encoding endoglucanase (3.2.1.4) and beta-glucosidase (3.2.1.4) were DEGs in Cluster 3, suggesting acceleration of the degradation from cellulose to glucose under SL-specific conditions. The gene encoding glucose-6-phosphate isomerase (5.3.1.9) was a DEG in Cluster 2, indicating that maltose is produced from D-fructose-6P in the Calvin cycle via starch. The enzymes degrading starch into maltose include alpha-amylase (3.2.1.1), beta-amylase (3.2.1.2), and isoamylase (3.2.1.68). All the genes of alpha-amylase were DEGs and may play a particularly crucial role in supplying maltose to P. bursaria. The gene encoding alpha-glucosidase (3.2.1.20) degrading maltose is a DEG in Cluster 1, indicating the suppression of maltose degradation under endosymbiosis.