Multi-omics Analysis of the Symbiotic Green Algae, Chlorella Variabilis, Revealing the Genetic Basis of the Obligate Endosymbiotic Lifestyle

Ryuhei Minei Nagahama Institute of Bio-Science and Technology Ryo Hoshina Nagahama Institute of Bio-Science and Technology Rina Higuchi Kobe University Lin Chen Kobe University Yuki Akizuki Nagahama Institute of Bio-Science and Technology Yasunobu Terabayashi Takara Bio Inc Satoshi Kira Takara Bio Inc Masanari Kitagawa Takara Bio Inc Toshinobu Suzaki Kobe University Atsushi Ogura (  aogu@whelix.info ) Nagahama Institute of Bio-Science and Technology


Introduction
Endosymbiosis is a driving force of evolution, facilitating the diversi cation of organisms. Mitochondria and plastids originated from endosymbiotic alpha-proteobacteria and cyanobacteria, respectively (Gray 1999). Secondary endosymbiosis leads to the diversi cation of the algal lineage. Plastid acquisition originated from endosymbiosis with cyanobacteria, a primary endosymbiosis whose descendants form the monophyletic supergroup Archaeplastida, composed of three lineages: green algae and land plants, red algae, and glaucophytes (Adl et al. 2019). Green and red algae have been repeatedly taken up and integrated as plastids by other eukaryotes, a process known as secondary endosymbiosis, whereas primary endosymbiosis seldom occurs. Besides Archaeplastida, secondary endosymbiosis has led to the establishment of broad diversi ed algal lineages: the Chlorarachniophyta and Euglenozoa descended from green algae and the Alveolata, Crypthphyta, Haptophyta, and Stramenopile from red algae (Ponce-Toledo et al. 2019). Inouye and Okamoto (2005) have proposed that secondary endosymbiosis requires many steps for plastid establishment. Previously, several organisms at various intermediate stages from endosymbiosis to plastid have been reported and broadly classi ed into three stages. Early in secondary endosymbiosis, in the rst stage, the relationship between host and symbiont is facultative endosymbiosis. A phagotrophic protist engulfs algae as prey and retains the undigested algae as a temporary symbiont. A host needs to take up of new algae repeatedly, because cell division does not become synchronized between host and symbiont. In the second stage, the facultative endosymbiotic relationship progresses to obligate endosymbiosis. Synchronized cell division between host and symbiont makes the relationship permanent, without the need for the incorporation of new algae. During the third stage, just before plastid establishment, many genes of the symbiont are horizontally transferred to the host genome, so that the nucleus of the symbiont markedly shrinks, becoming a vestigial nucleus, referred to as a nucleomorph.
The mitochondria of the symbiont disappear. Finally, the nucleus of the symbiont disappears and the symbiont becomes a plastid.
Many studies have investigated the genetic changes from symbiont to the plastid in the third stage of secondary endosymbiosis. Nucleomorphs have been found in Chlorarachniophyta, Crypthphyta, and some dino agellates (Sarai et al. 2020). Several nucleomorphs from Chlorarachniophyta and Crypthphyta have been sequenced. The genome size of the nucleomorphs ranges from 373 to 703 kbp (Suzuki et al. 2015), whereas that of green and red algae ranges from 12 to 343 Mbp and 11 to 105 Mbp, respectively (Blaby-Haas & Merchant 2019). The genome of the symbiont at the third stage is therefore signi cantly reduced. In a nucleomorph, 284-610 genes are housekeeping genes, such as those coding for rRNA and proteins involved in translation and transcription. The nuclear genomes of Chlorarachniophyta Bigelowiella natans and the cryptophyte Guillardia theta have been sequenced and were found to contain 353 and 508 genes derived from the symbiont, respectively, re ecting gene transfer from the symbiont to host (Curtis et al. 2012). Despite these studies, little is known about genetic changes involved in the transition from facultative to obligate endosymbiont in the second stage.
Among the various organisms in the second stage of secondary endosymbiosis, we focused on the wellknown endosymbiotic relationship between the ciliate Paramecium bursaria and the green algae Chlorella variabilis. Their cell division is synchronized (Kadono et al. 2004), indicating that they have established obligate endosymbiosis. In the natural environment, the aposymbiotic P. bursaria is rarely found (Tonooka & Watanabe 2002), and non-symbiotic C. variabilis has never been found (Hoshina et al. 2010). In the laboratory, methods of culturing P. bursaria and C. variabilis separately have been established (Siegel 1960), so various interactions between them have been reported. For instance, P. bursaria supplies nitrogen to C. variabilis (Kamako et al. 2005), while C. variabilis supplies a large amount of photosynthetic product in the form of maltose to P. bursaria (Muscatine et al. 1967). The genetic mechanisms underlying these transfers remain unclear.
As an initial approach to the elucidation of the genetic basis of the second stage of secondary symbiosis, comprehensive studies using data from next-generation sequencings, such as comparative genomics and transcriptome analysis, are powerful. Genomic and transcriptomic studies of P. bursaria have shed light on the genetic mechanisms of symbiosis with C. variabilis from the perspective of the host (Kodama et al. 2014;He et al. 2019). From the perspective of the symbiont, the genomes of C. variabilis (Blanc et al. 2010) and Micractinium conductrix (Arriola et al. 2018), another symbiont of P. bursaria, have been reported, but no comparative genomic studies have been performed to clarify the genetic changes from facultative to obligate endosymbionts in the second stage. Transcriptome studies under symbiotic conditions are limited (Quispe et al. 2016), and no comprehensive data are available to clarify the genetic mechanism of symbiosis.
In this study, to reveal the genetic basis of obligate endosymbiosis, we prepared genome datasets of C. variabilis, C. vulgaris, and C. sorokiniana and transcriptomes of C. variabilis, under symbiotic and nonsymbiotic conditions, and analyzed them. Since the taxonomy of the genus Chlorella remains controversial (Heeg & Wolf 2015), we selected C. vulgaris, the type species of the genus Chlorella, as the non-symbiotic free-living species, and C. sorokiniana as the outgroup, for genome comparison with C. variabilis. Although we expected to observe a presage of large-scale genome reduction in the third stage of secondary endosymbiosis, the genome of C. variabilis did not shrink but rather expanded, compared to that of C. vulgaris. Some of the C. variabilis-speci c expanded genes were signi cantly upregulated under symbiotic conditions compared to non-symbiotic conditions, suggesting that they play an important role in endosymbiosis. They have been found to contain genes possibly involved in cell wall biogenesis and degradation and metabolic exchange with the host. We also investigated the gene networks and pathways that play a central role in host-symbiont interactions: chlorophyl synthesis, the light-harvesting complex (LHC), the Calvin cycle, and the D-fructose-6P to maltose pathways. Enhanced photosynthesis caused by the increase of amount of chlorophyl, LHC, and enzymes comprising the Calvin cycle is likely to enable the transfer of a large supply of maltose to the host.

Read data preparation
Read data of Chlorella sorokiniana UTEX 1602 were downloaded from the NCBI BioProject PRJNA290386 (Arriola et al. 2018). In Chlorella variabilis NC64A (ATCC 50528), read data of a genome were downloaded from NCBI BioProject PRJDB7392 (Minei et al. 2018), and RNA-seq data were generated in this study. Details of the RNA-seq are described in the transcriptome analysis section. Chlorella vulgaris NIES-686 (= SAG 211-11b; CCAP 211/11b; UTEX 259) was cultured in G medium ) under 14-h light and 10-h dark conditions at 25°C. Genomic DNA was extracted using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany). Samples of 50-100 µg of cells were incubated for 5 min at 65℃ in 400 µl Buffer AP1 and 4 µl RNase A, and 400 µl glass beads (0.1 mm) were added. Cells were disrupted using Bead Smash 12 (Wakenyaku Co. Ltd., Kyoto, Japan) at 5,000 rpm for 30 s. This disruption was repeated ve times, and then the samples were incubated for 10 min at 65℃. Thereafter, we followed the manufacturer's protocol. Before total RNA extraction using RNeasy Mini Kit (Qiagen), collected cells were resuspended in ten times the amount of RNAlater® (Thermo Fisher Scienti c, Waltham, MA, USA) and disrupted in the same way as for DNA extraction. After adding 500 µl Buffer RLT, cell lysates were vortexed and incubated for 3 min at 56℃. The supernatants were recovered by centrifugation at 13,000 g and used as input to the manufacturer's protocol. Extracted genomic DNA and total RNA were shipped to BGI for library preparation and sequencing by PacBio® RS II (Paci c Biosciences, Menlo Park, CA) and HiSeq X Ten (Illumina, San Diego, CA, USA), following the manufacturer's protocol. Details on the read data are summarized in Table S1.

Genome dataset construction
Prepared read data were checked using SeqKit (Shen et al. 2016) and Seqtk (Li), and the statistics are summarized in Table S1. Long reads from RS II were assembled using SMART analysis software ( using chlorophyta_odb10. After removing chloroplast and mitochondrial genomes from the assembled genomes, using BLAST, repeat sequences were predicted using RepeatModeler (Flynn et al. 2020) and then softmasked; that is, repeat sequences of genomes were converted to lower case bases, using RepeatMasker. After quality control of the RNA-seq data using AfterQC (Chen et al. 2017), the ltered reads were mapped to organelle-eliminated genomes using HISAT2 (ver.

Molecular evolution analysis
Single-copy orthologs were identi ed from proteome sequence data of C. variabilis, C. vulgaris, and C. sorokiniana as the outgroup. Their protein-coding DNA sequences and aligned protein sequences using MAFFT were prepared as input for PAL2NAL (Suyama et al. 2006). This software converted multiple sequence alignments of proteins and their corresponding DNA sequences into codon alignments. The Ds, Dn, and Dn/Ds values of each branch were calculated using the CODEML included PAML package (Yang 2007) with the free ratio model. The genes with alignment length shorter than 150 bp, with Ds or Dn values higher than 2, or with Dn/Ds higher than 5 were excluded. These values were visualized in violin plots and two-dimensional density distributions using the R package ggplot. To detect positively selected genes, the branch-site model of CODEML was used, setting C. variabilis as the foreground branch and C. vulgaris and C. sorokiniana as the background branches. Maximum-likelihood ratio tests were performed comparing two models: the null model assuming all codons in all branches with Dn/Ds ≤ 1 and the alternative model assuming some codons in the foreground branch with Dn/Ds > 1. The p-value was calculated using chi-square statistics and adjusted for multiple testing using the false discovery rate (FDR) method. Genes with adjusted p-value < 0.05 were identi ed as candidate positively selected genes.

Transcriptome analysis
In a preparation of free-living conditions (FL), C. variabilis strain Kb1, previously isolated and cloned (Higuchi et al., 2018), was cultured in C medium (Ichimura, 1971) supplemented with 0.03% (w/v) L-serine under 12-h light and 12-h dark conditions at 25℃ (Kato & Imamura 2008). For symbiotic conditions, Paramecium bursaria strain Pb-Kb1, having C. variabilis strain Kb1 inside the cells, were cultured monoxenically with Chlorogonium capillatum (NIES-3374) as the sole food source, in sterilized Volvic® supplemented with 1g/L sodium acetate and 5 g/L yeast extract under 12-h light and 12-h dark conditions at 25℃. The C. variabilis cells were isolated by gently disrupting the cells of P. bursaria using an ultrasonic disrupter Handy Sonic UR-21P (Tomy Seiko Co. Ltd., Tokyo, Japan). After four and eight days of incubation, collected cells were used as symbiotic log and stationary phase conditions (SL and SS), respectively. All experiments were performed in duplicate. Filtered reads were mapped to the genome of C. variabilis using HISAT2 (ver. 2.1.0) and counted per gene using featureCounts included in the Subread package (Liao et al. 2014). Using the R package edgeR (Robinson et al. 2010), raw count data were ltered using the lterByExpr function and normalized using calcNormFactors. Genes with FDRadjusted p-value < 0.05 were extracted as signi cantly differentially expressed genes (DEGs) using the glmLRT function. Heat maps, dendrograms, and trend lines were generated by the pheatmap function included in the R package gplots and ggplot, respectively. GSEA was performed for genes belonging to each Cluster, with corresponding A. thaliana UniProt id by Metascape using GO Molecular Function, Biological Process, and Cellular Component annotations.

Results
De novo genome assembly and annotation To assemble unfragmented and non-redundant genomes of C. variabilis, C. vulgaris, and C. sorokiniana, we adopted a hybrid assembly technique using short-and long-read data. The read data of C. variabilis and C. sorokiniana were acquired from the public database. We sequenced the genomic DNA of C. vulgaris using PacBio RS II and Hiseq X Ten. RS II produced approximately 2.6 Gbp long-read data with an average length of 6.7 kbp, and HiSeq X produced 7.5 Gbp short-read data (Table S1). We assembled the hybrid data of the three species and evaluated the degree of fragmentation of these assemblies using indexes such as the number of contigs, the largest contig size, and N50. In the assembly of C. vulgaris, the number of contigs decreased from 237 to 91, the largest contig size increased from approximately 1.7 to 4.1 Mbp, and N50 increased from approximately 0.6 to 1.7 Mbp in the assembly using hybrid data, compared with the assembly using only long-read data. To connect the contigs and reduce the number of redundant contigs, we scaffolded each assembly of the three species and evaluated the e ciency of assembly by measuring the preservation rate of Chlorophyta universal single-copy orthologs using BUSCO (Seppey et al. 2019). The preservation rates of C. variabilis, C. vulgaris, and C. sorokiniana were 93.3%, 93.1%, and 92.8%, respectively, indicating that these genomes correctly re ect the original gene repertoire of each species. All proportions of redundantly predicted single-copy orthologs were less than 1% for the three species, indicating that these genomes correctly re ected the copy number of each gene (Table S2). Assembly of the genomes showed that the genome of C. variabilis is about 6 Mbp larger than that of C. vulgaris.
We predicted the gene structures-exons, introns, and repeat sequences-from the assembled genomes of three species and added their functional annotations. To obtain accurate gene structures, we predicted the structures based on expression data from RNA-sEq. The average lengths of Chlorophyta universal single-copy orthologs obtained from the genomes of the three species were more than 3 kbp, the number of exons was 8, and the preservation rate was 93%, showing that the structure of each gene was accurately predicted (Table S3). We found that C. variabilis carried 1,475 more genes than C. vulgaris. These predicted genes were functionally annotated using the UniProtKB, Arabidopsis thaliana proteome, NCBI non-redundant (nr), and InterProScan databases for gene names, GSEA, contamination check, and domain search, respectively. In the homology search against the nr database, the proportions of genes originating from bacteria and viruses were less than 1%, showing that the genomes of the three species were free from contamination (Fig. S1). Next, we identi ed repeat sequences in the genomes of the three species using a de novo approach. We found that C. variabilis possessed 5% more repeats than the others (Table S4). The genome of C. variabilis was 6 Mbp larger than that of C. vulgaris, probably due to the presence of repeat sequences. The total length of the repeat sequences in the C. variabilis genome was 3 Mbp larger than that of C. vulgaris. C. variabilis possessed 1,475 more genes than C. vulgaris, and multiplying 1,475 by the average length of the C. variabilis gene region (2,954) yields about 4.5 Mbp. The statistics of the three genome datasets generated are summarized in Table 1.
Phylogeny and molecular evolution of C. variabilis To con rm a phylogenetic relationship between C. variabilis and C. vulgaris, we constructed a tree using 449 single-copy orthologs from 14 Trebouxiophyceae species including Coccomyxa subellipsoidea as the outgroup. The tree generated indicated that C. vulgaris is most closely related to C. variabilis (Fig. 1A). The terminal branch lengths of C. variabilis and C. vulgaris were 0.077 and 0.123, respectively, indicating that the mutation rate of C. variabilis was lower than that of C. vulgaris. The Ds value of C. variabilis, representing the number of mutations per synonymous site, was lower than that of C. vulgaris. The two species had similar Dn values (the number of mutation per non-synonymous site) and Dn/Ds ratios, so positively selected genes were not detected using the likelihood ratio test (Fig. 1B).
Comparative genomics of C. variabilis and C. vulgaris To analyze the synteny between the three species at the whole-genome level, we produced dot plots. In comparisons between C. variabilis and C. vulgaris, each dot representing a homologous match between the two sequences was located on the diagonal more closely than the other comparison. These results showed that syntenies between C. variabilis and C. vulgaris are conserved throughout their genomes without large-scale insertions, deletions, inversions, or duplications (Fig. 1B). This nding also indicates that C. variabilis and C. vulgaris are closely related.
To identify the genetic features of C. variabilis, we compared the pro les of their GO terms and inferred orthologous gene groups among the three species. GO terms for each gene were derived from domain annotation, and these results were summarized in a bar chart. In the bar chart, most GO terms had the same patterns, but the "extracellular region part" term was presented only in C. variabilis ( Fig. 2A). In ortholog analysis, we identi ed OGs, which are de ned as groups of genes descended from a single ancestral gene (Emms & Kelly 2019). A total of 8,931 OGs were identi ed, containing 85.9% of all predicted genes among 3 species, and the remaining genes were not assigned to any OGs and were hence designated as singletons. These OGs were classi ed by conserved patterns among the three species and visualized in a Venn diagram (Fig. 2B). In the Venn diagram, 7,412 OGs containing 83.0% of all identi ed OGs were conserved among the 3 species, suggesting that the species are closely related.
We then focused on three groups composed of OGs and singletons: C. variabilis-speci c lost, gained, and duplicated genes.
The C. variabilis-speci c lost genes were contained in the 287 OGs conserved only in C. vulgaris and C. sorokiniana (Fig. 2B). To estimate the functions included in these OGs, we conducted GSEA, which identi es signi cantly enriched biological terms using GO terms, resulting in the identi cation of "cellular response to reactive oxygen species (ROS)" being the most enriched (Fig. S2A). Superoxide dismutase (SOD, OG0002817), which catalyzes the dismutation of superoxide radicals into oxygen and hydrogen peroxide; monodehydroascorbate reductase (MDR, OG0008335), which catalyzes the conversion of monodehydroascorbate to ascorbate of major antioxidant; and peptide-methionine (R)-S-oxide reductase (OG0008244), which protects against oxidation of proteins, were lost (Table S5). The C. variabilis-speci c gained genes were contained in 10 OGs with 39 genes and 1,426 singletons conserved only in C. variabilis (Fig. 2B). In these OGs, "protein phosphatase 1 binding" was the most enriched term, as identi ed using GSEA (Fig. S2B). The C. variabilis-speci c duplicated genes were contained in 297 OGs with 978 genes extracted from the 3 categories of OGs in the Venn diagrams: 7,412 OGs with 8,655 genes conserved among 3 species, 394 OGs with 510 genes conserved only in C. variabilis and C. vulgaris, and 805 OGs with 1,004 genes conserved only in C. variabilis and C. sorokiniana (Fig. 2C). In these OGs, "calmodulin-dependent protein kinase activity" was the most enriched, as identi ed using GSEA (Fig.   S2C). Among C. variabilis-speci c gained and duplicated genes, we analyzed the transcriptome under symbiotic conditions to identify the genes involved in the symbiotic lifestyle.

Transcriptome dynamics of C. variabilis under symbiosis
To identify DEGs under symbiotic conditions, we compared RNA-seq data derived from the three conditions illustrated in Fig. S3, namely, C. variabilis, cultured under non-symbiotic conditions without host P. bursaria (the free-living condition; FL) as a control, cultured under symbiotic conditions in the P. bursaria log phase (the symbiotic and log phase condition; SL), and cultured under symbiotic conditions in the P. bursaria stationary phase (the symbiotic and stationary phase condition; SS). More than 98% of reads remained after ltering, and more than 86% of reads mapped to the genome, showing that the sequencing process was effective, and six samples had no contamination. These mapped reads were counted by gene, and after ltering and normalization, we extracted 4,401 signi cantly DEGs from the pairwise comparison among the 3 conditions, using the generalized likelihood ratio test.
We calculated the relative expression values for each DEG, performed hierarchical clustering based on the expression patterns of each gene, and visualized these results with a heat map and dendrogram (Fig. 3A). The 4,401 DEGs were classi ed into four Clusters with similar expression patterns (Fig. 3B). Under symbiotic conditions, 2,085 genes were downregulated (Cluster 1) and 1,038 genes were upregulated (Cluster 2). In the SL condition, 1,114 genes were upregulated (Cluster 3), and in SS conditions, 164 genes were upregulated (Cluster 4). Cluster 1 included genes involved in cellular quality control of DNA and proteins, with GO terms such as "cellular response to DNA damage stimulus," "damaged DNA binding," and "proteolysis" being strongly enriched in GSEA (Fig. S4A). Cluster 2 included genes related to rRNA and tRNA, with GO terms "ncRNA metabolic process" identi ed by GSEA (Fig. S4B), suggesting activation of protein synthesis. Cluster 3 included genes linked to GO terms such as "cell cycle," "DNA replication," and "chromosome" (Fig. S4C), showing that C. variabilis proliferates in synchronization with host cell division, a nding that is consistent with previous reports. The genes localized in the chloroplast were contained in Clusters 2, 3, and 4 ( Fig. S4B, C, and D). Genes related to chlorophyl belonged to Cluster 4 ( Fig. S4D), showing the speci c activation of photosynthesis under symbiotic conditions.

Multi-omics analysis of C. variabilis
To identify genes playing a crucial role under symbiotic conditions, we combined the results of ortholog and transcriptome analysis and listed them in the Supplemental Data. C. variabilis-speci c gained or duplicated OGs and DEGs were extracted. Among the C. variabilis-speci c 1,465 genes gained, 468 DEGs were included, composed of 237, 106, 102, and 23 DEGs belonging to Clusters 1, 2, 3, and 4, respectively.
Only 95 of the 468 DEGs were annotated against a gene in the BLAST search of UniProtKB, including chitosanase (OG0011577; g11572), which degrades the chitosan constituting the cell wall of Chlorella, included in Cluster 2, chloride channel (OG0011693; g572) included in Cluster 3, and permease (OG0012137; g2843), a membrane transport protein included in Cluster 4 (Table S6). Unannotated OG0000793 was composed of ve genes, and this OG was speci cally gained and duplicated in C. variabilis. OG0000793 contained genes with different expression patterns, with three genes belonging to Clusters 2, 3, and 4, respectively (g5461, g6542, and g11141), and the other two genes (g3072 and g3111) were not DEGs, suggesting that they play different roles, or subfunctionalization (Table S6). 16 OGs belonging to Clusters 1, 2, 3, and 4, respectively. We focused on 152 OGs including 202 DEGs belonging to Clusters 2, 3, and 4, because the expression of these genes was upregulated under symbiotic conditions, indicating a high likelihood of involvement in symbiosis. In the 152 OGs, the genes were related to cell wall biogenesis or degradation and metabolic exchange with the host. The former included alpha-galactosidase (OG0000160), arabinosyltransferase (OG0000063), chitosanase (OG0000466), and beta-glucan synthesis-associated protein (OG0000291). The latter facilitates mutual transport of certain metabolites between C. variabilis and P. bursaria, including e ux transmembrane transporter (OG0000073) transferring a speci c substance from the inside of the cell to the outside, proton/sulfate cotransporter (OG0000123), sugar-phosphate/phosphate translocator (OG0001113), transporter of a speci c amino acid (OG0001180), ammonium transporter (OG0000350), and glutamate dehydrogenase (OG0001044), which assimilates ammonium to glutamate.

Symbiosis gene network and pathways
To investigate the pathways participating in symbiosis, we automatically extracted the genes corresponding to each component constituting pathways from integrated data of ortholog and transcriptome analysis (Supplemental Data 1) and plotted their expression patterns on the map from the KEGG Pathway database. These genes were manually curated and are summarized in Table S7. In the pathway "porphyrin and chlorophyl metabolism," the expression of genes encoding 16 enzymes constituting the biosynthetic pathway of chlorophyl a and b from glutamate tended to increase in SS-speci c conditions (Fig. 4A). Among 30 genes corresponding to 16 enzymatic reactions, 16 genes were DEGs, including 9, 2, and 5 genes belonging to Clusters 2, 3, and 4, respectively. OG0000400 and OG0000788, corresponding to coproporphyrinogen III oxidase (1.3.3.3) and magnesium-protoporphyrin IX monomethyl ester (oxidative) cyclase (1.14.13.81), respectively, were C. variabilis-speci c duplicated and contained DEGs belonging to Clusters 2 and 3. In addition to chlorophyl biosynthesis, all genes encoding chlorophyl-binding subunits of LHC, which enables e cient absorption of light energy, were highly expressed in SS-speci c conditions (Fig. 4B). These results suggest that the increase in chlorophyl and LHC content enhances light absorption and conversion into energy, which then promotes photosynthesis.
In the pathway "carbon xation in photosynthetic organisms," the gene expression of 13 enzymes constituting the Calvin-Benson cycle tended to increase under both SL and SS conditions, compared with FL conditions (Fig. 4C). Among 25 genes of 20 OGs corresponding to 13 enzymatic reactions, 17 genes were DEGs, including 2, 12, and 3 genes belonging to Clusters 1, 2, and 3, respectively. The two copies of the ribulose-bisphosphate carboxylase, or Rubisco

Discussion
In this study, we investigated the genetic changes occurring in the second stage of the move from facultative to obligate endosymbiosis and the genetic mechanisms of the endosymbiotic lifestyle, using C. variabilis. The genetic change, contrary to our hypothesis, did not involve the reduction of the size of the genome of C. variabilis. The genes involved in the biosynthesis and degradation of the cell wall and the metabolic exchange with the host play crucial roles in endosymbiosis. The genetic mechanism underlying the enhancement of photosynthetic capacity during symbiosis increases light absorption e ciency and carbon xation capacity, resulting in a massive supply of maltose to the P. bursaria.

Quality of the genome datasets
For comparative genomics, we prepared high-quality genome datasets of C. variabilis, C. vulgaris, and C. sorokiniana, organisms that are closely related, using the same pipeline. This approach ensured the accuracy of the ndings in this study. The three gene models predicted from the unfragmented and nonredundant genomes (Table S2) correctly re ected the original gene repertoire and copy number (Table  S3), and contamination was not detected (Fig. S1). Although the taxonomy of the genus Chlorella is unresolved (Heeg & Wolf 2015), the results of the phylogenetic tree (Fig. 1A), synteny ( Fig. 2A), and ortholog inference (Fig. 2C)  Non-reduction of the genome of C. variabilis Genome comparisons showed that C. variabilis has a larger genome size (6 Mbp), more genes (1,475) ( Table 1), and no higher mutation rate (Fig. 1A) than C. vulgaris, and contrary to expectations, we did not observe a presage of large-scale genome reduction in the third stage of secondary endosymbiosis.
However, C. variabilis lost some ROS-related genes ( Fig. S2A and Table S5), and the expressions of genes involved in cellular quality control of DNA and protein were downregulated under symbiotic conditions compared to non-symbiotic conditions (Figs. 3 and S4A), which may presage genome reduction. The fact that C. variabilis and P. bursaria can be arti cially cultured separately, and their symbiotic associations can be reconstructed (Siegel 1960), suggests that their endosymbiosis is in the relatively early part of stage 2. The larger number of genes in C. variabilis probably derives from gene gain or duplication (Fig. 2B), which may have arisen before the establishment of endosymbiosis with P. bursaria rather than after it. As a result, the adaptive capacity of C. variabilis has probably been enhanced, enabling endosymbiotic life. C. variabilis-speci c gained and duplicated genes were upregulated under symbiosis (Table S6 and Supplemental Data), suggesting that they play an important role in endosymbiosis. It has been reported that the properties of C. variabilis are variable and the organism is capable of environmental adaptation (Krauss & Shihira 1965).

C. variabilis -speci c gained and duplicated DEGs
Ortholog and transcriptome analyzes identi ed C. variabilis-speci c gained and duplicated genes and OGs, some of which were upregulated under symbiotic conditions compared to non-symbiotic conditions (Supplemental Data). These genes included some related to cell wall biosynthesis and degradation and metabolic exchange with P. bursaria (Table S6), an observation that suggests that they play an important role in endosymbiosis. The structure of the cell wall may have required adaptation to the intracellular environment of P. bursaria, which is different from that under FL conditions. In this study, C. variabilis possessed the highest number of genes associated with the GO term "extracellular region part," compared to C. vulgaris and C. sorokiniana (Fig. 2B). Pathway analysis indicated that cellulose is degraded into glucose under symbiosis ( Fig. 4D and Table S7). It has been reported that the composition of the cell wall is different between symbiotic and free-living Chlorella (Takeda 1995) and the thickness of the cell wall of C. variabilis under symbiotic conditions is about half compared with that in non-symbiotic conditions and has a different composition (Higuchi et al. 2018). We investigated the metabolites exchanged using transfer proteins, based on the results of the homology search (Table S6). The supplied ammonium (OG0000350 and OG0001044) and amino acids (OG0001180) are consistent with a previous study showing that C. variabilis can utilize them as a nitrogen source (Kamako et al. 2005). To the best of our knowledge, this is the rst time that the exchange of chloride (OG0011693), sugar-phosphate/phosphate (OG0001113), and sulfate (OG0000123) has been reported. Sulfurate exchange has been reported in symbiotic relationships between corals-algae (Yuyama & Watanabe 2008) and hydra-algae (Cook 1976).
The C. variabilis-speci c gained/duplicated DEGs whose functions were not predictable using homology search (e.g., OG0000793) are expected to reveal important ndings through further analysis using other methods.
Genetic mechanism of the enhancement of photosynthesis The pathway analysis found that the expression of C. variabilis genes corresponding to components the chlorophyl synthesis, LHC, Calvin cycle, and D-fructose-6P to maltose pathways tended to be upregulated under endosymbiosis ( Fig. 4 and Table S7). The increase in chlorophyl ( The gene expression pattern of each pathway was different. Genes constituting the Calvin cycle tended to be upregulated in both SS and SL conditions (Fig. 4C), genes constituting the pathway of chlorophyl synthesis (Fig. 4A) and LHC (Fig. 4C) tended to be upregulated speci cally in SS conditions, and genes constituting the pathway from D-fructose-6P to maltose tended to be upregulated speci cally in SL phase (Fig. 4D). These results suggest that P. bursaria suppresses the synthesis of chlorophyl and LHC in C. variabilis by decreasing the supply of the nitrogen source, glutamine, to C. variabilis, and promoting the supply of maltose to P. bursaria in C. variabilis during their proliferation. Chlorophyll synthesis is closely linked to nitrogen content (Liang et al. 2020), but C. variabilis can only utilize NH 3 and amino acids as nitrogen sources (Kamako et al. 2005). It has been reported that C. variabilis can utilize glutamine among amino acids (Hamada et al. 2018). Glutamate, which is the starting material for chlorophyl synthesis, can be synthesized from glutamine. He et al. (2019) found that the RNAi knockdown of glutamate synthase in P. bursaria reduced the number of symbiotic C. variabilis. Therefore, P. bursaria may control chlorophyl synthesis in C. variabilis according to the level of glutamine supplied to C. variabilis.
In conclusion, we found a non-reduced genome of C. variabilis, indicating that the symbiotic relationship between C. variabilis and P. bursaria is relatively early in stage 2 of secondary endosymbiosis. We identi ed the genes and pathways playing crucial roles under conditions of endosymbiosis. However, this study has several limitations. To characterize the stage of symbiosis between C. variabilis and P. bursaria, further comparative analysis with other symbiotic endosymbiotic relations is needed. Algae closely related to C. variabilis include several species that, as in C. variabilis, engage in endosymbiosis with speci c protists, which have arisen as phylogenetically independent organisms (Hoshina & Kusuoka 2016). Additional functional analyzes are needed, using techniques such as genome editing of genes related to symbiosis, as identi ed in this study. However, our study sheds new light on the second stage of secondary endosymbiosis, using genomic comparisons between closely related symbiotic and nonsymbiotic species and their transcriptomes under endosymbiosis. We believe that our ndings will play a leading role in providing a basis for uncovering the evolution of chloroplast acquisition by secondary symbiosis.

Declarations
Ethics approval and consent to participate Not applicable.

Not applicable
Availability of data and materials The datasets generated and analysed during the current study are available in the DDBJ Sequence Read Archive, with the BioProject Accession ID: PRJDB12190.

Competing interests
The authors declare that they have no potential con ict of interest. Dr. Ogura, associate editor for BMC genomics, was not involved in the editorial review of or decision to publish this article.    Within the squares are the corresponding protein symbols and EC numbers, and their annotations are in Table S7. The square is divided into three zones, each colored based on the median relative expression level of the genes encoding this protein under three conditions: FL, SL, and SS conditions, from left to right. Metabolites expected to increase biosynthesis/degradation are highlighted in pink. FL, the freeliving condition; SL, the symbiotic and log phase condition; SS, the symbiotic and stationary phase condition. The uncolored squares, enzymes, indicate that the corresponding genes were not identi ed

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.