Morphological features and genome assembly
Cell self-flocculation of S. obliquus AS-6-11 was observed by SEM analysis. The microalgal cells are round and form aggregates through cell-cell contacts (Fig. 1), which is different from the other reported Scenedesmus strains that are in spindle shape [12].
The estimated genome size of S. obliquus AS-6-11 is 172.3 Mbp with 2,772 contigs, and the N50 contig size is 94.4 kbp using MECAT for the genome assembly (Additional file 1: Table S1; NCBI BioProject ID: PRJNA593662). Results using the MECAT software showed a better assembly ability than that of SMRT Portal in S. obliquus AS-6-11, in which the contig numbers are 58.1% less, and the N50 value is 1.5-fold higher (Additional file 1: Table S1). The genome sizes of the released Scenedesmus strains [20-24] range from 23.4 to 208.0 Mbp (Table 1). Among the available results, the N50 contig sizes of S. obliquus AS-6-11 reported in this study and S. obliquus strain DOE0152z using Pacbio technology are significantly higher than the other Scenedesmus strains using SGS (Table 1). The N50 contig size of S. obliquus AS-6-11 is 1.2-fold and 10.7-fold higher than Scenedesmus sp. MC-1 and S. quadricauda LWG 002611, respectively. Besides, the GC content of Scenedesmus strains ranges from 52.0% to 63.2%, and S. obliquus AS-6-11 has the lowest GC content (Table 1). Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the assembly of S. obliquus AS-6-11 is 87.1% complete with 2,168 BUSCO groups (Additional file 2).
Table 1 Genomic information of the reported Scenedesmus strains*
Strains
|
Genome size
(Mbp)
|
GC content (%)
|
Contig
numbers
|
N50 value
(bp)
|
Sequencing
technology
|
Gene number
|
Reference/ BioProjects
|
Scenedesmus sp. ARA
|
93.2
|
56.8
|
4,727
|
37,561
|
Illumina HiSeq
|
-
|
[20]
|
Scenedesmus sp. MC-1
|
38.2
|
61.4
|
-
|
42,815
|
Illumina HiSeq 2000
|
8,652
|
[21]
|
S. vacuolatus
|
23.4
|
53.6
|
20,139
|
1,571
|
454
|
20,139
|
PRJNA498405
|
S. quadricauda isolate
LWG 002611
|
65.4
|
63.2
|
13,425
|
8,094
|
Ion Proton
|
13,514
|
[22]
|
Tetradesmus obliquus UTEX393
|
108.7
|
56.8
|
9,191
|
-
|
Illumina Hiseq2000
|
-
|
[23]
|
S. obliquus strain DOE0152z
|
208.0
|
56.7
|
2,705
|
155,544
|
PacBio
|
-
|
[24]
|
S. obliquus
AS-6-11
|
172.3
|
52.0
|
2,772
|
94,410
|
PacBio
|
31,964
|
This study
|
*- means information not available.
Genome annotations
A total of 31,964 protein-coding genes were predicted in the S. obliquus AS-6-11 genome (Table 2). The predicted gene number of S. obliquus AS-6-11 genome is dramatically higher than the other Scenedesmus strains (Table 1). According to the Non-redundant protein (NR), SWISS-PROT, and Pfam protein families databases, 19,847, 13,099, and 13,612 proteins were annotated, respectively (Table 2). The protein number annotated based on the NR database is the largest, which is 1.52-fold higher than that obtained based on the SWISS-PROT database. Besides, 65 GO terms and 428 pathways were predicted by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases in S. obliquus AS-6-11, respectively.
The top 20 GO terms and KEGG pathways enriched in gene function annotation of the S. obliquus AS-6-11 genome were illustrated in Fig. 2. The top 20 GO terms are mainly located in biological process (10) and cellular component (8), in which the cell, cell part, and organelle are the top three GO terms (Fig. 2a). The top 20 KEGG pathways are mainly related to genetic information processing (14), in which chromosome and associated proteins, membrane trafficking, and spliceosome are the top three KEGG pathways (Fig. 2b).
Table 2 Summary of the S. obliquus AS-6-11 genome annotation
Protein database
|
Annotated protein numbers
|
NR
|
19,847
|
SWISS-PROT
|
13,099
|
Pfam
|
13,612
|
GO
|
11,734
|
KEGG
|
3,302
|
Comparative genomic analysis based on KEGG pathways
A total of 428 pathways were annotated in the S. obliquus AS-6-11 genome. In terms of lipid metabolism, the fewest genes (171) were annotated in S. obliquus AS-6-11, especially in glycerolipid metabolism, glycerophospholipid metabolism and arachidonic acid metabolism (Table 3). However, more genes related to fatty acid biosynthesis and elongation were identified in S. obliquus AS-6-11 than that in C. reinhardtii and V. carteri (Table 3). Moreover, genes in the carotenoid biosynthesis in S. obliquus AS-6-11 are the fewest.
Table 3 Analysis of gene numbers of the key metabolic pathways among the five microalgae
KEGG pathways
|
C. reinhardtii
|
C. variabilis
|
M. conductrix
|
V. carteri
|
S. obliquus AS-6-11
|
Lipid metabolism
|
Fatty acid biosynthesis
|
23
|
26
|
27
|
24
|
26
|
Fatty acid elongation
|
7
|
8
|
10
|
8
|
8
|
Fatty acid degradation
|
16
|
21
|
15
|
18
|
18
|
Steroid biosynthesis
|
9
|
12
|
14
|
9
|
10
|
Steroid hormone biosynthesis
|
5
|
4
|
4
|
4
|
3
|
Glycerolipid metabolism
|
28
|
28
|
30
|
28
|
21
|
Glycerophospholipid metabolism
|
35
|
37
|
35
|
32
|
30
|
Ether lipid metabolism
|
5
|
9
|
7
|
6
|
5
|
Sphingolipid metabolism
|
18
|
16
|
14
|
16
|
17
|
Arachidonic acid metabolism
|
14
|
13
|
13
|
10
|
7
|
Alpha-linolenic acid metabolism
|
10
|
13
|
14
|
11
|
9
|
Biosynthesis of unsaturated fatty acids
|
10
|
15
|
14
|
12
|
11
|
Metabolism of terpenoids and polyketides
|
Carotenoid biosynthesis
|
12
|
11
|
14
|
12
|
10
|
Comparative genomic analysis of orthologous gene clusters
Comparing with the other four species, S. obliquus AS-6-11 has 15,879 gene clusters with 14,576 orthologous clusters and 1,303 single-copy gene clusters (Fig. 3). There are 3,357 overlapping orthologous gene clusters among the five microalgae. S. obliquus AS-6-11 has the most gene clusters and singletons (defined as the singleton genes for which no orthologs could be found in any of the other species [25]), and the number (8,751) is 1.26-fold, 3.71-fold, 5.34-fold and 1.67-fold higher than that in C. reinhardtii, C. variabilis, M. conductrix and V. carteri, respectively (Fig. 3). Comparative orthologous gene cluster analysis also showed that the phylogenetic proximity of S. obliquus AS-6-11 is very similar to that of the other four microalgae (Additional file 3: Fig. S1).
Comparative genomic analysis based on gene families
A total of 3,608 gene families were identified in S. obliquus AS-6-11, in which 136 unique gene families existed (Fig. 4). Both the total and unique gene families in S. obliquus AS-6-11 are more abundant than that in the other four microalgae (Fig. 4). The number of the unique gene families in S. obliquus AS-6-11 is 0.86, 1.19, 1.31 and 1.39-fold larger than C. reinhardtii, C. variabilis, M. conductrix and V. carteri, respectively (Fig. 4). In the S. obliquus AS-6-11 genome, the unique gene families include membrane protein (PF10160), red chlorophyll catabolite reductase (RCC reductase, PF06405), D-mannose binding lectin (PF01453), lipase maturation factor (PF06762), lipid-A-disaccharide synthetase (PF02684), thioesterase-like superfamily (PF13279) and so on. In addition, S. obliquus AS-6-11 and M. conductrix have the most common gene families (Fig. 4).
Analysis of the genome features related to cell self-flocculation
Cell self-flocculation of budding yeast Saccharomyces cerevisiae has been well-studied. The flocculation proteins, for example, Flo1p, Flo5p, Flo9p, and Flo10p, are cell wall proteins (CWPs) and also called lectin [26, 27]. GPI-anchor was reported as the common element in cell adhesion proteins and the GPI-anchored adhesins in yeast species of Candida albicans and S. cerevisiae are the well-known fungal adhesions [28]. In S. obliquus AS-6-11, a total of 432 GPI-anchored CWPs are identified. Analysis of the top 10 GPI-anchored CWPs indicated that seven of them has the transmembrane region, and eight of them had the signal peptides (Table 4). The isoelectric point (pI) and molecular weight (Mw) of the GPI-anchored CWPs vary from 4.95 to 9.58 and 6.10 KDa to 78.84 KDa, respectively (Table 4).
Table 4 Analysis of the top 10 GPI-anchored CWPs with signal peptides*
Protein name
|
GPI probability (%)
|
pI
|
Mw (KDa)
|
SMART analysis
|
Subcellular localization sites
|
Sco00011036
|
99.82
|
6.75
|
7.13
|
TMR
|
vacu: 8, chlo: 2, plas: 2, extr: 1, golg: 1
|
Sco00023226
|
99.73
|
5.38
|
30.84
|
TMR
|
vacu: 8, plas: 4, extr: 2
|
Sco00002357
|
99.65
|
6.36
|
9.68
|
TMR
|
extr: 7, E.R.: 3.5, E.R._plas: 3, mito: 2, plas: 1.5
|
Sco00003994
|
99.51
|
4.95
|
21.28
|
-
|
extr: 12, mito: 1, E.R.: 1
|
Sco00022819
|
99.47
|
8.59
|
29.73
|
TMR
|
extr: 11, mito: 2, vacu: 1
|
Sco00000470
|
99.41
|
9.58
|
14.83
|
TMR
|
vacu: 7, plas: 3, extr: 2, E.R.: 1, golg: 1
|
Sco00000669
|
99.33
|
8.48
|
6.10
|
-
|
extr: 12, mito: 1, plas: 1
|
Sco00004618
|
99.02
|
7.51
|
15.28
|
TMR
|
extr: 9, vacu: 3, chlo: 2
|
Sco00003952
|
98.73
|
7.51
|
78.84
|
TMR
|
plas: 11, vacu: 2, E.R.: 1
|
Sco00008125
|
98.71
|
5.22
|
8.87
|
-
|
plas: 11, extr: 11, vacu: 2, nucl: 1, cyto: 1, E.R.: 1
|
*chlo: chloroplast; cyto: cytoplasmic; E.R.: endoplasmic reticulum; extr: secreted; golg: golgi apparatus; mito: mitochondrial matrix; plas: membrane protein; TMR: Transmembrane region; vacu: vacuolar. ‘-’ represented no information available.
Fasciclin (PF02469) is an extracellular domain (http://pfam.xfam.org/family/PF02469) that belongs to the ancient cell adhesion domain that is common to plants and animals. So far, fasciclin domain proteins have not been analyzed in microalgae. In the S. obliquus AS-6-11 genome, a total of 33 fasciclin domain proteins are identified, which are divided into three groups (Fig. 5a). Three main motifs are randomly distributed across the fasciclin domain proteins (Fig. 5b). The predicted pI values and Mw greatly differ among the fasciclin domain proteins (Additional file 4: Table S2). The subcellular localization prediction of fasciclin domain proteins indicated that most proteins have cytoplasmic (cyto) sites, and 15 of them have secreted (extr) sites (Additional file 4: Table S2). Further analysis of these 15 fasciclin domain proteins containing extr sites showed that six proteins are homologous to the reported fasciclin proteins of Monoraphidium neglectum (64.84%), Aquabacterium sp. (61.36%), Scenedesmus sp. Ki4 (48.09%), Pelomonas puraquae (46.94%) (Table 5). Additionally, two of the predicted proteins are annotated into the GO term of the extracellular region part according to the GO database.
Table 5 Analysis of predicted extracellular secreted fasciclin domain proteins in S. obliquus AS-6-11
Protein name
|
pI
|
Mw (KDa)
|
Signal peptide
|
The most similar homologous protein and the source organism
|
Identity to the most similar sequence
|
Sco00000123
|
7.5
|
32.5
|
-
|
hypothetical protein MNEG_1104 [Monoraphidium neglectum]
|
63.00%
|
Sco00000322-1
|
9.2
|
43.4
|
1
|
hypothetical protein A1O9_09854 [Exophiala aquamarina CBS 119918]
|
40.00%
|
Sco00000322-2
|
8.8
|
35.1
|
1
|
fasciclin domain-containing protein [Aquabacterium sp.]
|
61.36%
|
Sco00000402-1
|
8.9
|
23.4
|
1
|
hypothetical protein DI09_43p180 [Mitosporidium daphniae]
|
38.78%
|
Sco00001432
|
7.7
|
44.1
|
1
|
Nex18 symbiotically induced [Micractinium conductrix]
|
52.08%
|
Sco00002253
|
7.1
|
80.0
|
-
|
hypothetical protein MNEG_2497 [Monoraphidium neglectum]
|
46.00%
|
Sco00003534
|
4.2
|
29.9
|
1
|
astaxanthin binding fasciclin family protein [Scenedesmus sp. Ki4]
|
48.09%
|
Sco00003587
|
7.6
|
84.4
|
1
|
-
|
|
Sco00004297
|
6.3
|
16.4
|
-
|
hypothetical protein Rsub_06992 [Raphidocelis subcapitata]
|
61.54%
|
Sco00009020
|
8.9
|
18.3
|
-
|
fasciclin domain-containing protein [Aquabacterium sp.]
|
61.36%
|
Sco00022889-1
|
8.9
|
33.8
|
1
|
fasciclin-like protein [Chlamydomonas reinhardtii]
|
38.75%
|
Sco00022889-2
|
7.7
|
26.7
|
1
|
fasciclin [Pelomonas puraquae]
|
46.94%
|
Sco00022879
|
5.2
|
12.2
|
-
|
beta-Ig-H3/fasciclin [Monoraphidium neglectum]
|
64.84%
|
Sco00000669
|
4.9
|
14.5
|
-
|
fasciclin domain-containing protein [Marinobacter]
|
53.85%
|
*The protein-coding genes that encode GPI-anchored CWPs were shown in bold font; ‘-’ represented no information available.
Combining analysis of GPI-anchored CWPs and fasciclin domain proteins, four fasciclin domain proteins were found to distribute in GPI-anchored CWPs (Fig. 6a; Additional file 5), in which one has two FAS1 domains (four repeated domains in the fasciclin I family of proteins), two have transmembrane regions, and one has signal peptide (Fig. 6a). Comparative genomic analysis of S. obliquus AS-6-11 and the other four microalgae species (C. reinhardtii, C. variabilis, M. conductrix and V. carteri) revealed no similar proteins to the four fasciclin domain proteins. We also performed comparative transcriptome analysis of S. obliquus AS-6-11 and the non-flocculating S. obliquus FSP-3, and the results showed that the four fasciclin domain protein-encoding genes (Fig. 6a) had transcription level in S. obliquus AS-6-11, but the transcription of these genes cannot be detected in S. obliquus FSP-3 (Additional file 6, Table S3).
The unique gene family D-mannose binding lectin was also analyzed (Additional file 5). One gene belongs to this unique gene family was identified, and the encoded protein has two conserved domains: CAP (cysteine-rich secretory proteins) domain and B_lectin (D-mannose binding lectin) domain. The putative D-mannose binding lectin of S. obliquus AS-6-11 is homologous to a secreted glycoprotein Pry1p of S. cerevisiae YJM693 (SGD ID: S000003615), and the identity is 58% (Fig. 6b). The similarity between Pry1p and D-mannose binding lectin attributes to the same CAP domain (Fig. 6b).